Files
mxaccessgw/docs/plans/2026-06-14-deferred-followups.md
T
2026-06-14 23:49:47 -04:00

17 KiB
Raw Blame History

Deferred Follow-ups Implementation Plan

Date: 2026-06-14 Status: Plan only — NOT yet executed. Saved for review. Context: After the alarm-subtag-fallback cleanup (merged 5976770) and its redeploy to windev (10.100.0.48), five items remain deferred. This plan handles all five. They are independent — execute in any order, or cherry-pick. Items D1D2 are code (branch off main); D3 is a dev-rig validation; D4D5 are ops on the deployed hosts (no code).

Source of the deferred list: the post-deploy review on 2026-06-14. See also memory project_deploy_mechanics, project_wonder_deployment, project_alarm_subtag_fallback, project_rig_alarms_object_driven.


D1 — Surface AlarmProviderSwitchCount on the dashboard metric list

Classification: small · Est: ~10 min · Where: Server (net10), build+test on macOS Why deferred: the dashboard reads provider state from the feed's ProviderStatus badge, not the metrics snapshot, so the snapshot field (added in B1) and the OTEL counter were enough. This makes the cumulative switch count visible in the dashboard's metric table too.

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSnapshotService.cs (CreateMetricSummaries, ~line 178198)
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Dashboard/DashboardSnapshotServiceTests.cs (or the existing snapshot test file)

Steps:

  1. In CreateMetricSummaries, add to the metrics list (near the other counters): new("mxgateway.alarms.provider_switches", snapshot.AlarmProviderSwitchCount), Also consider adding the current mode as a gauge row if useful: the snapshot does not carry the mode int (only the feed does), so DO NOT invent one — only add the switch count, which the snapshot already exposes.
  2. Add/extend a unit test asserting the summary list contains a mxgateway.alarms.provider_switches row equal to the snapshot's AlarmProviderSwitchCount after calling metrics.AlarmProviderSwitched(...).
  3. Verify: dotnet build src/ZB.MOM.WW.MxGateway.Server (macOS) and dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter FullyQualifiedName~DashboardSnapshot.
  4. Commit.

Rollback: revert the one-line list addition.


D2 — Reproduce and fix the AmbiguousMatchException on GET /

Classification: standard · Est: ~3045 min (repro + fix) · Where: Server (net10) Why deferred: observed only in old (06/05, Development) logs: AmbiguousMatchException: The request matched multiple endpoints. Matches: / (/) / (/). Unauthenticated GET / safely 302s to /login; the ambiguity would only throw for an authenticated request that reaches routing. Pre-existing and unrelated to the alarm work, but it would surface as a dashboard-home 500. Confirm it still reproduces before fixing.

Root-cause candidates (the two endpoints both matching /):

  • DashboardHome.razor (@page "/") mapped by MapRazorComponents<App>() (Dashboard/DashboardEndpointRouteBuilderExtensions.cs:92).
  • MapStaticAssets(...) (GatewayApplication.cs:190) — in .NET 810 the static-assets endpoint can register a fingerprinted root that collides with the home page.

Step 0 — Reproduce (do this first; if it does NOT repro, downgrade to a documentation note):

  1. Log into the dashboard as multi-role/password (Administrator) at http://10.100.0.48:5130/login.
  2. Navigate to http://10.100.0.48:5130/ (the "Dashboard" home).
  3. If it renders the Overview page → not a live bug on this build; record that and stop (the dev error-page throw was a Development-only artifact). If it 500s with AmbiguousMatchException → proceed to fix.

Fix (only if it reproduces):

  • Files: src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs (endpoint mapping order/region), src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs, src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor.
  • Candidate fixes, in order of preference:
    1. Inspect the actual two matched endpoint display names at runtime (the exception lists them) — temporarily enable detailed errors or read the stderr Matches: block — to learn exactly which two collide.
    2. If MapStaticAssets owns the second /: it must run, but confirm it isn't double-mapping the home route; if it is, move MapStaticAssets ordering or scope it so it doesn't claim the literal / content path. (Do NOT remove MapStaticAssets — Blazor needs it.)
    3. If two component pages both declare @page "/": give the non-home one a distinct route.
    4. Last resort: give DashboardHome an explicit route (e.g. @page "/overview") and add a trivial MapGet("/", () => Results.Redirect("/overview")) with .RequireAuthorization(...).
  • Add a test that exercises the root route once resolved (a WebApplicationFactory-based integration test asserting authenticated GET / returns 200, if the dashboard test harness supports auth; otherwise a routing-uniqueness assertion).
  • Verify: dotnet build src/ZB.MOM.WW.MxGateway.Server; re-run the authenticated GET / repro.

Rollback: revert the routing change; the home page returns to its prior (ambiguous) state.

Note: this fix ships only via a Server redeploy (D-deploy applies, see D5/windev redeploy procedure in project_deploy_mechanics). Decide whether to bundle it with the next redeploy.


D3 — Validate the failover path end-to-end on the dev rig (ForceSubtag)

Classification: standard (validation, no production code) · Est: ~3045 min · Where: dev rig / windev Why deferred: provider_mode 1 (healthy alarmmgr) is verified in production, but the actual alarmmgr→subtag failover, the degraded badge, the synthesized subtag alarms, and the provider_switches{from,to,reason} counter (with the new bounded failover/failback/unknown tag) have never fired live — a healthy system never switches, and the rig's alarms can't be made to fail COM on demand. Use the explicit ForceSubtag config mode to exercise the degraded path deterministically without an alarmmgr fault.

Approach: run a temporary gateway instance (NOT the production service) in ForceSubtag mode against the dev Galaxy, drive it with the operator/IDE alarm toggle, and confirm the degraded surface. Do this on windev in a throwaway run so production stays on Auto.

Steps:

  1. On windev, from a build worktree at main, run the Server locally (or a second instance on alt ports) with MxGateway:Alarms:Fallback:Mode=ForceSubtag (env MxGateway__Alarms__Fallback__Mode=ForceSubtag), pointed at the dev Galaxy (\\DESKTOP-6JL3KKO\Galaxy!DEV). Use distinct Kestrel ports to avoid clashing with the production service on 5120/5130.
  2. Subscribe an alarm client (or open the dashboard alarms page for that instance) and confirm:
    • The provider badge shows "Subtag monitoring (degraded)" (amber bg-warning).
    • curl .../metrics shows mxgateway_alarms_provider_mode 2.
    • Active alarms appear with degraded=true and a synthetic (MD5-derived) GUID, with the reference shape Galaxy!<realArea>.<object>.<attr> (e.g. Galaxy!TestArea.TestMachine_001.TestAlarm001).
  3. Drive a transition: have the operator/IDE toggle a TestMachine_NNN.TestAlarmNNN true→false (external MXAccess writes are ignored — see project_rig_alarms_object_driven); confirm a synthesized Raise then Clear, and an AckMsg write via AcknowledgeByName returns 0.
  4. (Optional, to exercise the switch counter) run in Auto and induce a primary fault if a safe way exists; otherwise document that the counter is unit-tested only and ForceSubtag covers the degraded surface. Note: ForceSubtag may not increment provider_switches (no runtime switch) — that counter's live exercise remains the one gap; record it explicitly rather than claiming coverage.
  5. Tear down the temporary instance. Production service is untouched (stays Auto).
  6. Record results in project_alarm_subtag_fallback memory (degraded badge + synthesized subtag alarms now live-validated; switch-counter still unit-test-only if not exercised).

Rollback: none — temporary instance only; nothing in production changes.


D4 — Prune stale deploy backups on windev

Classification: trivial (ops) · Est: ~10 min · Where: windev (10.100.0.48), no code Why deferred: backups accumulate on every deploy; harmless but cluttering C:\publish\mxaccessgw\.

Current backups observed (2026-06-14):

  • Server.bak-20260526T143341, Server.bak-20260529T090320, Server.bak529.20260604T122616, Server.bak-theme030-20260605-053056, Server.bak-theme031-20260605-083410, Server.bak-20260614-prefallback (today's), Worker.bak-20260614-prefallback (today's).
  • Inside Server\: ZB.MOM.WW.MxGateway.Server.dll.bak-20260604-loginfix, appsettings.json.bak-20260604-{glauth35,ldapkeys}.

Steps:

  1. Keep Server.bak-20260614-prefallback and Worker.bak-20260614-prefallback — the immediate rollback for the current deploy. Keep until the new build has soaked (e.g. a few days / one business cycle).
  2. Delete the clearly-superseded older backups: the Server.bak-2026052*, Server.bak529.*, and Server.bak-theme03* dirs. Confirm each is a Server dir (not something live) before rmdir /s /q.
  3. Optionally remove the in-Server .bak-* sidecar files (*.dll.bak-*, appsettings.json.bak-*) — but FIRST confirm the live appsettings.json is correct (it is, post-deploy) so the .bak ldap/glauth copies aren't the only record of a needed value.
  4. Verify the service is unaffected: Get-Service MxAccessGw Running, /health/ready 200.

Rollback: none needed (deletions only; the current-deploy backup is retained). If unsure about any single dir, skip it.


D5 — Redeploy the wonder host to bring it to parity

Classification: high-risk (separate production-ish box, fiddly access, divergent build) · Est: ~12 h · Where: wonder-app-vd03.zmr.zimmer.com (10.220.157.247) Why deferred: wonder still runs the pre-feature build — it lacks the entire alarm-fallback feature and the cleanup. Bring it to 5976770 if parity is wanted. This box is materially different from windev (read project_wonder_deployment fully before starting).

Key differences from windev (do NOT reuse the windev recipe blindly):

  • Access: no usable ssh host "cmd" exec. Port 22 OpenSSH (PowerShell) was observed CLOSED 2026-05-26; the reliable path is servecli on port 2222 (cmd.exe PTY + SFTP only) using the base64--EncodedCommand pattern in project_wonder_deployment / ~/Desktop/servecli/instructions.md. Re-test port 22 first; if up, prefer it.
  • Build shape: wonder runs a self-contained single-file win-x64 Server (~118 MB), renamed to MxGateway.Server.exe for the NSSM path E:\ApiInstall\MxGateway\Server\MxGateway.Server.exe. So publish must be -r win-x64 --self-contained -p:PublishSingleFile=true (NOT the framework-dependent windev recipe), then rename the exe to MxGateway.Server.exe.
  • Static assets quirk: ship BOTH ZB.MOM.WW.MxGateway.Server.staticwebassets.endpoints.json and MxGateway.Server.staticwebassets.endpoints.json next to the exe (the ZB-named one is used).
  • Config: wonder's appsettings.json carries HTTP-switched Kestrel (http://0.0.0.0:5130), Dashboard.GroupToRole = { SCADA-Admins/Designers/Deploy-All → Admin }, Dashboard.RequireHttpsCookie:false, LDAP base dc=scadalink,dc=local. Preserve wonder's appsettings.json exactly like windev (do not ship the repo default). Note: wonder's GroupToRole values are "Admin" — VERIFY the current build's validator accepts "Admin" (the windev crash showed it requires 'Administrator'/'Viewer'). If the validator rejects "Admin", wonder's appsettings GroupToRole values must be updated to Administrator BEFORE/with the deploy or the new build will crash-loop. This is the single biggest risk — resolve it first.
  • Worker: wonder's x86 worker is the original; the alarm-fallback adds new subscribe fields. To use subtag fallback on wonder, the x86 worker must also be redeployed (publish x86, rename if needed). If only the Server is updated, confirm worker IPC protobuf compat (contracts changed for alarms — a stale worker may not understand the extended SubscribeAlarms). Safer to deploy both Server and Worker together.

Steps (high level — expand against the runbook before executing):

  1. Confirm parity is actually wanted on wonder, and whether the dashboard (disabled there) and the alarm monitor are even in scope for that box. If alarms/dashboard are off on wonder, this may reduce to "Server binary parity only."
  2. Resolve the GroupToRole value compatibility question (Admin vs Administrator) — inspect the current build's GatewayOptionsValidator and decide whether to patch wonder's appsettings.
  3. Build self-contained Server (-r win-x64 --self-contained -p:PublishSingleFile=true) and x86 Worker from main (5976770).
  4. Transfer via SFTP (servecli) to a staging dir on E:\ApiInstall\MxGateway\.
  5. Stop MxAccessGw; back up ServerServer.bak.<ts> and WorkerWorker.bak.<ts>; swap in the new build; restore wonder's appsettings.json (+ patched GroupToRole if needed); ensure both staticwebassets manifests are present; rename Server exe to MxGateway.Server.exe.
  6. Start MxAccessGw (no dependents). Verify stderr line count does not grow (no crash loop), curl -k https://...:5130/health/live (or HTTP per current config) Healthy, gRPC 5120 up.
  7. Verify the new metric is present if the metrics endpoint is exposed there.

Rollback (documented in memory): stop service, Move-Item Server Server.failed; Move-Item Server.bak.<ts> Server, restore appsettings from the backup, start.


Suggested order

  1. D2 repro (5 min — just log in and hit /): decides whether D2 is a real fix or a no-op.
  2. D1 (small code) + D2 fix if needed — bundle into one Server branch; they ship together on the next Server redeploy.
  3. D4 (prune windev backups) — quick ops, independent.
  4. D3 (ForceSubtag validation) — exercises the degraded surface; do before/after the D1+D2 redeploy so you also confirm the new Server build is healthy.
  5. D5 (wonder) — largest and riskiest; do last, only if parity is wanted, and resolve the GroupToRole-value compatibility question first.

Execution note

This plan is intentionally NOT executed. When ready, execute on a branch off main (feat/deferred-followups or per-item branches) — do not commit to main directly. D1/D2 need a Server redeploy to take effect; D4/D5 are host operations; D3 touches nothing permanent.


D2 — Resolution (2026-06-14)

Static source determination on build 5976770 (runtime repro out of scope). What was checked:

  • Razor @page "/" count: exactly ONE — Dashboard/Components/Pages/DashboardHome.razor. All other pages declare distinct routes (/login, /sessions, /galaxy, /browse, /apikeys, /workers, /events, /alarms, /settings, /sessions/{SessionId}).
  • No root index.html: src/ZB.MOM.WW.MxGateway.Server/wwwroot/ contains only css/ and lib/ subdirectories; no index.html anywhere under wwwroot/.
  • No UseDefaultFiles / no MapFallback: neither appears anywhere in the Server project.
  • MapStaticAssets mapped once (GatewayApplication.cs:190) and UseStaticFiles() once (:41); the static-assets manifest serves fingerprinted CSS/JS assets, not a literal /.
  • No MapGet("/"): the dashboard endpoint builder (Dashboard/DashboardEndpointRouteBuilderExtensions.cs) maps only /auth/login, /logout, /denied, /hubs/{snapshot,alarms,events}, /hubs/token, then MapRazorComponents<App>(). None use pattern /.
  • MapZbHealth / MapZbMetrics come from the external ZB.MOM.WW.Health shared library (not in this repo) and map health/metrics paths, not /.

Root cause of the 2026-06-05 log: code-reviews/Server/findings.md (re-review at 42b0037, 2026-05-24) records that commit de7639a removed the legacy MapGet("/", ...) redirect that was colliding with the Blazor @page "/" (a real 500). That legacy registration was the source of the AmbiguousMatchException. It is gone on the current build, so the second / endpoint no longer exists.

Conclusion: No duplicate / endpoint on build 5976770. The AmbiguousMatchException is not reproducible from source — it was a stale Development-only artifact from before de7639a reached the deployed instance. No source change made (no-op).

Residual: A 100% confirmation still requires an authenticated runtime GET / against a deployed instance (the only path that exercises routing past the unauthenticated 302-to-/login). Recommend a spot-check of authenticated GET / after the next Server redeploy; if it returns 200 (not 500), this item can be fully closed.