17 KiB
Deferred Follow-ups Implementation Plan
Date: 2026-06-14
Status: Plan only — NOT yet executed. Saved for review.
Context: After the alarm-subtag-fallback cleanup (merged 5976770) and its redeploy to
windev (10.100.0.48), five items remain deferred. This plan handles all five. They are
independent — execute in any order, or cherry-pick. Items D1–D2 are code (branch off main);
D3 is a dev-rig validation; D4–D5 are ops on the deployed hosts (no code).
Source of the deferred list: the post-deploy review on 2026-06-14. See also memory
project_deploy_mechanics, project_wonder_deployment, project_alarm_subtag_fallback,
project_rig_alarms_object_driven.
D1 — Surface AlarmProviderSwitchCount on the dashboard metric list
Classification: small · Est: ~10 min · Where: Server (net10), build+test on macOS
Why deferred: the dashboard reads provider state from the feed's ProviderStatus badge,
not the metrics snapshot, so the snapshot field (added in B1) and the OTEL counter were enough.
This makes the cumulative switch count visible in the dashboard's metric table too.
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSnapshotService.cs(CreateMetricSummaries, ~line 178–198) - Test:
src/ZB.MOM.WW.MxGateway.Tests/Dashboard/DashboardSnapshotServiceTests.cs(or the existing snapshot test file)
Steps:
- In
CreateMetricSummaries, add to themetricslist (near the other counters):new("mxgateway.alarms.provider_switches", snapshot.AlarmProviderSwitchCount),Also consider adding the current mode as a gauge row if useful: the snapshot does not carry the mode int (only the feed does), so DO NOT invent one — only add the switch count, which the snapshot already exposes. - Add/extend a unit test asserting the summary list contains a
mxgateway.alarms.provider_switchesrow equal to the snapshot'sAlarmProviderSwitchCountafter callingmetrics.AlarmProviderSwitched(...). - Verify:
dotnet build src/ZB.MOM.WW.MxGateway.Server(macOS) anddotnet test src/ZB.MOM.WW.MxGateway.Tests --filter FullyQualifiedName~DashboardSnapshot. - Commit.
Rollback: revert the one-line list addition.
D2 — Reproduce and fix the AmbiguousMatchException on GET /
Classification: standard · Est: ~30–45 min (repro + fix) · Where: Server (net10)
Why deferred: observed only in old (06/05, Development) logs:
AmbiguousMatchException: The request matched multiple endpoints. Matches: / (/) / (/).
Unauthenticated GET / safely 302s to /login; the ambiguity would only throw for an
authenticated request that reaches routing. Pre-existing and unrelated to the alarm work,
but it would surface as a dashboard-home 500. Confirm it still reproduces before fixing.
Root-cause candidates (the two endpoints both matching /):
DashboardHome.razor(@page "/") mapped byMapRazorComponents<App>()(Dashboard/DashboardEndpointRouteBuilderExtensions.cs:92).MapStaticAssets(...)(GatewayApplication.cs:190) — in .NET 8–10 the static-assets endpoint can register a fingerprinted root that collides with the home page.
Step 0 — Reproduce (do this first; if it does NOT repro, downgrade to a documentation note):
- Log into the dashboard as
multi-role/password(Administrator) athttp://10.100.0.48:5130/login. - Navigate to
http://10.100.0.48:5130/(the "Dashboard" home). - If it renders the Overview page → not a live bug on this build; record that and stop
(the dev error-page throw was a Development-only artifact). If it 500s with
AmbiguousMatchException→ proceed to fix.
Fix (only if it reproduces):
- Files:
src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs(endpoint mapping order/region),src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs,src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor. - Candidate fixes, in order of preference:
- Inspect the actual two matched endpoint display names at runtime (the exception lists
them) — temporarily enable detailed errors or read the stderr
Matches:block — to learn exactly which two collide. - If
MapStaticAssetsowns the second/: it must run, but confirm it isn't double-mapping the home route; if it is, moveMapStaticAssetsordering or scope it so it doesn't claim the literal/content path. (Do NOT removeMapStaticAssets— Blazor needs it.) - If two component pages both declare
@page "/": give the non-home one a distinct route. - Last resort: give
DashboardHomean explicit route (e.g.@page "/overview") and add a trivialMapGet("/", () => Results.Redirect("/overview"))with.RequireAuthorization(...).
- Inspect the actual two matched endpoint display names at runtime (the exception lists
them) — temporarily enable detailed errors or read the stderr
- Add a test that exercises the root route once resolved (a WebApplicationFactory-based
integration test asserting authenticated
GET /returns 200, if the dashboard test harness supports auth; otherwise a routing-uniqueness assertion). - Verify:
dotnet build src/ZB.MOM.WW.MxGateway.Server; re-run the authenticatedGET /repro.
Rollback: revert the routing change; the home page returns to its prior (ambiguous) state.
Note: this fix ships only via a Server redeploy (D-deploy applies, see D5/windev redeploy
procedure in project_deploy_mechanics). Decide whether to bundle it with the next redeploy.
D3 — Validate the failover path end-to-end on the dev rig (ForceSubtag)
Classification: standard (validation, no production code) · Est: ~30–45 min · Where: dev rig / windev
Why deferred: provider_mode 1 (healthy alarmmgr) is verified in production, but the actual
alarmmgr→subtag failover, the degraded badge, the synthesized subtag alarms, and the
provider_switches{from,to,reason} counter (with the new bounded failover/failback/unknown
tag) have never fired live — a healthy system never switches, and the rig's alarms can't be made
to fail COM on demand. Use the explicit ForceSubtag config mode to exercise the degraded path
deterministically without an alarmmgr fault.
Approach: run a temporary gateway instance (NOT the production service) in ForceSubtag
mode against the dev Galaxy, drive it with the operator/IDE alarm toggle, and confirm the
degraded surface. Do this on windev in a throwaway run so production stays on Auto.
Steps:
- On windev, from a build worktree at
main, run the Server locally (or a second instance on alt ports) withMxGateway:Alarms:Fallback:Mode=ForceSubtag(envMxGateway__Alarms__Fallback__Mode=ForceSubtag), pointed at the dev Galaxy (\\DESKTOP-6JL3KKO\Galaxy!DEV). Use distinct Kestrel ports to avoid clashing with the production service on 5120/5130. - Subscribe an alarm client (or open the dashboard alarms page for that instance) and confirm:
- The provider badge shows "Subtag monitoring (degraded)" (amber
bg-warning). curl .../metricsshowsmxgateway_alarms_provider_mode 2.- Active alarms appear with
degraded=trueand a synthetic (MD5-derived) GUID, with the reference shapeGalaxy!<realArea>.<object>.<attr>(e.g.Galaxy!TestArea.TestMachine_001.TestAlarm001).
- The provider badge shows "Subtag monitoring (degraded)" (amber
- Drive a transition: have the operator/IDE toggle a
TestMachine_NNN.TestAlarmNNNtrue→false (external MXAccess writes are ignored — seeproject_rig_alarms_object_driven); confirm a synthesized Raise then Clear, and anAckMsgwrite via AcknowledgeByName returns 0. - (Optional, to exercise the switch counter) run in
Autoand induce a primary fault if a safe way exists; otherwise document that the counter is unit-tested only andForceSubtagcovers the degraded surface. Note:ForceSubtagmay not incrementprovider_switches(no runtime switch) — that counter's live exercise remains the one gap; record it explicitly rather than claiming coverage. - Tear down the temporary instance. Production service is untouched (stays
Auto). - Record results in
project_alarm_subtag_fallbackmemory (degraded badge + synthesized subtag alarms now live-validated; switch-counter still unit-test-only if not exercised).
Rollback: none — temporary instance only; nothing in production changes.
D4 — Prune stale deploy backups on windev
Classification: trivial (ops) · Est: ~10 min · Where: windev (10.100.0.48), no code
Why deferred: backups accumulate on every deploy; harmless but cluttering
C:\publish\mxaccessgw\.
Current backups observed (2026-06-14):
Server.bak-20260526T143341,Server.bak-20260529T090320,Server.bak529.20260604T122616,Server.bak-theme030-20260605-053056,Server.bak-theme031-20260605-083410,Server.bak-20260614-prefallback(today's),Worker.bak-20260614-prefallback(today's).- Inside
Server\:ZB.MOM.WW.MxGateway.Server.dll.bak-20260604-loginfix,appsettings.json.bak-20260604-{glauth35,ldapkeys}.
Steps:
- Keep
Server.bak-20260614-prefallbackandWorker.bak-20260614-prefallback— the immediate rollback for the current deploy. Keep until the new build has soaked (e.g. a few days / one business cycle). - Delete the clearly-superseded older backups: the
Server.bak-2026052*,Server.bak529.*, andServer.bak-theme03*dirs. Confirm each is a Server dir (not something live) beforermdir /s /q. - Optionally remove the in-
Server.bak-*sidecar files (*.dll.bak-*,appsettings.json.bak-*) — but FIRST confirm the liveappsettings.jsonis correct (it is, post-deploy) so the.bakldap/glauth copies aren't the only record of a needed value. - Verify the service is unaffected:
Get-Service MxAccessGwRunning,/health/ready200.
Rollback: none needed (deletions only; the current-deploy backup is retained). If unsure about any single dir, skip it.
D5 — Redeploy the wonder host to bring it to parity
Classification: high-risk (separate production-ish box, fiddly access, divergent build) · Est: ~1–2 h · Where: wonder-app-vd03.zmr.zimmer.com (10.220.157.247)
Why deferred: wonder still runs the pre-feature build — it lacks the entire alarm-fallback
feature and the cleanup. Bring it to 5976770 if parity is wanted. This box is materially
different from windev (read project_wonder_deployment fully before starting).
Key differences from windev (do NOT reuse the windev recipe blindly):
- Access: no usable
ssh host "cmd"exec. Port 22 OpenSSH (PowerShell) was observed CLOSED 2026-05-26; the reliable path is servecli on port 2222 (cmd.exe PTY + SFTP only) using the base64--EncodedCommandpattern inproject_wonder_deployment/~/Desktop/servecli/instructions.md. Re-test port 22 first; if up, prefer it. - Build shape: wonder runs a self-contained single-file win-x64 Server (~118 MB), renamed to
MxGateway.Server.exefor the NSSM pathE:\ApiInstall\MxGateway\Server\MxGateway.Server.exe. So publish must be-r win-x64 --self-contained -p:PublishSingleFile=true(NOT the framework-dependent windev recipe), then rename the exe toMxGateway.Server.exe. - Static assets quirk: ship BOTH
ZB.MOM.WW.MxGateway.Server.staticwebassets.endpoints.jsonandMxGateway.Server.staticwebassets.endpoints.jsonnext to the exe (the ZB-named one is used). - Config: wonder's
appsettings.jsoncarries HTTP-switched Kestrel (http://0.0.0.0:5130),Dashboard.GroupToRole={ SCADA-Admins/Designers/Deploy-All → Admin },Dashboard.RequireHttpsCookie:false, LDAP basedc=scadalink,dc=local. Preserve wonder's appsettings.json exactly like windev (do not ship the repo default). Note: wonder's GroupToRole values are"Admin"— VERIFY the current build's validator accepts"Admin"(the windev crash showed it requires'Administrator'/'Viewer'). If the validator rejects"Admin", wonder's appsettings GroupToRole values must be updated toAdministratorBEFORE/with the deploy or the new build will crash-loop. This is the single biggest risk — resolve it first. - Worker: wonder's x86 worker is the original; the alarm-fallback adds new subscribe fields. To use subtag fallback on wonder, the x86 worker must also be redeployed (publish x86, rename if needed). If only the Server is updated, confirm worker IPC protobuf compat (contracts changed for alarms — a stale worker may not understand the extended SubscribeAlarms). Safer to deploy both Server and Worker together.
Steps (high level — expand against the runbook before executing):
- Confirm parity is actually wanted on wonder, and whether the dashboard (disabled there) and the alarm monitor are even in scope for that box. If alarms/dashboard are off on wonder, this may reduce to "Server binary parity only."
- Resolve the GroupToRole value compatibility question (Admin vs Administrator) — inspect
the current build's
GatewayOptionsValidatorand decide whether to patch wonder's appsettings. - Build self-contained Server (
-r win-x64 --self-contained -p:PublishSingleFile=true) and x86 Worker frommain(5976770). - Transfer via SFTP (servecli) to a staging dir on
E:\ApiInstall\MxGateway\. - Stop
MxAccessGw; back upServer→Server.bak.<ts>andWorker→Worker.bak.<ts>; swap in the new build; restore wonder'sappsettings.json(+ patched GroupToRole if needed); ensure both staticwebassets manifests are present; rename Server exe toMxGateway.Server.exe. - Start
MxAccessGw(no dependents). Verify stderr line count does not grow (no crash loop),curl -k https://...:5130/health/live(or HTTP per current config) Healthy, gRPC 5120 up. - Verify the new metric is present if the metrics endpoint is exposed there.
Rollback (documented in memory): stop service,
Move-Item Server Server.failed; Move-Item Server.bak.<ts> Server, restore appsettings from the
backup, start.
Suggested order
- D2 repro (5 min — just log in and hit
/): decides whether D2 is a real fix or a no-op. - D1 (small code) + D2 fix if needed — bundle into one Server branch; they ship together on the next Server redeploy.
- D4 (prune windev backups) — quick ops, independent.
- D3 (ForceSubtag validation) — exercises the degraded surface; do before/after the D1+D2 redeploy so you also confirm the new Server build is healthy.
- D5 (wonder) — largest and riskiest; do last, only if parity is wanted, and resolve the GroupToRole-value compatibility question first.
Execution note
This plan is intentionally NOT executed. When ready, execute on a branch off main
(feat/deferred-followups or per-item branches) — do not commit to main directly. D1/D2 need a
Server redeploy to take effect; D4/D5 are host operations; D3 touches nothing permanent.
D2 — Resolution (2026-06-14)
Static source determination on build 5976770 (runtime repro out of scope). What was checked:
- Razor
@page "/"count: exactly ONE —Dashboard/Components/Pages/DashboardHome.razor. All other pages declare distinct routes (/login,/sessions,/galaxy,/browse,/apikeys,/workers,/events,/alarms,/settings,/sessions/{SessionId}). - No root index.html:
src/ZB.MOM.WW.MxGateway.Server/wwwroot/contains onlycss/andlib/subdirectories; noindex.htmlanywhere underwwwroot/. - No
UseDefaultFiles/ noMapFallback: neither appears anywhere in the Server project. MapStaticAssetsmapped once (GatewayApplication.cs:190) andUseStaticFiles()once (:41); the static-assets manifest serves fingerprinted CSS/JS assets, not a literal/.- No
MapGet("/"): the dashboard endpoint builder (Dashboard/DashboardEndpointRouteBuilderExtensions.cs) maps only/auth/login,/logout,/denied,/hubs/{snapshot,alarms,events},/hubs/token, thenMapRazorComponents<App>(). None use pattern/. MapZbHealth/MapZbMetricscome from the externalZB.MOM.WW.Healthshared library (not in this repo) and map health/metrics paths, not/.
Root cause of the 2026-06-05 log: code-reviews/Server/findings.md (re-review at 42b0037,
2026-05-24) records that commit de7639a removed the legacy MapGet("/", ...) redirect that
was colliding with the Blazor @page "/" (a real 500). That legacy registration was the source
of the AmbiguousMatchException. It is gone on the current build, so the second / endpoint no
longer exists.
Conclusion: No duplicate / endpoint on build 5976770. The AmbiguousMatchException is not
reproducible from source — it was a stale Development-only artifact from before de7639a reached
the deployed instance. No source change made (no-op).
Residual: A 100% confirmation still requires an authenticated runtime GET / against a
deployed instance (the only path that exercises routing past the unauthenticated 302-to-/login).
Recommend a spot-check of authenticated GET / after the next Server redeploy; if it returns 200
(not 500), this item can be fully closed.