Files
mxaccessgw/docs/plans/2026-06-14-deferred-followups.md
T
Joseph Doherty a56ce0ddbd docs: refresh stillpending.md for completed work; record residuals (§7/E2)
Mark §1.1 (11 worker commands), §1.2 (audit CorrelationId), and §4 client
CLI/helper parity as Resolved with commit refs; correct §4.4 (dotnet version
already worked). Record open residuals: §1.3 live failover counter, §3.2
multi-sample buffered conversion, §1.4 vendor-stub ack, DrainEvents snapshot
semantics.
2026-06-15 11:35:48 -04:00

302 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deferred Follow-ups Implementation Plan
**Date:** 2026-06-14
**Status:** D1 executed (commit 4af24b9 — `mxgateway.alarms.provider_switches` emitted in `DashboardSnapshotService.cs:198`). D2 resolved as no-op (see resolution section below). D3D5 remain pending (ops/validation, no code).
**Context:** After the alarm-subtag-fallback cleanup (merged `5976770`) and its redeploy to
windev (10.100.0.48), five items remain deferred. This plan handles all five. They are
independent — execute in any order, or cherry-pick. Items D1D2 are code (branch off `main`);
D3 is a dev-rig validation; D4D5 are ops on the deployed hosts (no code).
Source of the deferred list: the post-deploy review on 2026-06-14. See also memory
`project_deploy_mechanics`, `project_wonder_deployment`, `project_alarm_subtag_fallback`,
`project_rig_alarms_object_driven`.
---
## D1 — Surface `AlarmProviderSwitchCount` on the dashboard metric list
**Classification:** small · **Est:** ~10 min · **Where:** Server (net10), build+test on macOS
**Why deferred:** the dashboard reads provider *state* from the feed's `ProviderStatus` badge,
not the metrics snapshot, so the snapshot field (added in B1) and the OTEL counter were enough.
This makes the cumulative switch count visible in the dashboard's metric table too.
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSnapshotService.cs` (`CreateMetricSummaries`, ~line 178198)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Dashboard/DashboardSnapshotServiceTests.cs` (or the existing snapshot test file)
**Steps:**
1. In `CreateMetricSummaries`, add to the `metrics` list (near the other counters):
`new("mxgateway.alarms.provider_switches", snapshot.AlarmProviderSwitchCount),`
Also consider adding the current mode as a gauge row if useful:
the snapshot does not carry the mode int (only the feed does), so DO NOT invent one —
only add the switch count, which the snapshot already exposes.
2. Add/extend a unit test asserting the summary list contains a
`mxgateway.alarms.provider_switches` row equal to the snapshot's `AlarmProviderSwitchCount`
after calling `metrics.AlarmProviderSwitched(...)`.
3. Verify: `dotnet build src/ZB.MOM.WW.MxGateway.Server` (macOS) and
`dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter FullyQualifiedName~DashboardSnapshot`.
4. Commit.
**Rollback:** revert the one-line list addition.
---
## D2 — Reproduce and fix the `AmbiguousMatchException` on `GET /`
**Classification:** standard · **Est:** ~3045 min (repro + fix) · **Where:** Server (net10)
**Why deferred:** observed only in *old* (06/05, Development) logs:
`AmbiguousMatchException: The request matched multiple endpoints. Matches: / (/) / (/)`.
Unauthenticated `GET /` safely 302s to `/login`; the ambiguity would only throw for an
*authenticated* request that reaches routing. Pre-existing and unrelated to the alarm work,
but it would surface as a dashboard-home 500. Confirm it still reproduces before fixing.
**Root-cause candidates (the two endpoints both matching `/`):**
- `DashboardHome.razor` (`@page "/"`) mapped by `MapRazorComponents<App>()`
(`Dashboard/DashboardEndpointRouteBuilderExtensions.cs:92`).
- `MapStaticAssets(...)` (`GatewayApplication.cs:190`) — in .NET 810 the static-assets
endpoint can register a fingerprinted root that collides with the home page.
**Step 0 — Reproduce (do this first; if it does NOT repro, downgrade to a documentation note):**
1. Log into the dashboard as `multi-role`/`password` (Administrator) at
`http://10.100.0.48:5130/login`.
2. Navigate to `http://10.100.0.48:5130/` (the "Dashboard" home).
3. If it renders the Overview page → not a live bug on this build; record that and stop
(the dev error-page throw was a Development-only artifact). If it 500s with
`AmbiguousMatchException` → proceed to fix.
**Fix (only if it reproduces):**
- Files: `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs` (endpoint mapping order/region),
`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs`,
`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor`.
- Candidate fixes, in order of preference:
1. Inspect the actual two matched endpoint display names at runtime (the exception lists
them) — temporarily enable detailed errors or read the stderr `Matches:` block — to learn
exactly which two collide.
2. If `MapStaticAssets` owns the second `/`: it must run, but confirm it isn't double-mapping
the home route; if it is, move `MapStaticAssets` ordering or scope it so it doesn't claim
the literal `/` content path. (Do NOT remove `MapStaticAssets` — Blazor needs it.)
3. If two component pages both declare `@page "/"`: give the non-home one a distinct route.
4. Last resort: give `DashboardHome` an explicit route (e.g. `@page "/overview"`) and add a
trivial `MapGet("/", () => Results.Redirect("/overview"))` with `.RequireAuthorization(...)`.
- Add a test that exercises the root route once resolved (a WebApplicationFactory-based
integration test asserting authenticated `GET /` returns 200, if the dashboard test harness
supports auth; otherwise a routing-uniqueness assertion).
- Verify: `dotnet build src/ZB.MOM.WW.MxGateway.Server`; re-run the authenticated `GET /` repro.
**Rollback:** revert the routing change; the home page returns to its prior (ambiguous) state.
**Note:** this fix ships only via a Server redeploy (D-deploy applies, see D5/windev redeploy
procedure in `project_deploy_mechanics`). Decide whether to bundle it with the next redeploy.
---
## D3 — Validate the failover path end-to-end on the dev rig (ForceSubtag)
**Classification:** standard (validation, no production code) · **Est:** ~3045 min · **Where:** dev rig / windev
**Why deferred:** `provider_mode 1` (healthy alarmmgr) is verified in production, but the actual
alarmmgr→subtag **failover**, the **degraded badge**, the synthesized subtag alarms, and the
`provider_switches{from,to,reason}` counter (with the new bounded `failover`/`failback`/`unknown`
tag) have never fired live — a healthy system never switches, and the rig's alarms can't be made
to fail COM on demand. Use the explicit `ForceSubtag` config mode to exercise the degraded path
deterministically without an alarmmgr fault.
**Approach:** run a *temporary* gateway instance (NOT the production service) in `ForceSubtag`
mode against the dev Galaxy, drive it with the operator/IDE alarm toggle, and confirm the
degraded surface. Do this on windev in a throwaway run so production stays on `Auto`.
**Steps:**
1. On windev, from a build worktree at `main`, run the Server locally (or a second instance on
alt ports) with `MxGateway:Alarms:Fallback:Mode=ForceSubtag` (env
`MxGateway__Alarms__Fallback__Mode=ForceSubtag`), pointed at the dev Galaxy
(`\\DESKTOP-6JL3KKO\Galaxy!DEV`). Use distinct Kestrel ports to avoid clashing with the
production service on 5120/5130.
2. Subscribe an alarm client (or open the dashboard alarms page for that instance) and confirm:
- The provider badge shows **"Subtag monitoring (degraded)"** (amber `bg-warning`).
- `curl .../metrics` shows `mxgateway_alarms_provider_mode 2`.
- Active alarms appear with `degraded=true` and a synthetic (MD5-derived) GUID, with the
reference shape `Galaxy!<realArea>.<object>.<attr>` (e.g. `Galaxy!TestArea.TestMachine_001.TestAlarm001`).
3. Drive a transition: have the operator/IDE toggle a `TestMachine_NNN.TestAlarmNNN` true→false
(external MXAccess writes are ignored — see `project_rig_alarms_object_driven`); confirm a
synthesized Raise then Clear, and an `AckMsg` write via AcknowledgeByName returns 0.
4. (Optional, to exercise the switch counter) run in `Auto` and induce a primary fault if a safe
way exists; otherwise document that the counter is unit-tested only and `ForceSubtag` covers
the degraded surface. Note: `ForceSubtag` may not increment `provider_switches` (no runtime
switch) — that counter's live exercise remains the one gap; record it explicitly rather than
claiming coverage.
5. Tear down the temporary instance. Production service is untouched (stays `Auto`).
6. Record results in `project_alarm_subtag_fallback` memory (degraded badge + synthesized
subtag alarms now live-validated; switch-counter still unit-test-only if not exercised).
**Rollback:** none — temporary instance only; nothing in production changes.
---
## D4 — Prune stale deploy backups on windev
**Classification:** trivial (ops) · **Est:** ~10 min · **Where:** windev (10.100.0.48), no code
**Why deferred:** backups accumulate on every deploy; harmless but cluttering
`C:\publish\mxaccessgw\`.
**Current backups observed (2026-06-14):**
- `Server.bak-20260526T143341`, `Server.bak-20260529T090320`, `Server.bak529.20260604T122616`,
`Server.bak-theme030-20260605-053056`, `Server.bak-theme031-20260605-083410`,
`Server.bak-20260614-prefallback` (today's), `Worker.bak-20260614-prefallback` (today's).
- Inside `Server\`: `ZB.MOM.WW.MxGateway.Server.dll.bak-20260604-loginfix`,
`appsettings.json.bak-20260604-{glauth35,ldapkeys}`.
**Steps:**
1. **Keep** `Server.bak-20260614-prefallback` and `Worker.bak-20260614-prefallback` — the
immediate rollback for the current deploy. Keep until the new build has soaked (e.g. a few
days / one business cycle).
2. Delete the clearly-superseded older backups: the `Server.bak-2026052*`, `Server.bak529.*`,
and `Server.bak-theme03*` dirs. Confirm each is a Server dir (not something live) before
`rmdir /s /q`.
3. Optionally remove the in-`Server` `.bak-*` sidecar files (`*.dll.bak-*`,
`appsettings.json.bak-*`) — but FIRST confirm the live `appsettings.json` is correct (it is,
post-deploy) so the `.bak` ldap/glauth copies aren't the only record of a needed value.
4. Verify the service is unaffected: `Get-Service MxAccessGw` Running, `/health/ready` 200.
**Rollback:** none needed (deletions only; the current-deploy backup is retained). If unsure
about any single dir, skip it.
---
## D5 — Redeploy the wonder host to bring it to parity
**Classification:** high-risk (separate production-ish box, fiddly access, divergent build) · **Est:** ~12 h · **Where:** wonder-app-vd03.zmr.zimmer.com (10.220.157.247)
**Why deferred:** wonder still runs the pre-feature build — it lacks the entire alarm-fallback
feature and the cleanup. Bring it to `5976770` if parity is wanted. This box is materially
different from windev (read `project_wonder_deployment` fully before starting).
**Key differences from windev (do NOT reuse the windev recipe blindly):**
- Access: no usable `ssh host "cmd"` exec. Port 22 OpenSSH (PowerShell) was observed CLOSED
2026-05-26; the reliable path is **servecli on port 2222** (cmd.exe PTY + SFTP only) using the
base64-`-EncodedCommand` pattern in `project_wonder_deployment` / `~/Desktop/servecli/instructions.md`.
Re-test port 22 first; if up, prefer it.
- Build shape: wonder runs a **self-contained single-file win-x64** Server (~118 MB), renamed to
`MxGateway.Server.exe` for the NSSM path `E:\ApiInstall\MxGateway\Server\MxGateway.Server.exe`.
So publish must be `-r win-x64 --self-contained -p:PublishSingleFile=true` (NOT the
framework-dependent windev recipe), then rename the exe to `MxGateway.Server.exe`.
- Static assets quirk: ship BOTH `ZB.MOM.WW.MxGateway.Server.staticwebassets.endpoints.json` and
`MxGateway.Server.staticwebassets.endpoints.json` next to the exe (the ZB-named one is used).
- Config: wonder's `appsettings.json` carries HTTP-switched Kestrel (`http://0.0.0.0:5130`),
`Dashboard.GroupToRole` = `{ SCADA-Admins/Designers/Deploy-All → Admin }`,
`Dashboard.RequireHttpsCookie:false`, LDAP base `dc=scadalink,dc=local`. **Preserve wonder's
appsettings.json** exactly like windev (do not ship the repo default). Note: wonder's
GroupToRole values are `"Admin"` — VERIFY the current build's validator accepts `"Admin"` (the
windev crash showed it requires `'Administrator'`/`'Viewer'`). **If the validator rejects
`"Admin"`, wonder's appsettings GroupToRole values must be updated to `Administrator` BEFORE/with
the deploy or the new build will crash-loop.** This is the single biggest risk — resolve it first.
- Worker: wonder's x86 worker is the original; the alarm-fallback adds new subscribe fields. To
use subtag fallback on wonder, the **x86 worker must also be redeployed** (publish x86, rename
if needed). If only the Server is updated, confirm worker IPC protobuf compat (contracts changed
for alarms — a stale worker may not understand the extended SubscribeAlarms). Safer to deploy
both Server and Worker together.
**Steps (high level — expand against the runbook before executing):**
1. Confirm parity is actually wanted on wonder, and whether the dashboard (disabled there) and
the alarm monitor are even in scope for that box. If alarms/dashboard are off on wonder, this
may reduce to "Server binary parity only."
2. Resolve the **GroupToRole value compatibility** question (Admin vs Administrator) — inspect
the current build's `GatewayOptionsValidator` and decide whether to patch wonder's appsettings.
3. Build self-contained Server (`-r win-x64 --self-contained -p:PublishSingleFile=true`) and x86
Worker from `main` (`5976770`).
4. Transfer via SFTP (servecli) to a staging dir on `E:\ApiInstall\MxGateway\`.
5. Stop `MxAccessGw`; back up `Server``Server.bak.<ts>` and `Worker``Worker.bak.<ts>`;
swap in the new build; **restore wonder's `appsettings.json`** (+ patched GroupToRole if
needed); ensure both staticwebassets manifests are present; rename Server exe to
`MxGateway.Server.exe`.
6. Start `MxAccessGw` (no dependents). Verify stderr line count does not grow (no crash loop),
`curl -k https://...:5130/health/live` (or HTTP per current config) Healthy, gRPC 5120 up.
7. Verify the new metric is present if the metrics endpoint is exposed there.
**Rollback (documented in memory):** stop service,
`Move-Item Server Server.failed; Move-Item Server.bak.<ts> Server`, restore appsettings from the
backup, start.
---
## Suggested order
1. **D2 repro** (5 min — just log in and hit `/`): decides whether D2 is a real fix or a no-op.
2. **D1** (small code) + **D2 fix if needed** — bundle into one Server branch; they ship together
on the next Server redeploy.
3. **D4** (prune windev backups) — quick ops, independent.
4. **D3** (ForceSubtag validation) — exercises the degraded surface; do before/after the D1+D2
redeploy so you also confirm the new Server build is healthy.
5. **D5** (wonder) — largest and riskiest; do last, only if parity is wanted, and resolve the
GroupToRole-value compatibility question first.
## Execution note
This plan is intentionally NOT executed. When ready, execute on a branch off `main`
(`feat/deferred-followups` or per-item branches) — do not commit to `main` directly. D1/D2 need a
Server redeploy to take effect; D4/D5 are host operations; D3 touches nothing permanent.
---
## D2 — Resolution (2026-06-14)
Static source determination on build `5976770` (runtime repro out of scope). What was checked:
- **Razor `@page "/"` count:** exactly ONE — `Dashboard/Components/Pages/DashboardHome.razor`.
All other pages declare distinct routes (`/login`, `/sessions`, `/galaxy`, `/browse`,
`/apikeys`, `/workers`, `/events`, `/alarms`, `/settings`, `/sessions/{SessionId}`).
- **No root index.html:** `src/ZB.MOM.WW.MxGateway.Server/wwwroot/` contains only `css/` and
`lib/` subdirectories; no `index.html` anywhere under `wwwroot/`.
- **No `UseDefaultFiles` / no `MapFallback`:** neither appears anywhere in the Server project.
- **`MapStaticAssets` mapped once** (`GatewayApplication.cs:190`) and `UseStaticFiles()` once
(`:41`); the static-assets manifest serves fingerprinted CSS/JS assets, not a literal `/`.
- **No `MapGet("/")`:** the dashboard endpoint builder
(`Dashboard/DashboardEndpointRouteBuilderExtensions.cs`) maps only `/auth/login`, `/logout`,
`/denied`, `/hubs/{snapshot,alarms,events}`, `/hubs/token`, then `MapRazorComponents<App>()`.
None use pattern `/`.
- **`MapZbHealth` / `MapZbMetrics`** come from the external `ZB.MOM.WW.Health` shared library
(not in this repo) and map health/metrics paths, not `/`.
**Root cause of the 2026-06-05 log:** `code-reviews/Server/findings.md` (re-review at `42b0037`,
2026-05-24) records that commit `de7639a` **removed the legacy `MapGet("/", ...)` redirect that
was colliding with the Blazor `@page "/"` (a real 500)**. That legacy registration was the source
of the `AmbiguousMatchException`. It is gone on the current build, so the second `/` endpoint no
longer exists.
**Conclusion:** No duplicate `/` endpoint on build `5976770`. The AmbiguousMatchException is not
reproducible from source — it was a stale Development-only artifact from before `de7639a` reached
the deployed instance. **No source change made** (no-op).
**Residual:** A 100% confirmation still requires an authenticated runtime `GET /` against a
deployed instance (the only path that exercises routing past the unauthenticated 302-to-`/login`).
Recommend a spot-check of authenticated `GET /` after the next Server redeploy; if it returns 200
(not 500), this item can be fully closed.
---
## Recorded residuals after `feat/stillpending-completion` (2026-06-15)
The stillpending.md actionable items were implemented on branch `feat/stillpending-completion`
(see `docs/plans/2026-06-15-stillpending-completion.md`). These environment/vendor-gated residuals
remain explicitly open — none are code defects:
- **`provider_switches{from,to,reason}` counter — live exercise still pending.** The metric is
emitted on the alarm failover/failback path and unit-tested, but the dev rig's object-driven
alarms can't be made to fail a real alarmmgr→subtag switch from outside, so the `reason` tagging
is unproven against a live failover. Re-verify when a rig (or capture) can drive an actual
alarmmgr fault. (stillpending §1.3.)
- **`DrainEvents` is a diagnostic snapshot, not an exhaustive drain.** The worker now answers
`DrainEvents` (handled in `WorkerPipeSession`, off-STA), but it pulls from the same event queue
that the ~25 ms background stream-drain loop continuously empties. With an active event stream a
`DrainEvents` caller therefore receives only events not yet pushed by the stream loop — there is
no loss or double-delivery (the queue drain is lock-protected and destructive), but the result is
a non-deterministic snapshot. Documented here so the semantics aren't mistaken for a bug.
- **Buffered multi-sample conversion (`OnBufferedDataChange`) — unverified live.** `AddBufferedItem`
/ `SetBufferedUpdateInterval` are implemented and live-confirmed; the empty `NoData` bootstrap
event converts cleanly live (`f7ada90`). A real parallel quality/timestamp sample batch
(length > 1) is undrivable on the current rig, so the multi-sample path is exercised only by unit
tests against a fake `IMxAccessServer`. (stillpending §3.2.)
- **8-arg alarm ack operator `domain`/`full_name` — vendor-blocked.** The AVEVA `IwwAlarmConsumer2`
8-arg `AlarmAckByName` returns 55 (stub) and `AlarmAckByGUID` is `E_NOTIMPL` on this build, so the
two fields stay forward-compat-only on the wire. (stillpending §1.4 / §3.4 / §3.5.)