a56ce0ddbd
Mark §1.1 (11 worker commands), §1.2 (audit CorrelationId), and §4 client CLI/helper parity as Resolved with commit refs; correct §4.4 (dotnet version already worked). Record open residuals: §1.3 live failover counter, §3.2 multi-sample buffered conversion, §1.4 vendor-stub ack, DrainEvents snapshot semantics.
302 lines
19 KiB
Markdown
302 lines
19 KiB
Markdown
# Deferred Follow-ups Implementation Plan
|
||
|
||
**Date:** 2026-06-14
|
||
**Status:** D1 executed (commit 4af24b9 — `mxgateway.alarms.provider_switches` emitted in `DashboardSnapshotService.cs:198`). D2 resolved as no-op (see resolution section below). D3–D5 remain pending (ops/validation, no code).
|
||
**Context:** After the alarm-subtag-fallback cleanup (merged `5976770`) and its redeploy to
|
||
windev (10.100.0.48), five items remain deferred. This plan handles all five. They are
|
||
independent — execute in any order, or cherry-pick. Items D1–D2 are code (branch off `main`);
|
||
D3 is a dev-rig validation; D4–D5 are ops on the deployed hosts (no code).
|
||
|
||
Source of the deferred list: the post-deploy review on 2026-06-14. See also memory
|
||
`project_deploy_mechanics`, `project_wonder_deployment`, `project_alarm_subtag_fallback`,
|
||
`project_rig_alarms_object_driven`.
|
||
|
||
---
|
||
|
||
## D1 — Surface `AlarmProviderSwitchCount` on the dashboard metric list
|
||
|
||
**Classification:** small · **Est:** ~10 min · **Where:** Server (net10), build+test on macOS
|
||
**Why deferred:** the dashboard reads provider *state* from the feed's `ProviderStatus` badge,
|
||
not the metrics snapshot, so the snapshot field (added in B1) and the OTEL counter were enough.
|
||
This makes the cumulative switch count visible in the dashboard's metric table too.
|
||
|
||
**Files:**
|
||
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSnapshotService.cs` (`CreateMetricSummaries`, ~line 178–198)
|
||
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Dashboard/DashboardSnapshotServiceTests.cs` (or the existing snapshot test file)
|
||
|
||
**Steps:**
|
||
1. In `CreateMetricSummaries`, add to the `metrics` list (near the other counters):
|
||
`new("mxgateway.alarms.provider_switches", snapshot.AlarmProviderSwitchCount),`
|
||
Also consider adding the current mode as a gauge row if useful:
|
||
the snapshot does not carry the mode int (only the feed does), so DO NOT invent one —
|
||
only add the switch count, which the snapshot already exposes.
|
||
2. Add/extend a unit test asserting the summary list contains a
|
||
`mxgateway.alarms.provider_switches` row equal to the snapshot's `AlarmProviderSwitchCount`
|
||
after calling `metrics.AlarmProviderSwitched(...)`.
|
||
3. Verify: `dotnet build src/ZB.MOM.WW.MxGateway.Server` (macOS) and
|
||
`dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter FullyQualifiedName~DashboardSnapshot`.
|
||
4. Commit.
|
||
|
||
**Rollback:** revert the one-line list addition.
|
||
|
||
---
|
||
|
||
## D2 — Reproduce and fix the `AmbiguousMatchException` on `GET /`
|
||
|
||
**Classification:** standard · **Est:** ~30–45 min (repro + fix) · **Where:** Server (net10)
|
||
**Why deferred:** observed only in *old* (06/05, Development) logs:
|
||
`AmbiguousMatchException: The request matched multiple endpoints. Matches: / (/) / (/)`.
|
||
Unauthenticated `GET /` safely 302s to `/login`; the ambiguity would only throw for an
|
||
*authenticated* request that reaches routing. Pre-existing and unrelated to the alarm work,
|
||
but it would surface as a dashboard-home 500. Confirm it still reproduces before fixing.
|
||
|
||
**Root-cause candidates (the two endpoints both matching `/`):**
|
||
- `DashboardHome.razor` (`@page "/"`) mapped by `MapRazorComponents<App>()`
|
||
(`Dashboard/DashboardEndpointRouteBuilderExtensions.cs:92`).
|
||
- `MapStaticAssets(...)` (`GatewayApplication.cs:190`) — in .NET 8–10 the static-assets
|
||
endpoint can register a fingerprinted root that collides with the home page.
|
||
|
||
**Step 0 — Reproduce (do this first; if it does NOT repro, downgrade to a documentation note):**
|
||
1. Log into the dashboard as `multi-role`/`password` (Administrator) at
|
||
`http://10.100.0.48:5130/login`.
|
||
2. Navigate to `http://10.100.0.48:5130/` (the "Dashboard" home).
|
||
3. If it renders the Overview page → not a live bug on this build; record that and stop
|
||
(the dev error-page throw was a Development-only artifact). If it 500s with
|
||
`AmbiguousMatchException` → proceed to fix.
|
||
|
||
**Fix (only if it reproduces):**
|
||
- Files: `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs` (endpoint mapping order/region),
|
||
`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs`,
|
||
`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor`.
|
||
- Candidate fixes, in order of preference:
|
||
1. Inspect the actual two matched endpoint display names at runtime (the exception lists
|
||
them) — temporarily enable detailed errors or read the stderr `Matches:` block — to learn
|
||
exactly which two collide.
|
||
2. If `MapStaticAssets` owns the second `/`: it must run, but confirm it isn't double-mapping
|
||
the home route; if it is, move `MapStaticAssets` ordering or scope it so it doesn't claim
|
||
the literal `/` content path. (Do NOT remove `MapStaticAssets` — Blazor needs it.)
|
||
3. If two component pages both declare `@page "/"`: give the non-home one a distinct route.
|
||
4. Last resort: give `DashboardHome` an explicit route (e.g. `@page "/overview"`) and add a
|
||
trivial `MapGet("/", () => Results.Redirect("/overview"))` with `.RequireAuthorization(...)`.
|
||
- Add a test that exercises the root route once resolved (a WebApplicationFactory-based
|
||
integration test asserting authenticated `GET /` returns 200, if the dashboard test harness
|
||
supports auth; otherwise a routing-uniqueness assertion).
|
||
- Verify: `dotnet build src/ZB.MOM.WW.MxGateway.Server`; re-run the authenticated `GET /` repro.
|
||
|
||
**Rollback:** revert the routing change; the home page returns to its prior (ambiguous) state.
|
||
|
||
**Note:** this fix ships only via a Server redeploy (D-deploy applies, see D5/windev redeploy
|
||
procedure in `project_deploy_mechanics`). Decide whether to bundle it with the next redeploy.
|
||
|
||
---
|
||
|
||
## D3 — Validate the failover path end-to-end on the dev rig (ForceSubtag)
|
||
|
||
**Classification:** standard (validation, no production code) · **Est:** ~30–45 min · **Where:** dev rig / windev
|
||
**Why deferred:** `provider_mode 1` (healthy alarmmgr) is verified in production, but the actual
|
||
alarmmgr→subtag **failover**, the **degraded badge**, the synthesized subtag alarms, and the
|
||
`provider_switches{from,to,reason}` counter (with the new bounded `failover`/`failback`/`unknown`
|
||
tag) have never fired live — a healthy system never switches, and the rig's alarms can't be made
|
||
to fail COM on demand. Use the explicit `ForceSubtag` config mode to exercise the degraded path
|
||
deterministically without an alarmmgr fault.
|
||
|
||
**Approach:** run a *temporary* gateway instance (NOT the production service) in `ForceSubtag`
|
||
mode against the dev Galaxy, drive it with the operator/IDE alarm toggle, and confirm the
|
||
degraded surface. Do this on windev in a throwaway run so production stays on `Auto`.
|
||
|
||
**Steps:**
|
||
1. On windev, from a build worktree at `main`, run the Server locally (or a second instance on
|
||
alt ports) with `MxGateway:Alarms:Fallback:Mode=ForceSubtag` (env
|
||
`MxGateway__Alarms__Fallback__Mode=ForceSubtag`), pointed at the dev Galaxy
|
||
(`\\DESKTOP-6JL3KKO\Galaxy!DEV`). Use distinct Kestrel ports to avoid clashing with the
|
||
production service on 5120/5130.
|
||
2. Subscribe an alarm client (or open the dashboard alarms page for that instance) and confirm:
|
||
- The provider badge shows **"Subtag monitoring (degraded)"** (amber `bg-warning`).
|
||
- `curl .../metrics` shows `mxgateway_alarms_provider_mode 2`.
|
||
- Active alarms appear with `degraded=true` and a synthetic (MD5-derived) GUID, with the
|
||
reference shape `Galaxy!<realArea>.<object>.<attr>` (e.g. `Galaxy!TestArea.TestMachine_001.TestAlarm001`).
|
||
3. Drive a transition: have the operator/IDE toggle a `TestMachine_NNN.TestAlarmNNN` true→false
|
||
(external MXAccess writes are ignored — see `project_rig_alarms_object_driven`); confirm a
|
||
synthesized Raise then Clear, and an `AckMsg` write via AcknowledgeByName returns 0.
|
||
4. (Optional, to exercise the switch counter) run in `Auto` and induce a primary fault if a safe
|
||
way exists; otherwise document that the counter is unit-tested only and `ForceSubtag` covers
|
||
the degraded surface. Note: `ForceSubtag` may not increment `provider_switches` (no runtime
|
||
switch) — that counter's live exercise remains the one gap; record it explicitly rather than
|
||
claiming coverage.
|
||
5. Tear down the temporary instance. Production service is untouched (stays `Auto`).
|
||
6. Record results in `project_alarm_subtag_fallback` memory (degraded badge + synthesized
|
||
subtag alarms now live-validated; switch-counter still unit-test-only if not exercised).
|
||
|
||
**Rollback:** none — temporary instance only; nothing in production changes.
|
||
|
||
---
|
||
|
||
## D4 — Prune stale deploy backups on windev
|
||
|
||
**Classification:** trivial (ops) · **Est:** ~10 min · **Where:** windev (10.100.0.48), no code
|
||
**Why deferred:** backups accumulate on every deploy; harmless but cluttering
|
||
`C:\publish\mxaccessgw\`.
|
||
|
||
**Current backups observed (2026-06-14):**
|
||
- `Server.bak-20260526T143341`, `Server.bak-20260529T090320`, `Server.bak529.20260604T122616`,
|
||
`Server.bak-theme030-20260605-053056`, `Server.bak-theme031-20260605-083410`,
|
||
`Server.bak-20260614-prefallback` (today's), `Worker.bak-20260614-prefallback` (today's).
|
||
- Inside `Server\`: `ZB.MOM.WW.MxGateway.Server.dll.bak-20260604-loginfix`,
|
||
`appsettings.json.bak-20260604-{glauth35,ldapkeys}`.
|
||
|
||
**Steps:**
|
||
1. **Keep** `Server.bak-20260614-prefallback` and `Worker.bak-20260614-prefallback` — the
|
||
immediate rollback for the current deploy. Keep until the new build has soaked (e.g. a few
|
||
days / one business cycle).
|
||
2. Delete the clearly-superseded older backups: the `Server.bak-2026052*`, `Server.bak529.*`,
|
||
and `Server.bak-theme03*` dirs. Confirm each is a Server dir (not something live) before
|
||
`rmdir /s /q`.
|
||
3. Optionally remove the in-`Server` `.bak-*` sidecar files (`*.dll.bak-*`,
|
||
`appsettings.json.bak-*`) — but FIRST confirm the live `appsettings.json` is correct (it is,
|
||
post-deploy) so the `.bak` ldap/glauth copies aren't the only record of a needed value.
|
||
4. Verify the service is unaffected: `Get-Service MxAccessGw` Running, `/health/ready` 200.
|
||
|
||
**Rollback:** none needed (deletions only; the current-deploy backup is retained). If unsure
|
||
about any single dir, skip it.
|
||
|
||
---
|
||
|
||
## D5 — Redeploy the wonder host to bring it to parity
|
||
|
||
**Classification:** high-risk (separate production-ish box, fiddly access, divergent build) · **Est:** ~1–2 h · **Where:** wonder-app-vd03.zmr.zimmer.com (10.220.157.247)
|
||
**Why deferred:** wonder still runs the pre-feature build — it lacks the entire alarm-fallback
|
||
feature and the cleanup. Bring it to `5976770` if parity is wanted. This box is materially
|
||
different from windev (read `project_wonder_deployment` fully before starting).
|
||
|
||
**Key differences from windev (do NOT reuse the windev recipe blindly):**
|
||
- Access: no usable `ssh host "cmd"` exec. Port 22 OpenSSH (PowerShell) was observed CLOSED
|
||
2026-05-26; the reliable path is **servecli on port 2222** (cmd.exe PTY + SFTP only) using the
|
||
base64-`-EncodedCommand` pattern in `project_wonder_deployment` / `~/Desktop/servecli/instructions.md`.
|
||
Re-test port 22 first; if up, prefer it.
|
||
- Build shape: wonder runs a **self-contained single-file win-x64** Server (~118 MB), renamed to
|
||
`MxGateway.Server.exe` for the NSSM path `E:\ApiInstall\MxGateway\Server\MxGateway.Server.exe`.
|
||
So publish must be `-r win-x64 --self-contained -p:PublishSingleFile=true` (NOT the
|
||
framework-dependent windev recipe), then rename the exe to `MxGateway.Server.exe`.
|
||
- Static assets quirk: ship BOTH `ZB.MOM.WW.MxGateway.Server.staticwebassets.endpoints.json` and
|
||
`MxGateway.Server.staticwebassets.endpoints.json` next to the exe (the ZB-named one is used).
|
||
- Config: wonder's `appsettings.json` carries HTTP-switched Kestrel (`http://0.0.0.0:5130`),
|
||
`Dashboard.GroupToRole` = `{ SCADA-Admins/Designers/Deploy-All → Admin }`,
|
||
`Dashboard.RequireHttpsCookie:false`, LDAP base `dc=scadalink,dc=local`. **Preserve wonder's
|
||
appsettings.json** exactly like windev (do not ship the repo default). Note: wonder's
|
||
GroupToRole values are `"Admin"` — VERIFY the current build's validator accepts `"Admin"` (the
|
||
windev crash showed it requires `'Administrator'`/`'Viewer'`). **If the validator rejects
|
||
`"Admin"`, wonder's appsettings GroupToRole values must be updated to `Administrator` BEFORE/with
|
||
the deploy or the new build will crash-loop.** This is the single biggest risk — resolve it first.
|
||
- Worker: wonder's x86 worker is the original; the alarm-fallback adds new subscribe fields. To
|
||
use subtag fallback on wonder, the **x86 worker must also be redeployed** (publish x86, rename
|
||
if needed). If only the Server is updated, confirm worker IPC protobuf compat (contracts changed
|
||
for alarms — a stale worker may not understand the extended SubscribeAlarms). Safer to deploy
|
||
both Server and Worker together.
|
||
|
||
**Steps (high level — expand against the runbook before executing):**
|
||
1. Confirm parity is actually wanted on wonder, and whether the dashboard (disabled there) and
|
||
the alarm monitor are even in scope for that box. If alarms/dashboard are off on wonder, this
|
||
may reduce to "Server binary parity only."
|
||
2. Resolve the **GroupToRole value compatibility** question (Admin vs Administrator) — inspect
|
||
the current build's `GatewayOptionsValidator` and decide whether to patch wonder's appsettings.
|
||
3. Build self-contained Server (`-r win-x64 --self-contained -p:PublishSingleFile=true`) and x86
|
||
Worker from `main` (`5976770`).
|
||
4. Transfer via SFTP (servecli) to a staging dir on `E:\ApiInstall\MxGateway\`.
|
||
5. Stop `MxAccessGw`; back up `Server` → `Server.bak.<ts>` and `Worker` → `Worker.bak.<ts>`;
|
||
swap in the new build; **restore wonder's `appsettings.json`** (+ patched GroupToRole if
|
||
needed); ensure both staticwebassets manifests are present; rename Server exe to
|
||
`MxGateway.Server.exe`.
|
||
6. Start `MxAccessGw` (no dependents). Verify stderr line count does not grow (no crash loop),
|
||
`curl -k https://...:5130/health/live` (or HTTP per current config) Healthy, gRPC 5120 up.
|
||
7. Verify the new metric is present if the metrics endpoint is exposed there.
|
||
|
||
**Rollback (documented in memory):** stop service,
|
||
`Move-Item Server Server.failed; Move-Item Server.bak.<ts> Server`, restore appsettings from the
|
||
backup, start.
|
||
|
||
---
|
||
|
||
## Suggested order
|
||
|
||
1. **D2 repro** (5 min — just log in and hit `/`): decides whether D2 is a real fix or a no-op.
|
||
2. **D1** (small code) + **D2 fix if needed** — bundle into one Server branch; they ship together
|
||
on the next Server redeploy.
|
||
3. **D4** (prune windev backups) — quick ops, independent.
|
||
4. **D3** (ForceSubtag validation) — exercises the degraded surface; do before/after the D1+D2
|
||
redeploy so you also confirm the new Server build is healthy.
|
||
5. **D5** (wonder) — largest and riskiest; do last, only if parity is wanted, and resolve the
|
||
GroupToRole-value compatibility question first.
|
||
|
||
## Execution note
|
||
|
||
This plan is intentionally NOT executed. When ready, execute on a branch off `main`
|
||
(`feat/deferred-followups` or per-item branches) — do not commit to `main` directly. D1/D2 need a
|
||
Server redeploy to take effect; D4/D5 are host operations; D3 touches nothing permanent.
|
||
|
||
---
|
||
|
||
## D2 — Resolution (2026-06-14)
|
||
|
||
Static source determination on build `5976770` (runtime repro out of scope). What was checked:
|
||
|
||
- **Razor `@page "/"` count:** exactly ONE — `Dashboard/Components/Pages/DashboardHome.razor`.
|
||
All other pages declare distinct routes (`/login`, `/sessions`, `/galaxy`, `/browse`,
|
||
`/apikeys`, `/workers`, `/events`, `/alarms`, `/settings`, `/sessions/{SessionId}`).
|
||
- **No root index.html:** `src/ZB.MOM.WW.MxGateway.Server/wwwroot/` contains only `css/` and
|
||
`lib/` subdirectories; no `index.html` anywhere under `wwwroot/`.
|
||
- **No `UseDefaultFiles` / no `MapFallback`:** neither appears anywhere in the Server project.
|
||
- **`MapStaticAssets` mapped once** (`GatewayApplication.cs:190`) and `UseStaticFiles()` once
|
||
(`:41`); the static-assets manifest serves fingerprinted CSS/JS assets, not a literal `/`.
|
||
- **No `MapGet("/")`:** the dashboard endpoint builder
|
||
(`Dashboard/DashboardEndpointRouteBuilderExtensions.cs`) maps only `/auth/login`, `/logout`,
|
||
`/denied`, `/hubs/{snapshot,alarms,events}`, `/hubs/token`, then `MapRazorComponents<App>()`.
|
||
None use pattern `/`.
|
||
- **`MapZbHealth` / `MapZbMetrics`** come from the external `ZB.MOM.WW.Health` shared library
|
||
(not in this repo) and map health/metrics paths, not `/`.
|
||
|
||
**Root cause of the 2026-06-05 log:** `code-reviews/Server/findings.md` (re-review at `42b0037`,
|
||
2026-05-24) records that commit `de7639a` **removed the legacy `MapGet("/", ...)` redirect that
|
||
was colliding with the Blazor `@page "/"` (a real 500)**. That legacy registration was the source
|
||
of the `AmbiguousMatchException`. It is gone on the current build, so the second `/` endpoint no
|
||
longer exists.
|
||
|
||
**Conclusion:** No duplicate `/` endpoint on build `5976770`. The AmbiguousMatchException is not
|
||
reproducible from source — it was a stale Development-only artifact from before `de7639a` reached
|
||
the deployed instance. **No source change made** (no-op).
|
||
|
||
**Residual:** A 100% confirmation still requires an authenticated runtime `GET /` against a
|
||
deployed instance (the only path that exercises routing past the unauthenticated 302-to-`/login`).
|
||
Recommend a spot-check of authenticated `GET /` after the next Server redeploy; if it returns 200
|
||
(not 500), this item can be fully closed.
|
||
|
||
---
|
||
|
||
## Recorded residuals after `feat/stillpending-completion` (2026-06-15)
|
||
|
||
The stillpending.md actionable items were implemented on branch `feat/stillpending-completion`
|
||
(see `docs/plans/2026-06-15-stillpending-completion.md`). These environment/vendor-gated residuals
|
||
remain explicitly open — none are code defects:
|
||
|
||
- **`provider_switches{from,to,reason}` counter — live exercise still pending.** The metric is
|
||
emitted on the alarm failover/failback path and unit-tested, but the dev rig's object-driven
|
||
alarms can't be made to fail a real alarmmgr→subtag switch from outside, so the `reason` tagging
|
||
is unproven against a live failover. Re-verify when a rig (or capture) can drive an actual
|
||
alarmmgr fault. (stillpending §1.3.)
|
||
|
||
- **`DrainEvents` is a diagnostic snapshot, not an exhaustive drain.** The worker now answers
|
||
`DrainEvents` (handled in `WorkerPipeSession`, off-STA), but it pulls from the same event queue
|
||
that the ~25 ms background stream-drain loop continuously empties. With an active event stream a
|
||
`DrainEvents` caller therefore receives only events not yet pushed by the stream loop — there is
|
||
no loss or double-delivery (the queue drain is lock-protected and destructive), but the result is
|
||
a non-deterministic snapshot. Documented here so the semantics aren't mistaken for a bug.
|
||
|
||
- **Buffered multi-sample conversion (`OnBufferedDataChange`) — unverified live.** `AddBufferedItem`
|
||
/ `SetBufferedUpdateInterval` are implemented and live-confirmed; the empty `NoData` bootstrap
|
||
event converts cleanly live (`f7ada90`). A real parallel quality/timestamp sample batch
|
||
(length > 1) is undrivable on the current rig, so the multi-sample path is exercised only by unit
|
||
tests against a fake `IMxAccessServer`. (stillpending §3.2.)
|
||
|
||
- **8-arg alarm ack operator `domain`/`full_name` — vendor-blocked.** The AVEVA `IwwAlarmConsumer2`
|
||
8-arg `AlarmAckByName` returns −55 (stub) and `AlarmAckByGUID` is `E_NOTIMPL` on this build, so the
|
||
two fields stay forward-compat-only on the wire. (stillpending §1.4 / §3.4 / §3.5.)
|