Files
mxaccessgw/stillpending.md
T
Joseph Doherty a56ce0ddbd docs: refresh stillpending.md for completed work; record residuals (§7/E2)
Mark §1.1 (11 worker commands), §1.2 (audit CorrelationId), and §4 client
CLI/helper parity as Resolved with commit refs; correct §4.4 (dotnet version
already worked). Record open residuals: §1.3 live failover counter, §3.2
multi-sample buffered conversion, §1.4 vendor-stub ack, DrainEvents snapshot
semantics.
2026-06-15 11:35:48 -04:00

163 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Still Pending — Deferred / Partial / Unfinished / Missing Functionality
**Generated:** 2026-06-15 · **Commit:** `c7f754c` (main) · **Method:** six parallel read-only audits (Server, Worker, Contracts/proto, all five clients, docs/design/plans, tests + review backlog). Every item cites a verified `file:line`.
> **Resolution update (2026-06-15, branch `feat/stillpending-completion`):** The actionable items were implemented and verified per `docs/plans/2026-06-15-stillpending-completion.md`. **§1.1** (all 11 worker command kinds), **§1.2** (audit CorrelationId), and the **§4** client CLI/helper parity gaps are **Resolved** — see per-item annotations below. Worker COM commands are live-verified on the dev rig (`efd9971`, `f7ada90`). Remaining open items are the documented residuals (**§1.3**, **§1.4**, the **§3** vendor/capture-gated questions incl. the new **§3.2** multi-sample buffered residual) and the deliberate v1 scope of **§2**. Zero `.proto` changes were needed (all reply messages already existed).
## How to read this
Items are graded by what they actually are, because most "pending" surface in this codebase is **deliberate v1 scope**, not accidental:
- 🔴 **Genuine gap** — real unfinished/missing functionality with user-visible impact; a candidate to actually build.
- 🟠 **Parity hole** — declared in-scope (proto/design) but not wired through; breaks "MXAccess parity" or cross-client parity.
- 🟡 **Open question / vendor-gated** — intentionally incomplete, awaiting a live MXAccess capture or an AVEVA fix; raw data preserved meanwhile.
- 🔵 **Intentional v1 scope** — deliberately deferred and documented; listed so it's catalogued, not because it's broken.
-**Verification gap** — code exists but is unverified by default (opt-in/live tests).
- 📄 **Stale doc / dead code** — prose or code that lags reality.
---
## 1. Genuine gaps (real unfinished functionality)
### ✅ 1.1 — Worker does not implement 11 declared command kinds *(biggest real gap)* — **RESOLVED**
- **Resolved:** all 11 now implemented. The **5 control commands** (`Ping`, `GetSessionState`, `GetWorkerInfo`, `DrainEvents`, `ShutdownWorker`) are handled in `WorkerPipeSession` (off-STA — `ShutdownWorker` on the STA would deadlock, and these read pipe-session state) — `bf72cd8`. The **6 COM commands** (`Suspend`, `Activate`, `AuthenticateUser`, `ArchestrAUserToId`, `AddBufferedItem`, `SetBufferedUpdateInterval`) are implemented in `MxAccessCommandExecutor` (STA-dispatched) via new `IMxAccessServer`/`MxAccessComServer` wrappers selecting `ILMXProxyServer2/4/5``2939932`. Live-verified on the dev rig (`efd9971`, `f7ada90`): `ArchestrAUserToId`→Ok(user_id=1), `AddBufferedItem`/`SetBufferedUpdateInterval`→Ok; `AuthenticateUser`/`Suspend`/`Activate`→real `MxaccessFailure`/HResult (parity, not `INVALID_REQUEST`). `FakeWorkerHarness` now answers the control commands so the default gateway suite covers them (`bb5139f`). Note: `DrainEvents` is a *diagnostic snapshot* — it competes with the 25 ms background stream-drain loop, so with an active event stream it returns only events not yet pushed (no loss/double-drain; the queue drain is lock-protected and destructive).
- **Original finding below (for history):**
- **Location:** `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:97-128` (the `Execute` switch; everything else falls to `_ => CreateInvalidRequestReply(...)`).
- **What's missing:** the proto `MxCommandKind` enum defines these, the gateway *validates* (`Grpc/MxAccessGrpcRequestValidator.cs:86-95`), *scopes* (`Security/Authorization/GatewayGrpcScopeResolver.cs:45-47`), and *routes* them, and reply messages exist — but the worker answers `INVALID_REQUEST`:
- **MXAccess COM commands (parity-critical):** `AddBufferedItem`, `SetBufferedUpdateInterval`, `Suspend`, `Activate`, `AuthenticateUser`, `ArchestrAUserToId` (`mxaccess_gateway.proto:107-116,151-160`).
- **Worker control/lifecycle:** `Ping`, `GetSessionState`, `GetWorkerInfo`, `DrainEvents`, `ShutdownWorker` (`mxaccess_gateway.proto:133-137,177-181`).
- **Why it matters:** `gateway.md:890-899`, `docs/MxAccessWorkerInstanceDesign.md:424-433`, and `docs/DesignDecisions.md:169-173` list all six COM commands under **"Phase 4: Full Command Surface"** with the exit criterion *"gRPC command surface covers the installed MXAccess public method set."* So this is a declared-in-scope phase that isn't finished, not a v1 cut.
- **Masked by tests:** the control kinds (`Ping`/`GetWorkerInfo`/`DrainEvents`/`ShutdownWorker`) are exercised only through `FakeWorkerHarness`, so the unit suite is green while the *real* worker can't answer them. The live integration test `WorkerLiveMxAccessSmokeTests.cs:931` even sends `AuthenticateUser`, which would get an invalid-request reply today.
- **Note:** `OnBufferedDataChange` *event* mapping IS fully wired (`Conversion`/`MxAccessEventMapper.cs:231-254`) — but with `AddBufferedItem`/`SetBufferedUpdateInterval` unimplemented there is no way to set buffered eventing up.
### ✅ 1.2 — Constraint-denial audit events drop CorrelationId — **RESOLVED**
- **Resolved (`8415f35`, `55526d5`):** `request.ClientCorrelationId` is threaded from `Invoke``ApplyConstraintsAsync` → the six filter helpers → `IConstraintEnforcer.RecordDenialAsync` (new `string? correlationId` param); the `TODO(Task 2.3)` is gone. The audit `CorrelationId` column is `Guid?`, so a GUID-parseable id is stored typed; **and** the raw string is always preserved in the audit record's `DetailsJson["clientCorrelationId"]` — this matters because the Rust client sends non-GUID ids (`rust-client-<op>-<n>`) on all traffic and Python/Java default to empty, which would otherwise have left the typed column null. An end-to-end test asserts the value propagates through `Invoke`.
- **Original finding:** denied-operation audit records always wrote `CorrelationId = null` (`ConstraintEnforcer.cs:134-136,147`); threading needed an `IConstraintEnforcer` signature change.
### 🔴 1.3 — `provider_switches{from,to,reason}` counter never exercised live
- **Location:** metric emitted in `Alarms/GatewayAlarmMonitor.cs` (failover/failback path); residual recorded in `docs/plans/2026-06-14-deferred-followups.md:124-125`.
- **Evidence:** *"that counter's live exercise remains the one gap; record it explicitly rather than claiming coverage."* Unit-tested (Tests-032) but the dev rig can't drive a real alarmmgr→subtag failover (`project_rig_alarms_object_driven`), so the counter's `reason` tagging is unproven in production.
### 🟠 1.4 — Worker 8-arg alarm ack silently discards operator domain / full name
- **Location:** `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:261-278` (`_ = ackOperatorDomain; _ = ackOperatorFullName;`).
- **Evidence:** *"the IwwAlarmConsumer2 8-arg AlarmAckByName returns -55 on this AVEVA build (looks like a stub) … fields are accepted by the proto for forward-compat but are not propagated to AVEVA today."*
- **Impact:** two contract fields are accepted on the wire and silently not delivered. Root cause is the vendor stub (see 3.5), but the drop is currently invisible to callers.
---
## 2. Intentional v1 scope decisions (deliberately deferred — catalog)
These are documented, deliberate, and mostly enforced. Listed so the deferred surface is in one place — **none are bugs.** Canonical register: `docs/DesignDecisions.md:466-474` ("Later Revisit Items") + `gateway.md` "Post-v1 revisit items".
- 🔵 **Reconnectable sessions** — not in v1. `docs/DesignDecisions.md:63-73`, `gateway.md:1087,1101`.
- 🔵 **Multi-event-subscriber fan-out***plumbed but blocked.* The option flows all the way to `Sessions/GatewaySession.cs:387-408 AttachEventSubscriber(allowMultipleSubscribers)`, but `Configuration/GatewayOptionsValidator.cs:181-185` hard-rejects the only enabling value: *"AllowMultipleEventSubscribers is not supported until event fan-out is implemented."* So the fan-out code path never runs. `docs/DesignDecisions.md:75-80`.
- 🔵 **Gateway restart does not reattach orphan workers** — terminates them on startup. `docs/DesignDecisions.md:65-69`, `CLAUDE.md`.
- 🔵 **Workers run as the gateway service identity** — restricted service account is a reserved extension point. `docs/DesignDecisions.md:179-184`.
- 🔵 **Fail-fast event backpressure, no coalescing** — opt-in coalescing is post-v1. `docs/DesignDecisions.md:187-203`.
- 🔵 **No public command batching**`docs/DesignDecisions.md:206-212`.
- 🔵 **API-key admin is a local CLI only** — no public admin RPC. `docs/DesignDecisions.md:308-323`.
- 🔵 **No Blazor UI component libraries** — hard constraint. `docs/DesignDecisions.md:342-358`.
- 🔵 **Lazy browse is wire-only** — no lazy SQL / cache loading. `docs/DesignDecisions.md:365-376`, `docs/plans/2026-05-28-lazy-browse-design.md:30`.
- 🔵 **No server-side / streaming browse search**`docs/plans/2026-05-28-lazy-browse-design.md:208`.
- 🔵 **Alarm command surface is ack + query only** — no Clear/Disable/Enable/Silence/Shelve/Inhibit; matches the MXAccess alarm-client set. `Worker/MxAccess/AlarmCommandHandler.cs`, shelve/suppress out of scope per `docs/AlarmClientDiscovery.md:60-66`.
- 🔵 **Dashboard EventsHub has no per-session ACL** — any authenticated dashboard user may subscribe to any session group. `Dashboard/Hubs/EventsHub.cs:36-50` (`TODO(per-session-acl)`); only relevant once a per-session role model exists.
---
## 3. MXAccess parity — open questions & vendor-gated items
Intentionally incomplete, awaiting a live capture or an AVEVA fix; raw payload/metadata is preserved in the meantime (no synthesis).
- 🟡 **3.1 `OperationComplete` native trigger condition unknown** — modeled and emitted only from the real event (no synthesis), but the runtime condition that fires it isn't captured. `docs/DesignDecisions.md:280-289`, `gateway.md:1094`, `docs/MxAccessWorkerInstanceDesign.md:341,366`.
- 🟡 **3.2 `OnBufferedDataChange` multi-sample conversion unvalidated — STILL OPEN (residual after B8)**`AddBufferedItem`/`SetBufferedUpdateInterval` are now implemented and live-confirmed (§1.1), and the live B8 test (`f7ada90`) confirms the worker receives and cleanly converts the empty `NoData` bootstrap `OnBufferedDataChange` (no crash, no dropped payload). But the rig's object logic does not drive a multi-sample buffered batch on demand (same limitation as the alarm rig), so a real parallel quality/timestamp sample array (length > 1) has never been observed live — it is exercised only by the B-bundle unit tests against a fake `IMxAccessServer`. Re-run `GatewaySession_WithLiveWorker_BufferedItem_*` against a fast-changing simulated tag to close this. `docs/DesignDecisions.md:291-297`.
- 🟡 **3.3 Completion-only status → `MXSTATUS_PROXY[]` mapping unproven** — completion-only operation-status bytes are kept as raw diagnostic metadata until the analysis proves an exact mapping. `docs/DesignDecisions.md:299-306`.
- 🟡 **3.4 `AlarmAckByGUID` is `E_NOTIMPL` on this AVEVA build** — throws `NotImplementedException`; all acks route through `AlarmAckByName`. Proto/worker keep the path for forward-compat but it is dead today. `docs/AlarmClientDiscovery.md:750-763`.
- 🟡 **3.5 8-arg `AlarmAckByName` v2 is a vendor stub (returns -55)** — worker uses the 6-arg method; the 8-arg `domain`/`full_name` fields are carried for forward-compat only (see 1.4). `docs/AlarmClientDiscovery.md:743-748`.
- 🟡 **3.6 Subtag degraded-mode fidelity limits**`category`, `description`, `alarm_type_name`, operator fields, and `retrigger` are not populated/synthesized in subtag fallback (no subtag exists for them). Documented, by design. `docs/AlarmClientDiscovery.md:913-931`, `docs/plans/2026-06-13-alarm-subtag-fallback-design.md:292-298`.
- 🟡 **3.7 Subtag `Clear` transition unvalidated live** — Raise/Ack/AckMsg are live-confirmed; Clear is externally undrivable on the rig (object logic owns alarm state). Environmental, not code. (`project_alarm_subtag_fallback`, `project_rig_alarms_object_driven`.)
---
## 4. Clients — gaps & cross-client parity
Library RPC surface is at **full parity**: all gateway + GalaxyRepository RPCs and the `LazyBrowseNode` helper exist in all five clients, with **no** TODO/stub/not-implemented markers in production code. The CLI/helper gaps below are **RESOLVED**.
| Capability | Dotnet | Go | Python | Rust | Java |
|---|---|---|---|---|---|
| `Write2` single session helper | ✅ | ✅ `849f1d2` | ✅ | ✅ | ✅ |
| `ping` CLI subcommand | ✅ | ✅ `90529dc` | ✅ | ✅ | ✅ `0d5b488` |
| `version` CLI subcommand | ✅ *(already worked)* | ✅ | ✅ | ✅ | ✅ |
| `galaxy-*` CLI commands (4) | ✅ | ✅ | ✅ `a211fae` | ✅ | ✅ |
| `galaxy-browse` / BrowseChildren CLI | ✅ | ✅ | ✅ | ✅ | ✅ **(5/5)** |
-**4.1 Go single `Write2` helper — RESOLVED** (`849f1d2`): added `Write2`/`Write2Raw` to `clients/go/mxgateway/session.go`, matching the other four clients' signature.
-**4.2 Python `galaxy-*` CLI commands — RESOLVED** (`a211fae`, `a59fc99`): added `galaxy-test-connection`/`galaxy-last-deploy`/`galaxy-discover`/`galaxy-watch` Click commands wrapping `galaxy.py`; README corrected. (Fixed a UTC-offset bug in last-deploy output during review.)
-**4.3 `ping` CLI added to Go + Java — RESOLVED** (Go `90529dc`/`742ced7`, Java `0d5b488`).
-**4.4 `version` CLI in Dotnet — NOT MISSING (audit correction)**: the dotnet `version` subcommand already worked (`MxGatewayClientCli.cs:85``WriteVersion`, prints gateway/worker protocol versions). The original audit was wrong. Minor: unlike Go, dotnet's `version` omits a client-*package*-version line (`MxGatewayClientContractInfo` exposes only the two protocol versions) — cosmetic, not tracked.
-**4.5 Galaxy CLI command-name divergence — RESOLVED** (Java `0d5b488`): `galaxy-test-connection`/`galaxy-last-deploy` are now the canonical Java names, with `galaxy-test`/`galaxy-deploy-time` kept as **deprecated picocli aliases** so existing scripts don't break. (Rust keeps its `galaxy <subcommand>` group style — a clap structural choice, not a name divergence.)
-**4.6 `browse`/`BrowseChildren` CLI — RESOLVED, 0/5 → 5/5** (Rust `639e36b`, Go `8cb416b`, Python `39ec2a3`, dotnet `d7e2a8b`, Java `0d5b488`). All five emit the per-node JSON key `hasChildrenHint` (unified during review). Minor residual divergence: dotnet *nests* the Galaxy object fields under an `object` key while Go/Rust/Python/Java *flatten* them — both carry `hasChildrenHint` + a nested `children` array; harmonizing the object nesting is a cosmetic follow-up, not tracked.
-**4.7 No typed wrappers for the rarer commands**`AuthenticateUser`, `ArchestrAUserToId`, `AddBufferedItem`, `Suspend`/`Activate`, `GetSessionState`/`GetWorkerInfo`/`DrainEvents`/`ShutdownWorker` remain reachable via the generic `Invoke`/`invoke_raw` escape hatch in all five clients (consistent and deliberate; the worker-side commands are now implemented per §1.1, but no client adds dedicated typed wrappers — out of scope, the CLIs that needed them got `ping`/`browse` subcommands).
---
## 5. Verification gaps (code exists, unverified by default)
All live/integration paths are opt-in; the default unit suites do not exercise them.
-**Live MXAccess COM + STA + message pump**`Worker.Tests/MxAccess/MxAccessLiveComCreationTests.cs` (5 `[LiveMxAccessFact]`), gated `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`.
-**Live gateway↔worker↔MXAccess round-trip**`IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` (6 `[LiveMxAccessFact]`).
-**Live Galaxy Repository SQL**`IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs` (4 `[LiveGalaxyRepositoryFact]`), gated `MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1`.
-**Live LDAP dashboard auth**`IntegrationTests/DashboardLdapLiveTests.cs` (5 `[LiveLdapFact]`), gated `MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`.
-**Alarm runtime/discovery probes (dev-rig)**`Worker.Tests/Probes/{WnWrapConsumerProbeTests,AlarmClientWmProbeTests}.cs`, `AlarmClientDiscoveryTests.cs` — hard `[Fact(Skip=...)]`.
-**Live alarm + subtag-fallback smoke**`Worker.Tests/Probes/{AlarmSubtagLiveSmokeTests,AlarmsLiveSmokeTests}.cs``Skip` + one `[LiveMxAccessFact]`; Clear path remains undrivable even when enabled.
-**Python loopback TLS**`clients/python/tests/test_tls.py:111-112` — gated `MXGATEWAY_RUN_TLS_TESTS=1` + openssl; only cert-config parsing runs by default.
-**.NET client live browse smoke** — `clients/dotnet/.../BrowseChildrenSmokeTests.cs:17-18` — hard `[Fact(Skip=...)]`.
-**Cross-language client↔gateway wire behavior** — no per-client integration unit tests; only `scripts/run-client-e2e-tests.ps1` against a live gateway (`MXGATEWAY_INTEGRATION=1`). All client wire behavior is unverified in default unit runs.
No placeholder/empty/`Assert.True(true)` tests were found anywhere.
---
## 6. Config-gated functional gaps (work only after configuration)
- 🟠 **6.1 Alarm ack in subtag mode requires `AckComment` subtag configured** — empty by default; ack fails in subtag mode until set. Names must be validated against live MXAccess, not guessed. `docs/DesignDecisions.md:454-458`. (`AckCommentSubtag` is write-only; `Worker/MxAccess/SubtagAlarmStateMachine.cs:21`.)
- 🔵 **6.2 Multi-subscriber** — see 2 (option exists, validator-blocked).
---
## 7. Stale docs, dead code, accepted gaps
- 📄 **7.1 D1 plan header stale**`docs/plans/2026-06-14-deferred-followups.md:4` still says *"Plan only — NOT yet executed,"* but D1 is **done** (`Dashboard/DashboardSnapshotService.cs:198`, commit `4af24b9`). Update the plan status.
- 📄 **7.2 `AlarmClientDiscovery.md` STA "production fix needed" prose is stale**`docs/AlarmClientDiscovery.md:765-774` reads as a pending follow-up, but alarms now run through the worker STA / `GatewayAlarmMonitor` (merged). Re-check against current code.
- 📄 **7.3 EventsHub "publisher side is a follow-up" comment is stale**`Dashboard/Hubs/EventsHub.cs:9-17`; the `DashboardEventBroadcaster` exists, is DI-registered (`Dashboard/DashboardServiceCollectionExtensions.cs:47`), runs in the live loop (`Grpc/EventStreamService.cs:133`), and `SessionDetailsPage.razor` renders the feed.
- 📄 **7.4 CLAUDE.md project-name drift** — CLAUDE.md uses `src/MxGateway.Server`/`MxGateway.Tests`; the actual tree is `src/ZB.MOM.WW.MxGateway.*`. Misleads path-based work.
-**7.5 Dead `MapSqlException` helper**`Grpc/GalaxyRepositoryGrpcService.cs:350-360`, IDE0051-suppressed, kept for a hypothetical direct-SQL path that doesn't exist.
- **7.6 Accepted code-review gaps (`Won't Fix`, by design):**
- `Client.Python-012``Session.invoke_raw` deliberately skips `ensure_mxaccess_success`, so an embedded MXAccess HRESULT failure surfaces silently (raw-parity inspection). `code-reviews/Client.Python/findings.md:290`.
- `Contracts-003` — closed as not-a-defect. `code-reviews/Contracts/findings.md`.
- *(All 351 review findings are otherwise Resolved; none Open or Deferred.)*
---
## 8. Deferred test-coverage follow-ups (noted in resolutions, never filed as findings)
- **Java CLI bulk-subcommand coverage** — 6 of 13 non-trivial subcommands untested: `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk` (plus `stream-events`, the four `galaxy-*`, `close-session`). `code-reviews/Client.Java/findings.md:495` (Client.Java-026).
- **Per-session-ACL TODO** at `Server/Dashboard/Hubs/EventsHub.cs` (`code-reviews/Server/findings.md:765`).
- **Worker-Ready retry race** noted at `code-reviews/Server/findings.md:611`.
- **Duplicated `FakeWorkerProcess` harness** flagged as a latent regression vector — `code-reviews/Tests/findings.md:463`.
---
## Bottom line
**Status after `feat/stillpending-completion` (2026-06-15):** the net-new functionality is **done****§1.1** (all 11 worker command kinds, COM half live-verified on the rig), **§1.2** (audit CorrelationId, with a raw-string fallback for non-GUID clients), and the entire **§4** client CLI/helper parity surface (`Write2`, `ping`, `galaxy-*`, `galaxy-browse` 5/5, name aliases). Doc hygiene **§7** is done (`0032d2d`, `bd46ba1`). Zero `.proto` changes were required.
**Still open (all deliberate or environment/vendor-gated):**
- **§1.3** — `provider_switches` counter still only unit-tested; the dev rig can't drive a real alarmmgr→subtag failover, so live `reason`-tag coverage remains a recorded residual.
- **§1.4 / §3.4 / §3.5** — the AVEVA 8-arg `AlarmAckByName` is a vendor stub (55) and `AlarmAckByGUID` is `E_NOTIMPL`; the `domain`/`full_name` fields stay forward-compat-only until AVEVA implements them.
- **§3.2** — buffered commands work and the empty bootstrap converts cleanly live, but a multi-sample buffered batch is undrivable on the rig (unit-tested only).
- **§3.1 / §3.3 / §3.6 / §3.7** — await live MXAccess captures.
- **§2** — deliberate v1 scope. **§5** — opt-in verification gates. **§7.6** — accepted `Won't Fix` review findings.
MXAccess **event/data/value/write** mapping, the **Galaxy** RPC surface, and now the **full command surface** are complete; no `NotImplementedException`s, stubbed RPC bodies, or empty tests remain in the production paths.