Files
mxaccessgw/stillpending.md
T
Joseph Doherty a56ce0ddbd docs: refresh stillpending.md for completed work; record residuals (§7/E2)
Mark §1.1 (11 worker commands), §1.2 (audit CorrelationId), and §4 client
CLI/helper parity as Resolved with commit refs; correct §4.4 (dotnet version
already worked). Record open residuals: §1.3 live failover counter, §3.2
multi-sample buffered conversion, §1.4 vendor-stub ack, DrainEvents snapshot
semantics.
2026-06-15 11:35:48 -04:00

21 KiB
Raw Blame History

Still Pending — Deferred / Partial / Unfinished / Missing Functionality

Generated: 2026-06-15 · Commit: c7f754c (main) · Method: six parallel read-only audits (Server, Worker, Contracts/proto, all five clients, docs/design/plans, tests + review backlog). Every item cites a verified file:line.

Resolution update (2026-06-15, branch feat/stillpending-completion): The actionable items were implemented and verified per docs/plans/2026-06-15-stillpending-completion.md. §1.1 (all 11 worker command kinds), §1.2 (audit CorrelationId), and the §4 client CLI/helper parity gaps are Resolved — see per-item annotations below. Worker COM commands are live-verified on the dev rig (efd9971, f7ada90). Remaining open items are the documented residuals (§1.3, §1.4, the §3 vendor/capture-gated questions incl. the new §3.2 multi-sample buffered residual) and the deliberate v1 scope of §2. Zero .proto changes were needed (all reply messages already existed).

How to read this

Items are graded by what they actually are, because most "pending" surface in this codebase is deliberate v1 scope, not accidental:

  • 🔴 Genuine gap — real unfinished/missing functionality with user-visible impact; a candidate to actually build.
  • 🟠 Parity hole — declared in-scope (proto/design) but not wired through; breaks "MXAccess parity" or cross-client parity.
  • 🟡 Open question / vendor-gated — intentionally incomplete, awaiting a live MXAccess capture or an AVEVA fix; raw data preserved meanwhile.
  • 🔵 Intentional v1 scope — deliberately deferred and documented; listed so it's catalogued, not because it's broken.
  • Verification gap — code exists but is unverified by default (opt-in/live tests).
  • 📄 Stale doc / dead code — prose or code that lags reality.

1. Genuine gaps (real unfinished functionality)

1.1 — Worker does not implement 11 declared command kinds (biggest real gap)RESOLVED

  • Resolved: all 11 now implemented. The 5 control commands (Ping, GetSessionState, GetWorkerInfo, DrainEvents, ShutdownWorker) are handled in WorkerPipeSession (off-STA — ShutdownWorker on the STA would deadlock, and these read pipe-session state) — bf72cd8. The 6 COM commands (Suspend, Activate, AuthenticateUser, ArchestrAUserToId, AddBufferedItem, SetBufferedUpdateInterval) are implemented in MxAccessCommandExecutor (STA-dispatched) via new IMxAccessServer/MxAccessComServer wrappers selecting ILMXProxyServer2/4/52939932. Live-verified on the dev rig (efd9971, f7ada90): ArchestrAUserToId→Ok(user_id=1), AddBufferedItem/SetBufferedUpdateInterval→Ok; AuthenticateUser/Suspend/Activate→real MxaccessFailure/HResult (parity, not INVALID_REQUEST). FakeWorkerHarness now answers the control commands so the default gateway suite covers them (bb5139f). Note: DrainEvents is a diagnostic snapshot — it competes with the 25 ms background stream-drain loop, so with an active event stream it returns only events not yet pushed (no loss/double-drain; the queue drain is lock-protected and destructive).
  • Original finding below (for history):
  • Location: src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:97-128 (the Execute switch; everything else falls to _ => CreateInvalidRequestReply(...)).
  • What's missing: the proto MxCommandKind enum defines these, the gateway validates (Grpc/MxAccessGrpcRequestValidator.cs:86-95), scopes (Security/Authorization/GatewayGrpcScopeResolver.cs:45-47), and routes them, and reply messages exist — but the worker answers INVALID_REQUEST:
    • MXAccess COM commands (parity-critical): AddBufferedItem, SetBufferedUpdateInterval, Suspend, Activate, AuthenticateUser, ArchestrAUserToId (mxaccess_gateway.proto:107-116,151-160).
    • Worker control/lifecycle: Ping, GetSessionState, GetWorkerInfo, DrainEvents, ShutdownWorker (mxaccess_gateway.proto:133-137,177-181).
  • Why it matters: gateway.md:890-899, docs/MxAccessWorkerInstanceDesign.md:424-433, and docs/DesignDecisions.md:169-173 list all six COM commands under "Phase 4: Full Command Surface" with the exit criterion "gRPC command surface covers the installed MXAccess public method set." So this is a declared-in-scope phase that isn't finished, not a v1 cut.
  • Masked by tests: the control kinds (Ping/GetWorkerInfo/DrainEvents/ShutdownWorker) are exercised only through FakeWorkerHarness, so the unit suite is green while the real worker can't answer them. The live integration test WorkerLiveMxAccessSmokeTests.cs:931 even sends AuthenticateUser, which would get an invalid-request reply today.
  • Note: OnBufferedDataChange event mapping IS fully wired (Conversion/MxAccessEventMapper.cs:231-254) — but with AddBufferedItem/SetBufferedUpdateInterval unimplemented there is no way to set buffered eventing up.

1.2 — Constraint-denial audit events drop CorrelationId — RESOLVED

  • Resolved (8415f35, 55526d5): request.ClientCorrelationId is threaded from InvokeApplyConstraintsAsync → the six filter helpers → IConstraintEnforcer.RecordDenialAsync (new string? correlationId param); the TODO(Task 2.3) is gone. The audit CorrelationId column is Guid?, so a GUID-parseable id is stored typed; and the raw string is always preserved in the audit record's DetailsJson["clientCorrelationId"] — this matters because the Rust client sends non-GUID ids (rust-client-<op>-<n>) on all traffic and Python/Java default to empty, which would otherwise have left the typed column null. An end-to-end test asserts the value propagates through Invoke.
  • Original finding: denied-operation audit records always wrote CorrelationId = null (ConstraintEnforcer.cs:134-136,147); threading needed an IConstraintEnforcer signature change.

🔴 1.3 — provider_switches{from,to,reason} counter never exercised live

  • Location: metric emitted in Alarms/GatewayAlarmMonitor.cs (failover/failback path); residual recorded in docs/plans/2026-06-14-deferred-followups.md:124-125.
  • Evidence: "that counter's live exercise remains the one gap; record it explicitly rather than claiming coverage." Unit-tested (Tests-032) but the dev rig can't drive a real alarmmgr→subtag failover (project_rig_alarms_object_driven), so the counter's reason tagging is unproven in production.

🟠 1.4 — Worker 8-arg alarm ack silently discards operator domain / full name

  • Location: src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:261-278 (_ = ackOperatorDomain; _ = ackOperatorFullName;).
  • Evidence: "the IwwAlarmConsumer2 8-arg AlarmAckByName returns -55 on this AVEVA build (looks like a stub) … fields are accepted by the proto for forward-compat but are not propagated to AVEVA today."
  • Impact: two contract fields are accepted on the wire and silently not delivered. Root cause is the vendor stub (see 3.5), but the drop is currently invisible to callers.

2. Intentional v1 scope decisions (deliberately deferred — catalog)

These are documented, deliberate, and mostly enforced. Listed so the deferred surface is in one place — none are bugs. Canonical register: docs/DesignDecisions.md:466-474 ("Later Revisit Items") + gateway.md "Post-v1 revisit items".

  • 🔵 Reconnectable sessions — not in v1. docs/DesignDecisions.md:63-73, gateway.md:1087,1101.
  • 🔵 Multi-event-subscriber fan-outplumbed but blocked. The option flows all the way to Sessions/GatewaySession.cs:387-408 AttachEventSubscriber(allowMultipleSubscribers), but Configuration/GatewayOptionsValidator.cs:181-185 hard-rejects the only enabling value: "AllowMultipleEventSubscribers is not supported until event fan-out is implemented." So the fan-out code path never runs. docs/DesignDecisions.md:75-80.
  • 🔵 Gateway restart does not reattach orphan workers — terminates them on startup. docs/DesignDecisions.md:65-69, CLAUDE.md.
  • 🔵 Workers run as the gateway service identity — restricted service account is a reserved extension point. docs/DesignDecisions.md:179-184.
  • 🔵 Fail-fast event backpressure, no coalescing — opt-in coalescing is post-v1. docs/DesignDecisions.md:187-203.
  • 🔵 No public command batchingdocs/DesignDecisions.md:206-212.
  • 🔵 API-key admin is a local CLI only — no public admin RPC. docs/DesignDecisions.md:308-323.
  • 🔵 No Blazor UI component libraries — hard constraint. docs/DesignDecisions.md:342-358.
  • 🔵 Lazy browse is wire-only — no lazy SQL / cache loading. docs/DesignDecisions.md:365-376, docs/plans/2026-05-28-lazy-browse-design.md:30.
  • 🔵 No server-side / streaming browse searchdocs/plans/2026-05-28-lazy-browse-design.md:208.
  • 🔵 Alarm command surface is ack + query only — no Clear/Disable/Enable/Silence/Shelve/Inhibit; matches the MXAccess alarm-client set. Worker/MxAccess/AlarmCommandHandler.cs, shelve/suppress out of scope per docs/AlarmClientDiscovery.md:60-66.
  • 🔵 Dashboard EventsHub has no per-session ACL — any authenticated dashboard user may subscribe to any session group. Dashboard/Hubs/EventsHub.cs:36-50 (TODO(per-session-acl)); only relevant once a per-session role model exists.

3. MXAccess parity — open questions & vendor-gated items

Intentionally incomplete, awaiting a live capture or an AVEVA fix; raw payload/metadata is preserved in the meantime (no synthesis).

  • 🟡 3.1 OperationComplete native trigger condition unknown — modeled and emitted only from the real event (no synthesis), but the runtime condition that fires it isn't captured. docs/DesignDecisions.md:280-289, gateway.md:1094, docs/MxAccessWorkerInstanceDesign.md:341,366.
  • 🟡 3.2 OnBufferedDataChange multi-sample conversion unvalidated — STILL OPEN (residual after B8)AddBufferedItem/SetBufferedUpdateInterval are now implemented and live-confirmed (§1.1), and the live B8 test (f7ada90) confirms the worker receives and cleanly converts the empty NoData bootstrap OnBufferedDataChange (no crash, no dropped payload). But the rig's object logic does not drive a multi-sample buffered batch on demand (same limitation as the alarm rig), so a real parallel quality/timestamp sample array (length > 1) has never been observed live — it is exercised only by the B-bundle unit tests against a fake IMxAccessServer. Re-run GatewaySession_WithLiveWorker_BufferedItem_* against a fast-changing simulated tag to close this. docs/DesignDecisions.md:291-297.
  • 🟡 3.3 Completion-only status → MXSTATUS_PROXY[] mapping unproven — completion-only operation-status bytes are kept as raw diagnostic metadata until the analysis proves an exact mapping. docs/DesignDecisions.md:299-306.
  • 🟡 3.4 AlarmAckByGUID is E_NOTIMPL on this AVEVA build — throws NotImplementedException; all acks route through AlarmAckByName. Proto/worker keep the path for forward-compat but it is dead today. docs/AlarmClientDiscovery.md:750-763.
  • 🟡 3.5 8-arg AlarmAckByName v2 is a vendor stub (returns -55) — worker uses the 6-arg method; the 8-arg domain/full_name fields are carried for forward-compat only (see 1.4). docs/AlarmClientDiscovery.md:743-748.
  • 🟡 3.6 Subtag degraded-mode fidelity limitscategory, description, alarm_type_name, operator fields, and retrigger are not populated/synthesized in subtag fallback (no subtag exists for them). Documented, by design. docs/AlarmClientDiscovery.md:913-931, docs/plans/2026-06-13-alarm-subtag-fallback-design.md:292-298.
  • 🟡 3.7 Subtag Clear transition unvalidated live — Raise/Ack/AckMsg are live-confirmed; Clear is externally undrivable on the rig (object logic owns alarm state). Environmental, not code. (project_alarm_subtag_fallback, project_rig_alarms_object_driven.)

4. Clients — gaps & cross-client parity

Library RPC surface is at full parity: all gateway + GalaxyRepository RPCs and the LazyBrowseNode helper exist in all five clients, with no TODO/stub/not-implemented markers in production code. The CLI/helper gaps below are RESOLVED.

Capability Dotnet Go Python Rust Java
Write2 single session helper 849f1d2
ping CLI subcommand 90529dc 0d5b488
version CLI subcommand (already worked)
galaxy-* CLI commands (4) a211fae
galaxy-browse / BrowseChildren CLI (5/5)
  • 4.1 Go single Write2 helper — RESOLVED (849f1d2): added Write2/Write2Raw to clients/go/mxgateway/session.go, matching the other four clients' signature.
  • 4.2 Python galaxy-* CLI commands — RESOLVED (a211fae, a59fc99): added galaxy-test-connection/galaxy-last-deploy/galaxy-discover/galaxy-watch Click commands wrapping galaxy.py; README corrected. (Fixed a UTC-offset bug in last-deploy output during review.)
  • 4.3 ping CLI added to Go + Java — RESOLVED (Go 90529dc/742ced7, Java 0d5b488).
  • 4.4 version CLI in Dotnet — NOT MISSING (audit correction): the dotnet version subcommand already worked (MxGatewayClientCli.cs:85WriteVersion, prints gateway/worker protocol versions). The original audit was wrong. Minor: unlike Go, dotnet's version omits a client-package-version line (MxGatewayClientContractInfo exposes only the two protocol versions) — cosmetic, not tracked.
  • 4.5 Galaxy CLI command-name divergence — RESOLVED (Java 0d5b488): galaxy-test-connection/galaxy-last-deploy are now the canonical Java names, with galaxy-test/galaxy-deploy-time kept as deprecated picocli aliases so existing scripts don't break. (Rust keeps its galaxy <subcommand> group style — a clap structural choice, not a name divergence.)
  • 4.6 browse/BrowseChildren CLI — RESOLVED, 0/5 → 5/5 (Rust 639e36b, Go 8cb416b, Python 39ec2a3, dotnet d7e2a8b, Java 0d5b488). All five emit the per-node JSON key hasChildrenHint (unified during review). Minor residual divergence: dotnet nests the Galaxy object fields under an object key while Go/Rust/Python/Java flatten them — both carry hasChildrenHint + a nested children array; harmonizing the object nesting is a cosmetic follow-up, not tracked.
  • 4.7 No typed wrappers for the rarer commandsAuthenticateUser, ArchestrAUserToId, AddBufferedItem, Suspend/Activate, GetSessionState/GetWorkerInfo/DrainEvents/ShutdownWorker remain reachable via the generic Invoke/invoke_raw escape hatch in all five clients (consistent and deliberate; the worker-side commands are now implemented per §1.1, but no client adds dedicated typed wrappers — out of scope, the CLIs that needed them got ping/browse subcommands).

5. Verification gaps (code exists, unverified by default)

All live/integration paths are opt-in; the default unit suites do not exercise them.

  • Live MXAccess COM + STA + message pumpWorker.Tests/MxAccess/MxAccessLiveComCreationTests.cs (5 [LiveMxAccessFact]), gated MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1.
  • Live gateway↔worker↔MXAccess round-tripIntegrationTests/WorkerLiveMxAccessSmokeTests.cs (6 [LiveMxAccessFact]).
  • Live Galaxy Repository SQLIntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs (4 [LiveGalaxyRepositoryFact]), gated MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1.
  • Live LDAP dashboard authIntegrationTests/DashboardLdapLiveTests.cs (5 [LiveLdapFact]), gated MXGATEWAY_RUN_LIVE_LDAP_TESTS=1.
  • Alarm runtime/discovery probes (dev-rig)Worker.Tests/Probes/{WnWrapConsumerProbeTests,AlarmClientWmProbeTests}.cs, AlarmClientDiscoveryTests.cs — hard [Fact(Skip=...)].
  • Live alarm + subtag-fallback smokeWorker.Tests/Probes/{AlarmSubtagLiveSmokeTests,AlarmsLiveSmokeTests}.csSkip + one [LiveMxAccessFact]; Clear path remains undrivable even when enabled.
  • Python loopback TLSclients/python/tests/test_tls.py:111-112 — gated MXGATEWAY_RUN_TLS_TESTS=1 + openssl; only cert-config parsing runs by default.
  • .NET client live browse smokeclients/dotnet/.../BrowseChildrenSmokeTests.cs:17-18 — hard [Fact(Skip=...)].
  • Cross-language client↔gateway wire behavior — no per-client integration unit tests; only scripts/run-client-e2e-tests.ps1 against a live gateway (MXGATEWAY_INTEGRATION=1). All client wire behavior is unverified in default unit runs.

No placeholder/empty/Assert.True(true) tests were found anywhere.


6. Config-gated functional gaps (work only after configuration)

  • 🟠 6.1 Alarm ack in subtag mode requires AckComment subtag configured — empty by default; ack fails in subtag mode until set. Names must be validated against live MXAccess, not guessed. docs/DesignDecisions.md:454-458. (AckCommentSubtag is write-only; Worker/MxAccess/SubtagAlarmStateMachine.cs:21.)
  • 🔵 6.2 Multi-subscriber — see 2 (option exists, validator-blocked).

7. Stale docs, dead code, accepted gaps

  • 📄 7.1 D1 plan header staledocs/plans/2026-06-14-deferred-followups.md:4 still says "Plan only — NOT yet executed," but D1 is done (Dashboard/DashboardSnapshotService.cs:198, commit 4af24b9). Update the plan status.
  • 📄 7.2 AlarmClientDiscovery.md STA "production fix needed" prose is staledocs/AlarmClientDiscovery.md:765-774 reads as a pending follow-up, but alarms now run through the worker STA / GatewayAlarmMonitor (merged). Re-check against current code.
  • 📄 7.3 EventsHub "publisher side is a follow-up" comment is staleDashboard/Hubs/EventsHub.cs:9-17; the DashboardEventBroadcaster exists, is DI-registered (Dashboard/DashboardServiceCollectionExtensions.cs:47), runs in the live loop (Grpc/EventStreamService.cs:133), and SessionDetailsPage.razor renders the feed.
  • 📄 7.4 CLAUDE.md project-name drift — CLAUDE.md uses src/MxGateway.Server/MxGateway.Tests; the actual tree is src/ZB.MOM.WW.MxGateway.*. Misleads path-based work.
  • 7.5 Dead MapSqlException helperGrpc/GalaxyRepositoryGrpcService.cs:350-360, IDE0051-suppressed, kept for a hypothetical direct-SQL path that doesn't exist.
  • 7.6 Accepted code-review gaps (Won't Fix, by design):
    • Client.Python-012Session.invoke_raw deliberately skips ensure_mxaccess_success, so an embedded MXAccess HRESULT failure surfaces silently (raw-parity inspection). code-reviews/Client.Python/findings.md:290.
    • Contracts-003 — closed as not-a-defect. code-reviews/Contracts/findings.md.
    • (All 351 review findings are otherwise Resolved; none Open or Deferred.)

8. Deferred test-coverage follow-ups (noted in resolutions, never filed as findings)

  • Java CLI bulk-subcommand coverage — 6 of 13 non-trivial subcommands untested: read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, bench-read-bulk (plus stream-events, the four galaxy-*, close-session). code-reviews/Client.Java/findings.md:495 (Client.Java-026).
  • Per-session-ACL TODO at Server/Dashboard/Hubs/EventsHub.cs (code-reviews/Server/findings.md:765).
  • Worker-Ready retry race noted at code-reviews/Server/findings.md:611.
  • Duplicated FakeWorkerProcess harness flagged as a latent regression vector — code-reviews/Tests/findings.md:463.

Bottom line

Status after feat/stillpending-completion (2026-06-15): the net-new functionality is done§1.1 (all 11 worker command kinds, COM half live-verified on the rig), §1.2 (audit CorrelationId, with a raw-string fallback for non-GUID clients), and the entire §4 client CLI/helper parity surface (Write2, ping, galaxy-*, galaxy-browse 5/5, name aliases). Doc hygiene §7 is done (0032d2d, bd46ba1). Zero .proto changes were required.

Still open (all deliberate or environment/vendor-gated):

  • §1.3provider_switches counter still only unit-tested; the dev rig can't drive a real alarmmgr→subtag failover, so live reason-tag coverage remains a recorded residual.
  • §1.4 / §3.4 / §3.5 — the AVEVA 8-arg AlarmAckByName is a vendor stub (55) and AlarmAckByGUID is E_NOTIMPL; the domain/full_name fields stay forward-compat-only until AVEVA implements them.
  • §3.2 — buffered commands work and the empty bootstrap converts cleanly live, but a multi-sample buffered batch is undrivable on the rig (unit-tested only).
  • §3.1 / §3.3 / §3.6 / §3.7 — await live MXAccess captures.
  • §2 — deliberate v1 scope. §5 — opt-in verification gates. §7.6 — accepted Won't Fix review findings.

MXAccess event/data/value/write mapping, the Galaxy RPC surface, and now the full command surface are complete; no NotImplementedExceptions, stubbed RPC bodies, or empty tests remain in the production paths.