24 KiB
Still Pending — Deferred / Partial / Unfinished / Missing Functionality
Generated: 2026-06-15 · Commit: c7f754c (main) · Method: six parallel read-only audits (Server, Worker, Contracts/proto, all five clients, docs/design/plans, tests + review backlog). Every item cites a verified file:line.
Resolution update (2026-06-15, branch
feat/stillpending-completion): The actionable items were implemented and verified perdocs/plans/2026-06-15-stillpending-completion.md. §1.1 (all 11 worker command kinds), §1.2 (audit CorrelationId), and the §4 client CLI/helper parity gaps are Resolved — see per-item annotations below. Worker COM commands are live-verified on the dev rig (efd9971,f7ada90). Remaining open items are the documented residuals (§1.3, §1.4, the §3 vendor/capture-gated questions incl. the new §3.2 multi-sample buffered residual) and the deliberate v1 scope of §2. Zero.protochanges were needed (all reply messages already existed).Resolution update (2026-06-16,
main): The session-resilience epic (docs/plans/2026-06-15-session-resilience.md) landed its first 12 of 28 tasks onmain(through mergec446bef), which moves several §2 items off "deferred": multi-subscriber fan-out is now Resolved, and reconnectable sessions has its server-side core (detach-grace window + replay-on-reconnect + a newReplayGapsignal onMxEvent). Still pending in that epic: reconnect Tasks 13–15 (owner re-validation, clientReplayGaphandling, integration test), Phase 4 per-session dashboard ACL (Tasks 16–19 — the §2/§7.6/§8 EventsHub-ACL item), and Phase 5 orphan-worker reattach (Tasks 20–28). Separately, aMxGateway:Dashboard:DisableLogindev flag shipped (ca443b1) — auto-authenticates the dashboard as a multi-role admin; default off, enabled on the 10.100.0.48 deployment. Per-item §2 status annotated below; remaining epic tasks tracked inoldtasks.md.
How to read this
Items are graded by what they actually are, because most "pending" surface in this codebase is deliberate v1 scope, not accidental:
- 🔴 Genuine gap — real unfinished/missing functionality with user-visible impact; a candidate to actually build.
- 🟠 Parity hole — declared in-scope (proto/design) but not wired through; breaks "MXAccess parity" or cross-client parity.
- 🟡 Open question / vendor-gated — intentionally incomplete, awaiting a live MXAccess capture or an AVEVA fix; raw data preserved meanwhile.
- 🔵 Intentional v1 scope — deliberately deferred and documented; listed so it's catalogued, not because it's broken.
- ⚪ Verification gap — code exists but is unverified by default (opt-in/live tests).
- 📄 Stale doc / dead code — prose or code that lags reality.
1. Genuine gaps (real unfinished functionality)
✅ 1.1 — Worker does not implement 11 declared command kinds (biggest real gap) — RESOLVED
- Resolved: all 11 now implemented. The 5 control commands (
Ping,GetSessionState,GetWorkerInfo,DrainEvents,ShutdownWorker) are handled inWorkerPipeSession(off-STA —ShutdownWorkeron the STA would deadlock, and these read pipe-session state) —bf72cd8. The 6 COM commands (Suspend,Activate,AuthenticateUser,ArchestrAUserToId,AddBufferedItem,SetBufferedUpdateInterval) are implemented inMxAccessCommandExecutor(STA-dispatched) via newIMxAccessServer/MxAccessComServerwrappers selectingILMXProxyServer2/4/5—2939932. Live-verified on the dev rig (efd9971,f7ada90):ArchestrAUserToId→Ok(user_id=1),AddBufferedItem/SetBufferedUpdateInterval→Ok;AuthenticateUser/Suspend/Activate→realMxaccessFailure/HResult (parity, notINVALID_REQUEST).FakeWorkerHarnessnow answers the control commands so the default gateway suite covers them (bb5139f). Note:DrainEventsis a diagnostic snapshot — it competes with the 25 ms background stream-drain loop, so with an active event stream it returns only events not yet pushed (no loss/double-drain; the queue drain is lock-protected and destructive). - Original finding below (for history):
- Location:
src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:97-128(theExecuteswitch; everything else falls to_ => CreateInvalidRequestReply(...)). - What's missing: the proto
MxCommandKindenum defines these, the gateway validates (Grpc/MxAccessGrpcRequestValidator.cs:86-95), scopes (Security/Authorization/GatewayGrpcScopeResolver.cs:45-47), and routes them, and reply messages exist — but the worker answersINVALID_REQUEST:- MXAccess COM commands (parity-critical):
AddBufferedItem,SetBufferedUpdateInterval,Suspend,Activate,AuthenticateUser,ArchestrAUserToId(mxaccess_gateway.proto:107-116,151-160). - Worker control/lifecycle:
Ping,GetSessionState,GetWorkerInfo,DrainEvents,ShutdownWorker(mxaccess_gateway.proto:133-137,177-181).
- MXAccess COM commands (parity-critical):
- Why it matters:
gateway.md:890-899,docs/MxAccessWorkerInstanceDesign.md:424-433, anddocs/DesignDecisions.md:169-173list all six COM commands under "Phase 4: Full Command Surface" with the exit criterion "gRPC command surface covers the installed MXAccess public method set." So this is a declared-in-scope phase that isn't finished, not a v1 cut. - Masked by tests: the control kinds (
Ping/GetWorkerInfo/DrainEvents/ShutdownWorker) are exercised only throughFakeWorkerHarness, so the unit suite is green while the real worker can't answer them. The live integration testWorkerLiveMxAccessSmokeTests.cs:931even sendsAuthenticateUser, which would get an invalid-request reply today. - Note:
OnBufferedDataChangeevent mapping IS fully wired (Conversion/MxAccessEventMapper.cs:231-254) — but withAddBufferedItem/SetBufferedUpdateIntervalunimplemented there is no way to set buffered eventing up.
✅ 1.2 — Constraint-denial audit events drop CorrelationId — RESOLVED
- Resolved (
8415f35,55526d5):request.ClientCorrelationIdis threaded fromInvoke→ApplyConstraintsAsync→ the six filter helpers →IConstraintEnforcer.RecordDenialAsync(newstring? correlationIdparam); theTODO(Task 2.3)is gone. The auditCorrelationIdcolumn isGuid?, so a GUID-parseable id is stored typed; and the raw string is always preserved in the audit record'sDetailsJson["clientCorrelationId"]— this matters because the Rust client sends non-GUID ids (rust-client-<op>-<n>) on all traffic and Python/Java default to empty, which would otherwise have left the typed column null. An end-to-end test asserts the value propagates throughInvoke. - Original finding: denied-operation audit records always wrote
CorrelationId = null(ConstraintEnforcer.cs:134-136,147); threading needed anIConstraintEnforcersignature change.
🔴 1.3 — provider_switches{from,to,reason} counter never exercised live
- Location: metric emitted in
Alarms/GatewayAlarmMonitor.cs(failover/failback path); residual recorded indocs/plans/2026-06-14-deferred-followups.md:124-125. - Evidence: "that counter's live exercise remains the one gap; record it explicitly rather than claiming coverage." Unit-tested (Tests-032) but the dev rig can't drive a real alarmmgr→subtag failover (
project_rig_alarms_object_driven), so the counter'sreasontagging is unproven in production.
🟠 1.4 — Worker 8-arg alarm ack silently discards operator domain / full name
- Location:
src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:261-278(_ = ackOperatorDomain; _ = ackOperatorFullName;). - Evidence: "the IwwAlarmConsumer2 8-arg AlarmAckByName returns -55 on this AVEVA build (looks like a stub) … fields are accepted by the proto for forward-compat but are not propagated to AVEVA today."
- Impact: two contract fields are accepted on the wire and silently not delivered. Root cause is the vendor stub (see 3.5), but the drop is currently invisible to callers.
2. Intentional v1 scope decisions (deliberately deferred — catalog)
These are documented, deliberate, and mostly enforced. Listed so the deferred surface is in one place — none are bugs. Canonical register: docs/DesignDecisions.md:466-474 ("Later Revisit Items") + gateway.md "Post-v1 revisit items".
- 🟡 Reconnectable sessions — server-side core landed, not yet complete (2026-06-16,
main). Epic Phase 3 added the detach-grace retention window, replay-on-reconnect from the bounded replay ring, and theReplayGapsignal onMxEvent(Tasks 10–12, merged). Still pending: owner re-validation on reconnect (Task 13), clientReplayGaphandling across all five clients (Task 14), and the fake-worker reconnect integration test (Task 15) — so reconnect is not safe to rely on end-to-end yet. Overturnsdocs/DesignDecisions.md:63-73. Seedocs/plans/2026-06-15-session-resilience.mdPhase 3. - ✅ Multi-event-subscriber fan-out — RESOLVED (2026-06-16,
main). Epic Phase 2 (Tasks 7–9) removed theGatewayOptionsValidatorrejection, added aMaxEventSubscribersPerSessioncap (default 8), and built fan-out on the newSessionEventDistributor(single pump → N bounded per-subscriber channels with per-subscriber backpressure), with FakeWorkerHarness end-to-end coverage. Was: validator-blocked.docs/plans/2026-06-15-session-resilience.mdPhase 2. - 🟡 Gateway restart does not reattach orphan workers — still true on
main, planned (epic Phase 5, not started). Workers are still terminated on startup. Epic Phase 5 (Tasks 20–28) is designed to reverse this — stable pipe naming, a SQLite adoption manifest, a worker adopt/reconnect proto frame, and nonce-validated re-adoption behind anEnableOrphanReattachflag (default off). Reverses the hard rule indocs/DesignDecisions.md:65-69andCLAUDE.md.docs/plans/2026-06-15-session-resilience.mdPhase 5. - 🔵 Workers run as the gateway service identity — restricted service account is a reserved extension point.
docs/DesignDecisions.md:179-184. - 🔵 Fail-fast event backpressure, no coalescing — opt-in coalescing is post-v1.
docs/DesignDecisions.md:187-203. - 🔵 No public command batching —
docs/DesignDecisions.md:206-212. - 🔵 API-key admin is a local CLI only — no public admin RPC.
docs/DesignDecisions.md:308-323. - 🔵 No Blazor UI component libraries — hard constraint.
docs/DesignDecisions.md:342-358. - 🔵 Lazy browse is wire-only — no lazy SQL / cache loading.
docs/DesignDecisions.md:365-376,docs/plans/2026-05-28-lazy-browse-design.md:30. - 🔵 No server-side / streaming browse search —
docs/plans/2026-05-28-lazy-browse-design.md:208. - 🔵 Alarm command surface is ack + query only — no Clear/Disable/Enable/Silence/Shelve/Inhibit; matches the MXAccess alarm-client set.
Worker/MxAccess/AlarmCommandHandler.cs, shelve/suppress out of scope perdocs/AlarmClientDiscovery.md:60-66. - 🟡 Dashboard EventsHub has no per-session ACL — still true on
main, planned (epic Phase 4, not started). Any authenticated dashboard user may still subscribe to any session group (Dashboard/Hubs/EventsHub.csTODO(per-session-acl)). The enabling foundation (sessionOwnerKeyId) already merged in epic Phase 1; epic Phase 4 (Tasks 16–19) adds the gRPC session-owner gate, a session tag + group-to-tag config, and EventsHub per-session ACL with a hub-token tag claim.docs/plans/2026-06-15-session-resilience.mdPhase 4. (See also §8.)
3. MXAccess parity — open questions & vendor-gated items
Intentionally incomplete, awaiting a live capture or an AVEVA fix; raw payload/metadata is preserved in the meantime (no synthesis).
- 🟡 3.1
OperationCompletenative trigger condition unknown — modeled and emitted only from the real event (no synthesis), but the runtime condition that fires it isn't captured.docs/DesignDecisions.md:280-289,gateway.md:1094,docs/MxAccessWorkerInstanceDesign.md:341,366. - 🟡 3.2
OnBufferedDataChangemulti-sample conversion unvalidated — STILL OPEN (residual after B8) —AddBufferedItem/SetBufferedUpdateIntervalare now implemented and live-confirmed (§1.1), and the live B8 test (f7ada90) confirms the worker receives and cleanly converts the emptyNoDatabootstrapOnBufferedDataChange(no crash, no dropped payload). But the rig's object logic does not drive a multi-sample buffered batch on demand (same limitation as the alarm rig), so a real parallel quality/timestamp sample array (length > 1) has never been observed live — it is exercised only by the B-bundle unit tests against a fakeIMxAccessServer. Re-runGatewaySession_WithLiveWorker_BufferedItem_*against a fast-changing simulated tag to close this.docs/DesignDecisions.md:291-297. - 🟡 3.3 Completion-only status →
MXSTATUS_PROXY[]mapping unproven — completion-only operation-status bytes are kept as raw diagnostic metadata until the analysis proves an exact mapping.docs/DesignDecisions.md:299-306. - 🟡 3.4
AlarmAckByGUIDisE_NOTIMPLon this AVEVA build — throwsNotImplementedException; all acks route throughAlarmAckByName. Proto/worker keep the path for forward-compat but it is dead today.docs/AlarmClientDiscovery.md:750-763. - 🟡 3.5 8-arg
AlarmAckByNamev2 is a vendor stub (returns -55) — worker uses the 6-arg method; the 8-argdomain/full_namefields are carried for forward-compat only (see 1.4).docs/AlarmClientDiscovery.md:743-748. - 🟡 3.6 Subtag degraded-mode fidelity limits —
category,description,alarm_type_name, operator fields, andretriggerare not populated/synthesized in subtag fallback (no subtag exists for them). Documented, by design.docs/AlarmClientDiscovery.md:913-931,docs/plans/2026-06-13-alarm-subtag-fallback-design.md:292-298. - 🟡 3.7 Subtag
Cleartransition unvalidated live — Raise/Ack/AckMsg are live-confirmed; Clear is externally undrivable on the rig (object logic owns alarm state). Environmental, not code. (project_alarm_subtag_fallback,project_rig_alarms_object_driven.)
4. Clients — gaps & cross-client parity
Library RPC surface is at full parity: all gateway + GalaxyRepository RPCs and the LazyBrowseNode helper exist in all five clients, with no TODO/stub/not-implemented markers in production code. The CLI/helper gaps below are RESOLVED.
| Capability | Dotnet | Go | Python | Rust | Java |
|---|---|---|---|---|---|
Write2 single session helper |
✅ | ✅ 849f1d2 |
✅ | ✅ | ✅ |
ping CLI subcommand |
✅ | ✅ 90529dc |
✅ | ✅ | ✅ 0d5b488 |
version CLI subcommand |
✅ (already worked) | ✅ | ✅ | ✅ | ✅ |
galaxy-* CLI commands (4) |
✅ | ✅ | ✅ a211fae |
✅ | ✅ |
galaxy-browse / BrowseChildren CLI |
✅ | ✅ | ✅ | ✅ | ✅ (5/5) |
- ✅ 4.1 Go single
Write2helper — RESOLVED (849f1d2): addedWrite2/Write2Rawtoclients/go/mxgateway/session.go, matching the other four clients' signature. - ✅ 4.2 Python
galaxy-*CLI commands — RESOLVED (a211fae,a59fc99): addedgalaxy-test-connection/galaxy-last-deploy/galaxy-discover/galaxy-watchClick commands wrappinggalaxy.py; README corrected. (Fixed a UTC-offset bug in last-deploy output during review.) - ✅ 4.3
pingCLI added to Go + Java — RESOLVED (Go90529dc/742ced7, Java0d5b488). - ✅ 4.4
versionCLI in Dotnet — NOT MISSING (audit correction): the dotnetversionsubcommand already worked (MxGatewayClientCli.cs:85→WriteVersion, prints gateway/worker protocol versions). The original audit was wrong. Minor: unlike Go, dotnet'sversionomits a client-package-version line (MxGatewayClientContractInfoexposes only the two protocol versions) — cosmetic, not tracked. - ✅ 4.5 Galaxy CLI command-name divergence — RESOLVED (Java
0d5b488):galaxy-test-connection/galaxy-last-deployare now the canonical Java names, withgalaxy-test/galaxy-deploy-timekept as deprecated picocli aliases so existing scripts don't break. (Rust keeps itsgalaxy <subcommand>group style — a clap structural choice, not a name divergence.) - ✅ 4.6
browse/BrowseChildrenCLI — RESOLVED, 0/5 → 5/5 (Rust639e36b, Go8cb416b, Python39ec2a3, dotnetd7e2a8b, Java0d5b488). All five emit the per-node JSON keyhasChildrenHint(unified during review). Minor residual divergence: dotnet nests the Galaxy object fields under anobjectkey while Go/Rust/Python/Java flatten them — both carryhasChildrenHint+ a nestedchildrenarray; harmonizing the object nesting is a cosmetic follow-up, not tracked. - ⚪ 4.7 No typed wrappers for the rarer commands —
AuthenticateUser,ArchestrAUserToId,AddBufferedItem,Suspend/Activate,GetSessionState/GetWorkerInfo/DrainEvents/ShutdownWorkerremain reachable via the genericInvoke/invoke_rawescape hatch in all five clients (consistent and deliberate; the worker-side commands are now implemented per §1.1, but no client adds dedicated typed wrappers — out of scope, the CLIs that needed them gotping/browsesubcommands).
5. Verification gaps (code exists, unverified by default)
All live/integration paths are opt-in; the default unit suites do not exercise them.
- ⚪ Live MXAccess COM + STA + message pump —
Worker.Tests/MxAccess/MxAccessLiveComCreationTests.cs(5[LiveMxAccessFact]), gatedMXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1. - ⚪ Live gateway↔worker↔MXAccess round-trip —
IntegrationTests/WorkerLiveMxAccessSmokeTests.cs(6[LiveMxAccessFact]). - ⚪ Live Galaxy Repository SQL —
IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs(4[LiveGalaxyRepositoryFact]), gatedMXGATEWAY_RUN_LIVE_GALAXY_TESTS=1. - ⚪ Live LDAP dashboard auth —
IntegrationTests/DashboardLdapLiveTests.cs(5[LiveLdapFact]), gatedMXGATEWAY_RUN_LIVE_LDAP_TESTS=1. - ⚪ Alarm runtime/discovery probes (dev-rig) —
Worker.Tests/Probes/{WnWrapConsumerProbeTests,AlarmClientWmProbeTests}.cs,AlarmClientDiscoveryTests.cs— hard[Fact(Skip=...)]. - ⚪ Live alarm + subtag-fallback smoke —
Worker.Tests/Probes/{AlarmSubtagLiveSmokeTests,AlarmsLiveSmokeTests}.cs—Skip+ one[LiveMxAccessFact]; Clear path remains undrivable even when enabled. - ⚪ Python loopback TLS —
clients/python/tests/test_tls.py:111-112— gatedMXGATEWAY_RUN_TLS_TESTS=1+ openssl; only cert-config parsing runs by default. - ⚪ .NET client live browse smoke —
clients/dotnet/.../BrowseChildrenSmokeTests.cs:17-18— hard[Fact(Skip=...)]. - ⚪ Cross-language client↔gateway wire behavior — no per-client integration unit tests; only
scripts/run-client-e2e-tests.ps1against a live gateway (MXGATEWAY_INTEGRATION=1). All client wire behavior is unverified in default unit runs.
No placeholder/empty/Assert.True(true) tests were found anywhere.
6. Config-gated functional gaps (work only after configuration)
- 🟠 6.1 Alarm ack in subtag mode requires
AckCommentsubtag configured — empty by default; ack fails in subtag mode until set. Names must be validated against live MXAccess, not guessed.docs/DesignDecisions.md:454-458. (AckCommentSubtagis write-only;Worker/MxAccess/SubtagAlarmStateMachine.cs:21.) - ✅ 6.2 Multi-subscriber — RESOLVED (validator block removed; fan-out implemented — see §2).
7. Stale docs, dead code, accepted gaps
- 📄 7.1 D1 plan header stale —
docs/plans/2026-06-14-deferred-followups.md:4still says "Plan only — NOT yet executed," but D1 is done (Dashboard/DashboardSnapshotService.cs:198, commit4af24b9). Update the plan status. - 📄 7.2
AlarmClientDiscovery.mdSTA "production fix needed" prose is stale —docs/AlarmClientDiscovery.md:765-774reads as a pending follow-up, but alarms now run through the worker STA /GatewayAlarmMonitor(merged). Re-check against current code. - 📄 7.3 EventsHub "publisher side is a follow-up" comment is stale —
Dashboard/Hubs/EventsHub.cs:9-17; theDashboardEventBroadcasterexists, is DI-registered (Dashboard/DashboardServiceCollectionExtensions.cs:47), runs in the live loop (Grpc/EventStreamService.cs:133), andSessionDetailsPage.razorrenders the feed. - 📄 7.4 CLAUDE.md project-name drift — CLAUDE.md uses
src/MxGateway.Server/MxGateway.Tests; the actual tree issrc/ZB.MOM.WW.MxGateway.*. Misleads path-based work. - ⚪ 7.5 Dead
MapSqlExceptionhelper —Grpc/GalaxyRepositoryGrpcService.cs:350-360, IDE0051-suppressed, kept for a hypothetical direct-SQL path that doesn't exist. - 7.6 Accepted code-review gaps (
Won't Fix, by design):Client.Python-012—Session.invoke_rawdeliberately skipsensure_mxaccess_success, so an embedded MXAccess HRESULT failure surfaces silently (raw-parity inspection).code-reviews/Client.Python/findings.md:290.Contracts-003— closed as not-a-defect.code-reviews/Contracts/findings.md.- (All 351 review findings are otherwise Resolved; none Open or Deferred.)
8. Deferred test-coverage follow-ups (noted in resolutions, never filed as findings)
- Java CLI bulk-subcommand coverage — 6 of 13 non-trivial subcommands untested:
read-bulk,write-bulk,write2-bulk,write-secured-bulk,write-secured2-bulk,bench-read-bulk(plusstream-events, the fourgalaxy-*,close-session).code-reviews/Client.Java/findings.md:495(Client.Java-026). - Per-session-ACL TODO at
Server/Dashboard/Hubs/EventsHub.cs(code-reviews/Server/findings.md:765) — now scheduled as session-resilience epic Phase 4 (Tasks 16–19); not yet started. - Worker-Ready retry race noted at
code-reviews/Server/findings.md:611. - Duplicated
FakeWorkerProcessharness flagged as a latent regression vector —code-reviews/Tests/findings.md:463.
Bottom line
Status after feat/stillpending-completion (2026-06-15): the net-new functionality is done — §1.1 (all 11 worker command kinds, COM half live-verified on the rig), §1.2 (audit CorrelationId, with a raw-string fallback for non-GUID clients), and the entire §4 client CLI/helper parity surface (Write2, ping, galaxy-*, galaxy-browse 5/5, name aliases). Doc hygiene §7 is done (0032d2d, bd46ba1). Zero .proto changes were required.
Still open (all deliberate or environment/vendor-gated):
- §1.3 —
provider_switchescounter still only unit-tested; the dev rig can't drive a real alarmmgr→subtag failover, so livereason-tag coverage remains a recorded residual. - §1.4 / §3.4 / §3.5 — the AVEVA 8-arg
AlarmAckByNameis a vendor stub (−55) andAlarmAckByGUIDisE_NOTIMPL; thedomain/full_namefields stay forward-compat-only until AVEVA implements them. - §3.2 — buffered commands work and the empty bootstrap converts cleanly live, but a multi-sample buffered batch is undrivable on the rig (unit-tested only).
- §3.1 / §3.3 / §3.6 / §3.7 — await live MXAccess captures.
- §2 — mostly deliberate v1 scope, but the session-resilience epic (12/28 tasks merged to
main) has since resolved multi-subscriber fan-out and landed the reconnect server-side core; reconnect Tasks 13–15, per-session ACL (Phase 4), and orphan-worker reattach (Phase 5) remain (seedocs/plans/2026-06-15-session-resilience.md,oldtasks.md). §5 — opt-in verification gates. §7.6 — acceptedWon't Fixreview findings.
MXAccess event/data/value/write mapping, the Galaxy RPC surface, and now the full command surface are complete; no NotImplementedExceptions, stubbed RPC bodies, or empty tests remain in the production paths.