Files
mxaccessgw/docs/plans/2026-06-15-stillpending-completion-design.md
T
Joseph Doherty 4a00b1bdc1 docs: design for completing stillpending.md actionable items
Covers the 11 worker command kinds (§1.1), audit CorrelationId threading
(§1.2), client CLI/helper parity (§4), and doc hygiene (§7). Key finding:
all 11 commands already have proto/validation/scope/routing, so this is a
worker-executor + COM-wrapper + client-CLI effort with zero contract changes.
2026-06-15 09:32:01 -04:00

7.8 KiB

Still-Pending Completion — Design

Date: 2026-06-15 Source: stillpending.md (audit at commit c7f754c) Branch: feat/stillpending-completion Status: Design approved; implementation plan to follow.

Goal

Close the genuinely actionable items in stillpending.md: the 11 unimplemented worker command kinds (§1.1), audit-record CorrelationId threading (§1.2), the client CLI/helper parity gaps (§4), and the documentation/residual-recording hygiene (§7). Items that are deliberate v1 scope (§2), vendor- or rig-gated (§1.3, §1.4, §3), or opt-in verification gates (§5) are documented, not built.

Key discovery that shapes the work

All 11 worker commands already have proto request and reply messages, gateway-side validation (MxAccessGrpcRequestValidator.cs:86-111), scope mapping (GatewayGrpcScopeResolver.cs:45-54), and generic pass-through routing (MxAccessGatewayService.Invoke). Therefore:

  • No .proto changes — no codegen, no net48 CS0246 regen risk.
  • No gateway routing changes for the 11 commands.
  • The real code surface is: worker executor arms, six new COM-wrapper methods, the constraint-path CorrelationId thread, and client CLI/helper additions.

8 of the 11 have dedicated reply messages; 3 (SetBufferedUpdateInterval, Ping, ShutdownWorker) intentionally return the base OK reply.

Architecture

Two-process design is unchanged. Work lands in five independent workstreams.

Workstream A — Worker control/lifecycle commands (5)

Ping, GetSessionState, GetWorkerInfo, DrainEvents, ShutdownWorker. No COM involved. New arms in MxAccessCommandExecutor.Execute (src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:97-128).

The executor currently holds only COM collaborators, so it gains the runtime state these commands report on:

  • DrainEventsMxAccessEventQueue.Drain(maxEvents) (already exists; maxEvents == 0 drains all) → DrainEventsReply { repeated MxEvent events }.
  • Ping → echoes PingCommand.message; base OK reply.
  • GetSessionStateSessionStateReply { SessionState state } from the runtime/heartbeat snapshot.
  • GetWorkerInfoWorkerInfoReply { worker_process_id, worker_version, mxaccess_progid, mxaccess_clsid } from MxAccessInteropInfo + process info.
  • ShutdownWorker → honor grace_period then signal StaRuntime shutdown; base OK reply.

Where the control arms physically live (inside MxAccessCommandExecutor with injected collaborators, vs. intercepted one layer up where the runtime context already exists) is an implementation decision for the plan; the contract surface is identical either way.

Workstream B — Worker MXAccess COM commands (6)

Suspend, Activate, AuthenticateUser, ArchestrAUserToId, AddBufferedItem, SetBufferedUpdateInterval. Each needs:

  1. A wrapper method on IMxAccessServer (src/ZB.MOM.WW.MxGateway.Worker/MxAccess/IMxAccessServer.cs) and dispatch in MxAccessComServer (MxAccessComServer.cs). Open question resolved on windev: which native interface (ILMXProxyServer / …3 / …4) exposes each method — confirm against the interop and docs/MXAccess-Public-API.md before writing the dispatch.
  2. An executor arm mapping request → wrapper call → reply (SuspendReply/ActivateReply carry MxStatusProxy; AuthenticateUserReply/ ArchestrAUserToIdReply carry user_id; AddBufferedItemReply carries item_handle; SetBufferedUpdateInterval returns base OK).

AuthenticateUser credentials must never reach logs (standing rule). AddBufferedItem/SetBufferedUpdateInterval enable the already-wired OnBufferedDataChange event path (MxAccessEventMapper.cs:231-254); per the approved decision we verify the buffered round-trip live on windev, capturing a real multi-sample batch to validate the §3.2 conversion path (Conversion/VariantConverter.cs).

Workstream C — §1.2 audit CorrelationId

Thread request.ClientCorrelationId from MxAccessGatewayService.InvokeApplyConstraintsAsync → the six filter helpers (EnforceReadTagAsync, EnforceWriteHandleAsync, FilterTagBulkAsync, FilterReadBulkAsync, FilterWriteBulkAsync, FilterHandleBulkAsync) → IConstraintEnforcer.RecordDenialAsync, which gains a correlationId parameter.

Type detail: AuditEvent.CorrelationId is Guid?, but ClientCorrelationId is a free-form string. The enforcer does Guid.TryParse and stores the value when parseable, else null. No audit-schema or contract change. Remove the TODO(Task 2.3) at ConstraintEnforcer.cs:134-136.

Workstream D — Client CLI/helper parity (5 clients)

  • Go Write2 single-session helper, modeled on Write (session.go:559).
  • Python CLI — add the 4 galaxy-* commands wrapping existing galaxy.py library methods (test_connection, get_last_deploy_time, discover_hierarchy, watch_deploy_events).
  • ping CLI — add to Go (cmd/mxgw-go/main.go) and Java (MxGatewayCli.java).
  • browse CLI — add to all 5, wrapping each client's existing LazyBrowseNode/Browse helper → 0/5 → 5/5 parity.
  • Galaxy name unification — add canonical galaxy-test-connection / galaxy-last-deploy to Java, keeping galaxy-test / galaxy-deploy-time as deprecated aliases (no script breakage).
  • version in dotnet — the explorer found a version path at MxGatewayClientCli.cs:85 that conflicts with the audit's §4.4. Treat as verify-then-fix-only-if-genuinely-missing.

Workstream E — Docs/hygiene (§7) + residual recording

  • D1 plan header (docs/plans/2026-06-14-deferred-followups.md:4).
  • Stale STA "production fix needed" prose (docs/AlarmClientDiscovery.md:765-774).
  • Stale EventsHub follow-up comment (Dashboard/Hubs/EventsHub.cs:9-17).
  • CLAUDE.md project-name drift (MxGateway.*ZB.MOM.WW.MxGateway.*).
  • Remove dead MapSqlException (GalaxyRepositoryGrpcService.cs:350-360).
  • Record §1.3 (live failover counter unproven on rig) and §1.4 (8-arg ack drops operator domain/full-name; vendor stub) as explicitly-documented residuals.
  • Update stillpending.md to reflect what closed.

Data flow / error handling specifics

  • Control commands return base/typed replies with ProtocolStatusCode.Ok; failures map to the existing CreateInvalidRequestReply / CreateAlarmFailureReply helpers, never a thrown exception across the STA.
  • COM commands surface the native HResult on the reply exactly as the other COM arms do — MXAccess parity means we do not "fix" surprising returns (e.g. AuthenticateUser is allowed to fail; the live test tolerates it).
  • ShutdownWorker must not deadlock the STA: signal shutdown, return the reply, then let the pump drain — sequencing is a plan-level concern.

Testing & cross-platform verification

Workstream Build/test Host
A, B (worker) Worker.Tests xUnit with a fake IMxAccessServer asserting each arm calls the right wrapper and maps the reply windev (net48/x86)
B (live) MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1 smoke incl. buffered capture windev + live MXAccess
C (gateway) unit test: denied op persists parsed CorrelationId local
D (clients) go test, pytest, cargo test+clippy, gradle test, dotnet test local; Java on windev
E (docs) doc-only; --check regen where applicable local

TDD throughout, per-task commits, docs updated in the same change as source.

Out of scope (documented, not built)

§1.3 live-drive (rig can't drive a real failover), §1.4 actual delivery + §3.4/§3.5 (AVEVA AlarmAckByName stub returns -55, AlarmAckByGUID is E_NOTIMPL), §3.1/§3.2-conversion-fix/§3.3/§3.6/§3.7 (await live captures), all of §2 (deliberate v1 scope, incl. validator-blocked multi-subscriber), §5 (opt-in verification gates), §7.6 (Won't Fix review findings).