diff --git a/docs/plans/2026-06-15-stillpending-completion-design.md b/docs/plans/2026-06-15-stillpending-completion-design.md new file mode 100644 index 0000000..0ce16a6 --- /dev/null +++ b/docs/plans/2026-06-15-stillpending-completion-design.md @@ -0,0 +1,152 @@ +# Still-Pending Completion — Design + +**Date:** 2026-06-15 +**Source:** `stillpending.md` (audit at commit `c7f754c`) +**Branch:** `feat/stillpending-completion` +**Status:** Design approved; implementation plan to follow. + +## Goal + +Close the genuinely actionable items in `stillpending.md`: the 11 unimplemented +worker command kinds (§1.1), audit-record CorrelationId threading (§1.2), the +client CLI/helper parity gaps (§4), and the documentation/residual-recording +hygiene (§7). Items that are deliberate v1 scope (§2), vendor- or rig-gated +(§1.3, §1.4, §3), or opt-in verification gates (§5) are documented, not built. + +## Key discovery that shapes the work + +All 11 worker commands **already** have proto request *and* reply messages, +gateway-side validation (`MxAccessGrpcRequestValidator.cs:86-111`), scope +mapping (`GatewayGrpcScopeResolver.cs:45-54`), and generic pass-through routing +(`MxAccessGatewayService.Invoke`). Therefore: + +- **No `.proto` changes** — no codegen, no net48 `CS0246` regen risk. +- **No gateway routing changes** for the 11 commands. +- The real code surface is: worker executor arms, six new COM-wrapper methods, + the constraint-path CorrelationId thread, and client CLI/helper additions. + +8 of the 11 have dedicated reply messages; 3 (`SetBufferedUpdateInterval`, +`Ping`, `ShutdownWorker`) intentionally return the base OK reply. + +## Architecture + +Two-process design is unchanged. Work lands in five independent workstreams. + +### Workstream A — Worker control/lifecycle commands (5) + +`Ping`, `GetSessionState`, `GetWorkerInfo`, `DrainEvents`, `ShutdownWorker`. +No COM involved. New arms in `MxAccessCommandExecutor.Execute` +(`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:97-128`). + +The executor currently holds only COM collaborators, so it gains the runtime +state these commands report on: + +- `DrainEvents` → `MxAccessEventQueue.Drain(maxEvents)` (already exists; + `maxEvents == 0` drains all) → `DrainEventsReply { repeated MxEvent events }`. +- `Ping` → echoes `PingCommand.message`; base OK reply. +- `GetSessionState` → `SessionStateReply { SessionState state }` from the + runtime/heartbeat snapshot. +- `GetWorkerInfo` → `WorkerInfoReply { worker_process_id, worker_version, + mxaccess_progid, mxaccess_clsid }` from `MxAccessInteropInfo` + process info. +- `ShutdownWorker` → honor `grace_period` then signal `StaRuntime` shutdown; + base OK reply. + +Where the control arms physically live (inside `MxAccessCommandExecutor` with +injected collaborators, vs. intercepted one layer up where the runtime context +already exists) is an implementation decision for the plan; the contract surface +is identical either way. + +### Workstream B — Worker MXAccess COM commands (6) + +`Suspend`, `Activate`, `AuthenticateUser`, `ArchestrAUserToId`, +`AddBufferedItem`, `SetBufferedUpdateInterval`. Each needs: + +1. A wrapper method on `IMxAccessServer` + (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/IMxAccessServer.cs`) and dispatch in + `MxAccessComServer` (`MxAccessComServer.cs`). **Open question resolved on + windev:** which native interface (`ILMXProxyServer` / `…3` / `…4`) exposes + each method — confirm against the interop and `docs/MXAccess-Public-API.md` + before writing the dispatch. +2. An executor arm mapping request → wrapper call → reply + (`SuspendReply`/`ActivateReply` carry `MxStatusProxy`; `AuthenticateUserReply`/ + `ArchestrAUserToIdReply` carry `user_id`; `AddBufferedItemReply` carries + `item_handle`; `SetBufferedUpdateInterval` returns base OK). + +`AuthenticateUser` credentials must never reach logs (standing rule). +`AddBufferedItem`/`SetBufferedUpdateInterval` enable the already-wired +`OnBufferedDataChange` event path (`MxAccessEventMapper.cs:231-254`); per the +approved decision we **verify the buffered round-trip live on windev**, +capturing a real multi-sample batch to validate the §3.2 conversion path +(`Conversion/VariantConverter.cs`). + +### Workstream C — §1.2 audit CorrelationId + +Thread `request.ClientCorrelationId` from `MxAccessGatewayService.Invoke` → +`ApplyConstraintsAsync` → the six filter helpers (`EnforceReadTagAsync`, +`EnforceWriteHandleAsync`, `FilterTagBulkAsync`, `FilterReadBulkAsync`, +`FilterWriteBulkAsync`, `FilterHandleBulkAsync`) → +`IConstraintEnforcer.RecordDenialAsync`, which gains a `correlationId` +parameter. + +**Type detail:** `AuditEvent.CorrelationId` is `Guid?`, but `ClientCorrelationId` +is a free-form string. The enforcer does `Guid.TryParse` and stores the value +when parseable, else null. No audit-schema or contract change. Remove the +`TODO(Task 2.3)` at `ConstraintEnforcer.cs:134-136`. + +### Workstream D — Client CLI/helper parity (5 clients) + +- **Go** `Write2` single-session helper, modeled on `Write` (`session.go:559`). +- **Python CLI** — add the 4 `galaxy-*` commands wrapping existing `galaxy.py` + library methods (`test_connection`, `get_last_deploy_time`, + `discover_hierarchy`, `watch_deploy_events`). +- **`ping` CLI** — add to Go (`cmd/mxgw-go/main.go`) and Java (`MxGatewayCli.java`). +- **`browse` CLI** — add to **all 5**, wrapping each client's existing + `LazyBrowseNode`/`Browse` helper → 0/5 → 5/5 parity. +- **Galaxy name unification** — add canonical `galaxy-test-connection` / + `galaxy-last-deploy` to Java, keeping `galaxy-test` / `galaxy-deploy-time` as + **deprecated aliases** (no script breakage). +- **`version` in dotnet** — the explorer found a `version` path at + `MxGatewayClientCli.cs:85` that conflicts with the audit's §4.4. Treat as + **verify-then-fix-only-if-genuinely-missing**. + +### Workstream E — Docs/hygiene (§7) + residual recording + +- D1 plan header (`docs/plans/2026-06-14-deferred-followups.md:4`). +- Stale STA "production fix needed" prose (`docs/AlarmClientDiscovery.md:765-774`). +- Stale EventsHub follow-up comment (`Dashboard/Hubs/EventsHub.cs:9-17`). +- CLAUDE.md project-name drift (`MxGateway.*` → `ZB.MOM.WW.MxGateway.*`). +- Remove dead `MapSqlException` (`GalaxyRepositoryGrpcService.cs:350-360`). +- Record §1.3 (live failover counter unproven on rig) and §1.4 (8-arg ack drops + operator domain/full-name; vendor stub) as explicitly-documented residuals. +- Update `stillpending.md` to reflect what closed. + +## Data flow / error handling specifics + +- Control commands return base/typed replies with `ProtocolStatusCode.Ok`; + failures map to the existing `CreateInvalidRequestReply` / + `CreateAlarmFailureReply` helpers, never a thrown exception across the STA. +- COM commands surface the native `HResult` on the reply exactly as the other + COM arms do — MXAccess parity means we do not "fix" surprising returns + (e.g. `AuthenticateUser` is allowed to fail; the live test tolerates it). +- `ShutdownWorker` must not deadlock the STA: signal shutdown, return the reply, + then let the pump drain — sequencing is a plan-level concern. + +## Testing & cross-platform verification + +| Workstream | Build/test | Host | +|---|---|---| +| A, B (worker) | `Worker.Tests` xUnit with a fake `IMxAccessServer` asserting each arm calls the right wrapper and maps the reply | **windev** (net48/x86) | +| B (live) | `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1` smoke incl. buffered capture | **windev** + live MXAccess | +| C (gateway) | unit test: denied op persists parsed `CorrelationId` | local | +| D (clients) | `go test`, `pytest`, `cargo test`+clippy, `gradle test`, `dotnet test` | local; Java on windev | +| E (docs) | doc-only; `--check` regen where applicable | local | + +TDD throughout, per-task commits, docs updated in the same change as source. + +## Out of scope (documented, not built) + +§1.3 live-drive (rig can't drive a real failover), §1.4 actual delivery + +§3.4/§3.5 (AVEVA `AlarmAckByName` stub returns -55, `AlarmAckByGUID` is +`E_NOTIMPL`), §3.1/§3.2-conversion-fix/§3.3/§3.6/§3.7 (await live captures), +all of §2 (deliberate v1 scope, incl. validator-blocked multi-subscriber), §5 +(opt-in verification gates), §7.6 (`Won't Fix` review findings).