Files
mxaccessgw/docs/MxAccessWorkerInstanceDesign.md
T
Joseph Doherty e541339c07 docs(audit): apply per-cluster judgment fixes across living docs
Resolve audit findings: correct WorkerEnvelope proto/route/metric/session
facts; rewrite auth (ZB.MOM.WW.Auth migration), dashboard (ZB.MOM.WW.Theme),
and StyleGuide (foreign-project copy-paste); document alarm subsystem, Ldap
options, and gateway alarm broker; fix client CLI flags and package paths.
2026-06-03 16:01:28 -04:00

33 KiB
Raw Blame History

MXAccess Worker Instance Detailed Design

Purpose

An MXAccess worker instance is the compatibility boundary around one installed MXAccess COM object. It runs as a disposable .NET Framework 4.8 x86 process, owns one dedicated STA thread, pumps Windows/COM messages, executes MXAccess commands on that STA, and forwards MXAccess events back to the gateway.

The worker's job is not to make MXAccess nicer. Its job is to preserve direct MXAccess behavior while making that behavior available to modern clients through the gateway.

Runtime

  • Target runtime: .NET Framework 4.8.
  • Language: C#.
  • Platform target: x86 by default.
  • Process lifetime: one worker per gateway session.
  • Public network listeners: none.
  • Gateway IPC: one named pipe with protobuf-framed messages.
  • COM apartment: one dedicated STA thread.

Style guides:

Build And Test

Build the SDK-style worker project with the .NET SDK MSBuild entry point. The project targets .NET Framework 4.8, but the SDK resolver comes from the .NET SDK installation:

dotnet msbuild src\ZB.MOM.WW.MxGateway.Worker\ZB.MOM.WW.MxGateway.Worker.csproj /restore /p:Configuration=Debug /p:Platform=x86

docs/ToolchainLinks.md records the Visual Studio MSBuild executable for classic .NET Framework and COM interop builds:

& "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Current\Bin\MSBuild.exe" src\ZB.MOM.WW.MxGateway.Worker\ZB.MOM.WW.MxGateway.Worker.csproj /p:Configuration=Debug /p:Platform=x86

Run the worker tests with the same platform target:

dotnet test src\ZB.MOM.WW.MxGateway.Worker.Tests\ZB.MOM.WW.MxGateway.Worker.Tests.csproj -p:Platform=x86

The only MXAccess interop reference belongs in ZB.MOM.WW.MxGateway.Worker. Gateway and test projects may reference the worker project for metadata and scaffold tests, but they must not reference ArchestrA.MXAccess.dll directly.

Responsibilities

The worker owns:

  • connection to the gateway pipe,
  • protocol hello and readiness reporting,
  • STA thread creation and teardown,
  • COM initialization on the STA,
  • MXAccess COM object creation,
  • MXAccess event sink wiring,
  • command dispatch on the STA,
  • MXAccess handle and advise state tracking,
  • value/status/HRESULT capture,
  • conversion to worker protobuf DTOs,
  • event sequencing,
  • heartbeat reporting,
  • graceful shutdown.

The worker does not own:

  • public gRPC API,
  • client authentication,
  • cross-session routing,
  • worker process supervision,
  • remote TLS,
  • policy decisions for other sessions.

Process Bootstrap

Expected command-line arguments:

--session-id <sessionId>
--pipe-name <pipeName>
--protocol-version <version>

Expected protected environment values:

MXGATEWAY_WORKER_NONCE=<random nonce>

The nonce travels through the environment rather than the command line so it never appears in process-listing tools that expose argument vectors.

Startup sequence:

  1. Parse command-line arguments.
  2. Configure minimal logging.
  3. Validate required values are present.
  4. Connect to the gateway named pipe.
  5. Exchange WorkerHello and GatewayHello.
  6. Validate protocol version, session id, and nonce.
  7. Start the STA runtime.
  8. Create the MXAccess COM object on the STA.
  9. Attach MXAccess event handlers on the STA.
  10. Send WorkerReady.
  11. Start pipe read, pipe write, heartbeat, and shutdown coordination loops.

If validation fails before MXAccess creation, exit quickly with a non-zero exit code. If MXAccess creation fails, send WorkerFault when possible and exit.

WorkerApplication.Run returns one of the structured WorkerExitCode values. Codes 24 are produced by the bootstrap parse phase before any pipe, STA, or MXAccess work happens; codes 56 and a clean 0 only become reachable once the parse succeeds and the worker runs its pipe session:

Exit code Name Meaning
0 Success The pipe session ran to a clean close.
1 UnexpectedFailure A non-bootstrap exception reaches the process boundary.
2 InvalidArguments Required arguments are missing or unknown arguments are present.
3 InvalidProtocolVersion --protocol-version is not numeric or does not match the supported worker protocol.
4 MissingNonce MXGATEWAY_WORKER_NONCE is absent or empty.
5 PipeConnectionFailed The pipe connection raised an IOException or TimeoutException.
6 ProtocolViolation A WorkerFrameProtocolException escaped the pipe session.

WorkerBootstrapResult.Succeeded is a separate parse-phase gate: it reports whether argument parsing produced usable WorkerOptions. A false result carries one of codes 24 and the worker exits before running a session, so a successful parse is distinct from the 0 exit code, which only follows a clean pipe-session close.

Bootstrap logs use WorkerConsoleLogger key/value output. WorkerLogRedactor redacts fields whose names indicate nonce, secret, password, token, credential, or API key values before the message is written.

Internal Components

ZB.MOM.WW.MxGateway.Worker
  Program                       (calls WorkerApplication.Run)
  WorkerApplication             (parse, bootstrap, run pipe session, map exit code)
  Bootstrap
    WorkerOptionsParser         (parse args + env into WorkerOptions)
    WorkerOptions
    WorkerBootstrapResult       (parse outcome + WorkerExitCode)
    WorkerExitCode
    WorkerConsoleLogger / WorkerLogRedactor
  Ipc
    WorkerPipeClient            (named-pipe connect + retry, owns the session)
    WorkerPipeSession           (handshake, read/write/drain/heartbeat loops)
    WorkerFrameReader / WorkerFrameWriter
    WorkerEnvelopeValidator
    WorkerContractInfo          (protocol version + descriptor names)
  Sta
    StaRuntime                  (the dedicated STA thread + message pump loop)
    StaCommandDispatcher
    StaMessagePump
  MxAccess
    MxAccessStaSession          (IWorkerRuntimeSession over the STA)
    MxAccessSession             (handle registry + COM-call orchestration)
    MxAccessCommandExecutor     (IStaCommandExecutor; runs commands on the STA)
    MxAccessBaseEventSink       (OnDataChange tag-data events)
    MxAccessHandleRegistry
    (alarm subsystem — see below)
  Conversion
    VariantConverter            (MxValue <-> COM VARIANT, both directions)
    MxStatusProxyConverter
    HResultConverter / HResultConversion

Threading Model

main thread
  -> parse args
  -> configure host
  -> coordinate shutdown

pipe reader thread/task
  -> read WorkerEnvelope frames
  -> validate protocol
  -> enqueue commands or control messages

pipe writer thread/task
  -> serialize WorkerEnvelope frames
  -> write replies, events, heartbeats, faults

STA thread
  -> CoInitializeEx(APARTMENTTHREADED)
  -> create MXAccess COM object
  -> attach event handlers
  -> pump Windows/COM messages
  -> execute queued commands
  -> detach events and release COM on shutdown

watchdog/heartbeat task
  -> observe STA responsiveness
  -> send heartbeat or fault

No MXAccess method may execute outside the STA thread. Do not use Task.Run around COM calls. Do not let event handlers perform pipe writes.

STA Runtime

The STA runtime is the most important part of the worker.

Startup:

  1. Create a dedicated Thread.
  2. Set apartment state to ApartmentState.STA.
  3. Start the thread.
  4. Inside the thread, initialize COM.
  5. Create the MXAccess COM object.
  6. Attach event handlers.
  7. Signal ready to the worker host.
  8. Enter the message pump.

Shutdown:

  1. Mark the command queue as completing.
  2. Drain or reject pending commands according to shutdown mode.
  3. Optionally issue MXAccess cleanup calls for active handles.
  4. Detach event handlers.
  5. Release COM references.
  6. Uninitialize COM.
  7. Exit the thread.

Message Pump

The STA must pump Windows messages while also processing queued commands. A blocking queue that prevents message pumping is not acceptable.

Required loop shape:

while not shutdown:
  while command queue has work:
    execute one command on STA

  MsgWaitForMultipleObjectsEx(
    command_event,
    timeout,
    QS_ALLINPUT,
    MWMO_INPUTAVAILABLE)

  while PeekMessage:
    TranslateMessage
    DispatchMessage

The command queue should signal a Win32 event or equivalent wait handle so the STA can wake without busy-waiting.

The loop should update a heartbeat timestamp after:

  • successfully pumping messages,
  • starting a command,
  • finishing a command,
  • processing an MXAccess event.

StaRuntime implements this runtime boundary in the worker. It starts one background thread named MxGateway.Worker.STA, sets it to ApartmentState.STA, initializes COM through StaComApartmentInitializer, and runs StaMessagePump. Commands are scheduled through InvokeAsync; the command queue signals an AutoResetEvent so MsgWaitForMultipleObjectsEx can wake the STA without busy-waiting. LastActivityUtc records pump, command, startup, and shutdown activity so the future heartbeat/watchdog can report whether the STA is still responsive. Shutdown marks the runtime as closing, wakes the pump, rejects new commands, cancels queued work, uninitializes COM on the STA, and waits for the thread to exit.

COM Creation

The MXAccess analysis source at C:\Users\dohertj2\Desktop\mxaccess identifies the installed COM target:

  • interop assembly: C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll
  • assembly identity: ArchestrA.MxAccess, Version=3.2.0.0, PublicKeyToken=23106a86e706d0ae
  • COM class: ArchestrA.MxAccess.LMXProxyServerClass
  • CLSID: {C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC}
  • ProgID: LMXProxy.LMXProxyServer.1
  • version-independent ProgID: LMXProxy.LMXProxyServer
  • registered server: C:\Program Files (x86)\ArchestrA\Framework\Bin\LmxProxy.dll
  • registry view: HKCR\Wow6432Node\CLSID\{C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC}
  • threading model: Apartment

The worker should reference the interop assembly and instantiate LMXProxyServerClass on the dedicated STA thread. Keep the ProgID and assembly path configurable for diagnostics, but this COM class is the v1 default.

MxAccessStaSession owns the initial COM creation path. It starts StaRuntime, creates LMXProxyServerClass through MxAccessComObjectFactory on the STA, attaches MxAccessBaseEventSink, and returns WorkerReady only after those steps succeed. MxAccessSession keeps the raw COM object private, records the STA managed thread id that created it, detaches the base event sink during disposal, and releases the COM reference on the STA. After creation, MxAccessStaSession owns a StaCommandDispatcher backed by MxAccessCommandExecutor; DispatchAsync queues contract commands back to the same STA instead of exposing the COM object to callers.

Creation rules:

  • Create COM object only on the STA.
  • Attach event handlers only on the STA.
  • Keep the COM reference private to the STA runtime.
  • Never marshal the raw COM object to pipe reader/writer threads.
  • Capture COM creation HRESULT or exception details.

If COM creation fails, the worker should send a structured fault with:

  • fault category,
  • exception type,
  • HRESULT when available,
  • COM class or ProgID attempted,
  • worker process id,
  • session id.

WorkerPipeSession maps startup exceptions from this path to WorkerFaultCategory.MxaccessCreationFailed, includes the captured HRESULT when the exception exposes one, and does not send WorkerReady after a failed COM creation attempt.

After WorkerReady, WorkerPipeSession continues reading gateway frames for the lifetime of the process. WorkerCommand frames are dispatched to MxAccessStaSession, replies are written as WorkerCommandReply, and queued worker events are drained after command replies. WorkerShutdown starts the graceful shutdown path and returns WorkerShutdownAck only after the STA cleanup path completes.

Event Sink

The worker subscribes to every public MXAccess event family through MxAccessBaseEventSink:

  • OnDataChange
  • OnWriteComplete
  • OperationComplete
  • OnBufferedDataChange

Alarm transitions arrive on a separate path. They do not originate from the LMXProxyServerClass connection points, so MxAccessAlarmEventSink (driven by the alarm subsystem below) feeds them onto the same MxAccessEventQueue rather than MxAccessBaseEventSink.

Forward these event families only when the native MXAccess COM object raises them. Do not synthesize OperationComplete from write completion or command status. OnBufferedDataChange must be represented in the protocol now, but multi-sample payload conversion should remain capture-validated; preserve raw metadata whenever conversion is incomplete.

Event handling rules:

  • Event handlers are expected to run on the STA.
  • Assign a monotonic worker event sequence.
  • Convert event args to WorkerEvent.
  • Include value, quality, timestamp, handles, status arrays, and raw status details when available.
  • Preserve raw event payload metadata for unsupported buffered or completion-only shapes.
  • Enqueue to the outbound event queue.
  • Return quickly to preserve message pumping.

MxAccessBaseEventSink implements the COM connection-point handlers and keeps the handlers limited to event argument conversion plus enqueue. It uses MxAccessEventMapper to create MxEvent DTOs for OnDataChange, OnWriteComplete, OperationComplete, and OnBufferedDataChange. The mapper converts scalar and array values through VariantConverter, converts MXSTATUS_PROXY[] through MxStatusProxyConverter, and maps installed MxDataType values to the public protobuf enum while preserving the raw data type on buffered events. OperationComplete is only emitted from the native OperationComplete handler; write completion does not synthesize it.

MxAccessEventQueue is the bounded outbound event queue for one worker session. It assigns the monotonic WorkerSequence and WorkerTimestamp when an event is accepted, preserving the order in which MXAccess handlers enqueue events. The default capacity is 10000. When the queue reaches capacity, Enqueue records a WorkerFaultCategory.QueueOverflow fault and then throws MxAccessEventQueueOverflowException so the caller cannot silently drop the event. The event handler catches conversion and enqueue failures (including this overflow exception), records the first fault on the queue, and returns to the STA message pump instead of writing to the pipe.

If event conversion throws, catch it inside the event handler, record a structured WorkerFault, and keep the worker alive only if the fault policy allows it.

Alarm Subsystem

Alarms come from a different COM surface than tag data, so the worker carries a separate pipeline rather than folding alarms into MxAccessBaseEventSink. The MXAccess LMXProxyServerClass does not expose alarm subscription, so the worker hosts AVEVA's standalone alarm-consumer COM object instead.

  • WnWrapAlarmConsumer is the production IMxAccessAlarmConsumer, backed by WNWRAPCONSUMERLib.wwAlarmConsumerClass. It returns the active alarm set as a BSTR XML string through GetXmlCurrentAlarms2, which avoids the FILETIME→ DateTime marshaling that crashed the earlier managed alarm client. The CLSID is registered ThreadingModel=Apartment, so the consumer is created and driven entirely on the worker's STA. It owns no internal timer.
  • MxAccessStaSession drives the STA alarm poll loop: RunAlarmPollLoopAsync awaits a fixed 500 ms interval and then calls IAlarmCommandHandler.PollOnce on the STA via the runtime, so every GetXmlCurrentAlarms2 call stays on the apartment that owns the consumer. A poll failure is recorded as a WorkerFault on the event queue rather than terminating the worker.
  • AlarmCommandHandler owns one AlarmDispatcher per session and is the entry point for the alarm IPC commands (SubscribeAlarms, AcknowledgeAlarm by GUID or name, QueryActiveAlarms, Unsubscribe). It rejects a second subscribe before an unsubscribe, mirroring the consumer's non-idempotent Subscribe.
  • AlarmDispatcher wires the consumer's AlarmTransitionEmitted stream onto MxAccessAlarmEventSink.EnqueueTransition. It maps state transitions through AlarmRecordTransitionMapper, composes the canonical \\<machine>\Galaxy!<area> full reference, and projects active-alarm snapshots to ActiveAlarmSnapshot protos for the QueryActiveAlarms refresh stream.
  • MxAccessAlarmEventSink enqueues each decoded transition onto the shared MxAccessEventQueue as a proto alarm-transition event, stamping the session id, so alarms ride the same outbound IPC path as tag-data events.

Command Queue

The pipe reader converts WorkerCommand messages into StaCommand entries.

Each entry should include:

  • correlation id,
  • method name,
  • method-specific request payload,
  • enqueue timestamp,
  • cancellation marker,
  • reply completion path.

The STA command dispatcher:

  1. Dequeues one command.
  2. Checks whether shutdown has started.
  3. Calls the matching MXAccess method.
  4. Captures return values, out parameters, status arrays, and HRESULT.
  5. Converts results to WorkerCommandReply.
  6. Enqueues the reply to the pipe writer.

The STA should execute one command at a time. MXAccess command ordering must be preserved for one worker.

Command Dispatch Surface

Phase 1 commands:

  • Register
  • Unregister
  • AddItem
  • RemoveItem

Phase 2 event commands:

  • Advise
  • UnAdvise
  • AdviseSupervisory

Full surface:

  • AddItem2
  • AddBufferedItem
  • SetBufferedUpdateInterval
  • Suspend
  • Activate
  • Write
  • Write2
  • WriteSecured
  • WriteSecured2
  • AuthenticateUser
  • ArchestrAUserToId

Diagnostics:

  • Ping
  • GetSessionState
  • GetWorkerInfo
  • DrainEvents
  • ShutdownWorker

Implement method-specific dispatch instead of a generic string method invoker. Parity tests need stable command-specific request and reply shapes.

MxAccessCommandExecutor implements the first command pair:

  • Register calls LMXProxyServerClass.Register with the requested client name and preserves the returned server handle in both ReturnValue and RegisterReply.ServerHandle.
  • Unregister calls LMXProxyServerClass.Unregister with the requested server handle. The reply has no method-specific payload because the public MXAccess method returns void.

Both commands set Hresult to 0 only after the COM call returns normally. COM exceptions flow through StaCommandDispatcher, which captures the thrown HRESULT and converts the reply to ProtocolStatusCode.MxaccessFailure. MxAccessStaSession.GetRegisteredServerHandlesAsync returns an STA-read snapshot of tracked server handles for diagnostics and future cleanup logic.

MxAccessCommandExecutor also implements the item lifecycle commands:

  • AddItem calls LMXProxyServerClass.AddItem with the requested server handle and item definition. It preserves the returned item handle in both ReturnValue and AddItemReply.ItemHandle.
  • AddItem2 calls LMXProxyServerClass.AddItem2 with the requested server handle, item definition, and context string. The context string is passed to MXAccess exactly as received.
  • RemoveItem calls LMXProxyServerClass.RemoveItem with the requested server handle and item handle. The reply has no method-specific payload because the public MXAccess method returns void.

The worker records item handles only after AddItem or AddItem2 returns normally, and removes item handles only after RemoveItem returns normally. The registry does not prevalidate server or item handles, so invalid and cross-server handle behavior remains owned by MXAccess. COM exceptions continue through StaCommandDispatcher, which preserves the HRESULT and leaves diagnostic registry state unchanged for failed cleanup calls.

MxAccessCommandExecutor implements advice lifecycle commands on the same STA path:

  • Advise calls LMXProxyServerClass.Advise with the requested server handle and item handle.
  • AdviseSupervisory calls LMXProxyServerClass.AdviseSupervisory with the requested server handle and item handle. This remains a distinct command from plain Advise even though observed scalar captures share the same lower-level subscription body.
  • UnAdvise calls LMXProxyServerClass.UnAdvise with the requested server handle and item handle.

The worker records plain and supervisory advice separately only after the COM call returns normally. Successful UnAdvise removes all tracked advice for the server and item pair because the public MXAccess cleanup method has no plain versus supervisory selector. Successful RemoveItem and Unregister also clear related advice state from the worker registry. Failed advice and cleanup calls leave registry state unchanged so diagnostics continue to reflect the last successful MXAccess-owned state transition.

Handle Registry

The worker should track MXAccess state for diagnostics and cleanup, while still treating MXAccess as the authority.

Suggested tracked state:

  • registered server handles,
  • item handles,
  • item names and context,
  • server handle for each item,
  • advise state,
  • buffered item state,
  • authenticated user ids if needed,
  • last command touching each handle.

Rules:

  • Do not invent handles.
  • Do not rewrite handles returned by MXAccess.
  • Record server handles only after Register succeeds.
  • Remove server handles only after Unregister succeeds.
  • Record item handles only after AddItem or AddItem2 succeeds.
  • Remove item handles only after RemoveItem succeeds.
  • Record advice state only after Advise or AdviseSupervisory succeeds.
  • Remove advice state only after UnAdvise, RemoveItem, or Unregister succeeds.
  • Preserve invalid-handle behavior from MXAccess.
  • Preserve cross-server handle behavior from MXAccess.
  • Use registry state for cleanup and diagnostics, not semantic correction.

Value Conversion

VariantConverter should convert COM values into the protobuf MxValue union.

Supported scalar projections:

  • bool,
  • int32,
  • int64,
  • float,
  • double,
  • string,
  • timestamp,
  • raw fallback.

Supported arrays:

  • bool array,
  • int32 array,
  • float array,
  • double array,
  • string array,
  • timestamp array,
  • raw fallback.

Rules:

  • Preserve null and empty values distinctly when MXAccess exposes a distinction.
  • Preserve array rank and dimensions when available.
  • Preserve original variant type metadata.
  • If conversion is lossy, include the best typed value plus raw diagnostic metadata.
  • Do not throw away values just because they are awkward.

Credential-bearing values must not be logged.

Status And HRESULT Capture

MXSTATUS_PROXY arrays must be represented explicitly. Do not collapse status arrays into a single success flag.

For every command reply, capture:

  • protocol success/failure,
  • method name,
  • correlation id,
  • COM HRESULT if available,
  • thrown exception HRESULT if available,
  • MXAccess return value if any,
  • method-specific out parameters,
  • status array,
  • diagnostic message safe for logs.

If a COM call throws, map the exception into a command reply instead of crashing the worker, unless the exception indicates process corruption or the configured policy says to fail the session.

Cancellation

Worker cancellation is cooperative at the queue boundary.

Rules:

  • If a WorkerCancel arrives before a command starts, mark the command canceled and reply or drop according to protocol policy.
  • If a command is already executing on the STA, do not attempt to abort the COM call.
  • When the COM call returns after gateway cancellation, send the reply only if the gateway still wants late replies; otherwise log and discard.
  • Hard cancellation is process kill by the gateway.

Outbound Queues

The worker should use bounded outbound queues for replies, events, heartbeats, and faults.

Priority order when writing:

  1. faults,
  2. command replies,
  3. shutdown acknowledgements,
  4. heartbeats,
  5. events.

Event overflow policy defaults to fail-fast for parity testing. If the event queue fills:

  1. Capture overflow metrics.
  2. Send WorkerFault if possible.
  3. Stop accepting new commands.
  4. Let the gateway close or kill the worker.

Production coalescing may be added later, but it must be explicit and tested. Do not drop or coalesce events in v1.

Heartbeat And Watchdog

WorkerPipeSession starts the heartbeat loop after the gateway validates WorkerHello and receives WorkerReady. Heartbeats continue until WorkerShutdown, cancellation, or a pipe/protocol failure stops the session. The loop uses WorkerPipeSessionOptions.HeartbeatInterval; the default matches the gateway worker heartbeat interval.

The worker heartbeat proves that:

  • pipe writer is alive,
  • worker host is alive,
  • STA has recently pumped or completed work.

Heartbeat payload includes:

  • worker process id,
  • session id,
  • current state,
  • last STA activity timestamp,
  • pending command count,
  • outbound event queue depth,
  • event sequence,
  • current command correlation id if any.

MxAccessStaSession.CaptureHeartbeat() reads StaRuntime.LastActivityUtc and StaCommandDispatcher queue state without touching the raw MXAccess COM object outside the STA. Event queue depth and event sequence are reported as zero until the event queue implementation owns those counters.

The STA watchdog currently emits a WorkerFault with WorkerFaultCategory.StaHung when LastStaActivityUtc is older than WorkerPipeSessionOptions.HeartbeatGrace and no command is in flight. StaRuntime.ProcessQueuedCommands calls MarkActivity() only immediately before and after each work item, so a synchronously long-running STA command (for example a ReadBulk waiting timeout_ms for the first OnDataChange) legitimately freezes LastStaActivityUtc for the duration of the wait while the worker is healthy. The watchdog is therefore suppressed while the heartbeat snapshot's CurrentCommandCorrelationId is non-empty: the worker is busy executing a command, not hung, and the heartbeat already surfaces the in-flight correlation id so the gateway can apply its own per-command timeout if it considers the command too slow. The fault still fires on a truly hung STA — no command in flight and no activity for longer than HeartbeatGrace — which is the only case the watchdog can usefully distinguish from a slow command. Command duration and high event queue depth remain observable through heartbeat fields until dedicated thresholds own those warnings. The worker reports stale STA activity, but the gateway owns the final kill decision through its existing heartbeat and worker lifecycle policy.

The in-flight-command suppression itself is bounded by WorkerPipeSessionOptions.HeartbeatStuckCeiling (default 75 seconds = 5 × HeartbeatGrace). The motivating case for the suppression is a legitimately slow synchronous command — but a genuinely stuck COM call (for example against a dead MXAccess provider whose cross-apartment marshaler is permanently blocked, or a write completion that never fires) leaves CurrentCommandCorrelationId non-empty indefinitely. Without an upper bound the worker-side StaHung watchdog would be permanently defeated for that session and only the gateway's per-command timeout would catch the hang — losing the worker-originated diagnostic (StaHung fault category, the stale-by interval) from the gateway audit trail. Once LastStaActivityUtc has been stale for longer than HeartbeatStuckCeiling, the watchdog fires StaHung regardless of whether a command is in flight, on the assumption that no legitimate STA command should run that long without periodically refreshing activity. Deployments that legitimately run very long bulk operations should raise the ceiling rather than disable it.

Shutdown

Graceful shutdown sequence:

  1. Pipe reader receives WorkerShutdown.
  2. Worker host marks shutdown requested.
  3. Reject new commands.
  4. Let current STA command finish if within timeout.
  5. Optionally run MXAccess cleanup:
    • UnAdvise,
    • RemoveItem,
    • Unregister.
  6. Detach event handlers.
  7. Release COM object until reference count reaches zero when possible.
  8. Stop pipe reader and writer.
  9. Exit process with success code.

If shutdown wedges, the gateway kills the process. The worker should be written so process kill does not corrupt other sessions.

MxAccessStaSession.ShutdownGracefullyAsync implements the current cleanup path. It first calls StaCommandDispatcher.RequestShutdown() so new commands are rejected and queued commands that have not started receive ProtocolStatusCode.WorkerUnavailable. The command already executing on the STA is allowed to finish until the shutdown grace period expires.

After command dispatch is closed, cleanup runs on the STA in MXAccess handle order:

  1. one UnAdvise call per advised server/item pair,
  2. RemoveItem for active item handles,
  3. Unregister for active server handles,
  4. event sink detach,
  5. COM release.

Each cleanup call is best effort. A failed cleanup operation is recorded as an MxAccessShutdownFailure, logged by WorkerPipeSession, and does not prevent later cleanup calls from running. A shutdown with cleanup failures still returns WorkerShutdownAck with ProtocolStatusCode.Ok because the worker reached the controlled release path. If the grace period expires before cleanup can run or finish, the worker reports WorkerFaultCategory.ShutdownTimeout when possible and relies on the gateway to kill the process.

Fault Handling

Worker fault categories:

  • InvalidArguments
  • GatewayAuthenticationFailed
  • ProtocolMismatch
  • ProtocolViolation
  • PipeDisconnected
  • MxAccessCreationFailed
  • MxAccessCommandFailed
  • MxAccessEventConversionFailed
  • StaHung
  • QueueOverflow
  • ShutdownTimeout

Fault payload should include:

  • category,
  • session id,
  • correlation id when command-specific,
  • command method when command-specific,
  • HRESULT when available,
  • exception type when available,
  • safe diagnostic message.

Do not include raw credentials or full secured-write values.

Security

The worker should trust only the launching gateway after validating:

  • expected session id,
  • expected protocol version,
  • nonce,
  • pipe identity where available.

It should not expose any network listener. It should not accept commands from arbitrary local processes.

Credential-bearing commands must keep credential data out of:

  • command line,
  • logs,
  • metrics labels,
  • exception messages,
  • crash dumps when avoidable.

Observability

Worker logs should include:

  • startup arguments except secrets,
  • protocol version,
  • gateway handshake result,
  • MXAccess COM creation result,
  • command start/end with correlation id,
  • HRESULT/status summary,
  • event family and sequence,
  • queue overflow,
  • STA watchdog warnings,
  • shutdown path.

Metrics can be emitted through the gateway or exposed as worker heartbeat fields. The worker does not need its own public metrics endpoint.

Testing Strategy

Worker tests that do not require installed MXAccess:

  • frame reader/writer,
  • protocol validation,
  • command queue ordering,
  • STA command scheduling with a fake COM object,
  • message-pump wake behavior where practical,
  • value conversion,
  • status conversion,
  • event conversion from fake event args,
  • shutdown state transitions,
  • queue overflow behavior.

Live MXAccess tests:

  • COM creation on STA,
  • Register and Unregister,
  • AddItem and RemoveItem,
  • Advise and one OnDataChange,
  • write completion behavior,
  • secured write behavior,
  • buffered data-change behavior,
  • invalid handle behavior.
  • no synthesized OperationComplete when native MXAccess does not raise it.
  • raw metadata preservation for buffered payloads that cannot yet be fully converted.

Live tests should be opt-in and clearly marked because they depend on installed MXAccess COM and provider state. The worker test suite uses MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1 for these tests. AddItem uses TestChildObject.TestInt by default and accepts an override through MXGATEWAY_LIVE_MXACCESS_ITEM; AddItem2 uses the captured parity fixture shape AddItem2("TestInt", "TestChildObject").

WorkerLiveMxAccessSmokeTests in src/ZB.MOM.WW.MxGateway.IntegrationTests/ uses the same opt-in variable for the gateway-to-worker live smoke. It launches the x86 worker through WorkerProcessLauncher, opens a gateway session, runs Register, AddItem, and Advise, waits for one OnDataChange, and closes the session. The smoke accepts MXGATEWAY_LIVE_MXACCESS_WORKER_EXE for a non-default worker executable path and MXGATEWAY_LIVE_MXACCESS_EVENT_TIMEOUT_SECONDS for the bounded event wait.

Initial Implementation Slice

The first worker slice should implement:

  1. Argument parsing and pipe connection.
  2. Protocol hello and nonce validation.
  3. STA thread startup.
  4. COM initialization and MXAccess object creation.
  5. Message pump with command wake event.
  6. WorkerReady.
  7. Shutdown command.
  8. Register, AddItem, and Advise.
  9. Event sink for one OnDataChange.
  10. Basic value/status conversion.
  11. Event model coverage for OperationComplete and OnBufferedDataChange without synthesized events.
  12. Fault reporting.

This slice proves the worker can preserve the core MXAccess requirements: single-process isolation, STA ownership, message pumping, command execution, and event delivery.