Files
mxaccessgw/docs/MxAccessWorkerInstanceDesign.md
T
Joseph Doherty 1aafd6bde4 Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings
Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:28:54 -04:00

876 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MXAccess Worker Instance Detailed Design
## Purpose
An MXAccess worker instance is the compatibility boundary around one installed
MXAccess COM object. It runs as a disposable .NET Framework 4.8 x86 process,
owns one dedicated STA thread, pumps Windows/COM messages, executes MXAccess
commands on that STA, and forwards MXAccess events back to the gateway.
The worker's job is not to make MXAccess nicer. Its job is to preserve direct
MXAccess behavior while making that behavior available to modern clients through
the gateway.
## Runtime
- Target runtime: .NET Framework 4.8.
- Language: C#.
- Platform target: x86 by default.
- Process lifetime: one worker per gateway session.
- Public network listeners: none.
- Gateway IPC: one named pipe with protobuf-framed messages.
- COM apartment: one dedicated STA thread.
Style guides:
- [C# Style Guide](./style-guides/CSharpStyleGuide.md)
- [Protobuf Style Guide](./style-guides/ProtobufStyleGuide.md)
## Build And Test
Build the SDK-style worker project with the .NET SDK MSBuild entry point. The
project targets .NET Framework 4.8, but the SDK resolver comes from the .NET SDK
installation:
```powershell
dotnet msbuild src\MxGateway.Worker\MxGateway.Worker.csproj /restore /p:Configuration=Debug /p:Platform=x86
```
`docs/ToolchainLinks.md` records the Visual Studio MSBuild executable for
classic .NET Framework and COM interop builds:
```powershell
& "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Current\Bin\MSBuild.exe" src\MxGateway.Worker\MxGateway.Worker.csproj /p:Configuration=Debug /p:Platform=x86
```
Run the worker tests with the same platform target:
```powershell
dotnet test src\MxGateway.Worker.Tests\MxGateway.Worker.Tests.csproj -p:Platform=x86
```
The only MXAccess interop reference belongs in `MxGateway.Worker`. Gateway and
test projects may reference the worker project for metadata and scaffold tests,
but they must not reference `ArchestrA.MXAccess.dll` directly.
## Responsibilities
The worker owns:
- connection to the gateway pipe,
- protocol hello and readiness reporting,
- STA thread creation and teardown,
- COM initialization on the STA,
- MXAccess COM object creation,
- MXAccess event sink wiring,
- command dispatch on the STA,
- MXAccess handle and advise state tracking,
- value/status/HRESULT capture,
- conversion to worker protobuf DTOs,
- event sequencing,
- heartbeat reporting,
- graceful shutdown.
The worker does not own:
- public gRPC API,
- client authentication,
- cross-session routing,
- worker process supervision,
- remote TLS,
- policy decisions for other sessions.
## Process Bootstrap
Expected command-line arguments:
```text
--session-id <sessionId>
--pipe-name <pipeName>
--protocol-version <version>
```
Expected protected environment values:
```text
MXGATEWAY_WORKER_NONCE=<random nonce>
MXGATEWAY_WORKER_LOG_CONTEXT=<optional context>
```
Startup sequence:
1. Parse command-line arguments.
2. Configure minimal logging.
3. Validate required values are present.
4. Connect to the gateway named pipe.
5. Exchange `WorkerHello` and `GatewayHello`.
6. Validate protocol version, session id, and nonce.
7. Start the STA runtime.
8. Create the MXAccess COM object on the STA.
9. Attach MXAccess event handlers on the STA.
10. Send `WorkerReady`.
11. Start pipe read, pipe write, heartbeat, and shutdown coordination loops.
If validation fails before MXAccess creation, exit quickly with a non-zero exit
code. If MXAccess creation fails, send `WorkerFault` when possible and exit.
The bootstrap layer returns structured exit codes before it creates pipes,
starts the STA, or touches MXAccess:
| Exit code | Name | Meaning |
|-----------|------|---------|
| `0` | `Success` | Required bootstrap options are valid. |
| `1` | `UnexpectedFailure` | A non-bootstrap exception reaches the process boundary. |
| `2` | `InvalidArguments` | Required arguments are missing or unknown arguments are present. |
| `3` | `InvalidProtocolVersion` | `--protocol-version` is not numeric or does not match the supported worker protocol. |
| `4` | `MissingNonce` | `MXGATEWAY_WORKER_NONCE` is absent or empty. |
Bootstrap logs use `WorkerConsoleLogger` key/value output. `WorkerLogRedactor`
redacts fields whose names indicate nonce, secret, password, token,
credential, or API key values before the message is written.
## Internal Components
```text
MxGateway.Worker
Program
Bootstrap
WorkerOptions
WorkerHost
Ipc
PipeClient
FrameReader
FrameWriter
WorkerProtocol
Sta
StaRuntime
StaCommandQueue
MessagePump
StaWatchdog
MxAccess
MxAccessSession
MxAccessCommandDispatcher
MxAccessEventSink
MxAccessHandleRegistry
Conversion
VariantConverter
SafeArrayConverter
StatusProxyConverter
HResultMapper
```
## Threading Model
```text
main thread
-> parse args
-> configure host
-> coordinate shutdown
pipe reader thread/task
-> read WorkerEnvelope frames
-> validate protocol
-> enqueue commands or control messages
pipe writer thread/task
-> serialize WorkerEnvelope frames
-> write replies, events, heartbeats, faults
STA thread
-> CoInitializeEx(APARTMENTTHREADED)
-> create MXAccess COM object
-> attach event handlers
-> pump Windows/COM messages
-> execute queued commands
-> detach events and release COM on shutdown
watchdog/heartbeat task
-> observe STA responsiveness
-> send heartbeat or fault
```
No MXAccess method may execute outside the STA thread. Do not use `Task.Run`
around COM calls. Do not let event handlers perform pipe writes.
## STA Runtime
The STA runtime is the most important part of the worker.
Startup:
1. Create a dedicated `Thread`.
2. Set apartment state to `ApartmentState.STA`.
3. Start the thread.
4. Inside the thread, initialize COM.
5. Create the MXAccess COM object.
6. Attach event handlers.
7. Signal ready to the worker host.
8. Enter the message pump.
Shutdown:
1. Mark the command queue as completing.
2. Drain or reject pending commands according to shutdown mode.
3. Optionally issue MXAccess cleanup calls for active handles.
4. Detach event handlers.
5. Release COM references.
6. Uninitialize COM.
7. Exit the thread.
## Message Pump
The STA must pump Windows messages while also processing queued commands. A
blocking queue that prevents message pumping is not acceptable.
Required loop shape:
```text
while not shutdown:
while command queue has work:
execute one command on STA
MsgWaitForMultipleObjectsEx(
command_event,
timeout,
QS_ALLINPUT,
MWMO_INPUTAVAILABLE)
while PeekMessage:
TranslateMessage
DispatchMessage
```
The command queue should signal a Win32 event or equivalent wait handle so the
STA can wake without busy-waiting.
The loop should update a heartbeat timestamp after:
- successfully pumping messages,
- starting a command,
- finishing a command,
- processing an MXAccess event.
`StaRuntime` implements this runtime boundary in the worker. It starts one
background thread named `MxGateway.Worker.STA`, sets it to `ApartmentState.STA`,
initializes COM through `StaComApartmentInitializer`, and runs
`StaMessagePump`. Commands are scheduled through `InvokeAsync`; the command
queue signals an `AutoResetEvent` so `MsgWaitForMultipleObjectsEx` can wake the
STA without busy-waiting. `LastActivityUtc` records pump, command, startup, and
shutdown activity so the future heartbeat/watchdog can report whether the STA
is still responsive. Shutdown marks the runtime as closing, wakes the pump,
rejects new commands, cancels queued work, uninitializes COM on the STA, and
waits for the thread to exit.
## COM Creation
The MXAccess analysis source at `C:\Users\dohertj2\Desktop\mxaccess` identifies
the installed COM target:
- interop assembly:
`C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll`
- assembly identity:
`ArchestrA.MxAccess, Version=3.2.0.0, PublicKeyToken=23106a86e706d0ae`
- COM class:
`ArchestrA.MxAccess.LMXProxyServerClass`
- CLSID:
`{C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC}`
- ProgID:
`LMXProxy.LMXProxyServer.1`
- version-independent ProgID:
`LMXProxy.LMXProxyServer`
- registered server:
`C:\Program Files (x86)\ArchestrA\Framework\Bin\LmxProxy.dll`
- registry view:
`HKCR\Wow6432Node\CLSID\{C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC}`
- threading model:
`Apartment`
The worker should reference the interop assembly and instantiate
`LMXProxyServerClass` on the dedicated STA thread. Keep the ProgID and assembly
path configurable for diagnostics, but this COM class is the v1 default.
`MxAccessStaSession` owns the initial COM creation path. It starts `StaRuntime`,
creates `LMXProxyServerClass` through `MxAccessComObjectFactory` on the STA,
attaches `MxAccessBaseEventSink`, and returns `WorkerReady` only after those
steps succeed. `MxAccessSession` keeps the raw COM object private, records the
STA managed thread id that created it, detaches the base event sink during
disposal, and releases the COM reference on the STA. After creation,
`MxAccessStaSession` owns a `StaCommandDispatcher` backed by
`MxAccessCommandExecutor`; `DispatchAsync` queues contract commands back to the
same STA instead of exposing the COM object to callers.
Creation rules:
- Create COM object only on the STA.
- Attach event handlers only on the STA.
- Keep the COM reference private to the STA runtime.
- Never marshal the raw COM object to pipe reader/writer threads.
- Capture COM creation HRESULT or exception details.
If COM creation fails, the worker should send a structured fault with:
- fault category,
- exception type,
- HRESULT when available,
- COM class or ProgID attempted,
- worker process id,
- session id.
`WorkerPipeSession` maps startup exceptions from this path to
`WorkerFaultCategory.MxaccessCreationFailed`, includes the captured HRESULT
when the exception exposes one, and does not send `WorkerReady` after a failed
COM creation attempt.
After `WorkerReady`, `WorkerPipeSession` continues reading gateway frames for
the lifetime of the process. `WorkerCommand` frames are dispatched to
`MxAccessStaSession`, replies are written as `WorkerCommandReply`, and queued
worker events are drained after command replies. `WorkerShutdown` starts the
graceful shutdown path and returns `WorkerShutdownAck` only after the STA
cleanup path completes.
## Event Sink
The worker must subscribe to every public MXAccess event family:
- `OnDataChange`
- `OnWriteComplete`
- `OperationComplete`
- `OnBufferedDataChange`
Forward these event families only when the native MXAccess COM object raises
them. Do not synthesize `OperationComplete` from write completion or command
status. `OnBufferedDataChange` must be represented in the protocol now, but
multi-sample payload conversion should remain capture-validated; preserve raw
metadata whenever conversion is incomplete.
Event handling rules:
- Event handlers are expected to run on the STA.
- Assign a monotonic worker event sequence.
- Convert event args to `WorkerEvent`.
- Include value, quality, timestamp, handles, status arrays, and raw status
details when available.
- Preserve raw event payload metadata for unsupported buffered or
completion-only shapes.
- Enqueue to the outbound event queue.
- Return quickly to preserve message pumping.
`MxAccessBaseEventSink` implements the COM connection-point handlers and keeps
the handlers limited to event argument conversion plus enqueue. It uses
`MxAccessEventMapper` to create `MxEvent` DTOs for `OnDataChange`,
`OnWriteComplete`, `OperationComplete`, and `OnBufferedDataChange`. The mapper
converts scalar and array values through `VariantConverter`, converts
`MXSTATUS_PROXY[]` through `MxStatusProxyConverter`, and maps installed
`MxDataType` values to the public protobuf enum while preserving the raw data
type on buffered events. `OperationComplete` is only emitted from the native
`OperationComplete` handler; write completion does not synthesize it.
`MxAccessEventQueue` is the bounded outbound event queue for one worker
session. It assigns the monotonic `WorkerSequence` and `WorkerTimestamp` when an
event is accepted, preserving the order in which MXAccess handlers enqueue
events. The default capacity is `10000`. When the queue reaches capacity it
records a `WorkerFaultCategory.QueueOverflow` fault and rejects further events.
The event handler catches conversion and enqueue failures, records the first
fault on the queue, and returns to the STA message pump instead of writing to
the pipe.
If event conversion throws, catch it inside the event handler, record a
structured `WorkerFault`, and keep the worker alive only if the fault policy
allows it.
## Command Queue
The pipe reader converts `WorkerCommand` messages into `StaCommand` entries.
Each entry should include:
- correlation id,
- method name,
- method-specific request payload,
- enqueue timestamp,
- cancellation marker,
- reply completion path.
The STA command dispatcher:
1. Dequeues one command.
2. Checks whether shutdown has started.
3. Calls the matching MXAccess method.
4. Captures return values, out parameters, status arrays, and HRESULT.
5. Converts results to `WorkerCommandReply`.
6. Enqueues the reply to the pipe writer.
The STA should execute one command at a time. MXAccess command ordering must be
preserved for one worker.
## Command Dispatch Surface
Phase 1 commands:
- `Register`
- `Unregister`
- `AddItem`
- `RemoveItem`
Phase 2 event commands:
- `Advise`
- `UnAdvise`
- `AdviseSupervisory`
Full surface:
- `AddItem2`
- `AddBufferedItem`
- `SetBufferedUpdateInterval`
- `Suspend`
- `Activate`
- `Write`
- `Write2`
- `WriteSecured`
- `WriteSecured2`
- `AuthenticateUser`
- `ArchestrAUserToId`
Diagnostics:
- `Ping`
- `GetSessionState`
- `GetWorkerInfo`
- `DrainEvents`
- `ShutdownWorker`
Implement method-specific dispatch instead of a generic string method invoker.
Parity tests need stable command-specific request and reply shapes.
`MxAccessCommandExecutor` implements the first command pair:
- `Register` calls `LMXProxyServerClass.Register` with the requested client
name and preserves the returned server handle in both `ReturnValue` and
`RegisterReply.ServerHandle`.
- `Unregister` calls `LMXProxyServerClass.Unregister` with the requested server
handle. The reply has no method-specific payload because the public MXAccess
method returns `void`.
Both commands set `Hresult` to `0` only after the COM call returns normally.
COM exceptions flow through `StaCommandDispatcher`, which captures the thrown
HRESULT and converts the reply to `ProtocolStatusCode.MxaccessFailure`.
`MxAccessStaSession.GetRegisteredServerHandlesAsync` returns an STA-read
snapshot of tracked server handles for diagnostics and future cleanup logic.
`MxAccessCommandExecutor` also implements the item lifecycle commands:
- `AddItem` calls `LMXProxyServerClass.AddItem` with the requested server
handle and item definition. It preserves the returned item handle in both
`ReturnValue` and `AddItemReply.ItemHandle`.
- `AddItem2` calls `LMXProxyServerClass.AddItem2` with the requested server
handle, item definition, and context string. The context string is passed to
MXAccess exactly as received.
- `RemoveItem` calls `LMXProxyServerClass.RemoveItem` with the requested server
handle and item handle. The reply has no method-specific payload because the
public MXAccess method returns `void`.
The worker records item handles only after `AddItem` or `AddItem2` returns
normally, and removes item handles only after `RemoveItem` returns normally.
The registry does not prevalidate server or item handles, so invalid and
cross-server handle behavior remains owned by MXAccess. COM exceptions continue
through `StaCommandDispatcher`, which preserves the HRESULT and leaves
diagnostic registry state unchanged for failed cleanup calls.
`MxAccessCommandExecutor` implements advice lifecycle commands on the same STA
path:
- `Advise` calls `LMXProxyServerClass.Advise` with the requested server handle
and item handle.
- `AdviseSupervisory` calls `LMXProxyServerClass.AdviseSupervisory` with the
requested server handle and item handle. This remains a distinct command from
plain `Advise` even though observed scalar captures share the same lower-level
subscription body.
- `UnAdvise` calls `LMXProxyServerClass.UnAdvise` with the requested server
handle and item handle.
The worker records plain and supervisory advice separately only after the COM
call returns normally. Successful `UnAdvise` removes all tracked advice for the
server and item pair because the public MXAccess cleanup method has no plain
versus supervisory selector. Successful `RemoveItem` and `Unregister` also clear
related advice state from the worker registry. Failed advice and cleanup calls
leave registry state unchanged so diagnostics continue to reflect the last
successful MXAccess-owned state transition.
## Handle Registry
The worker should track MXAccess state for diagnostics and cleanup, while still
treating MXAccess as the authority.
Suggested tracked state:
- registered server handles,
- item handles,
- item names and context,
- server handle for each item,
- advise state,
- buffered item state,
- authenticated user ids if needed,
- last command touching each handle.
Rules:
- Do not invent handles.
- Do not rewrite handles returned by MXAccess.
- Record server handles only after `Register` succeeds.
- Remove server handles only after `Unregister` succeeds.
- Record item handles only after `AddItem` or `AddItem2` succeeds.
- Remove item handles only after `RemoveItem` succeeds.
- Record advice state only after `Advise` or `AdviseSupervisory` succeeds.
- Remove advice state only after `UnAdvise`, `RemoveItem`, or `Unregister`
succeeds.
- Preserve invalid-handle behavior from MXAccess.
- Preserve cross-server handle behavior from MXAccess.
- Use registry state for cleanup and diagnostics, not semantic correction.
## Value Conversion
`VariantConverter` should convert COM values into the protobuf `MxValue` union.
Supported scalar projections:
- bool,
- int32,
- int64,
- float,
- double,
- string,
- timestamp,
- raw fallback.
Supported arrays:
- bool array,
- int32 array,
- float array,
- double array,
- string array,
- timestamp array,
- raw fallback.
Rules:
- Preserve null and empty values distinctly when MXAccess exposes a distinction.
- Preserve array rank and dimensions when available.
- Preserve original variant type metadata.
- If conversion is lossy, include the best typed value plus raw diagnostic
metadata.
- Do not throw away values just because they are awkward.
Credential-bearing values must not be logged.
## Status And HRESULT Capture
`MXSTATUS_PROXY` arrays must be represented explicitly. Do not collapse status
arrays into a single success flag.
For every command reply, capture:
- protocol success/failure,
- method name,
- correlation id,
- COM HRESULT if available,
- thrown exception HRESULT if available,
- MXAccess return value if any,
- method-specific out parameters,
- status array,
- diagnostic message safe for logs.
If a COM call throws, map the exception into a command reply instead of
crashing the worker, unless the exception indicates process corruption or the
configured policy says to fail the session.
## Cancellation
Worker cancellation is cooperative at the queue boundary.
Rules:
- If a `WorkerCancel` arrives before a command starts, mark the command
canceled and reply or drop according to protocol policy.
- If a command is already executing on the STA, do not attempt to abort the COM
call.
- When the COM call returns after gateway cancellation, send the reply only if
the gateway still wants late replies; otherwise log and discard.
- Hard cancellation is process kill by the gateway.
## Outbound Queues
The worker should use bounded outbound queues for replies, events, heartbeats,
and faults.
Priority order when writing:
1. faults,
2. command replies,
3. shutdown acknowledgements,
4. heartbeats,
5. events.
Event overflow policy defaults to fail-fast for parity testing. If the event
queue fills:
1. Capture overflow metrics.
2. Send `WorkerFault` if possible.
3. Stop accepting new commands.
4. Let the gateway close or kill the worker.
Production coalescing may be added later, but it must be explicit and tested.
Do not drop or coalesce events in v1.
## Heartbeat And Watchdog
`WorkerPipeSession` starts the heartbeat loop after the gateway validates
`WorkerHello` and receives `WorkerReady`. Heartbeats continue until
`WorkerShutdown`, cancellation, or a pipe/protocol failure stops the session.
The loop uses `WorkerPipeSessionOptions.HeartbeatInterval`; the default matches
the gateway worker heartbeat interval.
The worker heartbeat proves that:
- pipe writer is alive,
- worker host is alive,
- STA has recently pumped or completed work.
Heartbeat payload includes:
- worker process id,
- session id,
- current state,
- last STA activity timestamp,
- pending command count,
- outbound event queue depth,
- event sequence,
- current command correlation id if any.
`MxAccessStaSession.CaptureHeartbeat()` reads `StaRuntime.LastActivityUtc` and
`StaCommandDispatcher` queue state without touching the raw MXAccess COM object
outside the STA. Event queue depth and event sequence are reported as zero until
the event queue implementation owns those counters.
The STA watchdog currently emits a `WorkerFault` with
`WorkerFaultCategory.StaHung` when `LastStaActivityUtc` is older than
`WorkerPipeSessionOptions.HeartbeatGrace` **and no command is in flight**.
`StaRuntime.ProcessQueuedCommands` calls `MarkActivity()` only immediately
before and after each work item, so a synchronously long-running STA command
(for example a `ReadBulk` waiting `timeout_ms` for the first `OnDataChange`)
legitimately freezes `LastStaActivityUtc` for the duration of the wait while
the worker is healthy. The watchdog is therefore suppressed while the
heartbeat snapshot's `CurrentCommandCorrelationId` is non-empty: the worker is
busy executing a command, not hung, and the heartbeat already surfaces the
in-flight correlation id so the gateway can apply its own per-command timeout
if it considers the command too slow. The fault still fires on a truly hung
STA — no command in flight and no activity for longer than `HeartbeatGrace`
which is the only case the watchdog can usefully distinguish from a slow
command. Command duration and high event queue depth remain observable through
heartbeat fields until dedicated thresholds own those warnings. The worker
reports stale STA activity, but the gateway owns the final kill decision
through its existing heartbeat and worker lifecycle policy.
The in-flight-command suppression itself is bounded by
`WorkerPipeSessionOptions.HeartbeatStuckCeiling` (default 75 seconds = 5 ×
`HeartbeatGrace`). The motivating case for the suppression is a legitimately
slow synchronous command — but a genuinely stuck COM call (for example
against a dead MXAccess provider whose cross-apartment marshaler is
permanently blocked, or a write completion that never fires) leaves
`CurrentCommandCorrelationId` non-empty indefinitely. Without an upper bound
the worker-side `StaHung` watchdog would be permanently defeated for that
session and only the gateway's per-command timeout would catch the hang —
losing the worker-originated diagnostic (`StaHung` fault category, the
stale-by interval) from the gateway audit trail. Once `LastStaActivityUtc`
has been stale for longer than `HeartbeatStuckCeiling`, the watchdog fires
`StaHung` regardless of whether a command is in flight, on the assumption
that no legitimate STA command should run that long without periodically
refreshing activity. Deployments that legitimately run very long bulk
operations should raise the ceiling rather than disable it.
## Shutdown
Graceful shutdown sequence:
1. Pipe reader receives `WorkerShutdown`.
2. Worker host marks shutdown requested.
3. Reject new commands.
4. Let current STA command finish if within timeout.
5. Optionally run MXAccess cleanup:
- `UnAdvise`,
- `RemoveItem`,
- `Unregister`.
6. Detach event handlers.
7. Release COM object until reference count reaches zero when possible.
8. Stop pipe reader and writer.
9. Exit process with success code.
If shutdown wedges, the gateway kills the process. The worker should be written
so process kill does not corrupt other sessions.
`MxAccessStaSession.ShutdownGracefullyAsync` implements the current cleanup
path. It first calls `StaCommandDispatcher.RequestShutdown()` so new commands
are rejected and queued commands that have not started receive
`ProtocolStatusCode.WorkerUnavailable`. The command already executing on the
STA is allowed to finish until the shutdown grace period expires.
After command dispatch is closed, cleanup runs on the STA in MXAccess handle
order:
1. one `UnAdvise` call per advised server/item pair,
2. `RemoveItem` for active item handles,
3. `Unregister` for active server handles,
4. event sink detach,
5. COM release.
Each cleanup call is best effort. A failed cleanup operation is recorded as an
`MxAccessShutdownFailure`, logged by `WorkerPipeSession`, and does not prevent
later cleanup calls from running. A shutdown with cleanup failures still returns
`WorkerShutdownAck` with `ProtocolStatusCode.Ok` because the worker reached the
controlled release path. If the grace period expires before cleanup can run or
finish, the worker reports `WorkerFaultCategory.ShutdownTimeout` when possible
and relies on the gateway to kill the process.
## Fault Handling
Worker fault categories:
- `InvalidArguments`
- `GatewayAuthenticationFailed`
- `ProtocolMismatch`
- `ProtocolViolation`
- `PipeDisconnected`
- `MxAccessCreationFailed`
- `MxAccessCommandFailed`
- `MxAccessEventConversionFailed`
- `StaHung`
- `QueueOverflow`
- `ShutdownTimeout`
Fault payload should include:
- category,
- session id,
- correlation id when command-specific,
- command method when command-specific,
- HRESULT when available,
- exception type when available,
- safe diagnostic message.
Do not include raw credentials or full secured-write values.
## Security
The worker should trust only the launching gateway after validating:
- expected session id,
- expected protocol version,
- nonce,
- pipe identity where available.
It should not expose any network listener. It should not accept commands from
arbitrary local processes.
Credential-bearing commands must keep credential data out of:
- command line,
- logs,
- metrics labels,
- exception messages,
- crash dumps when avoidable.
## Observability
Worker logs should include:
- startup arguments except secrets,
- protocol version,
- gateway handshake result,
- MXAccess COM creation result,
- command start/end with correlation id,
- HRESULT/status summary,
- event family and sequence,
- queue overflow,
- STA watchdog warnings,
- shutdown path.
Metrics can be emitted through the gateway or exposed as worker heartbeat
fields. The worker does not need its own public metrics endpoint.
## Testing Strategy
Worker tests that do not require installed MXAccess:
- frame reader/writer,
- protocol validation,
- command queue ordering,
- STA command scheduling with a fake COM object,
- message-pump wake behavior where practical,
- value conversion,
- status conversion,
- event conversion from fake event args,
- shutdown state transitions,
- queue overflow behavior.
Live MXAccess tests:
- COM creation on STA,
- `Register` and `Unregister`,
- `AddItem` and `RemoveItem`,
- `Advise` and one `OnDataChange`,
- write completion behavior,
- secured write behavior,
- buffered data-change behavior,
- invalid handle behavior.
- no synthesized `OperationComplete` when native MXAccess does not raise it.
- raw metadata preservation for buffered payloads that cannot yet be fully
converted.
Live tests should be opt-in and clearly marked because they depend on installed
MXAccess COM and provider state.
The worker test suite uses `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1` for these
tests. `AddItem` uses `TestChildObject.TestInt` by default and accepts an
override through `MXGATEWAY_LIVE_MXACCESS_ITEM`; `AddItem2` uses the captured
parity fixture shape `AddItem2("TestInt", "TestChildObject")`.
`WorkerLiveMxAccessSmokeTests` in `src/MxGateway.IntegrationTests/` uses the
same opt-in variable for the gateway-to-worker live smoke. It launches the x86
worker through `WorkerProcessLauncher`, opens a gateway session, runs
`Register`, `AddItem`, and `Advise`, waits for one `OnDataChange`, and closes
the session. The smoke accepts `MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` for a
non-default worker executable path and
`MXGATEWAY_LIVE_MXACCESS_EVENT_TIMEOUT_SECONDS` for the bounded event wait.
## Initial Implementation Slice
The first worker slice should implement:
1. Argument parsing and pipe connection.
2. Protocol hello and nonce validation.
3. STA thread startup.
4. COM initialization and MXAccess object creation.
5. Message pump with command wake event.
6. `WorkerReady`.
7. Shutdown command.
8. `Register`, `AddItem`, and `Advise`.
9. Event sink for one `OnDataChange`.
10. Basic value/status conversion.
11. Event model coverage for `OperationComplete` and `OnBufferedDataChange`
without synthesized events.
12. Fault reporting.
This slice proves the worker can preserve the core MXAccess requirements:
single-process isolation, STA ownership, message pumping, command execution,
and event delivery.
## Related Documentation
- [Worker Bootstrap](./WorkerBootstrap.md)
- [Worker STA](./WorkerSta.md)
- [Worker Conversion](./WorkerConversion.md)
- [Worker Frame Protocol](./WorkerFrameProtocol.md)
- [Worker Process Launcher](./WorkerProcessLauncher.md)
- [Gateway Process Detailed Design](./GatewayProcessDesign.md)
- [Design Decisions](./DesignDecisions.md)