# Code Review — Worker | Field | Value | |---|---| | Module | `src/MxGateway.Worker` | | Reviewer | Claude Code | | Review date | 2026-05-18 | | Commit reviewed | `6c64030` | | Status | Reviewed | | Open findings | 15 | ## Checklist coverage | # | Category | Result | |---|---|---| | 1 | Correctness & logic bugs | Issues found: heartbeat loop sleeps before first beat (Worker-002), `ProcessCommandAsync` state race drops replies (Worker-003), watchdog/heartbeat state inconsistency (Worker-004), double-dispose path (Worker-006), plus Worker-010/011/015. | | 2 | mxaccessgw conventions | Issue found: Worker-007 (reflection-based COM invocation bypasses the typed interface contract). | | 3 | Concurrency & thread safety | Issues found: Worker-001 (`WnWrapAlarmConsumer` timer fires COM off the STA), Worker-008 (consumer factory STA-affinity not enforced). | | 4 | Error handling & resilience | Issue found: Worker-005 (`OnPoll` silently swallows all poll failures). | | 5 | Security | No secret logging (redaction applied); inbound frame validation reasonable. No issues found. | | 6 | Performance & resource management | Issue found: Worker-009 (per-frame `byte[]` allocations on the hot event path). COM release is correct. | | 7 | Design-document adherence | Code matches `WorkerSta.md`/`WorkerFrameProtocol.md`; stale alarm-path docs (Worker-012). | | 8 | Code organization & conventions | Issue found: Worker-014 (`AlarmCommandHandler.cs` declares two public types in one file). | | 9 | Testing coverage | Issue found: Worker-013 (`StaMessagePump` has no direct tests; poll-loop lifecycle untested). | | 10 | Documentation & comments | Issue found: Worker-012 (stale "future PR / A.3" comments now describe shipped code). | ## Findings ### Worker-001 | Field | Value | |---|---| | Severity | High | | Category | Concurrency & thread safety | | Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207` | | Status | Open | **Description:** When constructed with `pollIntervalMilliseconds > 0`, `Subscribe` starts a `System.Threading.Timer` whose `OnPoll` callback runs `PollOnce()` — which calls `wwAlarmConsumerClass.GetXmlCurrentAlarms2` — on a thread-pool thread. The wnwrap CLSID is registered `ThreadingModel=Apartment`; calling its methods off the owning STA violates the hard rule that all COM calls happen on the dedicated STA thread, and can deadlock on cross-apartment marshaling when the STA is not pumping. The production path (default constructor, interval 0) is safe, but the public 3-arg constructor leaves this footgun callable, and tests/live-smoke use it. **Recommendation:** Remove the internal `Timer` entirely (production already drives `PollOnce` from the STA), or document and gate it so it can only be used from an STA thread. At minimum, make the timer-driven mode unreachable from any production wiring. **Resolution:** _(open)_ ### Worker-002 | Field | Value | |---|---| | Severity | High | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:545-549` | | Status | Open | **Description:** `RunHeartbeatLoopAsync` calls `await Task.Delay(_sessionOptions.HeartbeatInterval, ...)` before sending the first heartbeat. The gateway therefore receives no heartbeat for the first full interval (default 5s) after the worker reaches `Ready`. If the gateway's liveness watchdog expects a heartbeat sooner, a healthy worker can be misclassified as hung at startup. **Recommendation:** Send an initial heartbeat immediately on entering the loop, or move the `Task.Delay` to the end of the loop body. **Resolution:** _(open)_ ### Worker-003 | Field | Value | |---|---| | Severity | High | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:399-403`, `:416-419` | | Status | Open | **Description:** `ProcessCommandAsync` checks `_state` after `DispatchAsync` completes and silently `return`s without writing a `WorkerCommandReply` (or fault) when `_state` is not `Ready`/`ExecutingCommand`. `_state` is a plain field mutated from multiple tasks (heartbeat loop, event-drain loop, shutdown). A command that completes successfully while `_state` has transitioned will have its reply dropped with no diagnostic, and the gateway's correlation-id wait then hangs until its own timeout. The `_state` read is also not synchronized. **Recommendation:** Always attempt to write the reply/fault for an in-flight command, or explicitly reject in-flight commands with a `Canceled`/`WorkerUnavailable` reply during state transitions. Make `_state` access thread-safe (volatile or locked). **Resolution:** _(open)_ ### Worker-004 | Field | Value | |---|---| | Severity | Medium | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588` | | Status | Open | **Description:** After `ReportWatchdogFaultIfNeededAsync` sends an `StaHung` fault, the heartbeat loop continues sending normal heartbeats with `State` derived from `_state`, which the watchdog path never sets to `Faulted`. The heartbeat then keeps reporting a non-faulted state that contradicts the fault just sent. **Recommendation:** Set `_state = WorkerState.Faulted` (thread-safely) when the watchdog fault fires so heartbeat state and fault stay consistent. **Resolution:** _(open)_ ### Worker-005 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:297-313` | | Status | Open | **Description:** `OnPoll` catches every exception from `PollOnce()` and discards it (`_ = ex;`). The production poll path (`MxAccessStaSession.RunAlarmPollLoopAsync` → `AlarmCommandHandler.PollOnce` → `AlarmDispatcher.PollOnce` → `consumer.PollOnce()`) has no fault recording either. A permanently failing alarm provider (e.g. `GetXmlCurrentAlarms2` returning `E_FAIL`, malformed XML throwing in `XmlDocument.LoadXml`) is therefore completely silent — no fault on the event queue, no log. **Recommendation:** Route poll failures to `MxAccessEventQueue.RecordFault` (or a logger) so a broken alarm subscription becomes observable. Update the now-stale comment. **Resolution:** _(open)_ ### Worker-006 | Field | Value | |---|---| | Severity | Medium | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491` | | Status | Open | **Description:** `RunAsync`'s `finally` calls `_runtimeSession?.Dispose()` unless `_shutdownTimedOut`. On the normal path `ShutdownGracefullyAsync` already disposed the STA runtime, so re-entering `Dispose()` is a harmless no-op only because `ShutdownGracefullyAsync` reached its end and set `disposed = true`. If `ShutdownGracefullyAsync` throws `TimeoutException` after partial teardown with `_shutdownTimedOut` set, the session is never disposed at all — the `finally` skips it — leaking the STA thread and COM object, leaving cleanup to rely solely on process exit. **Recommendation:** Make the dispose decision explicit and confirm process exit always follows a timed-out shutdown; otherwise dispose defensively. At minimum document why disposal is deliberately skipped on timeout. **Resolution:** _(open)_ ### Worker-007 | Field | Value | |---|---| | Severity | Medium | | Category | mxaccessgw conventions | | Location | `src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150` | | Status | Open | **Description:** `Invoke` uses late-bound `Type.InvokeMember` reflection as a fallback when the COM object does not cast to `ILMXProxyServer*`. In production the object is always `LMXProxyServerClass`, so the reflection path exists only for test doubles — it is dead/untested code on the production path and obscures the interface contract. `params object[] arguments` also boxes value-type handles on every call. **Recommendation:** Drop the reflection fallback and require the COM object to implement the interface (tests can supply a typed fake), or clearly mark the fallback as test-only. **Resolution:** _(open)_ ### Worker-008 | Field | Value | |---|---| | Severity | Medium | | Category | Concurrency & thread safety | | Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249`, `:429-447` | | Status | Open | **Description:** `RunAlarmPollLoopAsync` correctly marshals `handler.PollOnce()` onto the STA via `staRuntime.InvokeAsync`, and the cancel/await/dispose ordering in `ShutdownGracefullyAsync` is sound. However, nothing enforces that the `consumerFactory` and all `IMxAccessAlarmConsumer` calls run on the STA thread; a future caller could break STA affinity silently. **Recommendation:** Add an assertion or documented invariant that the consumer factory and all `IMxAccessAlarmConsumer` calls run on the STA thread, mirroring the existing `MxAccessSession.CreationThreadId` pattern. **Resolution:** _(open)_ ### Worker-009 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49`, `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58` | | Status | Open | **Description:** Every frame read allocates a fresh 4-byte length buffer and a payload `byte[]`; every write allocates `ToByteArray()` plus a 4-byte prefix. On the hot event-drain path (batches of up to 128 `WorkerEvent` frames every 25 ms) this produces steady gen-0 garbage. `WorkerFrameWriter` also effectively serializes twice (`CalculateSize()` then `ToByteArray()`). **Recommendation:** Reuse a pooled buffer / `ArrayPool` for the length prefix and payload, and write directly into a pooled buffer using `CodedOutputStream`. Low priority unless event throughput is high. **Resolution:** _(open)_ ### Worker-010 | Field | Value | |---|---| | Severity | Low | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226` | | Status | Open | **Description:** `ConvertInt64Scalar` is reached for `TypeCode.UInt32` and `TypeCode.Int64`. For a `uint` with `expectedDataType == MxDataType.Time`, the value is treated as a Windows `FILETIME` via `DateTime.FromFileTimeUtc(longValue)`; a 32-bit FILETIME is never a valid full FILETIME, so this silently produces a near-epoch timestamp rather than a raw/diagnostic value. Unlikely in practice but a silent misconversion. **Recommendation:** Only apply the `MxDataType.Time` FILETIME projection for 64-bit source types; for `uint` fall through to integer or raw. **Resolution:** _(open)_ ### Worker-011 | Field | Value | |---|---| | Severity | Low | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171` | | Status | Open | **Description:** `retryAttempts` is computed as `(connectTimeout / min(connectTimeout, attemptTimeout)) - 1`. With defaults (30000 / 2000) this yields 14 retries, but each retry also incurs Polly exponential backoff. The overall `connectDeadline` (`CancelAfter(connectTimeout)`) is the real bound, so the computed attempt count can be larger or smaller than the time budget allows, and the formula is opaque. **Recommendation:** Drive retries purely off the `connectDeadline` token (Polly stops when cancelled) and drop the fragile attempt-count arithmetic, or add a comment explaining the intent. **Resolution:** _(open)_ ### Worker-012 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55`, `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43`, `src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112` | | Status | Open | **Description:** Multiple comments describe the alarm path as not-yet-wired future work ("PR A.2 — COM-side subscription scaffold … the worker advertises no alarm subscription", "the worker bootstrap will gain a thin 'run-on-STA' wrapper as part of A.3"). As of commit 6c64030 the alarm command handler, STA poll loop, and `SubscribeAlarms`/`AcknowledgeAlarm`/`QueryActiveAlarms` are all wired. These comments are stale and misleading. **Recommendation:** Update the XML docs/comments to describe the shipped behavior; remove the "future PR" framing. **Resolution:** _(open)_ ### Worker-013 | Field | Value | |---|---| | Severity | Low | | Category | Testing coverage | | Location | `src/MxGateway.Worker/Sta/StaMessagePump.cs` | | Status | Open | **Description:** `StaMessagePump` — the heart of COM event delivery (`MsgWaitForMultipleObjectsEx` + `PeekMessage`/`DispatchMessage`) — has no direct unit tests. `StaRuntimeTests` exercises it indirectly for command wake-up but never verifies that a posted Windows message actually wakes the wait and is dispatched, nor that `PumpPendingMessages` returns a correct count. The alarm poll-loop lifecycle in `MxAccessStaSession` (start/cancel/await on shutdown) also has no test. These are the most failure-sensitive paths in the module. **Recommendation:** Add tests that post a message to the STA thread and assert it is pumped, and tests covering alarm poll-loop start/stop and shutdown ordering. **Resolution:** _(open)_ ### Worker-014 | Field | Value | |---|---| | Severity | Low | | Category | Code organization & conventions | | Location | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33`, `:202` | | Status | Open | **Description:** The file declares two public types — the `AlarmCommandHandler` class and the `IAlarmCommandHandler` interface. The C# style guide and the rest of the module follow one-public-type-per-file (e.g. interfaces in their own `I*.cs` files like `IMxAccessAlarmConsumer.cs`). **Recommendation:** Move `IAlarmCommandHandler` to its own `IAlarmCommandHandler.cs` for consistency. **Resolution:** _(open)_ ### Worker-015 | Field | Value | |---|---| | Severity | Low | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145` | | Status | Open | **Description:** On overflow, `Enqueue` records the overflow fault and throws `MxAccessEventQueueOverflowException`; `MxAccessBaseEventSink.EnqueueEvent` catches it and calls `RecordFault` again. `RecordFault` is a no-op when a fault already exists, so the second call is harmless — but the intent is muddled, and there is no test asserting the dropped-event behavior. This is acceptable per the fail-fast design but undocumented at the call site. **Recommendation:** Add a brief comment in `EnqueueEvent` clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op. **Resolution:** _(open)_