Files
mxaccessgw/code-reviews/Worker/findings.md
T
Joseph Doherty 53e3973209 Resolve Worker-001, Worker-002, Worker-003 code-review findings
Worker-001: WnWrapAlarmConsumer armed a System.Threading.Timer whose OnPoll
callback ran GetXmlCurrentAlarms2 on a thread-pool thread against the
Apartment-threaded wnwrap COM object, which can deadlock on cross-apartment
marshaling. Removed the pollTimer/pollIntervalMs fields, OnPoll, the
poll-interval constructor parameter, and the timer arm/disposal. Polls are
driven externally by the STA via StaRuntime.InvokeAsync(PollOnce).

Worker-002: RunHeartbeatLoopAsync delayed a full HeartbeatInterval before
the first heartbeat. Restructured so the first beat is sent immediately on
entering the loop and the delay applies only between subsequent beats.

Worker-003: ProcessCommandAsync silently returned without a reply when
_state was not a command-serving state after dispatch. Both drop sites now
log a WorkerCommandResultDropped diagnostic with correlation_id via
IWorkerLogger; _state is now volatile.

Three pre-existing tests that asserted strict frame ordering were updated to
tolerate an interleaved first heartbeat (Worker-002 consequence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:59:46 -04:00

253 lines
17 KiB
Markdown

# Code Review — Worker
| Field | Value |
|---|---|
| Module | `src/MxGateway.Worker` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Status | Reviewed |
| Open findings | 12 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: heartbeat loop sleeps before first beat (Worker-002), `ProcessCommandAsync` state race drops replies (Worker-003), watchdog/heartbeat state inconsistency (Worker-004), double-dispose path (Worker-006), plus Worker-010/011/015. |
| 2 | mxaccessgw conventions | Issue found: Worker-007 (reflection-based COM invocation bypasses the typed interface contract). |
| 3 | Concurrency & thread safety | Issues found: Worker-001 (`WnWrapAlarmConsumer` timer fires COM off the STA), Worker-008 (consumer factory STA-affinity not enforced). |
| 4 | Error handling & resilience | Issue found: Worker-005 (`OnPoll` silently swallows all poll failures). |
| 5 | Security | No secret logging (redaction applied); inbound frame validation reasonable. No issues found. |
| 6 | Performance & resource management | Issue found: Worker-009 (per-frame `byte[]` allocations on the hot event path). COM release is correct. |
| 7 | Design-document adherence | Code matches `WorkerSta.md`/`WorkerFrameProtocol.md`; stale alarm-path docs (Worker-012). |
| 8 | Code organization & conventions | Issue found: Worker-014 (`AlarmCommandHandler.cs` declares two public types in one file). |
| 9 | Testing coverage | Issue found: Worker-013 (`StaMessagePump` has no direct tests; poll-loop lifecycle untested). |
| 10 | Documentation & comments | Issue found: Worker-012 (stale "future PR / A.3" comments now describe shipped code). |
## Findings
### Worker-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207` |
| Status | Resolved |
**Description:** When constructed with `pollIntervalMilliseconds > 0`, `Subscribe` starts a `System.Threading.Timer` whose `OnPoll` callback runs `PollOnce()` — which calls `wwAlarmConsumerClass.GetXmlCurrentAlarms2` — on a thread-pool thread. The wnwrap CLSID is registered `ThreadingModel=Apartment`; calling its methods off the owning STA violates the hard rule that all COM calls happen on the dedicated STA thread, and can deadlock on cross-apartment marshaling when the STA is not pumping. The production path (default constructor, interval 0) is safe, but the public 3-arg constructor leaves this footgun callable, and tests/live-smoke use it.
**Recommendation:** Remove the internal `Timer` entirely (production already drives `PollOnce` from the STA), or document and gate it so it can only be used from an STA thread. At minimum, make the timer-driven mode unreachable from any production wiring.
**Resolution:** 2026-05-18 — Removed the off-STA timer infrastructure from `WnWrapAlarmConsumer`: the `Timer? pollTimer` and `pollIntervalMs` fields, the `DefaultPollIntervalMilliseconds` constant, the `OnPoll` callback, the timer-arming arm in `Subscribe`, and the timer disposal block in `Dispose`. The `pollIntervalMilliseconds` parameter is gone from both public constructors (the test-seam ctor is now 2-arg: `wwAlarmConsumerClass` + `maxAlarmsPerFetch`), so the off-STA footgun is structurally unreachable. `PollOnce()` remains the public STA-driven entry point. The stale "poll … on a timer below" comment was corrected. Verified by the regression tests `WnWrapAlarmConsumer_has_no_internal_timer_field` and `WnWrapAlarmConsumer_exposes_no_poll_interval_constructor_parameter`; the `AlarmsLiveSmokeTests` call site was updated to the 2-arg constructor.
### Worker-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:545-549` |
| Status | Resolved |
**Description:** `RunHeartbeatLoopAsync` calls `await Task.Delay(_sessionOptions.HeartbeatInterval, ...)` before sending the first heartbeat. The gateway therefore receives no heartbeat for the first full interval (default 5s) after the worker reaches `Ready`. If the gateway's liveness watchdog expects a heartbeat sooner, a healthy worker can be misclassified as hung at startup.
**Recommendation:** Send an initial heartbeat immediately on entering the loop, or move the `Task.Delay` to the end of the loop body.
**Resolution:** 2026-05-18 — Restructured `RunHeartbeatLoopAsync` so the `Task.Delay(HeartbeatInterval)` is applied between beats only, not before the first. A `firstBeat` guard skips the delay on the initial iteration, so the gateway sees a heartbeat as soon as the worker is `Ready`; cancellation behavior is preserved (the loop still observes the token and the delay still throws on cancellation). Verified by the regression test `RunAsync_SendsFirstHeartbeatImmediatelyOnEnteringLoop`. Three pre-existing tests (`WorkerPipeClientTests.RunAsync_ConnectsToPipeAndCompletesHandshake`, `WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears`, `WorkerPipeSessionTests.RunAsync_WhenCommandThrowsAfterShutdown_DropsLateFaultAndWritesShutdownAck`) assumed strict frame ordering and were updated to skip the now-interleaved first heartbeat while still asserting the same shutdown-ack behavior.
### Worker-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:399-403`, `:416-419` |
| Status | Resolved |
**Description:** `ProcessCommandAsync` checks `_state` after `DispatchAsync` completes and silently `return`s without writing a `WorkerCommandReply` (or fault) when `_state` is not `Ready`/`ExecutingCommand`. `_state` is a plain field mutated from multiple tasks (heartbeat loop, event-drain loop, shutdown). A command that completes successfully while `_state` has transitioned will have its reply dropped with no diagnostic, and the gateway's correlation-id wait then hangs until its own timeout. The `_state` read is also not synchronized.
**Recommendation:** Always attempt to write the reply/fault for an in-flight command, or explicitly reject in-flight commands with a `Canceled`/`WorkerUnavailable` reply during state transitions. Make `_state` access thread-safe (volatile or locked).
**Resolution:** 2026-05-18 — Both silent-drop `return` sites in `ProcessCommandAsync` (the post-`DispatchAsync` success path and the exception path) now call a new `LogCommandResultDropped` helper before returning. The helper logs an Information event named `WorkerCommandResultDropped` via the session's `IWorkerLogger`, carrying the command's `correlation_id` plus `command_method` and `worker_state`, so a stuck gateway correlation-id wait is now traceable. The `_state` field was made `volatile` (`WorkerState` is an int-backed protobuf enum, so volatile is valid) so cross-thread reads observe the latest value without tearing; this is a low-risk, non-behavioral change and did not destabilize any test. Verified by the regression test `RunAsync_WhenReplyIsDroppedAfterShutdown_LogsDiagnostic`.
### Worker-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588` |
| Status | Open |
**Description:** After `ReportWatchdogFaultIfNeededAsync` sends an `StaHung` fault, the heartbeat loop continues sending normal heartbeats with `State` derived from `_state`, which the watchdog path never sets to `Faulted`. The heartbeat then keeps reporting a non-faulted state that contradicts the fault just sent.
**Recommendation:** Set `_state = WorkerState.Faulted` (thread-safely) when the watchdog fault fires so heartbeat state and fault stay consistent.
**Resolution:** _(open)_
### Worker-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:297-313` |
| Status | Open |
**Description:** `OnPoll` catches every exception from `PollOnce()` and discards it (`_ = ex;`). The production poll path (`MxAccessStaSession.RunAlarmPollLoopAsync``AlarmCommandHandler.PollOnce``AlarmDispatcher.PollOnce``consumer.PollOnce()`) has no fault recording either. A permanently failing alarm provider (e.g. `GetXmlCurrentAlarms2` returning `E_FAIL`, malformed XML throwing in `XmlDocument.LoadXml`) is therefore completely silent — no fault on the event queue, no log.
**Recommendation:** Route poll failures to `MxAccessEventQueue.RecordFault` (or a logger) so a broken alarm subscription becomes observable. Update the now-stale comment.
**Resolution:** _(open)_
### Worker-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491` |
| Status | Open |
**Description:** `RunAsync`'s `finally` calls `_runtimeSession?.Dispose()` unless `_shutdownTimedOut`. On the normal path `ShutdownGracefullyAsync` already disposed the STA runtime, so re-entering `Dispose()` is a harmless no-op only because `ShutdownGracefullyAsync` reached its end and set `disposed = true`. If `ShutdownGracefullyAsync` throws `TimeoutException` after partial teardown with `_shutdownTimedOut` set, the session is never disposed at all — the `finally` skips it — leaking the STA thread and COM object, leaving cleanup to rely solely on process exit.
**Recommendation:** Make the dispose decision explicit and confirm process exit always follows a timed-out shutdown; otherwise dispose defensively. At minimum document why disposal is deliberately skipped on timeout.
**Resolution:** _(open)_
### Worker-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | mxaccessgw conventions |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150` |
| Status | Open |
**Description:** `Invoke` uses late-bound `Type.InvokeMember` reflection as a fallback when the COM object does not cast to `ILMXProxyServer*`. In production the object is always `LMXProxyServerClass`, so the reflection path exists only for test doubles — it is dead/untested code on the production path and obscures the interface contract. `params object[] arguments` also boxes value-type handles on every call.
**Recommendation:** Drop the reflection fallback and require the COM object to implement the interface (tests can supply a typed fake), or clearly mark the fallback as test-only.
**Resolution:** _(open)_
### Worker-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249`, `:429-447` |
| Status | Open |
**Description:** `RunAlarmPollLoopAsync` correctly marshals `handler.PollOnce()` onto the STA via `staRuntime.InvokeAsync`, and the cancel/await/dispose ordering in `ShutdownGracefullyAsync` is sound. However, nothing enforces that the `consumerFactory` and all `IMxAccessAlarmConsumer` calls run on the STA thread; a future caller could break STA affinity silently.
**Recommendation:** Add an assertion or documented invariant that the consumer factory and all `IMxAccessAlarmConsumer` calls run on the STA thread, mirroring the existing `MxAccessSession.CreationThreadId` pattern.
**Resolution:** _(open)_
### Worker-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49`, `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58` |
| Status | Open |
**Description:** Every frame read allocates a fresh 4-byte length buffer and a payload `byte[]`; every write allocates `ToByteArray()` plus a 4-byte prefix. On the hot event-drain path (batches of up to 128 `WorkerEvent` frames every 25 ms) this produces steady gen-0 garbage. `WorkerFrameWriter` also effectively serializes twice (`CalculateSize()` then `ToByteArray()`).
**Recommendation:** Reuse a pooled buffer / `ArrayPool<byte>` for the length prefix and payload, and write directly into a pooled buffer using `CodedOutputStream`. Low priority unless event throughput is high.
**Resolution:** _(open)_
### Worker-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226` |
| Status | Open |
**Description:** `ConvertInt64Scalar` is reached for `TypeCode.UInt32` and `TypeCode.Int64`. For a `uint` with `expectedDataType == MxDataType.Time`, the value is treated as a Windows `FILETIME` via `DateTime.FromFileTimeUtc(longValue)`; a 32-bit FILETIME is never a valid full FILETIME, so this silently produces a near-epoch timestamp rather than a raw/diagnostic value. Unlikely in practice but a silent misconversion.
**Recommendation:** Only apply the `MxDataType.Time` FILETIME projection for 64-bit source types; for `uint` fall through to integer or raw.
**Resolution:** _(open)_
### Worker-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171` |
| Status | Open |
**Description:** `retryAttempts` is computed as `(connectTimeout / min(connectTimeout, attemptTimeout)) - 1`. With defaults (30000 / 2000) this yields 14 retries, but each retry also incurs Polly exponential backoff. The overall `connectDeadline` (`CancelAfter(connectTimeout)`) is the real bound, so the computed attempt count can be larger or smaller than the time budget allows, and the formula is opaque.
**Recommendation:** Drive retries purely off the `connectDeadline` token (Polly stops when cancelled) and drop the fragile attempt-count arithmetic, or add a comment explaining the intent.
**Resolution:** _(open)_
### Worker-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55`, `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43`, `src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112` |
| Status | Open |
**Description:** Multiple comments describe the alarm path as not-yet-wired future work ("PR A.2 — COM-side subscription scaffold … the worker advertises no alarm subscription", "the worker bootstrap will gain a thin 'run-on-STA' wrapper as part of A.3"). As of commit 6c64030 the alarm command handler, STA poll loop, and `SubscribeAlarms`/`AcknowledgeAlarm`/`QueryActiveAlarms` are all wired. These comments are stale and misleading.
**Recommendation:** Update the XML docs/comments to describe the shipped behavior; remove the "future PR" framing.
**Resolution:** _(open)_
### Worker-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/MxGateway.Worker/Sta/StaMessagePump.cs` |
| Status | Open |
**Description:** `StaMessagePump` — the heart of COM event delivery (`MsgWaitForMultipleObjectsEx` + `PeekMessage`/`DispatchMessage`) — has no direct unit tests. `StaRuntimeTests` exercises it indirectly for command wake-up but never verifies that a posted Windows message actually wakes the wait and is dispatched, nor that `PumpPendingMessages` returns a correct count. The alarm poll-loop lifecycle in `MxAccessStaSession` (start/cancel/await on shutdown) also has no test. These are the most failure-sensitive paths in the module.
**Recommendation:** Add tests that post a message to the STA thread and assert it is pumped, and tests covering alarm poll-loop start/stop and shutdown ordering.
**Resolution:** _(open)_
### Worker-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33`, `:202` |
| Status | Open |
**Description:** The file declares two public types — the `AlarmCommandHandler` class and the `IAlarmCommandHandler` interface. The C# style guide and the rest of the module follow one-public-type-per-file (e.g. interfaces in their own `I*.cs` files like `IMxAccessAlarmConsumer.cs`).
**Recommendation:** Move `IAlarmCommandHandler` to its own `IAlarmCommandHandler.cs` for consistency.
**Resolution:** _(open)_
### Worker-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145` |
| Status | Open |
**Description:** On overflow, `Enqueue` records the overflow fault and throws `MxAccessEventQueueOverflowException`; `MxAccessBaseEventSink.EnqueueEvent` catches it and calls `RecordFault` again. `RecordFault` is a no-op when a fault already exists, so the second call is harmless — but the intent is muddled, and there is no test asserting the dropped-event behavior. This is acceptable per the fail-fast design but undocumented at the call site.
**Recommendation:** Add a brief comment in `EnqueueEvent` clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op.
**Resolution:** _(open)_