Set up the code review process scaffolding adapted to mxaccessgw and
record a full per-module review of every src/MxGateway.* project at
commit 6c64030.
- code-reviews/_template/findings.md: per-module findings template
- code-reviews/regen-readme.py: generates README.md from findings.md
files; --check fails if stale
- code-reviews/<Module>/findings.md: reviews for Contracts, Server,
Worker, Tests, Worker.Tests, IntegrationTests (74 findings:
1 Critical, 10 High, 23 Medium, 40 Low; all Open)
- code-reviews/README.md: generated cross-module index
- REVIEW-PROCESS.md: review process document
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
Code Review — Worker
| Field | Value |
|---|---|
| Module | src/MxGateway.Worker |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | 6c64030 |
| Status | Reviewed |
| Open findings | 15 |
Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: heartbeat loop sleeps before first beat (Worker-002), ProcessCommandAsync state race drops replies (Worker-003), watchdog/heartbeat state inconsistency (Worker-004), double-dispose path (Worker-006), plus Worker-010/011/015. |
| 2 | mxaccessgw conventions | Issue found: Worker-007 (reflection-based COM invocation bypasses the typed interface contract). |
| 3 | Concurrency & thread safety | Issues found: Worker-001 (WnWrapAlarmConsumer timer fires COM off the STA), Worker-008 (consumer factory STA-affinity not enforced). |
| 4 | Error handling & resilience | Issue found: Worker-005 (OnPoll silently swallows all poll failures). |
| 5 | Security | No secret logging (redaction applied); inbound frame validation reasonable. No issues found. |
| 6 | Performance & resource management | Issue found: Worker-009 (per-frame byte[] allocations on the hot event path). COM release is correct. |
| 7 | Design-document adherence | Code matches WorkerSta.md/WorkerFrameProtocol.md; stale alarm-path docs (Worker-012). |
| 8 | Code organization & conventions | Issue found: Worker-014 (AlarmCommandHandler.cs declares two public types in one file). |
| 9 | Testing coverage | Issue found: Worker-013 (StaMessagePump has no direct tests; poll-loop lifecycle untested). |
| 10 | Documentation & comments | Issue found: Worker-012 (stale "future PR / A.3" comments now describe shipped code). |
Findings
Worker-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207 |
| Status | Open |
Description: When constructed with pollIntervalMilliseconds > 0, Subscribe starts a System.Threading.Timer whose OnPoll callback runs PollOnce() — which calls wwAlarmConsumerClass.GetXmlCurrentAlarms2 — on a thread-pool thread. The wnwrap CLSID is registered ThreadingModel=Apartment; calling its methods off the owning STA violates the hard rule that all COM calls happen on the dedicated STA thread, and can deadlock on cross-apartment marshaling when the STA is not pumping. The production path (default constructor, interval 0) is safe, but the public 3-arg constructor leaves this footgun callable, and tests/live-smoke use it.
Recommendation: Remove the internal Timer entirely (production already drives PollOnce from the STA), or document and gate it so it can only be used from an STA thread. At minimum, make the timer-driven mode unreachable from any production wiring.
Resolution: (open)
Worker-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:545-549 |
| Status | Open |
Description: RunHeartbeatLoopAsync calls await Task.Delay(_sessionOptions.HeartbeatInterval, ...) before sending the first heartbeat. The gateway therefore receives no heartbeat for the first full interval (default 5s) after the worker reaches Ready. If the gateway's liveness watchdog expects a heartbeat sooner, a healthy worker can be misclassified as hung at startup.
Recommendation: Send an initial heartbeat immediately on entering the loop, or move the Task.Delay to the end of the loop body.
Resolution: (open)
Worker-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:399-403, :416-419 |
| Status | Open |
Description: ProcessCommandAsync checks _state after DispatchAsync completes and silently returns without writing a WorkerCommandReply (or fault) when _state is not Ready/ExecutingCommand. _state is a plain field mutated from multiple tasks (heartbeat loop, event-drain loop, shutdown). A command that completes successfully while _state has transitioned will have its reply dropped with no diagnostic, and the gateway's correlation-id wait then hangs until its own timeout. The _state read is also not synchronized.
Recommendation: Always attempt to write the reply/fault for an in-flight command, or explicitly reject in-flight commands with a Canceled/WorkerUnavailable reply during state transitions. Make _state access thread-safe (volatile or locked).
Resolution: (open)
Worker-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588 |
| Status | Open |
Description: After ReportWatchdogFaultIfNeededAsync sends an StaHung fault, the heartbeat loop continues sending normal heartbeats with State derived from _state, which the watchdog path never sets to Faulted. The heartbeat then keeps reporting a non-faulted state that contradicts the fault just sent.
Recommendation: Set _state = WorkerState.Faulted (thread-safely) when the watchdog fault fires so heartbeat state and fault stay consistent.
Resolution: (open)
Worker-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:297-313 |
| Status | Open |
Description: OnPoll catches every exception from PollOnce() and discards it (_ = ex;). The production poll path (MxAccessStaSession.RunAlarmPollLoopAsync → AlarmCommandHandler.PollOnce → AlarmDispatcher.PollOnce → consumer.PollOnce()) has no fault recording either. A permanently failing alarm provider (e.g. GetXmlCurrentAlarms2 returning E_FAIL, malformed XML throwing in XmlDocument.LoadXml) is therefore completely silent — no fault on the event queue, no log.
Recommendation: Route poll failures to MxAccessEventQueue.RecordFault (or a logger) so a broken alarm subscription becomes observable. Update the now-stale comment.
Resolution: (open)
Worker-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124, src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491 |
| Status | Open |
Description: RunAsync's finally calls _runtimeSession?.Dispose() unless _shutdownTimedOut. On the normal path ShutdownGracefullyAsync already disposed the STA runtime, so re-entering Dispose() is a harmless no-op only because ShutdownGracefullyAsync reached its end and set disposed = true. If ShutdownGracefullyAsync throws TimeoutException after partial teardown with _shutdownTimedOut set, the session is never disposed at all — the finally skips it — leaking the STA thread and COM object, leaving cleanup to rely solely on process exit.
Recommendation: Make the dispose decision explicit and confirm process exit always follows a timed-out shutdown; otherwise dispose defensively. At minimum document why disposal is deliberately skipped on timeout.
Resolution: (open)
Worker-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | mxaccessgw conventions |
| Location | src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150 |
| Status | Open |
Description: Invoke uses late-bound Type.InvokeMember reflection as a fallback when the COM object does not cast to ILMXProxyServer*. In production the object is always LMXProxyServerClass, so the reflection path exists only for test doubles — it is dead/untested code on the production path and obscures the interface contract. params object[] arguments also boxes value-type handles on every call.
Recommendation: Drop the reflection fallback and require the COM object to implement the interface (tests can supply a typed fake), or clearly mark the fallback as test-only.
Resolution: (open)
Worker-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249, :429-447 |
| Status | Open |
Description: RunAlarmPollLoopAsync correctly marshals handler.PollOnce() onto the STA via staRuntime.InvokeAsync, and the cancel/await/dispose ordering in ShutdownGracefullyAsync is sound. However, nothing enforces that the consumerFactory and all IMxAccessAlarmConsumer calls run on the STA thread; a future caller could break STA affinity silently.
Recommendation: Add an assertion or documented invariant that the consumer factory and all IMxAccessAlarmConsumer calls run on the STA thread, mirroring the existing MxAccessSession.CreationThreadId pattern.
Resolution: (open)
Worker-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49, src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58 |
| Status | Open |
Description: Every frame read allocates a fresh 4-byte length buffer and a payload byte[]; every write allocates ToByteArray() plus a 4-byte prefix. On the hot event-drain path (batches of up to 128 WorkerEvent frames every 25 ms) this produces steady gen-0 garbage. WorkerFrameWriter also effectively serializes twice (CalculateSize() then ToByteArray()).
Recommendation: Reuse a pooled buffer / ArrayPool<byte> for the length prefix and payload, and write directly into a pooled buffer using CodedOutputStream. Low priority unless event throughput is high.
Resolution: (open)
Worker-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226 |
| Status | Open |
Description: ConvertInt64Scalar is reached for TypeCode.UInt32 and TypeCode.Int64. For a uint with expectedDataType == MxDataType.Time, the value is treated as a Windows FILETIME via DateTime.FromFileTimeUtc(longValue); a 32-bit FILETIME is never a valid full FILETIME, so this silently produces a near-epoch timestamp rather than a raw/diagnostic value. Unlikely in practice but a silent misconversion.
Recommendation: Only apply the MxDataType.Time FILETIME projection for 64-bit source types; for uint fall through to integer or raw.
Resolution: (open)
Worker-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171 |
| Status | Open |
Description: retryAttempts is computed as (connectTimeout / min(connectTimeout, attemptTimeout)) - 1. With defaults (30000 / 2000) this yields 14 retries, but each retry also incurs Polly exponential backoff. The overall connectDeadline (CancelAfter(connectTimeout)) is the real bound, so the computed attempt count can be larger or smaller than the time budget allows, and the formula is opaque.
Recommendation: Drive retries purely off the connectDeadline token (Polly stops when cancelled) and drop the fragile attempt-count arithmetic, or add a comment explaining the intent.
Resolution: (open)
Worker-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55, src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43, src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112 |
| Status | Open |
Description: Multiple comments describe the alarm path as not-yet-wired future work ("PR A.2 — COM-side subscription scaffold … the worker advertises no alarm subscription", "the worker bootstrap will gain a thin 'run-on-STA' wrapper as part of A.3"). As of commit 6c64030 the alarm command handler, STA poll loop, and SubscribeAlarms/AcknowledgeAlarm/QueryActiveAlarms are all wired. These comments are stale and misleading.
Recommendation: Update the XML docs/comments to describe the shipped behavior; remove the "future PR" framing.
Resolution: (open)
Worker-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | src/MxGateway.Worker/Sta/StaMessagePump.cs |
| Status | Open |
Description: StaMessagePump — the heart of COM event delivery (MsgWaitForMultipleObjectsEx + PeekMessage/DispatchMessage) — has no direct unit tests. StaRuntimeTests exercises it indirectly for command wake-up but never verifies that a posted Windows message actually wakes the wait and is dispatched, nor that PumpPendingMessages returns a correct count. The alarm poll-loop lifecycle in MxAccessStaSession (start/cancel/await on shutdown) also has no test. These are the most failure-sensitive paths in the module.
Recommendation: Add tests that post a message to the STA thread and assert it is pumped, and tests covering alarm poll-loop start/stop and shutdown ordering.
Resolution: (open)
Worker-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33, :202 |
| Status | Open |
Description: The file declares two public types — the AlarmCommandHandler class and the IAlarmCommandHandler interface. The C# style guide and the rest of the module follow one-public-type-per-file (e.g. interfaces in their own I*.cs files like IMxAccessAlarmConsumer.cs).
Recommendation: Move IAlarmCommandHandler to its own IAlarmCommandHandler.cs for consistency.
Resolution: (open)
Worker-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145 |
| Status | Open |
Description: On overflow, Enqueue records the overflow fault and throws MxAccessEventQueueOverflowException; MxAccessBaseEventSink.EnqueueEvent catches it and calls RecordFault again. RecordFault is a no-op when a fault already exists, so the second call is harmless — but the intent is muddled, and there is no test asserting the dropped-event behavior. This is acceptable per the fail-fast design but undocumented at the call site.
Recommendation: Add a brief comment in EnqueueEvent clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op.
Resolution: (open)