No source changes since d692232 — header bumped to track the latest
reviewed commit, all prior findings remain closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
62 KiB
Code Review — Worker
| Field | Value |
|---|---|
| Module | src/ZB.MOM.WW.MxGateway.Worker |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | 42b0037 |
| Status | Re-reviewed |
| Open findings | 0 |
2026-05-24 re-review (commit 42b0037)
Re-review: no new findings. git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker returns empty — the Worker module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the d692232 pass. The header is bumped to track the latest reviewed commit; Worker-001..025 remain closed.
Checklist coverage
This row reflects the 2026-05-20 re-review at commit a020350. Worker-001..022 are all closed; the row only summarises new findings filed against this commit. The prior pass's fixes for Worker-016..022 were verified sound:
- Worker-016:
StaRuntimeShutdownExceptionexists,MxAccessStaSession.cs:261is the onlycatch (StaRuntimeShutdownException)site in the module. No accidental catch elsewhere (grep verified). The graceful-shutdown vs. STA-affinity-violation distinction holds. - Worker-017:
ReportWatchdogFaultIfNeededAsyncreturns early whenCurrentCommandCorrelationIdis non-empty. Sound for the slow-but-progressing case; but see Worker-023 — there is no defensive ceiling, so a truly stuck command (synchronous COM call hung against a dead MXAccess provider) leavesCurrentCommandCorrelationIdnon-empty forever and the worker-side watchdog is permanently suppressed. - Worker-018:
SetXmlAlarmQueryis now wrapped intry/catch (COMException)and re-thrown asInvalidOperationExceptioncarrying the HRESULT. Sound. - Worker-019:
subscriptionExpressionfield is gone. - Worker-020:
_state is not WorkerState.Ready and not WorkerState.ExecutingCommandsimplified to_state != WorkerState.Ready. Confirmed_stateis never assignedExecutingCommand; volatile reads are atomic. - Worker-021:
_runtimeSession ??=inInitializeMxAccessAsyncpreserves a factory-supplied session. ConfirmedRunAsyncpath bypassesInitializeMxAccessAsyncentirely (it passes its own factory-driven lambda), so the??=only runs on the legacy parameterless-CompleteStartupHandshakeAsyncdirect-invocation path. - Worker-022:
MxAlarmSnapshot.cs(now containing onlyMxAlarmSnapshotRecord),MxAlarmStateKind.cs,MxAlarmTransitionEvent.cs— filenames match their single public type; all three keep theMxGateway.Worker.MxAccessnamespace.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found: Worker-025 (RunAsync does not null-check the result of _runtimeSessionFactory(); a null factory return would NRE on _runtimeSession.StartAsync(...) rather than throw a diagnostic exception). |
| 2 | mxaccessgw conventions | No issues found. The split alarm-snapshot files match the one-public-type-per-file convention; namespace consistency verified. |
| 3 | Concurrency & thread safety | Issue found: Worker-024 (the alarm command path — Subscribe/Acknowledge/AcknowledgeByName/QueryActive/Unsubscribe — has no STA-affinity assertion equivalent to Worker-008's EnsureOnAlarmConsumerThread guard; only the alarm poll path enforces affinity, leaving a latent gap if a future refactor lets alarm commands run off-STA). |
| 4 | Error handling & resilience | Issue found: Worker-023 (Worker-017's watchdog skip has no defensive ceiling; a truly stuck command — synchronous COM hung against a dead MXAccess provider — keeps CurrentCommandCorrelationId non-empty indefinitely, and the worker-side StaHung watchdog never fires. Gateway-side CommandTimeout is the only safety net). |
| 5 | Security | No issues found. No secret logging on the alarm path; the dropped-reply diagnostic Worker-003 added logs only the correlation id and command method, not the command payload. |
| 6 | Performance & resource management | No new issues found. Frame I/O still uses pooled buffers (Worker-009); STA join timeouts in Dispose are bounded. |
| 7 | Design-document adherence | No new design drift. The split alarm files preserve the documented public API surface. Worker-017's resolution comment documents the watchdog design intent — though see Worker-023 for the documentation gap on truly-stuck commands. |
| 8 | Code organization & conventions | No issues found. Worker-022 was the last file-organization issue. |
| 9 | Testing coverage | Worker-016 and Worker-017 each have direct regression tests (RunAlarmPollLoop_WhenPollOnceThrowsInvalidOperation_RecordsFaultOnEventQueue, RunAsync_WhenStaActivityIsStaleWithCommandInFlight_DoesNotWriteWatchdogFault). Worker-018, -020, -021's resolution notes state "no new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent" — Worker-018's SetXmlAlarmQuery failure-translation and Worker-020's simplified _state != Ready check have no regression test in this branch yet. No standalone finding — these are documented gaps in the resolution notes of the prior pass. |
| 10 | Documentation & comments | No new issues. Worker-017's XML doc on ReportWatchdogFaultIfNeededAsync documents the design intent clearly; the _runtimeSession ??= reasoning is documented inline; Worker-016's graceful-vs-affinity distinction is documented at both catch sites. |
2026-05-24 review (commit d692232)
Re-review pass at d692232. The only diff against the Worker source since
a020350 is the ZB.MOM.WW namespace/csproj/folder rename (commit dc9c0c9)
— no behavioural changes. The four external runtime identifiers documented
as intentionally unprefixed (including Name = "MxGateway.Worker.STA" on
StaRuntime) are confirmed still in their original form. The _writeLock
contention with the gateway-side watchdog (Server-031) is unchanged.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. |
| 2 | mxaccessgw conventions | No issues found — rename clean; STA thread name and MXAccess COM target (LMXProxyServerClass via ArchestrA.MXAccess.dll) unchanged. |
| 3 | Concurrency & thread safety | No issues found in this diff. |
| 4 | Error handling & resilience | No issues found in this diff. |
| 5 | Security | No issues found in this diff. |
| 6 | Performance & resource management | No issues found in this diff. |
| 7 | Design-document adherence | No issues found in this diff. |
| 8 | Code organization & conventions | No issues found in this diff. |
| 9 | Testing coverage | No issues found in this diff. |
| 10 | Documentation & comments | No issues found in this diff. |
Findings
Worker-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207 |
| Status | Resolved |
Description: When constructed with pollIntervalMilliseconds > 0, Subscribe starts a System.Threading.Timer whose OnPoll callback runs PollOnce() — which calls wwAlarmConsumerClass.GetXmlCurrentAlarms2 — on a thread-pool thread. The wnwrap CLSID is registered ThreadingModel=Apartment; calling its methods off the owning STA violates the hard rule that all COM calls happen on the dedicated STA thread, and can deadlock on cross-apartment marshaling when the STA is not pumping. The production path (default constructor, interval 0) is safe, but the public 3-arg constructor leaves this footgun callable, and tests/live-smoke use it.
Recommendation: Remove the internal Timer entirely (production already drives PollOnce from the STA), or document and gate it so it can only be used from an STA thread. At minimum, make the timer-driven mode unreachable from any production wiring.
Resolution: 2026-05-18 — Removed the off-STA timer infrastructure from WnWrapAlarmConsumer: the Timer? pollTimer and pollIntervalMs fields, the DefaultPollIntervalMilliseconds constant, the OnPoll callback, the timer-arming arm in Subscribe, and the timer disposal block in Dispose. The pollIntervalMilliseconds parameter is gone from both public constructors (the test-seam ctor is now 2-arg: wwAlarmConsumerClass + maxAlarmsPerFetch), so the off-STA footgun is structurally unreachable. PollOnce() remains the public STA-driven entry point. The stale "poll … on a timer below" comment was corrected. Verified by the regression tests WnWrapAlarmConsumer_has_no_internal_timer_field and WnWrapAlarmConsumer_exposes_no_poll_interval_constructor_parameter; the AlarmsLiveSmokeTests call site was updated to the 2-arg constructor.
Worker-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:545-549 |
| Status | Resolved |
Description: RunHeartbeatLoopAsync calls await Task.Delay(_sessionOptions.HeartbeatInterval, ...) before sending the first heartbeat. The gateway therefore receives no heartbeat for the first full interval (default 5s) after the worker reaches Ready. If the gateway's liveness watchdog expects a heartbeat sooner, a healthy worker can be misclassified as hung at startup.
Recommendation: Send an initial heartbeat immediately on entering the loop, or move the Task.Delay to the end of the loop body.
Resolution: 2026-05-18 — Restructured RunHeartbeatLoopAsync so the Task.Delay(HeartbeatInterval) is applied between beats only, not before the first. A firstBeat guard skips the delay on the initial iteration, so the gateway sees a heartbeat as soon as the worker is Ready; cancellation behavior is preserved (the loop still observes the token and the delay still throws on cancellation). Verified by the regression test RunAsync_SendsFirstHeartbeatImmediatelyOnEnteringLoop. Three pre-existing tests (WorkerPipeClientTests.RunAsync_ConnectsToPipeAndCompletesHandshake, WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears, WorkerPipeSessionTests.RunAsync_WhenCommandThrowsAfterShutdown_DropsLateFaultAndWritesShutdownAck) assumed strict frame ordering and were updated to skip the now-interleaved first heartbeat while still asserting the same shutdown-ack behavior.
Worker-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:399-403, :416-419 |
| Status | Resolved |
Description: ProcessCommandAsync checks _state after DispatchAsync completes and silently returns without writing a WorkerCommandReply (or fault) when _state is not Ready/ExecutingCommand. _state is a plain field mutated from multiple tasks (heartbeat loop, event-drain loop, shutdown). A command that completes successfully while _state has transitioned will have its reply dropped with no diagnostic, and the gateway's correlation-id wait then hangs until its own timeout. The _state read is also not synchronized.
Recommendation: Always attempt to write the reply/fault for an in-flight command, or explicitly reject in-flight commands with a Canceled/WorkerUnavailable reply during state transitions. Make _state access thread-safe (volatile or locked).
Resolution: 2026-05-18 — Both silent-drop return sites in ProcessCommandAsync (the post-DispatchAsync success path and the exception path) now call a new LogCommandResultDropped helper before returning. The helper logs an Information event named WorkerCommandResultDropped via the session's IWorkerLogger, carrying the command's correlation_id plus command_method and worker_state, so a stuck gateway correlation-id wait is now traceable. The _state field was made volatile (WorkerState is an int-backed protobuf enum, so volatile is valid) so cross-thread reads observe the latest value without tearing; this is a low-risk, non-behavioral change and did not destabilize any test. Verified by the regression test RunAsync_WhenReplyIsDroppedAfterShutdown_LogsDiagnostic.
Worker-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588 |
| Status | Resolved |
Description: After ReportWatchdogFaultIfNeededAsync sends an StaHung fault, the heartbeat loop continues sending normal heartbeats with State derived from _state, which the watchdog path never sets to Faulted. The heartbeat then keeps reporting a non-faulted state that contradicts the fault just sent.
Recommendation: Set _state = WorkerState.Faulted (thread-safely) when the watchdog fault fires so heartbeat state and fault stay consistent.
Resolution: 2026-05-18 — ReportWatchdogFaultIfNeededAsync now sets _state = WorkerState.Faulted immediately after _watchdogFaultSent = true and before the StaHung fault is written, so the next heartbeat reports Faulted instead of contradicting the fault. _state is already volatile (Worker-003), so the cross-thread write from the heartbeat loop is observed correctly by the heartbeat's own CreateHeartbeat read; no further locking is required. Verified by the regression test WorkerPipeSessionTests.RunAsync_AfterWatchdogFault_HeartbeatReportsFaultedState, which uses a stale-activity snapshot with an empty current-command correlation id so the heartbeat State is derived from _state rather than forced to ExecutingCommand.
Worker-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-258 (production alarm poll loop) |
| Status | Resolved |
Description: OnPoll catches every exception from PollOnce() and discards it (_ = ex;). The production poll path (MxAccessStaSession.RunAlarmPollLoopAsync → AlarmCommandHandler.PollOnce → AlarmDispatcher.PollOnce → consumer.PollOnce()) has no fault recording either. A permanently failing alarm provider (e.g. GetXmlCurrentAlarms2 returning E_FAIL, malformed XML throwing in XmlDocument.LoadXml) is therefore completely silent — no fault on the event queue, no log.
Recommendation: Route poll failures to MxAccessEventQueue.RecordFault (or a logger) so a broken alarm subscription becomes observable. Update the now-stale comment.
Re-triage: The cited location WnWrapAlarmConsumer.cs:297-313 and the OnPoll callback no longer exist as of this branch — Worker-001 removed the off-STA Timer and its OnPoll callback entirely. The substantive concern still held, however: the production poll path in MxAccessStaSession.RunAlarmPollLoopAsync caught only OperationCanceledException, ObjectDisposedException, and InvalidOperationException. A genuine poll failure (COMException from GetXmlCurrentAlarms2, a malformed-XML XmlException) escaped uncaught, faulted the never-awaited Task.Run poll task, and was silently lost — exactly the silent-failure the finding describes. The finding was re-pointed at the live location and fixed there rather than at the removed OnPoll.
Resolution: 2026-05-18 — RunAlarmPollLoopAsync gained a trailing catch (Exception exception) arm after the three graceful-stop catches. A real alarm-poll failure is now converted to a WorkerFault (category MxaccessEventConversionFailed, carrying the exception type and, for a COMException, its HResult) by the new CreateAlarmPollFault helper and recorded on the session's MxAccessEventQueue via RecordFault. The worker's event-drain loop drains that fault and forwards it to the gateway, so a broken alarm subscription is now observable on the IPC fault path instead of vanishing. The poll loop still stops after the failure (the subscription is dead). No new proto enum value was added — MxaccessEventConversionFailed is the closest existing alarm-path category, avoiding a contracts regeneration across all clients. Verified by the regression test MxAccessStaSessionTests.RunAlarmPollLoop_WhenPollOnceThrows_RecordsFaultOnEventQueue.
Worker-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124, src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491 |
| Status | Resolved |
Description: RunAsync's finally calls _runtimeSession?.Dispose() unless _shutdownTimedOut. On the normal path ShutdownGracefullyAsync already disposed the STA runtime, so re-entering Dispose() is a harmless no-op only because ShutdownGracefullyAsync reached its end and set disposed = true. If ShutdownGracefullyAsync throws TimeoutException after partial teardown with _shutdownTimedOut set, the session is never disposed at all — the finally skips it — leaking the STA thread and COM object, leaving cleanup to rely solely on process exit.
Recommendation: Make the dispose decision explicit and confirm process exit always follows a timed-out shutdown; otherwise dispose defensively. At minimum document why disposal is deliberately skipped on timeout.
Resolution: 2026-05-18 — RunAsync's finally now always calls _runtimeSession?.Dispose(); the if (!_shutdownTimedOut) guard and the _shutdownTimedOut field (which had become write-only) were removed. MxAccessStaSession.Dispose is idempotent (if (disposed) return) and bounded — each STA join is capped with Wait(TimeSpan.FromSeconds(2)) — so re-entering it on the normal path (where ShutdownGracefullyAsync already disposed the runtime) is a harmless no-op, while on the timed-out path it is now the only thing that reclaims the STA thread and releases the MXAccess COM object. The previous behaviour leaked both on a shutdown timeout and relied solely on process exit. A code comment in the finally block documents the reasoning. Verified by the regression test WorkerPipeSessionTests.RunAsync_WhenShutdownTimesOut_StillDisposesRuntimeSession, which forces a TimeoutException from ShutdownGracefullyAsync and asserts the runtime session is disposed before RunAsync rethrows.
Worker-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | mxaccessgw conventions |
| Location | src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150 |
| Status | Resolved |
Description: Invoke uses late-bound Type.InvokeMember reflection as a fallback when the COM object does not cast to ILMXProxyServer*. In production the object is always LMXProxyServerClass, so the reflection path exists only for test doubles — it is dead/untested code on the production path and obscures the interface contract. params object[] arguments also boxes value-type handles on every call.
Recommendation: Drop the reflection fallback and require the COM object to implement the interface (tests can supply a typed fake), or clearly mark the fallback as test-only.
Re-triage: The finding's claim that the reflection path is "dead/untested code" is partly inaccurate — it was in fact the path exercised by the entire MxAccessCommandExecutorTests suite, whose FakeMxAccessComObject did not implement any typed interface. So the reflection fallback was test-only but not untested. The convention concern (bypassing the typed interface contract, boxing value-type handles) is valid, so the fix follows the recommendation's first option.
Resolution: 2026-05-18 — The late-bound Type.InvokeMember reflection fallback and its params object[]-boxing Invoke helper were removed from MxAccessComServer. Each adapter method now takes one of two typed paths: an is IMxAccessServer fast path (test fakes implement IMxAccessServer directly) and the production path that casts to the typed ILMXProxyServer / ILMXProxyServer3 / ILMXProxyServer4 COM interfaces via new AsProxyServer* helpers. A COM object implementing neither now fails fast with a clear InvalidOperationException naming the missing interface, instead of an opaque late-bound call. The test seam was migrated accordingly: MxAccessCommandExecutorTests.FakeMxAccessComObject now declares : IMxAccessServer (its method signatures already matched the interface exactly, so no behavioural change). Verified by the new MxAccessComServerTests (typed-server routing, untyped-object rejection, original-exception propagation — no more TargetInvocationException wrapping) plus the unchanged, still-passing MxAccessCommandExecutorTests suite which now exercises the typed IMxAccessServer path.
Worker-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249, :429-447 |
| Status | Resolved |
Description: RunAlarmPollLoopAsync correctly marshals handler.PollOnce() onto the STA via staRuntime.InvokeAsync, and the cancel/await/dispose ordering in ShutdownGracefullyAsync is sound. However, nothing enforces that the consumerFactory and all IMxAccessAlarmConsumer calls run on the STA thread; a future caller could break STA affinity silently.
Recommendation: Add an assertion or documented invariant that the consumer factory and all IMxAccessAlarmConsumer calls run on the STA thread, mirroring the existing MxAccessSession.CreationThreadId pattern.
Resolution: 2026-05-18 — MxAccessStaSession now records the STA thread id (alarmConsumerThreadId) at the point the alarm-command-handler factory is invoked — which already runs inside staRuntime.InvokeAsync during StartAsync, mirroring the MxAccessSession.CreationThreadId capture. RunAlarmPollLoopAsync's marshalled poll lambda now calls EnsureOnAlarmConsumerThread() before handler.PollOnce(), asserting the poll runs on the recorded STA thread. The check is delegated to a new internal static guard AssertOnAlarmConsumerThread(int? expected, int actual) that throws a descriptive InvalidOperationException on an affinity violation and is a no-op when the consumer thread is unrecorded (no alarm handler configured). Making the guard static and internal keeps it directly unit-testable. The STA-affinity invariant is documented in the guard's XML doc. Verified by the regression tests MxAccessStaSessionTests.AssertOnAlarmConsumerThread_WhenOffOwningThread_Throws and AssertOnAlarmConsumerThread_OnOwningThreadOrUnset_DoesNotThrow.
Worker-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49, src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58 |
| Status | Resolved |
Description: Every frame read allocates a fresh 4-byte length buffer and a payload byte[]; every write allocates ToByteArray() plus a 4-byte prefix. On the hot event-drain path (batches of up to 128 WorkerEvent frames every 25 ms) this produces steady gen-0 garbage. WorkerFrameWriter also effectively serializes twice (CalculateSize() then ToByteArray()).
Recommendation: Reuse a pooled buffer / ArrayPool<byte> for the length prefix and payload, and write directly into a pooled buffer using CodedOutputStream. Low priority unless event throughput is high.
Resolution: 2026-05-18 — WorkerFrameWriter.WriteAsync now serializes the envelope exactly once into a single frame buffer that carries the 4-byte length prefix followed by the payload, via envelope.WriteTo(new Span<byte>(frame, sizeof(uint), payloadLength)). This eliminates the redundant second serialization pass (ToByteArray() re-runs CalculateSize() internally), the separate length-prefix array, and the separate prefix WriteAsync/extra FlushAsync round. WorkerFrameReader.ReadAsync now rents its payload buffer from ArrayPool<byte>.Shared and returns it in a finally once WorkerEnvelope.Parser.ParseFrom(payload, 0, length) has copied what it needs; ReadExactlyOrThrowAsync gained an explicit count parameter so it honours the logical frame length rather than the (possibly larger) rented buffer length. The 4-byte length-prefix buffer is left as a per-call stack-sized allocation — pooling a 4-byte array is not worthwhile. Verified by the new regression test WorkerFrameProtocolTests.ReadAsync_WithVaryingFrameSizes_ParsesEachFrameExactly, which reads a large frame followed by a small frame through one reader to prove the pooled buffer is sliced to each frame's own length and never leaks stale trailing bytes; the existing round-trip, malformed-payload, and concurrent-write tests continue to pass.
Worker-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226 |
| Status | Resolved |
Description: ConvertInt64Scalar is reached for TypeCode.UInt32 and TypeCode.Int64. For a uint with expectedDataType == MxDataType.Time, the value is treated as a Windows FILETIME via DateTime.FromFileTimeUtc(longValue); a 32-bit FILETIME is never a valid full FILETIME, so this silently produces a near-epoch timestamp rather than a raw/diagnostic value. Unlikely in practice but a silent misconversion.
Recommendation: Only apply the MxDataType.Time FILETIME projection for 64-bit source types; for uint fall through to integer or raw.
Resolution: 2026-05-18 — ConvertInt64Scalar's MxDataType.Time FILETIME projection is now gated on value is long. A genuine 64-bit long still projects to a Timestamp via DateTime.FromFileTimeUtc; a 32-bit uint — which can only hold the low half of a FILETIME — now falls through to the integer projection (DataType = Integer, Int64Value) instead of silently producing a bogus near-1601 timestamp. Verified by the regression test VariantConverterTests.Convert_WithUInt32AndExpectedTime_DoesNotProjectFileTime; the existing Convert_WithFileTimeAndExpectedTime_ProjectsTimestamp (a long FILETIME) continues to pass, confirming the 64-bit path is unchanged.
Worker-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171 |
| Status | Resolved |
Description: retryAttempts is computed as (connectTimeout / min(connectTimeout, attemptTimeout)) - 1. With defaults (30000 / 2000) this yields 14 retries, but each retry also incurs Polly exponential backoff. The overall connectDeadline (CancelAfter(connectTimeout)) is the real bound, so the computed attempt count can be larger or smaller than the time budget allows, and the formula is opaque.
Recommendation: Drive retries purely off the connectDeadline token (Polly stops when cancelled) and drop the fragile attempt-count arithmetic, or add a comment explaining the intent.
Resolution: 2026-05-18 — The opaque retryAttempts arithmetic in ConnectWithRetryAsync was removed. MaxRetryAttempts is now int.MaxValue, so the retry loop is bounded solely by the connectDeadline linked token (CancelAfter(_connectTimeoutMilliseconds)): Polly stops retrying the moment that token is cancelled, making the overall connect timeout the single source of truth and correctly accounting for the exponential backoff between attempts (which the old formula ignored). A comment documents the intent. No new test was added — the change does not alter observable behavior (the deadline was always the real bound; the old formula always permitted more attempts than fit the budget), and the existing WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears (server appears mid-retry) and RunAsync_WhenPipeNeverAppears_ThrowsTimeoutException (deadline ends the loop) already cover both retry-until-success and deadline-bounded termination.
Worker-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55, src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43, src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112 |
| Status | Resolved |
Description: Multiple comments describe the alarm path as not-yet-wired future work ("PR A.2 — COM-side subscription scaffold … the worker advertises no alarm subscription", "the worker bootstrap will gain a thin 'run-on-STA' wrapper as part of A.3"). As of commit 6c64030 the alarm command handler, STA poll loop, and SubscribeAlarms/AcknowledgeAlarm/QueryActiveAlarms are all wired. These comments are stale and misleading.
Recommendation: Update the XML docs/comments to describe the shipped behavior; remove the "future PR" framing.
Re-triage: The WnWrapAlarmConsumer.cs:38-43 citation is inaccurate — those lines were rewritten by Worker-001 and already describe the shipped no-internal-timer threading model correctly; nothing stale there. Conversely, two stale comments the finding did not cite were found on the same alarm path and fixed under the same root cause: AlarmDispatcher.cs's <remarks> still framed the dispatcher as "the in-process slice of A.3" with a "companion follow-up PR" adding the (now-shipped) SubscribeAlarmsCommand/AcknowledgeAlarmCommand/QueryActiveAlarmsCommand, and stated the consumer "polls on a System.Threading.Timer thread today" — a claim made false by Worker-001's removal of that timer; and AlarmCommandHandler.cs's <remarks> likewise asserted "the wnwrap consumer's polling timer fires on a thread-pool thread". The discovery document docs/AlarmClientDiscovery.md (referenced by the source comments) was deliberately left untouched: it is a historical research log of the investigation that chose the shipped design, not API/contract/lifecycle prose, and the source comments cite only its still-accurate "Option A — captured" payload schema.
Resolution: 2026-05-18 — Rewrote the stale alarm-path comments to describe shipped behavior with no "future PR / A.2 / A.3" framing. MxAccessAlarmEventSink: the class <remarks> and the Attach comment now explain that AlarmDispatcher owns the consumer→sink→queue wire-up and that Attach carries only the session id (no COM-event subscription is needed because the polled wnwrap consumer raises transition events itself). MxAccessEventMapper.CreateOnAlarmTransition's XML summary now states the worker drives it from MxAccessAlarmEventSink.EnqueueTransition once AlarmDispatcher decodes a wnwrap transition. AlarmDispatcher and AlarmCommandHandler <remarks> were corrected to describe the shipped command surface and the no-internal-timer / STA-driven polling model (the System.Threading.Timer claims were factually wrong post-Worker-001). Pure documentation change — no behavior altered, no test needed; the build stays green.
Worker-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | src/MxGateway.Worker/Sta/StaMessagePump.cs |
| Status | Resolved |
Description: StaMessagePump — the heart of COM event delivery (MsgWaitForMultipleObjectsEx + PeekMessage/DispatchMessage) — has no direct unit tests. StaRuntimeTests exercises it indirectly for command wake-up but never verifies that a posted Windows message actually wakes the wait and is dispatched, nor that PumpPendingMessages returns a correct count. The alarm poll-loop lifecycle in MxAccessStaSession (start/cancel/await on shutdown) also has no test. These are the most failure-sensitive paths in the module.
Recommendation: Add tests that post a message to the STA thread and assert it is pumped, and tests covering alarm poll-loop start/stop and shutdown ordering.
Re-triage: This finding is stale as of the reviewed branch — the coverage it asks for already exists. src/MxGateway.Worker.Tests/Sta/StaMessagePumpTests.cs contains direct StaMessagePump tests covering null-argument validation, waking on a signalled event, returning on timeout, the zero-timeout conversion branch, PumpPendingMessages returning the correct count for messages posted to the STA thread (PumpPendingMessages_MessagesPostedToStaThread_ReturnsCountProcessed, PumpPendingMessages_NoMessagesPosted_ReturnsZero), and WaitForWorkOrMessages waking on a posted Windows message (WaitForWorkOrMessages_WindowsMessagePosted_ReturnsForInputAvailable) — exactly the "post a message and assert it is pumped" test the recommendation asks for. The alarm poll-loop lifecycle is covered by MxAccessStaSessionTests.StartAsync_WithAlarmCommandHandlerFactory_PollOnceCalledViaSta (start → poll runs on the STA) and Dispose_StopsAlarmPollLoop (Dispose joins the poll task; no further polls). The finding was raised against a stale view of the test project; no source or test change is required. Re-triaged as already resolved rather than fixed.
Resolution: 2026-05-18 — No code change. Re-triaged: the requested direct StaMessagePump tests (including posted-message dispatch and pump count) and the alarm poll-loop start/stop lifecycle tests already exist in StaMessagePumpTests.cs and MxAccessStaSessionTests.cs. See the re-triage note above for the specific test names.
Worker-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33, :202 |
| Status | Resolved |
Description: The file declares two public types — the AlarmCommandHandler class and the IAlarmCommandHandler interface. The C# style guide and the rest of the module follow one-public-type-per-file (e.g. interfaces in their own I*.cs files like IMxAccessAlarmConsumer.cs).
Recommendation: Move IAlarmCommandHandler to its own IAlarmCommandHandler.cs for consistency.
Resolution: 2026-05-18 — The IAlarmCommandHandler interface (with its XML docs) was moved verbatim out of AlarmCommandHandler.cs into a new src/MxGateway.Worker/MxAccess/IAlarmCommandHandler.cs, with its own using directives (System, System.Collections.Generic, MxGateway.Contracts.Proto). AlarmCommandHandler.cs now declares one public type, matching the module's one-public-type-per-file convention (cf. IMxAccessAlarmConsumer.cs). Pure file-organization change — no API surface, behavior, or namespace changed; no test needed. The worker build is clean with zero warnings (no unused usings left behind in AlarmCommandHandler.cs).
Worker-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145 |
| Status | Resolved |
Description: On overflow, Enqueue records the overflow fault and throws MxAccessEventQueueOverflowException; MxAccessBaseEventSink.EnqueueEvent catches it and calls RecordFault again. RecordFault is a no-op when a fault already exists, so the second call is harmless — but the intent is muddled, and there is no test asserting the dropped-event behavior. This is acceptable per the fail-fast design but undocumented at the call site.
Recommendation: Add a brief comment in EnqueueEvent clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op.
Resolution: 2026-05-18 — Added a comment in MxAccessBaseEventSink.EnqueueEvent's catch block (per the finding's recommendation) explaining that two distinct fail-fast failures land there: a conversion failure from createEvent() (recorded here as an MxaccessEventConversionFailed fault) and an MxAccessEventQueueOverflowException from Enqueue at capacity, which — per the fail-fast backpressure design in docs/DesignDecisions.md — drops the event and has already self-recorded a QueueOverflow fault inside Enqueue. Because MxAccessEventQueue.RecordFault keeps only the first fault, the catch's RecordFault call is then a deliberate near no-op rather than a second, conflicting fault. Pure comment change as recommended — no behavior altered. docs/DesignDecisions.md already documents the fail-fast event backpressure rule, so no doc change was required.
Worker-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:261-265 |
| Status | Resolved |
Description: RunAlarmPollLoopAsync catches InvalidOperationException and silently returns with the rationale "STA runtime shutting down — stop the loop gracefully". The same catch arm, however, also swallows the InvalidOperationException thrown by EnsureOnAlarmConsumerThread() / AssertOnAlarmConsumerThread() — the STA-affinity guard added under Worker-008. If the alarm poll ever ran on the wrong thread (a regression of the STA-affinity invariant), the assertion would fire, the loop would silently stop, no fault would be recorded, and the only observable symptom would be alarms no longer flowing. The assertion exists to catch a programming error early; this catch defeats it.
Recommendation: Either tighten the InvalidOperationException catch so it only swallows the STA-runtime-shutting-down sentinel (e.g. match on the exception message produced by StaRuntime.InvokeAsync, or have the STA runtime throw a dedicated exception type for shutdown), or rethrow / record-a-fault for InvalidOperationExceptions whose message does not match the shutdown sentinel. Add a regression test that drives RunAlarmPollLoopAsync with a handler that throws InvalidOperationException from PollOnce and asserts the loop records a fault rather than silently exiting.
Resolution: 2026-05-20 — Introduced a dedicated StaRuntimeShutdownException (src/MxGateway.Worker/Sta/StaRuntimeShutdownException.cs) that StaRuntime.InvokeAsync and the queue-enqueue path now throw in place of a generic InvalidOperationException when shutdownRequested is set. RunAlarmPollLoopAsync in MxAccessStaSession.cs:258-291 now catches StaRuntimeShutdownException (graceful stop, returns silently) separately from the generic Exception arm, which records the fault on the event queue. An STA-affinity InvalidOperationException from EnsureOnAlarmConsumerThread therefore now falls through to the fault path and becomes observable on the IPC fault path instead of silently terminating alarm delivery. Verified: dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86 clean (0 warnings). Regression coverage in MxAccessStaSessionTests.cs exercises both the graceful-shutdown and the affinity-violation paths.
Worker-017
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/MxGateway.Worker/Sta/StaRuntime.cs:280-288, src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-631 |
| Status | Resolved |
Description: StaRuntime.ProcessQueuedCommands calls MarkActivity() only before and after workItem.Execute(). For a command that synchronously holds the STA for longer than WorkerPipeSessionOptions.HeartbeatGrace (default 15s) — e.g. ReadBulk with many uncached tags, each waiting up to its per-tag TimeoutMs (default 1000 ms) — no MarkActivity() runs during the wait, LastActivityUtc stays frozen, and ReportWatchdogFaultIfNeededAsync fires an StaHung fault. The heartbeat itself reports WorkerState.ExecutingCommand with the live CurrentCommandCorrelationId, so the worker actually knows it is executing a command rather than hung — but the watchdog branch only checks staleFor > HeartbeatGrace and ignores the in-flight command. A legitimate slow bulk read then self-faults and tears the session down.
Recommendation: Either (a) extend WorkerPipeSession.ReportWatchdogFaultIfNeededAsync to skip the StaHung fault when the snapshot's CurrentCommandCorrelationId is non-empty (the worker is executing a command, not hung), or (b) thread a MarkActivity-style callback into the bulk-read pumpStep so long synchronous STA operations periodically refresh LastActivityUtc. Option (a) is the smaller surface — the heartbeat already carries enough signal for the gateway to decide the command is just slow. Either way, the design intent (watchdog catches a hung STA, not a slow command) should be documented on ReportWatchdogFaultIfNeededAsync.
Resolution: 2026-05-20 — Applied option (a): WorkerPipeSession.ReportWatchdogFaultIfNeededAsync (src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-645) now returns early when snapshot.CurrentCommandCorrelationId is non-empty — the STA is busy executing a known command, not hung, and the heartbeat already surfaces the correlation id so the gateway can decide whether the command is too slow against its own per-command timeout. The next MarkActivity() after the command returns lifts LastActivityUtc and the watchdog resumes normal operation. A new XML doc comment on the method records the design intent (watchdog catches a hung STA, not a slow command). Verified: dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86 clean. Regression coverage added in WorkerPipeSessionTests.cs.
Worker-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:160-161 |
| Status | Resolved |
Description: Subscribe calls com.SetXmlAlarmQuery(xmlQuery) and discards the return value. The block-level comment immediately above states that this call is empirically required for subsequent GetXmlCurrentAlarms2 to succeed — i.e. it is on the critical path of the alarm subscription. Every other AVEVA-COM call in the same method (InitializeConsumer, RegisterConsumer, Subscribe, AlarmAckByName, etc.) is gated on a != 0 return-code check and throws InvalidOperationException on failure. If SetXmlAlarmQuery ever returns non-zero (or otherwise fails non-fatally), the consumer reaches subscribed = true with the wnwrap state misconfigured, and the next PollOnce fails with the same E_FAIL the comment warns about — without any indication where the regression lies.
Recommendation: Either (a) check the SetXmlAlarmQuery return code and treat a non-zero value as a subscription failure (matching the other call-gates in the method) or (b) document explicitly in the comment that SetXmlAlarmQuery's return code is meaningless on this AVEVA build (referencing docs/AlarmClientDiscovery.md if so). At minimum capture the return value in a local for diagnostic purposes so a future failure is easier to triage.
Re-triage: The finding's framing assumed an integer return code; inspection of the Interop.WNWRAPCONSUMERLib assembly confirmed SetXmlAlarmQuery is declared Void SetXmlAlarmQuery(System.String) on all three flavors (IwwAlarmConsumer, IwwAlarmConsumer2, wwAlarmConsumerClass). There is no integer return code to gate on. A genuine failure can only surface as a COMException mapped from the underlying HRESULT, so the fix wraps the call to translate that into the same InvalidOperationException failure-shape used by every other call-gate in Subscribe, with the HRESULT included in the diagnostic message.
Resolution: 2026-05-20 — WnWrapAlarmConsumer.Subscribe now wraps the com.SetXmlAlarmQuery(xmlQuery) call in a try/catch (COMException ex) that throws an InvalidOperationException carrying the HRESULT ($"wwAlarmConsumer.SetXmlAlarmQuery failed with HRESULT 0x{ex.HResult:X8}; subsequent GetXmlCurrentAlarms2 polls would return E_FAIL.") with the original COMException as InnerException. A previously silent failure that left subscribed = true with misconfigured wnwrap state — and produced an opaque E_FAIL from the next PollOnce with no indication where the regression lay — now surfaces as a subscription failure at the Subscribe call-site, matching the existing v1-lifecycle failure shape. The block comment was extended to record that the interop signature returns void (no integer return code to gate on like the sibling v1 calls) so a future maintainer doesn't try to add one. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is structurally analogous to the existing Initialize/Register/Subscribe call-gates and is exercised end-to-end by the live alarm smoke path.
Worker-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:59, :188 |
| Status | Resolved |
Description: WnWrapAlarmConsumer declares private string subscriptionExpression = string.Empty; and assigns it once inside Subscribe (line 188), but never reads it. It is dead state — neither PollOnce, AcknowledgeByName, AcknowledgeByGuid, SnapshotActiveAlarms, nor Dispose consults it. Either it is genuinely unused (delete it) or it was intended to support a not-yet-implemented feature (e.g. re-subscribing after a transient failure, or echoing the subscription back through IsSubscribed/SubscriptionExpression), in which case the intent should be wired up or documented.
Recommendation: Delete the field (the safest option — treatWarningsAsErrors=true will continue to permit it as long as it's read into; consider promoting it to read-only via an exposed property SubscriptionExpression so smoke tests can assert what subscription is active without touching wnwrap state). If a future use is expected, file a follow-up issue.
Resolution: 2026-05-20 — Deleted the dead private string subscriptionExpression = string.Empty; field declaration and its sole assignment inside Subscribe (subscriptionExpression = subscription;). The field had no readers and was pure write-only state. Pure cleanup — no behaviour change, no public API surface affected. The worker build remains clean with zero warnings under TreatWarningsAsErrors=true.
Worker-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:405, :423 |
| Status | Resolved |
Description: ProcessCommandAsync decides whether to write a command reply with if (_state is not WorkerState.Ready and not WorkerState.ExecutingCommand). The ExecutingCommand arm is dead: _state is only ever assigned Starting, Handshaking, InitializingSta, Ready, ShuttingDown, Faulted, or Stopped. The string WorkerState.ExecutingCommand appears nowhere as a target of _state = .... The WorkerState.ExecutingCommand value is synthesized only in CreateHeartbeat (line 811) when a command is in flight, so it never leaks back into _state. The check is effectively _state is not WorkerState.Ready. The intent is unclear: either the check should also accept the live "is executing" condition (which today is implicit via _state == Ready plus a non-empty CurrentCommandCorrelationId from the dispatcher), or the dead arm should be removed for clarity.
Recommendation: Simplify the check to if (_state != WorkerState.Ready) to match the actual state machine, and update the dropped-reply log fields accordingly. Alternatively, introduce an explicit WorkerState.ExecutingCommand transition (set when a command starts dispatching, restored to Ready on completion) so the check matches its name. The simpler fix is the former.
Resolution: 2026-05-20 — Both occurrences of the _state is not WorkerState.Ready and not WorkerState.ExecutingCommand check in ProcessCommandAsync (the post-DispatchAsync success path and the exception path) were simplified to _state != WorkerState.Ready. The ExecutingCommand arm was dead — _state is never written that value; only CreateHeartbeat synthesizes it on the wire when CurrentCommandCorrelationId is non-empty. A comment was added at the success-path site documenting the assignment-set of _state and why Ready is the only command-serving state. No behavioural change — _state could never be ExecutingCommand at that read, so the simplification preserves the same effective decision while removing the misleading dead arm. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent.
Worker-021
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-118, :790-805, :136-139 |
| Status | Resolved |
Description: RunAsync constructs the runtime session through _runtimeSession = _runtimeSessionFactory() (line 111) and immediately calls CompleteStartupHandshakeAsync(token => _runtimeSession.StartAsync(...)). That path is fine. However the public parameterless CompleteStartupHandshakeAsync() (line 136) routes through InitializeMxAccessAsync (line 790), which unconditionally reassigns _runtimeSession = new MxAccessStaSession(eq => new AlarmCommandHandler(eq)); — overwriting whatever the factory put there. If anything ever calls CompleteStartupHandshakeAsync() after RunAsync has already begun, the factory-supplied session is leaked (no Dispose is called on the old instance) and a fresh hard-coded MxAccessStaSession is started instead. Today no production code path triggers this, but the API surface is public and dangerous — a test or a refactor could trip it.
Recommendation: Either (a) make InitializeMxAccessAsync a no-op if _runtimeSession is already non-null (treat the existing instance as authoritative and only call its StartAsync), or (b) make the parameterless CompleteStartupHandshakeAsync() and InitializeMxAccessAsync internal / remove them, since the production path is the factory-driven one in RunAsync. Option (b) is cleaner: the parameterless overload is dead in production.
Resolution: 2026-05-20 — Applied option (a): InitializeMxAccessAsync now uses _runtimeSession ??= new MxAccessStaSession(eq => new AlarmCommandHandler(eq));, so the existing factory-supplied instance from RunAsync is treated as authoritative and only the fall-back direct-invocation path (where the parameterless CompleteStartupHandshakeAsync is called without a prior factory call) constructs the hard-coded MxAccessStaSession. The StartAsync call and the catch-and-dispose path now operate on a local session captured from _runtimeSession, so a startup failure still disposes the runtime regardless of which path supplied it. A comment in InitializeMxAccessAsync documents the reasoning. Option (a) was preferred over (b) because the parameterless CompleteStartupHandshakeAsync overload is part of the existing public API surface and tightening it to internal would be a contract change with no production driver requesting it. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is exercised end-to-end by the existing RunAsync factory path which now goes through the null-coalescing assignment instead of an unconditional new.
Worker-022
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/MxGateway.Worker/MxAccess/MxAlarmSnapshot.cs:12, :26, :49 |
| Status | Resolved |
Description: MxAlarmSnapshot.cs declares three public types in one file: the MxAlarmStateKind enum, the MxAlarmSnapshotRecord class, and the MxAlarmTransitionEvent class. The C# style guide (docs/style-guides/CSharpStyleGuide.md:68) requires one public type per file unless a small nested type is clearer. The recently resolved Worker-014 split IAlarmCommandHandler out of AlarmCommandHandler.cs for exactly this reason — the same convention applies here.
Recommendation: Move MxAlarmStateKind and MxAlarmTransitionEvent into their own files (MxAlarmStateKind.cs, MxAlarmTransitionEvent.cs) and leave MxAlarmSnapshotRecord in MxAlarmSnapshot.cs (or rename the file to MxAlarmSnapshotRecord.cs to match the surviving type). Pure file-organization change; no behaviour or namespace impact.
Resolution: 2026-05-20 — Split MxAlarmSnapshot.cs into three files, each declaring one public type and keeping the original MxGateway.Worker.MxAccess namespace so existing usages are unaffected: MxAlarmStateKind.cs (the enum, with its XML doc), MxAlarmTransitionEvent.cs (the EventArgs subclass, with its PreviousState doc), and MxAlarmSnapshot.cs (now containing only MxAlarmSnapshotRecord plus its XML doc). Matches the one-public-type-per-file convention re-affirmed by Worker-014's IAlarmCommandHandler split. Pure file-organization change — no API, namespace, or behaviour change; build is clean.
Worker-023
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:610-668, src/MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:124-153 |
| Status | Resolved |
Description: Worker-017 (resolved at a020350) suppresses the StaHung watchdog when CurrentCommandCorrelationId is non-empty: "the STA is busy executing a command, not hung." The fix is correct for the motivating case (legitimately slow ReadBulk against many uncached tags) — gateway-side per-command timeouts (WorkerClient.InvokeAsync's timeout parameter, see src/MxGateway.Server/Workers/WorkerClient.cs:189-218) eventually fail the command and may kill the worker. But the suppression has no defensive ceiling. Most MXAccess commands in MxAccessCommandExecutor — Register, AddItem, Advise, Write, WriteSecured, and their bulk variants — call directly into the MXAccess COM object with no internal deadline. If a COM call hangs (e.g. the MXAccess provider crashed and the cross-apartment marshaler is permanently blocked, or a write completion never fires), StaRuntime.ProcessQueuedCommands is stuck inside workItem.Execute(), StaCommandDispatcher.currentCommandCorrelationId stays non-empty forever, and ReportWatchdogFaultIfNeededAsync will short-circuit on every heartbeat. The worker-side StaHung watchdog — the only signal that distinguishes a hung STA from a slow gateway response from inside the worker — is permanently defeated for that session. Gateway-side CommandTimeout is the safety net, but it depends on the gateway operator picking a sensible per-command timeout (some bulk operations legitimately set this to many minutes), and it does not surface a worker-originated diagnostic (StaHung fault category, LastStaActivityUtc value) to the gateway audit trail.
Recommendation: Add a defensive upper bound, distinct from HeartbeatGrace, after which the watchdog fires even when a command is in flight — e.g. HeartbeatStuckCeiling (default 5× HeartbeatGrace = 75s, or align with the longest reasonable per-command timeout). Pseudocode for the in-flight branch:
if (!string.IsNullOrEmpty(snapshot.CurrentCommandCorrelationId)
&& staleFor <= _sessionOptions.HeartbeatStuckCeiling)
{
return; // slow command — gateway will time out if needed
}
// staleFor > ceiling OR no command in flight — fire StaHung
Document the ceiling in MxAccessWorkerInstanceDesign.md's watchdog section. Add a regression test that drives RunAsync with CurrentCommandCorrelationId non-empty and LastStaActivityUtc stale beyond the ceiling, asserting WorkerFaultCategory.StaHung is emitted.
Resolution: 2026-05-20 — Added WorkerPipeSessionOptions.HeartbeatStuckCeiling (default 75s = 5 × HeartbeatGrace) and extended WorkerPipeSession.ReportWatchdogFaultIfNeededAsync so the in-flight-command suppression is bounded by the ceiling: once staleFor > HeartbeatStuckCeiling the watchdog fires StaHung even with CurrentCommandCorrelationId non-empty. A truly stuck synchronous COM call (dead provider, blocked marshaler) no longer permanently defeats the worker-side watchdog. The ceiling is validated at startup (> 0 and > HeartbeatGrace). Documented in the new XML doc on HeartbeatStuckCeiling and in docs/MxAccessWorkerInstanceDesign.md's "Heartbeat And Watchdog" section. Regression test WorkerPipeSessionTests.RunAsync_WhenStaActivityIsStaleBeyondCeilingWithCommandInFlight_WritesWatchdogFault drives RunAsync with a non-empty current-command id and stale activity beyond the ceiling, asserting WorkerFaultCategory.StaHung is emitted. The existing RunAsync_WhenStaActivityIsStaleWithCommandInFlight_DoesNotWriteWatchdogFault test (5s stale, default 75s ceiling) continues to pass, confirming the suppression still works within the ceiling.
Worker-024
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:63-187, src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:191-323 |
| Status | Resolved |
Description: Worker-008 (resolved 2026-05-18) introduced MxAccessStaSession.AssertOnAlarmConsumerThread(int?, int), called from EnsureOnAlarmConsumerThread() in the marshalled poll lambda at RunAlarmPollLoopAsync (MxAccessStaSession.cs:247). The assertion catches a regression that runs IMxAccessAlarmConsumer.PollOnce() off the STA — exactly the deadlock-on-cross-apartment-marshaling risk the ThreadingModel=Apartment wnwrap consumer demands. However, the assertion guards only the poll path. AlarmCommandHandler.Subscribe, Acknowledge, AcknowledgeByName, QueryActive, and Unsubscribe — each of which calls into the same IMxAccessAlarmConsumer and ultimately the COM object — have no equivalent guard. Today they are reached only through MxAccessCommandExecutor.Execute → StaCommandDispatcher.ExecuteQueuedCommandAsync → staRuntime.InvokeAsync(...), so they do run on the STA in production. But the invariant is enforced only by convention (the same convention Worker-008 made explicit for PollOnce); a future refactor that lets a test or a refactored fast-path call into the handler off-STA would silently break the same apartment rule, and the wnwrap COM call would block on marshaling rather than fail loudly.
Recommendation: Add an EnsureOnAlarmConsumerThread()-equivalent assertion at the entry of each AlarmCommandHandler operation that touches the consumer (Subscribe is the highest-value site because it constructs the consumer; Acknowledge* and QueryActive next). Reuse MxAccessStaSession.AssertOnAlarmConsumerThread so the affinity invariant has a single canonical guard. Wire the expected thread id through the handler's constructor (today AlarmCommandHandler does not know the STA thread id — MxAccessStaSession captures it at line 191 but does not pass it). One implementation shape: hand the handler a small IThreadAffinityGuard whose Verify() is called at each entry, constructed by MxAccessStaSession once alarmConsumerThreadId is captured.
Resolution: 2026-05-20 — Extended AlarmCommandHandler with a third constructor that takes an optional Action? threadAffinityCheck, and invoked the guard at the entry of every method that touches the underlying IMxAccessAlarmConsumer: Subscribe, Unsubscribe, Acknowledge, AcknowledgeByName, QueryActive, and PollOnce. The factory signature on MxAccessStaSession was widened from Func<MxAccessEventQueue, IAlarmCommandHandler> to Func<MxAccessEventQueue, Action, IAlarmCommandHandler>, so MxAccessStaSession (which captures alarmConsumerThreadId at the factory call site, already running inside staRuntime.InvokeAsync) can pass its existing EnsureOnAlarmConsumerThread as the guard — keeping the affinity invariant on a single canonical check, AssertOnAlarmConsumerThread. WorkerPipeSession's three factory wiring sites were updated to (eq, affinity) => new AlarmCommandHandler(eq, () => new WnWrapAlarmConsumer(), affinity). The previous two-arg AlarmCommandHandler constructor remains (now delegating with threadAffinityCheck: null) so existing AlarmCommandHandlerTests continue to exercise the handler on a single thread without configuring a guard. Regression tests AlarmCommandHandlerTests.EveryCommandPathEntry_InvokesThreadAffinityGuard (counts invocations across all six entry points) and EveryCommandPathEntry_PropagatesAffinityGuardException (a throwing guard propagates from every entry point) verify the wiring.
Worker-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-117 |
| Status | Resolved |
Description: RunAsync assigns _runtimeSession = _runtimeSessionFactory() (line 111) and immediately dereferences _runtimeSession.StartAsync(...) inside the lambda at line 115. If the supplied factory ever returns null, the lambda will throw NullReferenceException rather than a diagnostic exception, and the finally block at line 128 (_runtimeSession?.Dispose()) silently no-ops. The production factories (() => new MxAccessStaSession(...) in the two convenience constructors) never return null, but the factory delegate type Func<IWorkerRuntimeSession> admits null returns and the constructor's runtimeSessionFactory ?? throw null-check at line 102 only validates the delegate itself, not its return value. The InitializeMxAccessAsync direct-invocation path uses _runtimeSession ??= new MxAccessStaSession(...) (line 840), so a null factory return there would be replaced with a default instance — different behavior from the RunAsync path.
Recommendation: Promote the null check to the call site:
_runtimeSession = _runtimeSessionFactory()
?? throw new InvalidOperationException("Worker runtime session factory returned null.");
Match the pattern AlarmCommandHandler.Subscribe already uses for consumerFactory() (AlarmCommandHandler.cs:76-77).
Resolution: 2026-05-20 — WorkerPipeSession.RunAsync now uses _runtimeSession = _runtimeSessionFactory() ?? throw new InvalidOperationException("Worker runtime session factory returned null.");, matching the pattern AlarmCommandHandler.Subscribe uses for its consumerFactory(). A null factory return now produces a clear diagnostic exception at the call site instead of NRE-ing on the next dereference (and the finally block's _runtimeSession?.Dispose() silently no-oping on a half-initialized session). Regression test WorkerPipeSessionTests.RunAsync_WhenRuntimeSessionFactoryReturnsNull_ThrowsDiagnosticException drives RunAsync with () => null! and asserts the diagnostic InvalidOperationException is thrown with the expected message.