Files
mxaccessgw/code-reviews/Worker/findings.md
T
Joseph Doherty 37ef27e8ed code-reviews: bump Worker + Worker.Tests headers to 42b0037
No source changes since d692232 — header bumped to track the latest
reviewed commit, all prior findings remain closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:56 -04:00

467 lines
62 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Worker
| Field | Value |
|---|---|
| Module | `src/ZB.MOM.WW.MxGateway.Worker` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 0 |
## 2026-05-24 re-review (commit `42b0037`)
**Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker` returns empty — the Worker module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker-001..025 remain closed.
## Checklist coverage
This row reflects the 2026-05-20 re-review at commit `a020350`. Worker-001..022 are all closed; the row only summarises new findings filed against this commit. The prior pass's fixes for Worker-016..022 were verified sound:
- **Worker-016**: `StaRuntimeShutdownException` exists, `MxAccessStaSession.cs:261` is the only `catch (StaRuntimeShutdownException)` site in the module. No accidental catch elsewhere (grep verified). The graceful-shutdown vs. STA-affinity-violation distinction holds.
- **Worker-017**: `ReportWatchdogFaultIfNeededAsync` returns early when `CurrentCommandCorrelationId` is non-empty. Sound for the slow-but-progressing case; but see **Worker-023** — there is no defensive ceiling, so a truly stuck command (synchronous COM call hung against a dead MXAccess provider) leaves `CurrentCommandCorrelationId` non-empty forever and the worker-side watchdog is permanently suppressed.
- **Worker-018**: `SetXmlAlarmQuery` is now wrapped in `try/catch (COMException)` and re-thrown as `InvalidOperationException` carrying the HRESULT. Sound.
- **Worker-019**: `subscriptionExpression` field is gone.
- **Worker-020**: `_state is not WorkerState.Ready and not WorkerState.ExecutingCommand` simplified to `_state != WorkerState.Ready`. Confirmed `_state` is never assigned `ExecutingCommand`; volatile reads are atomic.
- **Worker-021**: `_runtimeSession ??=` in `InitializeMxAccessAsync` preserves a factory-supplied session. Confirmed `RunAsync` path bypasses `InitializeMxAccessAsync` entirely (it passes its own factory-driven lambda), so the `??=` only runs on the legacy parameterless-`CompleteStartupHandshakeAsync` direct-invocation path.
- **Worker-022**: `MxAlarmSnapshot.cs` (now containing only `MxAlarmSnapshotRecord`), `MxAlarmStateKind.cs`, `MxAlarmTransitionEvent.cs` — filenames match their single public type; all three keep the `MxGateway.Worker.MxAccess` namespace.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found: Worker-025 (`RunAsync` does not null-check the result of `_runtimeSessionFactory()`; a null factory return would NRE on `_runtimeSession.StartAsync(...)` rather than throw a diagnostic exception). |
| 2 | mxaccessgw conventions | No issues found. The split alarm-snapshot files match the one-public-type-per-file convention; namespace consistency verified. |
| 3 | Concurrency & thread safety | Issue found: Worker-024 (the alarm command path — `Subscribe`/`Acknowledge`/`AcknowledgeByName`/`QueryActive`/`Unsubscribe` — has no STA-affinity assertion equivalent to Worker-008's `EnsureOnAlarmConsumerThread` guard; only the alarm *poll* path enforces affinity, leaving a latent gap if a future refactor lets alarm commands run off-STA). |
| 4 | Error handling & resilience | Issue found: Worker-023 (Worker-017's watchdog skip has no defensive ceiling; a truly stuck command — synchronous COM hung against a dead MXAccess provider — keeps `CurrentCommandCorrelationId` non-empty indefinitely, and the worker-side `StaHung` watchdog never fires. Gateway-side `CommandTimeout` is the only safety net). |
| 5 | Security | No issues found. No secret logging on the alarm path; the dropped-reply diagnostic Worker-003 added logs only the correlation id and command method, not the command payload. |
| 6 | Performance & resource management | No new issues found. Frame I/O still uses pooled buffers (Worker-009); STA join timeouts in `Dispose` are bounded. |
| 7 | Design-document adherence | No new design drift. The split alarm files preserve the documented public API surface. Worker-017's resolution comment documents the watchdog design intent — though see Worker-023 for the documentation gap on truly-stuck commands. |
| 8 | Code organization & conventions | No issues found. Worker-022 was the last file-organization issue. |
| 9 | Testing coverage | Worker-016 and Worker-017 each have direct regression tests (`RunAlarmPollLoop_WhenPollOnceThrowsInvalidOperation_RecordsFaultOnEventQueue`, `RunAsync_WhenStaActivityIsStaleWithCommandInFlight_DoesNotWriteWatchdogFault`). Worker-018, -020, -021's resolution notes state "no new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent" — Worker-018's `SetXmlAlarmQuery` failure-translation and Worker-020's simplified `_state != Ready` check have no regression test in this branch yet. No standalone finding — these are documented gaps in the resolution notes of the prior pass. |
| 10 | Documentation & comments | No new issues. Worker-017's XML doc on `ReportWatchdogFaultIfNeededAsync` documents the design intent clearly; the `_runtimeSession ??=` reasoning is documented inline; Worker-016's graceful-vs-affinity distinction is documented at both catch sites. |
### 2026-05-24 review (commit d692232)
Re-review pass at `d692232`. The only diff against the Worker source since
`a020350` is the `ZB.MOM.WW` namespace/csproj/folder rename (commit `dc9c0c9`)
— no behavioural changes. The four external runtime identifiers documented
as intentionally unprefixed (including `Name = "MxGateway.Worker.STA"` on
`StaRuntime`) are confirmed still in their original form. The `_writeLock`
contention with the gateway-side watchdog (Server-031) is unchanged.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. |
| 2 | mxaccessgw conventions | No issues found — rename clean; STA thread name and MXAccess COM target (`LMXProxyServerClass` via `ArchestrA.MXAccess.dll`) unchanged. |
| 3 | Concurrency & thread safety | No issues found in this diff. |
| 4 | Error handling & resilience | No issues found in this diff. |
| 5 | Security | No issues found in this diff. |
| 6 | Performance & resource management | No issues found in this diff. |
| 7 | Design-document adherence | No issues found in this diff. |
| 8 | Code organization & conventions | No issues found in this diff. |
| 9 | Testing coverage | No issues found in this diff. |
| 10 | Documentation & comments | No issues found in this diff. |
## Findings
### Worker-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207` |
| Status | Resolved |
**Description:** When constructed with `pollIntervalMilliseconds > 0`, `Subscribe` starts a `System.Threading.Timer` whose `OnPoll` callback runs `PollOnce()` — which calls `wwAlarmConsumerClass.GetXmlCurrentAlarms2` — on a thread-pool thread. The wnwrap CLSID is registered `ThreadingModel=Apartment`; calling its methods off the owning STA violates the hard rule that all COM calls happen on the dedicated STA thread, and can deadlock on cross-apartment marshaling when the STA is not pumping. The production path (default constructor, interval 0) is safe, but the public 3-arg constructor leaves this footgun callable, and tests/live-smoke use it.
**Recommendation:** Remove the internal `Timer` entirely (production already drives `PollOnce` from the STA), or document and gate it so it can only be used from an STA thread. At minimum, make the timer-driven mode unreachable from any production wiring.
**Resolution:** 2026-05-18 — Removed the off-STA timer infrastructure from `WnWrapAlarmConsumer`: the `Timer? pollTimer` and `pollIntervalMs` fields, the `DefaultPollIntervalMilliseconds` constant, the `OnPoll` callback, the timer-arming arm in `Subscribe`, and the timer disposal block in `Dispose`. The `pollIntervalMilliseconds` parameter is gone from both public constructors (the test-seam ctor is now 2-arg: `wwAlarmConsumerClass` + `maxAlarmsPerFetch`), so the off-STA footgun is structurally unreachable. `PollOnce()` remains the public STA-driven entry point. The stale "poll … on a timer below" comment was corrected. Verified by the regression tests `WnWrapAlarmConsumer_has_no_internal_timer_field` and `WnWrapAlarmConsumer_exposes_no_poll_interval_constructor_parameter`; the `AlarmsLiveSmokeTests` call site was updated to the 2-arg constructor.
### Worker-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:545-549` |
| Status | Resolved |
**Description:** `RunHeartbeatLoopAsync` calls `await Task.Delay(_sessionOptions.HeartbeatInterval, ...)` before sending the first heartbeat. The gateway therefore receives no heartbeat for the first full interval (default 5s) after the worker reaches `Ready`. If the gateway's liveness watchdog expects a heartbeat sooner, a healthy worker can be misclassified as hung at startup.
**Recommendation:** Send an initial heartbeat immediately on entering the loop, or move the `Task.Delay` to the end of the loop body.
**Resolution:** 2026-05-18 — Restructured `RunHeartbeatLoopAsync` so the `Task.Delay(HeartbeatInterval)` is applied between beats only, not before the first. A `firstBeat` guard skips the delay on the initial iteration, so the gateway sees a heartbeat as soon as the worker is `Ready`; cancellation behavior is preserved (the loop still observes the token and the delay still throws on cancellation). Verified by the regression test `RunAsync_SendsFirstHeartbeatImmediatelyOnEnteringLoop`. Three pre-existing tests (`WorkerPipeClientTests.RunAsync_ConnectsToPipeAndCompletesHandshake`, `WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears`, `WorkerPipeSessionTests.RunAsync_WhenCommandThrowsAfterShutdown_DropsLateFaultAndWritesShutdownAck`) assumed strict frame ordering and were updated to skip the now-interleaved first heartbeat while still asserting the same shutdown-ack behavior.
### Worker-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:399-403`, `:416-419` |
| Status | Resolved |
**Description:** `ProcessCommandAsync` checks `_state` after `DispatchAsync` completes and silently `return`s without writing a `WorkerCommandReply` (or fault) when `_state` is not `Ready`/`ExecutingCommand`. `_state` is a plain field mutated from multiple tasks (heartbeat loop, event-drain loop, shutdown). A command that completes successfully while `_state` has transitioned will have its reply dropped with no diagnostic, and the gateway's correlation-id wait then hangs until its own timeout. The `_state` read is also not synchronized.
**Recommendation:** Always attempt to write the reply/fault for an in-flight command, or explicitly reject in-flight commands with a `Canceled`/`WorkerUnavailable` reply during state transitions. Make `_state` access thread-safe (volatile or locked).
**Resolution:** 2026-05-18 — Both silent-drop `return` sites in `ProcessCommandAsync` (the post-`DispatchAsync` success path and the exception path) now call a new `LogCommandResultDropped` helper before returning. The helper logs an Information event named `WorkerCommandResultDropped` via the session's `IWorkerLogger`, carrying the command's `correlation_id` plus `command_method` and `worker_state`, so a stuck gateway correlation-id wait is now traceable. The `_state` field was made `volatile` (`WorkerState` is an int-backed protobuf enum, so volatile is valid) so cross-thread reads observe the latest value without tearing; this is a low-risk, non-behavioral change and did not destabilize any test. Verified by the regression test `RunAsync_WhenReplyIsDroppedAfterShutdown_LogsDiagnostic`.
### Worker-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588` |
| Status | Resolved |
**Description:** After `ReportWatchdogFaultIfNeededAsync` sends an `StaHung` fault, the heartbeat loop continues sending normal heartbeats with `State` derived from `_state`, which the watchdog path never sets to `Faulted`. The heartbeat then keeps reporting a non-faulted state that contradicts the fault just sent.
**Recommendation:** Set `_state = WorkerState.Faulted` (thread-safely) when the watchdog fault fires so heartbeat state and fault stay consistent.
**Resolution:** 2026-05-18 — `ReportWatchdogFaultIfNeededAsync` now sets `_state = WorkerState.Faulted` immediately after `_watchdogFaultSent = true` and before the `StaHung` fault is written, so the next heartbeat reports `Faulted` instead of contradicting the fault. `_state` is already `volatile` (Worker-003), so the cross-thread write from the heartbeat loop is observed correctly by the heartbeat's own `CreateHeartbeat` read; no further locking is required. Verified by the regression test `WorkerPipeSessionTests.RunAsync_AfterWatchdogFault_HeartbeatReportsFaultedState`, which uses a stale-activity snapshot with an empty current-command correlation id so the heartbeat `State` is derived from `_state` rather than forced to `ExecutingCommand`.
### Worker-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-258` (production alarm poll loop) |
| Status | Resolved |
**Description:** `OnPoll` catches every exception from `PollOnce()` and discards it (`_ = ex;`). The production poll path (`MxAccessStaSession.RunAlarmPollLoopAsync``AlarmCommandHandler.PollOnce``AlarmDispatcher.PollOnce``consumer.PollOnce()`) has no fault recording either. A permanently failing alarm provider (e.g. `GetXmlCurrentAlarms2` returning `E_FAIL`, malformed XML throwing in `XmlDocument.LoadXml`) is therefore completely silent — no fault on the event queue, no log.
**Recommendation:** Route poll failures to `MxAccessEventQueue.RecordFault` (or a logger) so a broken alarm subscription becomes observable. Update the now-stale comment.
**Re-triage:** The cited location `WnWrapAlarmConsumer.cs:297-313` and the `OnPoll` callback no longer exist as of this branch — Worker-001 removed the off-STA `Timer` and its `OnPoll` callback entirely. The substantive concern still held, however: the **production** poll path in `MxAccessStaSession.RunAlarmPollLoopAsync` caught only `OperationCanceledException`, `ObjectDisposedException`, and `InvalidOperationException`. A genuine poll failure (`COMException` from `GetXmlCurrentAlarms2`, a malformed-XML `XmlException`) escaped uncaught, faulted the never-awaited `Task.Run` poll task, and was silently lost — exactly the silent-failure the finding describes. The finding was re-pointed at the live location and fixed there rather than at the removed `OnPoll`.
**Resolution:** 2026-05-18 — `RunAlarmPollLoopAsync` gained a trailing `catch (Exception exception)` arm after the three graceful-stop catches. A real alarm-poll failure is now converted to a `WorkerFault` (category `MxaccessEventConversionFailed`, carrying the exception type and, for a `COMException`, its `HResult`) by the new `CreateAlarmPollFault` helper and recorded on the session's `MxAccessEventQueue` via `RecordFault`. The worker's event-drain loop drains that fault and forwards it to the gateway, so a broken alarm subscription is now observable on the IPC fault path instead of vanishing. The poll loop still stops after the failure (the subscription is dead). No new proto enum value was added — `MxaccessEventConversionFailed` is the closest existing alarm-path category, avoiding a contracts regeneration across all clients. Verified by the regression test `MxAccessStaSessionTests.RunAlarmPollLoop_WhenPollOnceThrows_RecordsFaultOnEventQueue`.
### Worker-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491` |
| Status | Resolved |
**Description:** `RunAsync`'s `finally` calls `_runtimeSession?.Dispose()` unless `_shutdownTimedOut`. On the normal path `ShutdownGracefullyAsync` already disposed the STA runtime, so re-entering `Dispose()` is a harmless no-op only because `ShutdownGracefullyAsync` reached its end and set `disposed = true`. If `ShutdownGracefullyAsync` throws `TimeoutException` after partial teardown with `_shutdownTimedOut` set, the session is never disposed at all — the `finally` skips it — leaking the STA thread and COM object, leaving cleanup to rely solely on process exit.
**Recommendation:** Make the dispose decision explicit and confirm process exit always follows a timed-out shutdown; otherwise dispose defensively. At minimum document why disposal is deliberately skipped on timeout.
**Resolution:** 2026-05-18 — `RunAsync`'s `finally` now always calls `_runtimeSession?.Dispose()`; the `if (!_shutdownTimedOut)` guard and the `_shutdownTimedOut` field (which had become write-only) were removed. `MxAccessStaSession.Dispose` is idempotent (`if (disposed) return`) and bounded — each STA join is capped with `Wait(TimeSpan.FromSeconds(2))` — so re-entering it on the normal path (where `ShutdownGracefullyAsync` already disposed the runtime) is a harmless no-op, while on the timed-out path it is now the only thing that reclaims the STA thread and releases the MXAccess COM object. The previous behaviour leaked both on a shutdown timeout and relied solely on process exit. A code comment in the `finally` block documents the reasoning. Verified by the regression test `WorkerPipeSessionTests.RunAsync_WhenShutdownTimesOut_StillDisposesRuntimeSession`, which forces a `TimeoutException` from `ShutdownGracefullyAsync` and asserts the runtime session is disposed before `RunAsync` rethrows.
### Worker-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | mxaccessgw conventions |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150` |
| Status | Resolved |
**Description:** `Invoke` uses late-bound `Type.InvokeMember` reflection as a fallback when the COM object does not cast to `ILMXProxyServer*`. In production the object is always `LMXProxyServerClass`, so the reflection path exists only for test doubles — it is dead/untested code on the production path and obscures the interface contract. `params object[] arguments` also boxes value-type handles on every call.
**Recommendation:** Drop the reflection fallback and require the COM object to implement the interface (tests can supply a typed fake), or clearly mark the fallback as test-only.
**Re-triage:** The finding's claim that the reflection path is "dead/untested code" is partly inaccurate — it was in fact the path exercised by the entire `MxAccessCommandExecutorTests` suite, whose `FakeMxAccessComObject` did not implement any typed interface. So the reflection fallback was test-only but *not* untested. The convention concern (bypassing the typed interface contract, boxing value-type handles) is valid, so the fix follows the recommendation's first option.
**Resolution:** 2026-05-18 — The late-bound `Type.InvokeMember` reflection fallback and its `params object[]`-boxing `Invoke` helper were removed from `MxAccessComServer`. Each adapter method now takes one of two typed paths: an `is IMxAccessServer` fast path (test fakes implement `IMxAccessServer` directly) and the production path that casts to the typed `ILMXProxyServer` / `ILMXProxyServer3` / `ILMXProxyServer4` COM interfaces via new `AsProxyServer*` helpers. A COM object implementing neither now fails fast with a clear `InvalidOperationException` naming the missing interface, instead of an opaque late-bound call. The test seam was migrated accordingly: `MxAccessCommandExecutorTests.FakeMxAccessComObject` now declares `: IMxAccessServer` (its method signatures already matched the interface exactly, so no behavioural change). Verified by the new `MxAccessComServerTests` (typed-server routing, untyped-object rejection, original-exception propagation — no more `TargetInvocationException` wrapping) plus the unchanged, still-passing `MxAccessCommandExecutorTests` suite which now exercises the typed `IMxAccessServer` path.
### Worker-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249`, `:429-447` |
| Status | Resolved |
**Description:** `RunAlarmPollLoopAsync` correctly marshals `handler.PollOnce()` onto the STA via `staRuntime.InvokeAsync`, and the cancel/await/dispose ordering in `ShutdownGracefullyAsync` is sound. However, nothing enforces that the `consumerFactory` and all `IMxAccessAlarmConsumer` calls run on the STA thread; a future caller could break STA affinity silently.
**Recommendation:** Add an assertion or documented invariant that the consumer factory and all `IMxAccessAlarmConsumer` calls run on the STA thread, mirroring the existing `MxAccessSession.CreationThreadId` pattern.
**Resolution:** 2026-05-18 — `MxAccessStaSession` now records the STA thread id (`alarmConsumerThreadId`) at the point the alarm-command-handler factory is invoked — which already runs inside `staRuntime.InvokeAsync` during `StartAsync`, mirroring the `MxAccessSession.CreationThreadId` capture. `RunAlarmPollLoopAsync`'s marshalled poll lambda now calls `EnsureOnAlarmConsumerThread()` before `handler.PollOnce()`, asserting the poll runs on the recorded STA thread. The check is delegated to a new `internal static` guard `AssertOnAlarmConsumerThread(int? expected, int actual)` that throws a descriptive `InvalidOperationException` on an affinity violation and is a no-op when the consumer thread is unrecorded (no alarm handler configured). Making the guard `static` and `internal` keeps it directly unit-testable. The STA-affinity invariant is documented in the guard's XML doc. Verified by the regression tests `MxAccessStaSessionTests.AssertOnAlarmConsumerThread_WhenOffOwningThread_Throws` and `AssertOnAlarmConsumerThread_OnOwningThreadOrUnset_DoesNotThrow`.
### Worker-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49`, `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58` |
| Status | Resolved |
**Description:** Every frame read allocates a fresh 4-byte length buffer and a payload `byte[]`; every write allocates `ToByteArray()` plus a 4-byte prefix. On the hot event-drain path (batches of up to 128 `WorkerEvent` frames every 25 ms) this produces steady gen-0 garbage. `WorkerFrameWriter` also effectively serializes twice (`CalculateSize()` then `ToByteArray()`).
**Recommendation:** Reuse a pooled buffer / `ArrayPool<byte>` for the length prefix and payload, and write directly into a pooled buffer using `CodedOutputStream`. Low priority unless event throughput is high.
**Resolution:** 2026-05-18 — `WorkerFrameWriter.WriteAsync` now serializes the envelope exactly once into a single frame buffer that carries the 4-byte length prefix followed by the payload, via `envelope.WriteTo(new Span<byte>(frame, sizeof(uint), payloadLength))`. This eliminates the redundant second serialization pass (`ToByteArray()` re-runs `CalculateSize()` internally), the separate length-prefix array, and the separate prefix `WriteAsync`/extra `FlushAsync` round. `WorkerFrameReader.ReadAsync` now rents its payload buffer from `ArrayPool<byte>.Shared` and returns it in a `finally` once `WorkerEnvelope.Parser.ParseFrom(payload, 0, length)` has copied what it needs; `ReadExactlyOrThrowAsync` gained an explicit `count` parameter so it honours the logical frame length rather than the (possibly larger) rented buffer length. The 4-byte length-prefix buffer is left as a per-call stack-sized allocation — pooling a 4-byte array is not worthwhile. Verified by the new regression test `WorkerFrameProtocolTests.ReadAsync_WithVaryingFrameSizes_ParsesEachFrameExactly`, which reads a large frame followed by a small frame through one reader to prove the pooled buffer is sliced to each frame's own length and never leaks stale trailing bytes; the existing round-trip, malformed-payload, and concurrent-write tests continue to pass.
### Worker-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226` |
| Status | Resolved |
**Description:** `ConvertInt64Scalar` is reached for `TypeCode.UInt32` and `TypeCode.Int64`. For a `uint` with `expectedDataType == MxDataType.Time`, the value is treated as a Windows `FILETIME` via `DateTime.FromFileTimeUtc(longValue)`; a 32-bit FILETIME is never a valid full FILETIME, so this silently produces a near-epoch timestamp rather than a raw/diagnostic value. Unlikely in practice but a silent misconversion.
**Recommendation:** Only apply the `MxDataType.Time` FILETIME projection for 64-bit source types; for `uint` fall through to integer or raw.
**Resolution:** 2026-05-18 — `ConvertInt64Scalar`'s `MxDataType.Time` FILETIME projection is now gated on `value is long`. A genuine 64-bit `long` still projects to a `Timestamp` via `DateTime.FromFileTimeUtc`; a 32-bit `uint` — which can only hold the low half of a FILETIME — now falls through to the integer projection (`DataType = Integer`, `Int64Value`) instead of silently producing a bogus near-1601 timestamp. Verified by the regression test `VariantConverterTests.Convert_WithUInt32AndExpectedTime_DoesNotProjectFileTime`; the existing `Convert_WithFileTimeAndExpectedTime_ProjectsTimestamp` (a `long` FILETIME) continues to pass, confirming the 64-bit path is unchanged.
### Worker-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171` |
| Status | Resolved |
**Description:** `retryAttempts` is computed as `(connectTimeout / min(connectTimeout, attemptTimeout)) - 1`. With defaults (30000 / 2000) this yields 14 retries, but each retry also incurs Polly exponential backoff. The overall `connectDeadline` (`CancelAfter(connectTimeout)`) is the real bound, so the computed attempt count can be larger or smaller than the time budget allows, and the formula is opaque.
**Recommendation:** Drive retries purely off the `connectDeadline` token (Polly stops when cancelled) and drop the fragile attempt-count arithmetic, or add a comment explaining the intent.
**Resolution:** 2026-05-18 — The opaque `retryAttempts` arithmetic in `ConnectWithRetryAsync` was removed. `MaxRetryAttempts` is now `int.MaxValue`, so the retry loop is bounded solely by the `connectDeadline` linked token (`CancelAfter(_connectTimeoutMilliseconds)`): Polly stops retrying the moment that token is cancelled, making the overall connect timeout the single source of truth and correctly accounting for the exponential backoff between attempts (which the old formula ignored). A comment documents the intent. No new test was added — the change does not alter observable behavior (the deadline was always the real bound; the old formula always permitted more attempts than fit the budget), and the existing `WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears` (server appears mid-retry) and `RunAsync_WhenPipeNeverAppears_ThrowsTimeoutException` (deadline ends the loop) already cover both retry-until-success and deadline-bounded termination.
### Worker-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55`, `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43`, `src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112` |
| Status | Resolved |
**Description:** Multiple comments describe the alarm path as not-yet-wired future work ("PR A.2 — COM-side subscription scaffold … the worker advertises no alarm subscription", "the worker bootstrap will gain a thin 'run-on-STA' wrapper as part of A.3"). As of commit 6c64030 the alarm command handler, STA poll loop, and `SubscribeAlarms`/`AcknowledgeAlarm`/`QueryActiveAlarms` are all wired. These comments are stale and misleading.
**Recommendation:** Update the XML docs/comments to describe the shipped behavior; remove the "future PR" framing.
**Re-triage:** The `WnWrapAlarmConsumer.cs:38-43` citation is inaccurate — those lines were rewritten by Worker-001 and already describe the shipped no-internal-timer threading model correctly; nothing stale there. Conversely, two stale comments the finding did *not* cite were found on the same alarm path and fixed under the same root cause: `AlarmDispatcher.cs`'s `<remarks>` still framed the dispatcher as "the in-process slice of A.3" with a "companion follow-up PR" adding the (now-shipped) `SubscribeAlarmsCommand`/`AcknowledgeAlarmCommand`/`QueryActiveAlarmsCommand`, and stated the consumer "polls on a `System.Threading.Timer` thread today" — a claim made false by Worker-001's removal of that timer; and `AlarmCommandHandler.cs`'s `<remarks>` likewise asserted "the wnwrap consumer's polling timer fires on a thread-pool thread". The discovery document `docs/AlarmClientDiscovery.md` (referenced by the source comments) was deliberately left untouched: it is a historical research log of the investigation that chose the shipped design, not API/contract/lifecycle prose, and the source comments cite only its still-accurate "Option A — captured" payload schema.
**Resolution:** 2026-05-18 — Rewrote the stale alarm-path comments to describe shipped behavior with no "future PR / A.2 / A.3" framing. `MxAccessAlarmEventSink`: the class `<remarks>` and the `Attach` comment now explain that `AlarmDispatcher` owns the consumer→sink→queue wire-up and that `Attach` carries only the session id (no COM-event subscription is needed because the polled wnwrap consumer raises transition events itself). `MxAccessEventMapper.CreateOnAlarmTransition`'s XML summary now states the worker drives it from `MxAccessAlarmEventSink.EnqueueTransition` once `AlarmDispatcher` decodes a wnwrap transition. `AlarmDispatcher` and `AlarmCommandHandler` `<remarks>` were corrected to describe the shipped command surface and the no-internal-timer / STA-driven polling model (the `System.Threading.Timer` claims were factually wrong post-Worker-001). Pure documentation change — no behavior altered, no test needed; the build stays green.
### Worker-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/MxGateway.Worker/Sta/StaMessagePump.cs` |
| Status | Resolved |
**Description:** `StaMessagePump` — the heart of COM event delivery (`MsgWaitForMultipleObjectsEx` + `PeekMessage`/`DispatchMessage`) — has no direct unit tests. `StaRuntimeTests` exercises it indirectly for command wake-up but never verifies that a posted Windows message actually wakes the wait and is dispatched, nor that `PumpPendingMessages` returns a correct count. The alarm poll-loop lifecycle in `MxAccessStaSession` (start/cancel/await on shutdown) also has no test. These are the most failure-sensitive paths in the module.
**Recommendation:** Add tests that post a message to the STA thread and assert it is pumped, and tests covering alarm poll-loop start/stop and shutdown ordering.
**Re-triage:** This finding is stale as of the reviewed branch — the coverage it asks for already exists. `src/MxGateway.Worker.Tests/Sta/StaMessagePumpTests.cs` contains direct `StaMessagePump` tests covering null-argument validation, waking on a signalled event, returning on timeout, the zero-timeout conversion branch, `PumpPendingMessages` returning the correct count for messages posted to the STA thread (`PumpPendingMessages_MessagesPostedToStaThread_ReturnsCountProcessed`, `PumpPendingMessages_NoMessagesPosted_ReturnsZero`), and `WaitForWorkOrMessages` waking on a posted Windows message (`WaitForWorkOrMessages_WindowsMessagePosted_ReturnsForInputAvailable`) — exactly the "post a message and assert it is pumped" test the recommendation asks for. The alarm poll-loop lifecycle is covered by `MxAccessStaSessionTests.StartAsync_WithAlarmCommandHandlerFactory_PollOnceCalledViaSta` (start → poll runs on the STA) and `Dispose_StopsAlarmPollLoop` (Dispose joins the poll task; no further polls). The finding was raised against a stale view of the test project; no source or test change is required. Re-triaged as already resolved rather than fixed.
**Resolution:** 2026-05-18 — No code change. Re-triaged: the requested direct `StaMessagePump` tests (including posted-message dispatch and pump count) and the alarm poll-loop start/stop lifecycle tests already exist in `StaMessagePumpTests.cs` and `MxAccessStaSessionTests.cs`. See the re-triage note above for the specific test names.
### Worker-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33`, `:202` |
| Status | Resolved |
**Description:** The file declares two public types — the `AlarmCommandHandler` class and the `IAlarmCommandHandler` interface. The C# style guide and the rest of the module follow one-public-type-per-file (e.g. interfaces in their own `I*.cs` files like `IMxAccessAlarmConsumer.cs`).
**Recommendation:** Move `IAlarmCommandHandler` to its own `IAlarmCommandHandler.cs` for consistency.
**Resolution:** 2026-05-18 — The `IAlarmCommandHandler` interface (with its XML docs) was moved verbatim out of `AlarmCommandHandler.cs` into a new `src/MxGateway.Worker/MxAccess/IAlarmCommandHandler.cs`, with its own `using` directives (`System`, `System.Collections.Generic`, `MxGateway.Contracts.Proto`). `AlarmCommandHandler.cs` now declares one public type, matching the module's one-public-type-per-file convention (cf. `IMxAccessAlarmConsumer.cs`). Pure file-organization change — no API surface, behavior, or namespace changed; no test needed. The worker build is clean with zero warnings (no unused usings left behind in `AlarmCommandHandler.cs`).
### Worker-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145` |
| Status | Resolved |
**Description:** On overflow, `Enqueue` records the overflow fault and throws `MxAccessEventQueueOverflowException`; `MxAccessBaseEventSink.EnqueueEvent` catches it and calls `RecordFault` again. `RecordFault` is a no-op when a fault already exists, so the second call is harmless — but the intent is muddled, and there is no test asserting the dropped-event behavior. This is acceptable per the fail-fast design but undocumented at the call site.
**Recommendation:** Add a brief comment in `EnqueueEvent` clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op.
**Resolution:** 2026-05-18 — Added a comment in `MxAccessBaseEventSink.EnqueueEvent`'s catch block (per the finding's recommendation) explaining that two distinct fail-fast failures land there: a conversion failure from `createEvent()` (recorded here as an `MxaccessEventConversionFailed` fault) and an `MxAccessEventQueueOverflowException` from `Enqueue` at capacity, which — per the fail-fast backpressure design in `docs/DesignDecisions.md` — drops the event and has *already* self-recorded a `QueueOverflow` fault inside `Enqueue`. Because `MxAccessEventQueue.RecordFault` keeps only the first fault, the catch's `RecordFault` call is then a deliberate near no-op rather than a second, conflicting fault. Pure comment change as recommended — no behavior altered. `docs/DesignDecisions.md` already documents the fail-fast event backpressure rule, so no doc change was required.
### Worker-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:261-265` |
| Status | Resolved |
**Description:** `RunAlarmPollLoopAsync` catches `InvalidOperationException` and silently returns with the rationale "STA runtime shutting down — stop the loop gracefully". The same catch arm, however, also swallows the `InvalidOperationException` thrown by `EnsureOnAlarmConsumerThread()` / `AssertOnAlarmConsumerThread()` — the STA-affinity guard added under Worker-008. If the alarm poll ever ran on the wrong thread (a regression of the STA-affinity invariant), the assertion would fire, the loop would silently stop, no fault would be recorded, and the only observable symptom would be alarms no longer flowing. The assertion exists to catch a programming error early; this catch defeats it.
**Recommendation:** Either tighten the `InvalidOperationException` catch so it only swallows the STA-runtime-shutting-down sentinel (e.g. match on the exception message produced by `StaRuntime.InvokeAsync`, or have the STA runtime throw a dedicated exception type for shutdown), or rethrow / record-a-fault for `InvalidOperationException`s whose message does not match the shutdown sentinel. Add a regression test that drives `RunAlarmPollLoopAsync` with a handler that throws `InvalidOperationException` from `PollOnce` and asserts the loop records a fault rather than silently exiting.
**Resolution:** 2026-05-20 — Introduced a dedicated `StaRuntimeShutdownException` (`src/MxGateway.Worker/Sta/StaRuntimeShutdownException.cs`) that `StaRuntime.InvokeAsync` and the queue-enqueue path now throw in place of a generic `InvalidOperationException` when `shutdownRequested` is set. `RunAlarmPollLoopAsync` in `MxAccessStaSession.cs:258-291` now catches `StaRuntimeShutdownException` (graceful stop, returns silently) separately from the generic `Exception` arm, which records the fault on the event queue. An STA-affinity `InvalidOperationException` from `EnsureOnAlarmConsumerThread` therefore now falls through to the fault path and becomes observable on the IPC fault path instead of silently terminating alarm delivery. Verified: `dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86` clean (0 warnings). Regression coverage in `MxAccessStaSessionTests.cs` exercises both the graceful-shutdown and the affinity-violation paths.
### Worker-017
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker/Sta/StaRuntime.cs:280-288`, `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-631` |
| Status | Resolved |
**Description:** `StaRuntime.ProcessQueuedCommands` calls `MarkActivity()` only before and after `workItem.Execute()`. For a command that synchronously holds the STA for longer than `WorkerPipeSessionOptions.HeartbeatGrace` (default 15s) — e.g. `ReadBulk` with many uncached tags, each waiting up to its per-tag `TimeoutMs` (default 1000 ms) — no `MarkActivity()` runs during the wait, `LastActivityUtc` stays frozen, and `ReportWatchdogFaultIfNeededAsync` fires an `StaHung` fault. The heartbeat itself reports `WorkerState.ExecutingCommand` with the live `CurrentCommandCorrelationId`, so the worker actually knows it is executing a command rather than hung — but the watchdog branch only checks `staleFor > HeartbeatGrace` and ignores the in-flight command. A legitimate slow bulk read then self-faults and tears the session down.
**Recommendation:** Either (a) extend `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` to skip the `StaHung` fault when the snapshot's `CurrentCommandCorrelationId` is non-empty (the worker is executing a command, not hung), or (b) thread a `MarkActivity`-style callback into the bulk-read `pumpStep` so long synchronous STA operations periodically refresh `LastActivityUtc`. Option (a) is the smaller surface — the heartbeat already carries enough signal for the gateway to decide the command is just slow. Either way, the design intent (watchdog catches a hung STA, not a slow command) should be documented on `ReportWatchdogFaultIfNeededAsync`.
**Resolution:** 2026-05-20 — Applied option (a): `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` (`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-645`) now returns early when `snapshot.CurrentCommandCorrelationId` is non-empty — the STA is busy executing a known command, not hung, and the heartbeat already surfaces the correlation id so the gateway can decide whether the command is too slow against its own per-command timeout. The next `MarkActivity()` after the command returns lifts `LastActivityUtc` and the watchdog resumes normal operation. A new XML doc comment on the method records the design intent (watchdog catches a hung STA, not a slow command). Verified: `dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86` clean. Regression coverage added in `WorkerPipeSessionTests.cs`.
### Worker-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:160-161` |
| Status | Resolved |
**Description:** `Subscribe` calls `com.SetXmlAlarmQuery(xmlQuery)` and discards the return value. The block-level comment immediately above states that this call is empirically required for subsequent `GetXmlCurrentAlarms2` to succeed — i.e. it is on the critical path of the alarm subscription. Every other AVEVA-COM call in the same method (`InitializeConsumer`, `RegisterConsumer`, `Subscribe`, `AlarmAckByName`, etc.) is gated on a `!= 0` return-code check and throws `InvalidOperationException` on failure. If `SetXmlAlarmQuery` ever returns non-zero (or otherwise fails non-fatally), the consumer reaches `subscribed = true` with the wnwrap state misconfigured, and the next `PollOnce` fails with the same `E_FAIL` the comment warns about — without any indication where the regression lies.
**Recommendation:** Either (a) check the `SetXmlAlarmQuery` return code and treat a non-zero value as a subscription failure (matching the other call-gates in the method) or (b) document explicitly in the comment that `SetXmlAlarmQuery`'s return code is meaningless on this AVEVA build (referencing `docs/AlarmClientDiscovery.md` if so). At minimum capture the return value in a local for diagnostic purposes so a future failure is easier to triage.
**Re-triage:** The finding's framing assumed an integer return code; inspection of the `Interop.WNWRAPCONSUMERLib` assembly confirmed `SetXmlAlarmQuery` is declared `Void SetXmlAlarmQuery(System.String)` on all three flavors (`IwwAlarmConsumer`, `IwwAlarmConsumer2`, `wwAlarmConsumerClass`). There is no integer return code to gate on. A genuine failure can only surface as a `COMException` mapped from the underlying HRESULT, so the fix wraps the call to translate that into the same `InvalidOperationException` failure-shape used by every other call-gate in `Subscribe`, with the HRESULT included in the diagnostic message.
**Resolution:** 2026-05-20 — `WnWrapAlarmConsumer.Subscribe` now wraps the `com.SetXmlAlarmQuery(xmlQuery)` call in a `try`/`catch (COMException ex)` that throws an `InvalidOperationException` carrying the HRESULT (`$"wwAlarmConsumer.SetXmlAlarmQuery failed with HRESULT 0x{ex.HResult:X8}; subsequent GetXmlCurrentAlarms2 polls would return E_FAIL."`) with the original `COMException` as `InnerException`. A previously silent failure that left `subscribed = true` with misconfigured wnwrap state — and produced an opaque `E_FAIL` from the next `PollOnce` with no indication where the regression lay — now surfaces as a subscription failure at the `Subscribe` call-site, matching the existing v1-lifecycle failure shape. The block comment was extended to record that the interop signature returns `void` (no integer return code to gate on like the sibling v1 calls) so a future maintainer doesn't try to add one. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is structurally analogous to the existing `Initialize/Register/Subscribe` call-gates and is exercised end-to-end by the live alarm smoke path.
### Worker-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:59`, `:188` |
| Status | Resolved |
**Description:** `WnWrapAlarmConsumer` declares `private string subscriptionExpression = string.Empty;` and assigns it once inside `Subscribe` (line 188), but never reads it. It is dead state — neither `PollOnce`, `AcknowledgeByName`, `AcknowledgeByGuid`, `SnapshotActiveAlarms`, nor `Dispose` consults it. Either it is genuinely unused (delete it) or it was intended to support a not-yet-implemented feature (e.g. re-subscribing after a transient failure, or echoing the subscription back through `IsSubscribed`/`SubscriptionExpression`), in which case the intent should be wired up or documented.
**Recommendation:** Delete the field (the safest option — `treatWarningsAsErrors=true` will continue to permit it as long as it's read into; consider promoting it to read-only via an exposed property `SubscriptionExpression` so smoke tests can assert what subscription is active without touching wnwrap state). If a future use is expected, file a follow-up issue.
**Resolution:** 2026-05-20 — Deleted the dead `private string subscriptionExpression = string.Empty;` field declaration and its sole assignment inside `Subscribe` (`subscriptionExpression = subscription;`). The field had no readers and was pure write-only state. Pure cleanup — no behaviour change, no public API surface affected. The worker build remains clean with zero warnings under `TreatWarningsAsErrors=true`.
### Worker-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:405`, `:423` |
| Status | Resolved |
**Description:** `ProcessCommandAsync` decides whether to write a command reply with `if (_state is not WorkerState.Ready and not WorkerState.ExecutingCommand)`. The `ExecutingCommand` arm is dead: `_state` is only ever assigned `Starting`, `Handshaking`, `InitializingSta`, `Ready`, `ShuttingDown`, `Faulted`, or `Stopped`. The string `WorkerState.ExecutingCommand` appears nowhere as a target of `_state = ...`. The `WorkerState.ExecutingCommand` value is synthesized only in `CreateHeartbeat` (line 811) when a command is in flight, so it never leaks back into `_state`. The check is effectively `_state is not WorkerState.Ready`. The intent is unclear: either the check should also accept the live "is executing" condition (which today is implicit via `_state == Ready` plus a non-empty `CurrentCommandCorrelationId` from the dispatcher), or the dead arm should be removed for clarity.
**Recommendation:** Simplify the check to `if (_state != WorkerState.Ready)` to match the actual state machine, and update the dropped-reply log fields accordingly. Alternatively, introduce an explicit `WorkerState.ExecutingCommand` transition (set when a command starts dispatching, restored to `Ready` on completion) so the check matches its name. The simpler fix is the former.
**Resolution:** 2026-05-20 — Both occurrences of the `_state is not WorkerState.Ready and not WorkerState.ExecutingCommand` check in `ProcessCommandAsync` (the post-`DispatchAsync` success path and the exception path) were simplified to `_state != WorkerState.Ready`. The `ExecutingCommand` arm was dead — `_state` is never written that value; only `CreateHeartbeat` synthesizes it on the wire when `CurrentCommandCorrelationId` is non-empty. A comment was added at the success-path site documenting the assignment-set of `_state` and why `Ready` is the only command-serving state. No behavioural change — `_state` could never be `ExecutingCommand` at that read, so the simplification preserves the same effective decision while removing the misleading dead arm. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent.
### Worker-021
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-118`, `:790-805`, `:136-139` |
| Status | Resolved |
**Description:** `RunAsync` constructs the runtime session through `_runtimeSession = _runtimeSessionFactory()` (line 111) and immediately calls `CompleteStartupHandshakeAsync(token => _runtimeSession.StartAsync(...))`. That path is fine. However the public parameterless `CompleteStartupHandshakeAsync()` (line 136) routes through `InitializeMxAccessAsync` (line 790), which unconditionally reassigns `_runtimeSession = new MxAccessStaSession(eq => new AlarmCommandHandler(eq));` — overwriting whatever the factory put there. If anything ever calls `CompleteStartupHandshakeAsync()` after `RunAsync` has already begun, the factory-supplied session is leaked (no `Dispose` is called on the old instance) and a fresh hard-coded `MxAccessStaSession` is started instead. Today no production code path triggers this, but the API surface is public and dangerous — a test or a refactor could trip it.
**Recommendation:** Either (a) make `InitializeMxAccessAsync` a no-op if `_runtimeSession` is already non-null (treat the existing instance as authoritative and only call its `StartAsync`), or (b) make the parameterless `CompleteStartupHandshakeAsync()` and `InitializeMxAccessAsync` `internal` / remove them, since the production path is the factory-driven one in `RunAsync`. Option (b) is cleaner: the parameterless overload is dead in production.
**Resolution:** 2026-05-20 — Applied option (a): `InitializeMxAccessAsync` now uses `_runtimeSession ??= new MxAccessStaSession(eq => new AlarmCommandHandler(eq));`, so the existing factory-supplied instance from `RunAsync` is treated as authoritative and only the fall-back direct-invocation path (where the parameterless `CompleteStartupHandshakeAsync` is called without a prior factory call) constructs the hard-coded `MxAccessStaSession`. The `StartAsync` call and the `catch`-and-dispose path now operate on a local `session` captured from `_runtimeSession`, so a startup failure still disposes the runtime regardless of which path supplied it. A comment in `InitializeMxAccessAsync` documents the reasoning. Option (a) was preferred over (b) because the parameterless `CompleteStartupHandshakeAsync` overload is part of the existing public API surface and tightening it to `internal` would be a contract change with no production driver requesting it. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is exercised end-to-end by the existing `RunAsync` factory path which now goes through the null-coalescing assignment instead of an unconditional `new`.
### Worker-022
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker/MxAccess/MxAlarmSnapshot.cs:12`, `:26`, `:49` |
| Status | Resolved |
**Description:** `MxAlarmSnapshot.cs` declares three public types in one file: the `MxAlarmStateKind` enum, the `MxAlarmSnapshotRecord` class, and the `MxAlarmTransitionEvent` class. The C# style guide (`docs/style-guides/CSharpStyleGuide.md:68`) requires one public type per file unless a small nested type is clearer. The recently resolved Worker-014 split `IAlarmCommandHandler` out of `AlarmCommandHandler.cs` for exactly this reason — the same convention applies here.
**Recommendation:** Move `MxAlarmStateKind` and `MxAlarmTransitionEvent` into their own files (`MxAlarmStateKind.cs`, `MxAlarmTransitionEvent.cs`) and leave `MxAlarmSnapshotRecord` in `MxAlarmSnapshot.cs` (or rename the file to `MxAlarmSnapshotRecord.cs` to match the surviving type). Pure file-organization change; no behaviour or namespace impact.
**Resolution:** 2026-05-20 — Split `MxAlarmSnapshot.cs` into three files, each declaring one public type and keeping the original `MxGateway.Worker.MxAccess` namespace so existing usages are unaffected: `MxAlarmStateKind.cs` (the enum, with its XML doc), `MxAlarmTransitionEvent.cs` (the `EventArgs` subclass, with its `PreviousState` doc), and `MxAlarmSnapshot.cs` (now containing only `MxAlarmSnapshotRecord` plus its XML doc). Matches the one-public-type-per-file convention re-affirmed by Worker-014's `IAlarmCommandHandler` split. Pure file-organization change — no API, namespace, or behaviour change; build is clean.
### Worker-023
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:610-668`, `src/MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:124-153` |
| Status | Resolved |
**Description:** Worker-017 (resolved at `a020350`) suppresses the `StaHung` watchdog when `CurrentCommandCorrelationId` is non-empty: "the STA is busy executing a command, not hung." The fix is correct for the motivating case (legitimately slow `ReadBulk` against many uncached tags) — gateway-side per-command timeouts (`WorkerClient.InvokeAsync`'s `timeout` parameter, see `src/MxGateway.Server/Workers/WorkerClient.cs:189-218`) eventually fail the command and may kill the worker. **But the suppression has no defensive ceiling.** Most MXAccess commands in `MxAccessCommandExecutor``Register`, `AddItem`, `Advise`, `Write`, `WriteSecured`, and their bulk variants — call directly into the MXAccess COM object **with no internal deadline**. If a COM call hangs (e.g. the MXAccess provider crashed and the cross-apartment marshaler is permanently blocked, or a write completion never fires), `StaRuntime.ProcessQueuedCommands` is stuck inside `workItem.Execute()`, `StaCommandDispatcher.currentCommandCorrelationId` stays non-empty forever, and `ReportWatchdogFaultIfNeededAsync` will short-circuit on every heartbeat. The worker-side `StaHung` watchdog — the only signal that distinguishes a hung STA from a slow gateway response from inside the worker — is permanently defeated for that session. Gateway-side `CommandTimeout` is the safety net, but it depends on the gateway operator picking a sensible per-command timeout (some bulk operations legitimately set this to many minutes), and it does not surface a worker-originated diagnostic (`StaHung` fault category, `LastStaActivityUtc` value) to the gateway audit trail.
**Recommendation:** Add a defensive upper bound, distinct from `HeartbeatGrace`, after which the watchdog fires even when a command is in flight — e.g. `HeartbeatStuckCeiling` (default 5× `HeartbeatGrace` = 75s, or align with the longest reasonable per-command timeout). Pseudocode for the in-flight branch:
```csharp
if (!string.IsNullOrEmpty(snapshot.CurrentCommandCorrelationId)
&& staleFor <= _sessionOptions.HeartbeatStuckCeiling)
{
return; // slow command — gateway will time out if needed
}
// staleFor > ceiling OR no command in flight — fire StaHung
```
Document the ceiling in `MxAccessWorkerInstanceDesign.md`'s watchdog section. Add a regression test that drives `RunAsync` with `CurrentCommandCorrelationId` non-empty and `LastStaActivityUtc` stale beyond the ceiling, asserting `WorkerFaultCategory.StaHung` is emitted.
**Resolution:** 2026-05-20 — Added `WorkerPipeSessionOptions.HeartbeatStuckCeiling` (default 75s = 5 × `HeartbeatGrace`) and extended `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` so the in-flight-command suppression is bounded by the ceiling: once `staleFor > HeartbeatStuckCeiling` the watchdog fires `StaHung` even with `CurrentCommandCorrelationId` non-empty. A truly stuck synchronous COM call (dead provider, blocked marshaler) no longer permanently defeats the worker-side watchdog. The ceiling is validated at startup (`> 0` and `> HeartbeatGrace`). Documented in the new XML doc on `HeartbeatStuckCeiling` and in `docs/MxAccessWorkerInstanceDesign.md`'s "Heartbeat And Watchdog" section. Regression test `WorkerPipeSessionTests.RunAsync_WhenStaActivityIsStaleBeyondCeilingWithCommandInFlight_WritesWatchdogFault` drives `RunAsync` with a non-empty current-command id and stale activity beyond the ceiling, asserting `WorkerFaultCategory.StaHung` is emitted. The existing `RunAsync_WhenStaActivityIsStaleWithCommandInFlight_DoesNotWriteWatchdogFault` test (5s stale, default 75s ceiling) continues to pass, confirming the suppression still works within the ceiling.
### Worker-024
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:63-187`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:191-323` |
| Status | Resolved |
**Description:** Worker-008 (resolved 2026-05-18) introduced `MxAccessStaSession.AssertOnAlarmConsumerThread(int?, int)`, called from `EnsureOnAlarmConsumerThread()` in the marshalled poll lambda at `RunAlarmPollLoopAsync` (`MxAccessStaSession.cs:247`). The assertion catches a regression that runs `IMxAccessAlarmConsumer.PollOnce()` off the STA — exactly the deadlock-on-cross-apartment-marshaling risk the `ThreadingModel=Apartment` wnwrap consumer demands. **However, the assertion guards only the poll path.** `AlarmCommandHandler.Subscribe`, `Acknowledge`, `AcknowledgeByName`, `QueryActive`, and `Unsubscribe` — each of which calls into the same `IMxAccessAlarmConsumer` and ultimately the COM object — have no equivalent guard. Today they are reached only through `MxAccessCommandExecutor.Execute``StaCommandDispatcher.ExecuteQueuedCommandAsync``staRuntime.InvokeAsync(...)`, so they do run on the STA in production. But the invariant is enforced only by *convention* (the same convention Worker-008 made explicit for `PollOnce`); a future refactor that lets a test or a refactored fast-path call into the handler off-STA would silently break the same apartment rule, and the wnwrap COM call would block on marshaling rather than fail loudly.
**Recommendation:** Add an `EnsureOnAlarmConsumerThread()`-equivalent assertion at the entry of each `AlarmCommandHandler` operation that touches the consumer (`Subscribe` is the highest-value site because it constructs the consumer; `Acknowledge*` and `QueryActive` next). Reuse `MxAccessStaSession.AssertOnAlarmConsumerThread` so the affinity invariant has a single canonical guard. Wire the expected thread id through the handler's constructor (today `AlarmCommandHandler` does not know the STA thread id — `MxAccessStaSession` captures it at line 191 but does not pass it). One implementation shape: hand the handler a small `IThreadAffinityGuard` whose `Verify()` is called at each entry, constructed by `MxAccessStaSession` once `alarmConsumerThreadId` is captured.
**Resolution:** 2026-05-20 — Extended `AlarmCommandHandler` with a third constructor that takes an optional `Action? threadAffinityCheck`, and invoked the guard at the entry of every method that touches the underlying `IMxAccessAlarmConsumer`: `Subscribe`, `Unsubscribe`, `Acknowledge`, `AcknowledgeByName`, `QueryActive`, and `PollOnce`. The factory signature on `MxAccessStaSession` was widened from `Func<MxAccessEventQueue, IAlarmCommandHandler>` to `Func<MxAccessEventQueue, Action, IAlarmCommandHandler>`, so `MxAccessStaSession` (which captures `alarmConsumerThreadId` at the factory call site, already running inside `staRuntime.InvokeAsync`) can pass its existing `EnsureOnAlarmConsumerThread` as the guard — keeping the affinity invariant on a single canonical check, `AssertOnAlarmConsumerThread`. `WorkerPipeSession`'s three factory wiring sites were updated to `(eq, affinity) => new AlarmCommandHandler(eq, () => new WnWrapAlarmConsumer(), affinity)`. The previous two-arg `AlarmCommandHandler` constructor remains (now delegating with `threadAffinityCheck: null`) so existing `AlarmCommandHandlerTests` continue to exercise the handler on a single thread without configuring a guard. Regression tests `AlarmCommandHandlerTests.EveryCommandPathEntry_InvokesThreadAffinityGuard` (counts invocations across all six entry points) and `EveryCommandPathEntry_PropagatesAffinityGuardException` (a throwing guard propagates from every entry point) verify the wiring.
### Worker-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-117` |
| Status | Resolved |
**Description:** `RunAsync` assigns `_runtimeSession = _runtimeSessionFactory()` (line 111) and immediately dereferences `_runtimeSession.StartAsync(...)` inside the lambda at line 115. If the supplied factory ever returns `null`, the lambda will throw `NullReferenceException` rather than a diagnostic exception, and the `finally` block at line 128 (`_runtimeSession?.Dispose()`) silently no-ops. The production factories (`() => new MxAccessStaSession(...)` in the two convenience constructors) never return null, but the factory delegate type `Func<IWorkerRuntimeSession>` admits null returns and the constructor's `runtimeSessionFactory ?? throw` null-check at line 102 only validates the delegate itself, not its return value. The `InitializeMxAccessAsync` direct-invocation path uses `_runtimeSession ??= new MxAccessStaSession(...)` (line 840), so a null factory return there would be replaced with a default instance — different behavior from the `RunAsync` path.
**Recommendation:** Promote the null check to the call site:
```csharp
_runtimeSession = _runtimeSessionFactory()
?? throw new InvalidOperationException("Worker runtime session factory returned null.");
```
Match the pattern `AlarmCommandHandler.Subscribe` already uses for `consumerFactory()` (`AlarmCommandHandler.cs:76-77`).
**Resolution:** 2026-05-20 — `WorkerPipeSession.RunAsync` now uses `_runtimeSession = _runtimeSessionFactory() ?? throw new InvalidOperationException("Worker runtime session factory returned null.");`, matching the pattern `AlarmCommandHandler.Subscribe` uses for its `consumerFactory()`. A null factory return now produces a clear diagnostic exception at the call site instead of NRE-ing on the next dereference (and the `finally` block's `_runtimeSession?.Dispose()` silently no-oping on a half-initialized session). Regression test `WorkerPipeSessionTests.RunAsync_WhenRuntimeSessionFactoryReturnsNull_ThrowsDiagnosticException` drives `RunAsync` with `() => null!` and asserts the diagnostic `InvalidOperationException` is thrown with the expected message.