Files

T

Joseph Doherty a0203503a7 Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules

Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 09:46:47 -04:00

48 KiB

Raw Blame History

Code Review — Worker

Field	Value
Module	`src/MxGateway.Worker`
Reviewer	Claude Code
Review date	2026-05-20
Commit reviewed	`1cd51bb`
Status	Reviewed
Open findings	0

Checklist coverage

This row reflects the 2026-05-20 re-review at commit 1cd51bb. Worker-001..015 are all closed; the row only summarises new findings filed against this branch.

#	Category	Result
1	Correctness & logic bugs	Issues found: Worker-018 (`SetXmlAlarmQuery` return code ignored), Worker-019 (`subscriptionExpression` is write-only dead state), Worker-020 (dead `ExecutingCommand` arm in `ProcessCommandAsync` state check), Worker-021 (`InitializeMxAccessAsync` can overwrite an already-set `_runtimeSession`).
2	mxaccessgw conventions	Issue found: Worker-022 (`MxAlarmSnapshot.cs` declares three public types in one file).
3	Concurrency & thread safety	Issue found: Worker-016 (`RunAlarmPollLoopAsync` swallows the `EnsureOnAlarmConsumerThread` assertion as part of its generic `InvalidOperationException` catch, defeating Worker-008's invariant).
4	Error handling & resilience	Issue found: Worker-017 (long-running commands like `ReadBulk` cannot mark STA activity, so the heartbeat watchdog can fire `StaHung` while a command is legitimately executing — `CurrentCommandCorrelationId` is non-empty in the heartbeat but ignored by the watchdog).
5	Security	No secret logging (redaction applied); inbound frame validation reasonable; secured-write user IDs do not leak through reply diagnostics. No new issues found.
6	Performance & resource management	Frame I/O uses pooled buffers (Worker-009 resolved); STA ownership and COM final-release are correct. No new issues found.
7	Design-document adherence	Code matches `gateway.md` / `MxAccessWorkerInstanceDesign.md` / `WorkerFrameProtocol.md`. No new design drift.
8	Code organization & conventions	Issue found: Worker-022 (see row 2).
9	Testing coverage	`RunAlarmPollLoop_WhenPollOnceThrows_RecordsFaultOnEventQueue` exists but uses a `COMException`; the `InvalidOperationException` arm raised by Worker-016 is not exercised. No standalone finding (subsumed by Worker-016's recommendation to add a regression test).
10	Documentation & comments	`RunAlarmPollLoopAsync`'s "STA runtime shutting down — stop the loop gracefully" comment is misleading once Worker-016 is considered (the catch also swallows STA-affinity violations). Noted in Worker-016.

Findings

Worker-001

Field	Value
Severity	High
Category	Concurrency & thread safety
Location	`src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207`
Status	Resolved

Description: When constructed with pollIntervalMilliseconds > 0, Subscribe starts a System.Threading.Timer whose OnPoll callback runs PollOnce() — which calls wwAlarmConsumerClass.GetXmlCurrentAlarms2 — on a thread-pool thread. The wnwrap CLSID is registered ThreadingModel=Apartment; calling its methods off the owning STA violates the hard rule that all COM calls happen on the dedicated STA thread, and can deadlock on cross-apartment marshaling when the STA is not pumping. The production path (default constructor, interval 0) is safe, but the public 3-arg constructor leaves this footgun callable, and tests/live-smoke use it.

Recommendation: Remove the internal Timer entirely (production already drives PollOnce from the STA), or document and gate it so it can only be used from an STA thread. At minimum, make the timer-driven mode unreachable from any production wiring.

Resolution: 2026-05-18 — Removed the off-STA timer infrastructure from WnWrapAlarmConsumer: the Timer? pollTimer and pollIntervalMs fields, the DefaultPollIntervalMilliseconds constant, the OnPoll callback, the timer-arming arm in Subscribe, and the timer disposal block in Dispose. The pollIntervalMilliseconds parameter is gone from both public constructors (the test-seam ctor is now 2-arg: wwAlarmConsumerClass + maxAlarmsPerFetch), so the off-STA footgun is structurally unreachable. PollOnce() remains the public STA-driven entry point. The stale "poll … on a timer below" comment was corrected. Verified by the regression tests WnWrapAlarmConsumer_has_no_internal_timer_field and WnWrapAlarmConsumer_exposes_no_poll_interval_constructor_parameter; the AlarmsLiveSmokeTests call site was updated to the 2-arg constructor.

Worker-002

Field	Value
Severity	High
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:545-549`
Status	Resolved

Description: RunHeartbeatLoopAsync calls await Task.Delay(_sessionOptions.HeartbeatInterval, ...) before sending the first heartbeat. The gateway therefore receives no heartbeat for the first full interval (default 5s) after the worker reaches Ready. If the gateway's liveness watchdog expects a heartbeat sooner, a healthy worker can be misclassified as hung at startup.

Recommendation: Send an initial heartbeat immediately on entering the loop, or move the Task.Delay to the end of the loop body.

Resolution: 2026-05-18 — Restructured RunHeartbeatLoopAsync so the Task.Delay(HeartbeatInterval) is applied between beats only, not before the first. A firstBeat guard skips the delay on the initial iteration, so the gateway sees a heartbeat as soon as the worker is Ready; cancellation behavior is preserved (the loop still observes the token and the delay still throws on cancellation). Verified by the regression test RunAsync_SendsFirstHeartbeatImmediatelyOnEnteringLoop. Three pre-existing tests (WorkerPipeClientTests.RunAsync_ConnectsToPipeAndCompletesHandshake, WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears, WorkerPipeSessionTests.RunAsync_WhenCommandThrowsAfterShutdown_DropsLateFaultAndWritesShutdownAck) assumed strict frame ordering and were updated to skip the now-interleaved first heartbeat while still asserting the same shutdown-ack behavior.

Worker-003

Field	Value
Severity	High
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:399-403`, `:416-419`
Status	Resolved

Description: ProcessCommandAsync checks _state after DispatchAsync completes and silently returns without writing a WorkerCommandReply (or fault) when _state is not Ready/ExecutingCommand. _state is a plain field mutated from multiple tasks (heartbeat loop, event-drain loop, shutdown). A command that completes successfully while _state has transitioned will have its reply dropped with no diagnostic, and the gateway's correlation-id wait then hangs until its own timeout. The _state read is also not synchronized.

Recommendation: Always attempt to write the reply/fault for an in-flight command, or explicitly reject in-flight commands with a Canceled/WorkerUnavailable reply during state transitions. Make _state access thread-safe (volatile or locked).

Resolution: 2026-05-18 — Both silent-drop return sites in ProcessCommandAsync (the post-DispatchAsync success path and the exception path) now call a new LogCommandResultDropped helper before returning. The helper logs an Information event named WorkerCommandResultDropped via the session's IWorkerLogger, carrying the command's correlation_id plus command_method and worker_state, so a stuck gateway correlation-id wait is now traceable. The _state field was made volatile (WorkerState is an int-backed protobuf enum, so volatile is valid) so cross-thread reads observe the latest value without tearing; this is a low-risk, non-behavioral change and did not destabilize any test. Verified by the regression test RunAsync_WhenReplyIsDroppedAfterShutdown_LogsDiagnostic.

Worker-004

Field	Value
Severity	Medium
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588`
Status	Resolved

Description: After ReportWatchdogFaultIfNeededAsync sends an StaHung fault, the heartbeat loop continues sending normal heartbeats with State derived from _state, which the watchdog path never sets to Faulted. The heartbeat then keeps reporting a non-faulted state that contradicts the fault just sent.

Recommendation: Set _state = WorkerState.Faulted (thread-safely) when the watchdog fault fires so heartbeat state and fault stay consistent.

Resolution: 2026-05-18 — ReportWatchdogFaultIfNeededAsync now sets _state = WorkerState.Faulted immediately after _watchdogFaultSent = true and before the StaHung fault is written, so the next heartbeat reports Faulted instead of contradicting the fault. _state is already volatile (Worker-003), so the cross-thread write from the heartbeat loop is observed correctly by the heartbeat's own CreateHeartbeat read; no further locking is required. Verified by the regression test WorkerPipeSessionTests.RunAsync_AfterWatchdogFault_HeartbeatReportsFaultedState, which uses a stale-activity snapshot with an empty current-command correlation id so the heartbeat State is derived from _state rather than forced to ExecutingCommand.

Worker-005

Field	Value
Severity	Medium
Category	Error handling & resilience
Location	`src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-258` (production alarm poll loop)
Status	Resolved

Description: OnPoll catches every exception from PollOnce() and discards it (_ = ex;). The production poll path (MxAccessStaSession.RunAlarmPollLoopAsync → AlarmCommandHandler.PollOnce → AlarmDispatcher.PollOnce → consumer.PollOnce()) has no fault recording either. A permanently failing alarm provider (e.g. GetXmlCurrentAlarms2 returning E_FAIL, malformed XML throwing in XmlDocument.LoadXml) is therefore completely silent — no fault on the event queue, no log.

Recommendation: Route poll failures to MxAccessEventQueue.RecordFault (or a logger) so a broken alarm subscription becomes observable. Update the now-stale comment.

Re-triage: The cited location WnWrapAlarmConsumer.cs:297-313 and the OnPoll callback no longer exist as of this branch — Worker-001 removed the off-STA Timer and its OnPoll callback entirely. The substantive concern still held, however: the production poll path in MxAccessStaSession.RunAlarmPollLoopAsync caught only OperationCanceledException, ObjectDisposedException, and InvalidOperationException. A genuine poll failure (COMException from GetXmlCurrentAlarms2, a malformed-XML XmlException) escaped uncaught, faulted the never-awaited Task.Run poll task, and was silently lost — exactly the silent-failure the finding describes. The finding was re-pointed at the live location and fixed there rather than at the removed OnPoll.

Resolution: 2026-05-18 — RunAlarmPollLoopAsync gained a trailing catch (Exception exception) arm after the three graceful-stop catches. A real alarm-poll failure is now converted to a WorkerFault (category MxaccessEventConversionFailed, carrying the exception type and, for a COMException, its HResult) by the new CreateAlarmPollFault helper and recorded on the session's MxAccessEventQueue via RecordFault. The worker's event-drain loop drains that fault and forwards it to the gateway, so a broken alarm subscription is now observable on the IPC fault path instead of vanishing. The poll loop still stops after the failure (the subscription is dead). No new proto enum value was added — MxaccessEventConversionFailed is the closest existing alarm-path category, avoiding a contracts regeneration across all clients. Verified by the regression test MxAccessStaSessionTests.RunAlarmPollLoop_WhenPollOnceThrows_RecordsFaultOnEventQueue.

Worker-006

Field	Value
Severity	Medium
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491`
Status	Resolved

Description: RunAsync's finally calls _runtimeSession?.Dispose() unless _shutdownTimedOut. On the normal path ShutdownGracefullyAsync already disposed the STA runtime, so re-entering Dispose() is a harmless no-op only because ShutdownGracefullyAsync reached its end and set disposed = true. If ShutdownGracefullyAsync throws TimeoutException after partial teardown with _shutdownTimedOut set, the session is never disposed at all — the finally skips it — leaking the STA thread and COM object, leaving cleanup to rely solely on process exit.

Recommendation: Make the dispose decision explicit and confirm process exit always follows a timed-out shutdown; otherwise dispose defensively. At minimum document why disposal is deliberately skipped on timeout.

Resolution: 2026-05-18 — RunAsync's finally now always calls _runtimeSession?.Dispose(); the if (!_shutdownTimedOut) guard and the _shutdownTimedOut field (which had become write-only) were removed. MxAccessStaSession.Dispose is idempotent (if (disposed) return) and bounded — each STA join is capped with Wait(TimeSpan.FromSeconds(2)) — so re-entering it on the normal path (where ShutdownGracefullyAsync already disposed the runtime) is a harmless no-op, while on the timed-out path it is now the only thing that reclaims the STA thread and releases the MXAccess COM object. The previous behaviour leaked both on a shutdown timeout and relied solely on process exit. A code comment in the finally block documents the reasoning. Verified by the regression test WorkerPipeSessionTests.RunAsync_WhenShutdownTimesOut_StillDisposesRuntimeSession, which forces a TimeoutException from ShutdownGracefullyAsync and asserts the runtime session is disposed before RunAsync rethrows.

Worker-007

Field	Value
Severity	Medium
Category	mxaccessgw conventions
Location	`src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150`
Status	Resolved

Description: Invoke uses late-bound Type.InvokeMember reflection as a fallback when the COM object does not cast to ILMXProxyServer*. In production the object is always LMXProxyServerClass, so the reflection path exists only for test doubles — it is dead/untested code on the production path and obscures the interface contract. params object[] arguments also boxes value-type handles on every call.

Recommendation: Drop the reflection fallback and require the COM object to implement the interface (tests can supply a typed fake), or clearly mark the fallback as test-only.

Re-triage: The finding's claim that the reflection path is "dead/untested code" is partly inaccurate — it was in fact the path exercised by the entire MxAccessCommandExecutorTests suite, whose FakeMxAccessComObject did not implement any typed interface. So the reflection fallback was test-only but not untested. The convention concern (bypassing the typed interface contract, boxing value-type handles) is valid, so the fix follows the recommendation's first option.

Resolution: 2026-05-18 — The late-bound Type.InvokeMember reflection fallback and its params object[]-boxing Invoke helper were removed from MxAccessComServer. Each adapter method now takes one of two typed paths: an is IMxAccessServer fast path (test fakes implement IMxAccessServer directly) and the production path that casts to the typed ILMXProxyServer / ILMXProxyServer3 / ILMXProxyServer4 COM interfaces via new AsProxyServer* helpers. A COM object implementing neither now fails fast with a clear InvalidOperationException naming the missing interface, instead of an opaque late-bound call. The test seam was migrated accordingly: MxAccessCommandExecutorTests.FakeMxAccessComObject now declares : IMxAccessServer (its method signatures already matched the interface exactly, so no behavioural change). Verified by the new MxAccessComServerTests (typed-server routing, untyped-object rejection, original-exception propagation — no more TargetInvocationException wrapping) plus the unchanged, still-passing MxAccessCommandExecutorTests suite which now exercises the typed IMxAccessServer path.

Worker-008

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249`, `:429-447`
Status	Resolved

Description: RunAlarmPollLoopAsync correctly marshals handler.PollOnce() onto the STA via staRuntime.InvokeAsync, and the cancel/await/dispose ordering in ShutdownGracefullyAsync is sound. However, nothing enforces that the consumerFactory and all IMxAccessAlarmConsumer calls run on the STA thread; a future caller could break STA affinity silently.

Recommendation: Add an assertion or documented invariant that the consumer factory and all IMxAccessAlarmConsumer calls run on the STA thread, mirroring the existing MxAccessSession.CreationThreadId pattern.

Resolution: 2026-05-18 — MxAccessStaSession now records the STA thread id (alarmConsumerThreadId) at the point the alarm-command-handler factory is invoked — which already runs inside staRuntime.InvokeAsync during StartAsync, mirroring the MxAccessSession.CreationThreadId capture. RunAlarmPollLoopAsync's marshalled poll lambda now calls EnsureOnAlarmConsumerThread() before handler.PollOnce(), asserting the poll runs on the recorded STA thread. The check is delegated to a new internal static guard AssertOnAlarmConsumerThread(int? expected, int actual) that throws a descriptive InvalidOperationException on an affinity violation and is a no-op when the consumer thread is unrecorded (no alarm handler configured). Making the guard static and internal keeps it directly unit-testable. The STA-affinity invariant is documented in the guard's XML doc. Verified by the regression tests MxAccessStaSessionTests.AssertOnAlarmConsumerThread_WhenOffOwningThread_Throws and AssertOnAlarmConsumerThread_OnOwningThreadOrUnset_DoesNotThrow.

Worker-009

Field	Value
Severity	Low
Category	Performance & resource management
Location	`src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49`, `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58`
Status	Resolved

Description: Every frame read allocates a fresh 4-byte length buffer and a payload byte[]; every write allocates ToByteArray() plus a 4-byte prefix. On the hot event-drain path (batches of up to 128 WorkerEvent frames every 25 ms) this produces steady gen-0 garbage. WorkerFrameWriter also effectively serializes twice (CalculateSize() then ToByteArray()).

Recommendation: Reuse a pooled buffer / ArrayPool<byte> for the length prefix and payload, and write directly into a pooled buffer using CodedOutputStream. Low priority unless event throughput is high.

Resolution: 2026-05-18 — WorkerFrameWriter.WriteAsync now serializes the envelope exactly once into a single frame buffer that carries the 4-byte length prefix followed by the payload, via envelope.WriteTo(new Span<byte>(frame, sizeof(uint), payloadLength)). This eliminates the redundant second serialization pass (ToByteArray() re-runs CalculateSize() internally), the separate length-prefix array, and the separate prefix WriteAsync/extra FlushAsync round. WorkerFrameReader.ReadAsync now rents its payload buffer from ArrayPool<byte>.Shared and returns it in a finally once WorkerEnvelope.Parser.ParseFrom(payload, 0, length) has copied what it needs; ReadExactlyOrThrowAsync gained an explicit count parameter so it honours the logical frame length rather than the (possibly larger) rented buffer length. The 4-byte length-prefix buffer is left as a per-call stack-sized allocation — pooling a 4-byte array is not worthwhile. Verified by the new regression test WorkerFrameProtocolTests.ReadAsync_WithVaryingFrameSizes_ParsesEachFrameExactly, which reads a large frame followed by a small frame through one reader to prove the pooled buffer is sliced to each frame's own length and never leaks stale trailing bytes; the existing round-trip, malformed-payload, and concurrent-write tests continue to pass.

Worker-010

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226`
Status	Resolved

Description: ConvertInt64Scalar is reached for TypeCode.UInt32 and TypeCode.Int64. For a uint with expectedDataType == MxDataType.Time, the value is treated as a Windows FILETIME via DateTime.FromFileTimeUtc(longValue); a 32-bit FILETIME is never a valid full FILETIME, so this silently produces a near-epoch timestamp rather than a raw/diagnostic value. Unlikely in practice but a silent misconversion.

Recommendation: Only apply the MxDataType.Time FILETIME projection for 64-bit source types; for uint fall through to integer or raw.

Resolution: 2026-05-18 — ConvertInt64Scalar's MxDataType.Time FILETIME projection is now gated on value is long. A genuine 64-bit long still projects to a Timestamp via DateTime.FromFileTimeUtc; a 32-bit uint — which can only hold the low half of a FILETIME — now falls through to the integer projection (DataType = Integer, Int64Value) instead of silently producing a bogus near-1601 timestamp. Verified by the regression test VariantConverterTests.Convert_WithUInt32AndExpectedTime_DoesNotProjectFileTime; the existing Convert_WithFileTimeAndExpectedTime_ProjectsTimestamp (a long FILETIME) continues to pass, confirming the 64-bit path is unchanged.

Worker-011

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171`
Status	Resolved

Description: retryAttempts is computed as (connectTimeout / min(connectTimeout, attemptTimeout)) - 1. With defaults (30000 / 2000) this yields 14 retries, but each retry also incurs Polly exponential backoff. The overall connectDeadline (CancelAfter(connectTimeout)) is the real bound, so the computed attempt count can be larger or smaller than the time budget allows, and the formula is opaque.

Recommendation: Drive retries purely off the connectDeadline token (Polly stops when cancelled) and drop the fragile attempt-count arithmetic, or add a comment explaining the intent.

Resolution: 2026-05-18 — The opaque retryAttempts arithmetic in ConnectWithRetryAsync was removed. MaxRetryAttempts is now int.MaxValue, so the retry loop is bounded solely by the connectDeadline linked token (CancelAfter(_connectTimeoutMilliseconds)): Polly stops retrying the moment that token is cancelled, making the overall connect timeout the single source of truth and correctly accounting for the exponential backoff between attempts (which the old formula ignored). A comment documents the intent. No new test was added — the change does not alter observable behavior (the deadline was always the real bound; the old formula always permitted more attempts than fit the budget), and the existing WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears (server appears mid-retry) and RunAsync_WhenPipeNeverAppears_ThrowsTimeoutException (deadline ends the loop) already cover both retry-until-success and deadline-bounded termination.

Worker-012

Field	Value
Severity	Low
Category	Documentation & comments
Location	`src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55`, `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43`, `src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112`
Status	Resolved

Description: Multiple comments describe the alarm path as not-yet-wired future work ("PR A.2 — COM-side subscription scaffold … the worker advertises no alarm subscription", "the worker bootstrap will gain a thin 'run-on-STA' wrapper as part of A.3"). As of commit 6c64030 the alarm command handler, STA poll loop, and SubscribeAlarms/AcknowledgeAlarm/QueryActiveAlarms are all wired. These comments are stale and misleading.

Recommendation: Update the XML docs/comments to describe the shipped behavior; remove the "future PR" framing.

Re-triage: The WnWrapAlarmConsumer.cs:38-43 citation is inaccurate — those lines were rewritten by Worker-001 and already describe the shipped no-internal-timer threading model correctly; nothing stale there. Conversely, two stale comments the finding did not cite were found on the same alarm path and fixed under the same root cause: AlarmDispatcher.cs's <remarks> still framed the dispatcher as "the in-process slice of A.3" with a "companion follow-up PR" adding the (now-shipped) SubscribeAlarmsCommand/AcknowledgeAlarmCommand/QueryActiveAlarmsCommand, and stated the consumer "polls on a System.Threading.Timer thread today" — a claim made false by Worker-001's removal of that timer; and AlarmCommandHandler.cs's <remarks> likewise asserted "the wnwrap consumer's polling timer fires on a thread-pool thread". The discovery document docs/AlarmClientDiscovery.md (referenced by the source comments) was deliberately left untouched: it is a historical research log of the investigation that chose the shipped design, not API/contract/lifecycle prose, and the source comments cite only its still-accurate "Option A — captured" payload schema.

Resolution: 2026-05-18 — Rewrote the stale alarm-path comments to describe shipped behavior with no "future PR / A.2 / A.3" framing. MxAccessAlarmEventSink: the class <remarks> and the Attach comment now explain that AlarmDispatcher owns the consumer→sink→queue wire-up and that Attach carries only the session id (no COM-event subscription is needed because the polled wnwrap consumer raises transition events itself). MxAccessEventMapper.CreateOnAlarmTransition's XML summary now states the worker drives it from MxAccessAlarmEventSink.EnqueueTransition once AlarmDispatcher decodes a wnwrap transition. AlarmDispatcher and AlarmCommandHandler <remarks> were corrected to describe the shipped command surface and the no-internal-timer / STA-driven polling model (the System.Threading.Timer claims were factually wrong post-Worker-001). Pure documentation change — no behavior altered, no test needed; the build stays green.

Worker-013

Field	Value
Severity	Low
Category	Testing coverage
Location	`src/MxGateway.Worker/Sta/StaMessagePump.cs`
Status	Resolved

Description: StaMessagePump — the heart of COM event delivery (MsgWaitForMultipleObjectsEx + PeekMessage/DispatchMessage) — has no direct unit tests. StaRuntimeTests exercises it indirectly for command wake-up but never verifies that a posted Windows message actually wakes the wait and is dispatched, nor that PumpPendingMessages returns a correct count. The alarm poll-loop lifecycle in MxAccessStaSession (start/cancel/await on shutdown) also has no test. These are the most failure-sensitive paths in the module.

Recommendation: Add tests that post a message to the STA thread and assert it is pumped, and tests covering alarm poll-loop start/stop and shutdown ordering.

Re-triage: This finding is stale as of the reviewed branch — the coverage it asks for already exists. src/MxGateway.Worker.Tests/Sta/StaMessagePumpTests.cs contains direct StaMessagePump tests covering null-argument validation, waking on a signalled event, returning on timeout, the zero-timeout conversion branch, PumpPendingMessages returning the correct count for messages posted to the STA thread (PumpPendingMessages_MessagesPostedToStaThread_ReturnsCountProcessed, PumpPendingMessages_NoMessagesPosted_ReturnsZero), and WaitForWorkOrMessages waking on a posted Windows message (WaitForWorkOrMessages_WindowsMessagePosted_ReturnsForInputAvailable) — exactly the "post a message and assert it is pumped" test the recommendation asks for. The alarm poll-loop lifecycle is covered by MxAccessStaSessionTests.StartAsync_WithAlarmCommandHandlerFactory_PollOnceCalledViaSta (start → poll runs on the STA) and Dispose_StopsAlarmPollLoop (Dispose joins the poll task; no further polls). The finding was raised against a stale view of the test project; no source or test change is required. Re-triaged as already resolved rather than fixed.

Resolution: 2026-05-18 — No code change. Re-triaged: the requested direct StaMessagePump tests (including posted-message dispatch and pump count) and the alarm poll-loop start/stop lifecycle tests already exist in StaMessagePumpTests.cs and MxAccessStaSessionTests.cs. See the re-triage note above for the specific test names.

Worker-014

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33`, `:202`
Status	Resolved

Description: The file declares two public types — the AlarmCommandHandler class and the IAlarmCommandHandler interface. The C# style guide and the rest of the module follow one-public-type-per-file (e.g. interfaces in their own I*.cs files like IMxAccessAlarmConsumer.cs).

Recommendation: Move IAlarmCommandHandler to its own IAlarmCommandHandler.cs for consistency.

Resolution: 2026-05-18 — The IAlarmCommandHandler interface (with its XML docs) was moved verbatim out of AlarmCommandHandler.cs into a new src/MxGateway.Worker/MxAccess/IAlarmCommandHandler.cs, with its own using directives (System, System.Collections.Generic, MxGateway.Contracts.Proto). AlarmCommandHandler.cs now declares one public type, matching the module's one-public-type-per-file convention (cf. IMxAccessAlarmConsumer.cs). Pure file-organization change — no API surface, behavior, or namespace changed; no test needed. The worker build is clean with zero warnings (no unused usings left behind in AlarmCommandHandler.cs).

Worker-015

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145`
Status	Resolved

Description: On overflow, Enqueue records the overflow fault and throws MxAccessEventQueueOverflowException; MxAccessBaseEventSink.EnqueueEvent catches it and calls RecordFault again. RecordFault is a no-op when a fault already exists, so the second call is harmless — but the intent is muddled, and there is no test asserting the dropped-event behavior. This is acceptable per the fail-fast design but undocumented at the call site.

Recommendation: Add a brief comment in EnqueueEvent clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op.

Resolution: 2026-05-18 — Added a comment in MxAccessBaseEventSink.EnqueueEvent's catch block (per the finding's recommendation) explaining that two distinct fail-fast failures land there: a conversion failure from createEvent() (recorded here as an MxaccessEventConversionFailed fault) and an MxAccessEventQueueOverflowException from Enqueue at capacity, which — per the fail-fast backpressure design in docs/DesignDecisions.md — drops the event and has already self-recorded a QueueOverflow fault inside Enqueue. Because MxAccessEventQueue.RecordFault keeps only the first fault, the catch's RecordFault call is then a deliberate near no-op rather than a second, conflicting fault. Pure comment change as recommended — no behavior altered. docs/DesignDecisions.md already documents the fail-fast event backpressure rule, so no doc change was required.

Worker-016

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:261-265`
Status	Resolved

Description: RunAlarmPollLoopAsync catches InvalidOperationException and silently returns with the rationale "STA runtime shutting down — stop the loop gracefully". The same catch arm, however, also swallows the InvalidOperationException thrown by EnsureOnAlarmConsumerThread() / AssertOnAlarmConsumerThread() — the STA-affinity guard added under Worker-008. If the alarm poll ever ran on the wrong thread (a regression of the STA-affinity invariant), the assertion would fire, the loop would silently stop, no fault would be recorded, and the only observable symptom would be alarms no longer flowing. The assertion exists to catch a programming error early; this catch defeats it.

Recommendation: Either tighten the InvalidOperationException catch so it only swallows the STA-runtime-shutting-down sentinel (e.g. match on the exception message produced by StaRuntime.InvokeAsync, or have the STA runtime throw a dedicated exception type for shutdown), or rethrow / record-a-fault for InvalidOperationExceptions whose message does not match the shutdown sentinel. Add a regression test that drives RunAlarmPollLoopAsync with a handler that throws InvalidOperationException from PollOnce and asserts the loop records a fault rather than silently exiting.

Resolution: 2026-05-20 — Introduced a dedicated StaRuntimeShutdownException (src/MxGateway.Worker/Sta/StaRuntimeShutdownException.cs) that StaRuntime.InvokeAsync and the queue-enqueue path now throw in place of a generic InvalidOperationException when shutdownRequested is set. RunAlarmPollLoopAsync in MxAccessStaSession.cs:258-291 now catches StaRuntimeShutdownException (graceful stop, returns silently) separately from the generic Exception arm, which records the fault on the event queue. An STA-affinity InvalidOperationException from EnsureOnAlarmConsumerThread therefore now falls through to the fault path and becomes observable on the IPC fault path instead of silently terminating alarm delivery. Verified: dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86 clean (0 warnings). Regression coverage in MxAccessStaSessionTests.cs exercises both the graceful-shutdown and the affinity-violation paths.

Worker-017

Field	Value
Severity	Medium
Category	Error handling & resilience
Location	`src/MxGateway.Worker/Sta/StaRuntime.cs:280-288`, `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-631`
Status	Resolved

Description: StaRuntime.ProcessQueuedCommands calls MarkActivity() only before and after workItem.Execute(). For a command that synchronously holds the STA for longer than WorkerPipeSessionOptions.HeartbeatGrace (default 15s) — e.g. ReadBulk with many uncached tags, each waiting up to its per-tag TimeoutMs (default 1000 ms) — no MarkActivity() runs during the wait, LastActivityUtc stays frozen, and ReportWatchdogFaultIfNeededAsync fires an StaHung fault. The heartbeat itself reports WorkerState.ExecutingCommand with the live CurrentCommandCorrelationId, so the worker actually knows it is executing a command rather than hung — but the watchdog branch only checks staleFor > HeartbeatGrace and ignores the in-flight command. A legitimate slow bulk read then self-faults and tears the session down.

Recommendation: Either (a) extend WorkerPipeSession.ReportWatchdogFaultIfNeededAsync to skip the StaHung fault when the snapshot's CurrentCommandCorrelationId is non-empty (the worker is executing a command, not hung), or (b) thread a MarkActivity-style callback into the bulk-read pumpStep so long synchronous STA operations periodically refresh LastActivityUtc. Option (a) is the smaller surface — the heartbeat already carries enough signal for the gateway to decide the command is just slow. Either way, the design intent (watchdog catches a hung STA, not a slow command) should be documented on ReportWatchdogFaultIfNeededAsync.

Resolution: 2026-05-20 — Applied option (a): WorkerPipeSession.ReportWatchdogFaultIfNeededAsync (src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-645) now returns early when snapshot.CurrentCommandCorrelationId is non-empty — the STA is busy executing a known command, not hung, and the heartbeat already surfaces the correlation id so the gateway can decide whether the command is too slow against its own per-command timeout. The next MarkActivity() after the command returns lifts LastActivityUtc and the watchdog resumes normal operation. A new XML doc comment on the method records the design intent (watchdog catches a hung STA, not a slow command). Verified: dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86 clean. Regression coverage added in WorkerPipeSessionTests.cs.

Worker-018

Field	Value
Severity	Low
Category	Error handling & resilience
Location	`src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:160-161`
Status	Resolved

Description: Subscribe calls com.SetXmlAlarmQuery(xmlQuery) and discards the return value. The block-level comment immediately above states that this call is empirically required for subsequent GetXmlCurrentAlarms2 to succeed — i.e. it is on the critical path of the alarm subscription. Every other AVEVA-COM call in the same method (InitializeConsumer, RegisterConsumer, Subscribe, AlarmAckByName, etc.) is gated on a != 0 return-code check and throws InvalidOperationException on failure. If SetXmlAlarmQuery ever returns non-zero (or otherwise fails non-fatally), the consumer reaches subscribed = true with the wnwrap state misconfigured, and the next PollOnce fails with the same E_FAIL the comment warns about — without any indication where the regression lies.

Recommendation: Either (a) check the SetXmlAlarmQuery return code and treat a non-zero value as a subscription failure (matching the other call-gates in the method) or (b) document explicitly in the comment that SetXmlAlarmQuery's return code is meaningless on this AVEVA build (referencing docs/AlarmClientDiscovery.md if so). At minimum capture the return value in a local for diagnostic purposes so a future failure is easier to triage.

Re-triage: The finding's framing assumed an integer return code; inspection of the Interop.WNWRAPCONSUMERLib assembly confirmed SetXmlAlarmQuery is declared Void SetXmlAlarmQuery(System.String) on all three flavors (IwwAlarmConsumer, IwwAlarmConsumer2, wwAlarmConsumerClass). There is no integer return code to gate on. A genuine failure can only surface as a COMException mapped from the underlying HRESULT, so the fix wraps the call to translate that into the same InvalidOperationException failure-shape used by every other call-gate in Subscribe, with the HRESULT included in the diagnostic message.

Resolution: 2026-05-20 — WnWrapAlarmConsumer.Subscribe now wraps the com.SetXmlAlarmQuery(xmlQuery) call in a try/catch (COMException ex) that throws an InvalidOperationException carrying the HRESULT ($"wwAlarmConsumer.SetXmlAlarmQuery failed with HRESULT 0x{ex.HResult:X8}; subsequent GetXmlCurrentAlarms2 polls would return E_FAIL.") with the original COMException as InnerException. A previously silent failure that left subscribed = true with misconfigured wnwrap state — and produced an opaque E_FAIL from the next PollOnce with no indication where the regression lay — now surfaces as a subscription failure at the Subscribe call-site, matching the existing v1-lifecycle failure shape. The block comment was extended to record that the interop signature returns void (no integer return code to gate on like the sibling v1 calls) so a future maintainer doesn't try to add one. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is structurally analogous to the existing Initialize/Register/Subscribe call-gates and is exercised end-to-end by the live alarm smoke path.

Worker-019

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:59`, `:188`
Status	Resolved

Description: WnWrapAlarmConsumer declares private string subscriptionExpression = string.Empty; and assigns it once inside Subscribe (line 188), but never reads it. It is dead state — neither PollOnce, AcknowledgeByName, AcknowledgeByGuid, SnapshotActiveAlarms, nor Dispose consults it. Either it is genuinely unused (delete it) or it was intended to support a not-yet-implemented feature (e.g. re-subscribing after a transient failure, or echoing the subscription back through IsSubscribed/SubscriptionExpression), in which case the intent should be wired up or documented.

Recommendation: Delete the field (the safest option — treatWarningsAsErrors=true will continue to permit it as long as it's read into; consider promoting it to read-only via an exposed property SubscriptionExpression so smoke tests can assert what subscription is active without touching wnwrap state). If a future use is expected, file a follow-up issue.

Resolution: 2026-05-20 — Deleted the dead private string subscriptionExpression = string.Empty; field declaration and its sole assignment inside Subscribe (subscriptionExpression = subscription;). The field had no readers and was pure write-only state. Pure cleanup — no behaviour change, no public API surface affected. The worker build remains clean with zero warnings under TreatWarningsAsErrors=true.

Worker-020

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:405`, `:423`
Status	Resolved

Description: ProcessCommandAsync decides whether to write a command reply with if (_state is not WorkerState.Ready and not WorkerState.ExecutingCommand). The ExecutingCommand arm is dead: _state is only ever assigned Starting, Handshaking, InitializingSta, Ready, ShuttingDown, Faulted, or Stopped. The string WorkerState.ExecutingCommand appears nowhere as a target of _state = .... The WorkerState.ExecutingCommand value is synthesized only in CreateHeartbeat (line 811) when a command is in flight, so it never leaks back into _state. The check is effectively _state is not WorkerState.Ready. The intent is unclear: either the check should also accept the live "is executing" condition (which today is implicit via _state == Ready plus a non-empty CurrentCommandCorrelationId from the dispatcher), or the dead arm should be removed for clarity.

Recommendation: Simplify the check to if (_state != WorkerState.Ready) to match the actual state machine, and update the dropped-reply log fields accordingly. Alternatively, introduce an explicit WorkerState.ExecutingCommand transition (set when a command starts dispatching, restored to Ready on completion) so the check matches its name. The simpler fix is the former.

Resolution: 2026-05-20 — Both occurrences of the _state is not WorkerState.Ready and not WorkerState.ExecutingCommand check in ProcessCommandAsync (the post-DispatchAsync success path and the exception path) were simplified to _state != WorkerState.Ready. The ExecutingCommand arm was dead — _state is never written that value; only CreateHeartbeat synthesizes it on the wire when CurrentCommandCorrelationId is non-empty. A comment was added at the success-path site documenting the assignment-set of _state and why Ready is the only command-serving state. No behavioural change — _state could never be ExecutingCommand at that read, so the simplification preserves the same effective decision while removing the misleading dead arm. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent.

Worker-021

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-118`, `:790-805`, `:136-139`
Status	Resolved

Description: RunAsync constructs the runtime session through _runtimeSession = _runtimeSessionFactory() (line 111) and immediately calls CompleteStartupHandshakeAsync(token => _runtimeSession.StartAsync(...)). That path is fine. However the public parameterless CompleteStartupHandshakeAsync() (line 136) routes through InitializeMxAccessAsync (line 790), which unconditionally reassigns _runtimeSession = new MxAccessStaSession(eq => new AlarmCommandHandler(eq)); — overwriting whatever the factory put there. If anything ever calls CompleteStartupHandshakeAsync() after RunAsync has already begun, the factory-supplied session is leaked (no Dispose is called on the old instance) and a fresh hard-coded MxAccessStaSession is started instead. Today no production code path triggers this, but the API surface is public and dangerous — a test or a refactor could trip it.

Recommendation: Either (a) make InitializeMxAccessAsync a no-op if _runtimeSession is already non-null (treat the existing instance as authoritative and only call its StartAsync), or (b) make the parameterless CompleteStartupHandshakeAsync() and InitializeMxAccessAsync internal / remove them, since the production path is the factory-driven one in RunAsync. Option (b) is cleaner: the parameterless overload is dead in production.

Resolution: 2026-05-20 — Applied option (a): InitializeMxAccessAsync now uses _runtimeSession ??= new MxAccessStaSession(eq => new AlarmCommandHandler(eq));, so the existing factory-supplied instance from RunAsync is treated as authoritative and only the fall-back direct-invocation path (where the parameterless CompleteStartupHandshakeAsync is called without a prior factory call) constructs the hard-coded MxAccessStaSession. The StartAsync call and the catch-and-dispose path now operate on a local session captured from _runtimeSession, so a startup failure still disposes the runtime regardless of which path supplied it. A comment in InitializeMxAccessAsync documents the reasoning. Option (a) was preferred over (b) because the parameterless CompleteStartupHandshakeAsync overload is part of the existing public API surface and tightening it to internal would be a contract change with no production driver requesting it. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is exercised end-to-end by the existing RunAsync factory path which now goes through the null-coalescing assignment instead of an unconditional new.

Worker-022

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`src/MxGateway.Worker/MxAccess/MxAlarmSnapshot.cs:12`, `:26`, `:49`
Status	Resolved

Description: MxAlarmSnapshot.cs declares three public types in one file: the MxAlarmStateKind enum, the MxAlarmSnapshotRecord class, and the MxAlarmTransitionEvent class. The C# style guide (docs/style-guides/CSharpStyleGuide.md:68) requires one public type per file unless a small nested type is clearer. The recently resolved Worker-014 split IAlarmCommandHandler out of AlarmCommandHandler.cs for exactly this reason — the same convention applies here.

Recommendation: Move MxAlarmStateKind and MxAlarmTransitionEvent into their own files (MxAlarmStateKind.cs, MxAlarmTransitionEvent.cs) and leave MxAlarmSnapshotRecord in MxAlarmSnapshot.cs (or rename the file to MxAlarmSnapshotRecord.cs to match the surviving type). Pure file-organization change; no behaviour or namespace impact.

Resolution: 2026-05-20 — Split MxAlarmSnapshot.cs into three files, each declaring one public type and keeping the original MxGateway.Worker.MxAccess namespace so existing usages are unaffected: MxAlarmStateKind.cs (the enum, with its XML doc), MxAlarmTransitionEvent.cs (the EventArgs subclass, with its PreviousState doc), and MxAlarmSnapshot.cs (now containing only MxAlarmSnapshotRecord plus its XML doc). Matches the one-public-type-per-file convention re-affirmed by Worker-014's IAlarmCommandHandler split. Pure file-organization change — no API, namespace, or behaviour change; build is clean.

48 KiB Raw Blame History