Resolve Worker-009..015 code-review findings

Worker-009: WorkerFrameWriter serialized twice and WorkerFrameReader
allocated a payload byte[] per frame. The writer now serializes once into a
single prefix+payload buffer; the reader rents the payload buffer from
ArrayPool and honors the logical frame length.

Worker-010: VariantConverter projected a uint+Time value as a full FILETIME,
producing a near-1601 timestamp. The FILETIME projection is now gated on
`value is long`; uint falls through to the integer projection.

Worker-011: replaced the opaque retryAttempts formula in WorkerPipeClient
with MaxRetryAttempts = int.MaxValue, leaving the connect deadline as the
sole bound.

Worker-012: rewrote stale "future PR / polls on a Timer" comments in
AlarmDispatcher, AlarmCommandHandler, MxAccessAlarmEventSink and
MxAccessEventMapper to match the shipped, post-Worker-001 behavior.

Worker-013 (re-triaged): already resolved — StaMessagePumpTests and
MxAccessStaSessionTests cover the pump and poll loop directly.

Worker-014: moved IAlarmCommandHandler into its own file so
AlarmCommandHandler.cs declares one public type.

Worker-015: clarified the MxAccessBaseEventSink.EnqueueEvent overflow-catch
comment explaining the deliberate double RecordFault no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-18 22:42:17 -04:00
parent fe9044115b
commit 1764eff1cf
13 changed files with 229 additions and 127 deletions
+19 -15
View File
@@ -7,7 +7,7 @@
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Status | Reviewed |
| Open findings | 7 |
| Open findings | 0 |
## Checklist coverage
@@ -157,13 +157,13 @@
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49`, `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58` |
| Status | Open |
| Status | Resolved |
**Description:** Every frame read allocates a fresh 4-byte length buffer and a payload `byte[]`; every write allocates `ToByteArray()` plus a 4-byte prefix. On the hot event-drain path (batches of up to 128 `WorkerEvent` frames every 25 ms) this produces steady gen-0 garbage. `WorkerFrameWriter` also effectively serializes twice (`CalculateSize()` then `ToByteArray()`).
**Recommendation:** Reuse a pooled buffer / `ArrayPool<byte>` for the length prefix and payload, and write directly into a pooled buffer using `CodedOutputStream`. Low priority unless event throughput is high.
**Resolution:** _(open)_
**Resolution:** 2026-05-18 — `WorkerFrameWriter.WriteAsync` now serializes the envelope exactly once into a single frame buffer that carries the 4-byte length prefix followed by the payload, via `envelope.WriteTo(new Span<byte>(frame, sizeof(uint), payloadLength))`. This eliminates the redundant second serialization pass (`ToByteArray()` re-runs `CalculateSize()` internally), the separate length-prefix array, and the separate prefix `WriteAsync`/extra `FlushAsync` round. `WorkerFrameReader.ReadAsync` now rents its payload buffer from `ArrayPool<byte>.Shared` and returns it in a `finally` once `WorkerEnvelope.Parser.ParseFrom(payload, 0, length)` has copied what it needs; `ReadExactlyOrThrowAsync` gained an explicit `count` parameter so it honours the logical frame length rather than the (possibly larger) rented buffer length. The 4-byte length-prefix buffer is left as a per-call stack-sized allocation — pooling a 4-byte array is not worthwhile. Verified by the new regression test `WorkerFrameProtocolTests.ReadAsync_WithVaryingFrameSizes_ParsesEachFrameExactly`, which reads a large frame followed by a small frame through one reader to prove the pooled buffer is sliced to each frame's own length and never leaks stale trailing bytes; the existing round-trip, malformed-payload, and concurrent-write tests continue to pass.
### Worker-010
@@ -172,13 +172,13 @@
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226` |
| Status | Open |
| Status | Resolved |
**Description:** `ConvertInt64Scalar` is reached for `TypeCode.UInt32` and `TypeCode.Int64`. For a `uint` with `expectedDataType == MxDataType.Time`, the value is treated as a Windows `FILETIME` via `DateTime.FromFileTimeUtc(longValue)`; a 32-bit FILETIME is never a valid full FILETIME, so this silently produces a near-epoch timestamp rather than a raw/diagnostic value. Unlikely in practice but a silent misconversion.
**Recommendation:** Only apply the `MxDataType.Time` FILETIME projection for 64-bit source types; for `uint` fall through to integer or raw.
**Resolution:** _(open)_
**Resolution:** 2026-05-18 — `ConvertInt64Scalar`'s `MxDataType.Time` FILETIME projection is now gated on `value is long`. A genuine 64-bit `long` still projects to a `Timestamp` via `DateTime.FromFileTimeUtc`; a 32-bit `uint` — which can only hold the low half of a FILETIME — now falls through to the integer projection (`DataType = Integer`, `Int64Value`) instead of silently producing a bogus near-1601 timestamp. Verified by the regression test `VariantConverterTests.Convert_WithUInt32AndExpectedTime_DoesNotProjectFileTime`; the existing `Convert_WithFileTimeAndExpectedTime_ProjectsTimestamp` (a `long` FILETIME) continues to pass, confirming the 64-bit path is unchanged.
### Worker-011
@@ -187,13 +187,13 @@
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171` |
| Status | Open |
| Status | Resolved |
**Description:** `retryAttempts` is computed as `(connectTimeout / min(connectTimeout, attemptTimeout)) - 1`. With defaults (30000 / 2000) this yields 14 retries, but each retry also incurs Polly exponential backoff. The overall `connectDeadline` (`CancelAfter(connectTimeout)`) is the real bound, so the computed attempt count can be larger or smaller than the time budget allows, and the formula is opaque.
**Recommendation:** Drive retries purely off the `connectDeadline` token (Polly stops when cancelled) and drop the fragile attempt-count arithmetic, or add a comment explaining the intent.
**Resolution:** _(open)_
**Resolution:** 2026-05-18 — The opaque `retryAttempts` arithmetic in `ConnectWithRetryAsync` was removed. `MaxRetryAttempts` is now `int.MaxValue`, so the retry loop is bounded solely by the `connectDeadline` linked token (`CancelAfter(_connectTimeoutMilliseconds)`): Polly stops retrying the moment that token is cancelled, making the overall connect timeout the single source of truth and correctly accounting for the exponential backoff between attempts (which the old formula ignored). A comment documents the intent. No new test was added — the change does not alter observable behavior (the deadline was always the real bound; the old formula always permitted more attempts than fit the budget), and the existing `WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears` (server appears mid-retry) and `RunAsync_WhenPipeNeverAppears_ThrowsTimeoutException` (deadline ends the loop) already cover both retry-until-success and deadline-bounded termination.
### Worker-012
@@ -202,13 +202,15 @@
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55`, `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43`, `src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112` |
| Status | Open |
| Status | Resolved |
**Description:** Multiple comments describe the alarm path as not-yet-wired future work ("PR A.2 — COM-side subscription scaffold … the worker advertises no alarm subscription", "the worker bootstrap will gain a thin 'run-on-STA' wrapper as part of A.3"). As of commit 6c64030 the alarm command handler, STA poll loop, and `SubscribeAlarms`/`AcknowledgeAlarm`/`QueryActiveAlarms` are all wired. These comments are stale and misleading.
**Recommendation:** Update the XML docs/comments to describe the shipped behavior; remove the "future PR" framing.
**Resolution:** _(open)_
**Re-triage:** The `WnWrapAlarmConsumer.cs:38-43` citation is inaccurate — those lines were rewritten by Worker-001 and already describe the shipped no-internal-timer threading model correctly; nothing stale there. Conversely, two stale comments the finding did *not* cite were found on the same alarm path and fixed under the same root cause: `AlarmDispatcher.cs`'s `<remarks>` still framed the dispatcher as "the in-process slice of A.3" with a "companion follow-up PR" adding the (now-shipped) `SubscribeAlarmsCommand`/`AcknowledgeAlarmCommand`/`QueryActiveAlarmsCommand`, and stated the consumer "polls on a `System.Threading.Timer` thread today" — a claim made false by Worker-001's removal of that timer; and `AlarmCommandHandler.cs`'s `<remarks>` likewise asserted "the wnwrap consumer's polling timer fires on a thread-pool thread". The discovery document `docs/AlarmClientDiscovery.md` (referenced by the source comments) was deliberately left untouched: it is a historical research log of the investigation that chose the shipped design, not API/contract/lifecycle prose, and the source comments cite only its still-accurate "Option A — captured" payload schema.
**Resolution:** 2026-05-18 — Rewrote the stale alarm-path comments to describe shipped behavior with no "future PR / A.2 / A.3" framing. `MxAccessAlarmEventSink`: the class `<remarks>` and the `Attach` comment now explain that `AlarmDispatcher` owns the consumer→sink→queue wire-up and that `Attach` carries only the session id (no COM-event subscription is needed because the polled wnwrap consumer raises transition events itself). `MxAccessEventMapper.CreateOnAlarmTransition`'s XML summary now states the worker drives it from `MxAccessAlarmEventSink.EnqueueTransition` once `AlarmDispatcher` decodes a wnwrap transition. `AlarmDispatcher` and `AlarmCommandHandler` `<remarks>` were corrected to describe the shipped command surface and the no-internal-timer / STA-driven polling model (the `System.Threading.Timer` claims were factually wrong post-Worker-001). Pure documentation change — no behavior altered, no test needed; the build stays green.
### Worker-013
@@ -217,13 +219,15 @@
| Severity | Low |
| Category | Testing coverage |
| Location | `src/MxGateway.Worker/Sta/StaMessagePump.cs` |
| Status | Open |
| Status | Resolved |
**Description:** `StaMessagePump` — the heart of COM event delivery (`MsgWaitForMultipleObjectsEx` + `PeekMessage`/`DispatchMessage`) — has no direct unit tests. `StaRuntimeTests` exercises it indirectly for command wake-up but never verifies that a posted Windows message actually wakes the wait and is dispatched, nor that `PumpPendingMessages` returns a correct count. The alarm poll-loop lifecycle in `MxAccessStaSession` (start/cancel/await on shutdown) also has no test. These are the most failure-sensitive paths in the module.
**Recommendation:** Add tests that post a message to the STA thread and assert it is pumped, and tests covering alarm poll-loop start/stop and shutdown ordering.
**Resolution:** _(open)_
**Re-triage:** This finding is stale as of the reviewed branch — the coverage it asks for already exists. `src/MxGateway.Worker.Tests/Sta/StaMessagePumpTests.cs` contains direct `StaMessagePump` tests covering null-argument validation, waking on a signalled event, returning on timeout, the zero-timeout conversion branch, `PumpPendingMessages` returning the correct count for messages posted to the STA thread (`PumpPendingMessages_MessagesPostedToStaThread_ReturnsCountProcessed`, `PumpPendingMessages_NoMessagesPosted_ReturnsZero`), and `WaitForWorkOrMessages` waking on a posted Windows message (`WaitForWorkOrMessages_WindowsMessagePosted_ReturnsForInputAvailable`) — exactly the "post a message and assert it is pumped" test the recommendation asks for. The alarm poll-loop lifecycle is covered by `MxAccessStaSessionTests.StartAsync_WithAlarmCommandHandlerFactory_PollOnceCalledViaSta` (start → poll runs on the STA) and `Dispose_StopsAlarmPollLoop` (Dispose joins the poll task; no further polls). The finding was raised against a stale view of the test project; no source or test change is required. Re-triaged as already resolved rather than fixed.
**Resolution:** 2026-05-18 — No code change. Re-triaged: the requested direct `StaMessagePump` tests (including posted-message dispatch and pump count) and the alarm poll-loop start/stop lifecycle tests already exist in `StaMessagePumpTests.cs` and `MxAccessStaSessionTests.cs`. See the re-triage note above for the specific test names.
### Worker-014
@@ -232,13 +236,13 @@
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33`, `:202` |
| Status | Open |
| Status | Resolved |
**Description:** The file declares two public types — the `AlarmCommandHandler` class and the `IAlarmCommandHandler` interface. The C# style guide and the rest of the module follow one-public-type-per-file (e.g. interfaces in their own `I*.cs` files like `IMxAccessAlarmConsumer.cs`).
**Recommendation:** Move `IAlarmCommandHandler` to its own `IAlarmCommandHandler.cs` for consistency.
**Resolution:** _(open)_
**Resolution:** 2026-05-18 — The `IAlarmCommandHandler` interface (with its XML docs) was moved verbatim out of `AlarmCommandHandler.cs` into a new `src/MxGateway.Worker/MxAccess/IAlarmCommandHandler.cs`, with its own `using` directives (`System`, `System.Collections.Generic`, `MxGateway.Contracts.Proto`). `AlarmCommandHandler.cs` now declares one public type, matching the module's one-public-type-per-file convention (cf. `IMxAccessAlarmConsumer.cs`). Pure file-organization change — no API surface, behavior, or namespace changed; no test needed. The worker build is clean with zero warnings (no unused usings left behind in `AlarmCommandHandler.cs`).
### Worker-015
@@ -247,10 +251,10 @@
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145` |
| Status | Open |
| Status | Resolved |
**Description:** On overflow, `Enqueue` records the overflow fault and throws `MxAccessEventQueueOverflowException`; `MxAccessBaseEventSink.EnqueueEvent` catches it and calls `RecordFault` again. `RecordFault` is a no-op when a fault already exists, so the second call is harmless — but the intent is muddled, and there is no test asserting the dropped-event behavior. This is acceptable per the fail-fast design but undocumented at the call site.
**Recommendation:** Add a brief comment in `EnqueueEvent` clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op.
**Resolution:** _(open)_
**Resolution:** 2026-05-18 — Added a comment in `MxAccessBaseEventSink.EnqueueEvent`'s catch block (per the finding's recommendation) explaining that two distinct fail-fast failures land there: a conversion failure from `createEvent()` (recorded here as an `MxaccessEventConversionFailed` fault) and an `MxAccessEventQueueOverflowException` from `Enqueue` at capacity, which — per the fail-fast backpressure design in `docs/DesignDecisions.md` — drops the event and has *already* self-recorded a `QueueOverflow` fault inside `Enqueue`. Because `MxAccessEventQueue.RecordFault` keeps only the first fault, the catch's `RecordFault` call is then a deliberate near no-op rather than a second, conflicting fault. Pure comment change as recommended — no behavior altered. `docs/DesignDecisions.md` already documents the fail-fast event backpressure rule, so no doc change was required.