Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings

Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 10:28:54 -04:00
parent a0203503a7
commit 1aafd6bde4
74 changed files with 3349 additions and 395 deletions
+83 -12
View File
@@ -5,26 +5,34 @@
| Module | `src/MxGateway.Worker` |
| Reviewer | Claude Code |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Commit reviewed | `a020350` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
This row reflects the 2026-05-20 re-review at commit `1cd51bb`. Worker-001..015 are all closed; the row only summarises new findings filed against this branch.
This row reflects the 2026-05-20 re-review at commit `a020350`. Worker-001..022 are all closed; the row only summarises new findings filed against this commit. The prior pass's fixes for Worker-016..022 were verified sound:
- **Worker-016**: `StaRuntimeShutdownException` exists, `MxAccessStaSession.cs:261` is the only `catch (StaRuntimeShutdownException)` site in the module. No accidental catch elsewhere (grep verified). The graceful-shutdown vs. STA-affinity-violation distinction holds.
- **Worker-017**: `ReportWatchdogFaultIfNeededAsync` returns early when `CurrentCommandCorrelationId` is non-empty. Sound for the slow-but-progressing case; but see **Worker-023** — there is no defensive ceiling, so a truly stuck command (synchronous COM call hung against a dead MXAccess provider) leaves `CurrentCommandCorrelationId` non-empty forever and the worker-side watchdog is permanently suppressed.
- **Worker-018**: `SetXmlAlarmQuery` is now wrapped in `try/catch (COMException)` and re-thrown as `InvalidOperationException` carrying the HRESULT. Sound.
- **Worker-019**: `subscriptionExpression` field is gone.
- **Worker-020**: `_state is not WorkerState.Ready and not WorkerState.ExecutingCommand` simplified to `_state != WorkerState.Ready`. Confirmed `_state` is never assigned `ExecutingCommand`; volatile reads are atomic.
- **Worker-021**: `_runtimeSession ??=` in `InitializeMxAccessAsync` preserves a factory-supplied session. Confirmed `RunAsync` path bypasses `InitializeMxAccessAsync` entirely (it passes its own factory-driven lambda), so the `??=` only runs on the legacy parameterless-`CompleteStartupHandshakeAsync` direct-invocation path.
- **Worker-022**: `MxAlarmSnapshot.cs` (now containing only `MxAlarmSnapshotRecord`), `MxAlarmStateKind.cs`, `MxAlarmTransitionEvent.cs` — filenames match their single public type; all three keep the `MxGateway.Worker.MxAccess` namespace.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Worker-018 (`SetXmlAlarmQuery` return code ignored), Worker-019 (`subscriptionExpression` is write-only dead state), Worker-020 (dead `ExecutingCommand` arm in `ProcessCommandAsync` state check), Worker-021 (`InitializeMxAccessAsync` can overwrite an already-set `_runtimeSession`). |
| 2 | mxaccessgw conventions | Issue found: Worker-022 (`MxAlarmSnapshot.cs` declares three public types in one file). |
| 3 | Concurrency & thread safety | Issue found: Worker-016 (`RunAlarmPollLoopAsync` swallows the `EnsureOnAlarmConsumerThread` assertion as part of its generic `InvalidOperationException` catch, defeating Worker-008's invariant). |
| 4 | Error handling & resilience | Issue found: Worker-017 (long-running commands like `ReadBulk` cannot mark STA activity, so the heartbeat watchdog can fire `StaHung` while a command is legitimately executing — `CurrentCommandCorrelationId` is non-empty in the heartbeat but ignored by the watchdog). |
| 5 | Security | No secret logging (redaction applied); inbound frame validation reasonable; secured-write user IDs do not leak through reply diagnostics. No new issues found. |
| 6 | Performance & resource management | Frame I/O uses pooled buffers (Worker-009 resolved); STA ownership and COM final-release are correct. No new issues found. |
| 7 | Design-document adherence | Code matches `gateway.md` / `MxAccessWorkerInstanceDesign.md` / `WorkerFrameProtocol.md`. No new design drift. |
| 8 | Code organization & conventions | Issue found: Worker-022 (see row 2). |
| 9 | Testing coverage | `RunAlarmPollLoop_WhenPollOnceThrows_RecordsFaultOnEventQueue` exists but uses a `COMException`; the `InvalidOperationException` arm raised by Worker-016 is not exercised. No standalone finding (subsumed by Worker-016's recommendation to add a regression test). |
| 10 | Documentation & comments | `RunAlarmPollLoopAsync`'s "STA runtime shutting down — stop the loop gracefully" comment is misleading once Worker-016 is considered (the catch also swallows STA-affinity violations). Noted in Worker-016. |
| 1 | Correctness & logic bugs | Issue found: Worker-025 (`RunAsync` does not null-check the result of `_runtimeSessionFactory()`; a null factory return would NRE on `_runtimeSession.StartAsync(...)` rather than throw a diagnostic exception). |
| 2 | mxaccessgw conventions | No issues found. The split alarm-snapshot files match the one-public-type-per-file convention; namespace consistency verified. |
| 3 | Concurrency & thread safety | Issue found: Worker-024 (the alarm command path — `Subscribe`/`Acknowledge`/`AcknowledgeByName`/`QueryActive`/`Unsubscribe` — has no STA-affinity assertion equivalent to Worker-008's `EnsureOnAlarmConsumerThread` guard; only the alarm *poll* path enforces affinity, leaving a latent gap if a future refactor lets alarm commands run off-STA). |
| 4 | Error handling & resilience | Issue found: Worker-023 (Worker-017's watchdog skip has no defensive ceiling; a truly stuck command — synchronous COM hung against a dead MXAccess provider — keeps `CurrentCommandCorrelationId` non-empty indefinitely, and the worker-side `StaHung` watchdog never fires. Gateway-side `CommandTimeout` is the only safety net). |
| 5 | Security | No issues found. No secret logging on the alarm path; the dropped-reply diagnostic Worker-003 added logs only the correlation id and command method, not the command payload. |
| 6 | Performance & resource management | No new issues found. Frame I/O still uses pooled buffers (Worker-009); STA join timeouts in `Dispose` are bounded. |
| 7 | Design-document adherence | No new design drift. The split alarm files preserve the documented public API surface. Worker-017's resolution comment documents the watchdog design intent — though see Worker-023 for the documentation gap on truly-stuck commands. |
| 8 | Code organization & conventions | No issues found. Worker-022 was the last file-organization issue. |
| 9 | Testing coverage | Worker-016 and Worker-017 each have direct regression tests (`RunAlarmPollLoop_WhenPollOnceThrowsInvalidOperation_RecordsFaultOnEventQueue`, `RunAsync_WhenStaActivityIsStaleWithCommandInFlight_DoesNotWriteWatchdogFault`). Worker-018, -020, -021's resolution notes state "no new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent" — Worker-018's `SetXmlAlarmQuery` failure-translation and Worker-020's simplified `_state != Ready` check have no regression test in this branch yet. No standalone finding — these are documented gaps in the resolution notes of the prior pass. |
| 10 | Documentation & comments | No new issues. Worker-017's XML doc on `ReportWatchdogFaultIfNeededAsync` documents the design intent clearly; the `_runtimeSession ??=` reasoning is documented inline; Worker-016's graceful-vs-affinity distinction is documented at both catch sites. |
## Findings
@@ -367,3 +375,66 @@ This row reflects the 2026-05-20 re-review at commit `1cd51bb`. Worker-001..015
**Recommendation:** Move `MxAlarmStateKind` and `MxAlarmTransitionEvent` into their own files (`MxAlarmStateKind.cs`, `MxAlarmTransitionEvent.cs`) and leave `MxAlarmSnapshotRecord` in `MxAlarmSnapshot.cs` (or rename the file to `MxAlarmSnapshotRecord.cs` to match the surviving type). Pure file-organization change; no behaviour or namespace impact.
**Resolution:** 2026-05-20 — Split `MxAlarmSnapshot.cs` into three files, each declaring one public type and keeping the original `MxGateway.Worker.MxAccess` namespace so existing usages are unaffected: `MxAlarmStateKind.cs` (the enum, with its XML doc), `MxAlarmTransitionEvent.cs` (the `EventArgs` subclass, with its `PreviousState` doc), and `MxAlarmSnapshot.cs` (now containing only `MxAlarmSnapshotRecord` plus its XML doc). Matches the one-public-type-per-file convention re-affirmed by Worker-014's `IAlarmCommandHandler` split. Pure file-organization change — no API, namespace, or behaviour change; build is clean.
### Worker-023
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:610-668`, `src/MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:124-153` |
| Status | Resolved |
**Description:** Worker-017 (resolved at `a020350`) suppresses the `StaHung` watchdog when `CurrentCommandCorrelationId` is non-empty: "the STA is busy executing a command, not hung." The fix is correct for the motivating case (legitimately slow `ReadBulk` against many uncached tags) — gateway-side per-command timeouts (`WorkerClient.InvokeAsync`'s `timeout` parameter, see `src/MxGateway.Server/Workers/WorkerClient.cs:189-218`) eventually fail the command and may kill the worker. **But the suppression has no defensive ceiling.** Most MXAccess commands in `MxAccessCommandExecutor``Register`, `AddItem`, `Advise`, `Write`, `WriteSecured`, and their bulk variants — call directly into the MXAccess COM object **with no internal deadline**. If a COM call hangs (e.g. the MXAccess provider crashed and the cross-apartment marshaler is permanently blocked, or a write completion never fires), `StaRuntime.ProcessQueuedCommands` is stuck inside `workItem.Execute()`, `StaCommandDispatcher.currentCommandCorrelationId` stays non-empty forever, and `ReportWatchdogFaultIfNeededAsync` will short-circuit on every heartbeat. The worker-side `StaHung` watchdog — the only signal that distinguishes a hung STA from a slow gateway response from inside the worker — is permanently defeated for that session. Gateway-side `CommandTimeout` is the safety net, but it depends on the gateway operator picking a sensible per-command timeout (some bulk operations legitimately set this to many minutes), and it does not surface a worker-originated diagnostic (`StaHung` fault category, `LastStaActivityUtc` value) to the gateway audit trail.
**Recommendation:** Add a defensive upper bound, distinct from `HeartbeatGrace`, after which the watchdog fires even when a command is in flight — e.g. `HeartbeatStuckCeiling` (default 5× `HeartbeatGrace` = 75s, or align with the longest reasonable per-command timeout). Pseudocode for the in-flight branch:
```csharp
if (!string.IsNullOrEmpty(snapshot.CurrentCommandCorrelationId)
&& staleFor <= _sessionOptions.HeartbeatStuckCeiling)
{
return; // slow command — gateway will time out if needed
}
// staleFor > ceiling OR no command in flight — fire StaHung
```
Document the ceiling in `MxAccessWorkerInstanceDesign.md`'s watchdog section. Add a regression test that drives `RunAsync` with `CurrentCommandCorrelationId` non-empty and `LastStaActivityUtc` stale beyond the ceiling, asserting `WorkerFaultCategory.StaHung` is emitted.
**Resolution:** 2026-05-20 — Added `WorkerPipeSessionOptions.HeartbeatStuckCeiling` (default 75s = 5 × `HeartbeatGrace`) and extended `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` so the in-flight-command suppression is bounded by the ceiling: once `staleFor > HeartbeatStuckCeiling` the watchdog fires `StaHung` even with `CurrentCommandCorrelationId` non-empty. A truly stuck synchronous COM call (dead provider, blocked marshaler) no longer permanently defeats the worker-side watchdog. The ceiling is validated at startup (`> 0` and `> HeartbeatGrace`). Documented in the new XML doc on `HeartbeatStuckCeiling` and in `docs/MxAccessWorkerInstanceDesign.md`'s "Heartbeat And Watchdog" section. Regression test `WorkerPipeSessionTests.RunAsync_WhenStaActivityIsStaleBeyondCeilingWithCommandInFlight_WritesWatchdogFault` drives `RunAsync` with a non-empty current-command id and stale activity beyond the ceiling, asserting `WorkerFaultCategory.StaHung` is emitted. The existing `RunAsync_WhenStaActivityIsStaleWithCommandInFlight_DoesNotWriteWatchdogFault` test (5s stale, default 75s ceiling) continues to pass, confirming the suppression still works within the ceiling.
### Worker-024
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:63-187`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:191-323` |
| Status | Resolved |
**Description:** Worker-008 (resolved 2026-05-18) introduced `MxAccessStaSession.AssertOnAlarmConsumerThread(int?, int)`, called from `EnsureOnAlarmConsumerThread()` in the marshalled poll lambda at `RunAlarmPollLoopAsync` (`MxAccessStaSession.cs:247`). The assertion catches a regression that runs `IMxAccessAlarmConsumer.PollOnce()` off the STA — exactly the deadlock-on-cross-apartment-marshaling risk the `ThreadingModel=Apartment` wnwrap consumer demands. **However, the assertion guards only the poll path.** `AlarmCommandHandler.Subscribe`, `Acknowledge`, `AcknowledgeByName`, `QueryActive`, and `Unsubscribe` — each of which calls into the same `IMxAccessAlarmConsumer` and ultimately the COM object — have no equivalent guard. Today they are reached only through `MxAccessCommandExecutor.Execute``StaCommandDispatcher.ExecuteQueuedCommandAsync``staRuntime.InvokeAsync(...)`, so they do run on the STA in production. But the invariant is enforced only by *convention* (the same convention Worker-008 made explicit for `PollOnce`); a future refactor that lets a test or a refactored fast-path call into the handler off-STA would silently break the same apartment rule, and the wnwrap COM call would block on marshaling rather than fail loudly.
**Recommendation:** Add an `EnsureOnAlarmConsumerThread()`-equivalent assertion at the entry of each `AlarmCommandHandler` operation that touches the consumer (`Subscribe` is the highest-value site because it constructs the consumer; `Acknowledge*` and `QueryActive` next). Reuse `MxAccessStaSession.AssertOnAlarmConsumerThread` so the affinity invariant has a single canonical guard. Wire the expected thread id through the handler's constructor (today `AlarmCommandHandler` does not know the STA thread id — `MxAccessStaSession` captures it at line 191 but does not pass it). One implementation shape: hand the handler a small `IThreadAffinityGuard` whose `Verify()` is called at each entry, constructed by `MxAccessStaSession` once `alarmConsumerThreadId` is captured.
**Resolution:** 2026-05-20 — Extended `AlarmCommandHandler` with a third constructor that takes an optional `Action? threadAffinityCheck`, and invoked the guard at the entry of every method that touches the underlying `IMxAccessAlarmConsumer`: `Subscribe`, `Unsubscribe`, `Acknowledge`, `AcknowledgeByName`, `QueryActive`, and `PollOnce`. The factory signature on `MxAccessStaSession` was widened from `Func<MxAccessEventQueue, IAlarmCommandHandler>` to `Func<MxAccessEventQueue, Action, IAlarmCommandHandler>`, so `MxAccessStaSession` (which captures `alarmConsumerThreadId` at the factory call site, already running inside `staRuntime.InvokeAsync`) can pass its existing `EnsureOnAlarmConsumerThread` as the guard — keeping the affinity invariant on a single canonical check, `AssertOnAlarmConsumerThread`. `WorkerPipeSession`'s three factory wiring sites were updated to `(eq, affinity) => new AlarmCommandHandler(eq, () => new WnWrapAlarmConsumer(), affinity)`. The previous two-arg `AlarmCommandHandler` constructor remains (now delegating with `threadAffinityCheck: null`) so existing `AlarmCommandHandlerTests` continue to exercise the handler on a single thread without configuring a guard. Regression tests `AlarmCommandHandlerTests.EveryCommandPathEntry_InvokesThreadAffinityGuard` (counts invocations across all six entry points) and `EveryCommandPathEntry_PropagatesAffinityGuardException` (a throwing guard propagates from every entry point) verify the wiring.
### Worker-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-117` |
| Status | Resolved |
**Description:** `RunAsync` assigns `_runtimeSession = _runtimeSessionFactory()` (line 111) and immediately dereferences `_runtimeSession.StartAsync(...)` inside the lambda at line 115. If the supplied factory ever returns `null`, the lambda will throw `NullReferenceException` rather than a diagnostic exception, and the `finally` block at line 128 (`_runtimeSession?.Dispose()`) silently no-ops. The production factories (`() => new MxAccessStaSession(...)` in the two convenience constructors) never return null, but the factory delegate type `Func<IWorkerRuntimeSession>` admits null returns and the constructor's `runtimeSessionFactory ?? throw` null-check at line 102 only validates the delegate itself, not its return value. The `InitializeMxAccessAsync` direct-invocation path uses `_runtimeSession ??= new MxAccessStaSession(...)` (line 840), so a null factory return there would be replaced with a default instance — different behavior from the `RunAsync` path.
**Recommendation:** Promote the null check to the call site:
```csharp
_runtimeSession = _runtimeSessionFactory()
?? throw new InvalidOperationException("Worker runtime session factory returned null.");
```
Match the pattern `AlarmCommandHandler.Subscribe` already uses for `consumerFactory()` (`AlarmCommandHandler.cs:76-77`).
**Resolution:** 2026-05-20 — `WorkerPipeSession.RunAsync` now uses `_runtimeSession = _runtimeSessionFactory() ?? throw new InvalidOperationException("Worker runtime session factory returned null.");`, matching the pattern `AlarmCommandHandler.Subscribe` uses for its `consumerFactory()`. A null factory return now produces a clear diagnostic exception at the call site instead of NRE-ing on the next dereference (and the `finally` block's `_runtimeSession?.Dispose()` silently no-oping on a half-initialized session). Regression test `WorkerPipeSessionTests.RunAsync_WhenRuntimeSessionFactoryReturnsNull_ThrowsDiagnosticException` drives `RunAsync` with `() => null!` and asserts the diagnostic `InvalidOperationException` is thrown with the expected message.