Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings

Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 10:28:54 -04:00
parent a0203503a7
commit 1aafd6bde4
74 changed files with 3349 additions and 395 deletions
+106 -1
View File
@@ -5,7 +5,7 @@
| Module | `src/MxGateway.Worker.Tests` |
| Reviewer | Claude Code |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Commit reviewed | `a020350` |
| Status | Reviewed |
| Open findings | 0 |
@@ -41,6 +41,21 @@
| 9 | Testing coverage | Issues found: Worker.Tests-017 (`WorkerCancel` envelope-dispatch path untested), Worker.Tests-022 (`WnWrapAlarmConsumer.PollOnce` transition-delta computation untested at the snapshot-to-transitions level). |
| 10 | Documentation & comments | Issues found: Worker.Tests-023 (`AlarmClientWmProbeTests` and `WnWrapConsumerProbeTests` are unit-test classes carrying 1000+ lines of probe-only code; their `[Fact(Skip=...)]` status is documented but the probe scaffolding is mixed into the same test assembly as regression tests). |
### 2026-05-20 re-review (commit `a020350`)
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No new issues — Worker.Tests-018/024 fixes hold; the new `WriteAsync_WithEmptyEnvelope_ThrowsInvalidEnvelopeFromValidator` correctly documents that the writer-side defensive zero-length branch is intercepted by `WorkerEnvelopeValidator.Validate`. |
| 2 | mxaccessgw conventions | Issues found: Worker.Tests-025 (`LiveMxAccessFactAttribute` duplicated in Worker.Tests and IntegrationTests with no shared constant — divergent-by-drift risk). |
| 3 | Concurrency & thread safety | Issues found: Worker.Tests-027 (`FakeRuntimeSession.CancelCommandReturnValue` mutated without the same `gate` lock that protects `cancelledCorrelationIds`/`snapshot`/`events`). |
| 4 | Error handling & resilience | No new issues — Worker.Tests-021 closed all three uncovered protocol branches. |
| 5 | Security | No new issues. |
| 6 | Performance & resource management | No new issues. |
| 7 | Design-document adherence | Issues found: Worker.Tests-028 (Worker.Tests-023 resolution promised an `docs/GatewayTesting.md` paragraph describing the probe surface; the doc was never updated, so the partition is invisible outside the source tree). |
| 8 | Code organization & conventions | Issues found: Worker.Tests-026 (`MxAccessSession.CreateForTesting` has no runtime guard preventing accidental production use — only the `internal` modifier plus `InternalsVisibleTo` separates it from the live `Create` path); Worker.Tests-029 (Probes moved to `Probes/` folder but kept the unit-test `MxGateway.Worker.Tests` namespace, so a namespace-based filter cannot distinguish probes from regression tests). |
| 9 | Testing coverage | No new issues — the five `LiveMxAccessFact`-gated tests in `MxAccessLiveComCreationTests` and the `ComputeTransitions` unit tests close the previously identified gaps. |
| 10 | Documentation & comments | Issues found: Worker.Tests-030 (`CreateCancelEnvelope` uses `Sequence = 4` while the immediately-following `CreateShutdownEnvelope` uses `Sequence = 3`; the cancel test writes them in 4-then-3 order, which works because the worker has no inbound sequence-monotonicity check — but the numbering is misleading to a future reader and contradicts the gateway-side monotonic-sequence convention `gateway.md` documents for outbound). |
## Findings
### Worker.Tests-001
@@ -402,3 +417,93 @@
**Recommendation:** Strengthen to `InvalidOperationException exception = Assert.Throws<InvalidOperationException>(...); Assert.Contains("simulated wnwrap subscribe failure", exception.Message)` — pin both the type and the originating message so a regression that throws a *different* `InvalidOperationException` from inside `AlarmCommandHandler` fails the test.
**Resolution:** 2026-05-20 — `Subscribe_WhenUnderlyingSubscribeThrows_DisposesConsumer` now captures the thrown exception and asserts `Assert.Contains("simulated wnwrap subscribe failure", exception.Message)` against the fake's exact thrown message. A regression that throws a *different* `InvalidOperationException` from inside `AlarmCommandHandler` (for example its own "already subscribed" guard at line 73 of `AlarmCommandHandler.cs`) now fails the message-contains assertion — the original test's type-only `Assert.Throws<InvalidOperationException>` would have passed silently while hiding the swallowed failure cause. The disposal assertion (`consumer.Disposed == true`) is unchanged; the test now pins both the disposal contract and the origin of the propagated exception. XML doc on the test method documents the regression scenario.
### Worker.Tests-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | mxaccessgw conventions |
| Location | `src/MxGateway.Worker.Tests/TestSupport/LiveMxAccessFactAttribute.cs:23`, `src/MxGateway.IntegrationTests/IntegrationTestEnvironment.cs:5`, `src/MxGateway.IntegrationTests/LiveMxAccessFactAttribute.cs:9-12` |
| Status | Resolved |
**Description:** Worker.Tests-018 resolved the silent-skip issue by adding a Worker.Tests-local `LiveMxAccessFactAttribute`. The resolution called out that "introducing a cross-project shared assembly was not practical" because Worker.Tests targets net48/x86 and IntegrationTests targets net10.0. The two copies are correct today but the contract is held only by convention — both define `LiveMxAccessVariableName = "MXGATEWAY_RUN_LIVE_MXACCESS_TESTS"` as separate `public const string` literals, with the same `=="1"` `StringComparison.Ordinal` check duplicated. The IntegrationTests copy delegates to `IntegrationTestEnvironment.LiveMxAccessTestsEnabled`/`IsEnabled`, so any future opt-in tweak (e.g. accepting `"true"` as well, or honouring a different env-var name) made in `IntegrationTestEnvironment` will silently leave Worker.Tests behind. The XML doc on the Worker.Tests copy acknowledges this risk in prose but the divergence is invisible at compile time — there's no test or assertion that pins the two opt-in checks return the same answer.
**Recommendation:** Either (a) lift the env-var-name string into `MxGateway.Contracts` (which already multi-targets `net10.0;net48`) as a `public const string`, then both `LiveMxAccessFactAttribute` copies reference the same constant; (b) add a single unit test in Worker.Tests that pins `LiveMxAccessFactAttribute.LiveMxAccessVariableName == "MXGATEWAY_RUN_LIVE_MXACCESS_TESTS"` to make the contract literal-visible to any reviewer changing the name; (c) document the synchronization requirement in `docs/GatewayTesting.md` alongside the existing live-opt-in section.
**Resolution:** 2026-05-20 — Added `GatewayContractInfo.LiveMxAccessOptInVariableName` to `MxGateway.Contracts` (net10.0/net48-multi-targeted) and routed both `LiveMxAccessFactAttribute` copies plus `IntegrationTestEnvironment.LiveMxAccessVariableName` through that single constant; the env-var literal now lives in one place.
### Worker.Tests-026
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessSession.cs:74-88` |
| Status | Resolved |
**Description:** `MxAccessSession.CreateForTesting` (added in Worker.Tests-016) is declared `internal static`, gated only by `<InternalsVisibleTo Include="MxGateway.Worker.Tests" />` in `MxGateway.Worker.csproj`. The XML doc states "production code must use the `Create` factory", but there is no runtime enforcement. The protection rests on (1) the `internal` modifier — which silently widens if any future `InternalsVisibleTo` directive is added (e.g. for an integration-test shim, a benchmark project, or an `InternalsVisibleTo`-using analyzer); and (2) reviewer attention. Worker.Tests itself contains real STA-running test code (the live tests, the probes), so a future test in Worker.Tests could call `CreateForTesting` from a context that has a real MXAccess COM object and the `new object()` placeholder would silently substitute. The factory hands out a session with `mxAccessComObject = new object()` so any code that later goes through `Marshal.IsComObject` or `Marshal.FinalReleaseComObject` on it would simply return false / no-op, masking lifetime regressions.
**Recommendation:** Add a one-line conditional guard — e.g. `[Conditional("DEBUG")]` is not appropriate (the worker also ships Release builds), but the factory could check that `eventSink` is *not* an `MxAccessBaseEventSink` (the production sink), throwing `InvalidOperationException("CreateForTesting must not be used with the production MxAccessBaseEventSink")`. Production code never passes that sink to a "for testing" factory; the asymmetry is the cheapest signal. Alternatively, gate the factory with `[Obsolete("Test seam — never call from production code", error: false)]` so any production call surfaces as a build warning (and `TreatWarningsAsErrors` would turn that into a build break).
**Resolution:** 2026-05-20 — Added a runtime guard to `MxAccessSession.CreateForTesting` that throws `ArgumentException` when the supplied `eventSink` is an `MxAccessBaseEventSink` (the production sink), so any future caller wiring the live sink into the test factory fails fast instead of silently bypassing `Marshal.IsComObject` on the `new object()` placeholder.
### Worker.Tests-027
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker.Tests/TestSupport/FakeRuntimeSession.cs:174, 179-187` |
| Status | Resolved |
**Description:** The consolidated `FakeRuntimeSession` (introduced by Worker.Tests-014, extended for Worker.Tests-017) reads/writes `cancelledCorrelationIds`, `snapshot`, and `events` under `lock(gate)`. The new `CancelCommandReturnValue` (a `bool` set by the test) is mutated outside any lock and read inside `CancelCommand` outside the lock as well (`return CancelCommandReturnValue;` after the locked `cancelledCorrelationIds.Add`). For a plain `bool` set before the worker's message-loop runs this is harmless on x86 (atomic-on-aligned-write), but it contradicts the rest of the file's locking convention and a future test that flips `CancelCommandReturnValue` mid-dispatch from a different thread would see an undocumented race. The same applies to `BlockDispatch`, `ThrowAfterDispatchReleased`, `ThrowTimeoutOnShutdown`, and `Disposed` — all are `bool`/auto-property without the `gate` lock — but those existed before Worker.Tests-017 and the finding flags only the consistency drift the new property introduces.
**Recommendation:** Either (a) hold `lock(gate)` when reading `CancelCommandReturnValue` inside `CancelCommand`, matching the surrounding locked statement; (b) mark `CancelCommandReturnValue` with `volatile` to document the cross-thread visibility; or (c) add an XML-doc note stating the property must be set before `RunAsync` begins and is not safe to mutate mid-test. Option (c) is cheapest and matches how `BlockDispatch` is used today.
**Resolution:** 2026-05-20 — Converted `CancelCommandReturnValue` to a private-backing-field property whose get/set both hold `lock(gate)`, and folded the return statement of `CancelCommand` inside the existing locked block, so the property now respects the same locking convention as `cancelledCorrelationIds`, `snapshot`, and `events`.
### Worker.Tests-028
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `docs/GatewayTesting.md`, `src/MxGateway.Worker.Tests/Probes/` |
| Status | Resolved |
**Description:** The Worker.Tests-023 resolution (commit `a020350`) stated that option (b) was taken — moving the three probe files to `Probes/` — but the recommendation for option (b) was "move them into a `Probes/` subfolder inside `MxGateway.Worker.Tests` **and** add a one-paragraph header in `docs/GatewayTesting.md` describing the probe surface." The folder move was made; the documentation addition was not. `docs/GatewayTesting.md` has no mention of `Probes/`, `AlarmClientWmProbeTests`, `WnWrapConsumerProbeTests`, or `AlarmsLiveSmokeTests` (verified with `Grep` against the doc). A reader navigating `docs/GatewayTesting.md` to understand the testing surface cannot tell the probes exist, what they pin, or how to flip `Skip=null` on the dev rig — the only documentation is the in-source `Skip=...` strings and the per-probe XML doc.
**Recommendation:** Add a `## Dev-rig probes` (or similar) section to `docs/GatewayTesting.md` that names the three probe files, explains the probe contract (live AVEVA COM, `Skip=null` flip, no in-CI coverage), and points to the source location `src/MxGateway.Worker.Tests/Probes/`. One paragraph is enough; the existing `[Fact(Skip=...)]` strings carry the rest of the detail.
**Resolution:** 2026-05-20 — Added a `## Dev-rig Probes` section to `docs/GatewayTesting.md` between the Live MXAccess Smoke and Live Galaxy Repository sections; the new section names the three probe files (`AlarmsLiveSmokeTests`, `AlarmClientWmProbeTests`, `WnWrapConsumerProbeTests`), explains the probe contract (live AVEVA COM, `Skip=null` flip on the dev rig, not part of the regression contract), and points to the source location `src/MxGateway.Worker.Tests/Probes/`.
### Worker.Tests-029
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs:9`, `src/MxGateway.Worker.Tests/Probes/AlarmClientWmProbeTests.cs:14`, `src/MxGateway.Worker.Tests/Probes/WnWrapConsumerProbeTests.cs:10` |
| Status | Resolved |
**Description:** Worker.Tests-023 partitioned the probes by directory (`Probes/` subfolder) but kept their original namespace `namespace MxGateway.Worker.Tests;` rather than moving them to `namespace MxGateway.Worker.Tests.Probes;`. The folder/namespace mismatch is a minor C# convention drift (the project's other subfolder-grouped tests — `Bootstrap/`, `Conversion/`, `MxAccess/`, `Sta/`, `Ipc/`, `TestSupport/`, `Contracts/`, `ProjectStructure/` — all use a `MxGateway.Worker.Tests.<Subfolder>` namespace matching the directory). It also means an xUnit test filter like `--filter FullyQualifiedName~MxGateway.Worker.Tests.Probes` will discover zero tests, so the partition is invisible to the runner: any CI-side rule that wants to exclude probes still has to enumerate file/class names individually rather than match by namespace.
**Recommendation:** Move the three probe files to `namespace MxGateway.Worker.Tests.Probes;`. xUnit discovers by attribute, not by namespace, so the rename is behaviour-neutral and lets a `FullyQualifiedName~Probes` filter trivially target them. The two other consolidations introduced in this sweep (`TestSupport/``MxGateway.Worker.Tests.TestSupport`) already follow this pattern.
**Resolution:** 2026-05-20 — Moved `AlarmsLiveSmokeTests`, `AlarmClientWmProbeTests`, and `WnWrapConsumerProbeTests` to `namespace MxGateway.Worker.Tests.Probes;` so the folder and namespace match the project's other subfolder-grouped tests; a `FullyQualifiedName~MxGateway.Worker.Tests.Probes` filter now targets exactly the three probe classes. Verified by xUnit discovery output: the three probes appear under their new namespace as `[SKIP]`.
### Worker.Tests-030
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:862-890` |
| Status | Resolved |
**Description:** Within `WorkerPipeSessionTests`, the inbound-envelope helpers assign `Sequence` values that are inconsistent with the order in which the tests send them: `CreateGatewayHelloEnvelope` is `Sequence = 1`, `CreateCommandEnvelope` is `Sequence = 2`, `CreateShutdownEnvelope` is `Sequence = 3`, and `CreateCancelEnvelope` is `Sequence = 4`. The Worker.Tests-017 cancel test sends the cancel (`Sequence = 4`) **before** the shutdown (`Sequence = 3`) — a future reader inspecting the wire trace will see decreasing sequence numbers. The test still passes because the worker has no inbound sequence-monotonicity check (verified by `Grep`ing `Ipc/` for `ValidateSequence`/`monotonic`/sequence-comparison patterns — none exist). But `gateway.md` documents monotonic sequence numbers on the outbound side, and the test's literal sequence values suggest a convention that isn't enforced and can mislead a debugger correlating a frame dump to test intent.
**Recommendation:** Either (a) reassign `CreateCancelEnvelope` to a sequence value `>` shutdown (or pass the sequence as a parameter, matching `CreateGatewayHelloEnvelope`'s parameter style), so the wire trace reads in ascending order; (b) add an XML-doc note on the cancel test stating that the worker has no inbound monotonicity check and the test ignores envelope sequence ordering; (c) parameterise all four helper methods so each test passes its desired sequence and the literal numbers stop carrying implicit meaning. Option (c) is the cleanest because `CreateGatewayHelloEnvelope` is already parameter-driven for nonce/version.
**Resolution:** 2026-05-20 — Took option (c): parameterised `CreateGatewayHelloEnvelope`/`CreateCommandEnvelope`/`CreateCancelEnvelope`/`CreateShutdownEnvelope` with a `ulong sequence` argument (defaults 1/2/2/3 respectively, matching the typical Hello/Command/Cancel/Shutdown ordering), so the literal sequence values no longer carry implicit meaning. Updated the cancel-correlation test's wire trace to ascend (Hello=1, Cancel=2, Shutdown=3) and added a comment noting that the worker has no inbound monotonicity check — the parameter exists so multi-frame tests can pin the trace ordering explicitly when needed.