Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings

Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 10:28:54 -04:00
parent a0203503a7
commit 1aafd6bde4
74 changed files with 3349 additions and 395 deletions
+42 -12
View File
@@ -5,26 +5,26 @@
| Module | `src/MxGateway.Contracts` |
| Reviewer | Claude Code |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Commit reviewed | `a020350` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
This re-review focuses on the contract delta introduced since the prior review at `6c64030` — primarily the new bulk write/read command family added in `5e375f6` (`WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, `ReadBulk`) plus the resolution changes for Contracts-001/002/004/005/006/007/008.
This re-review covers the Contracts module at `a020350`, after Contracts-009 through Contracts-013 (plus Client.Rust-013's proto comment reformat on `ReadBulkCommand`) were resolved against `1cd51bb`. The Contracts source under review is unchanged from `1cd51bb` apart from the documentation-only updates introduced by `a020350`; the pass re-checks every category on the bulk write/read family, the alarm reply surface, and the GalaxyRepository contract that were the target of those resolutions.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | New bulk command kinds, `BulkWriteResult`, and `BulkReadResult` align with the worker executor, validator (`MxAccessGrpcRequestValidator.ExpectedPayload`), and `MxAccessSession.ReadBulk`. Field numbering is contiguous and additive (10-43 on `MxCommand.payload`, 20-40 on `MxCommandReply.payload`); no collisions. No new functional bugs. |
| 2 | mxaccessgw conventions | Additive-only evolution preserved across all three protos; new wire-compatibility policy comment block (added under Contracts-005) is honored by the bulk additions; generated code untouched; naming and oneof usage are consistent with the style guide. No new violations. |
| 3 | Concurrency & thread safety | N/A — pure contract definitions plus a static const class with no shared mutable state. |
| 4 | Error handling & resilience | `BulkWriteResult` carries the full `was_successful` + `hresult` + `statuses` + `error_message` carriers per entry; `BulkReadResult` carries `was_successful` + `was_cached` + per-entry value and statuses. The asymmetry (no `hresult` on `BulkReadResult`) is intentional given ReadBulk's lifecycle. No issues. |
| 5 | Security | The new `WriteSecuredBulkCommand` / `WriteSecured2BulkCommand` carry the redaction note on the outer command only, not on the inner entry's `value` field (Contracts-011); otherwise no secrets forced into loggable shapes. |
| 6 | Performance & resource management | `ReadBulk` is the only command without a 1:1 MXAccess analogue; the per-entry timeout shape (`uint32 timeout_ms`) and `was_cached` semantics avoid disturbing existing subscriptions. No bloat issues. |
| 7 | Design-document adherence | `gateway.md` documents the bulk write/read families, but `docs/Contracts.md` was not updated for them (Contracts-009). This violates the CLAUDE.md "update docs in the same commit as the source" rule for the bulk-read/write addition. |
| 8 | Code organization & conventions | Package / namespace / file layout correct; additive-only contract evolution observed; field numbers continuous and isolated by 100+ from diagnostic/control commands. No new issues. |
| 9 | Testing coverage | The bulk write/read families have no `ProtobufContractRoundTripTests` coverage (Contracts-010); Galaxy Repository protos and `MxArray` raw paths are now covered (per Contracts-007 resolution). |
| 10 | Documentation & comments | `GalaxyAttribute.mx_data_type` lacks an in-proto comment explaining it is a raw Galaxy integer (Contracts-012); the `GatewayContractInfoTests` summary is now stale (Contracts-013); credential-sensitive bulk entry `value` fields lack per-field redaction comments (Contracts-011). |
| 1 | Correctness & logic bugs | Bulk command kinds, `BulkWriteResult`, and `BulkReadResult` align with the worker executor (`MxAccessSession.ReadBulk` / `ExecuteBulkWriteEntry`), the gateway server-side filter (`MxAccessGatewayService.ReplaceWriteBulkEntries`), the validator (`MxAccessGrpcRequestValidator.ExpectedPayload`, covering every new kind), and the round-trip tests added under Contracts-010. Field numbering across all three protos remains additive and contiguous — `MxCommand.payload` 10-43 + 100-104, `MxCommandReply.payload` 20-40 + 100-102, `MxCommandKind` 0-34 + 100-104, `WorkerEnvelope.body` 10-20 — with no number reused or repurposed. No new functional bugs. |
| 2 | mxaccessgw conventions | Wire-compatibility policy comment blocks (Contracts-005 resolution) are present at the top of all three `.proto` files and the bulk additions honour them — every change since the prior review is additive. Generated code under `Generated/` is untouched. Naming, `snake_case` field names, `PascalCase` messages, enum-prefix discipline, oneof usage for command/reply/value/event/envelope, and the credential-sensitivity comments per the ProtobufStyleGuide are all consistent. No new violations. |
| 3 | Concurrency & thread safety | N/A — pure contract definitions plus a static constants class (`GatewayContractInfo`) with no shared mutable state. |
| 4 | Error handling & resilience | `BulkWriteResult` carries `was_successful` + `optional int32 hresult` + `repeated MxStatusProxy statuses` + `error_message` per entry; `BulkReadResult` carries `was_successful` + `was_cached` + per-entry `value`/`quality`/`source_timestamp`/statuses/`error_message`. The deliberate absence of `hresult` on `BulkReadResult` is pinned by `ProtobufContractRoundTripTests.BulkReadReply_RoundTripsCachedAndSnapshotResults` (descriptor assertion) and matches the documented "ReadBulk outcomes are timeout / cache / lifecycle states, not MXAccess COM return codes" rationale. The `AcknowledgeAlarmReply.status` reservation comment (Contracts-008) and the by-name ack reuse comment (Contracts-002) keep ack outcome handling unambiguous. No new issues. |
| 5 | Security | The single-item and bulk `WriteSecured` / `WriteSecured2` paths now carry the credential-sensitivity comment on both the outer command (`WriteSecuredBulkCommand` / `WriteSecured2BulkCommand`) and each entry's `value` field (Contracts-011 resolution). `AuthenticateUserCommand.verify_user_password` carries the same redaction note. No new secret-leak surfaces. |
| 6 | Performance & resource management | `ReadBulk` is still the only command without a 1:1 MXAccess analogue; the per-tag `timeout_ms` cap and `was_cached` short-circuit prevent disturbing existing subscriptions. `BulkWriteReply` / `BulkReadReply` are flat repeated lists with no nested pagination machinery, matching the "one round-trip per batch" Bulk Command Family decision. No bloat issues. |
| 7 | Design-document adherence | `gateway.md`, `docs/Contracts.md` (Contracts-009 resolution), `docs/DesignDecisions.md` (Bulk Command Family), and `docs/AlarmClientDiscovery.md` (Contracts-002 / Contracts-008 resolutions) describe the contracts now in force. The `MX_COMMAND_KIND_WRITE2_BULK` / `MX_COMMAND_KIND_WRITE_SECURED2_BULK` enum-value names use the `<BASE>2_BULK` suffix order while the public reply oneof case names use `Write2Bulk` / `WriteSecured2Bulk` (the `2` precedes `Bulk` in PascalCase); both match the corresponding command-message names — no design-doc divergence. The proto comment on `BulkWriteResult` describes a "gateway's tag-allowlist filter" that does not exist by that name in source or docs — see new finding Contracts-014. |
| 8 | Code organization & conventions | Package / namespace / file layout correct; `csharp_namespace` options remain consistent; the worker proto continues to import `mxaccess_gateway.proto` rather than duplicate the command/reply/event/value/status surface. Additive-only contract evolution observed; field numbers continuous and isolated by 100+ from diagnostic/control commands. No new issues. |
| 9 | Testing coverage | `ProtobufContractRoundTripTests` now exercises all five new bulk write/read commands, both new reply types (with `HasHresult == true` / `HasHresult == false` arms for the proto3 optional, and a descriptor-level assertion that `BulkReadResult` has no `hresult` field), every new `MxCommandReply.payload` oneof case (parameterised `[Theory]`), and the existing alarm / Galaxy / worker-envelope cases. `GatewayContractInfoTests` pins the `GatewayProtocolVersion = 3` constant for both the alarm and bulk write/read additions. No new gaps observed at the contracts surface. |
| 10 | Documentation & comments | The bulk additions all carry per-message documentation comments (`WriteBulkCommand`, `Write2BulkCommand`, `WriteSecuredBulkCommand`, `WriteSecured2BulkCommand`, `ReadBulkCommand`) and per-field credential-sensitivity comments on `WriteSecured*BulkEntry.value`. `GalaxyAttribute.mx_data_type` / `data_type_name` / `mx_attribute_category` / `security_classification` carry the parity-detail comments added under Contracts-012. Two residual gaps remain — the misleading "tag-allowlist filter" wording on `BulkWriteResult` (new finding Contracts-014), and the absence of a comment on `BulkReadResult.value` / `quality` / `source_timestamp` / `statuses` describing what they carry when `was_successful = false` (new finding Contracts-015). |
## Findings
@@ -222,3 +222,33 @@ This re-review focuses on the contract delta introduced since the prior review a
**Recommendation:** Reword the summary to describe what the test pins (the current `GatewayProtocolVersion` constant equals 3) rather than narrating a specific historical bump, OR explicitly enumerate the alarm- and bulk-write/read additions covered under version 3 so readers know both extensions were additive and intentionally did not require a bump.
**Resolution:** _(2026-05-20)_ Reworded the XML summary on `GatewayContractInfoTests.GatewayProtocolVersion_IsVersionThree` to describe what the test actually pins: the current `GatewayProtocolVersion` constant equals 3, with both the alarm proto extension (`AcknowledgeAlarm` / `QueryActiveAlarms` RPCs, `OnAlarmTransitionEvent`, the alarm command/reply payload cases) AND the bulk write/read command family extension (`WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` / `ReadBulk` with their `BulkWriteReply` / `BulkReadReply` payloads) shipping under version 3 as strictly additive changes that did not require a further bump. The new summary also instructs that a future breaking contract change should bump the constant and update the test in lock-step. Test logic is unchanged; the test still passes.
### Contracts-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:549-553` |
| Status | Resolved |
**Description:** The `BulkWriteResult` header comment says `item_handle` mirrors the request entry "so callers can correlate inputs to outputs even when the gateway's tag-allowlist filter dropped some entries before reaching the worker." No "tag-allowlist filter" exists by that name anywhere in `src/`, `gateway.md`, `docs/`, or `docs/style-guides/` — a full-tree search returns matches only inside this proto comment and the prior-pass code-reviews. The real gateway-side bulk-write filter is `MxAccessGatewayService` calling `IConstraintEnforcer.CheckWriteHandleAsync` per entry (see `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:565-585` and `src/MxGateway.Server/Security/Authorization/IConstraintEnforcer.cs`); failures populate a synthetic `BulkWriteResult` with `was_successful = false` and the constraint's `ErrorMessage` is recorded via `constraintEnforcer.RecordDenialAsync`. The mechanism is a per-API-key constraint enforcer that can reject by handle (not a "tag" list), and the failure path covers any `ConstraintFailure` reason (write-handle scope, audit policy, etc.) — not a single inclusive tag allowlist. A future reader of the proto will search for "tag-allowlist" and find nothing, or worse, build a non-existent feature against the misleading name. The contract concept the comment is trying to communicate (item-level correlation matters because the gateway can drop entries before the worker sees them) is correct and worth keeping.
**Recommendation:** Reword the `BulkWriteResult` header comment to identify the actual mechanism — for example: "...so callers can correlate inputs to outputs even when the gateway's per-entry `IConstraintEnforcer.CheckWriteHandleAsync` filter (see `docs/Authorization.md`) dropped some entries before reaching the worker." Comment-only change with no wire-format impact.
**Resolution:** _(2026-05-20)_ Reworded the `BulkWriteResult` header comment in `mxaccess_gateway.proto` to identify the real gateway-side per-entry filter — `IConstraintEnforcer.CheckWriteHandleAsync` invoked by `MxAccessGatewayService.ReplaceWriteBulkEntries` — and cross-referenced `docs/Authorization.md` for the rationale. The contract concept (item-level correlation matters because the gateway can drop entries before the worker sees them) is preserved; the misleading "tag-allowlist filter" name is removed so future readers will not search for or build against a non-existent feature. The "Per-item failures populate `error_message` + `hresult` and never raise" sentence is retained verbatim. Comment-only change; `dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj` succeeded with 0 warnings / 0 errors on both `net48` and `net10.0`.
### Contracts-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:571-582` |
| Status | Resolved |
**Description:** `BulkReadResult` carries seven payload-bearing fields beyond the carrier flags — `value`, `quality`, `source_timestamp`, `statuses`, `error_message`, plus `item_handle` and `tag_address` — and the header comment only documents the `was_cached` arm. There is no in-proto statement of which fields carry data on `was_successful = true` versus `was_successful = false`. Cross-checked against the worker: `MxAccessSession.FailedRead` (line 940-956) populates only `ServerHandle`, `TagAddress`, `ItemHandle`, `WasSuccessful = false`, `WasCached`, and `ErrorMessage``value`, `quality`, `source_timestamp`, and `statuses` are all left at their proto3 defaults (null / 0 / null / empty). `SucceededRead` populates the value/quality/source_timestamp/statuses from the cached or snapshotted `OnDataChange`. A client reading `BulkReadResult` from the proto alone has no way to know that `value == null` and `quality == 0` on failure are deliberate "absent" markers rather than "value is null with quality bad" data — both interpretations are wire-equivalent. `BulkWriteResult` has the same shape gap for `statuses` / `hresult` on failed entries, but its header comment at least says "Per-item failures populate `error_message` + `hresult` and never raise"; `BulkReadResult` has no equivalent statement.
**Recommendation:** Extend the `BulkReadResult` header comment (or add per-field comments on `value` / `quality` / `source_timestamp` / `statuses` / `error_message`) to state explicitly which fields are populated on success and which are left at their proto3 defaults on failure — e.g. "On `was_successful = false`, only `server_handle`, `tag_address`, `item_handle` (when allocated), `was_cached`, and `error_message` are populated; `value`, `quality`, `source_timestamp`, and `statuses` are left at their proto3 defaults and must not be read as data." Comment-only change with no wire-format impact.
**Resolution:** _(2026-05-20)_ Extended the `BulkReadResult` header comment in `mxaccess_gateway.proto` with explicit per-arm documentation, mirroring the level of detail the existing `BulkWriteResult` header carries. On `was_successful = true` the comment now states `value` / `quality` / `source_timestamp` / `statuses` carry the read data (from the cached subscription or the snapshot lifecycle, depending on `was_cached`) and `error_message` is empty. On `was_successful = false` the comment lists exactly which fields are populated (`server_handle`, `tag_address`, `item_handle` when allocated, `was_cached`, `error_message`) and warns that `value` / `quality` / `source_timestamp` / `statuses` are left at their proto3 defaults and must not be read as data — explicitly noting they are wire-indistinguishable from "value is null with quality bad" data so a future reader cannot make that mistake. The comment also pins the deliberate absence of an `hresult` field on `BulkReadResult` (cross-referencing `docs/DesignDecisions.md` "Bulk Command Family" for the rationale) and the "Per-tag failures populate `error_message` and never raise" semantic that parallels `BulkWriteResult`. Comment-only change; `dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj` succeeded with 0 warnings / 0 errors on both `net48` and `net10.0`.