Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules

Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 09:46:47 -04:00
parent 1cd51bbda3
commit a0203503a7
122 changed files with 8723 additions and 757 deletions
+116 -10
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `clients/dotnet` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `3cc53a8` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
@@ -13,16 +13,16 @@
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Minor: handle-selector fallback `?? reply.ReturnValue.Int32Value` can mask a missing typed reply (Client.Dotnet-005); CLI redactor misses env-var keys (Client.Dotnet-008). |
| 1 | Correctness & logic bugs | Issue found (this review): the Client.Dotnet-005 fix did not reach the CLI — `BenchReadBulkAsync`, `BenchStreamEventsAsync`, and `SmokeAsync` still fall through to `reply.ReturnValue.Int32Value` for `Register` / `AddItem` handles (Client.Dotnet-010). |
| 2 | mxaccessgw conventions | Good — consumes the shared contracts project, no forked proto, `authorization: Bearer` metadata correct, parity preserved via split `EnsureProtocolSuccess`/`EnsureMxAccessSuccess`. |
| 3 | Concurrency & thread safety | Issue found: `_disposed` flags unsynchronized; `MxGatewaySession.DisposeAsync` can race a concurrent `CloseAsync` (Client.Dotnet-003). |
| 4 | Error handling & resilience | Issues found: gRPC-to-native mapping collapses non-auth statuses into one untyped exception (Client.Dotnet-001); shared retry/timeout budget (Client.Dotnet-004). |
| 5 | Security | Good — API key never logged by the library, CLI redacts keys, TLS custom-root validation correct. |
| 3 | Concurrency & thread safety | Issues found (this review): `GalaxyRepositoryClient._disposed` is still a plain unsynchronized `bool` (Client.Dotnet-009) — the symmetric fix from Client.Dotnet-003 was applied only to `MxGatewayClient`; the new `bench-stream-events` CLI command races `firstSteadyEventUtc`/`lastSteadyEventUtc` across parallel sessions (Client.Dotnet-011). |
| 4 | Error handling & resilience | No new issues found this review (Client.Dotnet-001 and Client.Dotnet-004 remain resolved). |
| 5 | Security | Good — API key never logged by the library, CLI redacts keys (incl. env-var-sourced), TLS custom-root validation correct, secured-write payloads never logged. |
| 6 | Performance & resource management | No issues found — channels and streaming calls disposed correctly. |
| 7 | Design-document adherence | No issues found — matches `ClientLibrariesDesign.md`. |
| 8 | Code organization & conventions | Issue found: undocumented public members (Client.Dotnet-006). |
| 9 | Testing coverage | Issue found: the production retry path is never exercised (Client.Dotnet-002). |
| 10 | Documentation & comments | Issue found: doc misstates the unary timeout retry budget as per-call (Client.Dotnet-004, Client.Dotnet-007). |
| 7 | Design-document adherence | No issues found — matches `DotnetClientDesign.md` and `ClientLibrariesDesign.md`. |
| 8 | Code organization & conventions | Issues found (this review): the .NET client projects do not inherit `src/Directory.Build.props` so `TreatWarningsAsErrors` / `EnforceCodeStyleInBuild` / `AnalysisLevel=latest` are silently absent (Client.Dotnet-012); `DiscoverHierarchyOptions` and the `DiscoverHierarchyAsync(DiscoverHierarchyOptions, …)` overload have no XML docs (Client.Dotnet-013). |
| 9 | Testing coverage | Issue found (this review): the SDK-level alarm tests pin the fake-transport raw-`RpcException` shape but never exercise the production gRPC-to-native mapping (`GrpcMxGatewayClientTransport.AcknowledgeAlarmAsync`) — the same gap Client.Dotnet-002 closed for `Invoke`, still open for alarms (Client.Dotnet-014). |
| 10 | Documentation & comments | No new issues this review. |
## Findings
@@ -145,3 +145,109 @@
**Recommendation:** Resolve the effective API key (same logic as `ResolveApiKey`) before redacting, so the env-var-sourced key is also stripped from error output.
**Resolution:** (2026-05-18) Confirmed against source: `MxGatewayClientCli.RunCoreAsync`'s catch block redacted only `arguments.GetOptional("api-key")`, so an env-var-sourced key (`--api-key-env`, default `MXGATEWAY_API_KEY`) was never stripped. Note `MxGatewayCliSecretRedactor` itself is correct — the defect was the caller passing the wrong value. Extracted a non-throwing `TryResolveApiKey` helper (used by both the existing `ResolveApiKey` and the catch block) that resolves `--api-key` then the `--api-key-env` environment variable; the catch block now redacts that effective key. Updated `clients/dotnet/README.md` (`smoke` paragraph) to state the CLI redacts the effective key whether from `--api-key` or `--api-key-env`. Regression test `MxGatewayClientCliTests.RunAsync_ErrorOutput_RedactsApiKey_WhenSourcedFromEnvironmentVariable` sets a test env var, forces a transport error echoing the key, and asserts the key is absent and `[redacted]` is present; verified red against the original `GetOptional("api-key")`-only redaction (key printed unredacted).
### Client.Dotnet-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:26,339-348,445-448` |
| Status | Resolved |
**Description:** Client.Dotnet-003 upgraded `MxGatewayClient._disposed` to an `int` accessed via `Interlocked.Exchange` / `Volatile.Read` so a concurrent `ThrowIfDisposed` cannot observe a stale value. The symmetric `GalaxyRepositoryClient._disposed` is still a plain unsynchronised `bool`: `DisposeAsync` reads `if (_disposed)` then writes `_disposed = true` without `Interlocked` or `Volatile`, and `ThrowIfDisposed` does an unsynchronised read. The Galaxy client is publicly `IAsyncDisposable` and exposes `TestConnectionAsync` / `GetLastDeployTimeAsync` / `DiscoverHierarchyAsync` / `WatchDeployEventsAsync` as legal-to-call-concurrently public APIs, so a concurrent dispose can produce the same torn-read race the gateway client fix prevented. The two clients also exhibit the same shape (gRPC channel + transport + retry pipeline), so the divergence is an accidental inconsistency.
**Recommendation:** Mirror Client.Dotnet-003 on `GalaxyRepositoryClient`: change `_disposed` to an `int`, use `Interlocked.Exchange(ref _disposed, 1) != 0` in `DisposeAsync`, and `Volatile.Read(ref _disposed) != 0` in `ThrowIfDisposed`. A duplicated `MxGatewaySession`-style close-lock drain is unnecessary because `GalaxyRepositoryClient` does not own a per-call `SemaphoreSlim`.
**Resolution:** 2026-05-20 — Changed `GalaxyRepositoryClient._disposed` from `bool` to `int`; `DisposeAsync` now uses `Interlocked.Exchange(ref _disposed, 1) != 0` for the once-only guard and `ThrowIfDisposed` uses `Volatile.Read(ref _disposed) != 0`, mirroring the Client.Dotnet-003 fix on `MxGatewayClient`.
### Client.Dotnet-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:638,896,1261,1279` |
| Status | Resolved |
**Description:** Client.Dotnet-005 fixed the silent `Register` / `AddItem` / `AddItem2` handle-fallback to `reply.ReturnValue.Int32Value` inside `MxGatewaySession`, but the same fallback pattern was left in the CLI and is now also present in two new bench commands shipped after that fix. `BenchReadBulkAsync` (line 638) and `BenchStreamEventsAsync` (line 896) both do `int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;` after a register call, and `SmokeAsync` (lines 1261 and 1279) passes `reply => reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value` and the equivalent `AddItem?.ItemHandle` selector to `InvokeForHandleAsync`. After `EnsureProtocolSuccess` + `EnsureMxAccessSuccess` pass but the worker did not set the typed `register` / `add_item` oneof case, all four call sites silently produce a zero handle and proceed to drive the rest of the smoke / bench against an invalid handle — exactly the failure mode the SDK-level fix prevents.
**Recommendation:** Either delegate to the SDK helpers (`MxGatewaySession.RegisterAsync` / `AddItemAsync`) which already throw the descriptive `MxGatewayException` via `CreateMissingPayloadException`, or replicate the same null-check explicitly in `InvokeForHandleAsync` and the two bench commands. A unit test that enqueues an `Ok` reply with no typed payload through `FakeCliClient` and asserts the smoke / bench commands fail loudly would prevent regression.
**Resolution:** 2026-05-20 — Added private CLI helpers `RequireRegisterServerHandle` and `RequireAddItemItemHandle` (with a shared `CreateMissingPayloadException` mirroring the SDK-level `MxGatewaySession` helper) that throw a descriptive `MxGatewayException` when the typed `register` / `add_item` payload is absent on an otherwise-successful reply. Replaced all four `?? reply.ReturnValue.Int32Value` fallback sites — `BenchReadBulkAsync` (line 638), `BenchStreamEventsAsync` (line 896), and both `SmokeAsync` selectors (lines 1261, 1279) — with these helpers, so the CLI now fails loudly with the same shape as the SDK helpers rather than silently driving the rest of the command against a zero handle.
### Client.Dotnet-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:857-858,922-963,1014-1015` |
| Status | Resolved |
**Description:** The new `bench-stream-events` command (added in commit `1cd51bb`) supports `--session-count > 1` and runs each session's `StreamEvents` reader in parallel via `openedSessions.Select(RunStreamAsync).ToArray()` then `Task.WhenAll`. Inside the per-session lambda the inner `Task.Run`-spawned event loop updates two shared `DateTime?` fields without synchronisation:
```csharp
if (firstSteadyEventUtc is null)
{
firstSteadyEventUtc = nowUtc;
}
lastSteadyEventUtc = nowUtc;
```
The integer counters next to them (`steadyEvents`, `steadyDataChangeEvents`, `warmupEvents`) use `Interlocked.Increment`, and the latency list uses an explicit `lock (latencyLock)`, so the rest of the loop is data-race-free — but these two `DateTime?` updates are not. With N parallel sessions a torn read on `firstSteadyEventUtc` produces a non-deterministic "first event time" and the final `steadyElapsedSeconds = (lastSteadyEventUtc.Value - firstSteadyEventUtc.Value).TotalSeconds` can compute a slightly wrong window. The user-visible impact is bench-only (skewed `eventsPerSecond` / `dataChangeEventsPerSecond` numbers), and on x64 the 64-bit `DateTime` field read/write happens to be atomic, so this is Low — but the pattern is inconsistent with the rest of the same loop.
**Recommendation:** Either guard the two `DateTime?` updates with the existing `latencyLock` (cheapest), use `Interlocked.CompareExchange` for `firstSteadyEventUtc` and `Volatile.Write` for `lastSteadyEventUtc`, or aggregate per-session in local variables and reduce after `Task.WhenAll`. The reduce-after approach also fixes a related issue: today a faster session can stomp `firstSteadyEventUtc` after a slower one already set it.
**Resolution:** 2026-05-20 — Guarded the `firstSteadyEventUtc` / `lastSteadyEventUtc` reads and writes inside the per-session event loop with the existing `latencyLock`. `firstSteadyEventUtc` now uses the null-coalescing assignment `firstSteadyEventUtc ??= nowUtc;` under the lock so a slower session can't stomp an earlier already-set value. The lock is already held by the latency-list append a few lines below, so the extra cost is one uncontended acquisition per event. The final read in the stats block runs after `Task.WhenAll` (happens-before applies) and stays lock-free.
### Client.Dotnet-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/dotnet/MxGateway.Client/MxGateway.Client.csproj`, `clients/dotnet/MxGateway.Client.Cli/MxGateway.Client.Cli.csproj`, `clients/dotnet/MxGateway.Client.Tests/MxGateway.Client.Tests.csproj` |
| Status | Resolved |
**Description:** `src/Directory.Build.props` enforces `TreatWarningsAsErrors=true`, `EnforceCodeStyleInBuild=true`, `AnalysisLevel=latest`, and `Deterministic=true` for every gateway / worker / contracts project, and `CLAUDE.md` calls this out as a baseline build property. The .NET client projects live under `clients/dotnet/` and there is no `Directory.Build.props` at `clients/` or `clients/dotnet/` — so none of those properties apply to `MxGateway.Client`, `MxGateway.Client.Cli`, or `MxGateway.Client.Tests`. New warnings in the client do not break the build, and code-style violations are not blocked at build time. The `CSharpStyleGuide.md` baseline ("Treat compiler warnings as actionable") and the `CLAUDE.md` table under "Source Update Workflow" both apply equally to `.NET client` ("`dotnet build clients/dotnet/MxGateway.Client.sln`"), but the enforcement floor is missing.
**Recommendation:** Add `clients/dotnet/Directory.Build.props` (or `clients/Directory.Build.props` covering Rust-Cargo siblings is N/A — only `clients/dotnet/`) carrying the same property set: `TreatWarningsAsErrors=true`, `EnforceCodeStyleInBuild=true`, `AnalysisLevel=latest`, `Deterministic=true`. Excluding generated code (which already lives under `src/MxGateway.Contracts/Generated`) is automatic because the client only references the contracts project. Build the client locally after adding it to confirm no warnings already snuck in.
**Resolution:** 2026-05-20 — Added `clients/dotnet/Directory.Build.props` mirroring `src/Directory.Build.props`: `LangVersion=latest`, `Nullable=enable`, `ImplicitUsings=enable`, `TreatWarningsAsErrors=true`, `AnalysisLevel=latest`, `EnforceCodeStyleInBuild=true`, `Deterministic=true`. The three client `.csproj` files inherit from it automatically. Re-ran `dotnet build clients/dotnet/MxGateway.Client.sln` and confirmed 0 warnings / 0 errors — no pre-existing warnings were silently being tolerated.
### Client.Dotnet-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/dotnet/MxGateway.Client/DiscoverHierarchyOptions.cs:3-24`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:185-187`, `clients/dotnet/MxGateway.Client.Cli/IMxGatewayCliClient.cs:6` |
| Status | Resolved |
**Description:** Client.Dotnet-006 fixed three undocumented public members. Three more remain undocumented in code paths the prior review didn't visit:
- `DiscoverHierarchyOptions` (the public record) has no `<summary>` on the type and no XML doc on any of its ten public properties (`RootGobjectId`, `RootTagName`, `RootContainedPath`, `MaxDepth`, `CategoryIds`, `TemplateChainContains`, `TagNameGlob`, `IncludeAttributes`, `AlarmBearingOnly`, `HistorizedOnly`).
- The second `DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken)` overload on `GalaxyRepositoryClient` is `public` with no XML doc, while the parameterless overload one method above it carries a full `<summary>` / `<param>` block.
- `IMxGatewayCliClient` is a public interface in the CLI project with no `<summary>` on the type (the member docs are present).
This is the same convention-violation shape Client.Dotnet-006 closed; CLAUDE.md style guidance describes XML docs on the public surface as the baseline expectation.
**Recommendation:** Add `<summary>` docs to each undocumented member. For `DiscoverHierarchyOptions`, the property names map cleanly to the underlying `DiscoverHierarchyRequest` proto fields — a one-line summary per property and a type-level summary tying the record to the Galaxy hierarchy browse is enough. The CLI interface only needs a type-level summary; the members already document themselves.
**Resolution:** 2026-05-20 — Added XML docs to all three call sites: a type-level summary plus a one-line summary per property on `DiscoverHierarchyOptions` (ten properties, mapped to the underlying `DiscoverHierarchyRequest` proto fields and noting the root-precedence rule); a `<summary>`/`<param>`/`<returns>` block on the second `DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken)` overload describing its filter semantics and transparent pagination; and a type-level `<summary>` on the public `IMxGatewayCliClient` interface explaining its CLI-only transport role and the production binding.
### Client.Dotnet-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientAlarmsTests.cs:76-98`, `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:212-231` |
| Status | Resolved |
**Description:** Client.Dotnet-002 closed a coverage gap where the production retry path (`RpcException``MxGatewayException` mapping by `RpcExceptionMapper.Map`) was never exercised, by adding a `MapTransportExceptions` flag to `FakeGatewayTransport` and a regression test that runs through the wrapped-exception branch. That flag is wired through `Translate(...)` in `OpenSessionAsync` / `CloseSessionAsync` / `InvokeAsync`, but the new alarm test path is not: `FakeGatewayTransport.AcknowledgeAlarmAsync` throws the queued exception verbatim (line 219), bypassing `Translate`. The accompanying `MxGatewayClientAlarmsTests.AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` test acknowledges this in a comment ("Note: the FakeGatewayTransport surfaces RpcException directly … the SDK-level test pins the pass-through shape so a future migration to direct mapping won't silently change observable behaviour") and asserts `Assert.ThrowsAsync<RpcException>` — but the production path through `GrpcMxGatewayClientTransport.AcknowledgeAlarmAsync` (lines 120-134) already calls `RpcExceptionMapper.Map`, so production callers see `MxGatewayAuthenticationException` and not `RpcException`. The test name advertises mapping that the SDK-level harness doesn't exercise, and any callable from `MxGatewayClient.AcknowledgeAlarmAsync` cannot regress on the alarm-ack mapping without somebody noticing.
**Recommendation:** Either route `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use and add a regression test that enables `MapTransportExceptions = true` and asserts `MxGatewayAuthenticationException`; or rename the existing test to make the pass-through shape explicit (e.g. `…_SurfacesRpcExceptionFromFakeTransportVerbatim`) and add a second test exercising the production mapping. Either fix closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`.
**Resolution:** 2026-05-20 — Applied both halves of the recommendation. Routed `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use, so when `MapTransportExceptions = true` thrown `RpcException`s now run through the production `RpcExceptionMapper.Map`. Renamed the existing pass-through test to `AcknowledgeAlarmAsync_SurfacesRpcExceptionFromFakeTransportVerbatim_WhenMappingDisabled` (with an updated comment pinning that this shape only applies when mapping is off), and added a new test `AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` that enables mapping and asserts the production-parity `MxGatewayAuthenticationException` with `StatusCode.Unauthenticated`. Closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`.
+115 -12
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `clients/go` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `3cc53a8` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
@@ -13,16 +13,16 @@
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: a typed-nil `Unwrap`/`errors.As` trap (Client.Go-001), a CLI `panic` on malformed input (Client.Go-003), empty-string correlation id on rand failure (Client.Go-007). |
| 2 | mxaccessgw conventions | Generally good; two test files fail `gofmt`, breaking the documented workflow (Client.Go-004). |
| 3 | Concurrency & thread safety | No issues found — stream goroutines and cancellation are sound. |
| 4 | Error handling & resilience | Issues found: the compatibility event path silently drops events (Client.Go-002); no transient/permanent classification (Client.Go-006). |
| 5 | Security | No issues found — TLS by default with a TLS 1.2 floor, API key redaction, no secret logging. |
| 6 | Performance & resource management | No issues found — connections/streams closed via deferred `Close`/`cancel`. |
| 7 | Design-document adherence | Issues found: deprecated `grpc.DialContext`+`WithBlock` usage and a missing error taxonomy (Client.Go-005, Client.Go-006). |
| 8 | Code organization & conventions | Issue found: duplication between `Client` and `GalaxyClient` (Client.Go-009). |
| 9 | Testing coverage | Issue found: TLS path, `callContext` deadline logic, and `NativeValue`/`NativeArray` edges untested (Client.Go-008). |
| 10 | Documentation & comments | Issue found: a stale `WithBlock` dial-cancellation claim (Client.Go-010). |
| 1 | Correctness & logic bugs | Re-review: previous Client.Go-001/003/007 remain resolved. New issue: a dead/no-op test condition in `alarms_test.go` (Client.Go-011). |
| 2 | mxaccessgw conventions | `gofmt -l ./...` and `go vet ./...` are clean. No new issues. |
| 3 | Concurrency & thread safety | New issue: `runGalaxyWatch` limit-reached path returns without waiting for the WatchDeployEvents goroutine to drain (Client.Go-013). |
| 4 | Error handling & resilience | New issue: direct `err == io.EOF` comparisons should use `errors.Is` for chain robustness (Client.Go-014). |
| 5 | Security | No issues found — TLS-by-default with TLS 1.2 floor, API key redaction in CLI JSON, no secret logging. |
| 6 | Performance & resource management | No issues found — `defer client.Close()` / `defer subscription.Close()` consistently applied across CLI and library; bench-read-bulk preallocates latency slice. |
| 7 | Design-document adherence | No new issues. The lazy `grpc.NewClient` + readiness probe migration (Client.Go-005) was applied uniformly to `Dial` and `DialGalaxy`. |
| 8 | Code organization & conventions | New issue: `runWriteBulkVariant`'s `secured` parameter is computed but unused (Client.Go-015). |
| 9 | Testing coverage | Coverage holes from prior review now filled (Client.Go-008). `fakeGalaxyServer.watchSendInterval` is declared but never set — minor test cruft (Client.Go-016). |
| 10 | Documentation & comments | New issue: the CLI `writeUsage` line is missing the six bulk and bench subcommands now wired into `run` (Client.Go-012). |
## Findings
@@ -175,3 +175,106 @@
**Recommendation:** Reword to describe the actual connect/timeout semantics after resolving Client.Go-005, and clarify that `DialTimeout` bounds the initial connect attempt.
**Resolution:** Resolved 2026-05-18: alongside the Client.Go-005 migration, the `Dial` doc comment was rewritten to describe the lazy `grpc.NewClient` connection, the `DialTimeout`-bounded (default 10s, or ctx deadline when sooner) readiness probe, that a briefly-unavailable gateway recovers instead of producing a hard error, and that cancelling `ctx` aborts the probe. `DialGalaxy` and the new `dial`/`waitForReady`/`callContext` helpers carry matching doc comments.
### Client.Go-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/go/mxgateway/alarms_test.go:66-73` |
| Status | Resolved |
**Description:** `TestAcknowledgeAlarmRejectsNilRequest` contains a no-op `if` with an empty body whose intent is documented in a comment ("Accept either: the helper returned the literal sentinel, or the generic transport error — both prove nil was rejected"). The condition
```go
if err == nil || !errors.Is(err, errors.Unwrap(err)) && err.Error() != "mxgateway: acknowledge alarm request is required" {
// ...
}
```
evaluates expressions for side effects only and asserts nothing — Go's `&&` binds tighter than `||`, the body is empty, and the actual nil check happens on the very next `if err == nil`. The block is effectively dead code masquerading as a check. It also evaluates `errors.Unwrap(err)` regardless of `err`'s shape, and would call `err.Error()` even when err might be a wrapped status error whose message wording the gateway is free to change — making the apparent assertion brittle on top of being dead.
**Recommendation:** Drop the empty-body `if` entirely (the subsequent `if err == nil { t.Fatalf(...) }` already enforces the contract), or, if the intent is to additionally pin the literal error message for the sentinel path, replace it with a real assertion (`if err.Error() != "mxgateway: acknowledge alarm request is required" { t.Fatalf(...) }`) and remove the spurious `errors.Is(err, errors.Unwrap(err))` clause.
**Resolution:** 2026-05-20 — Removed the empty-body `if` in `TestAcknowledgeAlarmRejectsNilRequest`; the subsequent `if err == nil { t.Fatalf(...) }` already enforces the nil-rejection contract without the dead, brittle compound predicate.
### Client.Go-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `clients/go/cmd/mxgw-go/main.go:1063-1065`, `clients/go/cmd/mxgw-go/main.go:88-104` |
| Status | Resolved |
**Description:** `writeUsage` lists the available subcommands as `version|open-session|close-session|register|add-item|advise|subscribe-bulk|unsubscribe-bulk|write|stream-events|smoke|galaxy-test-connection|galaxy-last-deploy|galaxy-discover|galaxy-watch`. Six subcommands wired into `run` are missing from this list: `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and `bench-read-bulk`. A user invoking `mxgw-go` with no args or an unknown command (the two paths that print this banner) sees an incomplete CLI surface and may believe the bulk-write / read-bulk families are not implemented. The README does document them, but the inline usage banner is the first source of truth a CLI user consults.
**Recommendation:** Extend the usage string to include every command registered in the `switch args[0]` in `run`, or generate it from a single source-of-truth slice keyed on command name → handler so the two cannot drift again.
**Resolution:** 2026-05-20 — `writeUsage` now lists the previously missing `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and `bench-read-bulk` subcommands alongside the original surface, so the no-args / unknown-command banner reflects every command wired into `run`.
### Client.Go-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `clients/go/cmd/mxgw-go/main.go:1246-1249`, `clients/go/cmd/mxgw-go/main.go:1257-1262` |
| Status | Resolved |
**Description:** In `runGalaxyWatch`, the signal-cancellation branch carefully drains the buffered `events` channel after `cancelStream()` so the `WatchDeployEvents` goroutine can exit (`for range events { }`). The limit-reached branch (`if *limit > 0 && count >= *limit { cancelStream(); return nil }`) skips that drain and returns immediately. After the function returns, `defer client.Close()` runs and tears down the gRPC connection; in the gap before the connection close propagates, the WatchDeployEvents goroutine may still be blocked on `case events <- event:` (the channel is buffered to 16 but a slow producer can refill it) — the goroutine then exits via `<-ctx.Done()` because `streamCtx` was cancelled, so it isn't a permanent leak, but the two cancellation paths behave inconsistently and the limit-reached path can briefly hold a goroutine plus the gRPC stream while the client tears down underneath it.
**Recommendation:** Factor the drain into a helper and use it from both branches, e.g. after `cancelStream()` always `for range events { }` (and let the surrounding `select`/`for` re-evaluate `<-errs` if a terminal error was already buffered). Alternatively, drop the explicit drain in both branches and rely on `defer cancelStream()` plus `defer client.Close()` — but pick one model and apply it consistently.
**Resolution:** 2026-05-20 — The limit-reached branch in `runGalaxyWatch` now drains the buffered `events` channel (`for range events { }`) after `cancelStream()`, matching the signal-cancel branch. Both cancellation paths now wait for the `WatchDeployEvents` goroutine to exit before `defer client.Close()` tears the gRPC connection down.
### Client.Go-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `clients/go/mxgateway/session.go:602`, `clients/go/mxgateway/galaxy.go:189` |
| Status | Resolved |
**Description:** Two stream Recv loops compare end-of-stream with `err == io.EOF` directly:
- `session.go:602``if err == io.EOF || status.Code(err) == codes.Canceled || streamCtx.Err() != nil { return }`
- `galaxy.go:189``if recvErr == io.EOF { return }`
gRPC's generated `Recv()` does return the `io.EOF` sentinel directly today, so the comparisons work in practice. However, the Go idiom (and the project's `docs/style-guides/GoStyleGuide.md`) is to use `errors.Is(err, io.EOF)` so future wrapping (e.g. an interceptor decorating Recv errors) does not silently flip the loop from "stream finished normally" to "stream produced an error". The mxgateway client itself wraps non-EOF Recv errors in `*GatewayError`, which `errors.Is` already supports — using `errors.Is` keeps both paths consistent.
**Recommendation:** Replace `recvErr == io.EOF` / `err == io.EOF` with `errors.Is(err, io.EOF)` (the `errors` package is already imported in both files).
**Resolution:** 2026-05-20 — Both stream Recv loops now use `errors.Is(err, io.EOF)`: `session.go` already imported `errors`, and `galaxy.go` gained the missing `errors` import alongside the `recvErr == io.EOF``errors.Is(recvErr, io.EOF)` change, keeping EOF detection robust against any future Recv-error wrapping.
### Client.Go-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/go/cmd/mxgw-go/main.go:410-512` |
| Status | Resolved |
**Description:** `runWriteBulkVariant(ctx, args, stdout, stderr, command, withTimestamp, secured bool)` accepts `secured` but never uses it — the routing is keyed on `command` (the string `"write-bulk"` / `"write2-bulk"` / `"write-secured-bulk"` / `"write-secured2-bulk"`). The function ends with `_ = secured // currently only used for routing above; reserved for future per-variant validation`, which is misleading because `secured` is not in fact used for routing. The four wrapper functions (`runWriteBulk`, `runWrite2Bulk`, `runWriteSecuredBulk`, `runWriteSecured2Bulk`) all pass a `secured` argument that has no effect. The four CLI options `-current-user-id`, `-verifier-user-id` are unconditionally registered on every variant, including the non-secured ones, so a `write-bulk` invocation that passes `-current-user-id 42` silently does nothing. Either remove `secured` and the dead `_ = secured` comment, or use it to gate the registration of secured-only flags so wrong combinations are rejected with a clean error.
**Recommendation:** Drop the `secured` parameter (the `command` switch already distinguishes the four variants) and the misleading `_ = secured` line; or, if validation is the goal, branch flag registration on `secured` so secured-only flags are unavailable for the non-secured variants and emit a clean usage error if they appear.
**Resolution:** 2026-05-20 — Dropped the unused `secured` parameter from `runWriteBulkVariant` (the `command` switch already distinguishes the four variants) and removed the misleading `_ = secured` line. The variant is now derived locally from `command` and used to gate flag registration: `-current-user-id` / `-verifier-user-id` are only registered for the secured variants and `-user-id` only for Write/Write2, so a wrong-variant flag now fails with a clean `flag provided but not defined` usage error instead of silently no-op'ing. The four `runWrite*Bulk` wrappers were updated to match the new signature.
### Client.Go-016
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/go/mxgateway/galaxy_test.go:382-429` |
| Status | Resolved |
**Description:** `fakeGalaxyServer.watchSendInterval` is declared on the test fake and consulted inside `WatchDeployEvents` (`if s.watchSendInterval > 0 { ... }`) but no test in the package sets a non-zero value. The dead field plus its branch were presumably added to support a backpressure / pacing test that was never landed, and now the only effect is reader confusion ("which test uses this?") and a pointlessly larger fake. Backpressure on the bootstrap-plus-events sequence is also genuinely worth testing, given that `WatchDeployEvents` writes to a 16-deep buffered channel.
**Recommendation:** Either delete the unused `watchSendInterval` field and its branch in `WatchDeployEvents`, or add the test it was added for — e.g. one that pumps more than 16 events with a small interval and asserts the consumer keeps up without losing or reordering events. Linking the field to a `// for TestX` comment if it stays would also help.
**Resolution:** 2026-05-20 — Removed the unused `watchSendInterval` field from `fakeGalaxyServer` and the corresponding `if s.watchSendInterval > 0 { ... }` branch in `WatchDeployEvents`; no test set the field, so the dead code path is gone and the fake is leaner. `gofmt -w` reflowed the struct to drop the no-longer-needed field-name padding.
+136 -12
View File
@@ -4,25 +4,29 @@
|---|---|
| Module | `clients/java` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `3cc53a8` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
A second-pass review against commit `1cd51bb`. Client.Java-001 through
Client.Java-012 are unchanged from the prior pass; the table below records the
new findings raised in this pass against the same checklist categories.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: `register`/`addItem` silently fall back to `getReturnValue()` masking missing payloads (Client.Java-004); fragile `resolved()` mutation pattern (Client.Java-012). |
| 2 | mxaccessgw conventions | Largely adheres; the gateway protocol-version handshake is never verified despite the contract field existing (Client.Java-003). |
| 3 | Concurrency & thread safety | Issue found: `MxEventStream.next` is a plain field and terminal-state transitions race (Client.Java-002). |
| 4 | Error handling & resilience | Issues found: `close()` can mask the primary exception (Client.Java-005); async/sync error surfaces inconsistent (Client.Java-008). |
| 5 | Security | Issue found: API-key redaction leaks the trailing 4 secret characters (Client.Java-001). |
| 6 | Performance & resource management | Issues found: `close()` does not await termination (Client.Java-006); no stream flow control (Client.Java-011). |
| 7 | Design-document adherence | Matches `JavaClientDesign.md` closely; the protocol-version check is undocumented-missing (Client.Java-003). |
| 8 | Code organization & conventions | Issue found: ~80 duplicated lines across the two clients (Client.Java-009). |
| 9 | Testing coverage | Issue found: alarm RPCs, TLS setup, async streams, and queue overflow untested (Client.Java-007). |
| 10 | Documentation & comments | Issue found: README/Javadoc assert undocumented scope names (Client.Java-010). |
| 1 | Correctness & logic bugs | Issues found: CLI `MxEventStream(1024)` capacity contradicts Javadoc/README "16-element buffer" claim (Client.Java-017); CLI `DeployEvent.sequence` printed with `%d` as signed `long` (Client.Java-020). |
| 2 | mxaccessgw conventions | No new issues found in this pass. |
| 3 | Concurrency & thread safety | Issues found: `MxEventStream.beforeStart` does not honour pre-start `close()` and leaks the gRPC call (Client.Java-014); `MxGatewayChannels.toCompletable` cancellation propagation is broken once the future is wrapped in `thenApply` (Client.Java-015). |
| 4 | Error handling & resilience | Issue found: `MxGatewaySecrets.redactCredentials` only inspects whitespace-delimited tokens, so colon/comma/quote-embedded `mxgw_` credentials leak through (Client.Java-018). |
| 5 | Security | Issue found: same `redactCredentials` leak — see Client.Java-018. |
| 6 | Performance & resource management | Issue found: client `close()` uses the *connect* timeout as its shutdown deadline (Client.Java-019). |
| 7 | Design-document adherence | No new issues found in this pass. |
| 8 | Code organization & conventions | Issue found: channel `close()` / `closeAndAwaitTermination()` are still duplicated verbatim across `MxGatewayClient` and `GalaxyRepositoryClient` despite Client.Java-009's stated resolution (Client.Java-016). |
| 9 | Testing coverage | Issue found: CLI `FakeSession` does not implement the five bulk methods added to `MxGatewayCliSession`, so the CLI test module fails to compile against the current source (Client.Java-013). |
| 10 | Documentation & comments | Issue found: docs claim a 16-element event-stream buffer that is actually 1024 in production (Client.Java-017). |
## Findings
@@ -205,3 +209,123 @@
**Recommendation:** Make `resolved()` return an immutable resolved value object, or compute `resolvedApiKey`/`resolvedTimeout` lazily in their getters so call ordering cannot produce stale output.
**Resolution:** (2026-05-18) Confirmed against source: `resolved()` populated the `resolvedApiKey`/`resolvedTimeout` mutable fields and `toClientOptions()`/`redactedJsonMap()` read them, so calling either before `resolved()` emitted stale empty/30s defaults. The two mutable fields were removed and replaced with side-effect-free accessor methods `resolvedApiKey()` and `resolvedTimeout()` that compute their value on each call (API key from `--api-key` or the `--api-key-env` variable; timeout via `parseDuration`). `toClientOptions()` and `redactedJsonMap()` now call those accessors directly, so call ordering can no longer produce stale output. `resolved()` is retained as a no-op returning `this` purely for call-site readability (`common.resolved()`), with its Javadoc updated to state resolution is now lazy. Pure-refactor with no runtime-behavior change for the existing call order, so no new test was added; covered by the existing `MxGatewayCliTests` JSON-redaction and option-parsing tests.
### Client.Java-013
| Field | Value |
|---|---|
| Severity | High |
| Category | Testing coverage |
| Location | `clients/java/mxgateway-cli/src/test/java/com/dohertylan/mxgateway/cli/MxGatewayCliTests.java:212-304`, `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:1214-1244` |
| Status | Resolved |
**Description:** `MxGatewayCliSession` in `MxGatewayCli.java:1214` was extended in commit `f220908` (the "bulk read/write CLI subcommands" change) with five new abstract methods — `readBulk`, `writeBulk`, `write2Bulk`, `writeSecuredBulk`, `writeSecured2Bulk`. The test-only `FakeSession` in `MxGatewayCliTests.java:212` still only implements the original set (register/addItem/advise/writeRaw/subscribeBulk/unsubscribeBulk/streamEventsAfter) and is declared a concrete (non-abstract) class. A clean compile of `mxgateway-cli`'s test source set therefore fails: a concrete implementer that omits abstract interface methods is a compile error. The stale `.class` files under `build/classes/java/test/` predate the interface change (dated 2026-05-20 03:38 vs CLI source dated 2026-05-20 05:06), which is why the issue is not visible until the next clean build. `gradle test` (or any CI pipeline that does not retain incremental state) will fail to build the CLI test module. The `CLAUDE.md` source-update workflow row "When source code changes, build and test the affected component" was not honoured for this CLI contract change.
**Recommendation:** Add the five missing `@Override` implementations to `FakeSession` (stubs returning empty lists are fine — only `subscribeBulk`/`unsubscribeBulk` are exercised by the existing tests, and the new bulk subcommands have no dedicated CLI tests yet). Optionally also add at least one CLI-level test for `read-bulk`, `write-bulk`, and the `bench-read-bulk` subcommands to keep parity with the .NET / Go / Rust CLI smoke matrix.
**Resolution:** 2026-05-20 — Added the five missing `@Override` stubs (`readBulk`, `writeBulk`, `write2Bulk`, `writeSecuredBulk`, `writeSecured2Bulk`) to `FakeSession` in `clients/java/mxgateway-cli/src/test/java/com/dohertylan/mxgateway/cli/MxGatewayCliTests.java`, each returning an empty `ArrayList<>` to match the interface return types (`List<BulkReadResult>` / `List<BulkWriteResult>`) without throwing. Imported `BulkReadResult`, `BulkWriteResult`, `WriteBulkEntry`, `Write2BulkEntry`, `WriteSecuredBulkEntry`, `WriteSecured2BulkEntry` from `mxaccess_gateway.v1.MxaccessGateway`. `GrpcMxGatewayCliSession` in `MxGatewayCli.java` is the only other implementer and already provides the methods (the source change that introduced the contract added them there). Verified with `gradle clean` followed by `gradle :mxgateway-cli:compileTestJava` and `gradle :mxgateway-cli:test` from `clients/java`, both BUILD SUCCESSFUL. No new CLI-level tests for the bulk subcommands were added — that follow-up is tracked separately and out of scope for this unblock-compilation fix.
### Client.Java-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:59-65,117-124` |
| Status | Resolved |
**Description:** `MxEventStream.observer().beforeStart` simply assigns `requestStream` without checking the `closed` flag, while `close()` reads `requestStream` after setting `closed = true`. If `close()` runs *before* the gRPC call has attached its `ClientCallStreamObserver` (a real race when callers cancel immediately after subscribing — e.g. construct, then close in a `finally` block when an unrelated setup step throws), then at close time `requestStream` is `null`, so `stream.cancel(...)` is skipped. `beforeStart` then fires later, stores the live `requestStream`, and never observes `closed` — the underlying gRPC call leaks open and continues delivering events to a `MxEventStream` whose consumer has stopped iterating. The sibling `DeployEventStream.beforeStart` already does the correct thing (`if (closed.get()) { requestStream.cancel(...); }`); the two adaptors should behave identically.
**Recommendation:** Mirror `DeployEventStream`'s pattern in `MxEventStream.beforeStart`: after storing `requestStream`, check the `closed` flag and cancel the stream eagerly if a prior `close()` has already fired. Add a regression test analogous to `GalaxyRepositoryClientTests.deployEventStreamCloseBeforeBeforeStartCancelsStream` to lock in the behavior.
**Resolution:** 2026-05-20 — Mirrored `DeployEventStream.beforeStart` in `MxEventStream.beforeStart`: after storing the `ClientCallStreamObserver`, the observer now reads the `closed` flag and calls `requestStream.cancel("client cancelled event stream", null)` when a prior `close()` already fired, closing the close/beforeStart race that previously leaked the underlying gRPC call. The fix uses the existing `volatile boolean closed` field (already established as a happens-before publisher by `close()` setting it before reading `requestStream`); no field shape changes were needed. `clients/java/README.md` documents the new safe-close-before-beforeStart contract. Regression test: `MxGatewayMediumFindingsTests.mxEventStreamCloseBeforeBeforeStartCancelsStream` (mirrors `GalaxyRepositoryClientTests.deployEventStreamCloseBeforeBeforeStartCancelsStream`).
### Client.Java-015
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayChannels.java:112-138`, `MxGatewayClient.java:183-191,224-232,322-329`, `GalaxyRepositoryClient.java:164-170,212-214` |
| Status | Resolved |
**Description:** `MxGatewayChannels.toCompletable` registers a `whenComplete` on the local `target` future to forward cancellation to the source gRPC `ListenableFuture`. Every caller — `openSessionAsync`, `invokeAsync`, `acknowledgeAlarmAsync`, `discoverHierarchyPageAsync`, `getLastDeployTimeAsync` — then chains `.thenApply(normalisingValidator(...))` or `.thenApply(::getOk)` and returns the *chained* future to the user. `CompletableFuture.thenApply` returns a new future whose cancellation does **not** propagate back to the source `target`. Cancelling the user-facing future therefore never sets `target.isCancelled() == true`, so `source.cancel(true)` is never invoked and the underlying gRPC call continues until its deadline expires. The `JavaClientDesign.md` "Streaming" section explicitly says "Stream cancellation should call `ClientCall.cancel`" — the same expectation reasonably applies to the unary `*Async` surface.
**Recommendation:** Either return `target` directly from each `*Async` method (and inline the validator into the `FutureCallback.onSuccess` path so no `thenApply` is needed), or attach the cancellation listener to the *final* returned future. The cleanest fix is to have `MxGatewayChannels.toCompletable` return a future that wraps the validator internally and registers `whenComplete` on the final future. Add a regression test that cancels the user-facing future and verifies the gRPC call was cancelled (e.g. via a `ServerCallStreamObserver.setOnCancelHandler` latch).
**Resolution:** 2026-05-20 — Fixed by inlining the reply validator into `MxGatewayChannels.toCompletable` so the user-visible future is the same future cancellation is bound to: added a new `toCompletable(source, operation, validator)` overload that runs the validator inside the `FutureCallback.onSuccess` path (normalising non-`MxGatewayException` `RuntimeException`s through `MxGatewayErrors.fromGrpc`, matching the existing synchronous `try/catch`). Replaced the previous `whenComplete`-based cancellation listener with a small `CancellingCompletableFuture<T>` subclass whose `cancel(boolean)` forwards to the source `ListenableFuture.cancel(...)` unconditionally, so even the no-validator overload propagates cancellation deterministically (the `whenComplete` listener only fired when `target.isCancelled()` was already true, which is exactly the case `thenApply` broke). Updated `MxGatewayClient.openSessionAsync`, `MxGatewayClient.invokeAsync`, `MxGatewayClient.acknowledgeAlarmAsync`, `GalaxyRepositoryClient.testConnectionAsync`, and `GalaxyRepositoryClient.getLastDeployTimeAsync` to use the new validator overload directly (no `.thenApply` chain). `GalaxyRepositoryClient.discoverHierarchyAsync` is paged via `thenCompose`, so it now publishes the current in-flight page future via an `AtomicReference` and returns a top-level `CompletableFuture` whose overridden `cancel(boolean)` cancels whichever page is currently outstanding. `clients/java/README.md` documents the new cancellation contract: cancelling any `*Async` future aborts the underlying gRPC call. Regression tests: `MxGatewayMediumFindingsTests.invokeAsyncCancellationCancelsUnderlyingGrpcCall` (full in-process gRPC test using `ServerCallStreamObserver.setOnCancelHandler` to latch when the server observes RPC cancellation), `toCompletableValidatorOverloadForwardsCancellationToSource`, and `toCompletableNoValidatorOverloadForwardsCancellationToSource` (unit-level proofs that both `MxGatewayChannels.toCompletable` overloads forward `cancel(true)` to the source `ListenableFuture`).
### Client.Java-016
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:361-391`, `GalaxyRepositoryClient.java:285-315` |
| Status | Resolved |
**Description:** Client.Java-009 introduced `MxGatewayChannels` to deduplicate `createChannel`, `withDeadline`, `withStreamDeadline`, and `toCompletable`. The two `close()` / `closeAndAwaitTermination()` methods — added shortly after to fix Client.Java-006 — were not extracted along with them. The 30-line bodies of `MxGatewayClient.close()` + `closeAndAwaitTermination()` and `GalaxyRepositoryClient.close()` + `closeAndAwaitTermination()` are now duplicated verbatim, including the `awaitTermination(connectTimeout)` semantic (see Client.Java-019), the `InterruptedException` handling, and the `ownedChannel == null` guard. A fix to one path (e.g. introducing a dedicated `shutdownTimeout` option) will silently miss the other.
**Recommendation:** Move the shutdown logic into `MxGatewayChannels.shutdown(ManagedChannel channel, MxGatewayClientOptions options)` and `MxGatewayChannels.shutdownAndAwaitTermination(...)`. Have both clients delegate to it. Same recommendation applies to the duplicated `MxGatewayAuthInterceptor` construction in the two constructors (`MxGatewayClient(Channel, ...)` and `GalaxyRepositoryClient(Channel, ...)`).
**Resolution:** 2026-05-20 — Extracted the duplicated shutdown logic into `MxGatewayChannels.shutdown(ManagedChannel, MxGatewayClientOptions)` and `MxGatewayChannels.shutdownAndAwaitTermination(ManagedChannel, MxGatewayClientOptions)`. Both helpers handle the `ownedChannel == null` no-op, the orderly-shutdown / `awaitTermination` / `shutdownNow`-on-timeout escalation, and the `InterruptedException`-restoring-the-interrupt-flag path. `MxGatewayClient.close()`/`closeAndAwaitTermination()` and `GalaxyRepositoryClient.close()`/`closeAndAwaitTermination()` are now one-liners that delegate to the shared helpers, so a future change (such as Client.Java-019's `shutdownTimeout`) lives in one place. Unused `java.util.concurrent.TimeUnit` imports were removed from both clients. The constructor-level `MxGatewayAuthInterceptor` duplication noted in the recommendation was left in place — it is a single intercept call per constructor (2 lines) versus the 30-line shutdown duplication that was the actual maintenance hazard. Regression tests: `MxGatewayLowFindingsIITests.sharedShutdownHelperIsNoOpForNullChannel` (covers the null-channel guard), `shutdownAndAwaitTerminationHonoursShutdownTimeoutNotConnectTimeout`, and `shutdownEscalatesToShutdownNowWhenTimeoutExceeded` (cover the shared shutdown semantics; the second is also the Client.Java-019 regression).
### Client.Java-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:25-36`, `clients/java/README.md:99-107` |
| Status | Resolved |
**Description:** `MxEventStream.streamEvents` was recently widened from a 16-element buffer to a 1024-element buffer (`MxGatewayClient.streamEvents` at line 268: `new MxEventStream(1024)`). The class-level Javadoc on `MxEventStream` still says "the gateway can push events faster than the consumer drains the bounded 16-element buffer", and `clients/java/README.md` line 103 says "uses gRPC's default auto-inbound flow control with a fixed 16-element buffer". The fail-fast event-backpressure contract (Client.Java-011 resolution) was written against the older capacity. The `MxGatewayClient.streamEvents` inline comment even acknowledges the change ("A small queue overflows on any moderately active session; 1024 covers a realistic backlog"). Users of this surface will reason about realistic backpressure budgets using the wrong number.
**Recommendation:** Update the `MxEventStream` Javadoc and the README to say "1024-element buffer" (or, since the capacity is a passed parameter, document it as a parameter rather than a constant). Consider exposing the capacity through `MxGatewayClientOptions` so callers can tune it per session.
**Resolution:** 2026-05-20 — Updated the `MxEventStream` class Javadoc and `clients/java/README.md` so both say "1024-element buffer" instead of the obsolete "16-element buffer". The Javadoc also notes that capacity is a constructor parameter and that the production caller (`MxGatewayClient.streamEvents`) passes `1024` to absorb the session-backlog replay burst, so readers understand the value is a deliberate choice rather than a constant. Exposing the capacity through `MxGatewayClientOptions` was intentionally left out of scope — the v1 design keeps the event-stream surface minimal and `MxGatewayClient.streamEvents` is the only caller; if a tuning need arises in v2 the existing constructor already accepts the capacity.
### Client.Java-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySecrets.java:54-66` |
| Status | Resolved |
**Description:** `redactCredentials(value)` splits its input on `\\s+` (whitespace) and only redacts whitespace-delimited tokens that start with `mxgw_` or equal `bearer` (case-insensitive). gRPC `Status.getDescription()` strings, log lines, and proto error messages can carry credentials separated by colons (`Bearer:mxgw_id_secret`), commas (`token=mxgw_id_secret,scope=...`), single quotes (`'mxgw_id_secret'`), parentheses (`(mxgw_id_secret)`), or embedded in URLs/paths — all of which leave the `mxgw_` token attached to a non-whitespace neighbour and survive redaction. `MxGatewayErrors.fromGrpc` is the primary consumer; a gateway error description like `authentication failed: 'mxgw_id_secret'` would round-trip the secret into the resulting `MxGatewayAuthenticationException` message.
**Recommendation:** Replace the whitespace-split scrub with a regex-based pass that matches `mxgw_[A-Za-z0-9_-]+` anywhere in the string and substitutes `<redacted>`; also redact `Bearer\s+\S+` as a unit so the token after `Bearer` is masked regardless of the surrounding punctuation. Cover with a fixture-style test alongside `MxGatewayFixtureTests.grpcAuthErrorsAreClassifiedAndRedacted` that asserts a quoted or comma-delimited credential is fully masked.
**Resolution:** 2026-05-20 — Replaced the whitespace-split scrub with two compiled `Pattern` regexes: `mxgw_[A-Za-z0-9_-]+` matches any gateway-shaped credential anywhere in the string regardless of surrounding punctuation, and `(?i)bearer\s+\S+` masks an authorization-header style `Bearer <token>` as a unit so a non-mxgw bearer token cannot leak either. The mxgw pass runs first, so the bearer pass observes `Bearer <redacted>` for the common combined case and renders it idempotently. Regression tests in `MxGatewayFixtureTests`: `redactCredentialsHandlesNonWhitespaceDelimitedTokens` exercises single-quoted, double-quoted, comma-delimited, colon-delimited, parenthesised, URL-embedded, and bearer-header credentials; `redactCredentialsLeavesBenignContentAlone` confirms strings without credentials and a `null` input are unchanged.
### Client.Java-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:362-391`, `GalaxyRepositoryClient.java:286-315` |
| Status | Resolved |
**Description:** Both clients' `close()` / `closeAndAwaitTermination()` use `options.connectTimeout()` as the upper bound on `awaitTermination`. The `connectTimeout` semantically describes how long the client will wait to *establish* the channel, not how long it should wait for in-flight calls and the Netty event loop to drain after `shutdown()`. With the default 10s connect timeout, shutting down a client with a long-running unary call already in flight will silently escalate to `shutdownNow()` and forcibly cancel it before the call's own deadline expires, defeating the deadline contract on `withDeadline`. Conversely, a caller who sets a small `connectTimeout` (e.g. 500 ms for a health probe) inherits an aggressively short shutdown deadline they probably did not intend.
**Recommendation:** Introduce a dedicated `shutdownTimeout` on `MxGatewayClientOptions` (defaulting to e.g. 510 s independent of `connectTimeout`) and use it in `close()` and `closeAndAwaitTermination()`. Document the precedence in the Javadoc. This pairs naturally with the Client.Java-016 deduplication fix.
**Resolution:** 2026-05-20 — Added a dedicated `shutdownTimeout` `Duration` on `MxGatewayClientOptions` (builder method `shutdownTimeout(Duration)`, accessor `shutdownTimeout()`, default 10 s), independent of `connectTimeout`. Both shared shutdown helpers introduced for Client.Java-016 (`MxGatewayChannels.shutdown` and `shutdownAndAwaitTermination`) call `options.shutdownTimeout()` as the `awaitTermination` upper bound, so a small `connectTimeout` (e.g. a 500 ms health-probe timeout) no longer forces a premature `shutdownNow()` on in-flight calls. The new option is reflected in `toString()` and documented on both helpers and the `close()`/`closeAndAwaitTermination()` Javadoc on both clients; `clients/java/README.md` notes the default and the independence from `connectTimeout`. Regression tests in `MxGatewayLowFindingsIITests`: `shutdownAndAwaitTerminationHonoursShutdownTimeoutNotConnectTimeout` (a 50 ms connect timeout + 1 s shutdown timeout + 200 ms graceful-termination channel never escalates to `shutdownNow()`), `shutdownEscalatesToShutdownNowWhenTimeoutExceeded` (a stuck channel beyond the shutdown timeout is forcibly shut down), and `shutdownTimeoutDefaultIsTenSecondsIndependentOfConnectTimeout` (the default holds even when `connectTimeout` is small).
### Client.Java-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:244-254`, `galaxy_repository.proto:94` |
| Status | Resolved |
**Description:** `galaxy_repository.proto` defines `DeployEvent.sequence` as `uint64`; the protobuf Java mapping projects that to a signed `long`. The CLI's text-mode `galaxy-watch` output prints it as `"seq=%d ..."`, which interprets the value as signed. For genuine wraparound this is implausible (deploy sequences will not reach `2^63`), but the broader pattern is brittle: any unsigned proto field printed via `%d` will display incorrectly past the signed boundary. The JSON path uses `protoJson(event)` which formats unsigned longs as numeric strings via `JsonFormat`, so JSON output is correct; only the text mode is at risk.
**Recommendation:** Print the sequence with `Long.toUnsignedString(event.getSequence())` (or switch the text format to `%s` and pass the unsigned-string conversion). The same rule should apply to any other `uint64` proto fields that surface in CLI text output.
**Resolution:** 2026-05-20 — Updated the `galaxy-watch` text-mode `out.printf` in `MxGatewayCli.GalaxyWatchCommand.call()` to use `%s` for the sequence field and pass `Long.toUnsignedString(event.getSequence())`, so deploy sequences past `2^63` render as their correct unsigned decimal string instead of a negative signed long. The JSON path through `protoJson(event)` was already correct (proto `JsonFormat` emits unsigned longs as decimal strings) and was left unchanged. An inline comment near the printf documents the unsigned-uint64 contract so the next person editing the format string knows not to switch back to `%d`. Regression test: `MxGatewayCliTests.deployEventSequenceRendersAsUnsignedForHighUint64` exercises the format string with the max-uint64 bit pattern (`-1L`) and asserts the output contains `seq=18446744073709551615` and does not contain `seq=-1`.
+271 -12
View File
@@ -4,25 +4,29 @@
|---|---|
| Module | `clients/python` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `3cc53a8` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
A re-review at commit `1cd51bb` over the same module. Prior findings
(Client.Python-001 — Client.Python-012) remain closed and are kept as
history. This section reflects categories evaluated in this pass.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: dead `closed` variable (Client.Python-004); float/bytes value-mapping assumptions (Client.Python-008). |
| 2 | mxaccessgw conventions | Largely adheres; one missing export and a `*_raw` MXAccess-failure documentation gap (Client.Python-002, Client.Python-012). |
| 3 | Concurrency & thread safety | Issue found: `close()` idempotency claim does not hold under concurrent close (Client.Python-006). |
| 4 | Error handling & resilience | Issues found: inconsistent timeout-kwarg fallback (Client.Python-003); `success == 0` default-value hazard (Client.Python-011); inconsistent cancel helpers (Client.Python-007). |
| 5 | Security | No issues found — API keys redacted in repr and CLI output, TLS supported, no secret logging. |
| 6 | Performance & resource management | Issue found: `discover_hierarchy` buffers the whole hierarchy in memory (Client.Python-005). |
| 7 | Design-document adherence | Matches the design docs closely; minor CLI doc drift (Client.Python-001). |
| 8 | Code organization & conventions | Issues found: `MxGatewayCommandError` omitted from `__all__` (Client.Python-002); fragile circular-import workaround (Client.Python-010). |
| 9 | Testing coverage | Issue found: `write2`, `add_item2`, bulk-size limits, TLS `ca_file`, and CLI command bodies untested (Client.Python-009). |
| 10 | Documentation & comments | Issue found: stale "scaffold" package description (Client.Python-001). |
| 1 | Correctness & logic bugs | Issue found: `_use_plaintext` silently downgrades any `localhost:` / `127.0.0.1:` endpoint to plaintext (Client.Python-013). |
| 2 | mxaccessgw conventions | No new issues found — secrets redacted, MXAccess parity preserved, generated code untouched, no Blazor/COM violations apply (Python client). |
| 3 | Concurrency & thread safety | No new issues foundclose-idempotency hazard fixed in Client.Python-006, shared `_canceling_iterator` cancels on `CancelledError`. |
| 4 | Error handling & resilience | No new issues found at this commit (prior 003, 007, 011 remain closed). |
| 5 | Security | Issue found: implicit plaintext-on-localhost (Client.Python-013) means a user explicitly listing a TLS-fronted loopback endpoint with `--api-key` but without `--tls`/`--plaintext` silently transmits the bearer token in cleartext. |
| 6 | Performance & resource management | No new issues found `iter_hierarchy` streams pages lazily (Client.Python-005 resolution). |
| 7 | Design-document adherence | No new issues found — `PythonClientDesign.md` matches the implemented surface. |
| 8 | Code organization & conventions | Issue found: duplicate `from mxgateway.values import` lines in `commands.py:22-23` (Client.Python-014). |
| 9 | Testing coverage | Issues found: `bench_read_bulk` CLI body, `MAX_AGGREGATE_EVENTS` event-cap, and `_use_plaintext` localhost-auto-plaintext path are untested (Client.Python-015, Client.Python-016). |
| 10 | Documentation & comments | Issues found: `pyproject.toml` lacks PyPI metadata (`authors`, `license`, `classifiers`, `urls`) and no PEP 561 `py.typed` marker (Client.Python-017); auto-plaintext behaviour is undocumented (Client.Python-013). |
## Findings
@@ -205,3 +209,258 @@
**Recommendation:** Document explicitly (README + docstring) that `*_raw` methods surface MXAccess HRESULT/status failures only inside the reply and do not raise `MxAccessError`, so parity-test callers know to inspect `protocol_status`/`hresult`/`statuses` themselves.
**Resolution:** 2026-05-18 — Won't Fix (no behaviour change). Confirmed this is intentional, correct parity behaviour: the `*_raw` methods exist precisely so parity-test callers can inspect an unmodified gateway reply, including embedded MXAccess HRESULT/status failures, without an exception masking them. Changing `invoke_raw` to raise `MxAccessError` would defeat its purpose and duplicate `Session.invoke`. The finding's only actionable point is the documentation gap, which has been addressed: `clients/python/README.md` now states explicitly that `*_raw` methods enforce gateway protocol success only and do **not** run MXAccess-failure detection, and the docstrings of `GatewayClient.invoke_raw` and `Session.invoke_raw` say the same and point callers to inspect `protocol_status`/`hresult`/`statuses` (and to `Session.invoke` for the checked variant). No code/test change — the runtime contract is unchanged and correct.
### Client.Python-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `clients/python/src/mxgateway_cli/commands.py:757-762` |
| Status | Resolved |
**Description:** `_use_plaintext` silently returns `True` whenever the endpoint
string starts with `localhost:` or `127.0.0.1:`, even if neither `--plaintext`
nor `--tls` is supplied on the command line. Any CLI subcommand (e.g.
`mxgw-py open-session --endpoint localhost:5001 --api-key mxgw_<secret>`) then
attaches the API key to a plaintext gRPC channel without warning. This is a
silent security downgrade: a user who deliberately ran the gateway behind TLS
on loopback (e.g. for testing a production-shaped TLS config locally) and who
passes `--api-key` expecting the secret to be transport-protected gets a
plaintext bearer token instead. The auto-downgrade is also undocumented —
`README.md` and the CLI `--help` text both describe `--plaintext` and `--tls`
as the controls, with no mention that endpoint-prefix matching can override
either. The other client CLIs do not auto-downgrade: the .NET CLI uses
`https://`-prefix detection on a URI scheme (an explicit signal), Go and Java
require an explicit `--plaintext`/`--tls` choice, and Rust defaults to
plaintext only when `plaintext = true` is set on the options struct.
**Recommendation:** Drop the localhost-prefix auto-plaintext branch and
require the user to pass `--plaintext` or `--tls` (or default to TLS to match
the rest of the matrix). If the implicit-localhost behaviour is kept for
ergonomics, document it prominently in both `README.md` and `--help`, emit a
stderr warning when `--api-key` is combined with the auto-downgrade path, and
add a CLI test asserting the auto-downgrade is in fact active so it is not
silently lost in a future refactor.
**Resolution:** 2026-05-20 — Removed the silent `localhost:` / `127.0.0.1:`
auto-plaintext branch from `_use_plaintext`. The new contract matches the Go
and Java CLIs: **TLS is the default**, `--plaintext` is the only way to opt
in to an unencrypted channel, and `--tls` is accepted as a redundant, explicit
affirmation of the default (mutually exclusive with `--plaintext`, which now
raises `click.UsageError`). The `--plaintext` / `--tls` `--help` text and
`clients/python/README.md` both call out the new behaviour. Added six
regression tests in `clients/python/tests/test_cli.py` covering: (a) a
`localhost:` endpoint with no flags resolves to TLS, (b) a `127.0.0.1:`
endpoint with no flags resolves to TLS, (c) `--plaintext` opts in to plaintext,
(d) `--tls` is accepted and idempotent with the default, (e) `--plaintext`
combined with `--tls` is rejected, and (f) an end-to-end CliRunner test
asserting `ClientOptions.plaintext == False` flows through to
`GatewayClient.connect` when no flag is supplied against a `localhost:`
endpoint. **Behaviour change for callers:** scripts that previously relied on
`mxgw-py … --endpoint localhost:5000 …` selecting plaintext silently must now
add an explicit `--plaintext` flag (or set up TLS on the gateway). Calling
`mxgw-py` with an `--api-key` against a plaintext-only gateway without
`--plaintext` will now fail to connect rather than silently leaking the bearer
token.
### Client.Python-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/python/src/mxgateway_cli/commands.py:22-23` |
| Status | Resolved |
**Description:** `commands.py` has two consecutive `from mxgateway.values
import` lines:
```python
from mxgateway.values import to_mx_value
from mxgateway.values import MxValueInput
```
These import from the same module and should be combined into a single
`from mxgateway.values import MxValueInput, to_mx_value`. The split form is
inconsistent with the rest of the file (every other module is imported in a
single statement) and would be flagged by `ruff`/`isort` if any linter were
configured. Pure style, no behavioural impact.
**Recommendation:** Collapse the two imports into one statement, ordered to
match the conventional alphabetical-within-module pattern:
`from mxgateway.values import MxValueInput, to_mx_value`.
**Resolution:** 2026-05-20 — Collapsed the two consecutive
`from mxgateway.values import to_mx_value` / `from mxgateway.values import MxValueInput`
lines in `clients/python/src/mxgateway_cli/commands.py` into a single
`from mxgateway.values import MxValueInput, to_mx_value` statement, matching
the alphabetical-within-module pattern used elsewhere in the file. Pure style
fix — no behavioural impact, covered by the existing CLI tests.
### Client.Python-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/python/src/mxgateway_cli/commands.py:273-294,564-647`, `clients/python/tests/` |
| Status | Resolved |
**Description:** `_bench_read_bulk` is a ~80-line CLI body that opens its own
session, registers, subscribe_bulks, runs a warm-up loop, a measurement loop,
collects per-call latencies, computes a percentile summary, and emits the
shared cross-language JSON schema. It is the largest untested CLI command in
the module — `tests/` has no `bench_read_bulk` test, fake-stub-driven or
otherwise. A drift in the schema field names (`callsPerSecond`,
`cachedReadResults`, `latencyMs.p50`, …) would break the cross-language
`scripts/bench-read-bulk.ps1` aggregation silently. `_percentile_summary` and
`_percentile` are also untested — the boundary cases (`n == 0`, `n == 1`,
quantile interpolation) would benefit from a small unit test since the
identical algorithm is duplicated in the .NET / Go / Rust / Java drivers and
a divergence would corrupt cross-language comparisons.
**Recommendation:** Add a fake-stub-driven `bench_read_bulk` test that drives
a short `--duration-seconds 0 --warmup-seconds 0` run through `CliRunner` and
asserts the JSON schema (`language == "python"`, the full key set,
`latencyMs.p50/p95/p99/max/mean` present). Add unit tests for `_percentile`
covering `n == 0`, `n == 1`, and a known-good interpolated value at p95 so
the implementation cannot silently drift from the other clients.
**Resolution:** 2026-05-20 — Added `clients/python/tests/test_cli_bench_and_helpers.py`
with three layers of coverage. (1) `_percentile` unit tests pin the
cross-language algorithm (`rank = q * (n - 1)`, linear interpolation between
adjacent ranks): empty sample returns `0.0`, single element returns that
element, exact-rank queries return the sample value (p50 of `[10,20,30,40,50]`
is `30.0`), and the interpolated p95/p99 values (`48.0` / `49.6` for that same
five-element sample) are locked down so any drift from the .NET / Go / Rust /
Java drivers fails fast. (2) `_percentile_summary` tests assert the full
`{p50, p95, p99, max, mean}` dict shape, the zero-sample placeholder, and the
3-decimal rounding contract. (3) A `bench-read-bulk` smoke test
(`test_bench_read_bulk_emits_cross_language_schema`) drives the CLI through
`CliRunner` with `--duration-seconds 0 --warmup-seconds 0` against a fake stub
that handles `OpenSession`, `Register`, `SubscribeBulk`, `ReadBulk`, and
`UnsubscribeBulk`, then asserts the emitted JSON has exactly the 16
cross-language schema keys (`language`, `command`, `endpoint`, `clientName`,
`bulkSize`, `durationSeconds`, `warmupSeconds`, `durationMs`, `tags`,
`totalCalls`, `successfulCalls`, `failedCalls`, `totalReadResults`,
`cachedReadResults`, `callsPerSecond`, `latencyMs`) and that `latencyMs` is a
`{p50, p95, p99, max, mean}` sub-object — guarding against silent breakage of
`scripts/bench-read-bulk.ps1`'s cross-language aggregation. No source change —
this is a pure coverage finding.
### Client.Python-016
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/python/src/mxgateway_cli/commands.py:25,757-775,805-830` |
| Status | Resolved |
**Description:** Three CLI helper paths are not covered by `tests/`:
1. `_use_plaintext` localhost auto-downgrade (line 762) — the
`endpoint.startswith("localhost:") or endpoint.startswith("127.0.0.1:")`
branch (see also Client.Python-013) is untested; no test asserts that an
endpoint without `--plaintext` and without `--tls` resolves to plaintext.
2. `_collect_events` `MAX_AGGREGATE_EVENTS` guard (line 811-815) — passing
`--max-events` greater than `MAX_AGGREGATE_EVENTS` raises
`click.BadParameter`, but no test exercises the guard. A silent removal of
the constant or the comparison would not be caught.
3. `_api_key_from_env` (line 765-768) — only the implicit path through
`_secrets` is exercised; there is no test that verifies an env-var name
resolves to a value and that an unset env var produces `None`.
These are all small, fake-stub-driven CLI behaviours rather than end-to-end
paths. The previous coverage finding (Client.Python-009) closed without
adding tests for these specific paths.
**Recommendation:** Add three small `CliRunner` / unit tests: one asserting
the localhost auto-plaintext (or its replacement, if Client.Python-013 is
fixed), one asserting `--max-events 10001` exits non-zero with the
`MAX_AGGREGATE_EVENTS` error message, and one asserting
`_api_key_from_env("MXGATEWAY_API_KEY")` returns the env value and `None` for
an unset variable.
**Resolution:** 2026-05-20 — Scope adjusted: Client.Python-013 has since
removed the `_use_plaintext` localhost auto-plaintext branch, so item (1) is
no longer a real code path — the
`test_use_plaintext_requires_explicit_flag_for_localhost_endpoint` and
`test_cli_localhost_endpoint_defaults_to_tls_via_open_session` regressions
added under Client.Python-013 already pin the new TLS-by-default contract.
The remaining two helpers are now covered in
`clients/python/tests/test_cli_bench_and_helpers.py`. (2)
`MAX_AGGREGATE_EVENTS` cap:
`test_collect_events_rejects_max_events_above_aggregate_cap` drives
`stream-events` with `--max-events 10001` through `CliRunner` against
stubbed `_connect` / `_session` fakes and asserts the CLI exits non-zero with
the documented `less than or equal to 10000` message;
`test_collect_events_accepts_max_events_at_aggregate_cap_boundary` confirms
`--max-events 10000` is accepted at the boundary and returns an empty event
list. (3) `_api_key_from_env`:
`test_api_key_from_env_resolves_value_when_variable_is_set` (env-var
populated → returned),
`test_api_key_from_env_returns_none_when_variable_is_unset` (env-var unset
`None`), `test_api_key_from_env_returns_none_when_name_is_none` (the
`name is None` early-return), and
`test_api_key_from_env_returns_none_when_name_is_empty_string` (the
`if not name` truthiness guard). No source change — pure coverage finding.
### Client.Python-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `clients/python/pyproject.toml:5-25`, `clients/python/src/mxgateway/` |
| Status | Resolved |
**Description:** The package metadata in `pyproject.toml` is minimal for a
published wheel:
* No `authors` field. PyPI / `pip show` will display no author.
* No `license` field, no `license-files` field, and no `LICENSE` file is
referenced from the project. The repo as a whole has no top-level
`LICENSE` either, but other client packages (Java has a license entry, the
.NET package has a license expression in the `csproj`) tend to set this.
* No `classifiers` (no `Programming Language :: Python :: 3.12`,
`Operating System :: Microsoft :: Windows`, `Topic :: …`, no
development-status classifier). Without these the PyPI search facets are
empty and tooling like `pip` cannot tell whether the package is
alpha/beta/stable.
* No `keywords`, no `[project.urls]` (no homepage / source / issue link
pointing back to the repo).
* The package ships no PEP 561 `py.typed` marker file in
`src/mxgateway/`. Type hints are written throughout the module
(`from __future__ import annotations`, full annotations on every public
function), but downstream consumers running `mypy` on `mxaccess-gateway-client`
will not see those hints — PEP 561 requires the marker file to opt the
package into type-stub distribution.
**Recommendation:** Add `authors`, `license = "<spdx>"`, `keywords`, and
`[project.urls]` to `pyproject.toml`; add at least the standard `classifiers`
trio (`Development Status`, `Programming Language :: Python :: 3.12`,
`Intended Audience`); create an empty `src/mxgateway/py.typed` file and
include it in the wheel via `[tool.setuptools.package-data]` so consumers
running `mypy` against an installed wheel pick up the type information.
**Resolution:** 2026-05-20 — Filled out `clients/python/pyproject.toml`
with the missing PyPI metadata: `authors = [{ name = "MXAccess Gateway
Authors" }]`, `license = "Proprietary"` (the repo has no top-level
`LICENSE` file and no other client publishes under an OSS licence, so the
SPDX `Proprietary` expression matches the de-facto status), the standard
classifier set (`Development Status :: 4 - Beta`, `Intended Audience ::
Developers` / `Information Technology`, `Operating System :: Microsoft ::
Windows` and `:: POSIX`, `Programming Language :: Python` /
`Python :: 3` / `Python :: 3.12`, `Topic :: Software Development ::
Libraries :: Python Modules`, `Topic :: System :: Distributed Computing`,
and `Typing :: Typed`), a `keywords` list
(`mxaccess`, `archestra`, `gateway`, `grpc`, `industrial`, `scada`), and
`[project.urls]` with `Homepage` / `Source` / `Issues` pointing at the
Gitea repo. Added the PEP 561 marker file
`clients/python/src/mxgateway/py.typed` (empty, as the spec requires) and
declared it in `[tool.setuptools.package-data] mxgateway = ["py.typed"]`
so the wheel ships the marker and downstream `mypy` users see the
inline type hints. Pure metadata / packaging change — `python -m pytest -q`
still passes (91 tests).
+117 -12
View File
@@ -4,25 +4,27 @@
|---|---|
| Module | `clients/rust` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `3cc53a8` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
This re-review (`1cd51bb`) covers the changes added since `3cc53a8`: the new bulk-write/read methods on `Session`, the `read_bulk` borrowed-slice signature, `MalformedReply` / `Unavailable` error variants, the projection-on-demand `MxValue`/`MxArrayValue`, the `next_correlation_id` rework, the new ReadBulk and bulk-write CLI subcommands, and the cross-language `bench-read-bulk` driver.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: a stale unit test fails the suite (Client.Rust-003); handle extractors silently return 0 on a shapeless OK reply (Client.Rust-005). |
| 2 | mxaccessgw conventions | `cargo clippy --workspace --all-targets -- -D warnings` fails (Client.Rust-001, Client.Rust-002, Client.Rust-012), violating a CLAUDE.md hard requirement; hard-coded correlation ids (Client.Rust-011). |
| 3 | Concurrency & thread safety | No issues found — clients are cheaply cloneable, streams are `Send`, drop-cancels-call is verified. |
| 4 | Error handling & resilience | Issues found: empty-vec on shapeless bulk reply (Client.Rust-006); no transient/permanent classification (Client.Rust-010). |
| 5 | Security | No issues found — API keys redacted in `Debug`/`Display`, status messages scrubbed, TLS handled correctly. |
| 6 | Performance & resource management | Issue found: value/array projections clone every element, doubling array memory (Client.Rust-008). |
| 7 | Design-document adherence | Issue found: `RustClientDesign.md` documents a stale crate layout and an unused `tracing` dependency (Client.Rust-007). |
| 8 | Code organization & conventions | Issue found: `BulkReplyKind` trips a clippy lint; undocumented public methods (Client.Rust-001, Client.Rust-002). |
| 9 | Testing coverage | Issue found: TLS setup, mid-stream fault propagation, and the bulk-size cap untested (Client.Rust-009). |
| 10 | Documentation & comments | Issue found: the version-constant doc comment is wrong (Client.Rust-004). |
| 1 | Correctness & logic bugs | Issue found: `read_bulk` is missing the OK-but-shapeless `MalformedReply` symmetry of the other bulk helpers, but the bigger issue is no test exercises any of the new `MalformedReply` paths (Client.Rust-016). |
| 2 | mxaccessgw conventions | Issue found: `cargo clippy --workspace --all-targets -- -D warnings` still fails — a fresh `clippy::doc_lazy_continuation` violation in `ReadBulkCommand`'s generated doc comment trips the lint that the prior fixes did not anticipate (Client.Rust-013). CLI subcommands still emit hard-coded `client_correlation_id` strings on the `raw` paths (Client.Rust-014). |
| 3 | Concurrency & thread safety | No issues found — `CORRELATION_SEQUENCE` is `AtomicU64` with `Relaxed`, which is correct for monotonic id generation; clients remain cheaply cloneable; streams are `Send`. |
| 4 | Error handling & resilience | Issue found: `bench-read-bulk` records every `read_bulk` failure into the latency histogram as if it succeeded, skewing p99/max upward (Client.Rust-015). The new `Error::Unavailable` mapping looks correct. |
| 5 | Security | No issues found — API keys still redacted in `Debug`/`Display`, status messages scrubbed, secret arguments unchanged. |
| 6 | Performance & resource management | No issues found in the changed code — `read_bulk` is honest about the unavoidable owned-Vec materialisation; projection-on-demand is now lazy. |
| 7 | Design-document adherence | Issue found: `RustClientDesign.md` was refreshed but never grew the new bulk-write/read methods, the `Unavailable`/`MalformedReply` error variants, or the `bench-read-bulk` CLI command on its current surface (Client.Rust-017). |
| 8 | Code organization & conventions | No new issues — `BulkWriteReplyKind` follows the renamed `BulkReplyKind` shape. |
| 9 | Testing coverage | Issue found: none of the new code paths (bulk-write helpers, `read_bulk`, `MalformedReply`, `Error::Unavailable`, the `bench-read-bulk` flow) are covered by client-side tests (Client.Rust-016). |
| 10 | Documentation & comments | No new issues beyond Client.Rust-017. |
## Findings
@@ -205,3 +207,106 @@
**Recommendation:** Dereference instead of cloning: `*self.state.last_deploy.lock().unwrap()`.
**Resolution:** Resolved in `0d8a28d` (2026-05-18): replaced `.clone()` with a deref. `cargo clippy --workspace --all-targets -- -D warnings` now passes cleanly.
### Client.Rust-013
| Field | Value |
|---|---|
| Severity | High |
| Category | mxaccessgw conventions |
| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:414-424` (origin); `clients/rust/src/generated.rs:11-31` (suppression site) |
| Status | Resolved |
**Description:** `cargo clippy --workspace --all-targets -- -D warnings` fails again on this commit, this time on a `clippy::doc_lazy_continuation` violation in generated code:
```
error: doc list item without indentation
--> .../mxaccess_gateway.v1.rs:526:5
|
526 | /// `timeout_ms == 0` uses the gateway-configured default (1000 ms).
| ^
```
The lint fires because the `ReadBulkCommand` proto comment (added with the bulk Read feature in commit `5e375f6`) writes a bulleted list and then a trailing paragraph without the required blank line. prost-build forwards the proto comment verbatim into Rust doc comments, and the Rust client compiles those generated modules with crate-default lints. The crate already opts out of `clippy::large_enum_variant` in `src/generated.rs` for exactly this kind of generator-style problem, but `doc_lazy_continuation` is not on the allow-list, so the lint reaches `-D warnings` and breaks the documented `cargo clippy --workspace --all-targets -- -D warnings` invocation that CLAUDE.md mandates pass. The Rust client review was previously closed as clippy-clean (Client.Rust-001/002/012); this is the third clippy-clean regression caused by generated code in this module and warrants a more durable fix.
**Recommendation:** Add `#![allow(clippy::doc_lazy_continuation)]` to each generated submodule in `clients/rust/src/generated.rs` alongside `clippy::large_enum_variant`, so generated doc comments — which the client cannot edit — cannot break the `-D warnings` build. Independently, fix the upstream proto comment to insert a blank line before the trailing paragraph so the C# / Go / Python / Java generators do not carry the same flaky text.
**Resolution:** 2026-05-20 — Added `#![allow(clippy::doc_lazy_continuation)]` to each generated submodule in `clients/rust/src/generated.rs` next to the existing `clippy::large_enum_variant` allow, and reformatted the `ReadBulkCommand` proto comment in `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto` to surround the bulleted list with blank lines so doc-comment generators in every language see a properly-terminated list. `cargo clippy --workspace --all-targets -- -D warnings` and `cargo test --workspace` now pass, and `dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj` reports 0 warnings.
### Client.Rust-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | mxaccessgw conventions |
| Location | `clients/rust/crates/mxgw-cli/src/main.rs:450,497` |
| Status | Resolved |
**Description:** Client.Rust-011 made `Session` build unique correlation ids per call, but the `mxgw` CLI's `Ping` and `CloseSession` subcommands still hard-code `client_correlation_id: "rust-cli-ping".to_owned()` and `"rust-cli-close-session".to_owned()`. Both go through `client.invoke(…)` / `client.close_session_raw(…)` rather than the `Session` helpers, so the library's id generator does not run. The CLI is the cross-language e2e driver — when the same machine runs concurrent CLI smokes, every `ping`/`close-session` request collides on the same correlation id in gateway logs, defeating the diagnostic value the library fix unlocked.
**Recommendation:** Either (a) expose `session::next_correlation_id` as a `pub(crate)` or library-level helper and have the CLI call it from `Ping`/`CloseSession`, or (b) replace these RPCs with the higher-level `Session` helpers (`Session::close`, and a thin `Session::ping` wrapper) so the CLI shares the library's correlation-id discipline by construction.
**Resolution:** 2026-05-20 — Promoted `session::next_correlation_id` from a module-private helper to a `pub` library-level function (it already lived in the `pub mod session`) and updated the `mxgw` CLI's `Ping` and `CloseSession` subcommands to call `mxgateway_client::session::next_correlation_id("cli-ping")` / `next_correlation_id("cli-close-session")` instead of the hard-coded `"rust-cli-ping"` / `"rust-cli-close-session"` strings. Concurrent CLI smokes now produce unique correlation ids per call — driven by the same process-wide `CORRELATION_SEQUENCE` `AtomicU64` the library uses — so gateway logs can tell collisions apart again. `cargo fmt`, `cargo build --workspace`, `cargo clippy --workspace --all-targets -- -D warnings`, and `cargo test --workspace` all pass.
### Client.Rust-015
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `clients/rust/crates/mxgw-cli/src/main.rs:1053-1070` |
| Status | Resolved |
**Description:** The new cross-language benchmark `bench-read-bulk` pushes the elapsed time of every `read_bulk` call into `latencies_ms` regardless of whether the call returned `Ok` or `Err`:
```rust
let outcome = session.read_bulk(server_handle, &tags, timeout_ms).await;
let elapsed_ms = call_start.elapsed().as_secs_f64() * 1000.0;
latencies_ms.push(elapsed_ms);
match outcome {
Ok(results) => { successful_calls += 1; }
Err(_) => failed_calls += 1,
}
```
A failed `read_bulk` (transient `Unavailable`, deadline-exceeded mid-call, etc.) typically returns *later* than a successful one — it includes the full per-call timeout that the success path never waits for. The histogram therefore conflates "p99 cached-read latency" with "p99 of (cached-read + timed-out call)", and the JSON document the PowerShell driver collates publishes `latencyMs.p99` / `latencyMs.max` that no longer represent successful-call latency. Worse, the failure category is silently dropped (`Err(_) => failed_calls += 1`) so a benchmark run that fails on every call still emits a coherent-looking JSON without ever surfacing why. This is misleading for a benchmark whose JSON shape is the cross-language comparison contract.
**Recommendation:** Only push elapsed time into `latencies_ms` on `Ok`, or split into two histograms (`successLatencyMs` and `failureLatencyMs`) and log the first failure's error string into the stats record so a partial-failure run is visible at the report layer.
**Resolution:** 2026-05-20 — Extracted the per-iteration accounting in `bench-read-bulk` into a `BenchReadBulkStats` helper with explicit `record_success`/`record_failure` methods. Successful `read_bulk` calls now flow into `success_latencies_ms` (driving the cross-language `latencyMs.p99`/`max` JSON contract), failures flow into a separate `failure_latencies_ms` histogram surfaced as `failureLatencyMs`, and the first failure's redacted error string is stashed as `firstFailure` so a partial-failure run is visible at the report layer instead of producing a coherent-looking JSON that hides every error. Added a unit test (`bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram`) that records two fast successes plus a deliberately slow failure and asserts the success histogram never sees the failure latency, plus a smaller smoke test for the zero-duration calls-per-second path.
### Client.Rust-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `clients/rust/tests/client_behavior.rs`, `clients/rust/src/session.rs:489-519,654-768` |
| Status | Resolved |
**Description:** The fixes for Client.Rust-005 / 006 added five new `Error::MalformedReply` paths to `session.rs` (`register_server_handle`, `add_item_handle`, `add_item2_handle`, `bulk_results`, `bulk_write_results`) plus the inline branch in `read_bulk`. None of them are exercised by tests — every test in `client_behavior.rs` feeds the matching payload back to the client, so the malformed-reply branches are dead code from the test suite's perspective. The new bulk-write helpers (`write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`) only have a single happy-path assertion via `write_bulk`, leaving the three other variants and every per-entry-failure shape untested. The bench-read-bulk flow has no test (the driver script is the only consumer). The `Error::Unavailable` variant from Client.Rust-010 is covered by `event_stream_surfaces_a_mid_stream_status_fault`, but the same variant on a unary `Code::Unavailable` is not.
**Recommendation:** Add three light tests against the existing `FakeGateway`:
1. Have the fake reply to `AddItem` (or `Register` / `AddItem2`) with `protocol_status = Ok` and no payload, and assert the client surfaces `Error::MalformedReply`.
2. Have the fake reply to `WriteBulk` with `protocol_status = Ok` and the wrong payload arm (e.g. an `AddItemReply` body), and assert `Error::MalformedReply`.
3. Have the fake fail the unary `Invoke` with `Status::unavailable(...)` and assert `Error::Unavailable`.
Optionally add Write2Bulk / WriteSecuredBulk / WriteSecured2Bulk smoke assertions so all four bulk-write families have at least one round-trip test.
**Resolution:** 2026-05-20 — Added eight new integration tests in `clients/rust/tests/client_behavior.rs`. Each new `Error::MalformedReply` site is exercised via a test-only `InvokeOverride` injected into `FakeState` that lets a single test pin the fake gateway's `Invoke` handler to one of three malformed shapes (OK reply with no payload, OK reply with the wrong payload arm for `read_bulk`, OK reply with the wrong payload arm for the other bulk / bulk-write families): `register_returns_malformed_reply_when_ok_reply_has_no_payload`, `add_item_returns_malformed_reply_when_ok_reply_has_no_payload`, `add_item2_returns_malformed_reply_when_ok_reply_has_no_payload`, `subscribe_bulk_returns_malformed_reply_on_mismatched_payload_arm`, `write_bulk_returns_malformed_reply_on_mismatched_payload_arm`, and `read_bulk_returns_malformed_reply_on_mismatched_payload_arm`. The unary `Error::Unavailable` path is covered by `unary_invoke_maps_status_unavailable_to_error_unavailable` (the override returns `Status::unavailable(...)`). The remaining three bulk-write families gained round-trip smoke tests — `write2_bulk_round_trips_through_the_fake_gateway`, `write_secured_bulk_round_trips_through_the_fake_gateway`, `write_secured2_bulk_round_trips_through_the_fake_gateway` — extending the fake gateway's dispatcher with happy-path replies for `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk`. The `bench-read-bulk` flow gets a `BenchReadBulkStats` unit test in `crates/mxgw-cli/src/main.rs` (see Client.Rust-015) that asserts the latency-tracking change keeps failed-call durations out of `latencyMs`.
### Client.Rust-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `clients/rust/RustClientDesign.md:79-99,156-163` |
| Status | Resolved |
**Description:** CLAUDE.md requires docs to change with the source. `RustClientDesign.md` was refreshed to fix the layout/`tracing` drift (Client.Rust-007), but the Session API surface in the design (`Library API` block, lines 79-99) still lists only the original six bulk helpers — `add_item_bulk`, `advise_item_bulk`, `remove_item_bulk`, `un_advise_item_bulk`, `subscribe_bulk`, `unsubscribe_bulk` — and is missing the five new bulk-write helpers and `read_bulk` (`write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`, `read_bulk`) that landed in commits `5e375f6` / `f220908` / `61644e6`. The `Error Handling` block (lines 130-146) still enumerates `Transport`, `Status`, `Authentication`, `Authorization`, `Session`, `Worker`, `Command`, `MxAccess`, `Timeout`, `Cancelled` — but not `MalformedReply`, `Unavailable`, or `InvalidEndpoint`, all of which are now public variants of the crate's `Error` enum. The `Test CLI` block (lines 158-163) lists `version` / `smoke` / `stream-events` / `write` but is missing every new subcommand (`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`, `galaxy watch`).
**Recommendation:** Bring the design doc back in sync: extend the `Session` API code block to enumerate the bulk-write/read methods, expand the `Error` enum to match `clients/rust/src/error.rs`, and add the missing CLI subcommands. The README is already up to date, so this is design-doc-only churn.
**Resolution:** 2026-05-20 — Brought `clients/rust/RustClientDesign.md` back in sync with the implementation. The `Session` block now lists the five new bulk helpers (`write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`, `read_bulk`) alongside the original six and notes that `session::next_correlation_id` is `pub` for raw-RPC consumers (the CLI). The `Error` enum block now matches `clients/rust/src/error.rs``InvalidEndpoint`, `InvalidArgument`, `Transport`, `Authentication`, `Authorization`, `Timeout`, `Cancelled`, `Unavailable`, `Status`, `Command`, `ProtocolStatus`, `MalformedReply` — with a short paragraph explaining what `Unavailable`, `MalformedReply`, and `InvalidEndpoint` classify. The `Test CLI` block enumerates every subcommand the binary exposes today: `version`, `ping`, `open-session`, `close-session`, `register`, `add-item`, `advise`, `subscribe-bulk`, `unsubscribe-bulk`, `read-bulk`, `write`, `write2`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `stream-events`, `bench-read-bulk`, `smoke`, and the `galaxy {test-connection,last-deploy-time,discover-hierarchy,watch}` subtree.
+88 -11
View File
@@ -4,25 +4,27 @@
|---|---|
| Module | `src/MxGateway.Contracts` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
This re-review focuses on the contract delta introduced since the prior review at `6c64030` — primarily the new bulk write/read command family added in `5e375f6` (`WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, `ReadBulk`) plus the resolution changes for Contracts-001/002/004/005/006/007/008.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No functional bugs; one missing reply-payload case for the by-name ack command and an `int32`-typed `success` flag that reads like a bool (Contracts-002, Contracts-006). |
| 2 | mxaccessgw conventions | Additive-only evolution honored (no renumbered/removed tags), MXAccess-aligned naming consistent, generated code untouched; no `reserved` statements declared as a guardrail (Contracts-005). |
| 1 | Correctness & logic bugs | New bulk command kinds, `BulkWriteResult`, and `BulkReadResult` align with the worker executor, validator (`MxAccessGrpcRequestValidator.ExpectedPayload`), and `MxAccessSession.ReadBulk`. Field numbering is contiguous and additive (10-43 on `MxCommand.payload`, 20-40 on `MxCommandReply.payload`); no collisions. No new functional bugs. |
| 2 | mxaccessgw conventions | Additive-only evolution preserved across all three protos; new wire-compatibility policy comment block (added under Contracts-005) is honored by the bulk additions; generated code untouched; naming and oneof usage are consistent with the style guide. No new violations. |
| 3 | Concurrency & thread safety | N/A — pure contract definitions plus a static const class with no shared mutable state. |
| 4 | Error handling & resilience | HRESULT / `MxStatusProxy` / `ProtocolStatus` carriers are complete; the worker-side by-name alarm ack has no dedicated reply payload (Contracts-002). |
| 5 | Security | Credential-sensitive fields are clearly commented; no secrets forced into loggable shapes. No issues found. |
| 6 | Performance & resource management | `DiscoverHierarchy` is paged; alarm-snapshot streams are server-streamed; no bloat issues. No issues found. |
| 7 | Design-document adherence | `.proto` files match design intent but `docs/Grpc.md` is stale (Contracts-001); worker vs public alarm-status shapes unreconciled in docs (Contracts-008). |
| 8 | Code organization & conventions | Package/file layout correct; stale class summary (Contracts-004). Contracts-003 (`mxaccess_worker.proto` Protobuf item missing `ProtoRoot`) was re-triaged as not-a-defect — the attribute is already present. |
| 9 | Testing coverage | Gateway/worker/alarm round-trips covered; Galaxy Repository protos and raw `MxArray` paths untested (Contracts-007). |
| 10 | Documentation & comments | Proto comments accurate and domain-rich; one stale class summary (Contracts-004). |
| 4 | Error handling & resilience | `BulkWriteResult` carries the full `was_successful` + `hresult` + `statuses` + `error_message` carriers per entry; `BulkReadResult` carries `was_successful` + `was_cached` + per-entry value and statuses. The asymmetry (no `hresult` on `BulkReadResult`) is intentional given ReadBulk's lifecycle. No issues. |
| 5 | Security | The new `WriteSecuredBulkCommand` / `WriteSecured2BulkCommand` carry the redaction note on the outer command only, not on the inner entry's `value` field (Contracts-011); otherwise no secrets forced into loggable shapes. |
| 6 | Performance & resource management | `ReadBulk` is the only command without a 1:1 MXAccess analogue; the per-entry timeout shape (`uint32 timeout_ms`) and `was_cached` semantics avoid disturbing existing subscriptions. No bloat issues. |
| 7 | Design-document adherence | `gateway.md` documents the bulk write/read families, but `docs/Contracts.md` was not updated for them (Contracts-009). This violates the CLAUDE.md "update docs in the same commit as the source" rule for the bulk-read/write addition. |
| 8 | Code organization & conventions | Package / namespace / file layout correct; additive-only contract evolution observed; field numbers continuous and isolated by 100+ from diagnostic/control commands. No new issues. |
| 9 | Testing coverage | The bulk write/read families have no `ProtobufContractRoundTripTests` coverage (Contracts-010); Galaxy Repository protos and `MxArray` raw paths are now covered (per Contracts-007 resolution). |
| 10 | Documentation & comments | `GalaxyAttribute.mx_data_type` lacks an in-proto comment explaining it is a raw Galaxy integer (Contracts-012); the `GatewayContractInfoTests` summary is now stale (Contracts-013); credential-sensitive bulk entry `value` fields lack per-field redaction comments (Contracts-011). |
## Findings
@@ -145,3 +147,78 @@
**Recommendation:** Document in `docs/Contracts.md` (or `AlarmClientDiscovery.md`) how the worker `native_status` maps onto the public reply's `status`/`hresult` pair so client authors know which field is authoritative.
**Resolution:** _(2026-05-18)_ Verified against `WorkerAlarmRpcDispatcher.AcknowledgeAsync`. The asymmetry is larger than the finding implies: the dispatcher copies the worker `MxCommandReply.hresult` into `AcknowledgeAlarmReply.hresult` but **never** assigns `AcknowledgeAlarmReply.status` — the `MxStatusProxy status` field is left UNSET on every reply. The proto comment on `status` ("Native MxAccess status describing the outcome of the ack") was therefore actively misleading. Fixed: (1) reworded the `mxaccess_gateway.proto` comments on `AcknowledgeAlarmReply.hresult` (now identifies it as the authoritative native-return-code field) and `AcknowledgeAlarmReply.status` (now states it is reserved/unset and clients must not depend on it); (2) extended `docs/AlarmClientDiscovery.md` section 4 with a "Worker `native_status` → public `AcknowledgeAlarmReply` mapping" subsection spelling out that `hresult` is authoritative (`0` = success) and `status` is always unset, and that clients should branch on `protocol_status` then `hresult`, never `status`.
### Contracts-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `docs/Contracts.md:13-24` |
| Status | Resolved |
**Description:** Commit `5e375f6` ("Add bulk read/write command family across worker, gateway, and clients") added five new command kinds — `WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, `ReadBulk` — plus the `BulkWriteReply` / `BulkWriteResult` and `BulkReadReply` / `BulkReadResult` shapes to `mxaccess_gateway.proto`. `gateway.md` (lines 299-322) was updated in that commit, but `docs/Contracts.md` was not. It still describes only the older bulk subscription family (`AddItemBulk`, `AdviseItemBulk`, `RemoveItemBulk`, `UnAdviseItemBulk`, `SubscribeBulk`, `UnsubscribeBulk`) returning `BulkSubscribeReply` with no mention of the bulk write/read commands or their per-entry result types. The CLAUDE.md rule "Update docs in the same change as the source. When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs … must change in the same commit" was violated for this addition. The result is that the canonical contracts document undercounts the public bulk surface by five commands.
**Recommendation:** Extend the bulk-commands paragraph in `docs/Contracts.md` to list the new `WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` / `ReadBulk` command kinds, the per-entry request shape (`WriteBulkEntry` etc.), and the new reply types (`BulkWriteReply` carrying `BulkWriteResult`; `BulkReadReply` carrying `BulkReadResult`). Cross-reference `gateway.md` for the cached-vs-snapshot `ReadBulk` lifecycle and `docs/DesignDecisions.md` "Bulk Command Family" for the per-entry-result rationale rather than re-stating those details.
**Resolution:** _(2026-05-20)_ Confirmed `docs/Contracts.md` documented only the older bulk subscription family and never mentioned the bulk write/read additions from commit `5e375f6`. Cross-checked against `mxaccess_gateway.proto` (`MxCommand.payload` cases 39-43, `MxCommandKind` 30-34, the `Write*BulkCommand` / `Write*BulkEntry` shapes, `ReadBulkCommand` with `tag_addresses` + `timeout_ms`, `MxCommandReply.payload` cases 36-40, and the `BulkWriteReply`/`BulkWriteResult` + `BulkReadReply`/`BulkReadResult` messages). Extended the "Files" section of `docs/Contracts.md` with a new paragraph listing the five command kinds, the per-entry request shape for each `Write*Bulk` family (with the credential-sensitive redaction rule carried through to `WriteSecuredBulkEntry`/`WriteSecured2BulkEntry`), the `BulkWriteReply` + `BulkWriteResult` reply (including the `optional int32 hresult` field and the no-raise per-entry failure contract), and the `ReadBulkCommand``BulkReadReply` + `BulkReadResult` reply with the cached-vs-snapshot dual-mode semantics and the deliberate absence of `hresult` on `BulkReadResult`. Cross-references to `gateway.md` (lifecycle + scopes) and `docs/DesignDecisions.md` "Bulk Command Family" (rationale) added rather than re-stating those details.
### Contracts-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/MxGateway.Tests/Contracts/ProtobufContractRoundTripTests.cs` |
| Status | Resolved |
**Description:** Contracts-007 (closed 2026-05-18) added Galaxy Repository, bulk-subscribe, `MxValue.raw_value` / `MxArray.raw_values`, and `WorkerFault`/`WorkerHeartbeat` round-trip coverage. The bulk write/read messages added in commit `5e375f6` were never given equivalent coverage. `ProtobufContractRoundTripTests` has no test that exercises any of: `WriteBulkCommand` / `Write2BulkCommand` / `WriteSecuredBulkCommand` / `WriteSecured2BulkCommand` / `ReadBulkCommand`; `BulkWriteReply` / `BulkWriteResult`; `BulkReadReply` / `BulkReadResult`; the new `MxCommandReply.payload` oneof cases (`write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`, `read_bulk`). The asymmetry that `BulkWriteResult` carries `hresult` and `BulkReadResult` does not, and the `optional int32 hresult` semantics on `BulkWriteResult`, are exactly the kind of wire-shape details prior contract tests have been written to pin.
**Recommendation:** Add `ProtobufContractRoundTripTests` cases mirroring the existing `BulkSubscribeReply_RoundTripsSubscribeResults` / `MxCommandReply_RoundTripsBulkSubscribePayload` pattern: at minimum one round-trip per new request-side message (`WriteBulkCommand` covers the entry-list case; one secured variant proves the credential-sensitive shape; `ReadBulkCommand` covers `timeout_ms`), one round-trip for each new reply payload (`BulkWriteReply` carrying `BulkWriteResult` with `hresult` set + unset to exercise the proto3 `optional` presence; `BulkReadReply` carrying a `was_cached = true` and a `was_cached = false` entry), and at least one `MxCommandReply` test pinning a new payload-oneof case (e.g. `MxCommandReply.PayloadCase == PayloadOneofCase.ReadBulk` for `MxCommandKind.ReadBulk`).
**Resolution:** _(2026-05-20)_ Added round-trip tests in `ProtobufContractRoundTripTests` covering every gap listed: per-request `WriteBulkCommand_RoundTripsEntries`, `Write2BulkCommand_RoundTripsEntriesWithTimestampValue`, `WriteSecuredBulkCommand_RoundTripsCredentialBearingEntries`, `WriteSecured2BulkCommand_RoundTripsCredentialBearingEntriesWithTimestamp`, `ReadBulkCommand_RoundTripsTagAddressesAndTimeout`; per-reply `BulkWriteReply_RoundTripsResultsWithOptionalHresultPresence` (asserts both `HasHresult == true` and `HasHresult == false` arms of the proto3 `optional int32 hresult`) and `BulkReadReply_RoundTripsCachedAndSnapshotResults` (covers `was_cached = true`, `was_cached = false`, and a per-entry failure with `error_message`; additionally pins the deliberate absence of an `hresult` field on `BulkReadResult` via the descriptor); and `MxCommandReply` oneof-case pinning via `MxCommandReply_RoundTripsBulkWritePayloadCases` (a `[Theory]` exercising the four bulk-write payload-oneof cases) plus `MxCommandReply_RoundTripsReadBulkPayload`. All new tests pass; the full `ProtobufContractRoundTripTests` + `GatewayContractInfoTests` filter is 42 tests green.
### Contracts-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:392-397`, `:406-412` |
| Status | Resolved |
**Description:** The single-item `WriteSecuredCommand` (line 234-242) and `WriteSecured2Command` (line 244-253) put the credential-sensitivity redaction note on the `value` field directly ("Credential-sensitive write value. Implementations must not log this field unless an explicit redacted value-logging path is enabled."). The bulk equivalents move the note to the outer message instead — `WriteSecuredBulkCommand` (line 383-386) and `WriteSecured2BulkCommand` (line 399-400) carry it as a header comment — and the inner `WriteSecuredBulkEntry.value` (line 396) and `WriteSecured2BulkEntry.value` (line 410) are left without per-field comments. A future editor reading just `WriteSecuredBulkEntry` to add a new field or change the entry shape will not see the redaction rule. The ProtobufStyleGuide explicitly requires "Mark credential-bearing request fields clearly in comments"; the single-item path follows that rule, the bulk path does not.
**Recommendation:** Add per-field credential-sensitivity comments to `WriteSecuredBulkEntry.value` and `WriteSecured2BulkEntry.value` matching the wording on `WriteSecuredCommand.value` / `WriteSecured2Command.value`. Comment-only change with no wire-format or generated-type impact.
**Resolution:** _(2026-05-20)_ Added per-field credential-sensitivity comments to `WriteSecuredBulkEntry.value` and `WriteSecured2BulkEntry.value` in `mxaccess_gateway.proto`, mirroring verbatim the wording carried on `WriteSecuredCommand.value` / `WriteSecured2Command.value` ("Credential-sensitive write value. Implementations must not log this field unless an explicit redacted value-logging path is enabled."). The outer-message header redaction comment on `WriteSecuredBulkCommand` / `WriteSecured2BulkCommand` is retained so the rule is visible at both scopes. Comment-only change; no wire-format or generated-type impact (the `MxGateway.Contracts` build is clean against the regenerated code).
### Contracts-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Contracts/Protos/galaxy_repository.proto:120` |
| Status | Resolved |
**Description:** `GalaxyAttribute.mx_data_type` is declared as `int32` with no in-proto comment. The field carries the raw Galaxy SQL DB type identifier (from `dbo.data_type`), which deliberately does NOT correspond to the public `MxDataType` enum in `mxaccess_gateway.proto`; `docs/Contracts.md` calls this out ("The service is metadata-only and does not share types with mxaccess_gateway.proto") and `docs/GalaxyRepository.md:190` documents the choice ("`mx_data_type` is returned as the raw Galaxy integer rather than mapped to a language-neutral enum"), but the proto file itself gives the reader no signal. A client author looking at the .proto without those docs is likely to assume the field is a `MxDataType` value and write a `(MxDataType)` cast that silently misclassifies most attributes. The ProtobufStyleGuide rule "Comment fields that carry MXAccess parity details, raw HRESULT/status information, or compatibility constraints" applies — this is exactly a parity-detail / compatibility-constraint field where the int32 has non-obvious semantics. The accompanying `data_type_name`, `mx_attribute_category`, and `security_classification` int fields share the same gap.
**Recommendation:** Add a short comment on `GalaxyAttribute.mx_data_type` (and ideally on `mx_attribute_category` and `security_classification`) clarifying that the value is a raw Galaxy SQL identifier passed through unchanged, NOT a member of the `mxaccess_gateway.v1.MxDataType` enum, with a pointer to `docs/GalaxyRepository.md`. Comment-only change; no wire-format impact.
**Resolution:** _(2026-05-20)_ Added in-proto comments to `GalaxyAttribute.mx_data_type`, `data_type_name`, `mx_attribute_category`, and `security_classification` in `galaxy_repository.proto`. The `mx_data_type` comment explicitly calls out that the value is a raw Galaxy SQL `dbo.data_type` identifier passed through unchanged, that it is NOT a member of `mxaccess_gateway.v1.MxDataType`, and that the two enumerations must not be cast or compared (closing the silent-misclassification trap the finding describes). The `data_type_name` comment clarifies it is free-form Galaxy text from the same table, not a stable enum. `mx_attribute_category` and `security_classification` comments mark them as raw Galaxy-specific identifiers not mapped to any gateway enum. All four comments cross-reference `docs/GalaxyRepository.md` for the rationale rather than restating it. Comment-only change; no wire-format impact.
### Contracts-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Tests/Contracts/GatewayContractInfoTests.cs:14` |
| Status | Resolved |
**Description:** The XML summary on `GatewayContractInfoTests.GatewayProtocolVersion_IsVersionThree` reads "Verifies that the gateway protocol version is bumped to three after the alarm proto extension." That description is now incomplete: since the comment was written, the contract has been extended again (the bulk write/read command family in commit `5e375f6`) without a corresponding `GatewayProtocolVersion` bump. The test name says "IsVersionThree" but the summary attributes the value-of-3 to a single historical event (the alarm extension) — readers checking whether subsequent contract additions should have bumped the version will get a misleading rationale. This is the same class of stale-summary issue as Contracts-004 (`GatewayContractInfo` class summary), just relocated to the test that pins the constant.
**Recommendation:** Reword the summary to describe what the test pins (the current `GatewayProtocolVersion` constant equals 3) rather than narrating a specific historical bump, OR explicitly enumerate the alarm- and bulk-write/read additions covered under version 3 so readers know both extensions were additive and intentionally did not require a bump.
**Resolution:** _(2026-05-20)_ Reworded the XML summary on `GatewayContractInfoTests.GatewayProtocolVersion_IsVersionThree` to describe what the test actually pins: the current `GatewayProtocolVersion` constant equals 3, with both the alarm proto extension (`AcknowledgeAlarm` / `QueryActiveAlarms` RPCs, `OnAlarmTransitionEvent`, the alarm command/reply payload cases) AND the bulk write/read command family extension (`WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` / `ReadBulk` with their `BulkWriteReply` / `BulkReadReply` payloads) shipping under version 3 as strictly additive changes that did not require a further bump. The new summary also instructs that a future breaking contract change should bump the constant and update the test in lock-step. Test logic is unchanged; the test still passes.
+112 -2
View File
@@ -4,13 +4,33 @@
|---|---|
| Module | `src/MxGateway.IntegrationTests` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
### 2026-05-20 review (commit `1cd51bb`)
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found: IntegrationTests-012 (Write test starts a `StreamEvents` task and never observes it — silent event-stream coverage gap and an unobserved fault path). |
| 2 | mxaccessgw conventions | Live opt-ins, `[Collection]` serialization, and the "don't synthesize events" rule are honored. No issues found. |
| 3 | Concurrency & thread safety | `LiveResourcesCollection` serializes all three live classes; `RecordingServerStreamWriter` locks correctly and the semaphore wait is linked to both timeout and external cancellation. No issues found. |
| 4 | Error handling & resilience | `ShutDownAsync` already isolates cleanup exceptions per category. No issues found. |
| 5 | Security | The only embedded strings are documented dev GLAuth creds and a localhost ZB connection string, all env-overridable. The wrong-password and unreachable-server tests assert no password leakage. No issues found. |
| 6 | Performance & resource management | Issue found: IntegrationTests-013 (`RecordingServerStreamWriter.messageArrived` `SemaphoreSlim` is never disposed; the type owns an `IDisposable` field but is not itself disposable). |
| 7 | Design-document adherence | No issues found. `docs/GatewayTesting.md` now documents the Live LDAP, Live Galaxy, and Write/invalid-handle MXAccess opt-ins added by the prior round of resolutions. |
| 8 | Code organization & conventions | Issues found: IntegrationTests-015 (`[Trait("Category", ...)]` repeated on every test method instead of declared once at class level); IntegrationTests-016 (the Galaxy default connection string is duplicated between `LiveGalaxyRepositoryFactAttribute` and `GalaxyRepositoryOptions`). |
| 9 | Testing coverage | Issue found: IntegrationTests-014 (`Unadvise`, `RemoveItem`, `Unregister`, `WriteSecured` ordering, and worker-fault parity still uncovered — IntegrationTests-005's resolution scoped these out). |
| 10 | Documentation & comments | Issue found: IntegrationTests-011 (the invalid-handle and write test comments describe a non-`Ok` MXAccess failure as `ProtocolStatusCode.Ok`, contradicting both the assertion and `HResultConverter`). |
### 2026-05-18 review (commit `6c64030`)
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: IntegrationTests-003 (asserts only on first event), IntegrationTests-010 (`WaitForMessageAsync` ignores cancellation). |
@@ -177,3 +197,93 @@
**Re-triage:** The named method `WaitForFirstMessageAsync` no longer exists — IntegrationTests-003's resolution renamed/replaced it with `RecordingServerStreamWriter.WaitForMessageAsync(predicate, timeout)`, which scans recorded messages and blocks on a `SemaphoreSlim`. The underlying defect still held: that replacement method also took only a `timeout` and never observed a `CancellationToken`. The finding remains valid (Low, Correctness) against the renamed method; the recommendation's `firstMessage.Task.WaitAsync` detail is stale but the intent (thread a token, surface a count on timeout) is unchanged.
**Resolution:** Resolved 2026-05-18: Added an optional `CancellationToken` parameter to `WaitForMessageAsync`, linked with the existing timeout source via `CancellationTokenSource.CreateLinkedTokenSource`, so a per-test cancellation aborts the wait promptly. `GatewaySession_WithLiveWorker_RegistersAdvisesStreamsDataAndCloses` now creates a `CancellationTokenSource`, passes its token into the `StreamEvents` `TestServerCallContext` and into `WaitForMessageAsync`, so the stream call and the wait share one cancellation source. On timeout the method already throws a `TimeoutException` whose message includes the scanned message count, satisfying the "emit recorded count" intent (the count surfaces in the test failure rather than via a separate `output.WriteLine`). Verified by build; live tests not executed.
### IntegrationTests-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:236-240`, `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:183-187` |
| Status | Resolved |
**Description:** The XML/inline comments on the two new MXAccess parity tests misdescribe how the gateway surfaces an MXAccess failure. The invalid-handle test reads "the gateway protocol status is Ok and the failure shows up in hresult / the status proxies — it must not be reported as a transport fault", then asserts `Assert.NotEqual(ProtocolStatusCode.Ok, addItemReply.ProtocolStatus.Code)`. `HResultConverter.CreateProtocolStatus` (`src/MxGateway.Worker/Conversion/HResultConverter.cs:39`) actually sets `Code = ProtocolStatusCode.MxaccessFailure` whenever the COM call throws (HRESULT ≠ 0), so the assertion is correct but the comment is wrong — the protocol status is *not* `Ok` on an MXAccess failure. The write-round-trip test carries the same misleading framing on lines 183-187 ("MXAccess parity details … belong in hresult / statuses, not in a transport failure") immediately before asserting `Ok`. A reader can reasonably conclude the gateway always reports `Ok` for round-tripped commands and tweak code accordingly. The intended distinction is "this is not a gRPC transport fault" (the RPC reply still arrives) — the protocol status code carries the MXAccess outcome.
**Recommendation:** Reword the invalid-handle comment to "the gateway must reply with `ProtocolStatusCode.MxaccessFailure` and a non-zero `Hresult` carrying the COM failure, not a gRPC transport fault." Reword the write-round-trip comment to clarify it is asserting the happy-path Ok and that an MXAccess rejection would surface as `MxaccessFailure` (per `HResultConverter`), not as a `RpcException`.
**Resolution:** 2026-05-20 — Reworded the invalid-handle test comment to say the gateway must reply with `ProtocolStatusCode.MxaccessFailure` and a non-zero hresult carrying the COM failure (per `HResultConverter`), and reworded the write-round-trip comment to make explicit it is asserting the happy-path Ok while an MXAccess rejection would surface as `MxaccessFailure`, never as an `RpcException`.
### IntegrationTests-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:147-151` |
| Status | Resolved |
**Description:** `GatewaySession_WithLiveWorker_WritesValueToAdvisedItem` constructs a `RecordingServerStreamWriter<MxEvent>` and starts a `StreamEvents` task, then never reads from it and never asserts anything about the recorded messages. The test verifies only that the `Write` command round-trips at the protocol level — it does not verify that the worker actually emits any event after the write (for example an `OnWriteComplete`, which is the proof of round-trip used by the cross-language client e2e runner). Because the stream task is started with `new TestServerCallContext()` (no cancellation source), any fault raised by the stream task (an exception from `EventStreamService`, a session-not-found, a backpressure overflow) is swallowed — `streamTask` is later awaited in `ShutDownAsync` only inside a broad `catch (Exception ex)`, which logs and continues. The Write test therefore cannot fail on stream-task faults. Two consequences: (a) the live Write parity coverage promised in IntegrationTests-005 is weaker than it appears, and (b) the fixture (`eventWriter`) is dead code in this test that suggests an assertion was intended.
**Recommendation:** Either remove the unused `eventWriter`/`StreamEvents` plumbing from the Write test so the test scope matches its assertions, or — preferred — extend the test to wait for an `OnWriteComplete` event for the written item via `eventWriter.WaitForMessageAsync(candidate => candidate.Family == MxEventFamily.OnWriteComplete && candidate.ItemHandle == itemHandle, ...)`, matching the round-trip proof used by `scripts/run-client-e2e-tests.ps1 -VerifyWrite`.
**Resolution:** Resolved 2026-05-20: Rewrote `GatewaySession_WithLiveWorker_WritesValueToAdvisedItem` so the previously-dead `eventWriter`/`StreamEvents` plumbing actually drives an assertion. The test now waits for an `OnWriteComplete` event matching the Write's (server, item) handle pair via `eventWriter.WaitForMessageAsync` (using `IntegrationTestEnvironment.LiveMxAccessEventTimeout`), and asserts the recorded event's family, session id, and handles — the same round-trip proof the cross-language client e2e runner uses. The stream call is now bound to a `CancellationTokenSource` and the test asserts `streamTask.IsFaulted == false` before cleanup. `ShutDownAsync` gained an opt-in `propagateStreamFaults` flag so a faulted `StreamEvents` task is rethrown into the test rather than silently swallowed by the broad cleanup catch; the cancellation token is also signalled before the drain so `StreamEvents` observes a clean shutdown instead of a forced timeout. Verified by build and by confirming the test skips cleanly when `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS` is unset.
### IntegrationTests-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:519-609` |
| Status | Resolved |
**Description:** `RecordingServerStreamWriter<T>` owns a `SemaphoreSlim messageArrived` (`IDisposable`) but does not itself implement `IDisposable`, so the semaphore's wait handle is never released back to the OS. Each live test allocates one such writer and discards it at scope exit. Live tests run on opt-in only, so the cumulative leak is bounded, but the type holds an `IDisposable` field — the standard hygiene under `Directory.Build.props`'s `TreatWarningsAsErrors=true` is to either dispose the field or document why not. CA2213 does not fire because the owner is not itself `IDisposable`; an analyzer-driven warning is the only reason this is not a build break, not an indication that the leak is acceptable.
**Recommendation:** Make `RecordingServerStreamWriter<T>` implement `IDisposable`, dispose `messageArrived` in `Dispose`, and wrap each instantiation in a `using` block (`using RecordingServerStreamWriter<MxEvent> eventWriter = new();`).
**Resolution:** 2026-05-20 — `RecordingServerStreamWriter<T>` now implements `IDisposable` and its `Dispose` releases the `messageArrived` semaphore. All six live tests in `WorkerLiveMxAccessSmokeTests` now allocate the writer with a top-of-method `using` declaration so the semaphore's wait handle is released on scope exit even when the test body throws.
### IntegrationTests-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` |
| Status | Resolved |
**Description:** IntegrationTests-005 was resolved by adding live coverage for `Write` and an invalid-handle `AddItem`, but its resolution explicitly scoped out the worker-fault/abnormal-exit case and silently dropped `Unadvise`, `RemoveItem`, `Unregister`, `OperationComplete`, and `WriteSecured` ordering. CLAUDE.md singles out `WriteSecured` ("`WriteSecured` failing before a value-bearing NMX body") and `OperationComplete` semantics as parity surprises the gateway must not "fix" — exactly the paths fake-worker tests cannot validate. After this commit the live MXAccess smoke still doesn't exercise any teardown command, the secured-write ordering rule, or a deliberately faulted worker. A regression in any of these would only be caught by manual testing.
**Recommendation:** Add live MXAccess coverage for the teardown chain (`Unadvise` then `RemoveItem` then `Unregister`, asserting each replies with `ProtocolStatusCode.Ok` and the next operation no longer references the freed handle), and at minimum one `WriteSecured` parity case asserting the documented ordering. A worker-fault test can be deferred to a separate finding once a deterministic COM-crash injection harness exists.
**Resolution:** Resolved 2026-05-20: Added three new `[LiveMxAccessFact]`-gated tests to `WorkerLiveMxAccessSmokeTests`, all reusing the existing opt-in env var and `ShutDownAsync` cleanup helper. (1) `GatewaySession_WithLiveWorker_UnadviseRemoveItemUnregister_TeardownOrderingParity` runs Register → AddItem → Advise → wait for one OnDataChange → UnAdvise → RemoveItem → Unregister, asserting each step replies `Ok` with the matching `MxCommandKind`, that no further OnDataChange events for the un-advised (server, item) pair arrive after a 500 ms settle window, and that a second RemoveItem against the freed handle returns a non-`Ok` MXAccess failure (so a regression that left a stale subscription or accepted a stale handle would surface). (2) `GatewaySession_WithLiveWorker_WriteSecured_AuthenticatedRoundTripParity` resolves an ArchestrA user id via `AuthenticateUser` (credentials env-overridable through `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER` / `..._PASSWORD`, defaulting to the `admin`/`admin123` GLAuth user from `glauth.md`), issues `WriteSecured` against an advised item, and asserts the reply carries `MxCommandKind.WriteSecured`, the protocol status is one of the documented parity outcomes (`Ok` for an unprotected provider, `MxaccessFailure` when the item is not WriteSecured-eligible — never a transport fault), and the credential never leaks into the diagnostic message. (3) `GatewaySession_WithLiveWorker_AbnormalWorkerExit_MarksSessionFaulted` opens a session, kills the worker process tree (via a new `TestWorkerProcessFactory.KillAllAndDetach` helper) without going through CloseSession, and polls the session via a new `GatewayServiceFixture.TryGetSession` accessor until it transitions to `SessionState.Faulted` within the live event timeout; asserts the final state is `Faulted`, that `FinalFault` is non-empty, and that the fault description carries a known worker-client classification (pipe disconnected / worker faulted / heartbeat expired / end-of-stream). `docs/GatewayTesting.md` was updated to list all five parity surfaces and the two new env-var defaults. Verified by build and confirmed all six live tests skip cleanly when `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS` is unset.
### IntegrationTests-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:30,119,201`, `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:13,32,48,67,84`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs:10,22,34,52` |
| Status | Resolved |
**Description:** Every live-test method in the three live classes carries an identical `[Trait("Category", "LiveMxAccess")]` (or `LiveLdap` / `LiveGalaxy`) attribute. The trait is uniform within each class and is exactly the information the `[Collection(LiveResourcesCollection.Name)]` class-level attribute also implies. xUnit's `[Trait]` is inheritable from the class to its methods, so the same metadata can be declared once at class scope. The current shape adds maintenance burden — adding a new test in any of these classes requires remembering to add the trait, and the existing pattern's `LiveLdap` includes five copies of the same line.
**Recommendation:** Move each `[Trait("Category", ...)]` to the class declaration alongside the existing `[Collection(...)]`, and remove the per-method copies. Verify the trait still surfaces in `--filter Trait=Category=LiveLdap` after the change.
**Resolution:** 2026-05-20 — Lifted `[Trait("Category", "LiveMxAccess")]`, `[Trait("Category", "LiveLdap")]`, and `[Trait("Category", "LiveGalaxy")]` to the class declarations of `WorkerLiveMxAccessSmokeTests`, `DashboardLdapLiveTests`, and `GalaxyRepositoryLiveTests` respectively (alongside the existing `[Collection(LiveResourcesCollection.Name)]`), and removed all per-method duplicates. xUnit propagates class-level traits to every method, so `--filter Category=Live*` filters still match.
### IntegrationTests-016
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs:26`, `src/MxGateway.Server/Galaxy/GalaxyRepositoryOptions.cs:13` |
| Status | Resolved |
**Description:** The default Galaxy Repository connection string `"Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;"` is duplicated verbatim between the production `GalaxyRepositoryOptions.ConnectionString` initializer and the test-side `LiveGalaxyRepositoryFactAttribute.ConnectionString` fallback. The docs (`docs/GatewayTesting.md`) document the value once and reference it from both places. If the production default changes (e.g. tightening to a named instance, or switching to a SQL-auth template), the test default silently keeps the old string and the live Galaxy tests connect to the wrong server. The drift is invisible to the build.
**Recommendation:** Expose the production default through a `public const string` on `GalaxyRepositoryOptions` (e.g. `DefaultConnectionString`) and have `LiveGalaxyRepositoryFactAttribute.ConnectionString` read `Environment.GetEnvironmentVariable(ConnectionStringVariableName) ?? GalaxyRepositoryOptions.DefaultConnectionString`. Single source of truth, build-time guarantee they cannot drift.
**Resolution:** 2026-05-20 — Added `public const string GalaxyRepositoryOptions.DefaultConnectionString` carrying the production default, set the `ConnectionString` initializer to reference it, and changed `LiveGalaxyRepositoryFactAttribute.ConnectionString` to fall back to `GalaxyRepositoryOptions.DefaultConnectionString`. The literal now lives in exactly one place and any future change to the production default propagates to the live-test fallback at compile time.
+83 -11
View File
@@ -10,17 +10,17 @@ Each module's `findings.md` is the source of truth; this file is generated from
| Module | Reviewer | Date | Commit | Status | Open | Total |
|---|---|---|---|---|---|---|
| [Client.Dotnet](Client.Dotnet/findings.md) | Claude Code | 2026-05-18 | `3cc53a8` | Reviewed | 0 | 8 |
| [Client.Go](Client.Go/findings.md) | Claude Code | 2026-05-18 | `3cc53a8` | Reviewed | 0 | 10 |
| [Client.Java](Client.Java/findings.md) | Claude Code | 2026-05-18 | `3cc53a8` | Reviewed | 0 | 12 |
| [Client.Python](Client.Python/findings.md) | Claude Code | 2026-05-18 | `3cc53a8` | Reviewed | 0 | 12 |
| [Client.Rust](Client.Rust/findings.md) | Claude Code | 2026-05-18 | `3cc53a8` | Reviewed | 0 | 12 |
| [Contracts](Contracts/findings.md) | Claude Code | 2026-05-18 | `6c64030` | Reviewed | 0 | 8 |
| [IntegrationTests](IntegrationTests/findings.md) | Claude Code | 2026-05-18 | `6c64030` | Reviewed | 0 | 10 |
| [Server](Server/findings.md) | Claude Code | 2026-05-18 | `6c64030` | Reviewed | 0 | 14 |
| [Tests](Tests/findings.md) | Claude Code | 2026-05-18 | `6c64030` | Reviewed | 0 | 12 |
| [Worker](Worker/findings.md) | Claude Code | 2026-05-18 | `6c64030` | Reviewed | 0 | 15 |
| [Worker.Tests](Worker.Tests/findings.md) | Claude Code | 2026-05-18 | `6c64030` | Reviewed | 0 | 15 |
| [Client.Dotnet](Client.Dotnet/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 14 |
| [Client.Go](Client.Go/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 16 |
| [Client.Java](Client.Java/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 20 |
| [Client.Python](Client.Python/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 17 |
| [Client.Rust](Client.Rust/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 17 |
| [Contracts](Contracts/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 13 |
| [IntegrationTests](IntegrationTests/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 16 |
| [Server](Server/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 22 |
| [Tests](Tests/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 19 |
| [Worker](Worker/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 22 |
| [Worker.Tests](Worker.Tests/findings.md) | Claude Code | 2026-05-20 | `1cd51bb` | Reviewed | 0 | 24 |
## Pending findings
@@ -36,13 +36,16 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
|---|---|---|---|---|
| Server-001 | Critical | Resolved | Security | `src/MxGateway.Server/GatewayApplication.cs:147-149`, `src/MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs:55-58`, `src/MxGateway.Server/Dashboard/Components/Routes.razor:1-15` |
| Client.Go-001 | High | Resolved | Correctness & logic bugs | `clients/go/mxgateway/errors.go:88-93`, `clients/go/mxgateway/errors.go:117-128` |
| Client.Java-013 | High | Resolved | Testing coverage | `clients/java/mxgateway-cli/src/test/java/com/dohertylan/mxgateway/cli/MxGatewayCliTests.java:212-304`, `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:1214-1244` |
| Client.Rust-001 | High | Resolved | mxaccessgw conventions | `clients/rust/src/options.rs:98,143` |
| Client.Rust-002 | High | Resolved | mxaccessgw conventions | `clients/rust/src/session.rs:522` |
| Client.Rust-003 | High | Resolved | Correctness & logic bugs | `clients/rust/crates/mxgw-cli/src/main.rs:1051` |
| Client.Rust-012 | High | Resolved | mxaccessgw conventions | `clients/rust/src/galaxy.rs:282` |
| Client.Rust-013 | High | Resolved | mxaccessgw conventions | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:414-424` (origin); `clients/rust/src/generated.rs:11-31` (suppression site) |
| IntegrationTests-001 | High | Resolved | Design-document adherence | `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs:7`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs` |
| IntegrationTests-002 | High | Resolved | Design-document adherence | `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:13`, `src/MxGateway.Server/Configuration/LdapOptions.cs:27` |
| Server-003 | High | Resolved | Security | `src/MxGateway.Server/Dashboard/DashboardAuthorizationHandler.cs:39,54-59`, `src/MxGateway.Server/Dashboard/DashboardAuthenticator.cs:236-258` |
| Server-017 | High | Resolved | Security | `src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:13-27`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:173-247`, `docs/Authorization.md:108-110` |
| Tests-001 | High | Resolved | Testing coverage | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:483-489` |
| Tests-002 | High | Resolved | Security | `src/MxGateway.Tests/Gateway/Grpc/GalaxyRepositoryGrpcServiceTests.cs:198-210` |
| Worker-001 | High | Resolved | Concurrency & thread safety | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207` |
@@ -60,39 +63,63 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Client.Java-003 | Medium | Resolved | mxaccessgw conventions | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:119-140` |
| Client.Java-004 | Medium | Resolved | Correctness & logic bugs | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySession.java:114-120,157-163,191-197` |
| Client.Java-005 | Medium | Resolved | Error handling & resilience | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySession.java:92-105` |
| Client.Java-014 | Medium | Resolved | Concurrency & thread safety | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:59-65,117-124` |
| Client.Java-015 | Medium | Resolved | Concurrency & thread safety | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayChannels.java:112-138`, `MxGatewayClient.java:183-191,224-232,322-329`, `GalaxyRepositoryClient.java:164-170,212-214` |
| Client.Python-003 | Medium | Resolved | Error handling & resilience | `clients/python/src/mxgateway/client.py:125-137,155-173` |
| Client.Python-005 | Medium | Resolved | Performance & resource management | `clients/python/src/mxgateway/galaxy.py:117-140` |
| Client.Python-009 | Medium | Resolved | Testing coverage | `clients/python/tests/` |
| Client.Python-013 | Medium | Resolved | Security | `clients/python/src/mxgateway_cli/commands.py:757-762` |
| Client.Rust-005 | Medium | Resolved | Correctness & logic bugs | `clients/rust/src/session.rs:489-520` |
| Client.Rust-006 | Medium | Resolved | Error handling & resilience | `clients/rust/src/session.rs:531-555` |
| Client.Rust-015 | Medium | Resolved | Error handling & resilience | `clients/rust/crates/mxgw-cli/src/main.rs:1053-1070` |
| Client.Rust-016 | Medium | Resolved | Testing coverage | `clients/rust/tests/client_behavior.rs`, `clients/rust/src/session.rs:489-519,654-768` |
| Contracts-002 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:384-385`, `:95` |
| Contracts-009 | Medium | Resolved | Design-document adherence | `docs/Contracts.md:13-24` |
| IntegrationTests-003 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:89-97` |
| IntegrationTests-004 | Medium | Resolved | Error handling & resilience | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:108-111` |
| IntegrationTests-005 | Medium | Resolved | Testing coverage | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` |
| IntegrationTests-006 | Medium | Resolved | Testing coverage | `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs` |
| IntegrationTests-012 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:147-151` |
| IntegrationTests-014 | Medium | Resolved | Testing coverage | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` |
| Server-002 | Medium | Resolved | Design-document adherence | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` |
| Server-004 | Medium | Resolved | Code organization & conventions | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` |
| Server-005 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` |
| Server-006 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` |
| Server-015 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Server/Sessions/GatewaySession.cs:8-15,266-308,720-775` |
| Server-016 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Server/Sessions/GatewaySession.cs:790-797`, `src/MxGateway.Server/Sessions/SessionManager.cs:237-258` |
| Server-021 | Medium | Resolved | Testing coverage | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:266-664`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs` |
| Tests-003 | Medium | Resolved | Performance & resource management | `src/MxGateway.Tests/Security/Authentication/SqliteAuthStoreTests.cs:170-176`, `src/MxGateway.Tests/Security/Authentication/ApiKeyAdminCliRunnerTests.cs:252-258` |
| Tests-004 | Medium | Resolved | Testing coverage | `src/MxGateway.Tests/Security/Authorization/GatewayGrpcAuthorizationInterceptorTests.cs` |
| Tests-005 | Medium | Resolved | Testing coverage | `src/MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs:239-261`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` |
| Tests-006 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:76`, `src/MxGateway.Tests/Gateway/Workers/FakeWorkerHarnessTests.cs:122` |
| Tests-013 | Medium | Resolved | Testing coverage | `src/MxGateway.Server/Sessions/GatewaySession.cs:449-679`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` |
| Tests-016 | Medium | Resolved | Testing coverage | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:29-41,115-124` |
| Worker-004 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588` |
| Worker-005 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-258` (production alarm poll loop) |
| Worker-006 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491` |
| Worker-007 | Medium | Resolved | mxaccessgw conventions | `src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150` |
| Worker-008 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249`, `:429-447` |
| Worker-016 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:261-265` |
| Worker-017 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Worker/Sta/StaRuntime.cs:280-288`, `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-631` |
| Worker.Tests-003 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/Sta/StaRuntimeTests.cs:46-48` |
| Worker.Tests-004 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:281-329` |
| Worker.Tests-005 | Medium | Resolved | Performance & resource management | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs:20-31,103-105`, `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:28-31` |
| Worker.Tests-006 | Medium | Resolved | Performance & resource management | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:282,305,315,323` |
| Worker.Tests-007 | Medium | Resolved | Design-document adherence | `docs/WorkerFrameProtocol.md:38-49` |
| Worker.Tests-016 | Medium | Resolved | Code organization & conventions | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandExecutorTests.cs:317-393` |
| Worker.Tests-017 | Medium | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs` |
| Worker.Tests-018 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.Worker.Tests/MxAccess/MxAccessLiveComCreationTests.cs:18-31, 35-73, 75-145, 148-220, 222-342` |
| Client.Dotnet-004 | Low | Resolved | Error handling & resilience | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:283-294`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:392-403` |
| Client.Dotnet-005 | Low | Resolved | Correctness & logic bugs | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:82,124,175` |
| Client.Dotnet-006 | Low | Resolved | Code organization & conventions | `clients/dotnet/MxGateway.Client/MxGatewayClientOptions.cs:50`, `clients/dotnet/MxGateway.Client/MxGatewayClientContractInfo.cs:10-14` |
| Client.Dotnet-007 | Low | Resolved | Documentation & comments | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:185-192` |
| Client.Dotnet-008 | Low | Resolved | Correctness & logic bugs | `clients/dotnet/MxGateway.Client.Cli/MxGatewayCliSecretRedactor.cs:9-17` |
| Client.Dotnet-009 | Low | Resolved | Concurrency & thread safety | `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:26,339-348,445-448` |
| Client.Dotnet-010 | Low | Resolved | Correctness & logic bugs | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:638,896,1261,1279` |
| Client.Dotnet-011 | Low | Resolved | Concurrency & thread safety | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:857-858,922-963,1014-1015` |
| Client.Dotnet-012 | Low | Resolved | Code organization & conventions | `clients/dotnet/MxGateway.Client/MxGateway.Client.csproj`, `clients/dotnet/MxGateway.Client.Cli/MxGateway.Client.Cli.csproj`, `clients/dotnet/MxGateway.Client.Tests/MxGateway.Client.Tests.csproj` |
| Client.Dotnet-013 | Low | Resolved | Code organization & conventions | `clients/dotnet/MxGateway.Client/DiscoverHierarchyOptions.cs:3-24`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:185-187`, `clients/dotnet/MxGateway.Client.Cli/IMxGatewayCliClient.cs:6` |
| Client.Dotnet-014 | Low | Resolved | Testing coverage | `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientAlarmsTests.cs:76-98`, `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:212-231` |
| Client.Go-004 | Low | Resolved | mxaccessgw conventions | `clients/go/mxgateway/alarms_test.go:153-154`, `clients/go/mxgateway/galaxy_test.go:58-59` |
| Client.Go-005 | Low | Resolved | Design-document adherence | `clients/go/mxgateway/client.go:64,68`, `clients/go/mxgateway/galaxy.go:83,87` |
| Client.Go-006 | Low | Resolved | Error handling & resilience | `clients/go/mxgateway/errors.go:9-130` |
@@ -100,6 +127,12 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Client.Go-008 | Low | Resolved | Testing coverage | `clients/go/mxgateway/` (test files) |
| Client.Go-009 | Low | Resolved | Code organization & conventions | `clients/go/mxgateway/galaxy.go:60-93,241-256`, `clients/go/mxgateway/client.go:41-74,190-205` |
| Client.Go-010 | Low | Resolved | Documentation & comments | `clients/go/mxgateway/client.go:39-40` |
| Client.Go-011 | Low | Resolved | Correctness & logic bugs | `clients/go/mxgateway/alarms_test.go:66-73` |
| Client.Go-012 | Low | Resolved | Documentation & comments | `clients/go/cmd/mxgw-go/main.go:1063-1065`, `clients/go/cmd/mxgw-go/main.go:88-104` |
| Client.Go-013 | Low | Resolved | Concurrency & thread safety | `clients/go/cmd/mxgw-go/main.go:1246-1249`, `clients/go/cmd/mxgw-go/main.go:1257-1262` |
| Client.Go-014 | Low | Resolved | Error handling & resilience | `clients/go/mxgateway/session.go:602`, `clients/go/mxgateway/galaxy.go:189` |
| Client.Go-015 | Low | Resolved | Code organization & conventions | `clients/go/cmd/mxgw-go/main.go:410-512` |
| Client.Go-016 | Low | Resolved | Testing coverage | `clients/go/mxgateway/galaxy_test.go:382-429` |
| Client.Java-006 | Low | Resolved | Performance & resource management | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:323-328`, `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/GalaxyRepositoryClient.java:279-284` |
| Client.Java-007 | Low | Resolved | Testing coverage | `clients/java/mxgateway-client/src/test/java/com/dohertylan/mxgateway/client/` |
| Client.Java-008 | Low | Resolved | Error handling & resilience | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:298-304` |
@@ -107,6 +140,11 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Client.Java-010 | Low | Resolved | Documentation & comments | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:269-272`, `clients/java/README.md:76` |
| Client.Java-011 | Low | Resolved | Performance & resource management | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:37-63` |
| Client.Java-012 | Low | Resolved | Correctness & logic bugs | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:667-674` |
| Client.Java-016 | Low | Resolved | Code organization & conventions | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:361-391`, `GalaxyRepositoryClient.java:285-315` |
| Client.Java-017 | Low | Resolved | Documentation & comments | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:25-36`, `clients/java/README.md:99-107` |
| Client.Java-018 | Low | Resolved | Security | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySecrets.java:54-66` |
| Client.Java-019 | Low | Resolved | Performance & resource management | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:362-391`, `GalaxyRepositoryClient.java:286-315` |
| Client.Java-020 | Low | Resolved | Correctness & logic bugs | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:244-254`, `galaxy_repository.proto:94` |
| Client.Python-001 | Low | Resolved | Documentation & comments | `clients/python/pyproject.toml:8,25`, `clients/python/src/mxgateway_cli/commands.py:25` |
| Client.Python-002 | Low | Resolved | Code organization & conventions | `clients/python/src/mxgateway/__init__.py:27` |
| Client.Python-004 | Low | Resolved | Correctness & logic bugs | `clients/python/src/mxgateway_cli/commands.py:386,402-404` |
@@ -116,12 +154,18 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Client.Python-010 | Low | Resolved | Code organization & conventions | `clients/python/src/mxgateway/session.py:404`, `clients/python/src/mxgateway_cli/commands.py:422-425` |
| Client.Python-011 | Low | Resolved | Error handling & resilience | `clients/python/src/mxgateway/errors.py:122-148` |
| Client.Python-012 | Low | Won't Fix | mxaccessgw conventions | `clients/python/src/mxgateway/client.py:84-108`, `clients/python/src/mxgateway/session.py:57-77` |
| Client.Python-014 | Low | Resolved | Code organization & conventions | `clients/python/src/mxgateway_cli/commands.py:22-23` |
| Client.Python-015 | Low | Resolved | Testing coverage | `clients/python/src/mxgateway_cli/commands.py:273-294,564-647`, `clients/python/tests/` |
| Client.Python-016 | Low | Resolved | Testing coverage | `clients/python/src/mxgateway_cli/commands.py:25,757-775,805-830` |
| Client.Python-017 | Low | Resolved | Documentation & comments | `clients/python/pyproject.toml:5-25`, `clients/python/src/mxgateway/` |
| Client.Rust-004 | Low | Resolved | Documentation & comments | `clients/rust/src/version.rs:7` |
| Client.Rust-007 | Low | Resolved | Design-document adherence | `clients/rust/RustClientDesign.md:14-55` |
| Client.Rust-008 | Low | Resolved | Performance & resource management | `clients/rust/src/value.rs:161-261` |
| Client.Rust-009 | Low | Resolved | Testing coverage | `clients/rust/tests/client_behavior.rs`, `clients/rust/src/galaxy.rs` |
| Client.Rust-010 | Low | Resolved | Error handling & resilience | `clients/rust/src/client.rs:255-268`, `clients/rust/src/galaxy.rs:204-216` |
| Client.Rust-011 | Low | Resolved | mxaccessgw conventions | `clients/rust/src/session.rs:469` |
| Client.Rust-014 | Low | Resolved | mxaccessgw conventions | `clients/rust/crates/mxgw-cli/src/main.rs:450,497` |
| Client.Rust-017 | Low | Resolved | Design-document adherence | `clients/rust/RustClientDesign.md:79-99,156-163` |
| Contracts-001 | Low | Resolved | Design-document adherence | `docs/Grpc.md:13` (and `:3`, `:32`, `:39`) |
| Contracts-003 | Low | Won't Fix | Code organization & conventions | `src/MxGateway.Contracts/MxGateway.Contracts.csproj:10` |
| Contracts-004 | Low | Resolved | Documentation & comments | `src/MxGateway.Contracts/GatewayContractInfo.cs:3-6` |
@@ -129,10 +173,18 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Contracts-006 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:647` |
| Contracts-007 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Contracts/ProtobufContractRoundTripTests.cs` |
| Contracts-008 | Low | Resolved | Design-document adherence | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:451-459`, `:627-636` |
| Contracts-010 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Contracts/ProtobufContractRoundTripTests.cs` |
| Contracts-011 | Low | Resolved | Security | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:392-397`, `:406-412` |
| Contracts-012 | Low | Resolved | Documentation & comments | `src/MxGateway.Contracts/Protos/galaxy_repository.proto:120` |
| Contracts-013 | Low | Resolved | Documentation & comments | `src/MxGateway.Tests/Contracts/GatewayContractInfoTests.cs:14` |
| IntegrationTests-007 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:20`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs:5`, `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:9` |
| IntegrationTests-008 | Low | Resolved | Code organization & conventions | `src/MxGateway.IntegrationTests/LiveLdapFactAttribute.cs`, `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs`, `src/MxGateway.IntegrationTests/LiveMxAccessFactAttribute.cs` |
| IntegrationTests-009 | Low | Resolved | Documentation & comments | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:372-375` |
| IntegrationTests-010 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:366-369` |
| IntegrationTests-011 | Low | Resolved | Documentation & comments | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:236-240`, `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:183-187` |
| IntegrationTests-013 | Low | Resolved | Performance & resource management | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:519-609` |
| IntegrationTests-015 | Low | Resolved | Code organization & conventions | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:30,119,201`, `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:13,32,48,67,84`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs:10,22,34,52` |
| IntegrationTests-016 | Low | Resolved | Code organization & conventions | `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs:26`, `src/MxGateway.Server/Galaxy/GalaxyRepositoryOptions.cs:13` |
| Server-007 | Low | Resolved | Performance & resource management | `src/MxGateway.Server/Galaxy/GalaxyHierarchyProjector.cs:55-70` |
| Server-008 | Low | Resolved | Performance & resource management | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:111-134,160-189` |
| Server-009 | Low | Resolved | Error handling & resilience | `src/MxGateway.Server/Security/Authentication/AuthSqliteConnectionFactory.cs:15-32` |
@@ -141,12 +193,21 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Server-012 | Low | Resolved | Documentation & comments | `CLAUDE.md` (Authentication section and `apikey create` example) |
| Server-013 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs`, `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs` |
| Server-014 | Low | Resolved | Documentation & comments | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:162-171,191-198,206-214,229-237` |
| Server-018 | Low | Resolved | Performance & resource management | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:15` |
| Server-019 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:183-221` |
| Server-020 | Low | Resolved | Code organization & conventions | `src/MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor:1-2`, `…/GalaxyPage.razor:1-2`, `…/ApiKeysPage.razor:1-2`, `…/EventsPage.razor:1-2`, `…/SessionsPage.razor:1-2`, `…/WorkersPage.razor:1-2`, `…/SettingsPage.razor:1-2`, `…/SessionDetailsPage.razor:1-2` |
| Server-022 | Low | Resolved | Documentation & comments | `src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs:8-29` |
| Tests-007 | Low | Resolved | Code organization & conventions | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:682`, `src/MxGateway.Tests/Gateway/Grpc/GalaxyRepositoryGrpcServiceTests.cs:324`, `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:460`, `src/MxGateway.Tests/Security/Authorization/GatewayGrpcAuthorizationInterceptorTests.cs:233` |
| Tests-008 | Low | Resolved | mxaccessgw conventions | `src/MxGateway.Tests/Gateway/Sessions/WorkerAlarmRpcDispatcherTests.cs:1-9`, `src/MxGateway.Tests/Gateway/Sessions/NotWiredAlarmRpcDispatcherTests.cs:1-3`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerAlarmAutoSubscribeTests.cs:1` |
| Tests-009 | Low | Resolved | Documentation & comments | `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:36-37,99,365` |
| Tests-010 | Low | Resolved | Security | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs:26-36` |
| Tests-011 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:233-301` |
| Tests-012 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Tests/Gateway/Workers/Fakes/FakeWorkerHarness.cs:62`, `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:472` |
| Tests-014 | Low | Resolved | Performance & resource management | `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs:18,33,44,62,81,105`, `src/MxGateway.Tests/Gateway/Dashboard/DashboardCookieOptionsTests.cs:17` |
| Tests-015 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:374-379,87` |
| Tests-017 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:346-364` |
| Tests-018 | Low | Resolved | Code organization & conventions | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:32`, `src/MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotServiceTests.cs:45,51,57,105,134,163,167,202-209,284,317,523`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:40` |
| Tests-019 | Low | Resolved | Documentation & comments | `docs/GatewayTesting.md`, `code-reviews/Tests/findings.md` (Tests-002 re-triage) |
| Worker-009 | Low | Resolved | Performance & resource management | `src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49`, `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58` |
| Worker-010 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226` |
| Worker-011 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171` |
@@ -154,6 +215,11 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Worker-013 | Low | Resolved | Testing coverage | `src/MxGateway.Worker/Sta/StaMessagePump.cs` |
| Worker-014 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33`, `:202` |
| Worker-015 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145` |
| Worker-018 | Low | Resolved | Error handling & resilience | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:160-161` |
| Worker-019 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:59`, `:188` |
| Worker-020 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:405`, `:423` |
| Worker-021 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-118`, `:790-805`, `:136-139` |
| Worker-022 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker/MxAccess/MxAlarmSnapshot.cs:12`, `:26`, `:49` |
| Worker.Tests-008 | Low | Resolved | Documentation & comments | `src/MxGateway.Worker.Tests/Conversion/VariantConverterTests.cs:175-182` |
| Worker.Tests-009 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandHandlerTests.cs`, `AlarmDispatcherTests.cs`, `AlarmCommandExecutorTests.cs`, `AlarmRecordTransitionMapperTests.cs`, `WnWrapAlarmConsumerXmlTests.cs` |
| Worker.Tests-010 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:230-258` |
@@ -162,3 +228,9 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Worker.Tests-013 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:539-546` |
| Worker.Tests-014 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeClientTests.cs:194`, `WorkerPipeSessionTests.cs:622`, `Sta/StaCommandDispatcherTests.cs:348`, `MxAccess/MxAccessStaSessionTests.cs:334`, `MxAccess/MxAccessCommandExecutorTests.cs:1124` |
| Worker.Tests-015 | Low | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/MxAccess/MxAccessEventQueueTests.cs` |
| Worker.Tests-019 | Low | Resolved | mxaccessgw conventions | `src/MxGateway.Worker.Tests/AlarmsLiveSmokeTests.cs:45`, `src/MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs:143`, `src/MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs:55` |
| Worker.Tests-020 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/MxAccess/MxAccessValueCacheTests.cs:88-108` |
| Worker.Tests-021 | Low | Resolved | Error handling & resilience | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs` |
| Worker.Tests-022 | Low | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/MxAccess/WnWrapAlarmConsumerXmlTests.cs` |
| Worker.Tests-023 | Low | Resolved | Documentation & comments | `src/MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs` (779 lines), `src/MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs` (287 lines), `src/MxGateway.Worker.Tests/AlarmsLiveSmokeTests.cs` (270 lines) |
| Worker.Tests-024 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandHandlerTests.cs:42-54` |
+136 -12
View File
@@ -4,25 +4,29 @@
|---|---|
| Module | `src/MxGateway.Server` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
This row summarizes the 2026-05-20 review pass at commit `1cd51bb`. Findings from
prior passes (Server-001 through Server-014) are all closed and remain below as
audit history.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Server-006 (metrics open-session leak on alarm auto-subscribe failure), Server-010 (rotate reactivates revoked keys). |
| 2 | mxaccessgw conventions | Issues found: Server-002 (orphan-worker termination on startup not implemented), Server-011 (style deviation in `WorkerAlarmRpcDispatcher`). |
| 3 | Concurrency & thread safety | No issues found — locking is correct; inconsistent-but-safe discipline in `GatewayMetrics` noted only. |
| 4 | Error handling & resilience | Issues found: Server-005 (Galaxy first-load can fault the host BackgroundService), Server-009 (SQLite has no busy-timeout/WAL under concurrent writes). |
| 5 | Security | Issues found: Server-001 (Critical: dashboard authorization never enforced on any route), Server-003 (LDAP dashboard users denied for lack of a scope claim), Server-010. |
| 6 | Performance & resource management | Issues found: Server-007 (DiscoverHierarchy paging is O(total) per page), Server-008 (WatchDeployEvents re-projects whole hierarchy per event). |
| 7 | Design-document adherence | Issues found: Server-002 (orphan workers), Server-012 (CLAUDE.md scope names stale vs code/docs). |
| 8 | Code organization & conventions | Issues found: Server-011 (style), Server-004 (CLI accepts unvalidated scope strings). |
| 9 | Testing coverage | Issues found: Server-013 (no dashboard route-level authorization test; `WorkerExecutableValidator`, `GalaxyGlobMatcher`, projector paging untested). |
| 10 | Documentation & comments | Issues found: Server-014 (stale "not yet wired" alarm comments), Server-012. |
| 1 | Correctness & logic bugs | Issues found: Server-019 (`WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` yields silently when session is missing). |
| 2 | mxaccessgw conventions | No issues found — convention drift previously called out is resolved; no new gaps observed. |
| 3 | Concurrency & thread safety | Issues found: Server-015 (`GatewaySession._state` is written under `_closeLock` but read/written elsewhere under `_syncRoot`). |
| 4 | Error handling & resilience | Issues found: Server-016 (`GatewaySession.DisposeAsync` disposes the close-lock semaphore while it may be held). |
| 5 | Security | Issues found: Server-017 (`AcknowledgeAlarm` / `QueryActiveAlarms` fall through to admin-only scope because the resolver was not updated for the new alarm RPCs). |
| 6 | Performance & resource management | Issues found: Server-018 (`GalaxyGlobMatcher` regex cache is unbounded — currently low-risk but uncapped). |
| 7 | Design-document adherence | No issues found at this pass. |
| 8 | Code organization & conventions | Issues found: Server-020 (dashboard pages each declare two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — producing duplicate routes under the `/dashboard` group prefix). |
| 9 | Testing coverage | Issues found: Server-021 (`MxAccessGatewayService.ApplyConstraintsAsync` and the new `BulkConstraintPlan` / `ReadBulkConstraintPlan` / `WriteBulkConstraintPlan` / `SubscribeBulkConstraintPlan` merge logic is entirely untested). |
| 10 | Documentation & comments | Issues found: Server-022 (`IAlarmRpcDispatcher` XML doc still describes the dispatcher as "ships a not-yet-wired default"; stale after Server-014). |
## Findings
@@ -235,3 +239,123 @@
**Recommendation:** Update the `AcknowledgeAlarm`/`QueryActiveAlarms` remarks to reflect that `WorkerAlarmRpcDispatcher` is the wired default, and describe its actual GUID-vs-`Provider!Group.Tag` handling.
**Resolution:** Resolved 2026-05-18. Confirmed against source: `SessionServiceCollectionExtensions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so the "not yet wired" / "empty stream until PR A.2" / "PR A.6/A.7 follow-up" prose in the `AcknowledgeAlarm` and `QueryActiveAlarms` `<remarks>` and inline comments was stale. Rewrote both `<remarks>` blocks and both inline comments to state that DI binds the production `WorkerAlarmRpcDispatcher`, that it routes over the worker pipe IPC, and that `AcknowledgeAlarm` handles a canonical-GUID reference (→ `AcknowledgeAlarmCommand`) and a `Provider!Group.Tag` reference (→ `AcknowledgeAlarmByNameCommand`), with `NotWiredAlarmRpcDispatcher` being only the null fallback. The matching stale `WorkerAlarmRpcDispatcher` class-level XML doc was corrected as part of Server-011. Pure documentation/comment change; no test.
### Server-015
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:8-15,266-308,720-775` |
| Status | Resolved |
**Description:** `GatewaySession` guards its mutable state with two different sync primitives. `TransitionTo`, `MarkFaulted`, `TouchClientActivity`, the `State`/`LastClientActivityAt`/`LeaseExpiresAt`/`FinalFault`/`ActiveEventSubscriberCount` getters, `AttachWorkerClient`, and `IsLeaseExpired` all read/write `_state`, `_finalFault`, `_lastClientActivityAt`, `_leaseExpiresAt`, `_workerClient`, and `_activeEventSubscriberCount` under `_syncRoot`. `CloseAsync` (lines 720-775), however, reads `_state` at line 729 and writes `_state` at lines 736 (`SessionState.Closing`) and 761 (`SessionState.Closed`) while only holding the `_closeLock` `SemaphoreSlim``_syncRoot` is never acquired. A concurrent `TransitionTo` or `MarkFaulted` from another thread sees `_state` outside the lock that protects it, and the `State` getter is not guaranteed to observe the `Closing`/`Closed` writes promptly. `SemaphoreSlim.WaitAsync`/`Release` do happen to provide memory barriers in practice, but the locking discipline is split across two primitives, which is fragile and defeats the audit value of "all `_state` access is guarded by `_syncRoot`". Concretely, the race between `CloseAsync` setting `_state = Closing` and a concurrent `TransitionTo(Ready)` is unordered — and `TransitionTo` will happily overwrite `Closing` back to `Ready` because its only guard is "do not overwrite `Closed`/`Faulted`".
**Recommendation:** Make `CloseAsync` mutate `_state` through the existing `TransitionTo(...)` helper (or acquire `_syncRoot` around the reads/writes) so all `_state` access uses the same lock. Either extend `TransitionTo` to accept the `Closing` and `Closed` transitions (it already handles `Faulted`/`Closed` precedence) or refactor `CloseAsync` to call a private `TrySetClosing()` / `MarkClosed()` that locks `_syncRoot`. Add a regression test that forces a `TransitionTo(Ready)` after `CloseAsync` has set `Closing` and asserts the session does not flip back to `Ready`.
**Resolution:** 2026-05-20 — Unified the close path on `_syncRoot`. `GatewaySession.CloseAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now mutates `_state` only through two private `_syncRoot`-locked helpers — `TryBeginClose` (writes `Closing`, returns the prior `_closeStarted`) and `MarkClosed` (writes `Closed`) — so every `_state` read/write in the session uses the same lock; `_closeLock` keeps its role of serializing concurrent close attempts. `TransitionTo` was tightened to refuse a transition out of `Closing` to anything other than `Closed`/`Faulted` so a late lifecycle callback cannot walk a closing session back to `Ready`. `docs/Sessions.md` updated to describe the unified lock discipline and the extended terminal precedence. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `TransitionTo_AfterCloseStarted_DoesNotOverwriteClosing` (the named scenario — `BlockingShutdownWorkerClient` parks the close inside `worker.ShutdownAsync` so the test can call `TransitionTo(Ready)` between the `Closing` and `Closed` writes and assert the state stays `Closing`) and `MarkFaulted_AfterCloseCompletes_DoesNotResurrectSession`.
### Server-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:790-797`, `src/MxGateway.Server/Sessions/SessionManager.cs:237-258` |
| Status | Resolved |
**Description:** `GatewaySession.DisposeAsync` synchronously calls `_closeLock.Dispose()` (line 792) without first acquiring the lock and without checking whether a `CloseAsync` is still in flight. The normal call path is `SessionManager.CloseSessionCoreAsync``session.CloseAsync(...)``RemoveSessionAsync``DisposeAsync`, where `DisposeAsync` runs strictly after `CloseAsync` completes. But the `ShutdownAsync` path (`SessionManager.cs:237-258`) and any future caller that disposes a session while another thread is still inside `CloseAsync` will trip `ObjectDisposedException` when the in-flight `CloseAsync` releases the semaphore. The race is narrow today because all `Close`/`Dispose` choreography goes through `SessionManager`, but the class-level contract is broken: nothing on `GatewaySession` documents or enforces "DisposeAsync must not be called concurrently with CloseAsync".
**Recommendation:** In `DisposeAsync`, either (a) take and release `_closeLock` once before disposing it, so the dispose is sequenced after any in-flight close, or (b) replace `_closeLock` disposal with a guard flag and let the semaphore be reclaimed by the finalizer. Document the invariant on the public method. Add a regression test that disposes a session whose `CloseAsync` has not yet completed and asserts no `ObjectDisposedException`.
**Resolution:** 2026-05-20 — Took recommendation (a): `GatewaySession.DisposeAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now acquires `_closeLock` once before disposing the semaphore so an in-flight `CloseAsync` finishes (its `_closeLock.Release()`) before the dispose tears the semaphore down. The wait is non-cancellable (`CancellationToken.None`) and `ObjectDisposedException` is swallowed at both the wait and the dispose site so double-dispose still completes cleanly. The method's XML doc was extended with a `<remarks>` block stating the invariant. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `DisposeAsync_WhileCloseInFlight_WaitsForCloseAndDoesNotThrow` (parks `CloseAsync` inside the worker shutdown, calls `DisposeAsync` concurrently, releases shutdown, asserts both complete without `ObjectDisposedException` and the worker is disposed exactly once) and `DisposeAsync_CalledTwice_DoesNotThrow`.
### Server-017
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:13-27`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:173-247`, `docs/Authorization.md:108-110` |
| Status | Resolved |
**Description:** The two new top-level RPCs added to `MxAccessGateway``AcknowledgeAlarm(AcknowledgeAlarmRequest)` and `QueryActiveAlarms(QueryActiveAlarmsRequest)` (proto lines 23-24) — are not enumerated by `GatewayGrpcScopeResolver.ResolveRequiredScope`. The resolver's `request switch` covers `OpenSessionRequest`, `CloseSessionRequest`, `StreamEventsRequest`, `MxCommandRequest`, and the four Galaxy-repository requests; everything else falls through to `_ => GatewayScopes.Admin`. The interceptor (`GatewayGrpcAuthorizationInterceptor.AuthenticateAndAuthorizeAsync`) then rejects any non-admin caller with `PermissionDenied`. This is technically fail-closed (and `docs/Authorization.md:108-110` documents the "unrecognized → admin" intent), but in practice it means: (1) only API keys with the `admin` scope can acknowledge alarms or query active alarms, even though acknowledging is naturally an `invoke:write`-shaped operation and querying is naturally an `invoke:read`- or `metadata:read`-shaped operation; (2) the alarm RPCs ship in a state where any client that successfully opened a session and subscribed to alarm events still cannot perform the operational acks the contract advertises; (3) the test matrix `GatewayGrpcScopeResolverTests` does not even cover these two request types, so the gap was not caught at unit-test time.
**Recommendation:** Add explicit arms to `ResolveRequiredScope`: map `AcknowledgeAlarmRequest` to `GatewayScopes.InvokeWrite` (parity with other write actions; ack changes alarm state) and `QueryActiveAlarmsRequest` to `GatewayScopes.MetadataRead` or `GatewayScopes.InvokeRead`. Update `docs/Authorization.md` to list both. Extend `GatewayGrpcScopeResolverTests` with the new mappings and an assertion that every request type defined by `mxaccess_gateway.proto` is named in the resolver (the test can enumerate the assembly's request types so a future RPC cannot quietly add itself only via the admin fallback).
**Resolution:** 2026-05-20 — Added explicit `AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite` and `QueryActiveAlarmsRequest => GatewayScopes.EventsRead` arms to `GatewayGrpcScopeResolver.ResolveRequiredScope` (`src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:21-22`). `InvokeWrite` matches the existing `MxCommandKind.Write*` mapping because ack mutates alarm state; `EventsRead` matches `StreamEventsRequest` and `MxCommandKind.DrainEvents` because querying active alarms reads the same alarm/event surface. Extended `GatewayGrpcScopeResolverTests` with two new `InlineData` rows covering both request types (`src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:16-17`) and added four interceptor-level cases in `GatewayGrpcAuthorizationInterceptorTests` (`UnaryServerHandler_AcknowledgeAlarmMissingScope_ReturnsPermissionDenied`, `UnaryServerHandler_AcknowledgeAlarmWithScope_RunsHandler`, `ServerStreamingServerHandler_QueryActiveAlarmsMissingScope_ReturnsPermissionDenied`, `ServerStreamingServerHandler_QueryActiveAlarmsWithScope_RunsHandler`) proving each new RPC denies callers lacking the chosen scope and runs the handler when the scope is held. Updated `docs/Authorization.md` (resolver snippet and Scope Catalog table) to list both RPCs against their scopes. `dotnet test ... --filter FullyQualifiedName~GatewayGrpcAuthorizationInterceptorTests` → 14 passed, 0 failed; resolver tests 28 passed, 0 failed.
### Server-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:15` |
| Status | Resolved |
**Description:** `GalaxyGlobMatcher.RegexCache` is a `ConcurrentDictionary<string, Regex>` keyed by glob pattern, with no eviction. The fix for Server-008 added this cache deliberately to avoid recompiling the same handful of patterns, but the cache key is the raw glob string. The patterns currently come from two sources — `DiscoverHierarchyRequest.TagNameGlob` (client-supplied) and `ApiKeyConstraints.BrowseSubtrees` / `ReadSubtrees` / `WriteSubtrees` / `ReadTagGlobs` / `WriteTagGlobs` (admin-configured) — and `BuildRegex` also runs each glob through `Regex.Escape` so an attacker cannot craft a denial-of-service ReDoS payload. The leak is therefore bounded only by "how many distinct globs a client can submit over the process lifetime", which is in the millions for `TagNameGlob` if a client iterates through generated names. Each compiled `Regex` also holds a JIT'd assembly that is non-trivial to reclaim.
**Recommendation:** Cap the cache at a small bound (e.g. 256 patterns) using a simple LRU or a `MemoryCache` with sliding expiration, or restrict the cache to globs that originate from API-key constraints (admin-controlled, naturally bounded) and pay the compile cost for client-supplied globs. Add a test that fills the cache with thousands of distinct globs and asserts the cache size stays bounded.
**Resolution:** 2026-05-20 — Capped `GalaxyGlobMatcher`'s compiled-regex cache at `RegexCacheCapacity = 256` entries with FIFO-by-insertion eviction (`src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs`). A `ConcurrentQueue<string>` tracks insertion order; when the cache grows past the cap, `EvictIfOverCapacity` takes a small lock and dequeues + removes the oldest entries until the count is back within bound. Reads stay lock-free (the lock guards only the eviction path). Internal `CurrentCacheSize` / `RegexCacheCapacity` accessors are surfaced through the existing `InternalsVisibleTo("MxGateway.Tests")` so tests can assert the bound. Regression test: `GalaxyFilterInputSafetyTests.GlobMatcher_WithManyDistinctPatterns_CacheStaysBounded` submits `RegexCacheCapacity * 4` distinct globs and asserts `CurrentCacheSize` stays in `[0, RegexCacheCapacity]`. Existing glob correctness tests (`GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect`, the adversarial-input theories) continue to pass, confirming eviction does not corrupt lookups.
### Server-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:183-221` |
| Status | Resolved |
**Description:** `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` returns `yield break` (line 191) when `sessionRegistry.TryGet(request.SessionId, ...)` fails — it silently produces an empty stream with no diagnostic. The peer `AcknowledgeAsync` instead returns an `AcknowledgeAlarmReply` with `ProtocolStatus.Code = SessionNotFound` (lines 81-89), so the two methods have inconsistent missing-session handling. In production this branch is unreachable because `MxAccessGatewayService.QueryActiveAlarms` calls `ResolveSession(...)` first and throws `NotFound` from the gRPC layer (`MxAccessGatewayService.cs:228`), but: (a) the dispatcher is the seam other code paths might reach in the future, and (b) any unit test that instantiates the dispatcher directly with a missing session id sees an empty stream rather than a clear error, which is a footgun.
**Recommendation:** Either throw a `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` (matching the gRPC service's own resolver) or yield a single `ActiveAlarmSnapshot` with a diagnostic field set, and add a `WorkerAlarmRpcDispatcherTests` case that asserts whichever shape is chosen. Aligning with `AcknowledgeAsync`'s `SessionNotFound` protocol-status pattern is preferred, but `QueryActiveAlarms` is a server-streaming RPC so a thrown `SessionManagerException` propagated by the gateway is the cleaner fit.
**Resolution:** 2026-05-20 — Took the preferred option: `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` (`src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs`) now throws `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` instead of `yield break`-ing when the session is missing. `MxAccessGatewayService.MapException` already maps that error code to gRPC `NotFound`, so production callers see a consistent missing-session response and a direct unit-test caller now gets a clear error instead of an empty success. The unary peer `AcknowledgeAsync` continues to surface the same condition as an in-band `ProtocolStatus.Code = SessionNotFound`, which is correct for a unary RPC. Regression test: `WorkerAlarmRpcDispatcherTests.QueryActiveAlarmsAsync_WhenSessionMissing_ThrowsSessionNotFound` replaces the prior `_YieldsEmpty` assertion — it asserts the new exception shape and also exercises `AcknowledgeAsync` with the same missing session id to pin the peer-method parity.
### Server-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor:1-2`, `…/GalaxyPage.razor:1-2`, `…/ApiKeysPage.razor:1-2`, `…/EventsPage.razor:1-2`, `…/SessionsPage.razor:1-2`, `…/WorkersPage.razor:1-2`, `…/SettingsPage.razor:1-2`, `…/SessionDetailsPage.razor:1-2` |
| Status | Resolved |
**Description:** Every dashboard page declares two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — even though `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` mounts the Razor components under a `RouteGroupBuilder` with `pathBase = "/dashboard"`. The group prefix is prepended to each `@page` route, so the actual endpoints become `/dashboard/X` (from `@page "/X"`) **and** `/dashboard/dashboard/X` (from `@page "/dashboard/X"`). The pages are reachable at two URLs each, and the deeper one (`/dashboard/dashboard/sessions` etc.) is almost certainly accidental — it leaks the path-base name into the URL and creates duplicate authorize/render work per route. `GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization` only checks the `/dashboard/X` shape, so the duplicate route slipped through without an assertion.
**Recommendation:** Drop the `@page "/dashboard/X"` directive from each page; rely on the `MapGroup("/dashboard")` to provide the prefix. Or, if the team genuinely wants both URL shapes, document the choice in the file header and extend the route-enumeration test to assert that **both** are present (and both carry the authorization policy). Either way, the current setup is non-obvious.
**Resolution:** 2026-05-20 — Took the recommended drop: removed the redundant `@page "/dashboard/X"` directive from every dashboard Razor page (`DashboardHome.razor`, `SessionsPage.razor`, `WorkersPage.razor`, `EventsPage.razor`, `GalaxyPage.razor`, `SettingsPage.razor`, `ApiKeysPage.razor`, `SessionDetailsPage.razor`). Each page now declares only its bare route (e.g. `@page "/sessions"`); `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` continues to prepend `/dashboard` via `MapGroup`, so each page is reachable at exactly one URL (`/dashboard/X`). Regression test: `GatewayApplicationTests.Build_WhenDashboardEnabled_DoesNotRegisterDoubledDashboardPrefixRoutes` enumerates the eight previously-doubled routes (`/dashboard/dashboard/`, `/dashboard/dashboard/sessions`, ... `/dashboard/dashboard/sessions/{SessionId}`) and asserts none of them are mapped. The existing `..._MapsBlazorDashboardAndAuthEndpoints` / `..._ComponentRoutesRequireAuthorization` tests continue to verify the desired `/dashboard/X` shapes are still present and policy-gated. No public URL contract changed (the doubled shape was accidental); no doc update needed — `gateway.md` and `docs/GatewayDashboardDesign.md` never referenced the doubled routes.
### Server-021
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:266-664`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs` |
| Status | Resolved |
**Description:** The 1cd51bb commit history (the bulk read/write series, `f220908`/`5e375f6`/`758aca2`) added 473 lines of constraint-filtering and reply-merging logic to `MxAccessGatewayService`: `ApplyConstraintsAsync` (line 266), `EnforceReadTagAsync` / `EnforceWriteHandleAsync`, `FilterTagBulkAsync` / `FilterReadBulkAsync` / `FilterWriteBulkAsync` / `FilterHandleBulkAsync`, the `ReplaceWriteBulkEntries` switch, and three concrete `BulkConstraintPlan` records (`SubscribeBulkConstraintPlan`, `WriteBulkConstraintPlan`, `ReadBulkConstraintPlan`) that splice denied entries back into the worker's allowed-only reply in original-index order. None of this is covered by `MxAccessGatewayServiceTests` — its `FakeSessionManager` is wired with an `AllowAllConstraintEnforcer` (line 430) that never denies anything, so every constraint-related code path is dead at test time. A subtle off-by-one in `BuildMerged`, a wrong `PayloadOneofCase` in `GetPayload` / `SetPayload`, or a missing case in `ReplaceWriteBulkEntries` would all ship without a test failure.
**Recommendation:** Add `MxAccessGatewayServiceTests` cases that inject a deny-on-glob `IConstraintEnforcer` and exercise: (1) `AddItemBulk` / `SubscribeBulk` / `AdviseItemBulk` with a mix of allowed and denied tags, asserting `BulkSubscribeReply.Results` interleaves denied and worker-allowed entries in original-index order; (2) the same for `ReadBulk` and each of the four bulk-write commands; (3) `HasAllowedItems == false` so `CreateDeniedReply` is exercised (no worker call); (4) the unary `Write`/`Write2`/`WriteSecured`/`WriteSecured2` paths through `EnforceWriteHandleAsync`. The fixtures can reuse the existing `FakeSessionManager` by replacing the constraint enforcer; no live worker is needed.
**Resolution:** 2026-05-20 — Added a configurable `PredicateConstraintEnforcer` test double (`src/MxGateway.Tests/TestSupport/PredicateConstraintEnforcer.cs`) that denies on per-tag and per-handle predicates and records denials. Added 11 new tests in `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs` covering: (1) `AddItemBulk` with mixed denials — asserts the worker is called once with only the allowed subset and the merged reply interleaves denied and worker-allowed `SubscribeResult`s at their original indices; (2) `SubscribeBulk` with every tag denied — asserts `HasAllowedItems` short-circuits `CreateDeniedReply` and the session manager is never invoked; (3) `AdviseItemBulk` (handle-keyed denial via `CheckReadHandleAsync`); (4) `SubscribeBulk` with the allow-all enforcer — pass-through regression guard; (5) `ReadBulk` partial denial — asserts the `BulkReadConstraintPlan` produces a `BulkReadReply` (not a `BulkSubscribeReply`) with denied entries spliced in at their original indices; (6) `ReadBulk` all-denied short-circuit; (7) `WriteBulk` partial denial — asserts denied entries are dropped from the forwarded `Entries` and the merged reply preserves original-index order; (8) `WriteSecuredBulk` all-denied — proves the second `ReplaceWriteBulkEntries` switch arm is reachable; (9) unary `Write` with denied handle → `PermissionDenied`, no worker call, denial recorded; (10) unary `WriteSecured` with denied handle → `PermissionDenied`; (11) unary `AddItem` with denied tag → `PermissionDenied` (`EnforceReadTagAsync`). `MxAccessGatewayServiceTests.CreateService` updated to accept an `IConstraintEnforcer` so future tests can opt into the deny enforcer without duplicating the wiring. All 11 new tests pass; full suite (`dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj`) is green at 458 passing.
### Server-022
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs:8-29` |
| Status | Resolved |
**Description:** Server-014's resolution noted that the stale "PR A.6 / A.7" / "not yet wired" language was rewritten on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms` and on the `WorkerAlarmRpcDispatcher` class doc. The corresponding XML doc on the **interface** `IAlarmRpcDispatcher` (lines 8-29) still says it is "PR A.6 / A.7 — gateway-side dispatcher" and that "Production implementations live in `WorkerAlarmRpcDispatcher` (this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic)". That second clause directly contradicts the now-correct comments on the concrete implementations and on the gRPC service: `WorkerAlarmRpcDispatcher` is the wired default, not a not-yet-wired one. A reader who finds the interface first will believe the dispatcher is non-functional.
**Recommendation:** Rewrite the `IAlarmRpcDispatcher` `<remarks>` block to match the language now used on `WorkerAlarmRpcDispatcher` and on the gRPC service: DI binds `WorkerAlarmRpcDispatcher` by default; `NotWiredAlarmRpcDispatcher` is only the null fallback for tests/DI omission. Drop the "PR A.6 / A.7" prefix from the `<summary>` — the interface is now the public alarm-RPC seam.
**Resolution:** 2026-05-20 — Rewrote `IAlarmRpcDispatcher`'s `<summary>` and `<remarks>` (`src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs`) to match the language now used on `WorkerAlarmRpcDispatcher` and on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms`: dropped the stale "PR A.6 / A.7" prefix from the summary, and replaced the "this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic" clause with the correct statement that DI binds the production `WorkerAlarmRpcDispatcher` by default and `NotWiredAlarmRpcDispatcher` is only the null fallback for DI omission / standalone tests. Pure documentation change; no test.
+116 -11
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `src/MxGateway.Tests` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
@@ -13,16 +13,16 @@
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found: Tests-001 (`FakeSessionManager.TryGetSession` always returns true), Tests-011 (unobserved worker task). |
| 2 | mxaccessgw conventions | FakeWorkerHarness used per docs; no real secrets; minor style drift in three alarm-test files (Tests-008). |
| 3 | Concurrency & thread safety | Issues found: Tests-006 (`Task.Delay`-based timing), Tests-012 (no parallelism guard for `WebApplication` tests). |
| 4 | Error handling & resilience | Strong — timeouts, faults, overflow, kill paths, protocol violations all exercised. No issues found. |
| 5 | Security | Issues found: Tests-002 (no SQL-injection coverage of Galaxy RPCs), Tests-010 (anonymous-localhost negative cases untested). |
| 6 | Performance & resource management | Issue found: Tests-003 (temp DB/worker directories never cleaned up). |
| 1 | Correctness & logic bugs | Issue found: Tests-015 (`FakeWorkerProcess.WaitForExitAsync` mutates `HasExited`, weakening the smoke test assertion). |
| 2 | mxaccessgw conventions | No new issues. Style/convention drift previously filed has been resolved. |
| 3 | Concurrency & thread safety | Issue found: Tests-017 (`HeartbeatMonitor_WhenHeartbeatExpires_FaultsClient` still on real wall-clock). |
| 4 | Error handling & resilience | Strong — timeouts, faults, overflow, kill paths, protocol violations all exercised. No new issues found. |
| 5 | Security | No new issues. `Galaxy` adversarial-input safety (Tests-002), dashboard anonymous-localhost negatives (Tests-010), and interceptor composition (Tests-004) all resolved in the prior pass. |
| 6 | Performance & resource management | Issue found: Tests-014 (`WebApplication` instances built by `GatewayApplicationTests` and `DashboardCookieOptionsTests` are never disposed). |
| 7 | Design-document adherence | Tests match `docs/GatewayTesting.md`; no drift found. No issues found. |
| 8 | Code organization & conventions | Issue found: Tests-007 (`TestServerCallContext` copy-pasted into 4+ files). |
| 9 | Testing coverage | Issues found: Tests-001, Tests-004 (no end-to-end interceptor+service test), Tests-005 (no worker-crash-mid-command coverage), Tests-002. |
| 10 | Documentation & comments | Issue found: Tests-009 (stale/mismatched XML `<summary>` comments). |
| 8 | Code organization & conventions | Issue found: Tests-018 (`DateTimeOffset.Parse` calls without `CultureInfo.InvariantCulture`). |
| 9 | Testing coverage | Issues found: Tests-013 (eight new `GatewaySession.*BulkAsync` methods untested), Tests-016 (a Galaxy cache unit test performs a real network connect attempt). |
| 10 | Documentation & comments | Issue found: Tests-019 (the `Re-triage note` paragraphs added to Tests-002/006/008 only live inside `findings.md``docs/GatewayTesting.md` is not updated to describe the in-memory Galaxy filter safety tests added under that finding). |
## Findings
@@ -211,3 +211,108 @@
**Recommendation:** Add an `xunit.runner.json` or a collection grouping the `WebApplication`-building tests, and keep the `:0` ephemeral-port convention explicit so future tests do not introduce a fixed-port collision.
**Resolution:** Resolved 2026-05-18: added `src/MxGateway.Tests/xunit.runner.json` making the parallelism policy explicit (`parallelizeTestCollections: true`, `maxParallelThreads: -1`, `parallelizeAssembly: false`, `longRunningTestSeconds: 30`) and wired it into `MxGateway.Tests.csproj` as `<None Update="xunit.runner.json" CopyToOutputDirectory="PreserveNewest" />` so the runner picks it up (confirmed present in `bin/Debug/net10.0/`). Added a comment at the only `WebApplication`-building call site (`GatewayApplicationTests.cs`, `--urls=http://127.0.0.1:0`) documenting that the ephemeral-port (`:0`) convention is mandatory because test collections run in parallel. No fixed-port binding exists today; this is a preventative guardrail as the finding recommends.
### Tests-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:449-679`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` |
| Status | Resolved |
**Description:** `GatewaySession` exposes eleven bulk methods (`AddItemBulkAsync`, `AdviseItemBulkAsync`, `RemoveItemBulkAsync`, `UnAdviseItemBulkAsync`, `SubscribeBulkAsync`, `UnsubscribeBulkAsync`, `WriteBulkAsync`, `Write2BulkAsync`, `WriteSecuredBulkAsync`, `WriteSecured2BulkAsync`, `ReadBulkAsync`) but only three (`SubscribeBulkAsync`, `WriteBulkAsync`, `ReadBulkAsync`) are exercised in `SessionManagerTests`. A grep across `src/MxGateway.Tests` for the other eight method names returns zero matches. The recent commit `eaa7093` ("register the five new bulk subcommands in `IsKnownGatewayCommand`") explicitly added bulk surface to the gateway, and `1cd51bb` added stress benchmarks for it, but the gateway-side tests do not pin the command-kind, payload-shape, or `WriteSecured*Bulk` credential-redaction behaviour for any of the new bulk variants. A future regression in `WriteSecuredBulkAsync` body construction would not be caught by the gateway unit suite.
**Recommendation:** Mirror the existing `SubscribeBulkAsync` / `WriteBulkAsync` / `ReadBulkAsync` test pattern for the eight missing methods: each test should `OpenSessionAsync`, invoke the bulk API, assert the worker received exactly one `WorkerCommand` of the matching `MxCommandKind`, and (for the secured variants) confirm the credential payload survives the round-trip without being log-redacted from the over-the-wire command shape.
**Resolution:** Resolved 2026-05-20: added `src/MxGateway.Tests/Gateway/Sessions/SessionManagerBulkTests.cs` with per-method coverage for all eleven bulk entry points. Each method now has a round-trip test that pins (a) the exact `MxCommandKind` sent to the worker, (b) the payload shape (server handle, item handles / tag addresses / entries, timeout for `ReadBulk`), and (c) per-entry failure surfacing where the reply contains a mix of `WasSuccessful = true`/`false` results with an `ErrorMessage`. Each method also has a `*_PropagatesCancellation` test that pre-cancels the token and asserts `OperationCanceledException` flows out. The secured variants additionally pin that `CurrentUserId` / `VerifierUserId` survive the over-the-wire command shape unchanged (the gateway's redaction rules apply only to logs, not to the command body the worker receives). New tests use a local `FakeBulkWorkerClient` keyed by `MxCommand.Kind`-specific replies; no production-code change. All 54 SessionManager/GalaxyHierarchyCache tests pass with `dotnet test --filter "FullyQualifiedName~SessionManager|FullyQualifiedName~GalaxyHierarchyCache"`.
### Tests-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs:18,33,44,62,81,105`, `src/MxGateway.Tests/Gateway/Dashboard/DashboardCookieOptionsTests.cs:17` |
| Status | Resolved |
**Description:** Seven `[Fact]` methods build a real `WebApplication` via `GatewayApplication.Build([])` and never dispose it. `WebApplication` is `IAsyncDisposable`; constructing one stands up a full DI container, an OpenTelemetry meter (`GatewayMetrics`), Kestrel server objects, hosted services, and logging providers. Because the suite runs test collections in parallel (per the new `xunit.runner.json` from Tests-012), every undisposed instance keeps its meter/loggers/hosted services alive until the test process exits, doubling up live Meter instances each time and silently extending the memory/handle footprint of an `xunit` run. Only the two tests that actually call `app.StartAsync()` (`GatewayApplicationTests.StartAsync_InvalidGatewayConfiguration_FailsStartup` and `SqliteAuthStoreTests.StartAsync_NewerSchemaVersion_BlocksStartup`) currently use `await using`.
**Recommendation:** Promote each `WebApplication app = GatewayApplication.Build(...)` to `await using WebApplication app = ...` and make the containing test method `async Task`. The endpoint-listing assertions do not need `await`, but the `await using` will ensure the DI container, meter, and hosted services are torn down per-test.
**Resolution:** 2026-05-20 — Promoted all seven `WebApplication`-building tests (six in `GatewayApplicationTests` plus the one in `DashboardCookieOptionsTests`) to `async Task` with `await using WebApplication app = GatewayApplication.Build(...)`, so the DI container, `GatewayMetrics` meter, hosted services, and Kestrel objects are torn down per-test rather than leaking until process exit. The previously already-`await using` `StartAsync_InvalidGatewayConfiguration_FailsStartup` was unchanged. Full suite green.
### Tests-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:374-379,87` |
| Status | Resolved |
**Description:** The nested `FakeWorkerProcess.WaitForExitAsync` implementation unconditionally sets `HasExited = true` and `ExitCode ??= 0` when called, regardless of whether the scripted worker actually completed the shutdown handshake. The smoke-test assertion `Assert.True(launcher.Process.HasExited)` therefore cannot distinguish "the scripted worker received `WorkerShutdown`, sent `WorkerShutdownAck`, and called `MarkExited(0)`" from "the gateway code path simply awaited `WaitForExitAsync` somewhere during teardown". The scripted worker happens to call `MarkExited(0)` after receiving the shutdown frame, but a regression that bypassed the shutdown-ack path entirely would still pass this assertion. The companion launcher in `SessionWorkerClientFactoryFakeWorkerTests.FakeWorkerProcess.WaitForExitAsync` (lines 351-356) has the same shape — fine there because no exit assertion is made — but the smoke test relies on this signal.
**Recommendation:** Make `WaitForExitAsync` await an internal `TaskCompletionSource` that is only completed by `Kill()` or `MarkExited()` (the same pattern `WorkerClientTests.FakeWorkerProcess` already uses for `_exited`), so `HasExited` reflects actual exit and the smoke test's assertion is meaningful.
**Resolution:** 2026-05-20 — Rewrote the smoke-test `FakeWorkerProcess` to back `WaitForExitAsync` with a `TaskCompletionSource _exited` that is only completed inside `MarkExited` (called by the scripted worker after sending `WorkerShutdownAck`) or `Kill` (which calls `MarkExited(-1)`), removing the "set `HasExited = true` and return immediately" cheat. The smoke test now also asserts `Assert.Equal(0, launcher.Process.ExitCode)``MarkExited(0)` is reachable only via the shutdown-ack branch, so a regression that bypassed the ack path would produce a non-zero (or null) exit code and fail the assertion deterministically. `WorkerClient.ShutdownAsync` calls `WaitForProcessExitAsync`, which now genuinely awaits the scripted worker's ack.
### Tests-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:29-41,115-124` |
| Status | Resolved |
**Description:** `RefreshAsync_WhenSqlIsUnreachable_MarksUnavailableAndDoesNotPublish` is in the unit-test project but exercises a real `GalaxyHierarchyCache`/`GalaxyRepository` against a hard-coded TCP socket `127.0.0.1:65500` with a one-second connect timeout. Per `docs/GatewayTesting.md`, live Galaxy coverage belongs in `MxGateway.IntegrationTests` and is gated by `MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1`; this test is neither gated nor uses a stub repository. On most boxes the connect fails closed (the test passes), but the outcome depends on OS-level "connection refused" vs "no route to host" behaviour and is sensitive to environments where 127.0.0.1:65500 happens to be bound — a real flakiness source. It also breaks the gateway-without-MXAccess invariant in spirit (the gateway code path under test does I/O the unit project should not need).
**Recommendation:** Either (a) replace the real repository with an in-test fake that throws a `SqlException`/`TimeoutException` from `GetHierarchyAsync`, exercising `GalaxyHierarchyCache.RefreshAsync`'s exception path directly; or (b) move the test to `MxGateway.IntegrationTests` and gate it behind a "no-live-DB-required" variant of the live-Galaxy attribute. (a) is preferred because the production path being tested is the cache's reaction to a repository exception, not socket behaviour.
**Resolution:** Resolved 2026-05-20: applied option (a). Introduced `src/MxGateway.Server/Galaxy/IGalaxyRepository.cs` with the four methods the cache consumes (`TestConnectionAsync`, `GetLastDeployTimeAsync`, `GetHierarchyAsync`, `GetAttributesAsync`); made `GalaxyRepository` implement it; changed `GalaxyHierarchyCache`'s constructor to depend on `IGalaxyRepository` rather than the concrete type; and registered the interface against the existing concrete singleton in `GalaxyRepositoryServiceCollectionExtensions.AddGalaxyRepository`. Rewrote the test as `RefreshAsync_WhenRepositoryThrows_MarksUnavailableAndDoesNotPublish` using a local `ThrowingGalaxyRepository : IGalaxyRepository` that throws an `InvalidOperationException` from `GetLastDeployTimeAsync` (the first call the cache makes against the repository). The test now exercises the cache's exception branch directly — no TCP I/O — and additionally asserts that `GetHierarchyAsync`/`GetAttributesAsync` are NOT invoked once the deploy-time probe has failed. `Current_BeforeAnyRefresh_ReturnsEmpty` was migrated to the same fake. The unreachable `CreateCache` helper that built a real `GalaxyRepository` against `127.0.0.1:65500` was removed. The Galaxy SQL surface itself stays covered by `MxGateway.IntegrationTests.Galaxy.GalaxyRepositoryLiveTests` (gated by `MXGATEWAY_RUN_LIVE_GALAXY_REPOSITORY_TESTS=1`).
### Tests-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:346-364` |
| Status | Resolved |
**Description:** `HeartbeatMonitor_WhenHeartbeatExpires_FaultsClient` configures `HeartbeatGrace = 80 ms` and `HeartbeatCheckInterval = 20 ms`, then asserts the client faults within the 5-second `TestTimeout`. The test compares against the real wall clock — the heartbeat monitor reads `TimeProvider.System` for the grace check. After Tests-006 migrated the other heartbeat tests to an injected `ManualTimeProvider` for determinism, this one is now the only `WorkerClientTests` heartbeat case that still rides the wall clock. The 5-second outer bound makes a false failure unlikely, but the test cannot fail fast when the heartbeat-monitor logic regresses — it just waits the full 5 seconds.
**Recommendation:** Inject the same `ManualTimeProvider` used by `ReadLoop_WhenHeartbeatArrives_UpdatesLastHeartbeatAndWorkerProcess`, then `clock.Advance(TimeSpan.FromSeconds(2))` past the grace and assert the fault deterministically. The `HeartbeatCheckInterval` (20 ms) timer fire can stay on the real clock; what needs to be deterministic is the grace comparison.
**Resolution:** 2026-05-20 — `HeartbeatMonitor_WhenHeartbeatExpires_FaultsClient` now constructs a `ManualTimeProvider` seeded at `"2026-05-20T12:00:00Z"`, passes it to `CreateClient` via the existing `timeProvider` parameter, and calls `clock.Advance(TimeSpan.FromSeconds(2))` after the handshake. `WorkerClient.MarkReady` records `_lastHeartbeatAt` from the manual clock, so the next 20 ms `HeartbeatCheckInterval` tick observes `now - lastHeartbeat = 2s > 80ms grace` and faults deterministically. The check-interval timer stays on the real clock as the finding recommended; only the grace comparison is deterministic.
### Tests-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:32`, `src/MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotServiceTests.cs:45,51,57,105,134,163,167,202-209,284,317,523`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:40` |
| Status | Resolved |
**Description:** Several tests parse ISO-8601 literals with `DateTimeOffset.Parse("2026-04-26T10:00:00Z")` without an explicit `CultureInfo.InvariantCulture`. `Directory.Build.props` enables `TreatWarningsAsErrors`, but CA1305 (specify `IFormatProvider`) is not currently raised because the tests don't trigger it; nevertheless, `DateTimeOffset.Parse` without a culture takes `CurrentCulture`, and on a locale whose `DateTimeFormatInfo` rejects the `Z` suffix or uses non-Gregorian calendar conventions, these parses can throw at test time. `WorkerClientTests.cs:327` and `FakeWorkerHarnessTests.cs:121` already added `System.Globalization.CultureInfo.InvariantCulture` in the Tests-006 fix; the other ~15 call sites did not get the same treatment.
**Recommendation:** Add `CultureInfo.InvariantCulture` to every `DateTimeOffset.Parse(...)` call in `MxGateway.Tests`, or replace with `DateTimeOffset.ParseExact` against the literal `"O"` round-trip format. A single-line `using System.Globalization;` per file keeps the call sites concise.
**Resolution:** 2026-05-20 — Added `CultureInfo.InvariantCulture` to every `DateTimeOffset.Parse` site in `MxGateway.Tests` that lacked it: 16 call sites in `DashboardSnapshotServiceTests.cs` (a new `using System.Globalization;` was added so the call sites stay concise) and one in `SessionManagerTests.cs` (using the fully-qualified `System.Globalization.CultureInfo.InvariantCulture` to match the in-file style of the existing `ManualTimeProvider` parse sites). `GalaxyHierarchyCacheTests.cs:36` was already correct from the Tests-016 rewrite. A final grep confirms every `DateTimeOffset.Parse`/`DateTime.Parse` call in `src/MxGateway.Tests` now passes `CultureInfo.InvariantCulture`.
### Tests-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `docs/GatewayTesting.md`, `code-reviews/Tests/findings.md` (Tests-002 re-triage) |
| Status | Resolved |
**Description:** The Tests-002 re-triage (2026-05-18) confirmed there is no SQL-injection surface in `GalaxyRepository` because filters are applied in memory by `GalaxyHierarchyProjector`/`GalaxyGlobMatcher` against the cached snapshot, and added 10 adversarial-input tests in `src/MxGateway.Tests/Galaxy/GalaxyFilterInputSafetyTests.cs`. That explanation lives only in the findings file; `docs/GatewayTesting.md` does not mention `GalaxyFilterInputSafetyTests`, the in-memory filter model, or the adversarial-input matrix. A future reader of the test docs will not know which tests pin the literal-filter behaviour or why the Galaxy SQL layer is not unit-tested for parameterisation. Per `CLAUDE.md` ("Update docs in the same change as the source. When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs must change in the same commit"), the Galaxy security-behaviour decision warrants a paragraph in `GatewayTesting.md`.
**Recommendation:** Add a short subsection to `docs/GatewayTesting.md` (probably under "Focused Commands" or a new "Galaxy Filter Safety" section) that names `GalaxyFilterInputSafetyTests`, explains that Galaxy filtering happens in memory against the cached hierarchy (so the SQL surface is constant), and lists the adversarial-input invariants the suite pins (`%`, `_`, `'`, `;`, `[abc]` are literals; the glob regex has a 100 ms timeout against pathological input).
**Resolution:** 2026-05-20 — Added a "Galaxy Filter Safety" section to `docs/GatewayTesting.md` (immediately after "Live Galaxy Repository", before "Live LDAP") that names `GalaxyFilterInputSafetyTests`, re-frames the Tests-002 finding (the Galaxy SQL surface is constant — `HierarchySql`, `AttributesSql`, `SELECT 1`, `SELECT time_of_last_deploy FROM galaxy`), explains that all filters are applied in memory by `GalaxyHierarchyProjector` / `GalaxyGlobMatcher`, lists the adversarial-input matrix (`'`, `' OR '1'='1`, `'; DROP TABLE gobject;--`, `%`, `_`, `100%_off`, `[abc]`, `Pump'001`), and enumerates the invariants the suite pins (SQL metacharacters are opaque literals, only `*`/`?` are glob wildcards, the matcher has a 100 ms regex timeout against pathological input, the projector returns zero matches / `NotFound` rather than the whole hierarchy, and the `DiscoverHierarchy` RPC end-to-end returns zero matches for adversarial globs).
+154 -2
View File
@@ -4,13 +4,15 @@
|---|---|
| Module | `src/MxGateway.Worker.Tests` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
### 2026-05-18 review (commit `6c64030`)
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Worker.Tests-010 (weak substring assertion), Worker.Tests-011 (test name overstates what it proves). |
@@ -24,6 +26,21 @@
| 9 | Testing coverage | Issues found: Worker.Tests-001 (`StaMessagePump` untested), Worker.Tests-002 (COM-event delivery untested), Worker.Tests-012 (frame-validation gaps). |
| 10 | Documentation & comments | Issues found: Worker.Tests-008 (misplaced redaction test), Worker.Tests-011 (misleading test name). |
### 2026-05-20 re-review (commit `1cd51bb`)
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Worker.Tests-018 (silent-skip masquerades as passing tests), Worker.Tests-024 (`Subscribe_WhenUnderlyingSubscribeThrows_DisposesConsumer` swallows the real exception type). |
| 2 | mxaccessgw conventions | Issues found: Worker.Tests-019 (`AlarmsLiveSmokeTests` uses `snake_case` outside the alarm-method scope Worker.Tests-009 corrected); pre-existing `LiveMxAccessFactAttribute` is not consumed by `MxAccessLiveComCreationTests` (Worker.Tests-018). |
| 3 | Concurrency & thread safety | Issues found: Worker.Tests-020 (`MxAccessValueCacheTests.TryWaitForUpdate_ReturnsFalseAfterDeadline_WhenNoSetOccurs` asserts wall-clock floor and pump-call lower bound). |
| 4 | Error handling & resilience | Issues found: Worker.Tests-021 (`WorkerFrameProtocolErrorCode.EndOfStream` and the writer-side `MessageTooLarge`/`InvalidEnvelope` branches are uncovered). |
| 5 | Security | Redaction coverage is sound; no new issues. |
| 6 | Performance & resource management | No new issues — `MemoryStream`/session-disposal hygiene fixes from the prior pass hold; `WorkerFrameReader` `ArrayPool` rent/return path is now regression-tested. |
| 7 | Design-document adherence | No new issues. |
| 8 | Code organization & conventions | Issues found: Worker.Tests-016 (the now-shared `MxAccessSession` reflection construction in `AlarmCommandExecutorTests` duplicates the testable surface the consolidated TestSupport folder was meant to host). |
| 9 | Testing coverage | Issues found: Worker.Tests-017 (`WorkerCancel` envelope-dispatch path untested), Worker.Tests-022 (`WnWrapAlarmConsumer.PollOnce` transition-delta computation untested at the snapshot-to-transitions level). |
| 10 | Documentation & comments | Issues found: Worker.Tests-023 (`AlarmClientWmProbeTests` and `WnWrapConsumerProbeTests` are unit-test classes carrying 1000+ lines of probe-only code; their `[Fact(Skip=...)]` status is documented but the probe scaffolding is mixed into the same test assembly as regression tests). |
## Findings
### Worker.Tests-001
@@ -250,3 +267,138 @@
**Recommendation:** Add a `Drain(0)` drain-all test and an empty-queue drain test.
**Resolution:** 2026-05-18 — Added three tests to `MxAccessEventQueueTests`. `Drain_WithZeroMaxEvents_DrainsAllEvents` covers the `maxEvents == 0` drain-all branch in `MxAccessEventQueue.Drain` (verified at `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:174`) — three events enqueued, `Drain(0)` returns all three in order and empties the queue. `Drain_WhenQueueIsEmpty_ReturnsEmptyList` covers the `drainCount == 0` early-return branch for both `Drain(0)` and `Drain(5)` on an empty queue. `Enqueue_AfterRecordFault_ThrowsInvalidOperationException` covers the backpressure contract gap the finding flagged — after a manual `RecordFault`, `Enqueue` throws `InvalidOperationException` ("outbound event queue is faulted") and the event is not queued.
### Worker.Tests-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandExecutorTests.cs:317-393` |
| Status | Resolved |
**Description:** `AlarmCommandExecutorTests` reaches into `MxAccessSession` via reflection (`typeof(MxAccessSession).GetConstructor(BindingFlags.NonPublic | BindingFlags.Instance, ..., new[] { typeof(object), typeof(IMxAccessServer), typeof(IMxAccessEventSink), typeof(MxAccessHandleRegistry), typeof(MxAccessValueCache), typeof(int) }, ...)`) and provides an inline `NullMxAccessServer` no-op implementing every `IMxAccessServer` method. The XML doc admits the reflection-based path is fragile (`"MxAccessSession private ctor signature changed; update the test seam."`). The same `NullMxAccessServer` shape is reinventable wherever an executor is exercised in isolation; the consolidated `TestSupport` namespace introduced in Worker.Tests-014 was the natural home for it, but the no-op server lives in a single test file's private nested class instead. A future change to the private ctor signature breaks this one test in a way that requires re-reading the reflection call to diagnose, and a second test that wants the same no-op surface will reflectively duplicate it.
**Recommendation:** Either (a) add a non-reflective seam — a constructor or static factory marked `internal`-with-`InternalsVisibleTo` that takes `IMxAccessServer` + the existing dependencies, removing the reflection — or (b) move the `NullMxAccessServer` no-op and the reflection helper into `TestSupport/NoopMxAccessSession.cs` so any future test can share it and a ctor change is fixed in one place.
**Resolution:** 2026-05-20 — Took option (a) plus option (b). Added a non-reflective `internal static MxAccessSession.CreateForTesting(IMxAccessServer, IMxAccessEventSink, MxAccessHandleRegistry?, MxAccessValueCache?, int?)` factory in `src/MxGateway.Worker/MxAccess/MxAccessSession.cs` (lines 61-88), gated through the pre-existing `<InternalsVisibleTo Include="MxGateway.Worker.Tests" />` in `src/MxGateway.Worker/MxGateway.Worker.csproj`. `AlarmCommandExecutorTests.NewExecutor` now calls `MxAccessSession.CreateForTesting(new NoopMxAccessServer(), new NoopEventSink())` — no `GetConstructor`/`Invoke`/`BindingFlags` anywhere in the file. The previously per-file `NullMxAccessServer` no-op was extracted to the shared `src/MxGateway.Worker.Tests/TestSupport/NoopMxAccessServer.cs` (matching the `TestSupport` consolidation introduced in Worker.Tests-014); the XML doc on the new file explicitly cites Worker.Tests-016 for the rationale. A future change to the `MxAccessSession` private ctor signature now updates `CreateForTesting` in one place; the test file does not need to be edited.
### Worker.Tests-017
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs` |
| Status | Resolved |
**Description:** `WorkerPipeSession.DispatchGatewayEnvelopeAsync` (`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:365-385`) has three documented branches: `WorkerCommand`, `WorkerShutdown`, and `WorkerCancel`. `WorkerPipeSessionTests` exercises the first two but never sends a `WorkerCancel` envelope, so the `_runtimeSession?.CancelCommand(envelope.CorrelationId)` path and the contract that the session forwards a cancel without faulting the pipe are uncovered. The `default:` arm (`UnexpectedEnvelopeBody` exception) is also uncovered — a gateway sending the wrong body case (e.g. another `GatewayHello` after the handshake) should produce a `ProtocolViolation` fault but no test asserts this.
**Recommendation:** Add two tests: one that writes a `WorkerCancel` envelope with a known correlation id and asserts `FakeRuntimeSession.CancelCommand` was called with that id (extend the shared `FakeRuntimeSession` to record cancel-correlation-ids); one that writes a post-handshake `GatewayHello` envelope and asserts the session writes a `WorkerFault` with category `ProtocolViolation` and exits the message loop.
**Resolution:** 2026-05-20 — Added two `[Fact]`s to `WorkerPipeSessionTests` and the supporting state to the shared `FakeRuntimeSession`. (1) `RunAsync_WhenGatewaySendsWorkerCancel_ForwardsCorrelationIdToRuntimeSession` writes a `WorkerCancel` envelope with correlation id `"cancel-correlation-1"` after the handshake, then drives a normal shutdown via `SendShutdownAndWaitAsync` — observing the shutdown ack proves the message loop kept running (no fault, no exit) and `Assert.Contains("cancel-correlation-1", runtime.CancelledCorrelationIds)` proves the cancel reached `IWorkerRuntimeSession.CancelCommand`. The shared `FakeRuntimeSession` was extended with a `CancelledCorrelationIds` snapshot list and an optional `CancelCommandReturnValue` (defaulting to `false`, preserving the prior behaviour). (2) `RunAsync_WhenGatewaySendsUnexpectedEnvelopeBodyAfterHandshake_ThrowsAndExitsMessageLoop` writes a second `GatewayHello` envelope post-handshake — valid envelope, invalid body case for the message-loop state — and asserts `Assert.ThrowsAsync<WorkerFrameProtocolException>(async () => await runTask)` with `ErrorCode == WorkerFrameProtocolErrorCode.UnexpectedEnvelopeBody`. Re-triage: the original recommendation said "the session writes a `WorkerFault` with category `ProtocolViolation`", but the source at `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:380-384` shows the `default:` arm throws `WorkerFrameProtocolException`; `RunMessageLoopAsync` has no fault-writing catch (only `CompleteStartupHandshakeAsync` writes faults during the handshake). The test XML doc records this — the contract pinned is the exception type/error-code and the message-loop exit, not a fault frame.
### Worker.Tests-018
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessLiveComCreationTests.cs:18-31, 35-73, 75-145, 148-220, 222-342` |
| Status | Resolved |
**Description:** Every `[Fact]` in `MxAccessLiveComCreationTests` gates on `RunLiveMxAccessTests()` and `return`s silently when the opt-in env var is not set. xUnit reports a `Fact` that returns normally as **passed**, so a CI run without `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1` shows five green "live MXAccess" tests that did not run a single line of MXAccess code. `docs/GatewayTesting.md` and the `IntegrationTests` project already provide the correct pattern — `LiveMxAccessFactAttribute` (in `src/MxGateway.IntegrationTests/LiveMxAccessFactAttribute.cs`) emits xUnit's native `Skipped` status when the env var is absent — but `MxAccessLiveComCreationTests` does not consume it, so the gate is invisible in test output. The first test (`StartAsync_WhenOptedIn_CreatesInstalledMxAccessComObjectOnSta`) additionally inlines the env-var check (`string.Equals(Environment.GetEnvironmentVariable(...), "1", StringComparison.Ordinal)`) instead of using the local `RunLiveMxAccessTests()` helper, so the convention is inconsistent even within the same file.
**Recommendation:** Move `LiveMxAccessFactAttribute` into a shared location both projects can reference (e.g. `MxGateway.Contracts.TestSupport` or a new `MxGateway.TestSupport` shared project), and decorate the five `MxAccessLiveComCreationTests` methods with `[LiveMxAccessFact]` instead of `[Fact]`. Drop the inline env-var checks. Skipped runs will then report `Skipped` rather than `Passed`, and CI will distinguish "live MXAccess unavailable" from "live MXAccess opted in, succeeded".
**Resolution:** 2026-05-20 — Added a self-contained `LiveMxAccessFactAttribute` at `src/MxGateway.Worker.Tests/TestSupport/LiveMxAccessFactAttribute.cs` (namespace `MxGateway.Worker.Tests.TestSupport`) that mirrors the `MxGateway.IntegrationTests` attribute: when `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS` is not `1`, the attribute sets `Skip` so xUnit emits a native `Skipped` result rather than a misleading `Passed`. All five `MxAccessLiveComCreationTests` methods now use `[LiveMxAccessFact]`; the inline env-var check at the top of `StartAsync_WhenOptedIn_CreatesInstalledMxAccessComObjectOnSta` and the per-method `if (!RunLiveMxAccessTests()) return;` silent-returns were deleted. The worker tests target net48/x86 and the integration tests target net10.0, so introducing a cross-project shared assembly was not practical; the Worker.Tests attribute is a near-duplicate of the IntegrationTests attribute and the XML doc on the new file calls this out so the next reviewer understands why two copies exist. xUnit output now reports the five live tests as `[SKIP]` when the env var is absent — `dotnet test ...` shows `Skipped: 9, Total: 274`, with the five `MxAccessLiveComCreationTests` correctly counted as skipped rather than passed.
### Worker.Tests-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | mxaccessgw conventions |
| Location | `src/MxGateway.Worker.Tests/AlarmsLiveSmokeTests.cs:45`, `src/MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs:143`, `src/MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs:55` |
| Status | Resolved |
**Description:** Worker.Tests-009 renamed every `snake_case` alarm-test method to the project's `Method_Scenario_Expectation` convention, but the rename missed the dev-rig probe and live-smoke `[Fact]`s in the `MxGateway.Worker.Tests` root (not under `MxAccess/`): `AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip`, `AlarmClientWmProbeTests.Probe_AlarmClient_for_alarm_messages` (and its helpers), and `WnWrapConsumerProbeTests.ProbeWnWrapConsumer`. These are `[Fact(Skip=...)]` so they never execute in normal CI, but they still drift from `docs/style-guides/CSharpStyleGuide.md` and contradict the resolution claim in Worker.Tests-009 that "every `[Fact]`/`[Theory]` method in the five alarm test files" was renamed.
**Recommendation:** Rename `Alarms_full_pipeline_round_trip``Alarms_FullPipelineRoundTrip_RaisesAndAcknowledges` (or similar `Method_Scenario_Expectation` form) and apply the same convention to the two probe methods. xUnit discovers by attribute, not name, so renames are behaviour-neutral.
**Resolution:** 2026-05-20 — Renamed the three `snake_case` probe/smoke `[Fact]` methods to the project's `Method_Scenario_Expectation` PascalCase convention: `Alarms_full_pipeline_round_trip``Alarms_FullPipelineRoundTrip_RaisesAndAcknowledges` (in `Probes/AlarmsLiveSmokeTests.cs`), `ProbeAlarmClientWmMessages``ProbeAlarmClient_OnDevRig_LogsAlarmWindowMessages` (in `Probes/AlarmClientWmProbeTests.cs`), and `ProbeWnWrapConsumer``ProbeWnWrapConsumer_OnDevRig_LogsXmlAlarmStream` (in `Probes/WnWrapConsumerProbeTests.cs`). The three files have moved to `Probes/` as part of Worker.Tests-023; the location columns above predate that move. xUnit discovers tests by attribute, so the renames are behaviour-neutral and the `Skip` strings still apply unchanged.
### Worker.Tests-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessValueCacheTests.cs:88-108` |
| Status | Resolved |
**Description:** `TryWaitForUpdate_ReturnsFalseAfterDeadline_WhenNoSetOccurs` asserts both a lower wall-clock bound (`stopwatch.ElapsedMilliseconds >= 60`, deadline was 80ms) and `pumpCalls > 1`. The 60ms floor is the same class of timing race Worker.Tests-003/004/013 corrected elsewhere: on a loaded CI agent a `Task.Run` scheduling delay can push the wait's start past the deadline so the loop runs zero or one iteration, the wait returns slightly *early* of the 60ms floor, and the test fails through no fault of the production code. The `pumpCalls > 1` check additionally races against the same scheduler — if the agent stalls the wait thread, `pumpStep` might fire only once before the deadline. The test purpose (verifying the timeout is honoured and pump-step is invoked) is sound but the assertions are wall-clock floors rather than deterministic checks.
**Recommendation:** Drop the elapsed-time floor and the `pumpCalls > 1` assertion; verify only that `result` is false, `value` is default, and `pumpCalls >= 1` (the pump must fire at least once, but not "more than once"). The fact that `TryWaitForUpdate` returned false after the deadline is the contract the test exists to pin; the timing strictness is incidental.
**Resolution:** 2026-05-20 — Eliminated the wall-clock dependency entirely (the equivalent of a manual time source for the `DateTime.UtcNow`-based deadline). The test now passes `DateTime.UtcNow.AddMilliseconds(-1)` — a deadline already in the past — so `TryWaitForUpdate`'s loop pumps once, immediately observes the elapsed deadline, and returns false with zero `Thread.Sleep`. The `Stopwatch`/`stopwatch.ElapsedMilliseconds >= 60` floor and the `pumpCalls > 1` strict-inequality assertions are gone. With an already-expired deadline the contract is deterministic: exactly one pump call (the loop must pump before checking the deadline so MXAccess messages can dispatch on the calling thread even when the deadline has just expired), `result == false`, `value` is default. Matches the pattern Worker.Tests-003/004/013 used — drop wall-clock floor checks in favour of a deterministic signal.
### Worker.Tests-021
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs` |
| Status | Resolved |
**Description:** `WorkerFrameProtocolTests` covers `MalformedLength`, `MessageTooLarge` (read-side, added in Worker.Tests-012), `ProtocolVersionMismatch`, `SessionMismatch`, and `InvalidEnvelope` on `WorkerFrameReader`. Three documented protocol-error branches remain uncovered: (1) `WorkerFrameProtocolErrorCode.EndOfStream` from `WorkerFrameReader.ReadExactlyOrThrowAsync` (`src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:106`) when the stream closes mid-frame — important because the gateway closing its end of the pipe during a partial read is the most common production transport failure; (2) `WorkerFrameWriter` rejecting an envelope whose `CalculateSize()` returns 0 with `WorkerFrameProtocolErrorCode.InvalidEnvelope` (`WorkerFrameWriter.cs:46`); (3) `WorkerFrameWriter` rejecting an envelope larger than `MaxMessageBytes` with `WorkerFrameProtocolErrorCode.MessageTooLarge` (`WorkerFrameWriter.cs:53`). The writer-side checks defend against a session that constructs a too-large envelope before sending it down the pipe — completely separate from the reader-side bounds the existing tests pin.
**Recommendation:** Add three tests: (a) `ReadAsync_WhenStreamEndsMidFrame_ThrowsEndOfStream` — feed a 4-byte length prefix declaring 100 bytes followed by only 50 bytes, assert `EndOfStream`; (b) `WriteAsync_WithEnvelopeAboveConfiguredMaximum_ThrowsMessageTooLarge` — construct `WorkerFrameProtocolOptions` with a small `MaxMessageBytes` and an envelope whose serialised size exceeds it, assert `MessageTooLarge`; (c) since `WorkerEnvelope.CalculateSize()` never returns 0 for a valid envelope (the protocol version field alone serializes), the `InvalidEnvelope` writer branch is genuinely unreachable in normal operation — either document this as defensive code that is intentionally untestable, or drop the check.
**Resolution:** 2026-05-20 — Added three `[Fact]`s to `WorkerFrameProtocolTests.cs` for the three uncovered protocol-error branches. (a) `ReadAsync_WhenStreamEndsMidFrame_ThrowsEndOfStream` builds a 4-byte length prefix declaring 100 bytes followed by only 50 bytes, drives `WorkerFrameReader.ReadAsync` against it, and asserts `WorkerFrameProtocolErrorCode.EndOfStream` — pins the gateway-closes-mid-read transport failure. (b) `WriteAsync_WithEnvelopeAboveConfiguredMaximum_ThrowsMessageTooLarge` constructs `WorkerFrameProtocolOptions` with `MaxMessageBytes=64`, builds a `GatewayHello` envelope whose `GatewayVersion` is padded to 1024 bytes, asserts `WorkerFrameProtocolErrorCode.MessageTooLarge` and that the stream stayed empty (zero bytes written). (c) `WriteAsync_WithEmptyEnvelope_ThrowsInvalidEnvelopeFromValidator` exercises the body-less path — `WorkerEnvelopeValidator.Validate` runs first and rejects an envelope whose `BodyCase` is `None` with `InvalidEnvelope`, so the `CalculateSize()==0` branch is intercepted before it fires; the XML doc explicitly documents that the defensive zero-length branch is unreachable through public API but is left in place as a one-comparison safety net against future serialisation regressions. Net change: three new tests, all green; the reader-side `EndOfStream` plus writer-side `MessageTooLarge`/`InvalidEnvelope` rejections are now regression-protected.
### Worker.Tests-022
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/MxGateway.Worker.Tests/MxAccess/WnWrapAlarmConsumerXmlTests.cs` |
| Status | Resolved |
**Description:** `WnWrapAlarmConsumerXmlTests` covers `ParseSnapshotXml` and `TryParseHexGuid` directly — the pure-helper layer — and pins the no-internal-timer Worker-001 invariant via reflection. The `PollOnce` transition-delta logic (`src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:289-337`) is what actually turns "snapshot N to snapshot N+1" into `MxAlarmTransitionEvent` instances, and is the only place the consumer makes state-management decisions: skip-when-state-unchanged, fire-with-previous-state-Unspecified for first sighting, and (implicitly) drop entries that vanished from the new snapshot. None of these branches are exercised — the live-smoke `AlarmsLiveSmokeTests` covers the end-to-end pipeline but is `[Fact(Skip=...)]` against the dev rig, so there is no in-CI coverage of "snapshot delta computation produces the right transitions" at all. A regression that, for example, emits a transition every poll regardless of state-change would slip through.
**Recommendation:** Refactor `PollOnce`'s snapshot-diff loop into a pure `internal static IReadOnlyList<MxAlarmTransitionEvent> ComputeTransitions(Dictionary<Guid,MxAlarmSnapshotRecord> previous, Dictionary<Guid,MxAlarmSnapshotRecord> next)` and add direct unit tests: (a) new entry produces `PreviousState=Unspecified`; (b) state-unchanged produces no transition; (c) state-changed produces a transition with the prior state; (d) entry vanished from `next` produces no transition (an alarm cleared from the active set; the snapshot just no longer mentions it). `MxAccessStaSession` already drives the COM-side polling, so the diff is genuinely independent of any COM dependency.
**Resolution:** 2026-05-20 — Extracted the snapshot-diff loop from `WnWrapAlarmConsumer.PollOnce` into a pure `internal static IReadOnlyList<MxAlarmTransitionEvent> ComputeTransitions(Dictionary<Guid,MxAlarmSnapshotRecord> previous, Dictionary<Guid,MxAlarmSnapshotRecord> next)` in `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs`. `PollOnce` now calls `ComputeTransitions` under the same `syncRoot` lock; the diff rules are unchanged. Added five `[Fact]`s in `WnWrapAlarmConsumerXmlTests.cs` exercising all four branches plus a multi-alarm fan-out case: `ComputeTransitions_WhenAlarmIsNewInNextSnapshot_EmitsTransitionWithUnspecifiedPreviousState`, `ComputeTransitions_WhenAlarmStateUnchanged_EmitsNoTransition`, `ComputeTransitions_WhenAlarmStateChanged_EmitsTransitionWithPriorState`, `ComputeTransitions_WhenAlarmDroppedFromActiveSet_EmitsNoTransition`, and `ComputeTransitions_WithMixedDelta_EmitsOnlyNewAndChangedTransitions`. Each test drives the function with `Dictionary<Guid,MxAlarmSnapshotRecord>` snapshots built from a `NewRecord` helper — no COM, no STA. A regression that emits a transition every poll regardless of state, swaps the previous/next ordering, or treats a dropped alarm as a transition now fails in-CI.
### Worker.Tests-023
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs` (779 lines), `src/MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs` (287 lines), `src/MxGateway.Worker.Tests/AlarmsLiveSmokeTests.cs` (270 lines) |
| Status | Resolved |
**Description:** Three large dev-rig "probe" files are mixed into the worker unit-test project but are not unit tests in the usual sense: each is a `[Fact(Skip="Runtime probe — flip Skip=null on the dev rig (AVEVA installed)...")]` driver that runs hundreds of seconds, opens real Galaxy subscriptions, posts Windows messages on STA threads, captures alarm payloads to `ITestOutputHelper`, and exists to document AVEVA COM behaviour rather than gate it. `AlarmClientWmProbeTests` alone is 779 lines — larger than every genuine unit-test file in the project. Build-time these files contribute 1300+ lines of probe scaffolding that consumers of the project's "what is `Worker.Tests` for?" inspection have to wade through. The Skip-attribute strings document why they exist, but a colocated `docs/AlarmProbes.md` (or moving the probes to a separate `MxGateway.Worker.Probes` non-test assembly) would make the distinction explicit and stop the probe files from inflating `Worker.Tests`' build/test surface.
**Recommendation:** Either (a) carve the three probe files out into `src/MxGateway.Worker.Probes/` (a separate project the dev-rig user opts into; the assembly references stay the same), or (b) move them into a `Probes/` subfolder inside `MxGateway.Worker.Tests` and add a one-paragraph header in `docs/GatewayTesting.md` describing the probe surface. Option (a) is cleaner because the live-smoke `AlarmsLiveSmokeTests` already references `WnWrapAlarmConsumer` directly and would naturally cohabit with the other AVEVA-COM probes.
**Resolution:** 2026-05-20 — Took option (b): moved `AlarmClientWmProbeTests.cs`, `WnWrapConsumerProbeTests.cs`, and `AlarmsLiveSmokeTests.cs` from `src/MxGateway.Worker.Tests/` into a new `src/MxGateway.Worker.Tests/Probes/` subfolder. The files keep their existing namespace (`MxGateway.Worker.Tests`) and their `[Fact(Skip=...)]` gating; the SDK-style project picks them up under the new path without a `.csproj` change. Option (b) was chosen over (a) because the probes still rely on the same test-project package references (`xunit`, `Microsoft.NET.Test.Sdk`, `Xunit.Abstractions`) plus the `Interop.WNWRAPCONSUMERLib`/`ArchestrA.MxAccess`/`aaAlarmManagedClient`/`IAlarmMgrDataProvider` references already declared in `MxGateway.Worker.Tests.csproj`; a separate `MxGateway.Worker.Probes` project would have to duplicate every one of these. The probes remain runnable on the dev rig by flipping `Skip=null` exactly as before. The `Worker.Tests` root listing now contains only genuine unit-test/regression files; probe scaffolding is visibly partitioned by directory.
### Worker.Tests-024
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandHandlerTests.cs:42-54` |
| Status | Resolved |
**Description:** `Subscribe_WhenUnderlyingSubscribeThrows_DisposesConsumer` asserts that an exception during `IMxAccessAlarmConsumer.Subscribe` triggers consumer disposal. The fake throws `new InvalidOperationException("simulated wnwrap subscribe failure")` and the test asserts `Assert.Throws<InvalidOperationException>(() => handler.Subscribe(...))`. But `AlarmCommandHandler.Subscribe` (`src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:65-93`) wraps the underlying call and re-throws — so an `InvalidOperationException` from any code path inside `Subscribe` (e.g. its own "already subscribed" guard at line 73) would also satisfy the assertion. The test does not pin that the *thrown* exception is the one from the fake; if `AlarmCommandHandler` regressed to throw before reaching the consumer, the test would still pass with `consumer.Disposed == false` ... except the test additionally asserts `consumer.Disposed` is true, which would fail. So the test does pin the disposal behaviour. The genuine weakness is that the assertion doesn't pin the exception message either ("simulated wnwrap subscribe failure"), so an unexpected `InvalidOperationException` from a different branch with a misleading message would pass without anyone noticing the handler swallowed the real failure cause.
**Recommendation:** Strengthen to `InvalidOperationException exception = Assert.Throws<InvalidOperationException>(...); Assert.Contains("simulated wnwrap subscribe failure", exception.Message)` — pin both the type and the originating message so a regression that throws a *different* `InvalidOperationException` from inside `AlarmCommandHandler` fails the test.
**Resolution:** 2026-05-20 — `Subscribe_WhenUnderlyingSubscribeThrows_DisposesConsumer` now captures the thrown exception and asserts `Assert.Contains("simulated wnwrap subscribe failure", exception.Message)` against the fake's exact thrown message. A regression that throws a *different* `InvalidOperationException` from inside `AlarmCommandHandler` (for example its own "already subscribed" guard at line 73 of `AlarmCommandHandler.cs`) now fails the message-contains assertion — the original test's type-only `Assert.Throws<InvalidOperationException>` would have passed silently while hiding the swallowed failure cause. The disposal assertion (`consumer.Disposed == true`) is unchanged; the test now pins both the disposal contract and the origin of the propagated exception. XML doc on the test method documents the regression scenario.
+121 -12
View File
@@ -4,25 +4,27 @@
|---|---|
| Module | `src/MxGateway.Worker` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
This row reflects the 2026-05-20 re-review at commit `1cd51bb`. Worker-001..015 are all closed; the row only summarises new findings filed against this branch.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: heartbeat loop sleeps before first beat (Worker-002), `ProcessCommandAsync` state race drops replies (Worker-003), watchdog/heartbeat state inconsistency (Worker-004), double-dispose path (Worker-006), plus Worker-010/011/015. |
| 2 | mxaccessgw conventions | Issue found: Worker-007 (reflection-based COM invocation bypasses the typed interface contract). |
| 3 | Concurrency & thread safety | Issues found: Worker-001 (`WnWrapAlarmConsumer` timer fires COM off the STA), Worker-008 (consumer factory STA-affinity not enforced). |
| 4 | Error handling & resilience | Issue found: Worker-005 (`OnPoll` silently swallows all poll failures). |
| 5 | Security | No secret logging (redaction applied); inbound frame validation reasonable. No issues found. |
| 6 | Performance & resource management | Issue found: Worker-009 (per-frame `byte[]` allocations on the hot event path). COM release is correct. |
| 7 | Design-document adherence | Code matches `WorkerSta.md`/`WorkerFrameProtocol.md`; stale alarm-path docs (Worker-012). |
| 8 | Code organization & conventions | Issue found: Worker-014 (`AlarmCommandHandler.cs` declares two public types in one file). |
| 9 | Testing coverage | Issue found: Worker-013 (`StaMessagePump` has no direct tests; poll-loop lifecycle untested). |
| 10 | Documentation & comments | Issue found: Worker-012 (stale "future PR / A.3" comments now describe shipped code). |
| 1 | Correctness & logic bugs | Issues found: Worker-018 (`SetXmlAlarmQuery` return code ignored), Worker-019 (`subscriptionExpression` is write-only dead state), Worker-020 (dead `ExecutingCommand` arm in `ProcessCommandAsync` state check), Worker-021 (`InitializeMxAccessAsync` can overwrite an already-set `_runtimeSession`). |
| 2 | mxaccessgw conventions | Issue found: Worker-022 (`MxAlarmSnapshot.cs` declares three public types in one file). |
| 3 | Concurrency & thread safety | Issue found: Worker-016 (`RunAlarmPollLoopAsync` swallows the `EnsureOnAlarmConsumerThread` assertion as part of its generic `InvalidOperationException` catch, defeating Worker-008's invariant). |
| 4 | Error handling & resilience | Issue found: Worker-017 (long-running commands like `ReadBulk` cannot mark STA activity, so the heartbeat watchdog can fire `StaHung` while a command is legitimately executing — `CurrentCommandCorrelationId` is non-empty in the heartbeat but ignored by the watchdog). |
| 5 | Security | No secret logging (redaction applied); inbound frame validation reasonable; secured-write user IDs do not leak through reply diagnostics. No new issues found. |
| 6 | Performance & resource management | Frame I/O uses pooled buffers (Worker-009 resolved); STA ownership and COM final-release are correct. No new issues found. |
| 7 | Design-document adherence | Code matches `gateway.md` / `MxAccessWorkerInstanceDesign.md` / `WorkerFrameProtocol.md`. No new design drift. |
| 8 | Code organization & conventions | Issue found: Worker-022 (see row 2). |
| 9 | Testing coverage | `RunAlarmPollLoop_WhenPollOnceThrows_RecordsFaultOnEventQueue` exists but uses a `COMException`; the `InvalidOperationException` arm raised by Worker-016 is not exercised. No standalone finding (subsumed by Worker-016's recommendation to add a regression test). |
| 10 | Documentation & comments | `RunAlarmPollLoopAsync`'s "STA runtime shutting down — stop the loop gracefully" comment is misleading once Worker-016 is considered (the catch also swallows STA-affinity violations). Noted in Worker-016. |
## Findings
@@ -258,3 +260,110 @@
**Recommendation:** Add a brief comment in `EnqueueEvent` clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op.
**Resolution:** 2026-05-18 — Added a comment in `MxAccessBaseEventSink.EnqueueEvent`'s catch block (per the finding's recommendation) explaining that two distinct fail-fast failures land there: a conversion failure from `createEvent()` (recorded here as an `MxaccessEventConversionFailed` fault) and an `MxAccessEventQueueOverflowException` from `Enqueue` at capacity, which — per the fail-fast backpressure design in `docs/DesignDecisions.md` — drops the event and has *already* self-recorded a `QueueOverflow` fault inside `Enqueue`. Because `MxAccessEventQueue.RecordFault` keeps only the first fault, the catch's `RecordFault` call is then a deliberate near no-op rather than a second, conflicting fault. Pure comment change as recommended — no behavior altered. `docs/DesignDecisions.md` already documents the fail-fast event backpressure rule, so no doc change was required.
### Worker-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:261-265` |
| Status | Resolved |
**Description:** `RunAlarmPollLoopAsync` catches `InvalidOperationException` and silently returns with the rationale "STA runtime shutting down — stop the loop gracefully". The same catch arm, however, also swallows the `InvalidOperationException` thrown by `EnsureOnAlarmConsumerThread()` / `AssertOnAlarmConsumerThread()` — the STA-affinity guard added under Worker-008. If the alarm poll ever ran on the wrong thread (a regression of the STA-affinity invariant), the assertion would fire, the loop would silently stop, no fault would be recorded, and the only observable symptom would be alarms no longer flowing. The assertion exists to catch a programming error early; this catch defeats it.
**Recommendation:** Either tighten the `InvalidOperationException` catch so it only swallows the STA-runtime-shutting-down sentinel (e.g. match on the exception message produced by `StaRuntime.InvokeAsync`, or have the STA runtime throw a dedicated exception type for shutdown), or rethrow / record-a-fault for `InvalidOperationException`s whose message does not match the shutdown sentinel. Add a regression test that drives `RunAlarmPollLoopAsync` with a handler that throws `InvalidOperationException` from `PollOnce` and asserts the loop records a fault rather than silently exiting.
**Resolution:** 2026-05-20 — Introduced a dedicated `StaRuntimeShutdownException` (`src/MxGateway.Worker/Sta/StaRuntimeShutdownException.cs`) that `StaRuntime.InvokeAsync` and the queue-enqueue path now throw in place of a generic `InvalidOperationException` when `shutdownRequested` is set. `RunAlarmPollLoopAsync` in `MxAccessStaSession.cs:258-291` now catches `StaRuntimeShutdownException` (graceful stop, returns silently) separately from the generic `Exception` arm, which records the fault on the event queue. An STA-affinity `InvalidOperationException` from `EnsureOnAlarmConsumerThread` therefore now falls through to the fault path and becomes observable on the IPC fault path instead of silently terminating alarm delivery. Verified: `dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86` clean (0 warnings). Regression coverage in `MxAccessStaSessionTests.cs` exercises both the graceful-shutdown and the affinity-violation paths.
### Worker-017
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker/Sta/StaRuntime.cs:280-288`, `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-631` |
| Status | Resolved |
**Description:** `StaRuntime.ProcessQueuedCommands` calls `MarkActivity()` only before and after `workItem.Execute()`. For a command that synchronously holds the STA for longer than `WorkerPipeSessionOptions.HeartbeatGrace` (default 15s) — e.g. `ReadBulk` with many uncached tags, each waiting up to its per-tag `TimeoutMs` (default 1000 ms) — no `MarkActivity()` runs during the wait, `LastActivityUtc` stays frozen, and `ReportWatchdogFaultIfNeededAsync` fires an `StaHung` fault. The heartbeat itself reports `WorkerState.ExecutingCommand` with the live `CurrentCommandCorrelationId`, so the worker actually knows it is executing a command rather than hung — but the watchdog branch only checks `staleFor > HeartbeatGrace` and ignores the in-flight command. A legitimate slow bulk read then self-faults and tears the session down.
**Recommendation:** Either (a) extend `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` to skip the `StaHung` fault when the snapshot's `CurrentCommandCorrelationId` is non-empty (the worker is executing a command, not hung), or (b) thread a `MarkActivity`-style callback into the bulk-read `pumpStep` so long synchronous STA operations periodically refresh `LastActivityUtc`. Option (a) is the smaller surface — the heartbeat already carries enough signal for the gateway to decide the command is just slow. Either way, the design intent (watchdog catches a hung STA, not a slow command) should be documented on `ReportWatchdogFaultIfNeededAsync`.
**Resolution:** 2026-05-20 — Applied option (a): `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` (`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-645`) now returns early when `snapshot.CurrentCommandCorrelationId` is non-empty — the STA is busy executing a known command, not hung, and the heartbeat already surfaces the correlation id so the gateway can decide whether the command is too slow against its own per-command timeout. The next `MarkActivity()` after the command returns lifts `LastActivityUtc` and the watchdog resumes normal operation. A new XML doc comment on the method records the design intent (watchdog catches a hung STA, not a slow command). Verified: `dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86` clean. Regression coverage added in `WorkerPipeSessionTests.cs`.
### Worker-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:160-161` |
| Status | Resolved |
**Description:** `Subscribe` calls `com.SetXmlAlarmQuery(xmlQuery)` and discards the return value. The block-level comment immediately above states that this call is empirically required for subsequent `GetXmlCurrentAlarms2` to succeed — i.e. it is on the critical path of the alarm subscription. Every other AVEVA-COM call in the same method (`InitializeConsumer`, `RegisterConsumer`, `Subscribe`, `AlarmAckByName`, etc.) is gated on a `!= 0` return-code check and throws `InvalidOperationException` on failure. If `SetXmlAlarmQuery` ever returns non-zero (or otherwise fails non-fatally), the consumer reaches `subscribed = true` with the wnwrap state misconfigured, and the next `PollOnce` fails with the same `E_FAIL` the comment warns about — without any indication where the regression lies.
**Recommendation:** Either (a) check the `SetXmlAlarmQuery` return code and treat a non-zero value as a subscription failure (matching the other call-gates in the method) or (b) document explicitly in the comment that `SetXmlAlarmQuery`'s return code is meaningless on this AVEVA build (referencing `docs/AlarmClientDiscovery.md` if so). At minimum capture the return value in a local for diagnostic purposes so a future failure is easier to triage.
**Re-triage:** The finding's framing assumed an integer return code; inspection of the `Interop.WNWRAPCONSUMERLib` assembly confirmed `SetXmlAlarmQuery` is declared `Void SetXmlAlarmQuery(System.String)` on all three flavors (`IwwAlarmConsumer`, `IwwAlarmConsumer2`, `wwAlarmConsumerClass`). There is no integer return code to gate on. A genuine failure can only surface as a `COMException` mapped from the underlying HRESULT, so the fix wraps the call to translate that into the same `InvalidOperationException` failure-shape used by every other call-gate in `Subscribe`, with the HRESULT included in the diagnostic message.
**Resolution:** 2026-05-20 — `WnWrapAlarmConsumer.Subscribe` now wraps the `com.SetXmlAlarmQuery(xmlQuery)` call in a `try`/`catch (COMException ex)` that throws an `InvalidOperationException` carrying the HRESULT (`$"wwAlarmConsumer.SetXmlAlarmQuery failed with HRESULT 0x{ex.HResult:X8}; subsequent GetXmlCurrentAlarms2 polls would return E_FAIL."`) with the original `COMException` as `InnerException`. A previously silent failure that left `subscribed = true` with misconfigured wnwrap state — and produced an opaque `E_FAIL` from the next `PollOnce` with no indication where the regression lay — now surfaces as a subscription failure at the `Subscribe` call-site, matching the existing v1-lifecycle failure shape. The block comment was extended to record that the interop signature returns `void` (no integer return code to gate on like the sibling v1 calls) so a future maintainer doesn't try to add one. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is structurally analogous to the existing `Initialize/Register/Subscribe` call-gates and is exercised end-to-end by the live alarm smoke path.
### Worker-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:59`, `:188` |
| Status | Resolved |
**Description:** `WnWrapAlarmConsumer` declares `private string subscriptionExpression = string.Empty;` and assigns it once inside `Subscribe` (line 188), but never reads it. It is dead state — neither `PollOnce`, `AcknowledgeByName`, `AcknowledgeByGuid`, `SnapshotActiveAlarms`, nor `Dispose` consults it. Either it is genuinely unused (delete it) or it was intended to support a not-yet-implemented feature (e.g. re-subscribing after a transient failure, or echoing the subscription back through `IsSubscribed`/`SubscriptionExpression`), in which case the intent should be wired up or documented.
**Recommendation:** Delete the field (the safest option — `treatWarningsAsErrors=true` will continue to permit it as long as it's read into; consider promoting it to read-only via an exposed property `SubscriptionExpression` so smoke tests can assert what subscription is active without touching wnwrap state). If a future use is expected, file a follow-up issue.
**Resolution:** 2026-05-20 — Deleted the dead `private string subscriptionExpression = string.Empty;` field declaration and its sole assignment inside `Subscribe` (`subscriptionExpression = subscription;`). The field had no readers and was pure write-only state. Pure cleanup — no behaviour change, no public API surface affected. The worker build remains clean with zero warnings under `TreatWarningsAsErrors=true`.
### Worker-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:405`, `:423` |
| Status | Resolved |
**Description:** `ProcessCommandAsync` decides whether to write a command reply with `if (_state is not WorkerState.Ready and not WorkerState.ExecutingCommand)`. The `ExecutingCommand` arm is dead: `_state` is only ever assigned `Starting`, `Handshaking`, `InitializingSta`, `Ready`, `ShuttingDown`, `Faulted`, or `Stopped`. The string `WorkerState.ExecutingCommand` appears nowhere as a target of `_state = ...`. The `WorkerState.ExecutingCommand` value is synthesized only in `CreateHeartbeat` (line 811) when a command is in flight, so it never leaks back into `_state`. The check is effectively `_state is not WorkerState.Ready`. The intent is unclear: either the check should also accept the live "is executing" condition (which today is implicit via `_state == Ready` plus a non-empty `CurrentCommandCorrelationId` from the dispatcher), or the dead arm should be removed for clarity.
**Recommendation:** Simplify the check to `if (_state != WorkerState.Ready)` to match the actual state machine, and update the dropped-reply log fields accordingly. Alternatively, introduce an explicit `WorkerState.ExecutingCommand` transition (set when a command starts dispatching, restored to `Ready` on completion) so the check matches its name. The simpler fix is the former.
**Resolution:** 2026-05-20 — Both occurrences of the `_state is not WorkerState.Ready and not WorkerState.ExecutingCommand` check in `ProcessCommandAsync` (the post-`DispatchAsync` success path and the exception path) were simplified to `_state != WorkerState.Ready`. The `ExecutingCommand` arm was dead — `_state` is never written that value; only `CreateHeartbeat` synthesizes it on the wire when `CurrentCommandCorrelationId` is non-empty. A comment was added at the success-path site documenting the assignment-set of `_state` and why `Ready` is the only command-serving state. No behavioural change — `_state` could never be `ExecutingCommand` at that read, so the simplification preserves the same effective decision while removing the misleading dead arm. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent.
### Worker-021
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-118`, `:790-805`, `:136-139` |
| Status | Resolved |
**Description:** `RunAsync` constructs the runtime session through `_runtimeSession = _runtimeSessionFactory()` (line 111) and immediately calls `CompleteStartupHandshakeAsync(token => _runtimeSession.StartAsync(...))`. That path is fine. However the public parameterless `CompleteStartupHandshakeAsync()` (line 136) routes through `InitializeMxAccessAsync` (line 790), which unconditionally reassigns `_runtimeSession = new MxAccessStaSession(eq => new AlarmCommandHandler(eq));` — overwriting whatever the factory put there. If anything ever calls `CompleteStartupHandshakeAsync()` after `RunAsync` has already begun, the factory-supplied session is leaked (no `Dispose` is called on the old instance) and a fresh hard-coded `MxAccessStaSession` is started instead. Today no production code path triggers this, but the API surface is public and dangerous — a test or a refactor could trip it.
**Recommendation:** Either (a) make `InitializeMxAccessAsync` a no-op if `_runtimeSession` is already non-null (treat the existing instance as authoritative and only call its `StartAsync`), or (b) make the parameterless `CompleteStartupHandshakeAsync()` and `InitializeMxAccessAsync` `internal` / remove them, since the production path is the factory-driven one in `RunAsync`. Option (b) is cleaner: the parameterless overload is dead in production.
**Resolution:** 2026-05-20 — Applied option (a): `InitializeMxAccessAsync` now uses `_runtimeSession ??= new MxAccessStaSession(eq => new AlarmCommandHandler(eq));`, so the existing factory-supplied instance from `RunAsync` is treated as authoritative and only the fall-back direct-invocation path (where the parameterless `CompleteStartupHandshakeAsync` is called without a prior factory call) constructs the hard-coded `MxAccessStaSession`. The `StartAsync` call and the `catch`-and-dispose path now operate on a local `session` captured from `_runtimeSession`, so a startup failure still disposes the runtime regardless of which path supplied it. A comment in `InitializeMxAccessAsync` documents the reasoning. Option (a) was preferred over (b) because the parameterless `CompleteStartupHandshakeAsync` overload is part of the existing public API surface and tightening it to `internal` would be a contract change with no production driver requesting it. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is exercised end-to-end by the existing `RunAsync` factory path which now goes through the null-coalescing assignment instead of an unconditional `new`.
### Worker-022
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Worker/MxAccess/MxAlarmSnapshot.cs:12`, `:26`, `:49` |
| Status | Resolved |
**Description:** `MxAlarmSnapshot.cs` declares three public types in one file: the `MxAlarmStateKind` enum, the `MxAlarmSnapshotRecord` class, and the `MxAlarmTransitionEvent` class. The C# style guide (`docs/style-guides/CSharpStyleGuide.md:68`) requires one public type per file unless a small nested type is clearer. The recently resolved Worker-014 split `IAlarmCommandHandler` out of `AlarmCommandHandler.cs` for exactly this reason — the same convention applies here.
**Recommendation:** Move `MxAlarmStateKind` and `MxAlarmTransitionEvent` into their own files (`MxAlarmStateKind.cs`, `MxAlarmTransitionEvent.cs`) and leave `MxAlarmSnapshotRecord` in `MxAlarmSnapshot.cs` (or rename the file to `MxAlarmSnapshotRecord.cs` to match the surviving type). Pure file-organization change; no behaviour or namespace impact.
**Resolution:** 2026-05-20 — Split `MxAlarmSnapshot.cs` into three files, each declaring one public type and keeping the original `MxGateway.Worker.MxAccess` namespace so existing usages are unaffected: `MxAlarmStateKind.cs` (the enum, with its XML doc), `MxAlarmTransitionEvent.cs` (the `EventArgs` subclass, with its `PreviousState` doc), and `MxAlarmSnapshot.cs` (now containing only `MxAlarmSnapshotRecord` plus its XML doc). Matches the one-public-type-per-file convention re-affirmed by Worker-014's `IAlarmCommandHandler` split. Pure file-organization change — no API, namespace, or behaviour change; build is clean.