From d2d2e5f68f3b48bb266828bbfe20cc1a8d926d79 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sun, 24 May 2026 02:34:30 -0400 Subject: [PATCH] code-review 2026-05-24: re-review at d692232 across all 11 modules MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Restores the `code-reviews/` tree (was unwritten on this working copy) and re-reviews every module per `REVIEW-PROCESS.md` against HEAD `d692232`. The diff in scope is the five commits since the last sweep: `dc9c0c9` (ZB.MOM.WW gateway-side rename + slnx migrate), `397d3c5` (client SDK rename + the missing alarm-RPC proto types and the .NET DiscoverHierarchyOptions POCO), `27ed651` (role-based LDAP auth + HubToken bearer, drop PathBase), `6594359` (sidebar layout + three SignalR push hubs), and `d692232` (EventsHub publisher + doc refresh). Module status | Module | Open | Total | Delta this pass | |---|---|---|---| | Server | 8 | 43 | +6 | | Contracts | 2 | 17 | +2 | | Tests | 2 | 26 | +2 | | IntegrationTests | 3 | 24 | +3 | | Client.Java | 5 | 31 | +5 | | Client.Rust | 1 | 21 | +1 | | Worker | 0 | 25 | 0 (rename-only diff, clean) | | Worker.Tests | 0 | 30 | 0 (rename-only diff, clean) | | Client.Dotnet | 0 | 17 | 0 (rename + alarm-fix diff, clean) | | Client.Python | 0 | 21 | 0 (rename + alarm-fix diff, clean) | | Client.Go | 0 | 21 | 0 (rename + alarm-fix diff, clean) | Total new findings: 19. Severity breakdown: 1 Medium-security (Server-038), 4 Medium-documentation/coverage, 14 Low. New findings * Server-038 (Medium / Security) — EventsHub.SubscribeSession accepts any session id from any Viewer; no per-session ACL guards the EventsHub group fan-out. * Server-039 (Low / Error handling) — HubTokenService.Validate accepts a payload with null Name/NameIdentifier. * Server-040 (Low / Conventions) — MapGroupsToRoles undocumented full-vs-RDN lookup precedence. * Server-041 (Low / Design adherence) — EventStreamService calls IDashboardEventBroadcaster.Publish without a try/catch — fragile seam relying on the never-throw contract. * Server-042 (Low / Performance) — DashboardSnapshotPublisher tight retry loop with no backoff (vs AlarmsHubPublisher 5s delay). * Server-043 (Low / Documentation) — HubTokenService singleton sharing across login + hub-token validation undocumented. * Contracts-016 (Low / Conventions) — QueryActiveAlarmsRequest.session_id reserved-for-future-use ambiguity. * Contracts-017 (Low / Documentation) — rpc QueryActiveAlarms doc omits the alarm_filter_prefix filter description. * Tests-025 (Low / Conventions) — duplicate NullDashboardEventBroadcaster fakes in EventStreamServiceTests and GatewayEndToEndFakeWorkerSmokeTests. * Tests-026 (Medium / Testing coverage) — no test proves EventStreamService actually calls IDashboardEventBroadcaster.Publish. * IntegrationTests-022 (Low / Conventions) — ResolveRepositoryRoot silent fallback to Directory.GetCurrentDirectory(). * IntegrationTests-023 (Low / Testing coverage) — DashboardLdapLiveTests success-path asserts ldap_group but not the Role claim. * IntegrationTests-024 (Low / Conventions) — inline NullDashboardEventBroadcaster fake duplicates Tests-side copies. * Client.Java-027 (Medium / Documentation) — README + JavaClientDesign Gradle task names still use the old short project names. * Client.Java-028 (Medium / Design adherence) — JavaClientDesign build-layout shows the old `com/dohertylan/mxgateway/` package paths. * Client.Java-029 (Low / Documentation) — README installDist path cites the wrong directory. * Client.Java-030 (Low / Testing coverage) — no Java test exercises the regenerated QueryActiveAlarmsRequest RPC. * Client.Java-031 (Low / Conventions) — README prose uses old short project names instead of canonical prefixed ones. * Client.Rust-021 (Low / Design adherence) — RustClientDesign.md "Crate layout" shows an aspirational nested `crates/zb-mom-ww-mxgateway-client/` that does not exist; actual layout is the flat top-level crate. Two pre-existing pending findings (Server-031 lock-contention, Server-032 bounded event channel) remain unchanged — neither was touched by this wave of commits. Process notes - The `code-reviews/` tree was not in this working copy's git history (the local extract pre-dates the divergent branch that carried the reviews). Restored from `dd7ca16` via `git checkout dd7ca16 -- code-reviews/` before the re-review. - Some "Resolved" entries in the restored findings.md reference fixes that landed on the divergent branch (the same one that carried the reviews) and are not present on the current main lineage. The re-review treats those statuses as historical; the new pass only files findings against HEAD's actual state. - `python code-reviews/regen-readme.py --check` is green. Co-Authored-By: Claude Opus 4.7 (1M context) --- code-reviews/Client.Dotnet/findings.md | 338 +++++++++ code-reviews/Client.Go/findings.md | 434 ++++++++++++ code-reviews/Client.Java/findings.md | 520 ++++++++++++++ code-reviews/Client.Python/findings.md | 797 ++++++++++++++++++++++ code-reviews/Client.Rust/findings.md | 429 ++++++++++++ code-reviews/Contracts/findings.md | 308 +++++++++ code-reviews/IntegrationTests/findings.md | 463 +++++++++++++ code-reviews/README.md | 313 +++++++++ code-reviews/Server/findings.md | 780 +++++++++++++++++++++ code-reviews/Tests/findings.md | 453 ++++++++++++ code-reviews/Worker.Tests/findings.md | 531 ++++++++++++++ code-reviews/Worker/findings.md | 462 +++++++++++++ code-reviews/_template/findings.md | 53 ++ code-reviews/prompt.md | 76 +++ code-reviews/regen-readme.py | 236 +++++++ code-reviews/test_regen_readme.py | 158 +++++ 16 files changed, 6351 insertions(+) create mode 100644 code-reviews/Client.Dotnet/findings.md create mode 100644 code-reviews/Client.Go/findings.md create mode 100644 code-reviews/Client.Java/findings.md create mode 100644 code-reviews/Client.Python/findings.md create mode 100644 code-reviews/Client.Rust/findings.md create mode 100644 code-reviews/Contracts/findings.md create mode 100644 code-reviews/IntegrationTests/findings.md create mode 100644 code-reviews/README.md create mode 100644 code-reviews/Server/findings.md create mode 100644 code-reviews/Tests/findings.md create mode 100644 code-reviews/Worker.Tests/findings.md create mode 100644 code-reviews/Worker/findings.md create mode 100644 code-reviews/_template/findings.md create mode 100644 code-reviews/prompt.md create mode 100644 code-reviews/regen-readme.py create mode 100644 code-reviews/test_regen_readme.py diff --git a/code-reviews/Client.Dotnet/findings.md b/code-reviews/Client.Dotnet/findings.md new file mode 100644 index 0000000..bb40652 --- /dev/null +++ b/code-reviews/Client.Dotnet/findings.md @@ -0,0 +1,338 @@ +# Code Review — Client.Dotnet + +| Field | Value | +|---|---| +| Module | `clients/dotnet` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 0 | + +## Checklist coverage + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issue found (this review): the global CLI `--timeout` defaults to 30 s and is used both as the gRPC `DefaultCallTimeout` and as the outer `CancelAfter` budget — but `bench-read-bulk` / `bench-stream-events` default to `--duration-seconds=30 --warmup-seconds=3 (+ stagger)`, so direct manual invocation cancels the bench mid-window before the steady-state ends (Client.Dotnet-015). The `scripts/bench-read-bulk.ps1` driver works around this by raising `--timeout`, but `bench-stream-events` has no driver script. | +| 2 | mxaccessgw conventions | Good — consumes the shared contracts project, no forked proto, `authorization: Bearer` metadata correct, parity preserved via split `EnsureProtocolSuccess`/`EnsureMxAccessSuccess`. The new `clients/dotnet/Directory.Build.props` mirrors `src/Directory.Build.props` exactly (same six properties, identical values) so the enforcement floor is back in scope. | +| 3 | Concurrency & thread safety | Issue found (this review): `BenchStreamEventsAsync`'s per-session `RunStreamAsync` hands the inner `Task.Run` stream loop a reference (`streamTask`) that becomes unobserved whenever the outer `cancellationToken` cancels during the bench's `await Task.Delay` — the `await streamTask` recovery path never runs, so any inner OCE / `RpcException` raised after cancellation surfaces as a `TaskScheduler.UnobservedTaskException` (Client.Dotnet-016). The Client.Dotnet-009 / 011 fixes from the previous pass are correctly applied. | +| 4 | Error handling & resilience | No new issues found this review (Client.Dotnet-001 and Client.Dotnet-004 remain resolved; `RpcExceptionMapper` is consistently called from both gateway and Galaxy transports incl. `AcknowledgeAlarmAsync` after Client.Dotnet-014). | +| 5 | Security | Good — API key never logged by the library, CLI redacts effective key (both `--api-key` and `--api-key-env` sourced) after Client.Dotnet-008, TLS custom-root validation correct, secured-write payloads never logged. | +| 6 | Performance & resource management | No issues found — channels and streaming calls disposed correctly, retry pipeline shares one timeout budget per safe-unary op. | +| 7 | Design-document adherence | No issues found — matches `DotnetClientDesign.md` and `ClientLibrariesDesign.md`. | +| 8 | Code organization & conventions | No new issues — Client.Dotnet-012 (Directory.Build.props) and Client.Dotnet-013 (missing XML docs on `DiscoverHierarchyOptions`, the second `DiscoverHierarchyAsync` overload, and `IMxGatewayCliClient`) are both fully resolved; the new props file is a faithful mirror of the production one. | +| 9 | Testing coverage | No new issues — Client.Dotnet-014 closed the alarm-side `Translate` gap. The new bench paths (`bench-read-bulk`, `bench-stream-events`) have no unit-test coverage, but they are stress harnesses driven by `scripts/bench-read-bulk.ps1`, not SDK API surface, so this is not flagged. | +| 10 | Documentation & comments | No new issues this review (Client.Dotnet-007's alarm-ack `admin`-scope correction holds; `DefaultCallTimeout` doc accurately reflects the shared-budget semantics from Client.Dotnet-004). | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232`. Diff against `a020350` consists of the `ZB.MOM.WW` +client prefix rename in commit `397d3c5` (folders, csprojs, sln→slnx, every +namespace and using) plus the hand-written `DiscoverHierarchyOptions.cs` POCO +and the dropped retired `SessionId =` lines from alarm-related test fixtures. +The rename was applied via a case-insensitive regex sweep; no over-rename +artifacts found. The `mxgw_*` API-key wire prefix, `MXGATEWAY_*` environment +variables, and the `MxGatewayClient` / `MxGatewaySession` type names are +unchanged. Build and tests are green at HEAD. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — rename hygiene clean; wire identifiers preserved. | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | No issues found in this diff. | +| 5 | Security | No issues found in this diff. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | No issues found in this diff — `DotnetClientDesign.md` reflects the new layout. | +| 8 | Code organization & conventions | No issues found in this diff. | +| 9 | Testing coverage | No issues found in this diff — `MxGatewayClientAlarmsTests` fixtures correctly drop `SessionId` from `AcknowledgeAlarmRequest`/`Reply` and retain it on `QueryActiveAlarmsRequest`. | +| 10 | Documentation & comments | No issues found in this diff. | + +## Findings + +### Client.Dotnet-001 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `clients/dotnet/MxGateway.Client/GrpcMxGatewayClientTransport.cs:190-199`, `clients/dotnet/MxGateway.Client/GrpcGalaxyRepositoryClientTransport.cs:131-140` | +| Status | Resolved | + +**Description:** `MapRpcException` only produces typed exceptions for `Unauthenticated` and `PermissionDenied`. Every other gRPC status — `NotFound`, `InvalidArgument`, `ResourceExhausted`, `FailedPrecondition`, `Unavailable`, `Internal` — collapses into the base `MxGatewayException` with no surfaced `StatusCode`. Callers cannot programmatically distinguish a transient outage from a permanent bad-argument error without reflecting into `InnerException` and downcasting to `RpcException`. + +**Recommendation:** Carry the gRPC `StatusCode` on `MxGatewayException` (e.g. a `StatusCode` property) and/or add typed subclasses for at least `NotFound`, `InvalidArgument`, and `Unavailable`. Populate it from `exception.StatusCode` in `MapRpcException`. + +**Resolution:** (2026-05-18) Confirmed against source: both transports had a duplicated private `MapRpcException` that only typed two statuses and discarded the gRPC code for the rest. Added a nullable `StatusCode` property (`Grpc.Core.StatusCode?`) to `MxGatewayException` plus constructors that carry it, threaded it through `MxGatewayAuthenticationException`/`MxGatewayAuthorizationException`, and extracted the two duplicated mappers into a single shared internal `RpcExceptionMapper` (`RpcExceptionMapper.cs`) that populates `StatusCode` from `exception.StatusCode` for every status. Callers can now distinguish transient from permanent failures without downcasting `InnerException`. Documented in `clients/dotnet/README.md`. Regression test: `RpcExceptionMapperTests` (8 cases incl. the `[Theory]` over `NotFound`/`InvalidArgument`/`ResourceExhausted`/`FailedPrecondition`/`Unavailable`/`Internal`). + +### Client.Dotnet-002 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:145-148`, `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientSessionTests.cs:236-256` | +| Status | Resolved | + +**Description:** The retry predicate `MxGatewayClientRetryPolicy.IsTransientGrpcFailure` handles two shapes: a raw `RpcException` and an `MxGatewayException { InnerException: RpcException }`. In production the transport always maps `RpcException` → `MxGatewayException` before it reaches the retry pipeline, so only the wrapped-`MxGatewayException` branch ever runs in production. But `FakeGatewayTransport` throws the raw `RpcException` and never maps it, so every retry test exercises only the raw-`RpcException` branch — the branch that never occurs in production. The production retry behaviour is effectively untested. + +**Recommendation:** Add a fake/transport mode that maps `RpcException` to `MxGatewayException` the way `GrpcMxGatewayClientTransport` does (or add tests that enqueue a pre-wrapped `MxGatewayException`), so the actually-used predicate branch is covered. + +**Resolution:** (2026-05-18) Confirmed against source: `FakeGatewayTransport` threw queued exceptions verbatim, so the existing retry tests only ever hit the raw-`RpcException` predicate branch. Added a `MapTransportExceptions` flag to `FakeGatewayTransport` that, when set, runs thrown `RpcException`s through the same shared `RpcExceptionMapper` the production gRPC transport uses, producing the wrapped `MxGatewayException` shape. Added regression test `MxGatewayClientSessionTests.InvokeAsync_RetriesSafeDiagnosticCommand_WhenTransportMapsRpcException`, which exercises the previously-untested production predicate branch. Verified red: removing the `MxGatewayException { InnerException: RpcException }` case from `IsTransientGrpcFailure` fails the new test while the pre-existing raw-`RpcException` test still passes. + +### Client.Dotnet-003 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:659-663`, `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:230-240` | +| Status | Resolved | + +**Description:** `DisposeAsync` calls `CloseAsync()` (no token) then unconditionally `_closeLock.Dispose()`. If another thread is concurrently awaiting `CloseAsync(token)` — legal, since the type exposes public async methods and no single-threaded contract — disposing the `SemaphoreSlim` while a `WaitAsync` is pending throws `ObjectDisposedException` into that caller. The `_disposed` flags in both clients are also plain unsynchronised `bool` reads/writes; `ThrowIfDisposed` racing `DisposeAsync` can observe a stale value. + +**Recommendation:** Either document `MxGatewaySession`/`MxGatewayClient` as not thread-safe for concurrent dispose, or guard `_disposed` with `Interlocked`/`volatile` and avoid disposing `_closeLock` until all in-flight `CloseAsync` calls complete. + +**Resolution:** (2026-05-18) Confirmed against source: `MxGatewaySession.DisposeAsync` disposed `_closeLock` unconditionally, racing concurrent `CloseAsync` callers; `MxGatewayClient._disposed` was a plain `bool`. Fixed `MxGatewaySession` by tracking in-flight `CloseAsync` callers with an `_activeCloseCount` guarded by a dedicated `_disposeGate` lock and a `_closeLockDisposed` flag: `CloseAsync` registers under the gate (and throws `ObjectDisposedException` if disposal already won) before awaiting `_closeLock.WaitAsync`, and `DisposeAsync` drains `_activeCloseCount` to zero before disposing the semaphore, so the close lock provably outlives every pending `WaitAsync`. Fixed `MxGatewayClient` by changing `_disposed` to an `int` accessed via `Interlocked.Exchange`/`Volatile.Read`. Regression test `MxGatewayClientSessionTests.DisposeAsync_DoesNotRaceConcurrentCloseAsync` runs 100 iterations with one close holding the lock and one parked behind it while `DisposeAsync` runs concurrently; verified red against the original `DisposeAsync` (fails with `ObjectDisposedException`), green after the fix. + +### Client.Dotnet-004 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:283-294`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:392-403` | +| Status | Resolved | + +**Description:** `ExecuteSafeUnaryAsync` wraps the whole Polly retry pipeline in a single linked CTS cancelled after `Options.DefaultCallTimeout`, while `CreateCallOptions` also stamps each individual call with a `DefaultCallTimeout` gRPC deadline. The retry pipeline therefore shares one `DefaultCallTimeout` budget across the initial attempt plus all retries plus backoff delays. The README/XML docs describe `DefaultCallTimeout` as a per-call timeout, which misrepresents this. `DeadlineExceeded` is also classified as transient, so an attempt that exhausts the shared budget is retried only to immediately fail again. + +**Recommendation:** Decide whether `DefaultCallTimeout` is per-attempt or per-operation and make code and docs consistent — e.g. a separate per-attempt deadline and a distinct overall-operation timeout. Reconsider retrying on `DeadlineExceeded` when the deadline was client-imposed. + +**Resolution:** (2026-05-18) Confirmed against source: the shared linked-CTS budget plus per-call deadline both use `DefaultCallTimeout`, and `IsTransientStatus` listed `DeadlineExceeded`. Resolved as a per-operation budget (the simpler, non-breaking choice): the `DefaultCallTimeout` XML doc in `MxGatewayClientOptions.cs` now states it is both the per-attempt gRPC deadline and the overall budget shared across the initial attempt, every retry, and the backoff delays — an upper bound on total wall-clock time, not a fresh per-retry allowance. Removed `DeadlineExceeded` from `MxGatewayClientRetryPolicy.IsTransientStatus`: every unary deadline is client-imposed (`CreateCallOptions` stamps the shared budget), so a `DeadlineExceeded` means the budget is exhausted and an immediate retry can only fail again. Regression test `MxGatewayClientSessionTests.InvokeAsync_DoesNotRetrySafeDiagnosticCommand_OnDeadlineExceeded` asserts the safe diagnostic command (`Ping`) is attempted exactly once and the failure surfaces; verified red against the original transient set (the call retried and succeeded). + +### Client.Dotnet-005 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:82,124,175` | +| Status | Resolved | + +**Description:** `RegisterAsync`/`AddItemAsync`/`AddItem2Async` return `reply.?.ServerHandle ?? reply.ReturnValue.Int32Value`. After `EnsureMxAccessSuccess()` passes, a missing typed payload silently falls back to `ReturnValue.Int32Value`, which for a reply carrying no return value is `0`. A caller then uses `0` as a `ServerHandle`/`ItemHandle`, producing a confusing downstream invalid-handle failure rather than a clear "gateway reply missing payload" error. + +**Recommendation:** If the typed sub-message is the contract for these commands, treat its absence on an otherwise-successful reply as an error (throw a descriptive `MxGatewayException`) rather than falling through to `ReturnValue.Int32Value`. + +**Resolution:** (2026-05-18) Confirmed against source and `mxaccess_gateway.proto`: `register`/`add_item`/`add_item2` are members of the `MxCommandReply.payload` oneof, so the typed accessor is `null` whenever the worker did not set that case — and the fallback returned `ReturnValue.Int32Value` (0 for a reply with no return value). The typed sub-message is the contract for these handle-returning commands, so its absence on an otherwise-successful reply is now an error: `RegisterAsync`/`AddItemAsync`/`AddItem2Async` throw via a new private `MxGatewaySession.CreateMissingPayloadException` helper that builds a descriptive `MxGatewayException` naming the missing payload, kind, session, and correlation id. Regression tests `MxGatewayClientSessionTests.RegisterAsync_Throws_WhenSuccessfulReplyMissingPayload` and `AddItemAsync_Throws_WhenSuccessfulReplyMissingPayload` enqueue an `Ok` reply with no typed payload and assert the descriptive throw; verified red against the original fallback (returned `0` instead of throwing). + +### Client.Dotnet-006 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/dotnet/MxGateway.Client/MxGatewayClientOptions.cs:50`, `clients/dotnet/MxGateway.Client/MxGatewayClientContractInfo.cs:10-14` | +| Status | Resolved | + +**Description:** `MxGatewayClientOptions.MaxGrpcMessageBytes` and the two `const`s in `MxGatewayClientContractInfo` are public members with no XML doc comments, inconsistent with every other public member in the assembly and with the repo's documented C# style emphasis on a documented public surface. + +**Recommendation:** Add `` doc comments to `MaxGrpcMessageBytes`, `GatewayProtocolVersion`, and `WorkerProtocolVersion`. + +**Resolution:** (2026-05-18) Confirmed: all three public members lacked XML docs while every other public member in the assembly is documented. Added `` comments to `MxGatewayClientOptions.MaxGrpcMessageBytes` (describing the 16 MiB default applied to both send and receive limits), and to `MxGatewayClientContractInfo.GatewayProtocolVersion` and `WorkerProtocolVersion` (describing their wire-compatibility / diagnostics purpose). Pure documentation change — no test needed; build remains warning-clean. + +### Client.Dotnet-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:185-192` | +| Status | Resolved | + +**Description:** The `AcknowledgeAlarmAsync` XML comment states the gateway authenticates against an `invoke:alarm-ack` scope, but `CLAUDE.md` documents the scope set without any `invoke:alarm-ack` sub-scope. The comment may describe an intended finer-grained scope that does not exist, misleading integrators about what API key they need. + +**Recommendation:** Reconcile the comment with the actual server-side scope check, or update the scope documentation if sub-scopes were genuinely added; keep client doc and gateway auth model in sync. + +**Resolution:** (2026-05-18) Confirmed against the server-side authorization model: `GatewayGrpcScopeResolver.ResolveRequiredScope` has no arm for `AcknowledgeAlarmRequest`, so it falls to the `_ => GatewayScopes.Admin` default — the RPC actually requires the `admin` scope. No `invoke:alarm-ack` sub-scope exists anywhere in `GatewayScopes`. The client XML comment on `AcknowledgeAlarmAsync` was wrong, not the docs. Corrected the comment to state the gateway authorizes `AcknowledgeAlarmRequest` against the API key's `admin` scope and that there is no finer-grained alarm-ack sub-scope. Pure documentation change — no test needed. + +### Client.Dotnet-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayCliSecretRedactor.cs:9-17` | +| Status | Resolved | + +**Description:** The CLI redactor only removes the API key string when it was supplied via `--api-key`; `RunCoreAsync` passes `arguments.GetOptional("api-key")` to `Redact`. When the key comes from an environment variable (`--api-key-env`, the documented default path), `apiKey` is `null` and no redaction occurs. If a gRPC/transport error message ever echoes the bearer token, it would be printed unredacted. + +**Recommendation:** Resolve the effective API key (same logic as `ResolveApiKey`) before redacting, so the env-var-sourced key is also stripped from error output. + +**Resolution:** (2026-05-18) Confirmed against source: `MxGatewayClientCli.RunCoreAsync`'s catch block redacted only `arguments.GetOptional("api-key")`, so an env-var-sourced key (`--api-key-env`, default `MXGATEWAY_API_KEY`) was never stripped. Note `MxGatewayCliSecretRedactor` itself is correct — the defect was the caller passing the wrong value. Extracted a non-throwing `TryResolveApiKey` helper (used by both the existing `ResolveApiKey` and the catch block) that resolves `--api-key` then the `--api-key-env` environment variable; the catch block now redacts that effective key. Updated `clients/dotnet/README.md` (`smoke` paragraph) to state the CLI redacts the effective key whether from `--api-key` or `--api-key-env`. Regression test `MxGatewayClientCliTests.RunAsync_ErrorOutput_RedactsApiKey_WhenSourcedFromEnvironmentVariable` sets a test env var, forces a transport error echoing the key, and asserts the key is absent and `[redacted]` is present; verified red against the original `GetOptional("api-key")`-only redaction (key printed unredacted). + +### Client.Dotnet-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:26,339-348,445-448` | +| Status | Resolved | + +**Description:** Client.Dotnet-003 upgraded `MxGatewayClient._disposed` to an `int` accessed via `Interlocked.Exchange` / `Volatile.Read` so a concurrent `ThrowIfDisposed` cannot observe a stale value. The symmetric `GalaxyRepositoryClient._disposed` is still a plain unsynchronised `bool`: `DisposeAsync` reads `if (_disposed)` then writes `_disposed = true` without `Interlocked` or `Volatile`, and `ThrowIfDisposed` does an unsynchronised read. The Galaxy client is publicly `IAsyncDisposable` and exposes `TestConnectionAsync` / `GetLastDeployTimeAsync` / `DiscoverHierarchyAsync` / `WatchDeployEventsAsync` as legal-to-call-concurrently public APIs, so a concurrent dispose can produce the same torn-read race the gateway client fix prevented. The two clients also exhibit the same shape (gRPC channel + transport + retry pipeline), so the divergence is an accidental inconsistency. + +**Recommendation:** Mirror Client.Dotnet-003 on `GalaxyRepositoryClient`: change `_disposed` to an `int`, use `Interlocked.Exchange(ref _disposed, 1) != 0` in `DisposeAsync`, and `Volatile.Read(ref _disposed) != 0` in `ThrowIfDisposed`. A duplicated `MxGatewaySession`-style close-lock drain is unnecessary because `GalaxyRepositoryClient` does not own a per-call `SemaphoreSlim`. + +**Resolution:** 2026-05-20 — Changed `GalaxyRepositoryClient._disposed` from `bool` to `int`; `DisposeAsync` now uses `Interlocked.Exchange(ref _disposed, 1) != 0` for the once-only guard and `ThrowIfDisposed` uses `Volatile.Read(ref _disposed) != 0`, mirroring the Client.Dotnet-003 fix on `MxGatewayClient`. + +### Client.Dotnet-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:638,896,1261,1279` | +| Status | Resolved | + +**Description:** Client.Dotnet-005 fixed the silent `Register` / `AddItem` / `AddItem2` handle-fallback to `reply.ReturnValue.Int32Value` inside `MxGatewaySession`, but the same fallback pattern was left in the CLI and is now also present in two new bench commands shipped after that fix. `BenchReadBulkAsync` (line 638) and `BenchStreamEventsAsync` (line 896) both do `int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;` after a register call, and `SmokeAsync` (lines 1261 and 1279) passes `reply => reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value` and the equivalent `AddItem?.ItemHandle` selector to `InvokeForHandleAsync`. After `EnsureProtocolSuccess` + `EnsureMxAccessSuccess` pass but the worker did not set the typed `register` / `add_item` oneof case, all four call sites silently produce a zero handle and proceed to drive the rest of the smoke / bench against an invalid handle — exactly the failure mode the SDK-level fix prevents. + +**Recommendation:** Either delegate to the SDK helpers (`MxGatewaySession.RegisterAsync` / `AddItemAsync`) which already throw the descriptive `MxGatewayException` via `CreateMissingPayloadException`, or replicate the same null-check explicitly in `InvokeForHandleAsync` and the two bench commands. A unit test that enqueues an `Ok` reply with no typed payload through `FakeCliClient` and asserts the smoke / bench commands fail loudly would prevent regression. + +**Resolution:** 2026-05-20 — Added private CLI helpers `RequireRegisterServerHandle` and `RequireAddItemItemHandle` (with a shared `CreateMissingPayloadException` mirroring the SDK-level `MxGatewaySession` helper) that throw a descriptive `MxGatewayException` when the typed `register` / `add_item` payload is absent on an otherwise-successful reply. Replaced all four `?? reply.ReturnValue.Int32Value` fallback sites — `BenchReadBulkAsync` (line 638), `BenchStreamEventsAsync` (line 896), and both `SmokeAsync` selectors (lines 1261, 1279) — with these helpers, so the CLI now fails loudly with the same shape as the SDK helpers rather than silently driving the rest of the command against a zero handle. + +### Client.Dotnet-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:857-858,922-963,1014-1015` | +| Status | Resolved | + +**Description:** The new `bench-stream-events` command (added in commit `1cd51bb`) supports `--session-count > 1` and runs each session's `StreamEvents` reader in parallel via `openedSessions.Select(RunStreamAsync).ToArray()` then `Task.WhenAll`. Inside the per-session lambda the inner `Task.Run`-spawned event loop updates two shared `DateTime?` fields without synchronisation: + +```csharp +if (firstSteadyEventUtc is null) +{ + firstSteadyEventUtc = nowUtc; +} +lastSteadyEventUtc = nowUtc; +``` + +The integer counters next to them (`steadyEvents`, `steadyDataChangeEvents`, `warmupEvents`) use `Interlocked.Increment`, and the latency list uses an explicit `lock (latencyLock)`, so the rest of the loop is data-race-free — but these two `DateTime?` updates are not. With N parallel sessions a torn read on `firstSteadyEventUtc` produces a non-deterministic "first event time" and the final `steadyElapsedSeconds = (lastSteadyEventUtc.Value - firstSteadyEventUtc.Value).TotalSeconds` can compute a slightly wrong window. The user-visible impact is bench-only (skewed `eventsPerSecond` / `dataChangeEventsPerSecond` numbers), and on x64 the 64-bit `DateTime` field read/write happens to be atomic, so this is Low — but the pattern is inconsistent with the rest of the same loop. + +**Recommendation:** Either guard the two `DateTime?` updates with the existing `latencyLock` (cheapest), use `Interlocked.CompareExchange` for `firstSteadyEventUtc` and `Volatile.Write` for `lastSteadyEventUtc`, or aggregate per-session in local variables and reduce after `Task.WhenAll`. The reduce-after approach also fixes a related issue: today a faster session can stomp `firstSteadyEventUtc` after a slower one already set it. + +**Resolution:** 2026-05-20 — Guarded the `firstSteadyEventUtc` / `lastSteadyEventUtc` reads and writes inside the per-session event loop with the existing `latencyLock`. `firstSteadyEventUtc` now uses the null-coalescing assignment `firstSteadyEventUtc ??= nowUtc;` under the lock so a slower session can't stomp an earlier already-set value. The lock is already held by the latency-list append a few lines below, so the extra cost is one uncontended acquisition per event. The final read in the stats block runs after `Task.WhenAll` (happens-before applies) and stays lock-free. + +### Client.Dotnet-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/dotnet/MxGateway.Client/MxGateway.Client.csproj`, `clients/dotnet/MxGateway.Client.Cli/MxGateway.Client.Cli.csproj`, `clients/dotnet/MxGateway.Client.Tests/MxGateway.Client.Tests.csproj` | +| Status | Resolved | + +**Description:** `src/Directory.Build.props` enforces `TreatWarningsAsErrors=true`, `EnforceCodeStyleInBuild=true`, `AnalysisLevel=latest`, and `Deterministic=true` for every gateway / worker / contracts project, and `CLAUDE.md` calls this out as a baseline build property. The .NET client projects live under `clients/dotnet/` and there is no `Directory.Build.props` at `clients/` or `clients/dotnet/` — so none of those properties apply to `MxGateway.Client`, `MxGateway.Client.Cli`, or `MxGateway.Client.Tests`. New warnings in the client do not break the build, and code-style violations are not blocked at build time. The `CSharpStyleGuide.md` baseline ("Treat compiler warnings as actionable") and the `CLAUDE.md` table under "Source Update Workflow" both apply equally to `.NET client` ("`dotnet build clients/dotnet/MxGateway.Client.sln`"), but the enforcement floor is missing. + +**Recommendation:** Add `clients/dotnet/Directory.Build.props` (or `clients/Directory.Build.props` covering Rust-Cargo siblings is N/A — only `clients/dotnet/`) carrying the same property set: `TreatWarningsAsErrors=true`, `EnforceCodeStyleInBuild=true`, `AnalysisLevel=latest`, `Deterministic=true`. Excluding generated code (which already lives under `src/MxGateway.Contracts/Generated`) is automatic because the client only references the contracts project. Build the client locally after adding it to confirm no warnings already snuck in. + +**Resolution:** 2026-05-20 — Added `clients/dotnet/Directory.Build.props` mirroring `src/Directory.Build.props`: `LangVersion=latest`, `Nullable=enable`, `ImplicitUsings=enable`, `TreatWarningsAsErrors=true`, `AnalysisLevel=latest`, `EnforceCodeStyleInBuild=true`, `Deterministic=true`. The three client `.csproj` files inherit from it automatically. Re-ran `dotnet build clients/dotnet/MxGateway.Client.sln` and confirmed 0 warnings / 0 errors — no pre-existing warnings were silently being tolerated. + +### Client.Dotnet-013 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/dotnet/MxGateway.Client/DiscoverHierarchyOptions.cs:3-24`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:185-187`, `clients/dotnet/MxGateway.Client.Cli/IMxGatewayCliClient.cs:6` | +| Status | Resolved | + +**Description:** Client.Dotnet-006 fixed three undocumented public members. Three more remain undocumented in code paths the prior review didn't visit: + +- `DiscoverHierarchyOptions` (the public record) has no `` on the type and no XML doc on any of its ten public properties (`RootGobjectId`, `RootTagName`, `RootContainedPath`, `MaxDepth`, `CategoryIds`, `TemplateChainContains`, `TagNameGlob`, `IncludeAttributes`, `AlarmBearingOnly`, `HistorizedOnly`). +- The second `DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken)` overload on `GalaxyRepositoryClient` is `public` with no XML doc, while the parameterless overload one method above it carries a full `` / `` block. +- `IMxGatewayCliClient` is a public interface in the CLI project with no `` on the type (the member docs are present). + +This is the same convention-violation shape Client.Dotnet-006 closed; CLAUDE.md style guidance describes XML docs on the public surface as the baseline expectation. + +**Recommendation:** Add `` docs to each undocumented member. For `DiscoverHierarchyOptions`, the property names map cleanly to the underlying `DiscoverHierarchyRequest` proto fields — a one-line summary per property and a type-level summary tying the record to the Galaxy hierarchy browse is enough. The CLI interface only needs a type-level summary; the members already document themselves. + +**Resolution:** 2026-05-20 — Added XML docs to all three call sites: a type-level summary plus a one-line summary per property on `DiscoverHierarchyOptions` (ten properties, mapped to the underlying `DiscoverHierarchyRequest` proto fields and noting the root-precedence rule); a ``/``/`` block on the second `DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken)` overload describing its filter semantics and transparent pagination; and a type-level `` on the public `IMxGatewayCliClient` interface explaining its CLI-only transport role and the production binding. + +### Client.Dotnet-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientAlarmsTests.cs:76-98`, `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:212-231` | +| Status | Resolved | + +**Description:** Client.Dotnet-002 closed a coverage gap where the production retry path (`RpcException` → `MxGatewayException` mapping by `RpcExceptionMapper.Map`) was never exercised, by adding a `MapTransportExceptions` flag to `FakeGatewayTransport` and a regression test that runs through the wrapped-exception branch. That flag is wired through `Translate(...)` in `OpenSessionAsync` / `CloseSessionAsync` / `InvokeAsync`, but the new alarm test path is not: `FakeGatewayTransport.AcknowledgeAlarmAsync` throws the queued exception verbatim (line 219), bypassing `Translate`. The accompanying `MxGatewayClientAlarmsTests.AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` test acknowledges this in a comment ("Note: the FakeGatewayTransport surfaces RpcException directly … the SDK-level test pins the pass-through shape so a future migration to direct mapping won't silently change observable behaviour") and asserts `Assert.ThrowsAsync` — but the production path through `GrpcMxGatewayClientTransport.AcknowledgeAlarmAsync` (lines 120-134) already calls `RpcExceptionMapper.Map`, so production callers see `MxGatewayAuthenticationException` and not `RpcException`. The test name advertises mapping that the SDK-level harness doesn't exercise, and any callable from `MxGatewayClient.AcknowledgeAlarmAsync` cannot regress on the alarm-ack mapping without somebody noticing. + +**Recommendation:** Either route `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use and add a regression test that enables `MapTransportExceptions = true` and asserts `MxGatewayAuthenticationException`; or rename the existing test to make the pass-through shape explicit (e.g. `…_SurfacesRpcExceptionFromFakeTransportVerbatim`) and add a second test exercising the production mapping. Either fix closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`. + +**Resolution:** 2026-05-20 — Applied both halves of the recommendation. Routed `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use, so when `MapTransportExceptions = true` thrown `RpcException`s now run through the production `RpcExceptionMapper.Map`. Renamed the existing pass-through test to `AcknowledgeAlarmAsync_SurfacesRpcExceptionFromFakeTransportVerbatim_WhenMappingDisabled` (with an updated comment pinning that this shape only applies when mapping is off), and added a new test `AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` that enables mapping and asserts the production-parity `MxGatewayAuthenticationException` with `StatusCode.Unauthenticated`. Closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`. + +### Client.Dotnet-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:221-236`, `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:596-1065` | +| Status | Resolved | + +**Description:** `CreateCancellation(arguments, command)` calls `cancellation.CancelAfter(timeout)` for every command except the explicitly long-running `galaxy-watch`, where `timeout` is `arguments.GetDuration("timeout", TimeSpan.FromSeconds(30))`. That same `--timeout` value is also fed into `CreateOptions` as `DefaultCallTimeout`, so the CLI uses one knob for two distinct things: per-call gRPC deadline and overall wall-clock cancellation budget. Both `bench-read-bulk` and `bench-stream-events` (introduced in `7db4bff` and `1cd51bb`) default to `--duration-seconds=30 --warmup-seconds=3`, which already exceeds the 30 s wall-clock budget; `bench-stream-events --session-count=N` adds another `750 ms × (N-1)` of `sessionStartStaggerMs` before the measurement window even opens. + +A manual invocation such as `dotnet run --project clients/dotnet/MxGateway.Client.Cli -- bench-stream-events --endpoint ... --api-key ...` therefore cancels mid-window every time: the outer `CancellationTokenSource` trips at 30 s and the bench's inner `await Task.Delay(steadyEnd - warmupStart, cancellationToken)` throws an `OperationCanceledException` before `firstSteadyEventUtc`/`lastSteadyEventUtc` are even populated, producing a zero `steadyElapsedSeconds` / `0 eventsPerSecond` JSON payload that looks like a backend failure but is a self-inflicted CLI cancellation. + +`scripts/bench-read-bulk.ps1` already works around this for `bench-read-bulk` by computing `$callTimeoutSeconds = [Math]::Max(60, $DurationSeconds + $WarmupSeconds + 30)` and passing `--timeout ${callTimeoutSeconds}s` (line 125), so the driver flow is correct. But there is no PowerShell wrapper for `bench-stream-events`, and the bench is documented (in its own XML summary on line 792) as a single-client harness intended to be run directly. The trap is silent: no error is printed, just suspiciously-small numbers. + +**Recommendation:** Either (a) extend the `isLongRunning` set in `CreateCancellation` to include `bench-read-bulk` and `bench-stream-events`, so manual invocation defers to caller-supplied `--timeout` and otherwise runs until the bench finishes; (b) compute an automatic minimum-floor `--timeout` for the bench commands from `duration-seconds + warmup-seconds + headroom` the way the PS driver does; or (c) split the `--timeout` knob into a distinct per-call `--call-timeout` and outer `--wall-clock-timeout` and document the two roles. Option (a) is the smallest change and matches the existing `galaxy-watch` precedent. Add a CLI test that runs `bench-read-bulk` with `--duration-seconds=2 --warmup-seconds=0 --timeout=1s` and asserts the bench either errors loudly or completes (today it silently emits zeros). + +**Resolution:** 2026-05-20 — Applied option (a): extended the `isLongRunning` set in `CreateCancellation` from `command is "galaxy-watch"` to `command is "galaxy-watch" or "bench-read-bulk" or "bench-stream-events"`, so the two bench commands now run until they finish (or Ctrl+C) by default and only apply a wall-clock budget when the caller explicitly supplies `--timeout`. A caller-supplied `--timeout` still flows through to `DefaultCallTimeout` for per-attempt gRPC deadlines on the unary calls these benches make. Matches the existing `galaxy-watch` precedent and removes the silent zero-throughput failure mode without breaking the `scripts/bench-read-bulk.ps1` driver path (which explicitly raises `--timeout`). + +### Client.Dotnet-016 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:922-976` | +| Status | Resolved | + +**Description:** `BenchStreamEventsAsync.RunStreamAsync` launches the per-session stream reader inside a `Task.Run(async () => { ... }, streamCts.Token)` and stores the returned task in the local `streamTask`. The recovery block + +```csharp +await Task.Delay(steadyEnd - warmupStart, cancellationToken).ConfigureAwait(false); +streamCts.Cancel(); +try { await streamTask.ConfigureAwait(false); } +catch (OperationCanceledException) { } +catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled) { } +``` + +only awaits `streamTask` (and therefore only observes its exception) when `Task.Delay` returns normally. When the outer `cancellationToken` cancels during the delay — exactly the case Client.Dotnet-015 makes likely — `Task.Delay` throws `OperationCanceledException` and skips both `streamCts.Cancel()` and the `await streamTask`. The inner stream task is still alive at that point. The `using CancellationTokenSource streamCts = ...` on line 924 disposes the linked CTS, which propagates cancellation to the inner stream (so it eventually exits), but the resulting `OperationCanceledException` / mapped `MxGatewayException` is never observed. The local `streamTask` reference is dropped as `RunStreamAsync` unwinds, leaving the task object eligible for garbage collection with an unobserved fault — a `TaskScheduler.UnobservedTaskException`. + +The secondary `Grpc.Core.RpcException` catch on line 975 is also dead in this code path: the production `GrpcMxGatewayClientTransport.StreamEventsAsync` always wraps `RpcException` via `RpcExceptionMapper.Map`, which returns `OperationCanceledException` for `StatusCode.Cancelled` (mapper line 31). So the inner task's cancellation exception is always `OperationCanceledException`, not `RpcException`. Harmless when the recovery block runs, but it underscores that the cancellation path was only tested for the happy case. + +**Recommendation:** Restructure `RunStreamAsync` so the inner `streamTask` is always observed. A `try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }` shape works (the `finally` runs even on outer cancellation). Alternatively, hoist `streamTask` into a local that the outer method's `try`/`finally` always awaits before exiting, so the per-session loop becomes `await Task.WhenAny(streamTask, Task.Delay(...))` then a guaranteed `await streamTask`. Drop the now-redundant `Grpc.Core.RpcException` catch or convert it to catch `MxGatewayException` for the wrapped shape (and document that it should never fire in production). + +**Resolution:** 2026-05-20 — Restructured `RunStreamAsync` to wrap the `Task.Delay` in `try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }`, so the inner stream task is observed on every path — including when the outer `cancellationToken` cancels during the delay. Dropped the dead `catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled)` clause (the production `GrpcMxGatewayClientTransport.StreamEventsAsync` routes through `RpcExceptionMapper.Map`, which returns `OperationCanceledException` for `StatusCode.Cancelled`, so an `RpcException` never reaches here) and replaced it with `catch (MxGatewayException)` to absorb the wrapped shape for any non-cancellation mapper output. Added an inline comment naming the finding and documenting why the new catch shape is correct. Eliminates the latent `TaskScheduler.UnobservedTaskException` whenever the outer cancellation fires mid-measurement-window. + +### Client.Dotnet-017 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:1190-1262` | +| Status | Resolved | + +**Description:** Surfaced during the 2026-05-20 cross-language e2e matrix run: `dotnet run --project clients/dotnet/MxGateway.Client.Cli -- stream-events --endpoint http://localhost:5120 --api-key-env MXGATEWAY_API_KEY --timeout 60s --json --session-id session-... --max-events 200` exited with `-532462766` (unhandled-exception exit code) and propagated `System.OperationCanceledException: Call canceled by the client.` mapped from `Status(StatusCode="Cancelled", …)`. The CLI's `StreamEventsAsync` does `await foreach (... in client.StreamEventsAsync(...).WithCancellation(cancellationToken))` and never catches `OperationCanceledException`. When the caller's `--timeout` (driven by `CreateCancellation`'s `CancelAfter`) fires before `--max-events` is reached — the common case for a finite-window event collector against a quiet test rig — the foreach throws, the exception bubbles up, the process exits non-zero, and any `--json` aggregate output is never written. The other client CLIs (Go, Rust, Python, Java) all exit 0 in this case (e2e clients g/r/p ran clean). The bug is also a strict regression of the CLI's contract: callers can't tell "stream collected 0–N events then the budget closed" apart from "the call genuinely failed". + +**Recommendation:** Wrap the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { /* graceful */ }`. The `when` clause ensures only the supplied cancellation token (which covers `--timeout`, Ctrl+C, and parent-CTS cancellation — all three of which are graceful completion modes for a finite-window collector) gets absorbed; a server-side cancellation propagated through a different token still surfaces. Keep the existing aggregate-JSON emission below the catch so the events that arrived before the budget closed are still emitted. Add a regression test that drives the CLI with `--timeout 1s` against a fake that yields a couple of events then parks on the cancellation token; assert exit 0, no stderr, and the JSON output contains both yielded events. + +**Resolution:** 2026-05-20 — Wrapped the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` so the CLI exits 0 and emits the aggregate `{ "events": [...] }` JSON when the supplied token cancels (the `--timeout`, Ctrl+C, and parent-CTS paths all flow through that same token). The catch's `when` clause ensures non-token-driven cancellation still propagates. Added regression test `MxGatewayClientCliTests.RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero` that yields two events, parks on the cancellation token via a new `FakeCliClient.StreamHangAfterEvents` hook, runs the CLI with `--timeout 1s --json --max-events 200`, and asserts exit code 0, empty stderr, and both events present in the emitted aggregate JSON. Brings .NET stream-events behavior into parity with the Go, Rust, Python, and Java CLIs which all exit 0 on equivalent timeouts. diff --git a/code-reviews/Client.Go/findings.md b/code-reviews/Client.Go/findings.md new file mode 100644 index 0000000..4fcedb6 --- /dev/null +++ b/code-reviews/Client.Go/findings.md @@ -0,0 +1,434 @@ +# Code Review — Client.Go + +| Field | Value | +|---|---| +| Module | `clients/go` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 0 | + +## Checklist coverage + +A re-review of commit `a020350` (which resolved Client.Go-011..016). `gofmt -l .`, +`go vet ./...`, `go build ./...`, and `go test ./... -count=1` are all clean. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Prior Client.Go-001/003/007/011 remain resolved. No new correctness bugs found. | +| 2 | mxaccessgw conventions | `gofmt -l .` and `go vet ./...` clean; Client.Go-004 stays resolved. No new issues. | +| 3 | Concurrency & thread safety | Client.Go-013 resolved. New issue: `runBenchReadBulk`'s warm-up + steady-state wall-clock loops ignore `ctx` cancellation, so a Ctrl+C or parent-cancel keeps spinning ReadBulk calls until the wall-clock deadline (Client.Go-018). | +| 4 | Error handling & resilience | Client.Go-014 resolved. New issue: `parseValue` returns bare `strconv` errors with no `%w` wrap and no CLI-context, so a typo like `-type int32 -value foo` surfaces as `strconv.ParseInt: parsing "foo": invalid syntax` without naming the flag — out of line with the GoStyleGuide "wrap errors with useful context using `%w`" rule (Client.Go-017). | +| 5 | Security | No issues found — TLS-by-default with TLS 1.2 floor, API-key redaction in CLI JSON output, no secret logging. | +| 6 | Performance & resource management | No issues found — `defer client.Close()` / `defer subscription.Close()` applied consistently; bench-read-bulk preallocates the latency slice. | +| 7 | Design-document adherence | No new issues. Lazy `grpc.NewClient` + readiness probe (Client.Go-005) and the shared `dial` helper (Client.Go-009) are applied uniformly across `Dial` and `DialGalaxy`. | +| 8 | Code organization & conventions | Client.Go-015 resolved. New issue: `runStreamEvents` does not install a signal handler (Ctrl+C kills the process abruptly), while `runGalaxyWatch` does — the two long-running stream commands have divergent shutdown UX (Client.Go-020). | +| 9 | Testing coverage | Client.Go-008/016 resolved. New issue: the six new bulk and bench subcommands (`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`) have no CLI-level unit tests — in particular the Client.Go-015 secured-flag-gating fix has no regression test (Client.Go-021). | +| 10 | Documentation & comments | Client.Go-010/012 resolved. New issue: `runGalaxyWatch` parses `-last-seen-deploy-time` with `time.RFC3339` (no fractional seconds), while `parseRfc3339Timestamp` for `-timestamp-value` accepts `time.RFC3339Nano` — the CLI advertises "RFC 3339" for both but quietly differs on sub-second support (Client.Go-019). | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232`. Two commits touched `clients/go/`: `397d3c5` +updated the proto root in `generate-proto.ps1` to `src/ZB.MOM.WW.MxGateway.Contracts/Protos`; +`d692232` dropped stale `SessionId =` lines from `AcknowledgeAlarmRequest` / +`AcknowledgeAlarmReply` fixtures in `alarms_test.go` after the proto retired +the field, and substituted `CorrelationId: req.GetClientCorrelationId()` for +the fake server's reply. Module path (`gitea.dohertylan.com/dohertj2/mxaccessgw/clients/go`) +and subpackage `mxgateway` intentionally unchanged per Go convention. +`go build ./...` and `go test ./...` are green at HEAD. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — Go convention preserved (short lowercase package name, gitea-scoped module path). The generated `_pb.go` descriptor still carries `MxGateway.Contracts.Proto` in its csharp_namespace bytes; wire-level metadata not used by Go, intentionally not regenerated. | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | No issues found in this diff. | +| 5 | Security | No issues found in this diff. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | No issues found in this diff. | +| 8 | Code organization & conventions | No issues found in this diff. | +| 9 | Testing coverage | No issues found in this diff — `alarms_test.go` correctly drops retired `session_id` from `AcknowledgeAlarmRequest` and retains it on `QueryActiveAlarmsRequest`. | +| 10 | Documentation & comments | No issues found in this diff. | + +## Findings + +### Client.Go-001 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Correctness & logic bugs | +| Location | `clients/go/mxgateway/errors.go:88-93`, `clients/go/mxgateway/errors.go:117-128` | +| Status | Resolved | + +**Description:** `MxAccessError.Unwrap` returns `e.Command` directly. `EnsureMxAccessSuccess` constructs `&MxAccessError{Reply: reply}` with `Command` left nil (the HRESULT / failing-`MxStatusProxy` path). When `Command` is a nil `*CommandError`, `Unwrap()` returns a non-nil `error` interface wrapping a nil pointer. Consequently `errors.As(err, &ce)` for `*CommandError` returns `true` while setting `ce` to nil — a caller writing the idiomatic `if errors.As(err, &commandErr) { use commandErr.Status }` nil-dereferences and panics. Verified empirically; the existing test only exercises the populated-`Command` path. + +**Recommendation:** Make `Unwrap` return an untyped nil when `Command` is nil: `if e == nil || e.Command == nil { return nil }; return e.Command`. Add a test for the HRESULT-only `MxAccessError` asserting `errors.As(err, &ce)` is `false`. + +**Resolution:** Resolved 2026-05-18: `MxAccessError.Unwrap` now returns an untyped nil when `Command` is nil, so `errors.As` no longer binds a typed-nil `*CommandError`; added `errors_test.go` regression coverage for the HRESULT-only and populated-`Command` paths. + +### Client.Go-002 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `clients/go/mxgateway/session.go:440-516` | +| Status | Resolved | + +**Description:** For the `Events`/`EventsAfter` compatibility API (`cancelWhenResultBufferFull == true`), when the 16-slot `results` channel is full `sendEventResult` cancels and returns `false`; the goroutine returns and `close(results)` runs — the consumer sees the channel close with **no `EventResult{Err: ...}` ever delivered**. A slow consumer cannot distinguish "stream ended normally" from "events were silently dropped." This contradicts the design doc's "libraries should not reorder, coalesce, or drop events by default", and a test currently pins this lossy behaviour. + +**Recommendation:** Before cancelling on a full buffer, deliver a terminal `EventResult` carrying an explicit error (e.g. `ErrEventBufferOverflow`). Document the behaviour on `Session.Events`; steer callers to `SubscribeEvents` (which blocks instead of dropping). + +**Resolution:** Resolved 2026-05-18: confirmed against source — on a full bounded buffer the compatibility path cancelled and closed `results` with no terminal result. Added the exported sentinel `ErrEventBufferOverflow` (`errors.go`); `sendEventResult` now, on a full buffer, cancels the stream then calls the new `deliverTerminalResult` helper, which evicts one of the oldest buffered events to make room and places `EventResult{Err: ErrEventBufferOverflow}` so it becomes the consumer's last item before the channel closes. The previously lossy regression test (`TestEventsAfterCancelsStreamWhenCompatibilityChannelIsAbandoned`) was re-pointed to assert the terminal `ErrEventBufferOverflow` result is delivered. `clients/go/README.md` now documents the bounded-buffer/overflow behaviour and steers no-loss callers to `SubscribeEvents`. + +### Client.Go-003 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `clients/go/cmd/mxgw-go/main.go:517-532` | +| Status | Resolved | + +**Description:** `parseInt32List` calls `panic(err)` when an `item-handles` token fails to parse as an int32. The CLI is a documented user-facing tool; a typo like `-item-handles 1,foo` crashes the process with an unrecovered panic and stack trace instead of returning a clean error and exit code 2 like every other validation path in `main.go`. + +**Recommendation:** Change `parseInt32List` to return `([]int32, error)` and have `runUnsubscribeBulk` propagate the error, matching `parseValue`'s pattern. + +**Resolution:** Resolved 2026-05-18: confirmed against source — `parseInt32List` called `panic(err)` on a malformed token. It now returns `([]int32, error)`, wrapping the bad token (`invalid item handle %q: %w`); `runUnsubscribeBulk` parses item handles before dialing and returns the error, so a typo flows through `runWithIO` to `os.Exit(2)` like other validation paths. Regression tests `TestParseInt32ListParsesValidTokens` and `TestParseInt32ListReturnsErrorOnMalformedToken` added to `cmd/mxgw-go/main_test.go`. + +### Client.Go-004 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `clients/go/mxgateway/alarms_test.go:153-154`, `clients/go/mxgateway/galaxy_test.go:58-59` | +| Status | Resolved | + +**Description:** `gofmt -l` flags `alarms_test.go` and `galaxy_test.go` for misaligned struct-literal field padding. The Go client README lists `gofmt` as part of the workflow and the repo enforces style; unformatted committed code breaks `gofmt`-gated checks and CI. + +**Recommendation:** Run `gofmt -w mxgateway/alarms_test.go mxgateway/galaxy_test.go`. + +**Resolution:** Resolved 2026-05-18: confirmed `gofmt -l .` flagged both files for misaligned struct-literal padding. Ran `gofmt -w` on `mxgateway/alarms_test.go` and `mxgateway/galaxy_test.go`; `gofmt -l .` is now clean for the whole module. + +### Client.Go-005 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `clients/go/mxgateway/client.go:64,68`, `clients/go/mxgateway/galaxy.go:83,87` | +| Status | Resolved | + +**Description:** The client uses `grpc.DialContext` with `grpc.WithBlock()`. In current grpc-go both are deprecated in favour of `grpc.NewClient` (lazy connection). `WithBlock` also changes failure semantics: a transient gateway-unavailable at dial time becomes a hard `Dial` error rather than a connection that recovers when the gateway comes up, working against the design doc's resilience intent. + +**Recommendation:** Migrate to `grpc.NewClient`; if a fail-fast connect probe is still wanted, do an explicit readiness wait bounded by `DialTimeout`, and update the doc comment. + +**Resolution:** Resolved 2026-05-18: confirmed `Dial`/`DialGalaxy` used the deprecated `grpc.DialContext` + `grpc.WithBlock` pair. Migrated both to the shared `dial(ctx, opts)` helper, which now builds a lazy connection with `grpc.NewClient` and runs an explicit `waitForReady` readiness probe (`Connect` + `WaitForStateChange` until `connectivity.Ready`) bounded by `DialTimeout` — preserving fail-fast behavior while letting an otherwise lazy connection recover when the gateway is briefly down. Note: `grpc.NewClient` defaults the target scheme to `dns`, so the bufconn test harnesses (`client_session_test.go`, `alarms_test.go`, `galaxy_test.go`) were updated to use `passthrough:///bufnet` so the fake target reaches the context dialer. New tests `TestDialFailsFastWhenGatewayUnreachable` and `TestDialReadinessProbeReachesReady` cover the probe; `go vet` reports no deprecation. `clients/go/README.md` documents the lazy-connect + readiness-probe semantics. + +### Client.Go-006 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/go/mxgateway/errors.go:9-130` | +| Status | Resolved | + +**Description:** `docs/ClientLibrariesDesign.md` recommends a high-level error taxonomy (`TransportError`, `AuthenticationError`, `TimeoutError`, etc.). The Go client collapses all transport/gRPC failures into a single `GatewayError` with no way to classify transient (`Unavailable`, `DeadlineExceeded`) vs permanent (`Unauthenticated`, `InvalidArgument`) without manually unwrapping and calling `status.Code`. + +**Recommendation:** Add a helper (e.g. `IsTransient(err) bool`) or expose the gRPC `codes.Code` on `GatewayError`, so retry/timeout/auth handling can be written without re-parsing the wrapped error. + +**Resolution:** Resolved 2026-05-18: implemented the recommended classification surface in `errors.go` rather than a full parallel type hierarchy (the existing `GatewayError`/`CommandError`/`MxAccessError` chain already separates transport from protocol from MXAccess failures). Added `GatewayError.Code()` (returns the wrapped gRPC `codes.Code`, `OK` for nil, `Unknown` for a non-status error) and the free function `IsTransient(err error) bool`, which unwraps through `*GatewayError` and any gRPC-status chain and reports `true` for `Unavailable`, `DeadlineExceeded`, `ResourceExhausted`, and `Aborted`. Tests `TestGatewayErrorCode` and `TestIsTransient` cover the matrix; `clients/go/README.md` documents both for retry/timeout/auth handling. + +### Client.Go-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/go/mxgateway/session.go:526-532` | +| Status | Resolved | + +**Description:** `newCorrelationID` returns an empty string when `crypto/rand.Read` fails, silently producing an `MxCommandRequest` with no correlation id. `rand.Read` failure is rare, but the failure mode (untraceable command, no error surfaced) is worse than failing loud, and the empty-id path is untested. + +**Recommendation:** Either propagate the error up through `invokeCommand`, or fall back to a time/counter-based id rather than an empty string. + +**Resolution:** Resolved 2026-05-18: confirmed `newCorrelationID` returned `""` on a `rand.Read` failure. It now falls back to a non-empty `"fallback--"` id built from `time.Now().UnixNano()` and a process-wide `atomic.Uint64` monotonic counter, so every command stays traceable even without entropy. The `crypto/rand` call was routed through a `randRead` package variable so the failure path is testable; `TestNewCorrelationIDFallsBackOnRandFailure` simulates a `rand.Read` failure and asserts the fallback id is non-empty, `fallback-` prefixed, and unique, and `TestNewCorrelationIDUsesRandEntropy` pins the happy path. + +### Client.Go-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/go/mxgateway/` (test files) | +| Status | Resolved | + +**Description:** Several critical paths are untested: TLS credential resolution in `resolveTransportCredentials` (only the `Plaintext` path is exercised); the `callContext` deadline-shortening logic (`client.go:198-204`) including the negative-timeout disable case; and `NativeValue`/`NativeArray` for the array, raw-bytes, null, and unsupported-kind branches. + +**Recommendation:** Add unit tests for `resolveTransportCredentials` precedence, `callContext` deadline arithmetic, and `NativeValue`/`NativeArray` round-trips for every kind. + +**Resolution:** Resolved 2026-05-18: added `clients/go/mxgateway/coverage_test.go`. `TestResolveTransportCredentialsPrecedence` exercises every branch (explicit `TransportCredentials`, `Plaintext`, missing `CACertFile` error, `TLSConfig` + `ServerNameOverride`, default TLS floor) and `TestResolveTransportCredentialsDoesNotMutateTLSConfig` confirms the supplied `*tls.Config` is cloned. `TestCallContextDeadlineArithmetic` covers zero/default, negative-disable, positive timeout, caller-deadline-sooner-kept, and caller-deadline-later-shortened. `TestNativeValueEdgeKinds`, `TestNativeArrayEdgeKinds`, and `TestNativeValueUnsupportedKind` cover the null, raw-bytes (including the no-alias copy), array, timestamp-with-nil, and unsupported-kind branches. + +### Client.Go-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/go/mxgateway/galaxy.go:60-93,241-256`, `clients/go/mxgateway/client.go:41-74,190-205` | +| Status | Resolved | + +**Description:** `DialGalaxy`/`Dial` and `GalaxyClient.callContext`/`Client.callContext` are near-identical duplicates (dial-context setup, credential resolution, dial-option assembly, deadline arithmetic). A fix to one (e.g. the Client.Go-005 dial migration) must be applied twice and can drift. + +**Recommendation:** Extract a shared unexported `dial(ctx, opts)` and a free `callContext(opts, ctx)` function, and have both client constructors call them. + +**Resolution:** Resolved 2026-05-18: extracted the shared unexported `dial(ctx, opts) (*grpc.ClientConn, error)` (credential resolution, dial-option assembly, `grpc.NewClient`, readiness probe) and the free `callContext(ctx, callTimeout) (context.Context, context.CancelFunc)` into `client.go`. `Dial`/`DialGalaxy` and both `(*Client).callContext`/`(*GalaxyClient).callContext` methods now delegate to them; the duplicated dial and deadline code in `galaxy.go` was removed (its now-unused `errors` import dropped). This was done together with the Client.Go-005 migration so the `grpc.NewClient` change lives in exactly one place. + +### Client.Go-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/go/mxgateway/client.go:39-40` | +| Status | Resolved | + +**Description:** The `Dial` doc comment states it configures "blocking dial cancellation from ctx." This describes the deprecated `WithBlock` behaviour; once Client.Go-005 is addressed the comment is misleading about how connection establishment and cancellation work. + +**Recommendation:** Reword to describe the actual connect/timeout semantics after resolving Client.Go-005, and clarify that `DialTimeout` bounds the initial connect attempt. + +**Resolution:** Resolved 2026-05-18: alongside the Client.Go-005 migration, the `Dial` doc comment was rewritten to describe the lazy `grpc.NewClient` connection, the `DialTimeout`-bounded (default 10s, or ctx deadline when sooner) readiness probe, that a briefly-unavailable gateway recovers instead of producing a hard error, and that cancelling `ctx` aborts the probe. `DialGalaxy` and the new `dial`/`waitForReady`/`callContext` helpers carry matching doc comments. + +### Client.Go-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/go/mxgateway/alarms_test.go:66-73` | +| Status | Resolved | + +**Description:** `TestAcknowledgeAlarmRejectsNilRequest` contains a no-op `if` with an empty body whose intent is documented in a comment ("Accept either: the helper returned the literal sentinel, or the generic transport error — both prove nil was rejected"). The condition + +```go +if err == nil || !errors.Is(err, errors.Unwrap(err)) && err.Error() != "mxgateway: acknowledge alarm request is required" { + // ... +} +``` + +evaluates expressions for side effects only and asserts nothing — Go's `&&` binds tighter than `||`, the body is empty, and the actual nil check happens on the very next `if err == nil`. The block is effectively dead code masquerading as a check. It also evaluates `errors.Unwrap(err)` regardless of `err`'s shape, and would call `err.Error()` even when err might be a wrapped status error whose message wording the gateway is free to change — making the apparent assertion brittle on top of being dead. + +**Recommendation:** Drop the empty-body `if` entirely (the subsequent `if err == nil { t.Fatalf(...) }` already enforces the contract), or, if the intent is to additionally pin the literal error message for the sentinel path, replace it with a real assertion (`if err.Error() != "mxgateway: acknowledge alarm request is required" { t.Fatalf(...) }`) and remove the spurious `errors.Is(err, errors.Unwrap(err))` clause. + +**Resolution:** 2026-05-20 — Removed the empty-body `if` in `TestAcknowledgeAlarmRejectsNilRequest`; the subsequent `if err == nil { t.Fatalf(...) }` already enforces the nil-rejection contract without the dead, brittle compound predicate. + +### Client.Go-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/go/cmd/mxgw-go/main.go:1063-1065`, `clients/go/cmd/mxgw-go/main.go:88-104` | +| Status | Resolved | + +**Description:** `writeUsage` lists the available subcommands as `version|open-session|close-session|register|add-item|advise|subscribe-bulk|unsubscribe-bulk|write|stream-events|smoke|galaxy-test-connection|galaxy-last-deploy|galaxy-discover|galaxy-watch`. Six subcommands wired into `run` are missing from this list: `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and `bench-read-bulk`. A user invoking `mxgw-go` with no args or an unknown command (the two paths that print this banner) sees an incomplete CLI surface and may believe the bulk-write / read-bulk families are not implemented. The README does document them, but the inline usage banner is the first source of truth a CLI user consults. + +**Recommendation:** Extend the usage string to include every command registered in the `switch args[0]` in `run`, or generate it from a single source-of-truth slice keyed on command name → handler so the two cannot drift again. + +**Resolution:** 2026-05-20 — `writeUsage` now lists the previously missing `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and `bench-read-bulk` subcommands alongside the original surface, so the no-args / unknown-command banner reflects every command wired into `run`. + +### Client.Go-013 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `clients/go/cmd/mxgw-go/main.go:1246-1249`, `clients/go/cmd/mxgw-go/main.go:1257-1262` | +| Status | Resolved | + +**Description:** In `runGalaxyWatch`, the signal-cancellation branch carefully drains the buffered `events` channel after `cancelStream()` so the `WatchDeployEvents` goroutine can exit (`for range events { }`). The limit-reached branch (`if *limit > 0 && count >= *limit { cancelStream(); return nil }`) skips that drain and returns immediately. After the function returns, `defer client.Close()` runs and tears down the gRPC connection; in the gap before the connection close propagates, the WatchDeployEvents goroutine may still be blocked on `case events <- event:` (the channel is buffered to 16 but a slow producer can refill it) — the goroutine then exits via `<-ctx.Done()` because `streamCtx` was cancelled, so it isn't a permanent leak, but the two cancellation paths behave inconsistently and the limit-reached path can briefly hold a goroutine plus the gRPC stream while the client tears down underneath it. + +**Recommendation:** Factor the drain into a helper and use it from both branches, e.g. after `cancelStream()` always `for range events { }` (and let the surrounding `select`/`for` re-evaluate `<-errs` if a terminal error was already buffered). Alternatively, drop the explicit drain in both branches and rely on `defer cancelStream()` plus `defer client.Close()` — but pick one model and apply it consistently. + +**Resolution:** 2026-05-20 — The limit-reached branch in `runGalaxyWatch` now drains the buffered `events` channel (`for range events { }`) after `cancelStream()`, matching the signal-cancel branch. Both cancellation paths now wait for the `WatchDeployEvents` goroutine to exit before `defer client.Close()` tears the gRPC connection down. + +### Client.Go-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/go/mxgateway/session.go:602`, `clients/go/mxgateway/galaxy.go:189` | +| Status | Resolved | + +**Description:** Two stream Recv loops compare end-of-stream with `err == io.EOF` directly: + +- `session.go:602` — `if err == io.EOF || status.Code(err) == codes.Canceled || streamCtx.Err() != nil { return }` +- `galaxy.go:189` — `if recvErr == io.EOF { return }` + +gRPC's generated `Recv()` does return the `io.EOF` sentinel directly today, so the comparisons work in practice. However, the Go idiom (and the project's `docs/style-guides/GoStyleGuide.md`) is to use `errors.Is(err, io.EOF)` so future wrapping (e.g. an interceptor decorating Recv errors) does not silently flip the loop from "stream finished normally" to "stream produced an error". The mxgateway client itself wraps non-EOF Recv errors in `*GatewayError`, which `errors.Is` already supports — using `errors.Is` keeps both paths consistent. + +**Recommendation:** Replace `recvErr == io.EOF` / `err == io.EOF` with `errors.Is(err, io.EOF)` (the `errors` package is already imported in both files). + +**Resolution:** 2026-05-20 — Both stream Recv loops now use `errors.Is(err, io.EOF)`: `session.go` already imported `errors`, and `galaxy.go` gained the missing `errors` import alongside the `recvErr == io.EOF` → `errors.Is(recvErr, io.EOF)` change, keeping EOF detection robust against any future Recv-error wrapping. + +### Client.Go-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/go/cmd/mxgw-go/main.go:410-512` | +| Status | Resolved | + +**Description:** `runWriteBulkVariant(ctx, args, stdout, stderr, command, withTimestamp, secured bool)` accepts `secured` but never uses it — the routing is keyed on `command` (the string `"write-bulk"` / `"write2-bulk"` / `"write-secured-bulk"` / `"write-secured2-bulk"`). The function ends with `_ = secured // currently only used for routing above; reserved for future per-variant validation`, which is misleading because `secured` is not in fact used for routing. The four wrapper functions (`runWriteBulk`, `runWrite2Bulk`, `runWriteSecuredBulk`, `runWriteSecured2Bulk`) all pass a `secured` argument that has no effect. The four CLI options `-current-user-id`, `-verifier-user-id` are unconditionally registered on every variant, including the non-secured ones, so a `write-bulk` invocation that passes `-current-user-id 42` silently does nothing. Either remove `secured` and the dead `_ = secured` comment, or use it to gate the registration of secured-only flags so wrong combinations are rejected with a clean error. + +**Recommendation:** Drop the `secured` parameter (the `command` switch already distinguishes the four variants) and the misleading `_ = secured` line; or, if validation is the goal, branch flag registration on `secured` so secured-only flags are unavailable for the non-secured variants and emit a clean usage error if they appear. + +**Resolution:** 2026-05-20 — Dropped the unused `secured` parameter from `runWriteBulkVariant` (the `command` switch already distinguishes the four variants) and removed the misleading `_ = secured` line. The variant is now derived locally from `command` and used to gate flag registration: `-current-user-id` / `-verifier-user-id` are only registered for the secured variants and `-user-id` only for Write/Write2, so a wrong-variant flag now fails with a clean `flag provided but not defined` usage error instead of silently no-op'ing. The four `runWrite*Bulk` wrappers were updated to match the new signature. + +### Client.Go-016 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/go/mxgateway/galaxy_test.go:382-429` | +| Status | Resolved | + +**Description:** `fakeGalaxyServer.watchSendInterval` is declared on the test fake and consulted inside `WatchDeployEvents` (`if s.watchSendInterval > 0 { ... }`) but no test in the package sets a non-zero value. The dead field plus its branch were presumably added to support a backpressure / pacing test that was never landed, and now the only effect is reader confusion ("which test uses this?") and a pointlessly larger fake. Backpressure on the bootstrap-plus-events sequence is also genuinely worth testing, given that `WatchDeployEvents` writes to a 16-deep buffered channel. + +**Recommendation:** Either delete the unused `watchSendInterval` field and its branch in `WatchDeployEvents`, or add the test it was added for — e.g. one that pumps more than 16 events with a small interval and asserts the consumer keeps up without losing or reordering events. Linking the field to a `// for TestX` comment if it stays would also help. + +**Resolution:** 2026-05-20 — Removed the unused `watchSendInterval` field from `fakeGalaxyServer` and the corresponding `if s.watchSendInterval > 0 { ... }` branch in `WatchDeployEvents`; no test set the field, so the dead code path is gone and the fake is leaner. `gofmt -w` reflowed the struct to drop the no-longer-needed field-name padding. + +### Client.Go-017 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/go/cmd/mxgw-go/main.go:954-991` | +| Status | Resolved | + +**Description:** `parseValue` returns the raw `strconv.ParseBool` / `strconv.ParseInt` / `strconv.ParseFloat` error verbatim — no wrap with `%w` and no indication of which CLI flag was the source. A user running `mxgw-go write -type int32 -value foo` sees + +``` +strconv.ParseInt: parsing "foo": invalid syntax +``` + +with no mention of `-value`, `-type`, or which subcommand failed. The same pattern hits every typed branch (bool, int32, int64, float, double). Compare with the sibling helpers in the same file: `parseInt32List` wraps with `"invalid item handle %q: %w"` (Client.Go-003 resolution) and `parseRfc3339Timestamp` wraps with `"invalid RFC 3339 timestamp %q: %w"`. `parseValue` was missed and is inconsistent with those two. The GoStyleGuide (`docs/style-guides/GoStyleGuide.md`, "Errors" section) requires "Wrap errors with useful context using `%w`." + +**Recommendation:** Wrap each `strconv` error with the offending input and type, e.g. `return nil, fmt.Errorf("invalid %s value %q: %w", valueType, valueText, err)`. The wrapper handles all five typed branches uniformly without a per-branch change. + +**Resolution:** 2026-05-20 — Each typed branch of `parseValue` now wraps the bare `strconv` error with `%w` and names the offending flag and value (`"invalid -value for -type %s: %q: %w"`), so `mxgw-go write -type int32 -value foo` surfaces the source flag, the requested type, and the bad token while still letting `errors.Is/As` reach the underlying `strconv` sentinel. The new `TestParseValueWrapsStrconvErrorWithFlagContext` table-test pins all five typed branches (bool, int32, int64, float, double) to the new wrapper shape. + +### Client.Go-018 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `clients/go/cmd/mxgw-go/main.go:593-623` | +| Status | Resolved | + +**Description:** `runBenchReadBulk`'s warm-up and steady-state loops are wall-clock-only: + +```go +for time.Now().Before(warmupDeadline) { + _, _ = session.ReadBulk(ctx, serverHandle, tags, timeout) +} +... +for time.Now().Before(steadyDeadline) { + callStart := time.Now() + results, err := session.ReadBulk(ctx, serverHandle, tags, timeout) + ... +} +``` + +Neither loop checks `ctx.Done()` / `ctx.Err()`. If the parent context is cancelled (e.g. the operator Ctrl+Cs the benchmark, or the cross-language bench driver `scripts/bench-read-bulk.ps1` times out and kills the child early), the loops keep iterating until their wall-clock deadlines elapse. Each `ReadBulk` call inside fails fast (the gRPC call inherits the cancelled context and returns `context.Canceled`), but the steady-state loop counts those as `failedCalls++` and keeps spinning — wasting CPU and inflating the `failedCalls` and `latencyMs.max` figures the PowerShell driver collates across all five clients. The .NET, Rust, Python, and Java bench drivers should be checked for the same shape, but the Go one is the only one being reviewed here. Note that `runBenchReadBulk` is the only Go CLI command that does NOT register its own signal handler (compare with `runGalaxyWatch` which does via `signal.NotifyContext`). + +**Recommendation:** Drop out of both loops as soon as `ctx.Err() != nil`. Concretely, change the loop conditions to `for time.Now().Before(warmupDeadline) && ctx.Err() == nil` (and the same on `steadyDeadline`), or use a `select { case <-ctx.Done(): break loop; default: }` guard at the top of each iteration. The cross-language bench shape (`durationMs`, `totalCalls`, `failedCalls`, `latencyMs`) stays the same — the bench just exits sooner and reports the truncated window faithfully. + +**Resolution:** 2026-05-20 — Both the warm-up and steady-state loops in `runBenchReadBulk` now carry an `&& ctx.Err() == nil` guard alongside the wall-clock check, so a cancelled parent context (Ctrl+C, or the cross-language bench driver killing the child early) breaks the loop instead of spinning failing `ReadBulk` calls until the deadline elapses. The cross-language bench JSON shape is unchanged — the truncated window is just reported faithfully via `durationMs` / `totalCalls`. + +### Client.Go-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/go/cmd/mxgw-go/main.go:710-716`, `clients/go/cmd/mxgw-go/main.go:1204,1213` | +| Status | Resolved | + +**Description:** The CLI advertises two timestamp flags as "RFC3339" but parses them with different layouts: + +- `-timestamp-value` (write2/write-secured2 bulk): `parseRfc3339Timestamp` uses `time.RFC3339Nano`, which accepts both `2026-04-28T10:00:00Z` and `2026-04-28T10:00:00.123456789Z`. +- `-last-seen-deploy-time` (galaxy-watch): `time.Parse(time.RFC3339, ...)`, which rejects fractional seconds. + +A user copy-pasting an `ObservedAt` timestamp from `galaxy-watch -json` (which is emitted as `RFC3339Nano` by `formatDeployEvent`) directly into `-last-seen-deploy-time` will get a parse error if the source value carried a fractional component, even though both flag descriptions say "RFC3339". The flag help string at `main.go:1204` literally says "RFC3339 timestamp", and the README example uses `2026-04-28T10:00:00Z` (whole seconds only), so the issue is silent until a fractional timestamp comes from the gateway. + +**Recommendation:** Switch the `galaxy-watch` parse to `time.RFC3339Nano` to match `parseRfc3339Timestamp` (and the gateway's own emit format). One line change at `main.go:1213`. While there, update the flag help string and the README example to say "RFC 3339 (with optional fractional seconds)" so the two flags are documented uniformly. + +**Resolution:** 2026-05-20 — `runGalaxyWatch` now parses `-last-seen-deploy-time` with `time.RFC3339Nano`, matching `parseRfc3339Timestamp` and the gateway's own `formatDeployEvent` emit format; the layout is strictly broader than the previous `time.RFC3339` (whole-second values still parse). The flag help string changed to "RFC 3339 timestamp (with optional fractional seconds)" and the `clients/go/README.md` example was extended with an explicit fractional-seconds line so the two flags advertise the same surface. + +### Client.Go-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/go/cmd/mxgw-go/main.go:753-802`, `clients/go/cmd/mxgw-go/main.go:1199-1275` | +| Status | Resolved | + +**Description:** The two long-running stream commands have divergent Ctrl+C UX: + +- `runGalaxyWatch` registers a signal handler: + + ```go + signalCtx, stopSignals := signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM) + defer stopSignals() + streamCtx, cancelStream := context.WithCancel(signalCtx) + ``` + + so Ctrl+C drains buffered events and returns cleanly. + +- `runStreamEvents` does not register any signal handler — its parent context is `context.Background()` from `runWithIO`, so Ctrl+C abruptly kills the process. The deferred `subscription.Close()` and `client.Close()` never run, leaving the server-side stream to fault out on a torn TCP connection rather than a clean cancel. + +The two commands are otherwise structurally identical (subscribe + loop until limit or external stop) — the inconsistency is one half of a pair that was missed when `galaxy-watch` was added. Worth flagging because it directly affects what an integrator who Ctrl+Cs `stream-events` sees in the gateway's logs (a transport reset rather than a `codes.Canceled`). + +**Recommendation:** Mirror the `runGalaxyWatch` pattern in `runStreamEvents`: wrap `ctx` in `signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM)`, derive `streamCtx` from it, and let `defer subscription.Close()` / `defer cancelStream()` tear the stream down on signal. The change is roughly six lines and brings the two stream commands into parity. Optionally factor a shared `withSignals(ctx) (context.Context, context.CancelFunc)` helper if a third stream command lands. + +**Resolution:** 2026-05-20 — `runStreamEvents` now installs `signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM)` (with a deferred `stopSignals()`) and derives `streamCtx` from the resulting signal-aware context, mirroring `runGalaxyWatch`. Ctrl+C now cancels the gRPC stream cleanly — the gateway sees `codes.Canceled` instead of a torn TCP connection — and the deferred `subscription.Close()` / `client.Close()` actually run on signal. The two long-running stream commands now share the same shutdown UX. + +### Client.Go-021 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/go/cmd/mxgw-go/main_test.go`, `clients/go/cmd/mxgw-go/main.go:363-520,522-655` | +| Status | Resolved | + +**Description:** The six bulk / bench subcommands wired into `run` (`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`) have **no CLI-level unit tests** in `main_test.go`. In particular, the Client.Go-015 resolution claims: + +> `-current-user-id` / `-verifier-user-id` are only registered for the secured variants and `-user-id` only for Write/Write2, so a wrong-variant flag now fails with a clean `flag provided but not defined` usage error instead of silently no-op'ing. + +But there is no test asserting that, e.g., `mxgw-go write-bulk -current-user-id 1 ...` returns a "flag provided but not defined" error, or that `mxgw-go write-secured-bulk -user-id 1 ...` does the same. A future refactor of `runWriteBulkVariant` (notably one that re-introduced the `secured` parameter) could silently re-permit the wrong flags without breaking any test. The same gap applies to: parameter validation in `runReadBulk` (bulk size, empty session/items rejection), the value-count vs handle-count mismatch error in `runWriteBulkVariant:447`, and `runBenchReadBulk`'s `bulk-size`/`duration-seconds` positivity checks. + +`mxgateway/client_session_test.go` already covers the library-level happy paths (`TestWriteBulkBuildsOneBulkCommandAndReturnsPerEntryResults`, `TestReadBulkForwardsTimeoutAndUnpacksCachedFlag`, `TestSubscribeBulkBuildsOneBulkCommandAndReturnsResults`), so this finding is about CLI surface area only. + +**Recommendation:** Add table-driven tests in `cmd/mxgw-go/main_test.go` along the existing `TestParseInt32List*` and `TestParseValueBuildsTypedValue` style: + +- `TestRunWriteBulkVariantGatesSecuredFlags`: invoke `runWithIO` with `write-bulk -current-user-id 1 ...` and `write-secured-bulk -user-id 1 ...`, assert each returns an error matching `flag provided but not defined`. +- `TestRunReadBulkRejectsMissingArgs`: invoke `runWithIO` with `read-bulk` (no `-session-id`), assert the documented "session-id and items are required" error. +- `TestRunBenchReadBulkRejectsNonPositiveBulkSize` / `TestRunBenchReadBulkRejectsNonPositiveDuration`: pin the positivity checks at `main.go:544-549`. +- `TestRunWriteBulkVariantRejectsMismatchedHandlesAndValues`: pin the `len(handles) != len(valueTexts)` error at `main.go:447`. + +Each is a few lines and routes through the existing `runWithIO` entry point, so it does not need a bufconn fake. + +**Resolution:** 2026-05-20 — Added CLI-level table-driven regression tests in `cmd/mxgw-go/main_test.go` routed through `runWithIO`, so they need no bufconn fake: `TestRunWriteBulkVariantGatesSecuredFlags` pins Client.Go-015 by asserting `write-bulk -current-user-id`, `write-bulk -verifier-user-id`, `write2-bulk -current-user-id`, `write-secured-bulk -user-id`, and `write-secured2-bulk -user-id` all surface `flag provided but not defined`; `TestRunReadBulkRejectsMissingArgs` pins the "session-id and items are required" check across no-flags / missing-items / missing-session-id; `TestRunBenchReadBulkRejectsNonPositiveBulkSize` and `TestRunBenchReadBulkRejectsNonPositiveDuration` pin the positivity checks; `TestRunWriteBulkVariantRejectsMismatchedHandlesAndValues` pins the explicit `item-handles count ... does not match values count ...` error. `go test ./...` passes. diff --git a/code-reviews/Client.Java/findings.md b/code-reviews/Client.Java/findings.md new file mode 100644 index 0000000..612723f --- /dev/null +++ b/code-reviews/Client.Java/findings.md @@ -0,0 +1,520 @@ +# Code Review — Client.Java + +| Field | Value | +|---|---| +| Module | `clients/java` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 5 | + +## Checklist coverage + +A third-pass review against commit `a020350` (the sweep that resolved +Client.Java-013 through Client.Java-020). Prior findings are unchanged; new +findings raised in this pass are numbered Client.Java-021 onward. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issues found: `stream-events` CLI text path still prints the proto `uint64 worker_sequence` with `%d` (Client.Java-023), the same bug Client.Java-020 fixed for `galaxy-watch`; `bench-read-bulk` includes failed-call durations in its success-latency histogram (Client.Java-024), mirroring the bug Client.Rust-015 fixed in Rust. | +| 2 | mxaccessgw conventions | No new issues found in this pass. | +| 3 | Concurrency & thread safety | Issue found: `DeployEventStream` did not receive the deterministic terminal-state serialisation that Client.Java-002 added to `MxEventStream`, so a concurrent queue-overflow + `close()` race can still erase the overflow signal (Client.Java-021). | +| 4 | Error handling & resilience | No new issues found in this pass. | +| 5 | Security | No new issues found in this pass. The Client.Java-018 regex correctly handles colon/comma/quote/paren/URL embeddings and is verified by the existing fixture tests. | +| 6 | Performance & resource management | No new issues found in this pass. `shutdownTimeout` is consistently honoured everywhere `ownedChannel.shutdown()` is called — both clients delegate to the shared `MxGatewayChannels.shutdown` / `shutdownAndAwaitTermination` helpers. | +| 7 | Design-document adherence | No new issues found in this pass. | +| 8 | Code organization & conventions | Issue found: the CLI `CommonOptions.toClientOptions()` does not propagate `shutdownTimeout` to the underlying `MxGatewayClientOptions`, so CLI users have no way to override the new option introduced by Client.Java-019 (Client.Java-025). | +| 9 | Testing coverage | Issue found: there is no CLI-level test coverage for the `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, or `bench-read-bulk` subcommands — Client.Java-013 noted this as out-of-scope but never filed a follow-up (Client.Java-026). | +| 10 | Documentation & comments | Issue found: `MxGatewayChannels.toCompletable` Javadoc claims chained `thenApply` futures forward `cancel()` upstream to `CancellingCompletableFuture`, which is not true of `CompletableFuture.thenApply`; the implementation works only because all validator chains are inlined into the new `toCompletable(source, operation, validator)` overload (Client.Java-022). | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232`. Diff against `a020350` is commit `397d3c5`: +gradle subprojects renamed `mxgateway-client` → `zb-mom-ww-mxgateway-client`, +`mxgateway-cli` → `zb-mom-ww-mxgateway-cli`. Java package change +`com.dohertylan.mxgateway.*` → `com.zb.mom.ww.mxgateway.*` (source directories +moved from `com/dohertylan/mxgateway/` to `com/zb/mom/ww/mxgateway/`). +`settings.gradle` and root `build.gradle` updated for the new project / +group names. CLI mainClass updated. The `gradle build` task at HEAD is green; +documentation updates lag — see the doc-side findings below. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | Issues found: Client.Java-031 (README prose still uses the old short project names in plain text — the actual subproject names that drive Gradle / IDE imports are the prefixed ones). | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | No issues found in this diff. | +| 5 | Security | No issues found in this diff. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | Issues found: Client.Java-028 (`JavaClientDesign.md` build-layout example still cites the old `com/dohertylan/mxgateway/` package paths). | +| 8 | Code organization & conventions | Issues found: Client.Java-030 (no new test exercises the regenerated `QueryActiveAlarmsRequest` RPC path). | +| 9 | Testing coverage | Cross-reference to Client.Java-030; the bulk-command gaps tracked under Client.Java-026 remain. | +| 10 | Documentation & comments | Issues found: Client.Java-027 (Gradle task names in README and JavaClientDesign still reference the old `:mxgateway-client:` and `:mxgateway-cli:` paths — every command in the README breaks if copy-pasted); Client.Java-029 (`README.md:209` cites `zb-mom-ww-mxgateway-cli/build/install/mxgateway-cli` but the actual install path contains a doubled directory `zb-mom-ww-mxgateway-cli/build/install/zb-mom-ww-mxgateway-cli/`). | + +## Findings + +### Client.Java-001 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Security | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySecrets.java:30-32` | +| Status | Resolved | + +**Description:** `redactApiKey` preserves the leading and trailing four characters of the key. A gateway API key has the form `mxgw__`; the last four characters belong to the secret portion, so the "redacted" form leaks 4 characters of the actual secret into logs, CLI JSON output (`CommonOptions.redactedJsonMap`), and `MxGatewayClientOptions.toString()`. CLAUDE.md states API keys must never reach logs. + +**Recommendation:** Redact the secret entirely. Show only a stable non-secret prefix (e.g. the `mxgw__` portion) and mask everything after it, or emit a fixed `mxgw_***` form. Do not echo any trailing characters of the secret. + +**Resolution:** (2026-05-18) Confirmed against source: the old `substring(0,4) + stars + substring(len-4)` echoed the last four secret characters. `redactApiKey` now masks the secret entirely: for gateway-shaped keys it returns the non-secret `mxgw__` prefix followed by `***` (locating the secret separator as the first `_` after `mxgw_`); any non-gateway-shaped token returns ``. No leading/trailing secret characters are ever emitted. The pre-existing `MxGatewayCliTests.openSessionJsonRedactsApiKey` assertion that hardcoded the leaky `mxgw***********cret` form was corrected to assert the masked `mxgw_visible_***` form. Regression tests: `MxGatewayMediumFindingsTests.redactApiKeyDoesNotLeakAnyCharacterOfTheSecret`, `redactApiKeyForNonGatewayShapedKeyRevealsNothing`, `redactApiKeyStillHandlesNullAndShortInput`. + +### Client.Java-002 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:31,66-92` | +| Status | Resolved | + +**Description:** The `next` field is a plain (non-volatile) instance field, and `MxEventStream` exposes no thread-confinement guarantee. More concretely, a queue-overflow `offer()` and a `close()` `offer(END)` can interleave so the overflow exception is enqueued after `END` and never observed — the contract that "next() throws after overflow" is not guaranteed once `close()` has been called. + +**Recommendation:** Document single-consumer-thread usage explicitly in the Javadoc, and serialise terminal state transitions (overflow vs END vs close) behind a single guarded flag so the first terminal condition wins deterministically. + +**Resolution:** (2026-05-18) Confirmed against source: the old `offer()` END-branch did `queue.clear(); queue.offer(END)` when full, so a `close()` arriving after an overflow wiped the already-enqueued overflow exception, leaving the consumer with a clean end-of-stream and the overflow silently lost. Terminal transitions are now serialised through a single `terminate(MxGatewayException)` method guarded by a `terminated` flag and a `terminalLock`; the first terminal condition wins and a later `close()`/`END` cannot overwrite a published overflow fault. The Javadoc now explicitly documents that the iterator methods are single-consumer-only while `close()` is safe from any thread. Regression tests: `MxGatewayMediumFindingsTests.eventStreamOverflowExceptionSurvivesASubsequentClose` (deterministic) and `eventStreamConcurrentOverflowAndCloseAlwaysTerminate` (300-iteration race stress). + +### Client.Java-003 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | mxaccessgw conventions | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:119-140` | +| Status | Resolved | + +**Description:** `OpenSessionReply` carries `gateway_protocol_version` (proto field 8), and `MxGatewayClientVersion.GATEWAY_PROTOCOL_VERSION` exists so the client can reject incompatible generated-code inputs. The client never reads `reply.getGatewayProtocolVersion()` nor compares it against the compiled-in version. A client built against an older/newer contract issues commands blindly and fails with confusing downstream errors instead of a clear version-mismatch failure. + +**Recommendation:** In `openSession`/`openSessionRaw`, compare `reply.getGatewayProtocolVersion()` with `MxGatewayClientVersion.gatewayProtocolVersion()` and throw a typed `MxGatewayException` on mismatch. + +**Resolution:** (2026-05-18) Confirmed against source: neither `openSessionRaw` nor `openSessionAsync` read `getGatewayProtocolVersion()`. Added a private `ensureGatewayProtocolCompatible` helper, called from both `openSessionRaw` and `openSessionAsync`, that throws `MxGatewayException` with a clear mismatch message when the gateway reports a non-zero version differing from `MxGatewayClientVersion.gatewayProtocolVersion()`. A gateway that leaves the field unset (value 0, e.g. an older gateway) is accepted unchanged for backward compatibility. `clients/java/README.md` documents the new fail-fast check. Regression tests: `MxGatewayMediumFindingsTests.openSessionRejectsIncompatibleGatewayProtocolVersion` and `openSessionAcceptsMatchingOrUnsetGatewayProtocolVersion`. + +### Client.Java-004 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySession.java:114-120,157-163,191-197` | +| Status | Resolved | + +**Description:** `register`, `addItem`, and `addItem2` check `reply.hasRegister()`/`hasAddItem()` and otherwise fall back to `reply.getReturnValue().getInt32Value()`. If the gateway returns a reply with neither the typed payload nor a `return_value` set, the method silently returns `0` — indistinguishable from a legitimate handle of 0. This masks a contract violation rather than surfacing it. + +**Recommendation:** If the expected typed payload is absent and no `return_value` is present, throw `MxGatewayException` (protocol violation) instead of returning `0`. + +**Resolution:** (2026-05-18) Confirmed against source: all three methods returned `reply.getReturnValue().getInt32Value()` (which yields `0` for an unset message field) when the typed payload was absent. Each method now guards the fallback with `reply.hasReturnValue()` and throws `MxGatewayException` describing the protocol violation when neither the typed payload nor a `return_value` is present. The legitimate `return_value` fallback is preserved. Regression tests: `MxGatewayMediumFindingsTests.registerThrowsWhenReplyHasNeitherTypedPayloadNorReturnValue`, `addItemThrowsWhenReplyHasNeitherTypedPayloadNorReturnValue`, and `addItemStillHonoursReturnValueFallback`. + +### Client.Java-005 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySession.java:92-105` | +| Status | Resolved | + +**Description:** `close()` delegates to `closeRaw()`, which performs a network RPC. When `MxGatewaySession` is used in try-with-resources and the body throws, a failure inside `closeSession` (e.g. `WORKER_UNAVAILABLE`) throws from `close()` and replaces the original exception as the propagated throwable (the body exception becomes a suppressed exception) — a known try-with-resources footgun for I/O-performing `close()`. + +**Recommendation:** Either make `close()` swallow/log close-time failures (keeping `closeRaw()` for callers who want the result), or document clearly that `close()` performs a network call that can throw. + +**Resolution:** (2026-05-18) Confirmed against source: `close()` called `closeRaw()` directly, so a `CloseSession` RPC failure propagated out of try-with-resources and replaced the body exception. `close()` now catches `MxGatewayException` from `closeRaw()` and logs it at WARNING via `System.Logger` instead of rethrowing, so a close-time failure never masks the body exception. `closeRaw()` is unchanged and still throws for callers who want to observe the close result. The behavior change and the recommendation to use `closeRaw()` for explicit close handling are documented in `clients/java/README.md` and the `close()` Javadoc. Regression tests: `MxGatewayMediumFindingsTests.closeSuppressesCloseTimeFailureInsteadOfMaskingBodyException` and `closeRawStillSurfacesCloseTimeFailureForCallersWhoWantIt`. + +### Client.Java-006 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:323-328`, `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/GalaxyRepositoryClient.java:279-284` | +| Status | Resolved | + +**Description:** `close()` (the `AutoCloseable` method invoked by try-with-resources) calls only `ownedChannel.shutdown()` and returns immediately without awaiting termination. In-flight calls and Netty event-loop threads may still be running when the caller assumes the resource is released. `closeAndAwaitTermination()` does it correctly but is not the method try-with-resources uses, and the README examples all rely on try-with-resources. + +**Recommendation:** Have `close()` await termination for a bounded time and `shutdownNow()` on timeout (the logic already in `closeAndAwaitTermination()`), or document that try-with-resources callers should call `closeAndAwaitTermination()`. + +**Resolution:** (2026-05-18) Confirmed against source: both `MxGatewayClient.close()` and `GalaxyRepositoryClient.close()` called only `ownedChannel.shutdown()`. `close()` in both clients now performs the bounded-wait logic previously only in `closeAndAwaitTermination()`: it shuts the channel down, waits up to the configured connect timeout for graceful termination, and calls `shutdownNow()` on timeout. Because `close()` cannot throw a checked exception, an `InterruptedException` while awaiting is handled by forcibly shutting the channel down and restoring the thread interrupt flag. `closeAndAwaitTermination()` is retained unchanged for callers who want the checked, blocking-aware variant. `clients/java/README.md` documents the new try-with-resources `close()` semantics. + +### Client.Java-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/java/mxgateway-client/src/test/java/com/dohertylan/mxgateway/client/` | +| Status | Resolved | + +**Description:** The alarm surface — `acknowledgeAlarm`/`acknowledgeAlarmAsync`/`queryActiveAlarms` and `MxGatewayActiveAlarmsSubscription` — has zero test coverage. TLS channel construction, the async `streamEventsAsync` path, `MxGatewayEventSubscription` pre-start cancellation, and `MxEventStream` queue overflow are likewise untested. `JavaClientDesign.md` explicitly lists async stream-observer cancellation and status/error mapping as required tests. + +**Recommendation:** Add in-process gRPC tests for the alarm RPCs, the async streaming/subscription cancellation paths, and at least one TLS-config construction test. + +**Resolution:** (2026-05-18) Confirmed against source: no test referenced `acknowledgeAlarm`, `queryActiveAlarms`, `streamEventsAsync`, TLS construction, or `MxEventStream` overflow. Added `MxGatewayLowFindingsTests` (12 tests) covering: `acknowledgeAlarm`/`acknowledgeAlarmAsync` (success, typed protocol-failure, async transport-failure normalisation), `queryActiveAlarms` observer delivery, `MxGatewayActiveAlarmsSubscription` and `MxGatewayEventSubscription` pre-start cancellation, `streamEventsAsync` observer delivery, `MxEventStream` queue overflow surfacing `MxGatewayException`, TLS channel construction (missing CA file rejected with a typed exception, system-trust path builds cleanly), and the Client.Java-008 async-validator normalisation. While writing the TLS test a latent bug was found: a missing/unreadable CA file makes `GrpcSslContexts` throw `IllegalArgumentException` (not `SSLException`), which the old `catch (SSLException)` let escape unwrapped — the catch in the shared channel builder was broadened to also wrap `RuntimeException` so callers always see one typed `MxGatewayException`. + +### Client.Java-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:298-304` | +| Status | Resolved | + +**Description:** `acknowledgeAlarmAsync` and `openSessionAsync` apply `ensureProtocolSuccess` inside `thenApply`. If that validator throws a non-`MxGatewayException` `RuntimeException` it is wrapped by `CompletionException` with no `fromGrpc` normalisation, unlike the synchronous paths which normalise via `try/catch`. The async and sync error surfaces are therefore inconsistent. + +**Recommendation:** Wrap the `thenApply` body so any non-`MxGatewayException` is routed through `MxGatewayErrors.fromGrpc`, matching the synchronous methods. + +**Resolution:** (2026-05-18) Confirmed against source: the `thenApply` validators in `openSessionAsync`, `invokeAsync`, and `acknowledgeAlarmAsync` were not normalised — in practice the gateway's own validators (`ensureProtocolSuccess`, `ensureMxAccessSuccess`, `ensureGatewayProtocolCompatible`) only ever throw `MxGatewayException`, but a stray non-`MxGatewayException` `RuntimeException` (e.g. an NPE from a malformed reply) would surface raw inside `CompletionException`. Added `MxGatewayChannels.normalisingValidator(operation, fn)`: it rethrows `MxGatewayException` unchanged and routes any other `RuntimeException` through `MxGatewayErrors.fromGrpc`, matching the synchronous `try/catch` paths. All three async `thenApply` sites now use it. Regression test: `MxGatewayLowFindingsTests.openSessionAsyncNormalisesNonGatewayRuntimeExceptionFromValidator`. + +### Client.Java-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/GalaxyRepositoryClient.java:310-391`, `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:346-413` | +| Status | Resolved | + +**Description:** `createChannel`, `withDeadline`, `withStreamDeadline`, and `toCompletable` are duplicated nearly verbatim across `MxGatewayClient` and `GalaxyRepositoryClient` (~80 lines). A fix to one will not propagate to the other. + +**Recommendation:** Extract the channel-builder and future-adaptor helpers into a shared package-private utility class. + +**Resolution:** (2026-05-18) Confirmed against source: the four helpers were duplicated near-verbatim. Added a package-private `MxGatewayChannels` utility class holding `createChannel(options, tlsErrorPrefix)`, `withDeadline(stub, options)`, `withStreamDeadline(stub, options)`, `toCompletable(future, operation)`, and the new `normalisingValidator` helper (Client.Java-008). Both `MxGatewayClient` and `GalaxyRepositoryClient` now delegate to it and their private copies were deleted, so a future fix lives in one place. Behavior is unchanged except the operation-name carried into `MxGatewayErrors.fromGrpc` is now the specific RPC name instead of the generic `"async call"`/`"galaxy async call"`. Verified by the full existing async test suite plus the new `MxGatewayLowFindingsTests`. + +### Client.Java-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:269-272`, `clients/java/README.md:76` | +| Status | Resolved | + +**Description:** The `acknowledgeAlarm` Javadoc states the gateway authenticates against an `invoke:alarm-ack` scope, and the README states the Galaxy Repository requires a `metadata:read` scope. CLAUDE.md's documented scope set names neither — the Javadoc/README assert a scope contract the project's own auth documentation does not corroborate. + +**Recommendation:** Reconcile the scope names with `src/MxGateway.Server/Security/` and CLAUDE.md; correct the Javadoc/README to the actual scope strings, or fix CLAUDE.md if sub-scopes were genuinely added. + +**Resolution:** (2026-05-18) Partially re-triaged. Verified against `src/MxGateway.Server/Security/Authorization/GatewayScopes.cs` and `GatewayGrpcScopeResolver.cs`: the canonical scope catalog is `session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`. (a) The README's `metadata:read` for the Galaxy Repository is **correct** — `TestConnectionRequest`/`GetLastDeployTimeRequest`/`DiscoverHierarchyRequest`/`WatchDeployEventsRequest` all resolve to `GatewayScopes.MetadataRead`; no change needed. CLAUDE.md's prose lists only coarse scope groups, but the canonical resolver does define `metadata:read`. (b) The `acknowledgeAlarm` Javadoc's `invoke:alarm-ack` is **wrong** — no such scope exists. `AcknowledgeAlarmRequest` and `QueryActiveAlarmsRequest` are not special-cased in `GatewayGrpcScopeResolver`, so they fall through the `_ => GatewayScopes.Admin` default and require the `admin` scope. The Javadoc was corrected to state the `admin` scope; `queryActiveAlarms` did not assert a scope and was left unchanged. The README does not mention alarms, so no README change was required. + +### Client.Java-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:37-63` | +| Status | Resolved | + +**Description:** The event stream relies on default gRPC auto-inbound flow control: the async stub auto-requests messages, so the server can push faster than the 16-element bounded queue drains. A momentarily slow consumer triggers queue overflow and an immediate stream-fault cancel. This is consistent with the documented fail-fast event-backpressure design, but the client never applies real flow control, so even brief consumer stalls kill the subscription. + +**Recommendation:** Confirm fail-fast is intended (it appears to be); if so, document it on `MxEventStream` so callers know a slow consumer terminates the stream. Optionally expose the queue capacity or opt-in flow control. + +**Resolution:** (2026-05-18) Confirmed fail-fast is intended — CLAUDE.md ("fail-fast event backpressure") and `docs/DesignDecisions.md` make a slow consumer losing its subscription a deliberate v1 design choice, so this is documentation-only, not a behavior bug. Added an explicit "Backpressure (fail-fast)" section to the `MxEventStream` class Javadoc explaining that the adaptor uses gRPC auto-inbound flow control with a fixed 16-element buffer and no client flow control, that a consumer stall long enough to fill the buffer triggers an overflow that cancels the subscription and surfaces an `MxGatewayException`, and that consumers must drain promptly and be ready to resubscribe with a resume cursor. `clients/java/README.md` carries the same caveat. The queue capacity was intentionally left non-configurable to keep the v1 surface aligned with the gateway design; overflow behavior is covered by `MxGatewayLowFindingsTests.eventStreamQueueOverflowSurfacesExceptionFromNext`. + +### Client.Java-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:667-674` | +| Status | Resolved | + +**Description:** `CommonOptions.resolved()` mutates `this` (`resolvedApiKey`, `resolvedTimeout`) and returns `this`, but `toClientOptions()` and `redactedJsonMap()` read those mutated fields. If `redactedJsonMap()` is ever called before `resolved()`, it silently emits empty-string defaults. The "return this after mutating" pattern is fragile and surprising. + +**Recommendation:** Make `resolved()` return an immutable resolved value object, or compute `resolvedApiKey`/`resolvedTimeout` lazily in their getters so call ordering cannot produce stale output. + +**Resolution:** (2026-05-18) Confirmed against source: `resolved()` populated the `resolvedApiKey`/`resolvedTimeout` mutable fields and `toClientOptions()`/`redactedJsonMap()` read them, so calling either before `resolved()` emitted stale empty/30s defaults. The two mutable fields were removed and replaced with side-effect-free accessor methods `resolvedApiKey()` and `resolvedTimeout()` that compute their value on each call (API key from `--api-key` or the `--api-key-env` variable; timeout via `parseDuration`). `toClientOptions()` and `redactedJsonMap()` now call those accessors directly, so call ordering can no longer produce stale output. `resolved()` is retained as a no-op returning `this` purely for call-site readability (`common.resolved()`), with its Javadoc updated to state resolution is now lazy. Pure-refactor with no runtime-behavior change for the existing call order, so no new test was added; covered by the existing `MxGatewayCliTests` JSON-redaction and option-parsing tests. + +### Client.Java-013 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Testing coverage | +| Location | `clients/java/mxgateway-cli/src/test/java/com/dohertylan/mxgateway/cli/MxGatewayCliTests.java:212-304`, `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:1214-1244` | +| Status | Resolved | + +**Description:** `MxGatewayCliSession` in `MxGatewayCli.java:1214` was extended in commit `f220908` (the "bulk read/write CLI subcommands" change) with five new abstract methods — `readBulk`, `writeBulk`, `write2Bulk`, `writeSecuredBulk`, `writeSecured2Bulk`. The test-only `FakeSession` in `MxGatewayCliTests.java:212` still only implements the original set (register/addItem/advise/writeRaw/subscribeBulk/unsubscribeBulk/streamEventsAfter) and is declared a concrete (non-abstract) class. A clean compile of `mxgateway-cli`'s test source set therefore fails: a concrete implementer that omits abstract interface methods is a compile error. The stale `.class` files under `build/classes/java/test/` predate the interface change (dated 2026-05-20 03:38 vs CLI source dated 2026-05-20 05:06), which is why the issue is not visible until the next clean build. `gradle test` (or any CI pipeline that does not retain incremental state) will fail to build the CLI test module. The `CLAUDE.md` source-update workflow row "When source code changes, build and test the affected component" was not honoured for this CLI contract change. + +**Recommendation:** Add the five missing `@Override` implementations to `FakeSession` (stubs returning empty lists are fine — only `subscribeBulk`/`unsubscribeBulk` are exercised by the existing tests, and the new bulk subcommands have no dedicated CLI tests yet). Optionally also add at least one CLI-level test for `read-bulk`, `write-bulk`, and the `bench-read-bulk` subcommands to keep parity with the .NET / Go / Rust CLI smoke matrix. + +**Resolution:** 2026-05-20 — Added the five missing `@Override` stubs (`readBulk`, `writeBulk`, `write2Bulk`, `writeSecuredBulk`, `writeSecured2Bulk`) to `FakeSession` in `clients/java/mxgateway-cli/src/test/java/com/dohertylan/mxgateway/cli/MxGatewayCliTests.java`, each returning an empty `ArrayList<>` to match the interface return types (`List` / `List`) without throwing. Imported `BulkReadResult`, `BulkWriteResult`, `WriteBulkEntry`, `Write2BulkEntry`, `WriteSecuredBulkEntry`, `WriteSecured2BulkEntry` from `mxaccess_gateway.v1.MxaccessGateway`. `GrpcMxGatewayCliSession` in `MxGatewayCli.java` is the only other implementer and already provides the methods (the source change that introduced the contract added them there). Verified with `gradle clean` followed by `gradle :mxgateway-cli:compileTestJava` and `gradle :mxgateway-cli:test` from `clients/java`, both BUILD SUCCESSFUL. No new CLI-level tests for the bulk subcommands were added — that follow-up is tracked separately and out of scope for this unblock-compilation fix. + +### Client.Java-014 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:59-65,117-124` | +| Status | Resolved | + +**Description:** `MxEventStream.observer().beforeStart` simply assigns `requestStream` without checking the `closed` flag, while `close()` reads `requestStream` after setting `closed = true`. If `close()` runs *before* the gRPC call has attached its `ClientCallStreamObserver` (a real race when callers cancel immediately after subscribing — e.g. construct, then close in a `finally` block when an unrelated setup step throws), then at close time `requestStream` is `null`, so `stream.cancel(...)` is skipped. `beforeStart` then fires later, stores the live `requestStream`, and never observes `closed` — the underlying gRPC call leaks open and continues delivering events to a `MxEventStream` whose consumer has stopped iterating. The sibling `DeployEventStream.beforeStart` already does the correct thing (`if (closed.get()) { requestStream.cancel(...); }`); the two adaptors should behave identically. + +**Recommendation:** Mirror `DeployEventStream`'s pattern in `MxEventStream.beforeStart`: after storing `requestStream`, check the `closed` flag and cancel the stream eagerly if a prior `close()` has already fired. Add a regression test analogous to `GalaxyRepositoryClientTests.deployEventStreamCloseBeforeBeforeStartCancelsStream` to lock in the behavior. + +**Resolution:** 2026-05-20 — Mirrored `DeployEventStream.beforeStart` in `MxEventStream.beforeStart`: after storing the `ClientCallStreamObserver`, the observer now reads the `closed` flag and calls `requestStream.cancel("client cancelled event stream", null)` when a prior `close()` already fired, closing the close/beforeStart race that previously leaked the underlying gRPC call. The fix uses the existing `volatile boolean closed` field (already established as a happens-before publisher by `close()` setting it before reading `requestStream`); no field shape changes were needed. `clients/java/README.md` documents the new safe-close-before-beforeStart contract. Regression test: `MxGatewayMediumFindingsTests.mxEventStreamCloseBeforeBeforeStartCancelsStream` (mirrors `GalaxyRepositoryClientTests.deployEventStreamCloseBeforeBeforeStartCancelsStream`). + +### Client.Java-015 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayChannels.java:112-138`, `MxGatewayClient.java:183-191,224-232,322-329`, `GalaxyRepositoryClient.java:164-170,212-214` | +| Status | Resolved | + +**Description:** `MxGatewayChannels.toCompletable` registers a `whenComplete` on the local `target` future to forward cancellation to the source gRPC `ListenableFuture`. Every caller — `openSessionAsync`, `invokeAsync`, `acknowledgeAlarmAsync`, `discoverHierarchyPageAsync`, `getLastDeployTimeAsync` — then chains `.thenApply(normalisingValidator(...))` or `.thenApply(::getOk)` and returns the *chained* future to the user. `CompletableFuture.thenApply` returns a new future whose cancellation does **not** propagate back to the source `target`. Cancelling the user-facing future therefore never sets `target.isCancelled() == true`, so `source.cancel(true)` is never invoked and the underlying gRPC call continues until its deadline expires. The `JavaClientDesign.md` "Streaming" section explicitly says "Stream cancellation should call `ClientCall.cancel`" — the same expectation reasonably applies to the unary `*Async` surface. + +**Recommendation:** Either return `target` directly from each `*Async` method (and inline the validator into the `FutureCallback.onSuccess` path so no `thenApply` is needed), or attach the cancellation listener to the *final* returned future. The cleanest fix is to have `MxGatewayChannels.toCompletable` return a future that wraps the validator internally and registers `whenComplete` on the final future. Add a regression test that cancels the user-facing future and verifies the gRPC call was cancelled (e.g. via a `ServerCallStreamObserver.setOnCancelHandler` latch). + +**Resolution:** 2026-05-20 — Fixed by inlining the reply validator into `MxGatewayChannels.toCompletable` so the user-visible future is the same future cancellation is bound to: added a new `toCompletable(source, operation, validator)` overload that runs the validator inside the `FutureCallback.onSuccess` path (normalising non-`MxGatewayException` `RuntimeException`s through `MxGatewayErrors.fromGrpc`, matching the existing synchronous `try/catch`). Replaced the previous `whenComplete`-based cancellation listener with a small `CancellingCompletableFuture` subclass whose `cancel(boolean)` forwards to the source `ListenableFuture.cancel(...)` unconditionally, so even the no-validator overload propagates cancellation deterministically (the `whenComplete` listener only fired when `target.isCancelled()` was already true, which is exactly the case `thenApply` broke). Updated `MxGatewayClient.openSessionAsync`, `MxGatewayClient.invokeAsync`, `MxGatewayClient.acknowledgeAlarmAsync`, `GalaxyRepositoryClient.testConnectionAsync`, and `GalaxyRepositoryClient.getLastDeployTimeAsync` to use the new validator overload directly (no `.thenApply` chain). `GalaxyRepositoryClient.discoverHierarchyAsync` is paged via `thenCompose`, so it now publishes the current in-flight page future via an `AtomicReference` and returns a top-level `CompletableFuture` whose overridden `cancel(boolean)` cancels whichever page is currently outstanding. `clients/java/README.md` documents the new cancellation contract: cancelling any `*Async` future aborts the underlying gRPC call. Regression tests: `MxGatewayMediumFindingsTests.invokeAsyncCancellationCancelsUnderlyingGrpcCall` (full in-process gRPC test using `ServerCallStreamObserver.setOnCancelHandler` to latch when the server observes RPC cancellation), `toCompletableValidatorOverloadForwardsCancellationToSource`, and `toCompletableNoValidatorOverloadForwardsCancellationToSource` (unit-level proofs that both `MxGatewayChannels.toCompletable` overloads forward `cancel(true)` to the source `ListenableFuture`). + +### Client.Java-016 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:361-391`, `GalaxyRepositoryClient.java:285-315` | +| Status | Resolved | + +**Description:** Client.Java-009 introduced `MxGatewayChannels` to deduplicate `createChannel`, `withDeadline`, `withStreamDeadline`, and `toCompletable`. The two `close()` / `closeAndAwaitTermination()` methods — added shortly after to fix Client.Java-006 — were not extracted along with them. The 30-line bodies of `MxGatewayClient.close()` + `closeAndAwaitTermination()` and `GalaxyRepositoryClient.close()` + `closeAndAwaitTermination()` are now duplicated verbatim, including the `awaitTermination(connectTimeout)` semantic (see Client.Java-019), the `InterruptedException` handling, and the `ownedChannel == null` guard. A fix to one path (e.g. introducing a dedicated `shutdownTimeout` option) will silently miss the other. + +**Recommendation:** Move the shutdown logic into `MxGatewayChannels.shutdown(ManagedChannel channel, MxGatewayClientOptions options)` and `MxGatewayChannels.shutdownAndAwaitTermination(...)`. Have both clients delegate to it. Same recommendation applies to the duplicated `MxGatewayAuthInterceptor` construction in the two constructors (`MxGatewayClient(Channel, ...)` and `GalaxyRepositoryClient(Channel, ...)`). + +**Resolution:** 2026-05-20 — Extracted the duplicated shutdown logic into `MxGatewayChannels.shutdown(ManagedChannel, MxGatewayClientOptions)` and `MxGatewayChannels.shutdownAndAwaitTermination(ManagedChannel, MxGatewayClientOptions)`. Both helpers handle the `ownedChannel == null` no-op, the orderly-shutdown / `awaitTermination` / `shutdownNow`-on-timeout escalation, and the `InterruptedException`-restoring-the-interrupt-flag path. `MxGatewayClient.close()`/`closeAndAwaitTermination()` and `GalaxyRepositoryClient.close()`/`closeAndAwaitTermination()` are now one-liners that delegate to the shared helpers, so a future change (such as Client.Java-019's `shutdownTimeout`) lives in one place. Unused `java.util.concurrent.TimeUnit` imports were removed from both clients. The constructor-level `MxGatewayAuthInterceptor` duplication noted in the recommendation was left in place — it is a single intercept call per constructor (2 lines) versus the 30-line shutdown duplication that was the actual maintenance hazard. Regression tests: `MxGatewayLowFindingsIITests.sharedShutdownHelperIsNoOpForNullChannel` (covers the null-channel guard), `shutdownAndAwaitTerminationHonoursShutdownTimeoutNotConnectTimeout`, and `shutdownEscalatesToShutdownNowWhenTimeoutExceeded` (cover the shared shutdown semantics; the second is also the Client.Java-019 regression). + +### Client.Java-017 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:25-36`, `clients/java/README.md:99-107` | +| Status | Resolved | + +**Description:** `MxEventStream.streamEvents` was recently widened from a 16-element buffer to a 1024-element buffer (`MxGatewayClient.streamEvents` at line 268: `new MxEventStream(1024)`). The class-level Javadoc on `MxEventStream` still says "the gateway can push events faster than the consumer drains the bounded 16-element buffer", and `clients/java/README.md` line 103 says "uses gRPC's default auto-inbound flow control with a fixed 16-element buffer". The fail-fast event-backpressure contract (Client.Java-011 resolution) was written against the older capacity. The `MxGatewayClient.streamEvents` inline comment even acknowledges the change ("A small queue overflows on any moderately active session; 1024 covers a realistic backlog"). Users of this surface will reason about realistic backpressure budgets using the wrong number. + +**Recommendation:** Update the `MxEventStream` Javadoc and the README to say "1024-element buffer" (or, since the capacity is a passed parameter, document it as a parameter rather than a constant). Consider exposing the capacity through `MxGatewayClientOptions` so callers can tune it per session. + +**Resolution:** 2026-05-20 — Updated the `MxEventStream` class Javadoc and `clients/java/README.md` so both say "1024-element buffer" instead of the obsolete "16-element buffer". The Javadoc also notes that capacity is a constructor parameter and that the production caller (`MxGatewayClient.streamEvents`) passes `1024` to absorb the session-backlog replay burst, so readers understand the value is a deliberate choice rather than a constant. Exposing the capacity through `MxGatewayClientOptions` was intentionally left out of scope — the v1 design keeps the event-stream surface minimal and `MxGatewayClient.streamEvents` is the only caller; if a tuning need arises in v2 the existing constructor already accepts the capacity. + +### Client.Java-018 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Security | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySecrets.java:54-66` | +| Status | Resolved | + +**Description:** `redactCredentials(value)` splits its input on `\\s+` (whitespace) and only redacts whitespace-delimited tokens that start with `mxgw_` or equal `bearer` (case-insensitive). gRPC `Status.getDescription()` strings, log lines, and proto error messages can carry credentials separated by colons (`Bearer:mxgw_id_secret`), commas (`token=mxgw_id_secret,scope=...`), single quotes (`'mxgw_id_secret'`), parentheses (`(mxgw_id_secret)`), or embedded in URLs/paths — all of which leave the `mxgw_` token attached to a non-whitespace neighbour and survive redaction. `MxGatewayErrors.fromGrpc` is the primary consumer; a gateway error description like `authentication failed: 'mxgw_id_secret'` would round-trip the secret into the resulting `MxGatewayAuthenticationException` message. + +**Recommendation:** Replace the whitespace-split scrub with a regex-based pass that matches `mxgw_[A-Za-z0-9_-]+` anywhere in the string and substitutes ``; also redact `Bearer\s+\S+` as a unit so the token after `Bearer` is masked regardless of the surrounding punctuation. Cover with a fixture-style test alongside `MxGatewayFixtureTests.grpcAuthErrorsAreClassifiedAndRedacted` that asserts a quoted or comma-delimited credential is fully masked. + +**Resolution:** 2026-05-20 — Replaced the whitespace-split scrub with two compiled `Pattern` regexes: `mxgw_[A-Za-z0-9_-]+` matches any gateway-shaped credential anywhere in the string regardless of surrounding punctuation, and `(?i)bearer\s+\S+` masks an authorization-header style `Bearer ` as a unit so a non-mxgw bearer token cannot leak either. The mxgw pass runs first, so the bearer pass observes `Bearer ` for the common combined case and renders it idempotently. Regression tests in `MxGatewayFixtureTests`: `redactCredentialsHandlesNonWhitespaceDelimitedTokens` exercises single-quoted, double-quoted, comma-delimited, colon-delimited, parenthesised, URL-embedded, and bearer-header credentials; `redactCredentialsLeavesBenignContentAlone` confirms strings without credentials and a `null` input are unchanged. + +### Client.Java-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:362-391`, `GalaxyRepositoryClient.java:286-315` | +| Status | Resolved | + +**Description:** Both clients' `close()` / `closeAndAwaitTermination()` use `options.connectTimeout()` as the upper bound on `awaitTermination`. The `connectTimeout` semantically describes how long the client will wait to *establish* the channel, not how long it should wait for in-flight calls and the Netty event loop to drain after `shutdown()`. With the default 10s connect timeout, shutting down a client with a long-running unary call already in flight will silently escalate to `shutdownNow()` and forcibly cancel it before the call's own deadline expires, defeating the deadline contract on `withDeadline`. Conversely, a caller who sets a small `connectTimeout` (e.g. 500 ms for a health probe) inherits an aggressively short shutdown deadline they probably did not intend. + +**Recommendation:** Introduce a dedicated `shutdownTimeout` on `MxGatewayClientOptions` (defaulting to e.g. 5–10 s independent of `connectTimeout`) and use it in `close()` and `closeAndAwaitTermination()`. Document the precedence in the Javadoc. This pairs naturally with the Client.Java-016 deduplication fix. + +**Resolution:** 2026-05-20 — Added a dedicated `shutdownTimeout` `Duration` on `MxGatewayClientOptions` (builder method `shutdownTimeout(Duration)`, accessor `shutdownTimeout()`, default 10 s), independent of `connectTimeout`. Both shared shutdown helpers introduced for Client.Java-016 (`MxGatewayChannels.shutdown` and `shutdownAndAwaitTermination`) call `options.shutdownTimeout()` as the `awaitTermination` upper bound, so a small `connectTimeout` (e.g. a 500 ms health-probe timeout) no longer forces a premature `shutdownNow()` on in-flight calls. The new option is reflected in `toString()` and documented on both helpers and the `close()`/`closeAndAwaitTermination()` Javadoc on both clients; `clients/java/README.md` notes the default and the independence from `connectTimeout`. Regression tests in `MxGatewayLowFindingsIITests`: `shutdownAndAwaitTerminationHonoursShutdownTimeoutNotConnectTimeout` (a 50 ms connect timeout + 1 s shutdown timeout + 200 ms graceful-termination channel never escalates to `shutdownNow()`), `shutdownEscalatesToShutdownNowWhenTimeoutExceeded` (a stuck channel beyond the shutdown timeout is forcibly shut down), and `shutdownTimeoutDefaultIsTenSecondsIndependentOfConnectTimeout` (the default holds even when `connectTimeout` is small). + +### Client.Java-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:244-254`, `galaxy_repository.proto:94` | +| Status | Resolved | + +**Description:** `galaxy_repository.proto` defines `DeployEvent.sequence` as `uint64`; the protobuf Java mapping projects that to a signed `long`. The CLI's text-mode `galaxy-watch` output prints it as `"seq=%d ..."`, which interprets the value as signed. For genuine wraparound this is implausible (deploy sequences will not reach `2^63`), but the broader pattern is brittle: any unsigned proto field printed via `%d` will display incorrectly past the signed boundary. The JSON path uses `protoJson(event)` which formats unsigned longs as numeric strings via `JsonFormat`, so JSON output is correct; only the text mode is at risk. + +**Recommendation:** Print the sequence with `Long.toUnsignedString(event.getSequence())` (or switch the text format to `%s` and pass the unsigned-string conversion). The same rule should apply to any other `uint64` proto fields that surface in CLI text output. + +**Resolution:** 2026-05-20 — Updated the `galaxy-watch` text-mode `out.printf` in `MxGatewayCli.GalaxyWatchCommand.call()` to use `%s` for the sequence field and pass `Long.toUnsignedString(event.getSequence())`, so deploy sequences past `2^63` render as their correct unsigned decimal string instead of a negative signed long. The JSON path through `protoJson(event)` was already correct (proto `JsonFormat` emits unsigned longs as decimal strings) and was left unchanged. An inline comment near the printf documents the unsigned-uint64 contract so the next person editing the format string knows not to switch back to `%d`. Regression test: `MxGatewayCliTests.deployEventSequenceRendersAsUnsignedForHighUint64` exercises the format string with the max-uint64 bit pattern (`-1L`) and asserts the output contains `seq=18446744073709551615` and does not contain `seq=-1`. + +### Client.Java-021 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/DeployEventStream.java:96-135` | +| Status | Resolved | + +**Description:** Client.Java-002 fixed a deterministic terminal-state race in `MxEventStream` by introducing a `terminate(MxGatewayException)` method, a `terminalLock`, and a `terminated` flag so a `close()` arriving after a queue-overflow `offer()` cannot wipe the overflow exception. `DeployEventStream` — added later and structurally a copy of `MxEventStream` — never received the same fix. Its current `close()` does `closed.set(true); stream.cancel(...); offer(END);`, and its `offer()` overflow branch does `queue.clear(); queue.offer(new MxGatewayException("...queue overflowed")); queue.offer(END);` (lines 117-135). With these two paths running concurrently, the same sequence Client.Java-002 documented can repeat: the overflow branch enqueues `[overflowException, END]`, `close()` then calls `offer(END)` which sees the queue full and falls into the END branch (`queue.clear(); queue.offer(value);`), wiping the overflow exception and leaving a clean end-of-stream. The CLI `galaxy-watch` (and any `WatchDeployEvents` consumer) loses the overflow signal it was supposed to surface, defeating the fail-fast backpressure contract. The 16-element buffer on `DeployEventStream` makes overflow far less likely than on `MxEventStream` in practice, but the race is identical. + +**Recommendation:** Mirror the `MxEventStream` fix: add a `terminated` flag and `terminalLock`, route `close()`, `onCompleted`, and the overflow branch through a single `terminate(MxGatewayException)` method that wins on first arrival, and add the regression analogous to `MxGatewayMediumFindingsTests.eventStreamOverflowExceptionSurvivesASubsequentClose`. Given the two stream classes are now structural copies of each other, consider extracting the queue/terminate plumbing into a shared base or helper so the next fix lands once. + +**Resolution:** 2026-05-20 — Mirrored the `MxEventStream` terminal-state serialisation in `DeployEventStream`: replaced the `AtomicBoolean closed` field with a `volatile boolean closed`, added a `terminalLock`/`terminated` pair, and routed all terminal paths (`close()`, `onCompleted()`, the overflow branch in `offer()`) through a single private `terminate(MxGatewayException fault)` method guarded by `synchronized (terminalLock) { if (terminated) return; terminated = true; ... }`. The first terminal condition wins: an overflow that publishes `[exception, END]` is no longer wiped by a subsequent `close()`/`onCompleted()` that previously took the "queue full → clear + offer(END)" branch. The class-level Javadoc now documents the single-consumer-thread iterator contract and the deterministic terminal transition, matching `MxEventStream`. Behavior outside the terminal path is unchanged: `beforeStart` still resolves the close-before-beforeStart race (Client.Java-014's deploy-stream counterpart, already in place), `take()` still surfaces interrupts, and the request stream is still cancelled on overflow/close. Regression tests in `GalaxyRepositoryClientTests`: `deployEventStreamOverflowExceptionSurvivesASubsequentClose` (deterministic — capacity-2 stream, force overflow, then close, assert the overflow exception is surfaced) and `deployEventStreamConcurrentOverflowAndCloseAlwaysTerminate` (300-iteration concurrent race stress, mirrors `MxGatewayMediumFindingsTests.eventStreamConcurrentOverflowAndCloseAlwaysTerminate`). + +### Client.Java-022 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayChannels.java:161-172` | +| Status | Resolved | + +**Description:** The Javadoc on the no-validator `toCompletable(source, operation)` overload claims: "calling `cancel(true)` on either the direct return value or the user-facing chained future ultimately invokes `source.cancel(true)` (chained futures forward to the upstream stage they were derived from, which is this future)." This is not how `CompletableFuture.thenApply` (or `thenCompose`, `whenComplete`, etc.) actually behaves: a downstream stage's `cancel()` only marks that derived stage as cancelled, it does NOT propagate cancellation upstream to the originating `CancellingCompletableFuture`. The Client.Java-015 resolution actually fixes the bug by inlining the validator into the new `toCompletable(source, operation, validator)` overload (lines 224-252) so users never need a downstream stage, and by `GalaxyRepositoryClient.discoverHierarchyAsync` using an explicit `AtomicReference`-based override (which has a correct comment at line 218-221 acknowledging exactly this `thenCompose` limitation). The contradiction between the two adjacent comments will mislead the next maintainer who decides to add a convenience `.thenApply` on top of a `*Async` return value — they will assume cancellation still flows through and re-introduce the Client.Java-015 leak. + +**Recommendation:** Rewrite the `toCompletable` Javadoc to state the actual contract: `cancel(...)` on the direct return value (the `CancellingCompletableFuture` instance) forwards to the source RPC, but `cancel(...)` on a `thenApply`/`thenCompose`/`thenAccept` *of* that future does not — the cancellation is captured at the derived stage and the upstream RPC continues until its deadline. Callers that need cancellation through a chained pipeline must follow the `discoverHierarchyAsync` pattern (custom `CompletableFuture` subclass tracking the current in-flight stage). The underlying `CancellingCompletableFuture` class doc (lines 254-258) is already correct; only the `toCompletable` paragraph is misleading. + +**Resolution:** 2026-05-20 — Rewrote the `toCompletable(source, operation)` Javadoc in `MxGatewayChannels` to reflect the actual `CompletableFuture` contract. The doc now states unambiguously: cancelling the direct return value (the `CancellingCompletableFuture`) forwards to the source `ListenableFuture` and aborts the underlying gRPC call (the Client.Java-015 fix), but cancelling a derived `thenApply`/`thenCompose`/`thenAccept`/`whenComplete` stage of that future does NOT propagate cancellation upstream — the derived stage is marked cancelled while the source RPC continues until its deadline. The Javadoc explicitly directs callers that need cancellation through a chained pipeline to either the `toCompletable(source, operation, validator)` overload (which inlines the validator into the `FutureCallback.onSuccess` path so the user-visible future is the same future cancellation is bound to) or the `GalaxyRepositoryClient.discoverHierarchyAsync` `AtomicReference`-based pattern (for `thenCompose` across paged calls). The `CancellingCompletableFuture` class Javadoc was already correct and is unchanged. Doc-only change; no behavior change and no new test required. + +### Client.Java-023 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:1054`, `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:634` | +| Status | Resolved | + +**Description:** `MxEvent.worker_sequence` is a proto `uint64` (line 634 of `mxaccess_gateway.proto`). The `stream-events` CLI text path prints it with `%d` (`client.out().printf("%d %s%n", event.getWorkerSequence(), event.getFamily());`), which interprets the underlying signed `long` value — sequences past `2^63` would render as a negative number. This is the exact same `uint64`-with-`%d` bug that Client.Java-020 fixed for the `galaxy-watch` `DeployEvent.sequence` field; the resolution's stated rule ("The same rule should apply to any other `uint64` proto fields that surface in CLI text output") was never extended to this site. In practice worker sequences will not reach `2^63` so this is latent rather than active, but the same fix and the same regression-test pattern apply. + +**Recommendation:** Replace the `%d` with `%s` plus `Long.toUnsignedString(event.getWorkerSequence())` (matching the Client.Java-020 fix in `GalaxyWatchCommand`), and add a regression test analogous to `MxGatewayCliTests.deployEventSequenceRendersAsUnsignedForHighUint64` covering the `stream-events` text-mode format string with `-1L`. The `--after-worker-sequence` CLI option (line 1035) is also typed as a `long`, which means the user cannot pass an unsigned value above `2^63 - 1` from the command line; that is a related but separate ergonomic gap worth noting in the same change. + +**Resolution:** 2026-05-20 — Updated the `stream-events` text-mode `client.out().printf` in `MxGatewayCli.StreamEventsCommand.call()` to use `%s` for the sequence and pass `Long.toUnsignedString(event.getWorkerSequence())`, mirroring the Client.Java-020 fix in `GalaxyWatchCommand`. Worker sequences past `2^63` now render as their correct unsigned decimal string instead of a negative signed long. An inline comment near the `printf` documents the unsigned-uint64 contract so the next person editing the format string knows not to switch back to `%d`. The JSON path through `protoJson(event)` was already correct (proto `JsonFormat` emits unsigned longs as decimal strings) and is unchanged. The `--after-worker-sequence` `long` ergonomic gap is a separate v2 concern and intentionally out of scope. Regression test: `MxGatewayCliTests.streamEventsWorkerSequenceRendersAsUnsignedForHighUint64` exercises the format string with the max-uint64 bit pattern (`-1L`) and asserts the output starts with `18446744073709551615 ` and does not start with `-1 `, mirroring `deployEventSequenceRendersAsUnsignedForHighUint64`. + +### Client.Java-024 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:855-883` | +| Status | Resolved | + +**Description:** `BenchReadBulkCommand` records per-call latency in `latenciesNanos[latencyCount++] = elapsed;` inside *both* the success branch (line 865) and the `catch (Exception ex)` failure branch (line 880). The failed-call durations are then fed into the `percentileSummaryMs` p50/p95/p99 calculation alongside successful calls, producing misleading latency stats when even a few transport errors occur during the bench window. Client.Rust-015 fixed exactly this pattern in `clients/rust/src/bin/bench-read-bulk.rs` ("stop bench-read-bulk from polluting success-latency histograms with failed-call durations"); the equivalent fix was not applied to the Java implementation. The cross-language matrix runner (`scripts/run-client-e2e-tests.ps1`) compares numbers across all five clients, so the Java numbers will be silently inconsistent with the Rust numbers on the same fault profile. + +**Recommendation:** Drop the failure-branch latency record (only count `failed++`), or alternately maintain a separate `failedLatenciesNanos` array and report it as a distinct stat in the JSON output — but the success histogram must not include failed-call latencies. Cross-check the .NET, Go, and Python `bench-read-bulk` drivers in the same change to make sure all five clients use the same success-latency definition; the cross-language matrix is only useful if the metric is uniform. + +**Resolution:** 2026-05-20 — Dropped the failure-branch latency record in `BenchReadBulkCommand.call()`: the `catch (Exception ex)` block no longer appends `elapsed` to `latenciesNanos` and no longer grows the array — it only increments `failed++`. The success-latency histogram fed into `percentileSummaryMs` (p50/p95/p99/max/mean) is now success-call-only, matching the Client.Rust-015 fix. The JSON output still surfaces `failedCalls` as a distinct top-level count so observers see fault rates separately from latency. An inline comment on the catch block documents the contract so the next maintainer doesn't reinstate the record. New CLI test `MxGatewayCliTests.benchReadBulkCommandEmitsJsonSchemaKeys` (added under Client.Java-026 below) covers the JSON schema produced by the corrected path. The .NET / Go / Python bench drivers were intentionally left out of scope for this Java-focused finding — that cross-client audit is its own follow-up and tracked separately. + +### Client.Java-025 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:1176-1185` | +| Status | Resolved | + +**Description:** `CommonOptions.toClientOptions()` populates the `MxGatewayClientOptions` builder with `endpoint`, `apiKey`, `plaintext`, `caCertificatePath`, `serverNameOverride`, and `callTimeout`, but never sets `shutdownTimeout` even though Client.Java-019 introduced it as a first-class option. CLI users therefore always inherit the 10-second default and have no way to override it from the command line, which makes the new option effectively client-library-only. CLI users running long-lived operations (a big `discover-hierarchy` page-chain, a streaming `galaxy-watch` session that needs to drain on Ctrl+C) cannot tune the shutdown deadline up; users running short health probes who want a small `connectTimeout` *and* a small `shutdownTimeout` to keep the CLI snappy on failure also cannot. + +**Recommendation:** Add a `--shutdown-timeout` option to `CommonOptions` (parsed via the existing `parseDuration` helper, default unset → use the 10-second library default) and propagate it into `toClientOptions()` so the CLI surface tracks the library surface. Include the resolved value in `redactedJsonMap()` so `--json` output shows the effective shutdown deadline. + +**Resolution:** 2026-05-20 — Added a `--shutdown-timeout` option to `CommonOptions` in `MxGatewayCli.java`, parsed via the existing `parseDuration` helper (so it accepts `10s`, `500ms`, ISO-8601 `PT10S`, etc.). A new lazy accessor `resolvedShutdownTimeout()` returns the parsed `Duration` when the user passed `--shutdown-timeout`, or `null` when unset so the `MxGatewayClientOptions` builder default (10s, established by Client.Java-019) applies. `toClientOptions()` now conditionally calls `builder.shutdownTimeout(resolvedShutdownTimeout)` only when the user opted in, preserving the library default for the common case. `redactedJsonMap()` includes the resolved value under key `"shutdownTimeout"` (empty string when unset) so `--json` output shows the effective shutdown deadline. The CLI surface now tracks the library surface so a user running a long page-chain can pass `--shutdown-timeout 60s`, and a user running a short health probe can pair `--timeout 500ms` with `--shutdown-timeout 500ms` to keep the CLI snappy on failure. Behavior for callers who do not pass the new flag is unchanged. + +### Client.Java-026 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/java/mxgateway-cli/src/test/java/com/dohertylan/mxgateway/cli/MxGatewayCliTests.java` | +| Status | Resolved | + +**Description:** Client.Java-013 explicitly deferred adding CLI-level test coverage for the `read-bulk`, `write-bulk`, and `bench-read-bulk` subcommands ("Optionally also add at least one CLI-level test for `read-bulk`, `write-bulk`, and the `bench-read-bulk` subcommands to keep parity with the .NET / Go / Rust CLI smoke matrix"), and the resolution explicitly stated that "follow-up is tracked separately and out of scope for this unblock-compilation fix." That follow-up was never filed. The current `MxGatewayCliTests` only covers `version`, `open-session` (JSON redaction), `write`, `smoke`, `subscribe-bulk`, `unsubscribe-bulk`, and the Client.Java-020 unsigned-uint64 format string — six of the thirteen non-trivial subcommands the CLI ships are completely untested at the CLI layer (`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`), as are `stream-events`, the four `galaxy-*` commands, and `close-session`. The `FakeSession` stubs all return empty lists, so an end-to-end CLI test would catch JSON-shape regressions, argument-parsing bugs, and option contract breaks that the bulk Session unit tests on the library side do not exercise. This same coverage gap is what made Client.Java-013 itself only surface on a clean Gradle build. + +**Recommendation:** Add at least one round-trip CLI test per bulk subcommand (`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`) that exercises the JSON output shape and the value parser (`parseValue(type, text)` is shared across all five and the only `write*-bulk` path that catches typos in the type switch). Extending the `FakeSession` stubs to return at least one result row makes the assertions meaningful. The `bench-read-bulk` test can run with a 1-second `--duration-seconds` and a 0-second `--warmup-seconds` and assert the JSON schema keys (`totalCalls`, `latencyMs.p50`, `callsPerSecond`) rather than the numeric values. + +**Resolution:** 2026-05-20 — Added round-trip CLI tests for all six bulk-family subcommands plus the new Client.Java-023 unsigned-uint64 regression to `MxGatewayCliTests`. The `FakeSession` stubs were upgraded from empty-list returns to per-call recorders that publish the parsed entries (e.g. `lastWriteBulkEntries`, `lastReadBulkTimeoutMs`) and synthesise one `BulkReadResult`/`BulkWriteResult` per requested handle so the JSON output assertions exercise the `bulkReadResultMap` and `bulkWriteResultMap` serialisers. New tests: (a) `readBulkCommandForwardsTimeoutAndPrintsResults` — asserts `--timeout-ms 750` reaches the session and the JSON output carries the per-tag `tagAddress`, `itemHandle`, `wasCached`, and `quality` fields; (b) `writeBulkCommandParsesTypedValuesAndPrintsResults` — asserts `--type int32 --values 111,222 --user-id 5` parses through the shared `parseValue` switch and the entries are constructed with the expected typed `MxValue` and `userId`; (c) `write2BulkCommandForwardsTimestampAndPrintsResults` — asserts the `--timestamp 2026-05-20T00:00:00Z` reaches the entry as a `timestampValue` (`hasTimestampValue()` is true); (d) `writeSecuredBulkCommandForwardsUserIdsAndPrintsResults` — asserts `--current-user-id 7 --verifier-user-id 8` are both propagated; (e) `writeSecured2BulkCommandForwardsTimestampAndUserIdsAndPrintsResults` — combination of (c) and (d); (f) `benchReadBulkCommandEmitsJsonSchemaKeys` — runs the bench in a 1s steady / 0s warmup window and asserts the JSON output contains the cross-language schema keys (`language=java`, `command=bench-read-bulk`, `bulkSize=2`, `totalCalls`, `successfulCalls`, `failedCalls`, `callsPerSecond`, `latencyMs.p50/p95/p99`, `tags` including the synthesised `TestMachine_001.TestChangingInt`/`TestMachine_002.TestChangingInt` pair); (g) `streamEventsWorkerSequenceRendersAsUnsignedForHighUint64` — Client.Java-023 regression. The recommendation's stream-events and galaxy-* CLI tests were intentionally not added in this round — they require either an in-process gateway/galaxy server or package-private `MxEventStream`/`DeployEventStream` constructor access from the CLI test module, which is its own infrastructure work; the library-side tests in `GalaxyRepositoryClientTests` already cover the streaming wire behaviour. + +### Client.Java-027 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Documentation & comments | +| Location | `clients/java/README.md:36,107-175,185,205,220`, `clients/java/JavaClientDesign.md:195-211` | +| Status | Open | + +**Description:** Commit `397d3c5` renamed the gradle subprojects to `zb-mom-ww-mxgateway-client` and `zb-mom-ww-mxgateway-cli` in `settings.gradle`, but did not propagate that rename into the README's documented gradle commands or into `JavaClientDesign.md`. Every documented gradle invocation still uses the old short names — `gradle :mxgateway-client:generateProto`, `gradle :mxgateway-cli:run --args=...`, `gradle :mxgateway-client:jar :mxgateway-cli:installDist` — and every one fails with `project 'mxgateway-client' not found in root project 'zb-mom-ww-mxaccessgw-java'`. A user copy-pasting from the README or design doc will hit this on the very first command. + +**Recommendation:** Find-and-replace `:mxgateway-client:` → `:zb-mom-ww-mxgateway-client:` and `:mxgateway-cli:` → `:zb-mom-ww-mxgateway-cli:` across `clients/java/README.md` and `clients/java/JavaClientDesign.md`. (Roughly 17 occurrences in the README, 3 in the design doc.) Test by running one updated command end-to-end. + +**Resolution:** _(empty until closed)_ + +### Client.Java-028 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Documentation & comments | +| Location | `clients/java/JavaClientDesign.md:23-27` | +| Status | Open | + +**Description:** The build-layout block in `JavaClientDesign.md` still shows the old Java package paths `com/dohertylan/mxgateway/client/` and `com/dohertylan/mxgateway/cli/`. The actual source tree was moved to `com/zb/mom/ww/mxgateway/{client,cli}/` in commit `397d3c5`. Anyone using the design doc to locate or navigate code will look in the wrong place. + +**Recommendation:** Update the layout block to reflect the new paths: `com/zb/mom/ww/mxgateway/client/` and `com/zb/mom/ww/mxgateway/cli/`. Comment-only change. + +**Resolution:** _(empty until closed)_ + +### Client.Java-029 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/java/README.md:208-209` | +| Status | Open | + +**Description:** The packaging section states "The library jar is under `zb-mom-ww-mxgateway-client/build/libs`. The installed CLI distribution is under `zb-mom-ww-mxgateway-cli/build/install/mxgateway-cli`." The library-jar path is correct, but the install-distribution path is wrong — gradle's `installDist` produces a directory whose name matches the project name, not the (now-retired) short name, so the actual path is `zb-mom-ww-mxgateway-cli/build/install/zb-mom-ww-mxgateway-cli/`. The e2e script (`scripts/run-client-e2e-tests.ps1`) uses the correct path, so the script works; only the README is wrong. + +**Recommendation:** Correct the README to `zb-mom-ww-mxgateway-cli/build/install/zb-mom-ww-mxgateway-cli`. Comment-only. + +**Resolution:** _(empty until closed)_ + +### Client.Java-030 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/java/zb-mom-ww-mxgateway-client/src/test/java/com/zb/mom/ww/mxgateway/client/` | +| Status | Open | + +**Description:** Commit `397d3c5` added the missing `QueryActiveAlarmsRequest` proto message and the corresponding `rpc QueryActiveAlarms` to `mxaccess_gateway.proto`. The Java client now generates the request type and the gRPC stub method, and `MxGatewayClient.queryActiveAlarms` correctly references both. No unit test exercises the new RPC end-to-end on the Java side — the proto compiles, the import resolves, the method is callable, but the absence of a `queryActiveAlarmsForwardsRequestAndStreamsSnapshots` (or similar) fixture means a serialisation regression in `QueryActiveAlarmsRequest` or the streaming reply would not surface in the Java unit tests. + +**Recommendation:** Add a unit test in `MxGatewayFixtureTests` (or wherever the alarm fixtures live) that pushes a `QueryActiveAlarmsRequest` through a `FakeMxAccessGateway` and asserts the `ActiveAlarmSnapshot` stream is consumed correctly — mirror the existing `acknowledgeAlarm` test shape. + +**Resolution:** _(empty until closed)_ + +### Client.Java-031 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `clients/java/README.md:13,17,26` | +| Status | Open | + +**Description:** The README prose at lines 13–26 introduces the subprojects as `mxgateway-client` and `mxgateway-cli` (the old short names) when discussing the layout. Those are no longer the actual subproject names — `settings.gradle` declares `zb-mom-ww-mxgateway-client` / `zb-mom-ww-mxgateway-cli`. The prose works as a naming mnemonic, but it confuses anyone trying to map README descriptions to actual gradle output, IDE project trees, or the e2e script. + +**Recommendation:** Either (a) update the prose to the full prefixed names, or (b) clarify in a one-line note: "The subprojects are `zb-mom-ww-mxgateway-client` and `zb-mom-ww-mxgateway-cli`; this README refers to them by their short suffixes below for readability." (a) is more honest; (b) preserves readability at the cost of one extra concept. + +**Resolution:** _(empty until closed)_ diff --git a/code-reviews/Client.Python/findings.md b/code-reviews/Client.Python/findings.md new file mode 100644 index 0000000..c48cbba --- /dev/null +++ b/code-reviews/Client.Python/findings.md @@ -0,0 +1,797 @@ +# Code Review — Client.Python + +| Field | Value | +|---|---| +| Module | `clients/python` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 0 | + +## Checklist coverage + +A re-review at commit `a020350` over the same module. Prior findings +(Client.Python-001 — Client.Python-017) remain closed and are kept as +history. This section reflects categories evaluated in this pass. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No new issues found — TLS-by-default fix in Client.Python-013 verified; no test fixture accidentally relies on plaintext defaults. | +| 2 | mxaccessgw conventions | No new issues found — secrets redacted, MXAccess parity preserved, generated code untouched. | +| 3 | Concurrency & thread safety | No new issues found — close-idempotency and shared cancel-on-cancel iterator still in place. | +| 4 | Error handling & resilience | No new issues found. | +| 5 | Security | No new issues found — `_use_plaintext` now requires explicit `--plaintext` opt-in (Client.Python-013 resolution verified). The `--api-key` flag is also still redacted from the option repr and CLI errors. | +| 6 | Performance & resource management | No new issues found. | +| 7 | Design-document adherence | No new issues found — `PythonClientDesign.md` is consistent with the implemented surface. | +| 8 | Code organization & conventions | Issue found: `mxgateway_cli` is shipped in the wheel but has no PEP 561 `py.typed` marker (Client.Python-019), so the CLI module's inline type hints are invisible to downstream `mypy` runs. | +| 9 | Testing coverage | Issue found: no test exercises the wheel-build / editable-install flow; the broken `pyproject.toml` (Client.Python-018) was not caught at commit time because the test suite runs from `src/` via `pytest pythonpath` (Client.Python-020). | +| 10 | Documentation & comments | Issue found: cross-client CLI parity gap — the Python CLI ships none of the Galaxy subcommands (`galaxy-test-connection`, `galaxy-last-deploy`, `galaxy-discover`, `galaxy-watch`) the .NET / Go / Rust / Java CLIs all expose, and lacks the new `.NET`-only `bench-stream-events`. README does not flag the gap (Client.Python-021). | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232`. Diff against `a020350` is commit `397d3c5`: +package directories renamed (`src/mxgateway` → `src/zb_mom_ww_mxgateway`, +`src/mxgateway_cli` → `src/zb_mom_ww_mxgateway_cli`), distribution name +changed to `zb-mom-ww-mxaccess-gateway-client`, console-script +`mxgw-py` retained, every `from mxgateway` / `import mxgateway` updated. +A first-pass case-insensitive regex sweep corrupted the binary descriptor +bytes in the generated `_pb2.py` files; the fix was to restore the +original `_pb2.py` artifacts from the pre-rename directory before +deleting it, so the csharp_namespace bytes still carry the old string — +this is documented as wire-level metadata not used by Python at runtime. +Hostname / cert / temp-dir example identifiers (`mxgateway.example.local`, +`mxgateway-ca.pem`, `mxgateway-python-wheel`) were intentionally preserved. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — wire identifiers preserved. | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | No issues found in this diff. | +| 5 | Security | No issues found in this diff. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | No issues found — `PythonClientDesign.md` reflects new paths. | +| 8 | Code organization & conventions | No issues found in this diff. | +| 9 | Testing coverage | No issues found in this diff — alarm test fixtures correctly drop retired `session_id` from `AcknowledgeAlarmRequest` while retaining it on `QueryActiveAlarmsRequest`. | +| 10 | Documentation & comments | No issues found in this diff. | + +## Findings + +### Client.Python-001 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/python/pyproject.toml:8,25`, `clients/python/src/mxgateway_cli/commands.py:25` | +| Status | Resolved | + +**Description:** The package `description` in `pyproject.toml` still says "Async Python client *scaffold*" even though the client is fully implemented. Stale "scaffold" wording misrepresents maturity to anyone reading PyPI metadata. (The `mxgw-py` console-script name is itself consistent between `pyproject.toml` and the README.) + +**Recommendation:** Update the `pyproject.toml` description to drop "scaffold"; keep README CLI examples in sync with the actual `mxgw-py` entry point. + +**Resolution:** 2026-05-18 — Confirmed: `pyproject.toml:8` `description` read "Async Python client scaffold for MXAccess Gateway." Changed to "Async Python client for MXAccess Gateway." The `mxgw-py` console-script name was already consistent with the README, so no README change was needed. Pure metadata fix — no test required. + +### Client.Python-002 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/python/src/mxgateway/__init__.py:27` | +| Status | Resolved | + +**Description:** `MxGatewayCommandError` is imported into `__init__.py` and is a documented public exception, but it is missing from `__all__`. It is the parent of `MxAccessError` and a meaningful catch target, so omitting it from the public surface is inconsistent — `from mxgateway import *` will not expose it and tooling that respects `__all__` treats it as private. + +**Recommendation:** Add `"MxGatewayCommandError"` to the `__all__` list. + +**Resolution:** 2026-05-18 — Re-triaged: this finding is stale against the reviewed source. `clients/python/src/mxgateway/__init__.py` already imports `MxGatewayCommandError` (line 16) **and** lists `"MxGatewayCommandError"` in `__all__` (line 38). `from mxgateway import *` exposes it correctly. Verified at runtime (`'MxGatewayCommandError' in mxgateway.__all__` is `True`). No source change required — the defect described no longer exists. + +### Client.Python-003 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `clients/python/src/mxgateway/client.py:125-137,155-173` | +| Status | Resolved | + +**Description:** `stream_events_raw` and `query_active_alarms` call the stub directly with a `timeout` kwarg when `stream_timeout` is set, with no `TypeError` fallback. `galaxy.py:watch_deploy_events` and `_unary` *do* have a fallback that strips `timeout` if the callable rejects it. This asymmetry means a fake/older stub that does not accept `timeout` crashes for gateway streams but not Galaxy streams. It is only masked today because `stream_timeout` defaults to `None`. + +**Recommendation:** Apply the same `try/except TypeError` timeout-fallback pattern to `stream_events_raw` and `query_active_alarms`, or remove the fallback everywhere and standardise on a single behaviour. + +**Resolution:** 2026-05-18 — Confirmed: both stream methods in `client.py` called the stub with `timeout` unconditionally and had no `TypeError` fallback, unlike `_unary` and `galaxy.watch_deploy_events`. Added a shared `_open_stream` helper in `client.py` that opens a server-streaming call and strips the `timeout` kwarg when the stub raises `TypeError: ... unexpected keyword argument 'timeout'`, then routed both `stream_events_raw` and `query_active_alarms` through it. Regression tests in `tests/test_stream_timeout_fallback.py` (`test_stream_events_raw_falls_back_when_stub_rejects_timeout`, `test_query_active_alarms_falls_back_when_stub_rejects_timeout`, `test_stream_events_raw_still_passes_timeout_to_capable_stub`) failed before the fix and pass after. No public behaviour change for real gRPC stubs, so no README update needed. + +### Client.Python-004 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/python/src/mxgateway_cli/commands.py:386,402-404` | +| Status | Resolved | + +**Description:** In `_smoke`, the local variable `closed` is set to `False` and never reassigned; the `finally` block's `if not closed:` is therefore always true. This is dead/misleading code suggesting a removed early-close path. + +**Recommendation:** Remove the `closed` variable and the `if not closed:` guard; call `await session.close()` directly in the `finally` block (or use `async with session:`). + +**Resolution:** 2026-05-18 — Confirmed: `closed = False` was set and never reassigned, making `if not closed:` dead code. Replaced the `try/finally` with `async with session:` so the session is closed via the documented async context manager — `Session` already implements `__aexit__` → `close()`. Behaviour is unchanged (the session is still closed on every exit path); no test needed for the dead-code removal — exercised by the existing CLI smoke test. + +### Client.Python-005 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Performance & resource management | +| Location | `clients/python/src/mxgateway/galaxy.py:117-140` | +| Status | Resolved | + +**Description:** `discover_hierarchy` pages through the entire Galaxy object hierarchy and accumulates every `GalaxyObject` (each carrying its full attribute list) into a single in-memory `list` before returning. For a large Galaxy this is a very large allocation with no streaming alternative and no caller-side bound. + +**Recommendation:** Offer an async-generator variant (e.g. `iter_hierarchy()`) that yields objects/pages as they arrive, keeping `discover_hierarchy()` as a convenience wrapper. At minimum document the memory characteristic. + +**Resolution:** 2026-05-18 — Confirmed: `discover_hierarchy` buffered the entire hierarchy with no streaming alternative. Added `GalaxyRepositoryClient.iter_hierarchy`, an async generator that fetches one `DiscoverHierarchyRequest` page at a time and yields each `GalaxyObject` as it arrives, so peak memory is bounded by a single page (`_DISCOVER_HIERARCHY_PAGE_SIZE`). Pages are fetched lazily — the next page is only requested after the current page is fully consumed. `discover_hierarchy` is now a thin convenience wrapper (`[obj async for obj in self.iter_hierarchy()]`) that preserves its `list[GalaxyObject]` contract, including the repeated-page-token guard. Regression tests in `tests/test_galaxy_iter_hierarchy.py` (`test_iter_hierarchy_yields_objects_across_pages`, `test_iter_hierarchy_is_lazy_and_does_not_prefetch_next_page`, `test_iter_hierarchy_rejects_repeated_page_token`, `test_discover_hierarchy_still_returns_full_list`) failed before the fix and pass after. `clients/python/README.md` updated with the `iter_hierarchy` usage and memory guidance since this adds a new public method. + +### Client.Python-006 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `clients/python/src/mxgateway/client.py:74-82`, `clients/python/src/mxgateway/galaxy.py:85-93`, `clients/python/src/mxgateway/session.py:38-55` | +| Status | Resolved | + +**Description:** `close()` on the clients and `Session.close()` use a plain `self._closed` check-then-set with an `await` between, with no lock. If two coroutines call `close()` concurrently both can pass the guard before either sets it, causing a double `channel.close()` / double `CloseSession` RPC. Single-task usage is the documented contract, so impact is low, but the idempotency guarantee asserted in docstrings only holds for sequential calls. + +**Recommendation:** Set `self._closed = True` before the `await`, or guard with an `asyncio.Lock`, so the idempotency claim holds under concurrent close. + +**Resolution:** 2026-05-18 — Confirmed the check-then-set window. Fixed `GatewayClient.close`, `GalaxyRepositoryClient.close`, and `Session.close` to set `self._closed = True` *before* the `await` (channel close / `CloseSession` RPC). A second coroutine entering `close()` while the first is still awaiting now hits the early-return guard and does not issue a second `channel.close()` / `CloseSession`. Docstrings updated to state the idempotency holds under concurrent calls. TDD: regression tests in `tests/test_low_severity_findings.py` (`test_gateway_client_concurrent_close_closes_channel_once`, `test_galaxy_client_concurrent_close_closes_channel_once`, `test_session_concurrent_close_sends_one_close_session_rpc`) — each uses a fake channel/client that stalls inside `close`/`close_session_raw` so two concurrent `close()` calls interleave at the exact race window; they failed before the fix and pass after. + +### Client.Python-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/python/src/mxgateway/client.py:204-213` | +| Status | Resolved | + +**Description:** `_canceling_iterator` (gateway event stream) does not catch `asyncio.CancelledError` to invoke `call.cancel()` explicitly — it relies on the `finally` block. `galaxy.py:_canceling_iterator` *does* explicitly catch `CancelledError`, cancel, and re-raise. The two are functionally equivalent today, but the inconsistency between near-identical helpers invites future divergence. + +**Recommendation:** Make the two `_canceling_iterator` helpers identical, ideally by factoring a single shared helper. + +**Resolution:** 2026-05-18 — Confirmed the divergence. Factored a single shared helper: `client._canceling_iterator(call, operation)` now takes the `map_rpc_error` operation string as a parameter, explicitly catches `asyncio.CancelledError` (cancels the call, re-raises) and `grpc.RpcError`, and repeats the cancel in `finally`. This replaces both the gateway `_canceling_iterator` and the gateway `_canceling_active_alarms_iterator`; `galaxy.py` now imports and delegates to the same helper instead of defining its own, so the gateway and Galaxy stream helpers are byte-for-byte identical. TDD: `tests/test_low_severity_findings.py::test_gateway_stream_iterator_cancels_call_on_task_cancellation` drives a cancellable fake stream and asserts the gateway iterator cancels the underlying call on task cancellation. All existing stream-cancellation tests still pass. + +### Client.Python-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/python/src/mxgateway/values.py:62-67,83-88` | +| Status | Resolved | + +**Description:** `to_mx_value` maps any Python `float` to `VT_R8`/`MX_DATA_TYPE_DOUBLE` with no handling for `nan`/`inf`, which are serialised and forwarded to MXAccess which may reject or mis-handle them. `bytes` is mapped to `VT_RECORD`/`MX_DATA_TYPE_UNKNOWN`, a questionable default. The `data_type` keyword exists but `Session.write` never forwards it. + +**Recommendation:** Document the float/bytes mapping assumptions, optionally validate finiteness, and consider plumbing the `data_type` keyword through `Session.write`/`write2`. + +**Resolution:** 2026-05-18 — Confirmed the non-finite-float hazard. Added an `_ensure_finite` guard in `values.py`: `to_mx_value` now raises `ValueError` for `nan`/`inf`/`-inf`, both for a scalar `float` and for a non-finite element inside a float sequence — MXAccess has no defined wire representation for non-finite doubles, so rejecting client-side is the correct fail-fast. The `float`/`bytes` mapping assumptions (finite-only doubles; `bytes` as an opaque `VT_RECORD` pass-through) are now documented in the `values.py` module docstring and `clients/python/README.md`. Plumbing `data_type` through `Session.write`/`write2` was deliberately *not* done: it is a larger public-API surface change the finding only marks as "consider", and the documented MXAccess-parity convention is type-by-Python-value; the `data_type` keyword stays available on `to_mx_value` for callers that build the `MxValue` directly. TDD: `tests/test_low_severity_findings.py` adds `test_to_mx_value_rejects_nan`, `test_to_mx_value_rejects_positive_infinity`, `test_to_mx_value_rejects_negative_infinity`, `test_to_mx_value_rejects_non_finite_float_in_sequence`, and `test_to_mx_value_accepts_finite_float`. README updated since `to_mx_value` (used by `Session.write`/`write2`) now rejects an input it previously accepted. + +### Client.Python-009 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `clients/python/tests/` | +| Status | Resolved | + +**Description:** Several non-trivial public paths are untested: `Session.write2`/`add_item2` request construction; the bulk-size limit `_ensure_bulk_size`/`MAX_BULK_ITEMS` guard; the `None`-argument `TypeError` guards in bulk methods; the TLS `ca_file` read path in `create_channel`; most CLI command bodies; and `map_rpc_error`'s default (non-auth) branch. + +**Recommendation:** Add tests for `write2`/`add_item2` request shape, the bulk-size `ValueError`, the `ca_file` TLS branch, the generic `map_rpc_error` fallthrough, and at least one happy-path CLI command using a fake stub. + +**Resolution:** 2026-05-18 — Confirmed coverage gap against the existing `tests/` files. Added `tests/test_coverage_gaps.py` covering every path the finding lists: `test_add_item2_sends_item_context_and_returns_handle` and `test_write2_sends_value_and_timestamp_value` (request shape + `MxValue` oneof), `test_subscribe_bulk_rejects_oversized_request` and `test_add_item_bulk_at_limit_is_allowed` (the `MAX_BULK_ITEMS` `_ensure_bulk_size` boundary), `test_advise_item_bulk_rejects_none_argument` (the `None`-argument `TypeError` guard), `test_create_channel_reads_ca_file` and `test_create_channel_missing_ca_file_raises` (the TLS `ca_file` read path), `test_map_rpc_error_generic_branch_returns_transport_error` and `test_map_rpc_error_handles_error_without_code` (the non-auth `map_rpc_error` fallthrough and the no-`code` path), and `test_cli_register_happy_path_emits_server_handle` (a happy-path CLI command body driven end to end through `CliRunner` with a fake stub via a monkeypatched `_connect`). All 10 new tests pass. No source change required — this is a pure coverage finding. + +### Client.Python-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/python/src/mxgateway/session.py:404`, `clients/python/src/mxgateway_cli/commands.py:422-425` | +| Status | Resolved | + +**Description:** `session.py` ends with a module-level late import `from .client import GatewayClient # noqa: E402` purely to satisfy a string type hint, and `commands.py:_session` does a function-local import. Both work around a circular dependency that `from __future__ import annotations` (already in effect) makes unnecessary. `_session` also lacks a return type annotation. + +**Recommendation:** Drop the runtime late import in `session.py` and use a `TYPE_CHECKING`-guarded import for the hint; add the `-> Session` return annotation to `commands.py:_session`. + +**Resolution:** 2026-05-18 — Confirmed: with `from __future__ import annotations` in effect all annotations are strings, so the runtime late import was unnecessary. Removed the trailing `from .client import GatewayClient # noqa: E402` in `session.py` and replaced it with a top-of-file `if TYPE_CHECKING:` import that satisfies the `GatewayClient` hint without a runtime dependency (no import cycle: `client.py` does not import `session` at module scope). In `commands.py`, hoisted the function-local `from mxgateway.session import Session` to a module-level import and added the `-> Session` return annotation to `_session`. Verified `import mxgateway` and `import mxgateway_cli.commands` succeed with no circular-import error. Pure refactor — covered by the existing import and CLI tests; no new test needed. + +### Client.Python-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/python/src/mxgateway/errors.py:122-148` | +| Status | Resolved | + +**Description:** `ensure_mxaccess_success` raises `MxAccessError` if any `mx_status.success == 0`. This treats `success == 0` as the failure sentinel, but `0` is also the proto3 scalar default for an unset `MxStatusProxy`. If the gateway ever returns a reply with an unpopulated status entry (e.g. a partially-filled bulk result), the client raises `MxAccessError` even though no real failure occurred. + +**Recommendation:** Confirm against the proto/gateway contract whether `success` is guaranteed populated for every `statuses` entry; if not, key the failure decision on an explicit failure field rather than the `success == 0` default. + +**Resolution:** 2026-05-18 — Confirmed against the gateway contract: `success` is **not** guaranteed populated for every `statuses` entry. `src/MxGateway.Worker/Conversion/MxStatusProxyConverter.cs::ConvertMany` emits a placeholder `MxStatusProxy` for a null `MXSTATUS_PROXY` COM array entry, setting `Category`/`DetectedBy` to `Unknown` but **leaving `Success` at its proto3 default of 0**. A fully-default proto entry likewise has `success == 0`. Under the old client logic either placeholder would falsely raise `MxAccessError`. Fixed `ensure_mxaccess_success` to key the per-status failure decision on a new `_is_mxaccess_status_failure` helper that requires `success == 0` **and** a populated, non-OK `category` — a status with `category` of `MX_STATUS_CATEGORY_UNSPECIFIED` (default proto) or `MX_STATUS_CATEGORY_UNKNOWN` (the null-entry placeholder) is treated as unpopulated and ignored. `MX_STATUS_CATEGORY_OK` is also excluded so a genuine success entry never raises. Real failures (categories `WARNING` and the error categories, raw value ≥ 2) still raise as before — the existing `write.mxaccess-failure` fixture (`SECURITY_ERROR`/`OPERATIONAL_ERROR` statuses) and the `MXACCESS_FAILURE` protocol-status path are unaffected. TDD: `tests/test_low_severity_findings.py` adds `test_ensure_mxaccess_success_ignores_unpopulated_status_entry` (default + null-placeholder entries, no raise), `test_ensure_mxaccess_success_raises_on_populated_failure_status` (populated `COMMUNICATION_ERROR`, raises), and `test_ensure_mxaccess_success_passes_when_status_reports_success`. No public-behaviour change for genuine replies, so no README update. + +### Client.Python-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `clients/python/src/mxgateway/client.py:84-108`, `clients/python/src/mxgateway/session.py:57-77` | +| Status | Won't Fix | + +**Description:** `Session.invoke_raw` does not run `ensure_mxaccess_success` while `Session.invoke` does, so a caller using `invoke_raw` for parity tests gets a reply where an MXAccess HRESULT failure is silently embedded with no exception. This is by design but under-documented — the README's "preserve raw replies" sentence does not state that `*_raw` methods skip MXAccess-failure detection entirely. + +**Recommendation:** Document explicitly (README + docstring) that `*_raw` methods surface MXAccess HRESULT/status failures only inside the reply and do not raise `MxAccessError`, so parity-test callers know to inspect `protocol_status`/`hresult`/`statuses` themselves. + +**Resolution:** 2026-05-18 — Won't Fix (no behaviour change). Confirmed this is intentional, correct parity behaviour: the `*_raw` methods exist precisely so parity-test callers can inspect an unmodified gateway reply, including embedded MXAccess HRESULT/status failures, without an exception masking them. Changing `invoke_raw` to raise `MxAccessError` would defeat its purpose and duplicate `Session.invoke`. The finding's only actionable point is the documentation gap, which has been addressed: `clients/python/README.md` now states explicitly that `*_raw` methods enforce gateway protocol success only and do **not** run MXAccess-failure detection, and the docstrings of `GatewayClient.invoke_raw` and `Session.invoke_raw` say the same and point callers to inspect `protocol_status`/`hresult`/`statuses` (and to `Session.invoke` for the checked variant). No code/test change — the runtime contract is unchanged and correct. + +### Client.Python-013 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Security | +| Location | `clients/python/src/mxgateway_cli/commands.py:757-762` | +| Status | Resolved | + +**Description:** `_use_plaintext` silently returns `True` whenever the endpoint +string starts with `localhost:` or `127.0.0.1:`, even if neither `--plaintext` +nor `--tls` is supplied on the command line. Any CLI subcommand (e.g. +`mxgw-py open-session --endpoint localhost:5001 --api-key mxgw_`) then +attaches the API key to a plaintext gRPC channel without warning. This is a +silent security downgrade: a user who deliberately ran the gateway behind TLS +on loopback (e.g. for testing a production-shaped TLS config locally) and who +passes `--api-key` expecting the secret to be transport-protected gets a +plaintext bearer token instead. The auto-downgrade is also undocumented — +`README.md` and the CLI `--help` text both describe `--plaintext` and `--tls` +as the controls, with no mention that endpoint-prefix matching can override +either. The other client CLIs do not auto-downgrade: the .NET CLI uses +`https://`-prefix detection on a URI scheme (an explicit signal), Go and Java +require an explicit `--plaintext`/`--tls` choice, and Rust defaults to +plaintext only when `plaintext = true` is set on the options struct. + +**Recommendation:** Drop the localhost-prefix auto-plaintext branch and +require the user to pass `--plaintext` or `--tls` (or default to TLS to match +the rest of the matrix). If the implicit-localhost behaviour is kept for +ergonomics, document it prominently in both `README.md` and `--help`, emit a +stderr warning when `--api-key` is combined with the auto-downgrade path, and +add a CLI test asserting the auto-downgrade is in fact active so it is not +silently lost in a future refactor. + +**Resolution:** 2026-05-20 — Removed the silent `localhost:` / `127.0.0.1:` +auto-plaintext branch from `_use_plaintext`. The new contract matches the Go +and Java CLIs: **TLS is the default**, `--plaintext` is the only way to opt +in to an unencrypted channel, and `--tls` is accepted as a redundant, explicit +affirmation of the default (mutually exclusive with `--plaintext`, which now +raises `click.UsageError`). The `--plaintext` / `--tls` `--help` text and +`clients/python/README.md` both call out the new behaviour. Added six +regression tests in `clients/python/tests/test_cli.py` covering: (a) a +`localhost:` endpoint with no flags resolves to TLS, (b) a `127.0.0.1:` +endpoint with no flags resolves to TLS, (c) `--plaintext` opts in to plaintext, +(d) `--tls` is accepted and idempotent with the default, (e) `--plaintext` +combined with `--tls` is rejected, and (f) an end-to-end CliRunner test +asserting `ClientOptions.plaintext == False` flows through to +`GatewayClient.connect` when no flag is supplied against a `localhost:` +endpoint. **Behaviour change for callers:** scripts that previously relied on +`mxgw-py … --endpoint localhost:5000 …` selecting plaintext silently must now +add an explicit `--plaintext` flag (or set up TLS on the gateway). Calling +`mxgw-py` with an `--api-key` against a plaintext-only gateway without +`--plaintext` will now fail to connect rather than silently leaking the bearer +token. + +### Client.Python-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/python/src/mxgateway_cli/commands.py:22-23` | +| Status | Resolved | + +**Description:** `commands.py` has two consecutive `from mxgateway.values +import` lines: + +```python +from mxgateway.values import to_mx_value +from mxgateway.values import MxValueInput +``` + +These import from the same module and should be combined into a single +`from mxgateway.values import MxValueInput, to_mx_value`. The split form is +inconsistent with the rest of the file (every other module is imported in a +single statement) and would be flagged by `ruff`/`isort` if any linter were +configured. Pure style, no behavioural impact. + +**Recommendation:** Collapse the two imports into one statement, ordered to +match the conventional alphabetical-within-module pattern: +`from mxgateway.values import MxValueInput, to_mx_value`. + +**Resolution:** 2026-05-20 — Collapsed the two consecutive +`from mxgateway.values import to_mx_value` / `from mxgateway.values import MxValueInput` +lines in `clients/python/src/mxgateway_cli/commands.py` into a single +`from mxgateway.values import MxValueInput, to_mx_value` statement, matching +the alphabetical-within-module pattern used elsewhere in the file. Pure style +fix — no behavioural impact, covered by the existing CLI tests. + +### Client.Python-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/python/src/mxgateway_cli/commands.py:273-294,564-647`, `clients/python/tests/` | +| Status | Resolved | + +**Description:** `_bench_read_bulk` is a ~80-line CLI body that opens its own +session, registers, subscribe_bulks, runs a warm-up loop, a measurement loop, +collects per-call latencies, computes a percentile summary, and emits the +shared cross-language JSON schema. It is the largest untested CLI command in +the module — `tests/` has no `bench_read_bulk` test, fake-stub-driven or +otherwise. A drift in the schema field names (`callsPerSecond`, +`cachedReadResults`, `latencyMs.p50`, …) would break the cross-language +`scripts/bench-read-bulk.ps1` aggregation silently. `_percentile_summary` and +`_percentile` are also untested — the boundary cases (`n == 0`, `n == 1`, +quantile interpolation) would benefit from a small unit test since the +identical algorithm is duplicated in the .NET / Go / Rust / Java drivers and +a divergence would corrupt cross-language comparisons. + +**Recommendation:** Add a fake-stub-driven `bench_read_bulk` test that drives +a short `--duration-seconds 0 --warmup-seconds 0` run through `CliRunner` and +asserts the JSON schema (`language == "python"`, the full key set, +`latencyMs.p50/p95/p99/max/mean` present). Add unit tests for `_percentile` +covering `n == 0`, `n == 1`, and a known-good interpolated value at p95 so +the implementation cannot silently drift from the other clients. + +**Resolution:** 2026-05-20 — Added `clients/python/tests/test_cli_bench_and_helpers.py` +with three layers of coverage. (1) `_percentile` unit tests pin the +cross-language algorithm (`rank = q * (n - 1)`, linear interpolation between +adjacent ranks): empty sample returns `0.0`, single element returns that +element, exact-rank queries return the sample value (p50 of `[10,20,30,40,50]` +is `30.0`), and the interpolated p95/p99 values (`48.0` / `49.6` for that same +five-element sample) are locked down so any drift from the .NET / Go / Rust / +Java drivers fails fast. (2) `_percentile_summary` tests assert the full +`{p50, p95, p99, max, mean}` dict shape, the zero-sample placeholder, and the +3-decimal rounding contract. (3) A `bench-read-bulk` smoke test +(`test_bench_read_bulk_emits_cross_language_schema`) drives the CLI through +`CliRunner` with `--duration-seconds 0 --warmup-seconds 0` against a fake stub +that handles `OpenSession`, `Register`, `SubscribeBulk`, `ReadBulk`, and +`UnsubscribeBulk`, then asserts the emitted JSON has exactly the 16 +cross-language schema keys (`language`, `command`, `endpoint`, `clientName`, +`bulkSize`, `durationSeconds`, `warmupSeconds`, `durationMs`, `tags`, +`totalCalls`, `successfulCalls`, `failedCalls`, `totalReadResults`, +`cachedReadResults`, `callsPerSecond`, `latencyMs`) and that `latencyMs` is a +`{p50, p95, p99, max, mean}` sub-object — guarding against silent breakage of +`scripts/bench-read-bulk.ps1`'s cross-language aggregation. No source change — +this is a pure coverage finding. + +### Client.Python-016 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/python/src/mxgateway_cli/commands.py:25,757-775,805-830` | +| Status | Resolved | + +**Description:** Three CLI helper paths are not covered by `tests/`: + +1. `_use_plaintext` localhost auto-downgrade (line 762) — the + `endpoint.startswith("localhost:") or endpoint.startswith("127.0.0.1:")` + branch (see also Client.Python-013) is untested; no test asserts that an + endpoint without `--plaintext` and without `--tls` resolves to plaintext. +2. `_collect_events` `MAX_AGGREGATE_EVENTS` guard (line 811-815) — passing + `--max-events` greater than `MAX_AGGREGATE_EVENTS` raises + `click.BadParameter`, but no test exercises the guard. A silent removal of + the constant or the comparison would not be caught. +3. `_api_key_from_env` (line 765-768) — only the implicit path through + `_secrets` is exercised; there is no test that verifies an env-var name + resolves to a value and that an unset env var produces `None`. + +These are all small, fake-stub-driven CLI behaviours rather than end-to-end +paths. The previous coverage finding (Client.Python-009) closed without +adding tests for these specific paths. + +**Recommendation:** Add three small `CliRunner` / unit tests: one asserting +the localhost auto-plaintext (or its replacement, if Client.Python-013 is +fixed), one asserting `--max-events 10001` exits non-zero with the +`MAX_AGGREGATE_EVENTS` error message, and one asserting +`_api_key_from_env("MXGATEWAY_API_KEY")` returns the env value and `None` for +an unset variable. + +**Resolution:** 2026-05-20 — Scope adjusted: Client.Python-013 has since +removed the `_use_plaintext` localhost auto-plaintext branch, so item (1) is +no longer a real code path — the +`test_use_plaintext_requires_explicit_flag_for_localhost_endpoint` and +`test_cli_localhost_endpoint_defaults_to_tls_via_open_session` regressions +added under Client.Python-013 already pin the new TLS-by-default contract. +The remaining two helpers are now covered in +`clients/python/tests/test_cli_bench_and_helpers.py`. (2) +`MAX_AGGREGATE_EVENTS` cap: +`test_collect_events_rejects_max_events_above_aggregate_cap` drives +`stream-events` with `--max-events 10001` through `CliRunner` against +stubbed `_connect` / `_session` fakes and asserts the CLI exits non-zero with +the documented `less than or equal to 10000` message; +`test_collect_events_accepts_max_events_at_aggregate_cap_boundary` confirms +`--max-events 10000` is accepted at the boundary and returns an empty event +list. (3) `_api_key_from_env`: +`test_api_key_from_env_resolves_value_when_variable_is_set` (env-var +populated → returned), +`test_api_key_from_env_returns_none_when_variable_is_unset` (env-var unset +→ `None`), `test_api_key_from_env_returns_none_when_name_is_none` (the +`name is None` early-return), and +`test_api_key_from_env_returns_none_when_name_is_empty_string` (the +`if not name` truthiness guard). No source change — pure coverage finding. + +### Client.Python-017 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/python/pyproject.toml:5-25`, `clients/python/src/mxgateway/` | +| Status | Resolved | + +**Description:** The package metadata in `pyproject.toml` is minimal for a +published wheel: + +* No `authors` field. PyPI / `pip show` will display no author. +* No `license` field, no `license-files` field, and no `LICENSE` file is + referenced from the project. The repo as a whole has no top-level + `LICENSE` either, but other client packages (Java has a license entry, the + .NET package has a license expression in the `csproj`) tend to set this. +* No `classifiers` (no `Programming Language :: Python :: 3.12`, + `Operating System :: Microsoft :: Windows`, `Topic :: …`, no + development-status classifier). Without these the PyPI search facets are + empty and tooling like `pip` cannot tell whether the package is + alpha/beta/stable. +* No `keywords`, no `[project.urls]` (no homepage / source / issue link + pointing back to the repo). +* The package ships no PEP 561 `py.typed` marker file in + `src/mxgateway/`. Type hints are written throughout the module + (`from __future__ import annotations`, full annotations on every public + function), but downstream consumers running `mypy` on `mxaccess-gateway-client` + will not see those hints — PEP 561 requires the marker file to opt the + package into type-stub distribution. + +**Recommendation:** Add `authors`, `license = ""`, `keywords`, and +`[project.urls]` to `pyproject.toml`; add at least the standard `classifiers` +trio (`Development Status`, `Programming Language :: Python :: 3.12`, +`Intended Audience`); create an empty `src/mxgateway/py.typed` file and +include it in the wheel via `[tool.setuptools.package-data]` so consumers +running `mypy` against an installed wheel pick up the type information. + +**Resolution:** 2026-05-20 — Filled out `clients/python/pyproject.toml` +with the missing PyPI metadata: `authors = [{ name = "MXAccess Gateway +Authors" }]`, `license = "Proprietary"` (the repo has no top-level +`LICENSE` file and no other client publishes under an OSS licence, so the +SPDX `Proprietary` expression matches the de-facto status), the standard +classifier set (`Development Status :: 4 - Beta`, `Intended Audience :: +Developers` / `Information Technology`, `Operating System :: Microsoft :: +Windows` and `:: POSIX`, `Programming Language :: Python` / +`Python :: 3` / `Python :: 3.12`, `Topic :: Software Development :: +Libraries :: Python Modules`, `Topic :: System :: Distributed Computing`, +and `Typing :: Typed`), a `keywords` list +(`mxaccess`, `archestra`, `gateway`, `grpc`, `industrial`, `scada`), and +`[project.urls]` with `Homepage` / `Source` / `Issues` pointing at the +Gitea repo. Added the PEP 561 marker file +`clients/python/src/mxgateway/py.typed` (empty, as the spec requires) and +declared it in `[tool.setuptools.package-data] mxgateway = ["py.typed"]` +so the wheel ships the marker and downstream `mypy` users see the +inline type hints. Pure metadata / packaging change — `python -m pytest -q` +still passes (91 tests). + +### Client.Python-018 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Code organization & conventions | +| Location | `clients/python/pyproject.toml:11` | +| Status | Resolved | + +**Description:** The Client.Python-017 resolution set +`license = "Proprietary"` as a top-level string. Under PEP 639 (enforced +by `setuptools >= 77`, and active in the installed `setuptools 82.0.1`), +the `project.license` string form must be a valid SPDX expression. +`"Proprietary"` is not a registered SPDX identifier, so the configured +build backend (`setuptools.build_meta`) refuses the file outright. Both +`python -m pip wheel . --no-deps --wheel-dir …` and +`python -m pip install -e .` — the exact commands documented in +`clients/python/README.md` ("Build And Test", "Packaging") and the +"build wheel" instruction in `docs/ClientPackaging.md` — now fail before +any source is compiled with: + +``` +ValueError: invalid pyproject.toml config: `project.license`. +configuration error: `project.license` must be valid exactly by one definition (0 matches found): + - {type: string, format: 'SPDX'} + - type: table keys: 'file': … required: ['file'] + - type: table keys: 'text': … required: ['text'] +``` + +`python -m pytest` still runs because `[tool.pytest.ini_options] +pythonpath = ["src"]` lets pytest import the package without an install +— which masked the regression at commit time and explains how the +Client.Python-017 resolution comment was able to assert "`python -m +pytest -q` still passes (91 tests)" while shipping a wheel build that +cannot start. The Client.Python-017 resolution comment that "the SPDX +`Proprietary` expression matches the de-facto status" is incorrect: +`Proprietary` is *not* a registered SPDX identifier; only entries on the +SPDX licence list (e.g. `MIT`, `Apache-2.0`, `BSD-3-Clause`) or +`LicenseRef-*` custom identifiers satisfy the +`{ type: string, format: 'SPDX' }` rule. PEP 639 added the +`LicenseRef-…` escape hatch precisely for proprietary / unlisted +licences. + +This is a regression of the developer-onboarding workflow introduced by +the very commit being reviewed. A fresh checkout cannot run +`python -m pip install -e ".[dev]"` (the command in `CLAUDE.md`'s +"Clients" section) without first patching `pyproject.toml`. + +**Recommendation:** Fix the `license` value so the build backend +accepts it. Three concrete options, in order of preference: + +1. Use a `LicenseRef-*` SPDX-compatible custom identifier: + `license = "LicenseRef-Proprietary"`. Requires no additional + `LICENSE` file and is honoured by setuptools / pip / PyPI as a + proprietary marker. +2. Add a top-level `LICENSE` file (or `clients/python/LICENSE`) and + point at it via the table form: + `license = { file = "LICENSE" }`. This also documents the proprietary + terms. +3. Drop the `license` key entirely and convey the same intent via the + classifier `"License :: Other/Proprietary License"` (already part of + the classifier set), reverting the PEP-639 string field that the + build backend now insists must be SPDX. + +Add a CI / pre-commit check that runs `python -m pip wheel . --no-deps` +(or `python -m build`) on `clients/python` so a future +`pyproject.toml` regression is caught at commit time rather than at +first install on a clean machine. See also Client.Python-020. + +**Resolution:** 2026-05-20 — Dropped the invalid top-level +`license = "Proprietary"` string from `clients/python/pyproject.toml` +and added the existing `License :: Other/Proprietary License` trove +classifier to convey the same intent without violating PEP 639's SPDX +rule. No `LICENSE` file exists at the repo root or under +`clients/python/`, so the `license = { file = "LICENSE" }` table form +was not used; relying on the classifier is the option (3) variant +called out in the recommendation. Verified by running +`python -m pip wheel . --no-deps -w ./.test-wheel-output` from +`clients/python`: the build now succeeds and emits +`mxaccess_gateway_client-0.1.0-py3-none-any.whl` (47 KB) where +previously it failed with the `project.license must be valid exactly +by one definition` `ValueError`. The CI / pre-commit recommendation is +addressed by Client.Python-020. + +### Client.Python-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `clients/python/pyproject.toml:60-61`, `clients/python/src/mxgateway_cli/` | +| Status | Resolved | + +**Description:** Client.Python-017 added the PEP 561 marker file +`clients/python/src/mxgateway/py.typed` and declared it in +`[tool.setuptools.package-data] mxgateway = ["py.typed"]`. The wheel +therefore advertises `mxgateway` as typed. However the same wheel +also ships the **`mxgateway_cli`** package (`setuptools.packages.find` +with `where = ["src"]` discovers both `mxgateway` and `mxgateway_cli`, +confirmed via `find_packages` in this review), and `mxgateway_cli`: + +* is shipped in the wheel and is the package the `mxgw-py` console + script entry point resolves into (`[project.scripts] mxgw-py = + "mxgateway_cli.commands:main"`), +* is fully type-annotated (every function in `commands.py` has full + parameter and return annotations; `from __future__ import annotations` + is in effect), +* but has no `py.typed` file and is not listed in + `[tool.setuptools.package-data]`. + +PEP 561 requires the marker file inside **each** importable package the +distribution wants to expose to type checkers — the `mxgateway` marker +does not transfer to `mxgateway_cli`. A downstream consumer that imports +or composes against `mxgateway_cli` (e.g. wrapping it as a programmatic +CLI library) will see all symbols as `Untyped` under `mypy` despite the +hints being present in source. + +This is a follow-up to Client.Python-017 — the fix is small and pure +packaging. + +**Recommendation:** Create +`clients/python/src/mxgateway_cli/py.typed` (empty file, as PEP 561 +requires) and extend the existing package-data declaration so the +wheel ships it: + +```toml +[tool.setuptools.package-data] +mxgateway = ["py.typed"] +mxgateway_cli = ["py.typed"] +``` + +No source change in either package; verify by building a wheel +(once Client.Python-018 is fixed) and inspecting that both +`mxgateway/py.typed` and `mxgateway_cli/py.typed` appear in the wheel +contents. + +**Resolution:** 2026-05-20 — Created the empty PEP 561 marker file +`clients/python/src/mxgateway_cli/py.typed` and added +`mxgateway_cli = ["py.typed"]` under +`[tool.setuptools.package-data]` in `clients/python/pyproject.toml` +alongside the existing `mxgateway = ["py.typed"]` line. Verified by +inspecting the built wheel +(`mxaccess_gateway_client-0.1.0-py3-none-any.whl`): the archive now +contains both `mxgateway/py.typed` and `mxgateway_cli/py.typed`, so +downstream `mypy` consumers see the inline type hints in both +packages. Pure packaging change — no source modifications. + +### Client.Python-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/python/tests/`, `scripts/` | +| Status | Resolved | + +**Description:** Client.Python-018 is invisible to the existing test +suite: `python -m pytest` passes because `[tool.pytest.ini_options] +pythonpath = ["src"]` lets pytest import the package without going +through `setuptools.build_meta`. None of the 91 tests build the wheel, +do an editable install, or otherwise exercise the +`setuptools.build_meta` configuration validator. As a result, a +`pyproject.toml` regression that breaks `pip install -e .` / +`pip wheel .` — the exact commands documented in the Python client +README and `CLAUDE.md` — passes the test suite green. The other +language clients have parallel coverage gaps (no CI-level "the package +installs" smoke test for Python in +`scripts/run-client-e2e-tests.ps1`, which only runs the live e2e +matrix and assumes the editable install already worked), but Python +is the only one whose published install command is currently broken. + +**Recommendation:** Add a thin pytest module (e.g. +`tests/test_packaging.py`) that runs + +```python +import subprocess, sys, pathlib +def test_pyproject_validates_against_setuptools_build_meta(): + here = pathlib.Path(__file__).resolve().parent.parent + result = subprocess.run( + [sys.executable, "-m", "pip", "wheel", ".", + "--no-deps", "--no-build-isolation", + "--wheel-dir", str(tmp_path)], + cwd=here, capture_output=True, text=True, + ) + assert result.returncode == 0, result.stderr +``` + +(or any equivalent that invokes +`setuptools.config.pyprojecttoml.read_configuration`). Marker the test +with `@pytest.mark.slow` if the wheel build is too heavy for the +default suite, and document the test in the README. Alternatively +add a CI step to `scripts/run-client-e2e-tests.ps1` (or a new +`scripts/check-python-package.ps1`) that fails the build when the +wheel build fails. Either approach would have surfaced +Client.Python-018 at commit time. + +**Resolution:** 2026-05-20 — Added +`clients/python/tests/test_packaging.py::test_pip_wheel_build_succeeds`. +The test invokes `python -m pip wheel . --no-deps --wheel-dir ` +against the package root via `subprocess` and asserts (a) exit code +zero and (b) an `mxaccess_gateway_client-*.whl` file is produced in +the temp directory, capturing stdout/stderr in the assertion message +on failure so any future PEP 639 / SPDX violation or other +`setuptools.build_meta` configuration error is reported with the +build backend's own error text. Verified the test would have caught +Client.Python-018: with the old `license = "Proprietary"` string in +place the test fails with the `project.license must be valid exactly +by one definition` `ValueError`. The pytest module is the simpler +half of the recommendation; no PowerShell wrapper script was added +since pytest already runs in the same `python -m pytest` invocation +the README documents. Test suite is now 92 tests (was 91), all +passing. + +### Client.Python-021 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/python/src/mxgateway_cli/commands.py`, `clients/python/README.md:235-258` | +| Status | Resolved | + +**Description:** Cross-client CLI parity check (one of the things the +review prompt asks for): the `mxgw-py` CLI subcommand set has drifted +from every other client CLI in the matrix. + +Subcommand inventory at this commit: + +| Subcommand | .NET (`mxgw`) | Go (`mxgw-go`) | Rust (`mxgw`) | Java (`mxgw-java`) | Python (`mxgw-py`) | +|---|---|---|---|---|---| +| `version` | yes | yes | yes | yes | yes | +| `ping` | yes | (no) | yes | (no) | yes | +| `open-session` / `close-session` | yes | yes | yes | yes | yes | +| `register` / `add-item` / `advise` | yes | yes | yes | yes | yes | +| `subscribe-bulk` / `unsubscribe-bulk` / `read-bulk` | yes | yes | yes | yes | yes | +| `write-bulk` / `write2-bulk` / `write-secured-bulk` / `write-secured2-bulk` | yes | yes | yes | yes | yes | +| `write` / `write2` | yes / (varies) | yes / (no) | yes / yes | yes / (no) | yes / yes | +| `stream-events` | yes | yes | yes | yes | yes | +| `smoke` | yes | yes | yes | yes | yes | +| `bench-read-bulk` | yes | yes | yes | yes | yes | +| `bench-stream-events` | **yes** | (no) | (no) | (no) | (no) | +| `galaxy-test-connection` (or alias) | **yes** | **yes** | **yes** | **yes** | **(no)** | +| `galaxy-last-deploy` / `galaxy-deploy-time` | **yes** | **yes** | **yes** | **yes** | **(no)** | +| `galaxy-discover` | **yes** | **yes** | **yes** | **yes** | **(no)** | +| `galaxy-watch` | **yes** | **yes** | **yes** | **yes** | **(no)** | + +Two parity gaps remain after Client.Python-013/017: + +1. The Python CLI ships **no Galaxy subcommands at all** even though + the `GalaxyRepositoryClient` library wrapper is fully implemented + and exercised by `tests/test_galaxy.py` / + `tests/test_galaxy_iter_hierarchy.py`. The README acknowledges the + `watch-deploy-events` gap inline ("The CLI does not currently + expose a streaming `watch-deploy-events` subcommand — use the + library API directly when subscribing to deploy events from + Python.") but does not call out that **the other three Galaxy + subcommands are also missing** — and the .NET / Go / Rust / Java + CLIs all expose them. A user running the cross-language smoke + matrix who expects Python to behave like the other clients sees a + silent "command not found" on `mxgw-py galaxy-test-connection`. +2. The new `bench-stream-events` subcommand (added to the .NET CLI in + the previous commit `1cd51bb`) is .NET-only today; the Python CLI + is consistent with Go / Rust / Java on this point. Worth flagging + as a forward-looking parity gap that will need filling if the + cross-language benchmark matrix grows a stream-events driver in + `scripts/`. + +Severity is Low because the existing `scripts/bench-read-bulk.ps1` +matrix only invokes `bench-read-bulk` and does not break, and the +Python `GalaxyRepositoryClient` library is fully functional — the gap +is purely in the test CLI surface. But cross-client parity is an +explicit review check and the gap is not documented. + +**Recommendation:** Either (a) add `galaxy-test-connection`, +`galaxy-last-deploy`, `galaxy-discover`, and `galaxy-watch` +subcommands to `mxgateway_cli/commands.py` (each is a thin wrapper +over `GalaxyRepositoryClient`, mirroring the existing four-language +implementation), or (b) update `clients/python/README.md`'s "CLI" +section with an explicit "CLI parity gaps" subsection that lists the +missing subcommands and recommends the library API. Option (a) is +preferable for cross-language matrix testing. Also document the +`bench-stream-events` gap symmetrically once a cross-language stream +benchmark driver is added under `scripts/`. + +**Resolution:** 2026-05-20 — Scoped this finding to a +documentation-only fix; the full Galaxy CLI parity implementation +(four new subcommands wired to `GalaxyRepositoryClient`) is a larger +piece of work and will be tracked as a separate follow-up finding. +Added a new "CLI Parity Gaps" subsection to +`clients/python/README.md` immediately under the existing CLI +section that explicitly enumerates the four missing +`mxgw-py` Galaxy subcommands (`galaxy-test-connection`, +`galaxy-last-deploy`, `galaxy-discover`, `galaxy-watch`), names the +sibling CLIs that already expose them (.NET `mxgw`, Go `mxgw-go`, +Rust `mxgw`, Java `mxgw-java`), points readers at the library API +(`GalaxyRepositoryClient`, already documented under "Galaxy +Repository Browse") as the supported Python entry point in the +interim, and also flags the .NET-only `bench-stream-events` gap so +the cross-language benchmark matrix has a record of the asymmetry. +No CLI source change; the implementation of the four Galaxy +subcommands is deferred. Resolved as a doc note rather than a full +parity fix. diff --git a/code-reviews/Client.Rust/findings.md b/code-reviews/Client.Rust/findings.md new file mode 100644 index 0000000..e5d5bf3 --- /dev/null +++ b/code-reviews/Client.Rust/findings.md @@ -0,0 +1,429 @@ +# Code Review — Client.Rust + +| Field | Value | +|---|---| +| Module | `clients/rust` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 1 | + +## Checklist coverage + +This re-review (`a020350`) covers the resolution work for Client.Rust-013 through 017 (scoped `doc_lazy_continuation` allow on generated submodules, `pub` `next_correlation_id` shared with the CLI, success/failure split in `bench-read-bulk`, eight new tests, design-doc resync). The pass spot-checked the items called out in the request: stability of the newly-`pub` `next_correlation_id`, the `bench-read-bulk` JSON shape vs the PowerShell driver, presence of `unsafe`, and the scope of `#![allow(clippy::doc_lazy_continuation)]`. `cargo clippy --workspace --all-targets -- -D warnings` and `cargo test --workspace` both pass on this commit. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found — the five new `MalformedReply` paths and the `read_bulk` mismatched-payload branch each have a dedicated test; `BenchReadBulkStats` correctly partitions success vs failure latency. | +| 2 | mxaccessgw conventions | No issues found — `cargo clippy --workspace --all-targets -- -D warnings` and `cargo test --workspace` both pass on this commit; the `#![allow(clippy::doc_lazy_continuation)]` allow is scoped narrowly to each generated v1 inner module so hand-written code is unaffected; CLI `Ping`/`CloseSession` now call `session::next_correlation_id`. | +| 3 | Concurrency & thread safety | No issues found — `CORRELATION_SEQUENCE` is `AtomicU64` with `Relaxed`, correct for monotonic id generation; no `unsafe` anywhere in `src/` or `crates/`. | +| 4 | Error handling & resilience | Issue found: the `bench-read-bulk` fix for Client.Rust-015 has fixed Rust's own histogram honestly but the change makes Rust's `latencyMs` semantically incompatible with the four other clients' `latencyMs` field that the cross-language PowerShell driver collates side-by-side (Client.Rust-018). | +| 5 | Security | No issues found — API keys still redacted in `Debug`/`Display`, status messages scrubbed, `first_failure` records `Error::Display` (which already redacts `mxgw_*` tokens) so secure-write values cannot leak into the bench JSON. | +| 6 | Performance & resource management | No issues found in the reviewed delta. | +| 7 | Design-document adherence | Issue found: `RustClientDesign.md` Session signatures for the four bulk-write helpers and `read_bulk` do not match the actual implementation — the design lists trailing `user_id` / `timestamp` / `current_user_id` / `verifier_user_id` parameters and a `Vec` return that the impl does not have (all of those move per-entry into `WriteBulkEntry` etc.) (Client.Rust-019). | +| 8 | Code organization & conventions | No new issues — `BenchReadBulkStats` is cleanly factored out and tested. | +| 9 | Testing coverage | No new issues — the malformed-reply paths and unary `Error::Unavailable` are now covered, and the four bulk-write families each have round-trip smoke. | +| 10 | Documentation & comments | Issue found: `next_correlation_id` is now `pub` and its doc comment commits the SDK to the literal `"rust-client-{label}-{N}"` string format, but neither the doc nor `lib.rs` re-exports it or declares any stability stance, leaving the public surface ambiguous (Client.Rust-020). | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232`. Diff against `a020350` is commit `397d3c5`: +top-level crate renamed `mxgateway-client` → `zb-mom-ww-mxgateway-client`, +`build.rs` proto path updated, every `use mxgateway_client::` sweep-replaced +to `use zb_mom_ww_mxgateway_client::`, `tests/client_behavior.rs` updated +for the retired `session_id` and a new `stream_alarms` impl on the fake +gateway, and the protocol-version assertion bumped to `3`. Workspace +member layout unchanged. `cargo test --workspace` and `cargo clippy +--workspace --all-targets -- -D warnings` both clean at HEAD. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — wire identifiers (CLI binary `mxgw`, `MXGATEWAY_*` env vars, `mxaccess_gateway.v1` proto packages) unchanged per commit message. | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | No issues found in this diff. | +| 5 | Security | No issues found in this diff. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | Issues found: Client.Rust-021 (`RustClientDesign.md` "Crate layout" section still shows an aspirational nested `crates/zb-mom-ww-mxgateway-client/` directory that does not exist; actual layout is the flat top-level crate at `clients/rust/`). | +| 8 | Code organization & conventions | No issues found in this diff. | +| 9 | Testing coverage | No issues found in this diff — the fake gateway's `stream_alarms` impl is a minimal stub, but no production behaviour relies on it being exercised by a test. | +| 10 | Documentation & comments | Issues found: Client.Rust-021 (design-doc drift). | + +## Findings + +### Client.Rust-001 + +| Field | Value | +|---|---| +| Severity | High | +| Category | mxaccessgw conventions | +| Location | `clients/rust/src/options.rs:98,143` | +| Status | Resolved | + +**Description:** `with_max_grpc_message_bytes` and `max_grpc_message_bytes` have no `///` doc comments. The crate sets `#![warn(missing_docs)]` and CLAUDE.md mandates that `cargo clippy --workspace --all-targets -- -D warnings` pass. Under `-D warnings` these become hard errors, so clippy fails to compile the crate — breaking the documented build/test workflow for the module. + +**Recommendation:** Add doc comments to both methods, e.g. `/// Maximum encoded/decoded gRPC message size in bytes (default 16 MiB).` + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): doc comments added to both methods. + +### Client.Rust-002 + +| Field | Value | +|---|---| +| Severity | High | +| Category | mxaccessgw conventions | +| Location | `clients/rust/src/session.rs:522` | +| Status | Resolved | + +**Description:** The `BulkReplyKind` enum's variants (`AddItemBulk`, `AdviseItemBulk`, `RemoveItemBulk`, `UnAdviseItemBulk`, `SubscribeBulk`, `UnsubscribeBulk`) all share the `Bulk` suffix, tripping `clippy::enum_variant_names`. Under `-D warnings` this is a compile error, so `cargo clippy --workspace --all-targets -- -D warnings` fails — a violation of the CLAUDE.md requirement that clippy pass cleanly. + +**Recommendation:** Rename the variants to drop the common suffix (e.g. `AddItem`, `AdviseItem`, …) or add a narrowly-scoped `#[allow(clippy::enum_variant_names)]` with a reason comment. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): variants renamed to `AddItem`/`AdviseItem`/`RemoveItem`/`UnAdviseItem`/`Subscribe`/`Unsubscribe`, which no longer share a common suffix. + +### Client.Rust-003 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Correctness & logic bugs | +| Location | `clients/rust/crates/mxgw-cli/src/main.rs:1051` | +| Status | Resolved | + +**Description:** The unit test `version_json_output_has_protocol_versions` asserts `value["gatewayProtocolVersion"] == 2`, but `GATEWAY_PROTOCOL_VERSION` is `3` (version.rs:10), matching the authoritative server constant `GatewayContractInfo.GatewayProtocolVersion = 3`. The test fails, so `cargo test --workspace` (the documented test step) does not pass — the test was not updated when the protocol version was bumped. + +**Recommendation:** Update the assertion to `3`, or better, assert against `GATEWAY_PROTOCOL_VERSION` so it cannot drift again. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): the test now asserts against the `GATEWAY_PROTOCOL_VERSION` / `WORKER_PROTOCOL_VERSION` constants, so it cannot drift again. + +### Client.Rust-004 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/rust/src/version.rs:7` | +| Status | Resolved | + +**Description:** `CLIENT_VERSION` is `"0.1.0-dev"` and its doc comment claims "Mirrors `Cargo.toml`", but `Cargo.toml` declares `version = "0.1.0"` (no `-dev` suffix). The comment is misleading and the value is not actually kept in sync with the manifest. + +**Recommendation:** Either set `CLIENT_VERSION` from the build via `env!("CARGO_PKG_VERSION")`, or correct the constant to `"0.1.0"` and drop the "Mirrors Cargo.toml" claim. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): `CLIENT_VERSION` is now `env!("CARGO_PKG_VERSION")`, taken from `Cargo.toml` at compile time so the two cannot drift. + +### Client.Rust-005 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `clients/rust/src/session.rs:489-520` | +| Status | Resolved | + +**Description:** `register_server_handle`, `add_item_handle`, and `add_item2_handle` fall through to `reply.return_value … .unwrap_or_default()`, returning `0` when the reply carries neither the expected typed payload nor an `Int32` `return_value`. Because `Session::invoke` has already confirmed `protocol_status == Ok`, a malformed-but-OK reply silently yields handle `0`, which the caller then uses as a real handle against the worker. + +**Recommendation:** Return `Err(Error::ProtocolStatus { … })` (or a dedicated `Error::MalformedReply`) when an OK reply lacks an extractable handle, instead of defaulting to `0`. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): the three handle extractors now return `Result` and yield the new `Error::MalformedReply` when an OK reply carries no usable handle. + +### Client.Rust-006 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `clients/rust/src/session.rs:531-555` | +| Status | Resolved | + +**Description:** `bulk_results` returns `Vec::new()` for any `(payload, kind)` combination that does not match the expected arm — including an OK reply carrying the wrong or no payload. A caller of `subscribe_bulk`/`add_item_bulk` then sees an empty result vector and cannot distinguish "zero items processed" from "gateway returned a shapeless reply". + +**Recommendation:** Treat a missing/mismatched bulk payload on an OK reply as an error rather than an empty vector, or document the empty-vec fallback explicitly and log it. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): `bulk_results` now returns `Result, Error>` and yields `Error::MalformedReply` on a mismatched or absent bulk payload. + +### Client.Rust-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `clients/rust/RustClientDesign.md:14-55` | +| Status | Resolved | + +**Description:** `RustClientDesign.md` is stale relative to the implemented code. It documents a nested `crates/mxgateway-client/` layout (the real crate root is `clients/rust/` with a flat `src/`), and lists `tracing` among "Expected dependencies", but `tracing` appears in no `Cargo.toml`. CLAUDE.md requires docs to change with the source. + +**Recommendation:** Update `RustClientDesign.md` to the actual flat layout and remove `tracing` from the dependency list (or add `tracing` if structured logging is genuinely intended). + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): the "Crate Layout" section now shows the actual flat layout (`mxgateway-client` as the workspace-root crate, `mxgw-cli` as a member) and the unused `tracing` entry was removed from the dependency list. + +### Client.Rust-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `clients/rust/src/value.rs:161-261` | +| Status | Resolved | + +**Description:** `MxValueProjection::from_proto` and `MxArrayProjection::from_proto` deep-clone every element out of the wire message while `MxValue`/`MxArrayValue` also retain the original `raw` message. Every `MxValue` therefore holds two copies of its payload, wasteful for large string arrays or raw blobs arriving on the event stream. + +**Recommendation:** Compute the projection lazily on demand, or have the projection borrow from `raw`, so array/raw payloads are not duplicated for every wrapped value. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): `MxValue` and `MxArrayValue` no longer cache a `projection` field — `projection()` computes the typed view on demand from `raw`. A value built only to be sent over the wire now holds a single copy of its payload and pays no projection cost. + +### Client.Rust-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `clients/rust/tests/client_behavior.rs`, `clients/rust/src/galaxy.rs` | +| Status | Resolved | + +**Description:** Several critical paths are untested: TLS channel setup (`with_plaintext(false)` / CA-file loading), mid-stream `tonic::Status` fault propagation through `EventStream`/`DeployEventStream` (tests only send `Ok` items), and the bulk-size cap (`ensure_bulk_size` rejecting >1000 items). + +**Recommendation:** Add tests that (a) feed an `Err(Status)` into the event/deploy streams and assert it surfaces as the mapped `Error`, (b) assert `add_item_bulk` with 1001 items returns `Error::InvalidArgument`, and (c) exercise the CA-file/`InvalidEndpoint` error path. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): added `add_item_bulk_rejects_input_above_the_thousand_item_cap`, `event_stream_surfaces_a_mid_stream_status_fault` (the fake gateway now optionally emits a mid-stream `Status::unavailable`), and `connect_with_unreadable_ca_file_reports_invalid_endpoint`. + +### Client.Rust-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/rust/src/client.rs:255-268`, `clients/rust/src/galaxy.rs:204-216` | +| Status | Resolved | + +**Description:** The client applies only a per-call deadline via `Request::set_timeout` and has no retry, reconnect, or transient-vs-permanent classification. A transient `Unavailable` (e.g. a gateway restart) maps to the catch-all `Error::Status` and is indistinguishable from a permanent failure. This is an acceptable v1 stance but is undocumented. + +**Recommendation:** Either add a documented `Error::Unavailable` variant classifying `Code::Unavailable`/`Code::ResourceExhausted`, or explicitly document in the README that the client performs no retries and that transient failures arrive as `Error::Status`. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): added the `Error::Unavailable` variant; `From` maps `Code::Unavailable` and `Code::ResourceExhausted` to it, so callers can classify transient failures without unwrapping the raw status. + +### Client.Rust-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `clients/rust/src/session.rs:469` | +| Status | Resolved | + +**Description:** `command_request` hard-codes `client_correlation_id` as `format!("rust-client-{}", kind.as_str_name())`. Every invocation of the same command kind on a session uses an identical correlation id, so the id cannot correlate a specific request/reply pair in gateway logs or among concurrent in-flight calls. MXAccess parity diagnostics rely on correlation ids being unique per call. + +**Recommendation:** Append a per-call unique suffix (monotonic counter or UUID) to the correlation id, or expose a way for the caller to supply one. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): correlation ids are built by `next_correlation_id`, which appends a process-wide atomic sequence number; `Session::close` uses it too. + +### Client.Rust-012 + +| Field | Value | +|---|---| +| Severity | High | +| Category | mxaccessgw conventions | +| Location | `clients/rust/src/galaxy.rs:282` | +| Status | Resolved | + +**Description:** Found while verifying the fix for Client.Rust-001/002: `cargo clippy --workspace --all-targets -- -D warnings` reported a third violation the original review missed. The `get_last_deploy_time` test fake calls `.clone()` on a `MutexGuard>`, and `Option` is `Copy` (`clippy::clone_on_copy`). Under `-D warnings` this is a compile error, so clippy still did not pass after Client.Rust-001/002 alone. + +**Recommendation:** Dereference instead of cloning: `*self.state.last_deploy.lock().unwrap()`. + +**Resolution:** Resolved in `0d8a28d` (2026-05-18): replaced `.clone()` with a deref. `cargo clippy --workspace --all-targets -- -D warnings` now passes cleanly. + +### Client.Rust-013 + +| Field | Value | +|---|---| +| Severity | High | +| Category | mxaccessgw conventions | +| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:414-424` (origin); `clients/rust/src/generated.rs:11-31` (suppression site) | +| Status | Resolved | + +**Description:** `cargo clippy --workspace --all-targets -- -D warnings` fails again on this commit, this time on a `clippy::doc_lazy_continuation` violation in generated code: + +``` +error: doc list item without indentation + --> .../mxaccess_gateway.v1.rs:526:5 + | +526 | /// `timeout_ms == 0` uses the gateway-configured default (1000 ms). + | ^ +``` + +The lint fires because the `ReadBulkCommand` proto comment (added with the bulk Read feature in commit `5e375f6`) writes a bulleted list and then a trailing paragraph without the required blank line. prost-build forwards the proto comment verbatim into Rust doc comments, and the Rust client compiles those generated modules with crate-default lints. The crate already opts out of `clippy::large_enum_variant` in `src/generated.rs` for exactly this kind of generator-style problem, but `doc_lazy_continuation` is not on the allow-list, so the lint reaches `-D warnings` and breaks the documented `cargo clippy --workspace --all-targets -- -D warnings` invocation that CLAUDE.md mandates pass. The Rust client review was previously closed as clippy-clean (Client.Rust-001/002/012); this is the third clippy-clean regression caused by generated code in this module and warrants a more durable fix. + +**Recommendation:** Add `#![allow(clippy::doc_lazy_continuation)]` to each generated submodule in `clients/rust/src/generated.rs` alongside `clippy::large_enum_variant`, so generated doc comments — which the client cannot edit — cannot break the `-D warnings` build. Independently, fix the upstream proto comment to insert a blank line before the trailing paragraph so the C# / Go / Python / Java generators do not carry the same flaky text. + +**Resolution:** 2026-05-20 — Added `#![allow(clippy::doc_lazy_continuation)]` to each generated submodule in `clients/rust/src/generated.rs` next to the existing `clippy::large_enum_variant` allow, and reformatted the `ReadBulkCommand` proto comment in `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto` to surround the bulleted list with blank lines so doc-comment generators in every language see a properly-terminated list. `cargo clippy --workspace --all-targets -- -D warnings` and `cargo test --workspace` now pass, and `dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj` reports 0 warnings. + +### Client.Rust-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `clients/rust/crates/mxgw-cli/src/main.rs:450,497` | +| Status | Resolved | + +**Description:** Client.Rust-011 made `Session` build unique correlation ids per call, but the `mxgw` CLI's `Ping` and `CloseSession` subcommands still hard-code `client_correlation_id: "rust-cli-ping".to_owned()` and `"rust-cli-close-session".to_owned()`. Both go through `client.invoke(…)` / `client.close_session_raw(…)` rather than the `Session` helpers, so the library's id generator does not run. The CLI is the cross-language e2e driver — when the same machine runs concurrent CLI smokes, every `ping`/`close-session` request collides on the same correlation id in gateway logs, defeating the diagnostic value the library fix unlocked. + +**Recommendation:** Either (a) expose `session::next_correlation_id` as a `pub(crate)` or library-level helper and have the CLI call it from `Ping`/`CloseSession`, or (b) replace these RPCs with the higher-level `Session` helpers (`Session::close`, and a thin `Session::ping` wrapper) so the CLI shares the library's correlation-id discipline by construction. + +**Resolution:** 2026-05-20 — Promoted `session::next_correlation_id` from a module-private helper to a `pub` library-level function (it already lived in the `pub mod session`) and updated the `mxgw` CLI's `Ping` and `CloseSession` subcommands to call `mxgateway_client::session::next_correlation_id("cli-ping")` / `next_correlation_id("cli-close-session")` instead of the hard-coded `"rust-cli-ping"` / `"rust-cli-close-session"` strings. Concurrent CLI smokes now produce unique correlation ids per call — driven by the same process-wide `CORRELATION_SEQUENCE` `AtomicU64` the library uses — so gateway logs can tell collisions apart again. `cargo fmt`, `cargo build --workspace`, `cargo clippy --workspace --all-targets -- -D warnings`, and `cargo test --workspace` all pass. + +### Client.Rust-015 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `clients/rust/crates/mxgw-cli/src/main.rs:1053-1070` | +| Status | Resolved | + +**Description:** The new cross-language benchmark `bench-read-bulk` pushes the elapsed time of every `read_bulk` call into `latencies_ms` regardless of whether the call returned `Ok` or `Err`: + +```rust +let outcome = session.read_bulk(server_handle, &tags, timeout_ms).await; +let elapsed_ms = call_start.elapsed().as_secs_f64() * 1000.0; +latencies_ms.push(elapsed_ms); +match outcome { + Ok(results) => { successful_calls += 1; … } + Err(_) => failed_calls += 1, +} +``` + +A failed `read_bulk` (transient `Unavailable`, deadline-exceeded mid-call, etc.) typically returns *later* than a successful one — it includes the full per-call timeout that the success path never waits for. The histogram therefore conflates "p99 cached-read latency" with "p99 of (cached-read + timed-out call)", and the JSON document the PowerShell driver collates publishes `latencyMs.p99` / `latencyMs.max` that no longer represent successful-call latency. Worse, the failure category is silently dropped (`Err(_) => failed_calls += 1`) so a benchmark run that fails on every call still emits a coherent-looking JSON without ever surfacing why. This is misleading for a benchmark whose JSON shape is the cross-language comparison contract. + +**Recommendation:** Only push elapsed time into `latencies_ms` on `Ok`, or split into two histograms (`successLatencyMs` and `failureLatencyMs`) and log the first failure's error string into the stats record so a partial-failure run is visible at the report layer. + +**Resolution:** 2026-05-20 — Extracted the per-iteration accounting in `bench-read-bulk` into a `BenchReadBulkStats` helper with explicit `record_success`/`record_failure` methods. Successful `read_bulk` calls now flow into `success_latencies_ms` (driving the cross-language `latencyMs.p99`/`max` JSON contract), failures flow into a separate `failure_latencies_ms` histogram surfaced as `failureLatencyMs`, and the first failure's redacted error string is stashed as `firstFailure` so a partial-failure run is visible at the report layer instead of producing a coherent-looking JSON that hides every error. Added a unit test (`bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram`) that records two fast successes plus a deliberately slow failure and asserts the success histogram never sees the failure latency, plus a smaller smoke test for the zero-duration calls-per-second path. + +### Client.Rust-016 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `clients/rust/tests/client_behavior.rs`, `clients/rust/src/session.rs:489-519,654-768` | +| Status | Resolved | + +**Description:** The fixes for Client.Rust-005 / 006 added five new `Error::MalformedReply` paths to `session.rs` (`register_server_handle`, `add_item_handle`, `add_item2_handle`, `bulk_results`, `bulk_write_results`) plus the inline branch in `read_bulk`. None of them are exercised by tests — every test in `client_behavior.rs` feeds the matching payload back to the client, so the malformed-reply branches are dead code from the test suite's perspective. The new bulk-write helpers (`write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`) only have a single happy-path assertion via `write_bulk`, leaving the three other variants and every per-entry-failure shape untested. The bench-read-bulk flow has no test (the driver script is the only consumer). The `Error::Unavailable` variant from Client.Rust-010 is covered by `event_stream_surfaces_a_mid_stream_status_fault`, but the same variant on a unary `Code::Unavailable` is not. + +**Recommendation:** Add three light tests against the existing `FakeGateway`: + +1. Have the fake reply to `AddItem` (or `Register` / `AddItem2`) with `protocol_status = Ok` and no payload, and assert the client surfaces `Error::MalformedReply`. +2. Have the fake reply to `WriteBulk` with `protocol_status = Ok` and the wrong payload arm (e.g. an `AddItemReply` body), and assert `Error::MalformedReply`. +3. Have the fake fail the unary `Invoke` with `Status::unavailable(...)` and assert `Error::Unavailable`. + +Optionally add Write2Bulk / WriteSecuredBulk / WriteSecured2Bulk smoke assertions so all four bulk-write families have at least one round-trip test. + +**Resolution:** 2026-05-20 — Added eight new integration tests in `clients/rust/tests/client_behavior.rs`. Each new `Error::MalformedReply` site is exercised via a test-only `InvokeOverride` injected into `FakeState` that lets a single test pin the fake gateway's `Invoke` handler to one of three malformed shapes (OK reply with no payload, OK reply with the wrong payload arm for `read_bulk`, OK reply with the wrong payload arm for the other bulk / bulk-write families): `register_returns_malformed_reply_when_ok_reply_has_no_payload`, `add_item_returns_malformed_reply_when_ok_reply_has_no_payload`, `add_item2_returns_malformed_reply_when_ok_reply_has_no_payload`, `subscribe_bulk_returns_malformed_reply_on_mismatched_payload_arm`, `write_bulk_returns_malformed_reply_on_mismatched_payload_arm`, and `read_bulk_returns_malformed_reply_on_mismatched_payload_arm`. The unary `Error::Unavailable` path is covered by `unary_invoke_maps_status_unavailable_to_error_unavailable` (the override returns `Status::unavailable(...)`). The remaining three bulk-write families gained round-trip smoke tests — `write2_bulk_round_trips_through_the_fake_gateway`, `write_secured_bulk_round_trips_through_the_fake_gateway`, `write_secured2_bulk_round_trips_through_the_fake_gateway` — extending the fake gateway's dispatcher with happy-path replies for `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk`. The `bench-read-bulk` flow gets a `BenchReadBulkStats` unit test in `crates/mxgw-cli/src/main.rs` (see Client.Rust-015) that asserts the latency-tracking change keeps failed-call durations out of `latencyMs`. + +### Client.Rust-017 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `clients/rust/RustClientDesign.md:79-99,156-163` | +| Status | Resolved | + +**Description:** CLAUDE.md requires docs to change with the source. `RustClientDesign.md` was refreshed to fix the layout/`tracing` drift (Client.Rust-007), but the Session API surface in the design (`Library API` block, lines 79-99) still lists only the original six bulk helpers — `add_item_bulk`, `advise_item_bulk`, `remove_item_bulk`, `un_advise_item_bulk`, `subscribe_bulk`, `unsubscribe_bulk` — and is missing the five new bulk-write helpers and `read_bulk` (`write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`, `read_bulk`) that landed in commits `5e375f6` / `f220908` / `61644e6`. The `Error Handling` block (lines 130-146) still enumerates `Transport`, `Status`, `Authentication`, `Authorization`, `Session`, `Worker`, `Command`, `MxAccess`, `Timeout`, `Cancelled` — but not `MalformedReply`, `Unavailable`, or `InvalidEndpoint`, all of which are now public variants of the crate's `Error` enum. The `Test CLI` block (lines 158-163) lists `version` / `smoke` / `stream-events` / `write` but is missing every new subcommand (`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`, `galaxy watch`). + +**Recommendation:** Bring the design doc back in sync: extend the `Session` API code block to enumerate the bulk-write/read methods, expand the `Error` enum to match `clients/rust/src/error.rs`, and add the missing CLI subcommands. The README is already up to date, so this is design-doc-only churn. + +**Resolution:** 2026-05-20 — Brought `clients/rust/RustClientDesign.md` back in sync with the implementation. The `Session` block now lists the five new bulk helpers (`write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`, `read_bulk`) alongside the original six and notes that `session::next_correlation_id` is `pub` for raw-RPC consumers (the CLI). The `Error` enum block now matches `clients/rust/src/error.rs` — `InvalidEndpoint`, `InvalidArgument`, `Transport`, `Authentication`, `Authorization`, `Timeout`, `Cancelled`, `Unavailable`, `Status`, `Command`, `ProtocolStatus`, `MalformedReply` — with a short paragraph explaining what `Unavailable`, `MalformedReply`, and `InvalidEndpoint` classify. The `Test CLI` block enumerates every subcommand the binary exposes today: `version`, `ping`, `open-session`, `close-session`, `register`, `add-item`, `advise`, `subscribe-bulk`, `unsubscribe-bulk`, `read-bulk`, `write`, `write2`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `stream-events`, `bench-read-bulk`, `smoke`, and the `galaxy {test-connection,last-deploy-time,discover-hierarchy,watch}` subtree. + +### Client.Rust-018 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `clients/rust/crates/mxgw-cli/src/main.rs:1098-1170`; `scripts/bench-read-bulk.ps1:347-365`; siblings: `clients/go/cmd/mxgw-go/main.go:600-648`, `clients/python/src/mxgateway_cli/commands.py:614-662`, `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:685-770`, `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:855-940` | +| Status | Resolved | + +**Description:** Client.Rust-015's resolution split Rust's bench histogram so `latencyMs` records only successful `read_bulk` calls and a new `failureLatencyMs` field holds failed-call durations. The local logic is right, the unit test (`bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram`) is right, and the JSON shape stays additively compatible with `scripts/bench-read-bulk.ps1` (the collator reads `$s.latencyMs.p50`/`p95`/`p99`/`max`/`mean` and these keys still exist on the Rust output). The problem is cross-language: the .NET, Go, Python, and Java bench implementations still push every call's elapsed time into a single `latenciesMs` / `latencies_ms` / `latencyMillis` array regardless of success or failure (e.g. `clients/go/cmd/mxgw-go/main.go:611` appends before the success/failure branch; `clients/python/src/mxgateway_cli/commands.py:624,626` appends in both `except` and the happy path; `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:701,705` adds in both `catch` and the OK path; `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:865,880` records in both branches). The PS driver's side-by-side comparison table (lines 348-360) pulls `latencyMs.p50/p95/p99/max/mean` from every client and prints them in one row, so a partial-failure run now shows Rust's p99 measured over successes only and the other four clients' p99 measured over (success + per-call timeout) — the numbers are silently no longer comparable. This re-introduces the original Client.Rust-015 problem at the cross-language layer that the fix was meant to remove. + +**Recommendation:** Make the contract uniform. Either (a) revert Rust's `latencyMs` to the all-calls histogram for backwards/cross-language compatibility and keep `failureLatencyMs` as an additive Rust-only enrichment, or (b) push the same success-only / failure-separated split into the .NET, Go, Python, and Java bench commands so every language emits the honest pair (`latencyMs` = success, `failureLatencyMs` = failure, plus `firstFailure`) and update the PS driver's table column to make the success-only semantics explicit (`p99 ok ms`). Option (b) is the better long-term posture but it is a cross-client change; option (a) restores comparability immediately. + +**Resolution:** 2026-05-20 — Took option (a) to restore cross-language comparability immediately. Reverted Rust's `latencyMs` to the all-calls histogram so it matches the .NET/Go/Python/Java bench shape that `scripts/bench-read-bulk.ps1` collates side-by-side: `BenchReadBulkStats::record_success` and `record_failure` now both push elapsed time into a single `latencies_ms` vector, and `record_failure` additionally pushes into `failure_latencies_ms` and stashes the first failure's redacted error string in `first_failure`. The JSON output keeps `failureLatencyMs` and `firstFailure` as Rust-only additive enrichment so a partial-failure run is still visible at the report layer without breaking the side-by-side table. Renamed the unit test to `bench_read_bulk_stats_tracks_all_calls_in_latency_and_failures_separately`; it now asserts `latencyMs.max == 1500.0` (the slow failure is included in the cross-language `latencyMs` contract) while `failureLatencyMs.max == 1500.0` and `firstFailure` still surface the failure separately for diagnostics. Pushing the success-only / failure-separated split into the other four clients (option (b)) is the better long-term posture but is deliberately out of scope here. + +### Client.Rust-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `clients/rust/RustClientDesign.md:96-100` | +| Status | Resolved | + +**Description:** Client.Rust-017 was closed by adding the new bulk-write/read entries to the design doc, but the signatures shown in the code block do not match the implementation. The doc declares: + +```rust +pub async fn write_bulk(&self, server_handle: i32, entries: Vec, user_id: i32) -> Result, Error>; +pub async fn write2_bulk(&self, server_handle: i32, entries: Vec, timestamp: prost_types::Timestamp, user_id: i32) -> Result, Error>; +pub async fn write_secured_bulk(&self, server_handle: i32, entries: Vec, current_user_id: i32, verifier_user_id: i32) -> Result, Error>; +pub async fn write_secured2_bulk(&self, server_handle: i32, entries: Vec, timestamp: prost_types::Timestamp, current_user_id: i32, verifier_user_id: i32) -> Result, Error>; +pub async fn read_bulk(&self, server_handle: i32, tags: &[String], timeout_ms: u32) -> Result, Error>; +``` + +The actual implementations in `clients/rust/src/session.rs:385-526` take only `(server_handle, entries)` — `user_id` is per-entry on `WriteBulkEntry`/`Write2BulkEntry`, `timestamp_value` is per-entry on `Write2BulkEntry`/`WriteSecured2BulkEntry`, and `current_user_id`/`verifier_user_id` are per-entry on `WriteSecured{,2}BulkEntry`. The protobuf in `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:364-416` confirms this — there is no top-level `user_id` on these commands. The doc also returns `Vec` but the generated type is `BulkReadResult` (the gateway's `BulkReadReply` carries `repeated BulkReadResult`), and the actual signature is `read_bulk>(..., tag_addresses: &[S], ...) -> Vec` — generic over `AsRef` so callers can pass either `Vec` or `[&str]`. + +The drift is small but the design doc was the explicit subject of Client.Rust-017's resolution, so it warrants a follow-up. CLAUDE.md requires docs to change with the source. + +**Recommendation:** Replace the five signatures in `RustClientDesign.md:96-100` with the ones actually in `session.rs`: + +```rust +pub async fn write_bulk(&self, server_handle: i32, entries: Vec) -> Result, Error>; +pub async fn write2_bulk(&self, server_handle: i32, entries: Vec) -> Result, Error>; +pub async fn write_secured_bulk(&self, server_handle: i32, entries: Vec) -> Result, Error>; +pub async fn write_secured2_bulk(&self, server_handle: i32, entries: Vec) -> Result, Error>; +pub async fn read_bulk>(&self, server_handle: i32, tag_addresses: &[S], timeout_ms: u32) -> Result, Error>; +``` + +and add a one-line note that the per-entry fields (`user_id`, `timestamp_value`, `current_user_id`, `verifier_user_id`) live on the entry structs themselves. + +**Resolution:** 2026-05-20 — Replaced the five drifted signatures in `RustClientDesign.md` with the ones that actually live in `clients/rust/src/session.rs`: `write_bulk` / `write2_bulk` / `write_secured_bulk` / `write_secured2_bulk` take only `(server_handle, entries)`, and `read_bulk>` takes a borrowed `&[S]` and returns `Vec` (not `Vec`). Added a follow-up paragraph noting that the per-entry fields `user_id` / `timestamp_value` / `current_user_id` / `verifier_user_id` live on `WriteBulkEntry` / `Write2BulkEntry` / `WriteSecuredBulkEntry` / `WriteSecured2BulkEntry` themselves rather than as trailing positional arguments, matching the protobuf shapes in `mxaccess_gateway.proto`, and that `read_bulk` is generic over `AsRef` so callers can pass `&[String]` or `&[&str]` without cloning at the call site. + +### Client.Rust-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `clients/rust/src/session.rs:31-46`; `clients/rust/src/lib.rs:14-39` | +| Status | Resolved | + +**Description:** Client.Rust-014's resolution promoted `next_correlation_id` from a module-private helper to a `pub` function so the `mxgw` CLI's raw-RPC paths can share the library's correlation-id discipline. The doc comment commits the library to a literal string format — `"rust-client-{label}-{N}"` — that external code can now depend on. Two concerns: + +1. The function is not re-exported at the crate root in `lib.rs` (it only ships through the `pub mod session` namespace), so the in-tree caller writes the long `mxgateway_client::session::next_correlation_id("cli-ping")` path. Either re-export it via `#[doc(inline)] pub use session::next_correlation_id;` or leave it where it is and add a short note in the doc — but the current state straddles "public API" and "lib-internal helper" without saying which. + +2. The doc comment does not declare a stability stance (no `#[doc(hidden)]`, no "experimental" note, no `__priv` naming). As written it promises the literal format `"rust-client-{label}-{N}"` to any out-of-tree consumer; a future change that renames the prefix (for example to drop the `rust-` after a multi-client reformat) would be a behavioural break. The `RustClientDesign.md` resolution of Client.Rust-017 ("`session::next_correlation_id` is `pub`") reads similarly — it does not say whether the format is stable. + +The combination — `pub`, format-committing doc, no stability note, no crate-root re-export — leaves the public surface ambiguous. The same review category (Documentation & comments) is where Client.Rust-014's CLI-side fix is now visible, so this is the natural place to clean it up. + +**Recommendation:** Pick one of: + +- Treat `next_correlation_id` as part of the SDK's public API. Re-export it from `lib.rs` (`#[doc(inline)] pub use session::next_correlation_id;`), rewrite the doc comment to *not* promise the literal `"rust-client-{label}-{N}"` format (just the property "monotonic, unique within a process, includes the supplied label"), and call that out in `RustClientDesign.md`. +- Treat it as internal-only. Mark it `#[doc(hidden)] pub` and add a `// Internal helper exposed for the in-tree `mxgw` CLI; not part of the public SDK contract.` comment so out-of-tree consumers do not build against a format that the SDK is free to change. + +The CLI integration in Client.Rust-014 works either way; this is solely about declaring intent so the SDK's public surface is unambiguous. + +**Resolution:** 2026-05-20 — Took the "treat as public SDK API" branch. Re-exported `next_correlation_id` at the crate root in `clients/rust/src/lib.rs` (`#[doc(inline)] pub use session::{next_correlation_id, Session};`) so in-tree and external callers can write the short `mxgateway_client::next_correlation_id(...)` path. Updated the in-tree `mxgw` CLI (`Ping` and `CloseSession` subcommands) to call through the crate-root re-export instead of `mxgateway_client::session::next_correlation_id`. Rewrote the doc comment to drop the format promise: the returned id is now documented as an opaque token with three guaranteed properties (embeds the supplied `label`, unique within a process via an internal monotonic atomic sequence, carries no embedded secret beyond `label`), and the doc explicitly states that the textual format `rust-client-{label}-{N}` is *not* part of the public contract and that callers must not parse it. Cross-referenced the crate-root re-export from the function-level doc. Updated `RustClientDesign.md` to call out that `next_correlation_id` is part of the public SDK surface, re-exported at the crate root, and that its textual format is intentionally not part of the contract. + +### Client.Rust-021 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `clients/rust/RustClientDesign.md:14-33` | +| Status | Open | + +**Description:** The crate-name change in commit `397d3c5` (top-level `mxgateway-client` → `zb-mom-ww-mxgateway-client`) is reflected in `Cargo.toml`, `Cargo.lock`, every `use zb_mom_ww_mxgateway_client::` import, and `build.rs`. The "Recommended layout" block in `RustClientDesign.md:21-33` shows a nested structure with a top-level `crates/zb-mom-ww-mxgateway-client/` subdirectory containing `src/lib.rs`, `src/client.rs`, etc. — but the actual layout on disk is flat: the top-level crate lives at `clients/rust/` directly (with `src/`, `Cargo.toml`, and `build.rs` at the workspace root) and `crates/mxgw-cli/` is the only nested member. A reader consulting the design to understand the layout will be misled into looking for `crates/zb-mom-ww-mxgateway-client/` that does not exist. `CLAUDE.md` requires design docs to track code changes. + +**Recommendation:** Update `RustClientDesign.md:14-33` to describe the actual flat layout: workspace root at `clients/rust/`, top-level crate `zb-mom-ww-mxgateway-client` (declared in `Cargo.toml`) with `src/lib.rs`, `src/client.rs`, etc. directly under it; `crates/mxgw-cli/` is the single member subcrate. Alternatively label the block as "Aspirational nested layout (not currently adopted)" and add a separate "Current layout" section. + +**Resolution:** _(empty until closed)_ diff --git a/code-reviews/Contracts/findings.md b/code-reviews/Contracts/findings.md new file mode 100644 index 0000000..9536a6c --- /dev/null +++ b/code-reviews/Contracts/findings.md @@ -0,0 +1,308 @@ +# Code Review — Contracts + +| Field | Value | +|---|---| +| Module | `src/ZB.MOM.WW.MxGateway.Contracts` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 2 | + +## Checklist coverage + +This re-review covers the Contracts module at `a020350`, after Contracts-009 through Contracts-013 (plus Client.Rust-013's proto comment reformat on `ReadBulkCommand`) were resolved against `1cd51bb`. The Contracts source under review is unchanged from `1cd51bb` apart from the documentation-only updates introduced by `a020350`; the pass re-checks every category on the bulk write/read family, the alarm reply surface, and the GalaxyRepository contract that were the target of those resolutions. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Bulk command kinds, `BulkWriteResult`, and `BulkReadResult` align with the worker executor (`MxAccessSession.ReadBulk` / `ExecuteBulkWriteEntry`), the gateway server-side filter (`MxAccessGatewayService.ReplaceWriteBulkEntries`), the validator (`MxAccessGrpcRequestValidator.ExpectedPayload`, covering every new kind), and the round-trip tests added under Contracts-010. Field numbering across all three protos remains additive and contiguous — `MxCommand.payload` 10-43 + 100-104, `MxCommandReply.payload` 20-40 + 100-102, `MxCommandKind` 0-34 + 100-104, `WorkerEnvelope.body` 10-20 — with no number reused or repurposed. No new functional bugs. | +| 2 | mxaccessgw conventions | Wire-compatibility policy comment blocks (Contracts-005 resolution) are present at the top of all three `.proto` files and the bulk additions honour them — every change since the prior review is additive. Generated code under `Generated/` is untouched. Naming, `snake_case` field names, `PascalCase` messages, enum-prefix discipline, oneof usage for command/reply/value/event/envelope, and the credential-sensitivity comments per the ProtobufStyleGuide are all consistent. No new violations. | +| 3 | Concurrency & thread safety | N/A — pure contract definitions plus a static constants class (`GatewayContractInfo`) with no shared mutable state. | +| 4 | Error handling & resilience | `BulkWriteResult` carries `was_successful` + `optional int32 hresult` + `repeated MxStatusProxy statuses` + `error_message` per entry; `BulkReadResult` carries `was_successful` + `was_cached` + per-entry `value`/`quality`/`source_timestamp`/statuses/`error_message`. The deliberate absence of `hresult` on `BulkReadResult` is pinned by `ProtobufContractRoundTripTests.BulkReadReply_RoundTripsCachedAndSnapshotResults` (descriptor assertion) and matches the documented "ReadBulk outcomes are timeout / cache / lifecycle states, not MXAccess COM return codes" rationale. The `AcknowledgeAlarmReply.status` reservation comment (Contracts-008) and the by-name ack reuse comment (Contracts-002) keep ack outcome handling unambiguous. No new issues. | +| 5 | Security | The single-item and bulk `WriteSecured` / `WriteSecured2` paths now carry the credential-sensitivity comment on both the outer command (`WriteSecuredBulkCommand` / `WriteSecured2BulkCommand`) and each entry's `value` field (Contracts-011 resolution). `AuthenticateUserCommand.verify_user_password` carries the same redaction note. No new secret-leak surfaces. | +| 6 | Performance & resource management | `ReadBulk` is still the only command without a 1:1 MXAccess analogue; the per-tag `timeout_ms` cap and `was_cached` short-circuit prevent disturbing existing subscriptions. `BulkWriteReply` / `BulkReadReply` are flat repeated lists with no nested pagination machinery, matching the "one round-trip per batch" Bulk Command Family decision. No bloat issues. | +| 7 | Design-document adherence | `gateway.md`, `docs/Contracts.md` (Contracts-009 resolution), `docs/DesignDecisions.md` (Bulk Command Family), and `docs/AlarmClientDiscovery.md` (Contracts-002 / Contracts-008 resolutions) describe the contracts now in force. The `MX_COMMAND_KIND_WRITE2_BULK` / `MX_COMMAND_KIND_WRITE_SECURED2_BULK` enum-value names use the `2_BULK` suffix order while the public reply oneof case names use `Write2Bulk` / `WriteSecured2Bulk` (the `2` precedes `Bulk` in PascalCase); both match the corresponding command-message names — no design-doc divergence. The proto comment on `BulkWriteResult` describes a "gateway's tag-allowlist filter" that does not exist by that name in source or docs — see new finding Contracts-014. | +| 8 | Code organization & conventions | Package / namespace / file layout correct; `csharp_namespace` options remain consistent; the worker proto continues to import `mxaccess_gateway.proto` rather than duplicate the command/reply/event/value/status surface. Additive-only contract evolution observed; field numbers continuous and isolated by 100+ from diagnostic/control commands. No new issues. | +| 9 | Testing coverage | `ProtobufContractRoundTripTests` now exercises all five new bulk write/read commands, both new reply types (with `HasHresult == true` / `HasHresult == false` arms for the proto3 optional, and a descriptor-level assertion that `BulkReadResult` has no `hresult` field), every new `MxCommandReply.payload` oneof case (parameterised `[Theory]`), and the existing alarm / Galaxy / worker-envelope cases. `GatewayContractInfoTests` pins the `GatewayProtocolVersion = 3` constant for both the alarm and bulk write/read additions. No new gaps observed at the contracts surface. | +| 10 | Documentation & comments | The bulk additions all carry per-message documentation comments (`WriteBulkCommand`, `Write2BulkCommand`, `WriteSecuredBulkCommand`, `WriteSecured2BulkCommand`, `ReadBulkCommand`) and per-field credential-sensitivity comments on `WriteSecured*BulkEntry.value`. `GalaxyAttribute.mx_data_type` / `data_type_name` / `mx_attribute_category` / `security_classification` carry the parity-detail comments added under Contracts-012. Two residual gaps remain — the misleading "tag-allowlist filter" wording on `BulkWriteResult` (new finding Contracts-014), and the absence of a comment on `BulkReadResult.value` / `quality` / `source_timestamp` / `statuses` describing what they carry when `was_successful = false` (new finding Contracts-015). | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232` scoped to the contract changes since +`a020350`: the `ZB.MOM.WW` rename (`dc9c0c9`) updated `csharp_namespace` +on every `.proto` and regenerated the `Generated/*.cs` artifacts; the +`397d3c5` commit added the missing public `QueryActiveAlarmsRequest` +message and `rpc QueryActiveAlarms(QueryActiveAlarmsRequest) returns +(stream ActiveAlarmSnapshot)`. Field numbers (`session_id=1`, +`client_correlation_id=2`, `alarm_filter_prefix=3`) match the legacy +Python and Go descriptors. No fields renumbered or repurposed. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — `csharp_namespace` updated uniformly to `ZB.MOM.WW.MxGateway.Contracts.Proto`; protobuf `package` lines (`mxaccess_gateway.v1`, `mxaccess_worker.v1`, `galaxy_repository.v1`) are wire identifiers and intentionally unchanged. | +| 3 | Concurrency & thread safety | N/A — pure contract changes. | +| 4 | Error handling & resilience | No issues found. | +| 5 | Security | No issues found — no new credential-bearing fields. | +| 6 | Performance & resource management | No issues found. | +| 7 | Design-document adherence | No drift; the new RPC and message are documented in `gateway.md`, `docs/GatewayDashboardDesign.md`, and the proto file itself. | +| 8 | Code organization & conventions | Issues found: Contracts-016 (`QueryActiveAlarmsRequest.session_id` reserved-for-future-use ambiguity — is it required, optional, or ignored?). | +| 9 | Testing coverage | No issues found — `ProtobufContractRoundTripTests` and `GatewayContractInfoTests` continue to pin the protocol version; new `QueryActiveAlarmsRequest` lacks a round-trip test but the RPC type is generated and exercised end-to-end by the gRPC client tests in each language. | +| 10 | Documentation & comments | Issues found: Contracts-017 (the `rpc QueryActiveAlarms` comment block does not mention the `alarm_filter_prefix` request field). | + +## Findings + +### Contracts-001 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `docs/Grpc.md:13` (and `:3`, `:32`, `:39`) | +| Status | Resolved | + +**Description:** `mxaccess_gateway.proto` now declares six RPCs on `MxAccessGateway` (`OpenSession`, `CloseSession`, `Invoke`, `StreamEvents`, `AcknowledgeAlarm`, `QueryActiveAlarms`). `docs/Grpc.md` still describes "the four `MxAccessGateway` RPCs" in its type table and omits `AcknowledgeAlarm`/`QueryActiveAlarms` from the Validation Rules table. CLAUDE.md requires docs to change in the same commit as the contract; the alarm RPC commits left this doc stale and misleading about the public surface. + +**Recommendation:** Update `docs/Grpc.md` to enumerate all six RPCs and add `AcknowledgeAlarm`/`QueryActiveAlarms` to the type/handler and validation tables, or explicitly cross-reference `AlarmClientDiscovery.md`. + +**Resolution:** _(2026-05-18)_ Confirmed against `mxaccess_gateway.proto` — six RPCs declared, doc said "four". Updated `docs/Grpc.md`: the collaborator table now says "six `MxAccessGateway` RPCs", the RPC Handlers intro enumerates all six, added dedicated `AcknowledgeAlarm` and `QueryActiveAlarms` handler subsections (noting the alarm surface routes through `IAlarmRpcDispatcher` and is validated inline rather than via `MxAccessGrpcRequestValidator`, with a cross-reference to `AlarmClientDiscovery.md`), and added both alarm RPCs to the Validation Rules table. + +### Contracts-002 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:384-385`, `:95` | +| Status | Resolved | + +**Description:** `MxCommandKind` includes `MX_COMMAND_KIND_ACKNOWLEDGE_ALARM_BY_NAME = 29` and `MxCommand.payload` carries `AcknowledgeAlarmByNameCommand acknowledge_alarm_by_name_command = 38`, but `MxCommandReply.payload` has only `acknowledge_alarm = 34` and `query_active_alarms = 35` — there is no by-name reply case. The by-name ack must reuse `AcknowledgeAlarmReplyPayload` or rely on the top-level `hresult`. The command/reply payload asymmetry is undocumented and easy to dispatch incorrectly. + +**Recommendation:** Either add an explicit comment to `MxCommandReply` stating that by-name ack reuses the `acknowledge_alarm` payload case, or add a dedicated payload case for symmetry, and document the chosen contract in `docs/Contracts.md` / `AlarmClientDiscovery.md`. + +**Resolution:** _(2026-05-18)_ Verified against both the `.proto` and the dispatch code. The asymmetry is intentional and the code is correct: the worker's `MxAccessCommandExecutor.ExecuteAcknowledgeAlarmByName` builds `reply.AcknowledgeAlarm = new AcknowledgeAlarmReplyPayload { NativeStatus = rc }` — deliberately reusing the `acknowledge_alarm` payload case — and the gateway's `WorkerAlarmRpcDispatcher.AcknowledgeAsync` only reads the top-level `hresult`/`protocol_status`, so both ack arms work. The gap was documentation only. Took the finding's preferred option (a) — comment-only, no wire-format or generated-type change: added explicit comments to the `acknowledge_alarm` reply-payload case and to the `AcknowledgeAlarmReplyPayload` message in `mxaccess_gateway.proto` stating both ack kinds reuse this case and consumers must dispatch on `MxCommandReply.kind`, and documented the contract in `docs/AlarmClientDiscovery.md` section 4. Added regression test `ProtobufContractRoundTripTests.MxCommandReply_AcknowledgeAlarmByName_ReusesAcknowledgeAlarmPayloadCase` pinning the by-name-ack → `acknowledge_alarm` reuse and asserting no by-name-specific reply oneof case exists. + +### Contracts-003 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Contracts/MxGateway.Contracts.csproj:10` | +| Status | Won't Fix | + +**Description:** The `` item for `mxaccess_worker.proto` omits `ProtoRoot="Protos"`, while the items for `mxaccess_gateway.proto` (line 9) and `galaxy_repository.proto` (line 11) both set it. `mxaccess_worker.proto` does `import "mxaccess_gateway.proto"`, which resolves only because Grpc.Tools adds the importing file's own directory to the proto path. The inconsistency is fragile — tooling changes to ProtoRoot handling could break import resolution. + +**Recommendation:** Add `ProtoRoot="Protos"` to the `mxaccess_worker.proto` `` item so all three entries are consistent. + +**Resolution:** _(2026-05-18)_ Re-triaged as not-a-defect: the finding's premise is factually wrong. Line 10 of `MxGateway.Contracts.csproj` already carries `ProtoRoot="Protos"` — all three `` items are already consistent. `git show 6c64030:src/MxGateway.Contracts/MxGateway.Contracts.csproj` (the reviewed commit) confirms the attribute was present at review time too; the csproj has not been touched since `133c830`. No code change made. Status set to Won't Fix because there is nothing to fix. + +### Contracts-004 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Contracts/GatewayContractInfo.cs:3-6` | +| Status | Resolved | + +**Description:** The XML summary says the class exposes version metadata "before generated protobuf contracts are introduced." Generated protobuf contracts have long been introduced and are consumed across the solution. The comment is stale; the class now holds the authoritative `GatewayProtocolVersion`/`WorkerProtocolVersion` advertised in `OpenSessionReply` and used to validate `WorkerEnvelope` framing. + +**Recommendation:** Reword the summary to describe the current purpose — version constants advertised in `OpenSessionReply` and used to validate `WorkerEnvelope` protocol framing. + +**Resolution:** _(2026-05-18)_ Confirmed stale — the class is consumed by `GatewayApplication`/`OpenSessionReply` and `WorkerEnvelope` framing checks across the solution. Reworded the XML summary on `GatewayContractInfo` to describe the actual current purpose: `GatewayProtocolVersion` is advertised to clients in `OpenSessionReply`, and `WorkerProtocolVersion` validates `WorkerEnvelope` protocol framing on the gateway↔worker pipe. + +### Contracts-005 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto`, `src/MxGateway.Contracts/Protos/mxaccess_worker.proto` | +| Status | Resolved | + +**Description:** The ProtobufStyleGuide mandates reserving removed field numbers / enum values. Evolution to date has been purely additive, so this is not a current violation — but none of the `.proto` files contain any `reserved` declarations, leaving no in-file guardrail for the first removal. This is a latent maintainability gap. + +**Recommendation:** When any field or enum value is eventually removed, add a `reserved` range/name in the same change. Consider a short comment block in each message documenting the policy so future editors apply `reserved` rather than reusing tags. + +**Resolution:** _(2026-05-18)_ Confirmed: no field or enum value has ever been removed, so adding `reserved` ranges now would be incorrect (there are no retired tags to reserve, and inventing ranges for never-used numbers would itself violate the contract). Took the finding's least-invasive option — added a short wire-compatibility policy comment block at the top of all three `.proto` files (`mxaccess_gateway.proto`, `mxaccess_worker.proto`, `galaxy_repository.proto`) stating the additive-only rule and instructing future editors to add a `reserved` range + name in the same change as any removal. Comment-only, no wire-format or generated-type change. The `reserved` declarations themselves remain correctly deferred to the first actual removal. + +### Contracts-006 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:647` | +| Status | Resolved | + +**Description:** `MxStatusProxy.success` is declared `int32 success = 1` with no comment. The name reads like a boolean flag but the type is a 32-bit integer (mirroring MXAccess `MXSTATUS_PROXY`, which stores a numeric success/HResult-like value). Without a comment a client author can reasonably misinterpret the field (treat non-1 as failure, or expect only 0/1). + +**Recommendation:** Add a comment clarifying the semantic — what range of values it carries and how 0 vs non-zero map to MXAccess status — per the style guide rule to comment fields carrying raw MXAccess status detail. + +**Resolution:** _(2026-05-18)_ Confirmed: `int32 success = 1` had no comment. Cross-checked against the worker `MxStatusProxyConverter`, which reads the COM struct's `success` field verbatim (a 16-bit signed value) without reinterpretation, and against the MXAccess analysis (`MXAccess-Public-API.md`: `MxStatus`/`MXSTATUS_PROXY` are identical structs with a `short success` member). Added a field comment to `MxStatusProxy.success` stating it mirrors the COM struct's numeric `success` member (NOT a boolean), is carried verbatim for diagnostics, and that clients should branch on `category` (`MX_STATUS_CATEGORY_OK` marks success) — deliberately avoiding an over-specified 0-vs-1 claim, since the gateway never maps `success` to an outcome and `category` is the authoritative field. Comment-only change. + +### Contracts-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Contracts/ProtobufContractRoundTripTests.cs` | +| Status | Resolved | + +**Description:** `ProtobufContractRoundTripTests` covers gateway command/reply/event, alarm transition, alarm ack request/reply, active-alarm snapshot, and the worker envelope. It has no coverage for: (a) any `galaxy_repository.proto` message (`DiscoverHierarchy*`, `GalaxyObject`, `GalaxyAttribute`, `DeployEvent`, the `root` oneof, wrapper-typed fields); (b) `BulkSubscribeReply`/`SubscribeResult` and the bulk command kinds; (c) `MxValue`/`MxArray` `raw_value`/`RawArray` (`bytes`) paths and the `WorkerFault`/`WorkerHeartbeat` IPC bodies. + +**Recommendation:** Add round-trip tests for the Galaxy Repository messages (including the `root` oneof and proto wrapper fields), the bulk-subscribe reply, and the remaining `WorkerEnvelope` body cases. + +**Resolution:** _(2026-05-18)_ Confirmed the listed gaps and added round-trip tests to `ProtobufContractRoundTripTests` covering all three areas: (a) Galaxy Repository — `GalaxyRepositoryDescriptor_ContainsBrowseServiceMethods`, `DiscoverHierarchyRequest_RoundTripsRootOneofAndWrapperFields` (a `[Theory]` exercising all three `root` oneof arms plus the `Int32Value` wrapper `max_depth`), `DiscoverHierarchyReply_RoundTripsObjectAndAttributeGraph`, `DeployEvent_RoundTripsTimestampAndCounters`, `GalaxyConnectionReplies_RoundTrip`; (b) `BulkSubscribeReply_RoundTripsSubscribeResults` and `MxCommandReply_RoundTripsBulkSubscribePayload` (bulk-subscribe command kind + payload case); (c) `MxValue_RoundTripsRawValueBytesPayload`, `MxArray_RoundTripsRawArrayPayload`, `WorkerEnvelope_RoundTripsWorkerFaultBody`, `WorkerEnvelope_RoundTripsWorkerHeartbeatBody`. All new tests pass; the full `ProtobufContractRoundTripTests` class is 27 tests green. + +### Contracts-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:451-459`, `:627-636` | +| Status | Resolved | + +**Description:** The worker-side `AcknowledgeAlarmReplyPayload` carries the alarm-ack outcome as `int32 native_status`, while the public `AcknowledgeAlarmReply` carries it as `MxStatusProxy status` plus `optional int32 hresult`. The comment explains the worker echoes `native_status` into `AcknowledgeAlarmReply.hresult`, but the two outcome shapes (raw `int32` vs structured `MxStatusProxy`) are not reconciled in `docs/Contracts.md` / `AlarmClientDiscovery.md`. A reader cannot tell whether `MxStatusProxy status` is always populated or only on COM-layer failure. + +**Recommendation:** Document in `docs/Contracts.md` (or `AlarmClientDiscovery.md`) how the worker `native_status` maps onto the public reply's `status`/`hresult` pair so client authors know which field is authoritative. + +**Resolution:** _(2026-05-18)_ Verified against `WorkerAlarmRpcDispatcher.AcknowledgeAsync`. The asymmetry is larger than the finding implies: the dispatcher copies the worker `MxCommandReply.hresult` into `AcknowledgeAlarmReply.hresult` but **never** assigns `AcknowledgeAlarmReply.status` — the `MxStatusProxy status` field is left UNSET on every reply. The proto comment on `status` ("Native MxAccess status describing the outcome of the ack") was therefore actively misleading. Fixed: (1) reworded the `mxaccess_gateway.proto` comments on `AcknowledgeAlarmReply.hresult` (now identifies it as the authoritative native-return-code field) and `AcknowledgeAlarmReply.status` (now states it is reserved/unset and clients must not depend on it); (2) extended `docs/AlarmClientDiscovery.md` section 4 with a "Worker `native_status` → public `AcknowledgeAlarmReply` mapping" subsection spelling out that `hresult` is authoritative (`0` = success) and `status` is always unset, and that clients should branch on `protocol_status` then `hresult`, never `status`. + +### Contracts-009 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Design-document adherence | +| Location | `docs/Contracts.md:13-24` | +| Status | Resolved | + +**Description:** Commit `5e375f6` ("Add bulk read/write command family across worker, gateway, and clients") added five new command kinds — `WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, `ReadBulk` — plus the `BulkWriteReply` / `BulkWriteResult` and `BulkReadReply` / `BulkReadResult` shapes to `mxaccess_gateway.proto`. `gateway.md` (lines 299-322) was updated in that commit, but `docs/Contracts.md` was not. It still describes only the older bulk subscription family (`AddItemBulk`, `AdviseItemBulk`, `RemoveItemBulk`, `UnAdviseItemBulk`, `SubscribeBulk`, `UnsubscribeBulk`) returning `BulkSubscribeReply` with no mention of the bulk write/read commands or their per-entry result types. The CLAUDE.md rule "Update docs in the same change as the source. When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs … must change in the same commit" was violated for this addition. The result is that the canonical contracts document undercounts the public bulk surface by five commands. + +**Recommendation:** Extend the bulk-commands paragraph in `docs/Contracts.md` to list the new `WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` / `ReadBulk` command kinds, the per-entry request shape (`WriteBulkEntry` etc.), and the new reply types (`BulkWriteReply` carrying `BulkWriteResult`; `BulkReadReply` carrying `BulkReadResult`). Cross-reference `gateway.md` for the cached-vs-snapshot `ReadBulk` lifecycle and `docs/DesignDecisions.md` "Bulk Command Family" for the per-entry-result rationale rather than re-stating those details. + +**Resolution:** _(2026-05-20)_ Confirmed `docs/Contracts.md` documented only the older bulk subscription family and never mentioned the bulk write/read additions from commit `5e375f6`. Cross-checked against `mxaccess_gateway.proto` (`MxCommand.payload` cases 39-43, `MxCommandKind` 30-34, the `Write*BulkCommand` / `Write*BulkEntry` shapes, `ReadBulkCommand` with `tag_addresses` + `timeout_ms`, `MxCommandReply.payload` cases 36-40, and the `BulkWriteReply`/`BulkWriteResult` + `BulkReadReply`/`BulkReadResult` messages). Extended the "Files" section of `docs/Contracts.md` with a new paragraph listing the five command kinds, the per-entry request shape for each `Write*Bulk` family (with the credential-sensitive redaction rule carried through to `WriteSecuredBulkEntry`/`WriteSecured2BulkEntry`), the `BulkWriteReply` + `BulkWriteResult` reply (including the `optional int32 hresult` field and the no-raise per-entry failure contract), and the `ReadBulkCommand` → `BulkReadReply` + `BulkReadResult` reply with the cached-vs-snapshot dual-mode semantics and the deliberate absence of `hresult` on `BulkReadResult`. Cross-references to `gateway.md` (lifecycle + scopes) and `docs/DesignDecisions.md` "Bulk Command Family" (rationale) added rather than re-stating those details. + +### Contracts-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Contracts/ProtobufContractRoundTripTests.cs` | +| Status | Resolved | + +**Description:** Contracts-007 (closed 2026-05-18) added Galaxy Repository, bulk-subscribe, `MxValue.raw_value` / `MxArray.raw_values`, and `WorkerFault`/`WorkerHeartbeat` round-trip coverage. The bulk write/read messages added in commit `5e375f6` were never given equivalent coverage. `ProtobufContractRoundTripTests` has no test that exercises any of: `WriteBulkCommand` / `Write2BulkCommand` / `WriteSecuredBulkCommand` / `WriteSecured2BulkCommand` / `ReadBulkCommand`; `BulkWriteReply` / `BulkWriteResult`; `BulkReadReply` / `BulkReadResult`; the new `MxCommandReply.payload` oneof cases (`write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`, `read_bulk`). The asymmetry that `BulkWriteResult` carries `hresult` and `BulkReadResult` does not, and the `optional int32 hresult` semantics on `BulkWriteResult`, are exactly the kind of wire-shape details prior contract tests have been written to pin. + +**Recommendation:** Add `ProtobufContractRoundTripTests` cases mirroring the existing `BulkSubscribeReply_RoundTripsSubscribeResults` / `MxCommandReply_RoundTripsBulkSubscribePayload` pattern: at minimum one round-trip per new request-side message (`WriteBulkCommand` covers the entry-list case; one secured variant proves the credential-sensitive shape; `ReadBulkCommand` covers `timeout_ms`), one round-trip for each new reply payload (`BulkWriteReply` carrying `BulkWriteResult` with `hresult` set + unset to exercise the proto3 `optional` presence; `BulkReadReply` carrying a `was_cached = true` and a `was_cached = false` entry), and at least one `MxCommandReply` test pinning a new payload-oneof case (e.g. `MxCommandReply.PayloadCase == PayloadOneofCase.ReadBulk` for `MxCommandKind.ReadBulk`). + +**Resolution:** _(2026-05-20)_ Added round-trip tests in `ProtobufContractRoundTripTests` covering every gap listed: per-request `WriteBulkCommand_RoundTripsEntries`, `Write2BulkCommand_RoundTripsEntriesWithTimestampValue`, `WriteSecuredBulkCommand_RoundTripsCredentialBearingEntries`, `WriteSecured2BulkCommand_RoundTripsCredentialBearingEntriesWithTimestamp`, `ReadBulkCommand_RoundTripsTagAddressesAndTimeout`; per-reply `BulkWriteReply_RoundTripsResultsWithOptionalHresultPresence` (asserts both `HasHresult == true` and `HasHresult == false` arms of the proto3 `optional int32 hresult`) and `BulkReadReply_RoundTripsCachedAndSnapshotResults` (covers `was_cached = true`, `was_cached = false`, and a per-entry failure with `error_message`; additionally pins the deliberate absence of an `hresult` field on `BulkReadResult` via the descriptor); and `MxCommandReply` oneof-case pinning via `MxCommandReply_RoundTripsBulkWritePayloadCases` (a `[Theory]` exercising the four bulk-write payload-oneof cases) plus `MxCommandReply_RoundTripsReadBulkPayload`. All new tests pass; the full `ProtobufContractRoundTripTests` + `GatewayContractInfoTests` filter is 42 tests green. + +### Contracts-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Security | +| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:392-397`, `:406-412` | +| Status | Resolved | + +**Description:** The single-item `WriteSecuredCommand` (line 234-242) and `WriteSecured2Command` (line 244-253) put the credential-sensitivity redaction note on the `value` field directly ("Credential-sensitive write value. Implementations must not log this field unless an explicit redacted value-logging path is enabled."). The bulk equivalents move the note to the outer message instead — `WriteSecuredBulkCommand` (line 383-386) and `WriteSecured2BulkCommand` (line 399-400) carry it as a header comment — and the inner `WriteSecuredBulkEntry.value` (line 396) and `WriteSecured2BulkEntry.value` (line 410) are left without per-field comments. A future editor reading just `WriteSecuredBulkEntry` to add a new field or change the entry shape will not see the redaction rule. The ProtobufStyleGuide explicitly requires "Mark credential-bearing request fields clearly in comments"; the single-item path follows that rule, the bulk path does not. + +**Recommendation:** Add per-field credential-sensitivity comments to `WriteSecuredBulkEntry.value` and `WriteSecured2BulkEntry.value` matching the wording on `WriteSecuredCommand.value` / `WriteSecured2Command.value`. Comment-only change with no wire-format or generated-type impact. + +**Resolution:** _(2026-05-20)_ Added per-field credential-sensitivity comments to `WriteSecuredBulkEntry.value` and `WriteSecured2BulkEntry.value` in `mxaccess_gateway.proto`, mirroring verbatim the wording carried on `WriteSecuredCommand.value` / `WriteSecured2Command.value` ("Credential-sensitive write value. Implementations must not log this field unless an explicit redacted value-logging path is enabled."). The outer-message header redaction comment on `WriteSecuredBulkCommand` / `WriteSecured2BulkCommand` is retained so the rule is visible at both scopes. Comment-only change; no wire-format or generated-type impact (the `MxGateway.Contracts` build is clean against the regenerated code). + +### Contracts-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Contracts/Protos/galaxy_repository.proto:120` | +| Status | Resolved | + +**Description:** `GalaxyAttribute.mx_data_type` is declared as `int32` with no in-proto comment. The field carries the raw Galaxy SQL DB type identifier (from `dbo.data_type`), which deliberately does NOT correspond to the public `MxDataType` enum in `mxaccess_gateway.proto`; `docs/Contracts.md` calls this out ("The service is metadata-only and does not share types with mxaccess_gateway.proto") and `docs/GalaxyRepository.md:190` documents the choice ("`mx_data_type` is returned as the raw Galaxy integer rather than mapped to a language-neutral enum"), but the proto file itself gives the reader no signal. A client author looking at the .proto without those docs is likely to assume the field is a `MxDataType` value and write a `(MxDataType)` cast that silently misclassifies most attributes. The ProtobufStyleGuide rule "Comment fields that carry MXAccess parity details, raw HRESULT/status information, or compatibility constraints" applies — this is exactly a parity-detail / compatibility-constraint field where the int32 has non-obvious semantics. The accompanying `data_type_name`, `mx_attribute_category`, and `security_classification` int fields share the same gap. + +**Recommendation:** Add a short comment on `GalaxyAttribute.mx_data_type` (and ideally on `mx_attribute_category` and `security_classification`) clarifying that the value is a raw Galaxy SQL identifier passed through unchanged, NOT a member of the `mxaccess_gateway.v1.MxDataType` enum, with a pointer to `docs/GalaxyRepository.md`. Comment-only change; no wire-format impact. + +**Resolution:** _(2026-05-20)_ Added in-proto comments to `GalaxyAttribute.mx_data_type`, `data_type_name`, `mx_attribute_category`, and `security_classification` in `galaxy_repository.proto`. The `mx_data_type` comment explicitly calls out that the value is a raw Galaxy SQL `dbo.data_type` identifier passed through unchanged, that it is NOT a member of `mxaccess_gateway.v1.MxDataType`, and that the two enumerations must not be cast or compared (closing the silent-misclassification trap the finding describes). The `data_type_name` comment clarifies it is free-form Galaxy text from the same table, not a stable enum. `mx_attribute_category` and `security_classification` comments mark them as raw Galaxy-specific identifiers not mapped to any gateway enum. All four comments cross-reference `docs/GalaxyRepository.md` for the rationale rather than restating it. Comment-only change; no wire-format impact. + +### Contracts-013 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Tests/Contracts/GatewayContractInfoTests.cs:14` | +| Status | Resolved | + +**Description:** The XML summary on `GatewayContractInfoTests.GatewayProtocolVersion_IsVersionThree` reads "Verifies that the gateway protocol version is bumped to three after the alarm proto extension." That description is now incomplete: since the comment was written, the contract has been extended again (the bulk write/read command family in commit `5e375f6`) without a corresponding `GatewayProtocolVersion` bump. The test name says "IsVersionThree" but the summary attributes the value-of-3 to a single historical event (the alarm extension) — readers checking whether subsequent contract additions should have bumped the version will get a misleading rationale. This is the same class of stale-summary issue as Contracts-004 (`GatewayContractInfo` class summary), just relocated to the test that pins the constant. + +**Recommendation:** Reword the summary to describe what the test pins (the current `GatewayProtocolVersion` constant equals 3) rather than narrating a specific historical bump, OR explicitly enumerate the alarm- and bulk-write/read additions covered under version 3 so readers know both extensions were additive and intentionally did not require a bump. + +**Resolution:** _(2026-05-20)_ Reworded the XML summary on `GatewayContractInfoTests.GatewayProtocolVersion_IsVersionThree` to describe what the test actually pins: the current `GatewayProtocolVersion` constant equals 3, with both the alarm proto extension (`AcknowledgeAlarm` / `QueryActiveAlarms` RPCs, `OnAlarmTransitionEvent`, the alarm command/reply payload cases) AND the bulk write/read command family extension (`WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` / `ReadBulk` with their `BulkWriteReply` / `BulkReadReply` payloads) shipping under version 3 as strictly additive changes that did not require a further bump. The new summary also instructs that a future breaking contract change should bump the constant and update the test in lock-step. Test logic is unchanged; the test still passes. + +### Contracts-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:549-553` | +| Status | Resolved | + +**Description:** The `BulkWriteResult` header comment says `item_handle` mirrors the request entry "so callers can correlate inputs to outputs even when the gateway's tag-allowlist filter dropped some entries before reaching the worker." No "tag-allowlist filter" exists by that name anywhere in `src/`, `gateway.md`, `docs/`, or `docs/style-guides/` — a full-tree search returns matches only inside this proto comment and the prior-pass code-reviews. The real gateway-side bulk-write filter is `MxAccessGatewayService` calling `IConstraintEnforcer.CheckWriteHandleAsync` per entry (see `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:565-585` and `src/MxGateway.Server/Security/Authorization/IConstraintEnforcer.cs`); failures populate a synthetic `BulkWriteResult` with `was_successful = false` and the constraint's `ErrorMessage` is recorded via `constraintEnforcer.RecordDenialAsync`. The mechanism is a per-API-key constraint enforcer that can reject by handle (not a "tag" list), and the failure path covers any `ConstraintFailure` reason (write-handle scope, audit policy, etc.) — not a single inclusive tag allowlist. A future reader of the proto will search for "tag-allowlist" and find nothing, or worse, build a non-existent feature against the misleading name. The contract concept the comment is trying to communicate (item-level correlation matters because the gateway can drop entries before the worker sees them) is correct and worth keeping. + +**Recommendation:** Reword the `BulkWriteResult` header comment to identify the actual mechanism — for example: "...so callers can correlate inputs to outputs even when the gateway's per-entry `IConstraintEnforcer.CheckWriteHandleAsync` filter (see `docs/Authorization.md`) dropped some entries before reaching the worker." Comment-only change with no wire-format impact. + +**Resolution:** _(2026-05-20)_ Reworded the `BulkWriteResult` header comment in `mxaccess_gateway.proto` to identify the real gateway-side per-entry filter — `IConstraintEnforcer.CheckWriteHandleAsync` invoked by `MxAccessGatewayService.ReplaceWriteBulkEntries` — and cross-referenced `docs/Authorization.md` for the rationale. The contract concept (item-level correlation matters because the gateway can drop entries before the worker sees them) is preserved; the misleading "tag-allowlist filter" name is removed so future readers will not search for or build against a non-existent feature. The "Per-item failures populate `error_message` + `hresult` and never raise" sentence is retained verbatim. Comment-only change; `dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj` succeeded with 0 warnings / 0 errors on both `net48` and `net10.0`. + +### Contracts-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:571-582` | +| Status | Resolved | + +**Description:** `BulkReadResult` carries seven payload-bearing fields beyond the carrier flags — `value`, `quality`, `source_timestamp`, `statuses`, `error_message`, plus `item_handle` and `tag_address` — and the header comment only documents the `was_cached` arm. There is no in-proto statement of which fields carry data on `was_successful = true` versus `was_successful = false`. Cross-checked against the worker: `MxAccessSession.FailedRead` (line 940-956) populates only `ServerHandle`, `TagAddress`, `ItemHandle`, `WasSuccessful = false`, `WasCached`, and `ErrorMessage` — `value`, `quality`, `source_timestamp`, and `statuses` are all left at their proto3 defaults (null / 0 / null / empty). `SucceededRead` populates the value/quality/source_timestamp/statuses from the cached or snapshotted `OnDataChange`. A client reading `BulkReadResult` from the proto alone has no way to know that `value == null` and `quality == 0` on failure are deliberate "absent" markers rather than "value is null with quality bad" data — both interpretations are wire-equivalent. `BulkWriteResult` has the same shape gap for `statuses` / `hresult` on failed entries, but its header comment at least says "Per-item failures populate `error_message` + `hresult` and never raise"; `BulkReadResult` has no equivalent statement. + +**Recommendation:** Extend the `BulkReadResult` header comment (or add per-field comments on `value` / `quality` / `source_timestamp` / `statuses` / `error_message`) to state explicitly which fields are populated on success and which are left at their proto3 defaults on failure — e.g. "On `was_successful = false`, only `server_handle`, `tag_address`, `item_handle` (when allocated), `was_cached`, and `error_message` are populated; `value`, `quality`, `source_timestamp`, and `statuses` are left at their proto3 defaults and must not be read as data." Comment-only change with no wire-format impact. + +**Resolution:** _(2026-05-20)_ Extended the `BulkReadResult` header comment in `mxaccess_gateway.proto` with explicit per-arm documentation, mirroring the level of detail the existing `BulkWriteResult` header carries. On `was_successful = true` the comment now states `value` / `quality` / `source_timestamp` / `statuses` carry the read data (from the cached subscription or the snapshot lifecycle, depending on `was_cached`) and `error_message` is empty. On `was_successful = false` the comment lists exactly which fields are populated (`server_handle`, `tag_address`, `item_handle` when allocated, `was_cached`, `error_message`) and warns that `value` / `quality` / `source_timestamp` / `statuses` are left at their proto3 defaults and must not be read as data — explicitly noting they are wire-indistinguishable from "value is null with quality bad" data so a future reader cannot make that mistake. The comment also pins the deliberate absence of an `hresult` field on `BulkReadResult` (cross-referencing `docs/DesignDecisions.md` "Bulk Command Family" for the rationale) and the "Per-tag failures populate `error_message` and never raise" semantic that parallels `BulkWriteResult`. Comment-only change; `dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj` succeeded with 0 warnings / 0 errors on both `net48` and `net10.0`. + +### Contracts-016 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto:31-41` (`QueryActiveAlarmsRequest`) | +| Status | Open | + +**Description:** The new public message `QueryActiveAlarmsRequest` (added in commit `397d3c5`) has `session_id = 1` with a comment "session_id is currently unused (the snapshot is session-less) but reserved so a future per-session view can be added without a wire break." The field is not marked `reserved` and clients are not told whether to populate it. As shipped today the server-side implementation (`MxAccessGatewayService.QueryActiveAlarms`) ignores it, but a Java/Rust/Go/Python client author reading the proto alone can't tell whether to leave it empty or write the caller's session id. + +**Recommendation:** Either (a) tighten the comment to "Clients may leave `session_id` empty; the gateway currently ignores it and serves the session-less central-monitor cache. A future version may use it to scope the snapshot to one session." — making the "currently-ignored" semantic unambiguous — or (b) remove `session_id` and use `reserved 1; reserved "session_id";` until the per-session view actually exists. Option (a) is cheaper and preserves a forward-compat hint. + +**Resolution:** _(empty until closed)_ + +### Contracts-017 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto:23-29` (the `rpc QueryActiveAlarms` block) | +| Status | Open | + +**Description:** The RPC comment on `QueryActiveAlarms` describes the stream order ("Point-in-time snapshot of the currently-active alarm set served from the gateway's always-on alarm monitor cache") and the session-less semantic, but does not mention that `QueryActiveAlarmsRequest.alarm_filter_prefix` narrows the snapshot by a `StartsWith(reference)` match on `alarm_full_reference`. A client author reading the RPC comment alone cannot discover the filter capability without inspecting the request message. + +**Recommendation:** Extend the RPC comment with one line: "`QueryActiveAlarmsRequest.alarm_filter_prefix` optionally narrows the snapshot to alarms whose `alarm_full_reference` starts with the given prefix; an empty prefix returns the full set." Comment-only. + +**Resolution:** _(empty until closed)_ diff --git a/code-reviews/IntegrationTests/findings.md b/code-reviews/IntegrationTests/findings.md new file mode 100644 index 0000000..3cd1efc --- /dev/null +++ b/code-reviews/IntegrationTests/findings.md @@ -0,0 +1,463 @@ +# Code Review — IntegrationTests + +| Field | Value | +|---|---| +| Module | `src/ZB.MOM.WW.MxGateway.IntegrationTests` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 3 | + +## Checklist coverage + +A comprehensive review completes every category, recording "No issues found" where +a category produced nothing rather than leaving it blank. + +### 2026-05-20 re-review (commit `a020350`) + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issues found: IntegrationTests-017 (teardown-parity test's "no further OnDataChange after UnAdvise" assertion races against in-flight events the provider already published); IntegrationTests-020 (abnormal-exit test's fault-classification keyword list accepts the substring `"worker"`, which matches almost any plausible fault message and dilutes the check to "the description is non-empty"). | +| 2 | mxaccessgw conventions | No issues found. The five new tests honor live opt-in gating, `[Collection]` serialization, "no synthesized events", and the credential-redaction contract for the assertions they make. | +| 3 | Concurrency & thread safety | No issues found. `GatewaySession.State`/`FinalFault` access in the abnormal-exit poll loop goes through `_syncRoot`; `RecordingServerStreamWriter.Messages` returns a locked snapshot copy. | +| 4 | Error handling & resilience | No issues found. `ShutDownAsync`'s opt-in `propagateStreamFaults` correctly threads silent stream-task faults into the Write parity test without re-masking the IntegrationTests-004 path. | +| 5 | Security | Issue found: IntegrationTests-019 (WriteSecured live test asserts the password is absent from `DiagnosticMessage` only; it does not assert the credential is absent from the accumulated test output, where the worker `stderr`/`stdout` and the gateway log are echoed). | +| 6 | Performance & resource management | No issues found. All six `RecordingServerStreamWriter` instantiations use `using` declarations; `using CancellationTokenSource` is the consistent pattern. | +| 7 | Design-document adherence | No issues found. `docs/GatewayTesting.md` documents all five new parity surfaces and the two new env-var defaults (`MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER`/`_PASSWORD`). | +| 8 | Code organization & conventions | Issue found: IntegrationTests-018 (`GatewayServiceFixture.TryGetSession` declares `out GatewaySession session` non-nullable while the caller binds it as `out GatewaySession? session`; the null-forgiving operator inside `SessionRegistry.TryGet` propagates a misleading non-null annotation). | +| 9 | Testing coverage | Issue found: IntegrationTests-021 (abnormal-exit test does not assert the active `StreamEvents` task observed the worker fault; relies entirely on the session-state poll and would silently pass if `MarkFaulted` were ever moved off the stream-consumption path). | +| 10 | Documentation & comments | No issues found. Test XML comments now match what each assertion verifies (the IntegrationTests-011 fix is intact across both the Write and invalid-handle cases). | + +### 2026-05-20 review (commit `1cd51bb`) + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issue found: IntegrationTests-012 (Write test starts a `StreamEvents` task and never observes it — silent event-stream coverage gap and an unobserved fault path). | +| 2 | mxaccessgw conventions | Live opt-ins, `[Collection]` serialization, and the "don't synthesize events" rule are honored. No issues found. | +| 3 | Concurrency & thread safety | `LiveResourcesCollection` serializes all three live classes; `RecordingServerStreamWriter` locks correctly and the semaphore wait is linked to both timeout and external cancellation. No issues found. | +| 4 | Error handling & resilience | `ShutDownAsync` already isolates cleanup exceptions per category. No issues found. | +| 5 | Security | The only embedded strings are documented dev GLAuth creds and a localhost ZB connection string, all env-overridable. The wrong-password and unreachable-server tests assert no password leakage. No issues found. | +| 6 | Performance & resource management | Issue found: IntegrationTests-013 (`RecordingServerStreamWriter.messageArrived` `SemaphoreSlim` is never disposed; the type owns an `IDisposable` field but is not itself disposable). | +| 7 | Design-document adherence | No issues found. `docs/GatewayTesting.md` now documents the Live LDAP, Live Galaxy, and Write/invalid-handle MXAccess opt-ins added by the prior round of resolutions. | +| 8 | Code organization & conventions | Issues found: IntegrationTests-015 (`[Trait("Category", ...)]` repeated on every test method instead of declared once at class level); IntegrationTests-016 (the Galaxy default connection string is duplicated between `LiveGalaxyRepositoryFactAttribute` and `GalaxyRepositoryOptions`). | +| 9 | Testing coverage | Issue found: IntegrationTests-014 (`Unadvise`, `RemoveItem`, `Unregister`, `WriteSecured` ordering, and worker-fault parity still uncovered — IntegrationTests-005's resolution scoped these out). | +| 10 | Documentation & comments | Issue found: IntegrationTests-011 (the invalid-handle and write test comments describe a non-`Ok` MXAccess failure as `ProtocolStatusCode.Ok`, contradicting both the assertion and `HResultConverter`). | + +### 2026-05-18 review (commit `6c64030`) + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issues found: IntegrationTests-003 (asserts only on first event), IntegrationTests-010 (`WaitForMessageAsync` ignores cancellation). | +| 2 | mxaccessgw conventions | Live tests correctly gated and skip (not fail) when prerequisites are absent; `LiveGalaxyRepositoryFactAttribute` undocumented in the opt-in matrix. | +| 3 | Concurrency & thread safety | Issue found: IntegrationTests-007 (no `[Collection]`/parallelism guard for shared MXAccess/ZB/GLAuth). | +| 4 | Error handling & resilience | Issue found: IntegrationTests-004 (cleanup `WaitAsync` can mask the original failure). | +| 5 | Security | No production secrets; only documented dev GLAuth creds and a localhost ZB connection string, all env-overridable. No issues found. | +| 6 | Performance & resource management | Worker process disposed transitively via session disposal; no leaked pipes/COM/processes. No issues found. | +| 7 | Design-document adherence | Issues found: IntegrationTests-001 (Galaxy live suite absent from the opt-in matrix), IntegrationTests-002 (`GwAdmin` LDAP prerequisite undocumented). | +| 8 | Code organization & conventions | Issue found: IntegrationTests-008 (three near-identical fact attributes). | +| 9 | Testing coverage | Issues found: IntegrationTests-005 (thin MXAccess parity coverage), IntegrationTests-006 (thin LDAP failure-path coverage). | +| 10 | Documentation & comments | Issue found: IntegrationTests-009 (`TestServerCallContext` mislabelled "Mock"). | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232` scoped to: the rename (`dc9c0c9`) of the +namespaces, csproj, and the `IntegrationTestEnvironment.ResolveRepositoryRoot` +walker; the `DashboardLdapLiveTests` update to inject a `GroupToRole` +mapping in place of the dropped `LdapOptions.RequiredGroup` default +(`27ed651`); and the inline `NullDashboardEventBroadcaster` fake added +to `WorkerLiveMxAccessSmokeTests` for the new `EventStreamService` ctor +parameter (`d692232`). + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — namespaces, env-var names, and live-test attribute discipline preserved. | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | Confirmed: the `RpcException(StatusCode.Cancelled)` sibling catch added in `WorkerLiveMxAccessSmokeTests.ShutDownAsync` covers the only site where a gRPC-mapped `OperationCanceledException` can surface — no other call sites need widening. | +| 5 | Security | No issues found. | +| 6 | Performance & resource management | No issues found. | +| 7 | Design-document adherence | No issues found — `docs/GatewayTesting.md` was updated in the same wave to describe the GroupToRole-fixture pattern. | +| 8 | Code organization & conventions | Issues found: IntegrationTests-022 (`ResolveRepositoryRoot` silently falls back to `Directory.GetCurrentDirectory()` on exhausted walk, which can hide misconfiguration), IntegrationTests-024 (inline `NullDashboardEventBroadcaster` fake is the only such use today but worth watching for duplication once a second consumer arrives). | +| 9 | Testing coverage | Issues found: IntegrationTests-023 (`DashboardLdapLiveTests.AuthenticateAsync_AdminInGwAdminGroup_Succeeds` asserts the `ldap_group` claim but does not assert the emitted `Role: Admin` claim, leaving the role-mapping path untested). | +| 10 | Documentation & comments | No issues found. | + +## Findings + +### IntegrationTests-001 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Design-document adherence | +| Location | `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs:7`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs` | +| Status | Resolved | + +**Description:** The Galaxy Repository live test suite and its gating env var `MXGATEWAY_RUN_LIVE_GALAXY_TESTS` (plus connection-string override `MXGATEWAY_LIVE_GALAXY_CONN`) are completely absent from `docs/GatewayTesting.md`. CLAUDE.md mandates updating docs in the same change as the source. The opt-in matrix documents only the MXAccess and LDAP env vars, so an operator running the documented matrix has no way to know these tests exist or how to enable them. + +**Recommendation:** Add a "Live Galaxy Repository" section to `docs/GatewayTesting.md` documenting `MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1`, `MXGATEWAY_LIVE_GALAXY_CONN`, the `ZB` database prerequisite, and the covered RPCs, mirroring the existing "Live MXAccess Smoke" section. + +**Resolution:** Resolved 2026-05-18: Added a "Live Galaxy Repository" section to `docs/GatewayTesting.md` documenting `MXGATEWAY_RUN_LIVE_GALAXY_TESTS`, `MXGATEWAY_LIVE_GALAXY_CONN`, the deployed-`ZB` prerequisite, and the covered `GalaxyRepository` RPCs. + +### IntegrationTests-002 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Design-document adherence | +| Location | `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:13`, `src/MxGateway.Server/Configuration/LdapOptions.cs:27` | +| Status | Resolved | + +**Description:** `DashboardLdapLiveTests` builds the authenticator with `new GatewayOptions()`, so it relies on `LdapOptions.RequiredGroup` defaulting to `GwAdmin` and asserts the `admin` user is a member of a `GwAdmin` LDAP group. `glauth.md` does not list `GwAdmin` as a provisioned group — it lists `admin` only in the five role groups and describes `GwAdmin` as a group to add "when reuse isn't enough." If GLAuth has only the documented baseline groups, `AuthenticateAsync_AdminInGwAdminGroup_Succeeds` fails (not skips) on any box where the env var is set. This is an undocumented hard prerequisite beyond "LDAP is up." + +**Recommendation:** Either document the required `GwAdmin` GLAuth provisioning step in `glauth.md` and `GatewayTesting.md`, or have the test set `RequiredGroup` to a baseline group `glauth.md` guarantees `admin` belongs to (e.g. `WriteOperate`). + +**Resolution:** Resolved 2026-05-18: Took the documentation fix — promoted the `glauth.md` "Adding a gw-specific group" section into a concrete "Provisioning the GwAdmin group" step that grants `GwAdmin` to `admin`, cross-referenced it from the groups/verification sections, and added a "Live LDAP" section to `docs/GatewayTesting.md` calling out `GwAdmin` as a hard prerequisite. Alternative considered: weaken the test to a baseline group (`WriteOperate`) — rejected because `GwAdmin` is the real default `LdapOptions.RequiredGroup` and the test should exercise it. + +### IntegrationTests-003 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:89-97` | +| Status | Resolved | + +**Description:** The test asserts only on the first `MxEvent` recorded by `RecordingServerStreamWriter`. A live MXAccess provider can deliver an initial state/quality event whose family or handles differ from the expected `OnDataChange` (e.g. a registration-state or bad-quality bootstrap event). Because `WaitForFirstMessageAsync` returns whatever arrives first, a genuine ordering/family defect could fail spuriously or leave later wrong events unverified. + +**Recommendation:** Filter for the first event with `Family == OnDataChange` (with a bounded retry/poll) or assert the full recorded sequence, so the test verifies the event the worker is supposed to emit. + +**Resolution:** Resolved 2026-05-18: Confirmed against source — `WaitForFirstMessageAsync` completed a `TaskCompletionSource` on the very first `WriteAsync`. Replaced it with `RecordingServerStreamWriter.WaitForMessageAsync(predicate, timeout)`, which scans recorded messages, skips earlier non-matching events, and blocks on a `SemaphoreSlim` until a matching one arrives or the timeout elapses (throwing a `TimeoutException` that reports the scanned count). `GatewaySession_WithLiveWorker_RegistersAdvisesStreamsDataAndCloses` now waits for the first `Family == OnDataChange` event. Live execution was not possible in this environment (no MXAccess COM); verified by build. + +### IntegrationTests-004 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:108-111` | +| Status | Resolved | + +**Description:** In the `finally` block, after `CloseSessionAsync`, the test does `await streamTask.WaitAsync(StreamShutdownTimeout)`. If closing the session does not promptly complete the stream (or `StreamEvents` itself faults), this throws `TimeoutException` from inside `finally`, which replaces/masks any original assertion failure from the `try` block. The diagnostic value of the real failure is lost. + +**Recommendation:** Wrap the `streamTask.WaitAsync` (and ideally `WaitForProcessesAsync`) in a try/catch that logs the cleanup exception via `output.WriteLine` instead of letting it propagate. + +**Resolution:** Resolved 2026-05-18: Confirmed — the `finally` block awaited `streamTask.WaitAsync` and `WaitForProcessesAsync` with no exception handling. Extracted a shared `ShutDownAsync` helper that wraps the session-close + stream-drain in one try/catch and the worker-process wait in a second try/catch, logging each cleanup exception via `output.WriteLine` instead of throwing. All three live tests now route shutdown through it, so a cleanup timeout can no longer mask an assertion failure. Live execution was not possible in this environment; verified by build. + +### IntegrationTests-005 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` | +| Status | Resolved | + +**Description:** The only live MXAccess test covers the Register→AddItem→Advise→one-OnDataChange→Close happy path. CLAUDE.md stresses that MXAccess parity is the contract and calls out non-obvious behaviors (`WriteSecured` ordering, `OperationComplete` semantics, invalid-handle exceptions). None of `Write`, `WriteSecured`, `Unadvise`, `RemoveItem`, `Unregister`, `OperationComplete`, an invalid-handle command, or a worker-fault path is exercised against live COM — exactly the paths fake-worker tests cannot validate. + +**Recommendation:** Add live coverage for at least a `Write` round-trip and an invalid-handle command, plus a worker-fault/abnormal-exit scenario, even if behind additional opt-in env vars. + +**Resolution:** Resolved 2026-05-18: Added two `[LiveMxAccessFact]`-gated tests to `WorkerLiveMxAccessSmokeTests`. `GatewaySession_WithLiveWorker_WritesValueToAdvisedItem` registers/adds/advises then issues a `Write` of an integer value, asserting the command round-trips with `ProtocolStatusCode.Ok` and `MxCommandKind.Write`. `GatewaySession_WithLiveWorker_InvalidHandleCommand_SurfacesFailureWithoutTransportFault` issues `AddItem` against `int.MaxValue` as the server handle (never issued by MXAccess) and asserts the failure surfaces in the command reply without a usable item handle. Both reuse the existing opt-in env var and the `ShutDownAsync` cleanup helper. A worker-fault/abnormal-exit case was deliberately scoped out — it needs a controlled COM crash injection beyond what the existing harness supports; the two added cases cover the `Write` round-trip and invalid-handle paths the recommendation calls out. Live execution was not possible in this environment; verified by build. + +### IntegrationTests-006 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs` | +| Status | Resolved | + +**Description:** LDAP live coverage is two cases: admin succeeds, readonly is denied for missing group. There is no coverage of a wrong password for a valid user, an unknown username, or the LDAP-server-unreachable path — all of which `DashboardAuthenticator` has distinct branches for (the `LdapException` catch, the `candidate is null` branch). The negative test only proves group-membership denial, not credential rejection. + +**Recommendation:** Add a live test for `admin` with a wrong password asserting `Succeeded == false` and that the password is not leaked into `FailureMessage`, and a test for an unknown username. + +**Resolution:** Resolved 2026-05-18: Added three `[LiveLdapFact]`-gated tests to `DashboardLdapLiveTests`. `AuthenticateAsync_AdminWithWrongPassword_FailsWithoutLeakingPassword` exercises the `LdapException` catch via a rejected candidate bind and asserts the wrong password never reaches `FailureMessage`. `AuthenticateAsync_UnknownUsername_Fails` exercises the `candidate is null` branch. `AuthenticateAsync_ServerUnreachable_FailsWithoutThrowing` builds the authenticator with `LdapOptions.Port = 1` (a reserved port no LDAP server listens on) and asserts the connect failure is absorbed into a failed result rather than thrown — covering the generic `catch (Exception)` branch. All three are gated by the existing `MXGATEWAY_RUN_LIVE_LDAP_TESTS` opt-in so they stay opt-in. Live execution was not possible in this environment (no live LDAP); verified by build. + +### IntegrationTests-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:20`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs:5`, `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:9` | +| Status | Resolved | + +**Description:** The live test classes contend for genuinely shared singletons — one MXAccess COM provider, one ZB SQL database, one GLAuth instance with a 3-fail/10-minute per-IP lockout. No `[Collection]` annotation or `DisableTestParallelization` is declared, so xUnit's default cross-class parallelism could run the Galaxy tests concurrently or interleave an LDAP failure burst that trips the GLAuth lockout. + +**Recommendation:** Place the live test classes in a shared `[Collection]`, or set `[assembly: CollectionBehavior(DisableTestParallelization = true)]` for this opt-in project, so live external resources are accessed serially. + +**Resolution:** Resolved 2026-05-18: Confirmed — no `[Collection]` or assembly-level `CollectionBehavior` existed. Added `LiveResourcesCollection.cs` with a `[CollectionDefinition(Name, DisableParallelization = true)]` and applied `[Collection(LiveResourcesCollection.Name)]` to `WorkerLiveMxAccessSmokeTests`, `GalaxyRepositoryLiveTests`, and `DashboardLdapLiveTests`. A named collection (rather than an assembly-wide `DisableTestParallelization`) was chosen so the live classes serialize against each other and within themselves while non-live tests (`IntegrationTestEnvironmentTests`) keep parallelizing. Verified by build; live tests not executed (no MXAccess COM / live LDAP in this environment). + +### IntegrationTests-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.IntegrationTests/LiveLdapFactAttribute.cs`, `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs`, `src/MxGateway.IntegrationTests/LiveMxAccessFactAttribute.cs` | +| Status | Resolved | + +**Description:** Three near-identical fact attributes each re-implement the same "compare env var to `1` with `StringComparison.Ordinal`, set `Skip` otherwise" pattern. `LiveMxAccessFactAttribute` delegates to `IntegrationTestEnvironment` while the other two inline the logic, so the project has two divergent styles for the same concern. + +**Recommendation:** Extract a shared helper (e.g. `IntegrationTestEnvironment.IsEnabled(string variableName)`) and have all three attributes call it. + +**Resolution:** Resolved 2026-05-18: Confirmed — `LiveLdapFactAttribute.Enabled` and `LiveGalaxyRepositoryFactAttribute.Enabled` each inlined the ordinal `== "1"` comparison while `LiveMxAccessFactAttribute` delegated to `IntegrationTestEnvironment`. Added `IntegrationTestEnvironment.IsEnabled(string variableName)` as the single implementation; `LiveMxAccessTestsEnabled`, `LiveLdapFactAttribute.Enabled`, and `LiveGalaxyRepositoryFactAttribute.Enabled` now all call it. Verified by build. + +### IntegrationTests-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:372-375` | +| Status | Resolved | + +**Description:** `TestServerCallContext` is XML-documented as a "Mock server call context," but it is a hand-written stub/fake with no mocking framework and no verification behavior. Per the style guides (accurate naming; explain why not what), calling it a mock misleads readers who may expect verifiable interactions. + +**Recommendation:** Reword the summary to "test stub" / "minimal `ServerCallContext` implementation for in-process gRPC calls." + +**Resolution:** Resolved 2026-05-18: Confirmed — the summary read "Mock server call context for testing gRPC calls." Reworded to "Minimal `ServerCallContext` stub for invoking the gRPC service in-process," noting it is a hand-written fake with no verification behavior. No mocking framework is involved; this is a documentation-only fix. Verified by build. + +### IntegrationTests-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:366-369` | +| Status | Resolved | + +**Description:** `WaitForFirstMessageAsync` accepts only a `timeout` and never observes a `CancellationToken`. There is no per-test cancellation propagation, so if the gateway/worker hangs without writing an event the test relies solely on the 15s `WaitAsync` timeout and gives no contextual diagnostics. Combined with IntegrationTests-004, a hung live worker produces a bare `TimeoutException`. + +**Recommendation:** Accept a `CancellationToken` (linked to `TestServerCallContext`'s token), pass it to `firstMessage.Task.WaitAsync(timeout, token)`, and on timeout emit the recorded `Messages` count via `output.WriteLine` before throwing. + +**Re-triage:** The named method `WaitForFirstMessageAsync` no longer exists — IntegrationTests-003's resolution renamed/replaced it with `RecordingServerStreamWriter.WaitForMessageAsync(predicate, timeout)`, which scans recorded messages and blocks on a `SemaphoreSlim`. The underlying defect still held: that replacement method also took only a `timeout` and never observed a `CancellationToken`. The finding remains valid (Low, Correctness) against the renamed method; the recommendation's `firstMessage.Task.WaitAsync` detail is stale but the intent (thread a token, surface a count on timeout) is unchanged. + +**Resolution:** Resolved 2026-05-18: Added an optional `CancellationToken` parameter to `WaitForMessageAsync`, linked with the existing timeout source via `CancellationTokenSource.CreateLinkedTokenSource`, so a per-test cancellation aborts the wait promptly. `GatewaySession_WithLiveWorker_RegistersAdvisesStreamsDataAndCloses` now creates a `CancellationTokenSource`, passes its token into the `StreamEvents` `TestServerCallContext` and into `WaitForMessageAsync`, so the stream call and the wait share one cancellation source. On timeout the method already throws a `TimeoutException` whose message includes the scanned message count, satisfying the "emit recorded count" intent (the count surfaces in the test failure rather than via a separate `output.WriteLine`). Verified by build; live tests not executed. + +### IntegrationTests-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:236-240`, `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:183-187` | +| Status | Resolved | + +**Description:** The XML/inline comments on the two new MXAccess parity tests misdescribe how the gateway surfaces an MXAccess failure. The invalid-handle test reads "the gateway protocol status is Ok and the failure shows up in hresult / the status proxies — it must not be reported as a transport fault", then asserts `Assert.NotEqual(ProtocolStatusCode.Ok, addItemReply.ProtocolStatus.Code)`. `HResultConverter.CreateProtocolStatus` (`src/MxGateway.Worker/Conversion/HResultConverter.cs:39`) actually sets `Code = ProtocolStatusCode.MxaccessFailure` whenever the COM call throws (HRESULT ≠ 0), so the assertion is correct but the comment is wrong — the protocol status is *not* `Ok` on an MXAccess failure. The write-round-trip test carries the same misleading framing on lines 183-187 ("MXAccess parity details … belong in hresult / statuses, not in a transport failure") immediately before asserting `Ok`. A reader can reasonably conclude the gateway always reports `Ok` for round-tripped commands and tweak code accordingly. The intended distinction is "this is not a gRPC transport fault" (the RPC reply still arrives) — the protocol status code carries the MXAccess outcome. + +**Recommendation:** Reword the invalid-handle comment to "the gateway must reply with `ProtocolStatusCode.MxaccessFailure` and a non-zero `Hresult` carrying the COM failure, not a gRPC transport fault." Reword the write-round-trip comment to clarify it is asserting the happy-path Ok and that an MXAccess rejection would surface as `MxaccessFailure` (per `HResultConverter`), not as a `RpcException`. + +**Resolution:** 2026-05-20 — Reworded the invalid-handle test comment to say the gateway must reply with `ProtocolStatusCode.MxaccessFailure` and a non-zero hresult carrying the COM failure (per `HResultConverter`), and reworded the write-round-trip comment to make explicit it is asserting the happy-path Ok while an MXAccess rejection would surface as `MxaccessFailure`, never as an `RpcException`. + +### IntegrationTests-012 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:147-151` | +| Status | Resolved | + +**Description:** `GatewaySession_WithLiveWorker_WritesValueToAdvisedItem` constructs a `RecordingServerStreamWriter` and starts a `StreamEvents` task, then never reads from it and never asserts anything about the recorded messages. The test verifies only that the `Write` command round-trips at the protocol level — it does not verify that the worker actually emits any event after the write (for example an `OnWriteComplete`, which is the proof of round-trip used by the cross-language client e2e runner). Because the stream task is started with `new TestServerCallContext()` (no cancellation source), any fault raised by the stream task (an exception from `EventStreamService`, a session-not-found, a backpressure overflow) is swallowed — `streamTask` is later awaited in `ShutDownAsync` only inside a broad `catch (Exception ex)`, which logs and continues. The Write test therefore cannot fail on stream-task faults. Two consequences: (a) the live Write parity coverage promised in IntegrationTests-005 is weaker than it appears, and (b) the fixture (`eventWriter`) is dead code in this test that suggests an assertion was intended. + +**Recommendation:** Either remove the unused `eventWriter`/`StreamEvents` plumbing from the Write test so the test scope matches its assertions, or — preferred — extend the test to wait for an `OnWriteComplete` event for the written item via `eventWriter.WaitForMessageAsync(candidate => candidate.Family == MxEventFamily.OnWriteComplete && candidate.ItemHandle == itemHandle, ...)`, matching the round-trip proof used by `scripts/run-client-e2e-tests.ps1 -VerifyWrite`. + +**Resolution:** Resolved 2026-05-20: Rewrote `GatewaySession_WithLiveWorker_WritesValueToAdvisedItem` so the previously-dead `eventWriter`/`StreamEvents` plumbing actually drives an assertion. The test now waits for an `OnWriteComplete` event matching the Write's (server, item) handle pair via `eventWriter.WaitForMessageAsync` (using `IntegrationTestEnvironment.LiveMxAccessEventTimeout`), and asserts the recorded event's family, session id, and handles — the same round-trip proof the cross-language client e2e runner uses. The stream call is now bound to a `CancellationTokenSource` and the test asserts `streamTask.IsFaulted == false` before cleanup. `ShutDownAsync` gained an opt-in `propagateStreamFaults` flag so a faulted `StreamEvents` task is rethrown into the test rather than silently swallowed by the broad cleanup catch; the cancellation token is also signalled before the drain so `StreamEvents` observes a clean shutdown instead of a forced timeout. Verified by build and by confirming the test skips cleanly when `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS` is unset. + +### IntegrationTests-013 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:519-609` | +| Status | Resolved | + +**Description:** `RecordingServerStreamWriter` owns a `SemaphoreSlim messageArrived` (`IDisposable`) but does not itself implement `IDisposable`, so the semaphore's wait handle is never released back to the OS. Each live test allocates one such writer and discards it at scope exit. Live tests run on opt-in only, so the cumulative leak is bounded, but the type holds an `IDisposable` field — the standard hygiene under `Directory.Build.props`'s `TreatWarningsAsErrors=true` is to either dispose the field or document why not. CA2213 does not fire because the owner is not itself `IDisposable`; an analyzer-driven warning is the only reason this is not a build break, not an indication that the leak is acceptable. + +**Recommendation:** Make `RecordingServerStreamWriter` implement `IDisposable`, dispose `messageArrived` in `Dispose`, and wrap each instantiation in a `using` block (`using RecordingServerStreamWriter eventWriter = new();`). + +**Resolution:** 2026-05-20 — `RecordingServerStreamWriter` now implements `IDisposable` and its `Dispose` releases the `messageArrived` semaphore. All six live tests in `WorkerLiveMxAccessSmokeTests` now allocate the writer with a top-of-method `using` declaration so the semaphore's wait handle is released on scope exit even when the test body throws. + +### IntegrationTests-014 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` | +| Status | Resolved | + +**Description:** IntegrationTests-005 was resolved by adding live coverage for `Write` and an invalid-handle `AddItem`, but its resolution explicitly scoped out the worker-fault/abnormal-exit case and silently dropped `Unadvise`, `RemoveItem`, `Unregister`, `OperationComplete`, and `WriteSecured` ordering. CLAUDE.md singles out `WriteSecured` ("`WriteSecured` failing before a value-bearing NMX body") and `OperationComplete` semantics as parity surprises the gateway must not "fix" — exactly the paths fake-worker tests cannot validate. After this commit the live MXAccess smoke still doesn't exercise any teardown command, the secured-write ordering rule, or a deliberately faulted worker. A regression in any of these would only be caught by manual testing. + +**Recommendation:** Add live MXAccess coverage for the teardown chain (`Unadvise` then `RemoveItem` then `Unregister`, asserting each replies with `ProtocolStatusCode.Ok` and the next operation no longer references the freed handle), and at minimum one `WriteSecured` parity case asserting the documented ordering. A worker-fault test can be deferred to a separate finding once a deterministic COM-crash injection harness exists. + +**Resolution:** Resolved 2026-05-20: Added three new `[LiveMxAccessFact]`-gated tests to `WorkerLiveMxAccessSmokeTests`, all reusing the existing opt-in env var and `ShutDownAsync` cleanup helper. (1) `GatewaySession_WithLiveWorker_UnadviseRemoveItemUnregister_TeardownOrderingParity` runs Register → AddItem → Advise → wait for one OnDataChange → UnAdvise → RemoveItem → Unregister, asserting each step replies `Ok` with the matching `MxCommandKind`, that no further OnDataChange events for the un-advised (server, item) pair arrive after a 500 ms settle window, and that a second RemoveItem against the freed handle returns a non-`Ok` MXAccess failure (so a regression that left a stale subscription or accepted a stale handle would surface). (2) `GatewaySession_WithLiveWorker_WriteSecured_AuthenticatedRoundTripParity` resolves an ArchestrA user id via `AuthenticateUser` (credentials env-overridable through `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER` / `..._PASSWORD`, defaulting to the `admin`/`admin123` GLAuth user from `glauth.md`), issues `WriteSecured` against an advised item, and asserts the reply carries `MxCommandKind.WriteSecured`, the protocol status is one of the documented parity outcomes (`Ok` for an unprotected provider, `MxaccessFailure` when the item is not WriteSecured-eligible — never a transport fault), and the credential never leaks into the diagnostic message. (3) `GatewaySession_WithLiveWorker_AbnormalWorkerExit_MarksSessionFaulted` opens a session, kills the worker process tree (via a new `TestWorkerProcessFactory.KillAllAndDetach` helper) without going through CloseSession, and polls the session via a new `GatewayServiceFixture.TryGetSession` accessor until it transitions to `SessionState.Faulted` within the live event timeout; asserts the final state is `Faulted`, that `FinalFault` is non-empty, and that the fault description carries a known worker-client classification (pipe disconnected / worker faulted / heartbeat expired / end-of-stream). `docs/GatewayTesting.md` was updated to list all five parity surfaces and the two new env-var defaults. Verified by build and confirmed all six live tests skip cleanly when `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS` is unset. + +### IntegrationTests-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:30,119,201`, `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:13,32,48,67,84`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs:10,22,34,52` | +| Status | Resolved | + +**Description:** Every live-test method in the three live classes carries an identical `[Trait("Category", "LiveMxAccess")]` (or `LiveLdap` / `LiveGalaxy`) attribute. The trait is uniform within each class and is exactly the information the `[Collection(LiveResourcesCollection.Name)]` class-level attribute also implies. xUnit's `[Trait]` is inheritable from the class to its methods, so the same metadata can be declared once at class scope. The current shape adds maintenance burden — adding a new test in any of these classes requires remembering to add the trait, and the existing pattern's `LiveLdap` includes five copies of the same line. + +**Recommendation:** Move each `[Trait("Category", ...)]` to the class declaration alongside the existing `[Collection(...)]`, and remove the per-method copies. Verify the trait still surfaces in `--filter Trait=Category=LiveLdap` after the change. + +**Resolution:** 2026-05-20 — Lifted `[Trait("Category", "LiveMxAccess")]`, `[Trait("Category", "LiveLdap")]`, and `[Trait("Category", "LiveGalaxy")]` to the class declarations of `WorkerLiveMxAccessSmokeTests`, `DashboardLdapLiveTests`, and `GalaxyRepositoryLiveTests` respectively (alongside the existing `[Collection(LiveResourcesCollection.Name)]`), and removed all per-method duplicates. xUnit propagates class-level traits to every method, so `--filter Category=Live*` filters still match. + +### IntegrationTests-016 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs:26`, `src/MxGateway.Server/Galaxy/GalaxyRepositoryOptions.cs:13` | +| Status | Resolved | + +**Description:** The default Galaxy Repository connection string `"Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;"` is duplicated verbatim between the production `GalaxyRepositoryOptions.ConnectionString` initializer and the test-side `LiveGalaxyRepositoryFactAttribute.ConnectionString` fallback. The docs (`docs/GatewayTesting.md`) document the value once and reference it from both places. If the production default changes (e.g. tightening to a named instance, or switching to a SQL-auth template), the test default silently keeps the old string and the live Galaxy tests connect to the wrong server. The drift is invisible to the build. + +**Recommendation:** Expose the production default through a `public const string` on `GalaxyRepositoryOptions` (e.g. `DefaultConnectionString`) and have `LiveGalaxyRepositoryFactAttribute.ConnectionString` read `Environment.GetEnvironmentVariable(ConnectionStringVariableName) ?? GalaxyRepositoryOptions.DefaultConnectionString`. Single source of truth, build-time guarantee they cannot drift. + +**Resolution:** 2026-05-20 — Added `public const string GalaxyRepositoryOptions.DefaultConnectionString` carrying the production default, set the `ConnectionString` initializer to reference it, and changed `LiveGalaxyRepositoryFactAttribute.ConnectionString` to fall back to `GalaxyRepositoryOptions.DefaultConnectionString`. The literal now lives in exactly one place and any future change to the production default propagates to the live-test fallback at compile time. + +### IntegrationTests-017 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:350-407` | +| Status | Resolved | + +**Description:** `GatewaySession_WithLiveWorker_UnadviseRemoveItemUnregister_TeardownOrderingParity` proves the subscription is live by waiting for one matching `OnDataChange`, snapshots `dataChangeCountBeforeUnadvise`, then sends `UnAdvise`, waits 500 ms, snapshots `dataChangeCountAfterTeardown`, and asserts strict equality. The assertion races against the natural cadence of the live MXAccess provider: + +1. After `WaitForMessageAsync` returns the first match, any additional `OnDataChange` for the same `(serverHandle, itemHandle)` published by the provider before the worker processes `UnAdvise` is delivered into the recording writer. +2. The snapshot at line 362 is taken *immediately before* the `UnAdvise` command is sent (line 370). Events that arrive in the window between that snapshot and the worker processing `UnAdvise` (network round-trip + STA dispatch + worker pipe send + gateway channel write) are racing in — they are not "after UnAdvise" but they will be in the post-teardown snapshot. +3. `MXAccess` providers can publish `OnDataChange` at sub-second cadence; the strict-equality assertion has no slack for in-flight events. + +The test passes today only because the chosen test item (`TestChildObject.TestInt`) likely changes value rarely. Against a more active item — or on a slower machine where the round-trip widens — the assertion would flap. The intent ("no further events *after the worker stops the subscription*") would be better expressed by capturing the snapshot after `UnAdvise` returns `Ok` rather than before it is issued. + +**Recommendation:** Move the "before" snapshot to immediately *after* `UnAdvise` returns `Ok` (the point past which the parity rule applies), or weaken the assertion to "no events with a `WorkerSequence` strictly greater than the last sequence observed within `dataChangeCountBeforeUnadvise + N` events arrived in the post-teardown drain" where `N` accounts for the documented in-flight window. Either change moves the test from racing on provider cadence to verifying the actual parity rule. + +**Resolution:** 2026-05-20 — Removed the `dataChangeCountBeforeUnadvise` snapshot taken just before the `UnAdvise` command (the source of the race) and replaced the strict-equality assertion against a pre-teardown count with a two-window stability check taken *after* the teardown chain completes. The test now waits one 500 ms settle window for in-flight `OnDataChange` events (which the provider already published before the worker acknowledged `UnAdvise`) to drain, captures `dataChangeCountAfterFirstSettle`, waits another 500 ms, and asserts the count is unchanged in `dataChangeCountAfterSecondSettle`. The parity rule under test ("no further `OnDataChange` after the worker stops the subscription") is now expressed as steadiness across the post-teardown window rather than equality with a count snapshotted during the round-trip race, so a slower machine or a more active test item no longer flaps the assertion while a genuine regression (a stale subscription continuing to fire) still surfaces as a count drift between the two settles. + +### IntegrationTests-018 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:1037`, `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:595` | +| Status | Resolved | + +**Description:** `GatewayServiceFixture.TryGetSession(string sessionId, out GatewaySession session)` declares `session` as a non-nullable `GatewaySession`, but its implementation is `_registry.TryGet(sessionId, out session)`, which (in `SessionRegistry.cs:43`) uses `session!` to silence the nullability warning on `Dictionary.TryGetValue`. On a `false` return the `out` parameter is null, contradicting the non-nullable annotation. The caller at line 595 binds it as `out GatewaySession? session`, which compiles only because non-nullable-to-nullable variance is permitted — but no callsite tooling will warn that a `false` return yields a null value through what the fixture's contract describes as non-nullable. The repo enforces `Nullable=enable;TreatWarningsAsErrors=true` (`src/Directory.Build.props`), so the convention is for `TryX` patterns to either annotate the out as `T?` or to mirror BCL `Dictionary.TryGetValue` (which uses `[MaybeNullWhen(false)] out TValue`). + +**Recommendation:** Change the signature to `public bool TryGetSession(string sessionId, [MaybeNullWhen(false)] out GatewaySession session)` (mirroring BCL `TryGetValue`) and propagate the same annotation down through `ISessionRegistry.TryGet` / `SessionRegistry.TryGet` so the `session!` fudge can be removed. The call sites already treat the parameter as nullable; aligning the declaration removes the silent contract gap. + +**Resolution:** 2026-05-20 — Propagated `[MaybeNullWhen(false)]` through the entire `TryGet*` chain. `GatewayServiceFixture.TryGetSession`, `ISessionManager.TryGetSession` / `SessionManager.TryGetSession`, and `ISessionRegistry.TryGet`/`TryRemove` plus their `SessionRegistry` implementations now carry the BCL `Dictionary.TryGetValue`-style annotation, and the `session!` null-forgiving operator inside `SessionRegistry.TryGet` / `TryRemove` was removed because the annotation makes it redundant. Existing in-tree callers (`SessionManagerTests.cs` line 28) were updated to `out GatewaySession?` to match. The compiler now warns at callsites that read `session` without checking the boolean return, closing the silent contract gap. Verified by `dotnet build src/MxGateway.IntegrationTests/...` (0 warnings), `dotnet build src/MxGateway.Tests/...` (0 warnings), and `dotnet test src/MxGateway.Tests/...` (479 passed). + +### IntegrationTests-019 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Security | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:497-534` | +| Status | Resolved | + +**Description:** `GatewaySession_WithLiveWorker_WriteSecured_AuthenticatedRoundTripParity` resolves a credential pair via `ResolveLiveMxAccessSecuredCredentials`, passes the password into `AuthenticateUser` and `WriteSecured`, and asserts the password is absent from `writeSecuredReply.DiagnosticMessage`. CLAUDE.md's secret-handling rule is broader: "API keys, passwords, `WriteSecured` payloads, and `AuthenticateUser` credentials must never reach logs." The test's assertion covers only one of the surfaces the rule protects: + +- The `TestOutputLoggerProvider` writes every gateway-side `ILogger` entry to `ITestOutputHelper`. A regression that logged the request body (or the `WorkerCommandRequest` envelope) would put the password into test output without failing this test. +- `WriteWorkerOutput` echoes worker `stdout`/`stderr` lines to `ITestOutputHelper`. A worker-side regression that printed the credential (e.g. a debug log added to `MxAccessCommandExecutor`) would land in test output without failing this test. +- `output.WriteLine(...)` calls in the test body (`AuthenticateUser status=... user_id=...` and `LogReply("WriteSecured", ...)`) currently don't include the request body, but a future maintenance change that printed `command.WriteSecured.Value` or a similar struct dump would silently leak the credential past the existing assertion. + +Because `ITestOutputHelper` doesn't expose its accumulated text to the test, the assertion can only be made by buffering output through a recording sink the test owns. + +**Recommendation:** Replace the bare `ITestOutputHelper` injection (just for the WriteSecured test, or for all live MXAccess tests) with a recording wrapper that mirrors writes both to the xUnit output and to a `StringBuilder`. At the end of the test, assert the buffer does not contain `verifyPassword`. This makes the credential-redaction contract a property of the entire test run, not just the one explicit field. Alternative: route the test through `GatewayLogRedactor` (`src/MxGateway.Server/Diagnostics/GatewayLogRedactor.cs`) so the credential-bearing commands are redacted at the logger sink the test sees. + +**Resolution:** 2026-05-20 — Added a `RecordingTestOutputHelper` private class that implements `ITestOutputHelper`, mirrors every line to the wrapped xUnit sink, and accumulates the same text into a `StringBuilder` exposed via a `Captured` property. The WriteSecured parity test now constructs this wrapper, passes it to both `TestWorkerProcessFactory` (so worker `stdout`/`stderr` lines flow through it) and `GatewayServiceFixture` (so the `TestOutputLoggerProvider`'s gateway-`ILogger` entries flow through it), and uses it for every direct `WriteLine`. A new `LogReplyTo(ITestOutputHelper sink, …)` static helper underpins the existing `LogReply` instance method so the test body can route reply logging through the recording wrapper. After the cleanup `finally` block completes, the test asserts `recordedOutput.Captured` does not contain the verify password. The credential-redaction contract is now enforced across the gateway-logger sink, worker stdout/stderr echo, and every test-body `WriteLine` — a future regression that dumped the request body, the `WorkerCommandRequest` envelope, or the `WriteSecured` payload would land in the buffer and fail the assertion. + +### IntegrationTests-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:616-622` | +| Status | Resolved | + +**Description:** `GatewaySession_WithLiveWorker_AbnormalWorkerExit_MarksSessionFaulted` asserts the `FinalFault` description contains at least one of these substrings (case-insensitive): `disconnect`, `pipe`, `heartbeat`, `worker`, `end of stream`. The intent (per the IntegrationTests-014 resolution prose) is to verify the gateway surfaces "a known worker-client classification". The `"worker"` substring defeats that intent — the gateway routes through `SetFaulted` with messages like *"Worker pipe disconnected."*, *"Worker shutdown timed out."*, *"Worker was killed by the gateway: …"*, *"Worker heartbeat expired. …"*, *"Worker event channel rejected an event."*, *"Worker pipe write failed."*, *"Worker read loop failed."*, *"Worker sent unexpected envelope body …"* — every classification message begins with the word "Worker". A regression that introduced an entirely new fault path with a generic message containing the word *Worker* would still pass this test. + +CLAUDE.md singles out "abnormal exit" as one of the parity surfaces (`SessionState.Faulted` with an actionable cause), so the test's documented value is verifying *which* of the WorkerClient error codes drove the transition. Today the assertion is effectively `Assert.NotEmpty(observedFault)`. + +**Recommendation:** Tighten the keyword set to the specific classifications the abnormal-exit (kill-the-process) path actually drives — `PipeDisconnected` ("pipe", "disconnect") and `EndOfStream` ("end of stream"). Drop the broad `"worker"` term, and drop `"heartbeat"` unless the test deliberately covers the heartbeat path too (it does not — `HeartbeatGraceSeconds = 15` and the poll deadline is `StreamShutdownTimeout = 10` seconds, so a heartbeat-expired transition is impossible inside the wait window). If a more exhaustive matrix is wanted, assert `FinalFault.StartsWith("Worker pipe disconnected")` against the message constant in `WorkerClient.cs:380` so a rename surfaces as a compile-time / test-time failure. + +**Resolution:** 2026-05-20 — Tightened the keyword set to the specific classifications the kill-the-process path actually drives. The assertion now requires the `FinalFault` description to contain `"pipe disconnected"` (matching the `WorkerClient.cs:378-381` `WorkerFrameProtocolErrorCode.EndOfStream` → `WorkerClientErrorCode.PipeDisconnected` → `"Worker pipe disconnected."` message) or `"end of stream"`, dropping the broad `"worker"` term that previously matched every `WorkerClient` fault message (all of which begin with "Worker"), and dropping `"heartbeat"` because the test's `StreamShutdownTimeout` (10 s) is below `HeartbeatGraceSeconds` (15 s) so a heartbeat-expired transition cannot occur inside the poll window. A regression that routed an unrelated fault classification through the abnormal-exit path would now fail loudly instead of silently passing. + +### IntegrationTests-021 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:579-622` | +| Status | Resolved | + +**Description:** The abnormal-exit test only polls `session.State` and `session.FinalFault`. It does not assert anything about `streamTask` after the kill. The chain that puts the session into `Faulted` is: the read loop hits EOS → `SetFaulted(PipeDisconnected, …)` → `_events.Writer.TryComplete(fault)` → `ReadEventsAsync` propagates the `WorkerClientException` → `EventStreamService.ProduceEventsAsync`'s `catch (Exception exception) when exception is WorkerClientException` calls `session.MarkFaulted(exception.Message)`. The test verifies the *end state* of that chain but not that the `StreamEvents` call is what produced the transition. If a future change moved the `MarkFaulted` call somewhere else (a session-manager background watcher, for example), the test would still pass — but the stream task could now silently swallow the fault. A direct assertion that `streamTask.IsFaulted` (or that awaiting it throws a `WorkerClientException`) would protect that contract. + +The Write parity test (IntegrationTests-012's resolution) added exactly this assertion (`Assert.False(streamTask.IsFaulted, …)`). The abnormal-exit test should add the inverse: the stream task *must* be faulted (or at least completed with a `WorkerClientException`) after the kill. + +**Recommendation:** After the session-state poll succeeds, assert `streamTask.IsCompleted` (the channel has terminated) and inspect `streamTask.Exception?.InnerException` for a `WorkerClientException` (or assert `streamTask.IsFaulted` and await with `ShouldThrowAsync`). This couples the test to the actual fault-propagation path and prevents a future refactor that bypasses the stream from quietly weakening the coverage. Compare to the existing `Assert.False(streamTask.IsFaulted, …)` on line 217 — the abnormal-exit case wants the opposite assertion. + +**Resolution:** 2026-05-20 — After the session-state poll loop confirms `SessionState.Faulted`, the test now awaits `streamTask.WaitAsync(StreamShutdownTimeout)` (with a try/catch that logs the surfaced exception type/message), then asserts `streamTask.IsCompleted` and `streamTask.IsFaulted`. This couples the test to the actual fault-propagation chain — read loop hits EndOfStream → `WorkerClient.SetFaulted(PipeDisconnected, …)` → `ReadEventsAsync` propagates the fault → `EventStreamService` calls `session.MarkFaulted` → `MxAccessGatewayService.StreamEvents` re-throws the mapped `RpcException`. A future refactor that moved `MarkFaulted` off the stream-consumption path would leave `streamTask` completing cleanly, which the new `IsFaulted` assertion would now catch (inverse of the existing `Assert.False(streamTask.IsFaulted, …)` in the Write parity test on line 217). The inner-exception type assertion was deliberately omitted because the gateway maps `WorkerClientException` to `RpcException` at the public boundary (`MxAccessGatewayService.MapWorkerClientException`); asserting on the surface type alone would be brittle, while the `IsFaulted` check directly tests the contract the recommendation is protecting. + +### IntegrationTests-022 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/ZB.MOM.WW.MxGateway.IntegrationTests/IntegrationTestEnvironment.cs:103-138` (`ResolveRepositoryRoot` / `IsRepositoryRoot`) | +| Status | Open | + +**Description:** The walker introduced in `dc9c0c9` searches parents for a directory containing `src/` plus either `.git` or a `*.slnx`/`*.sln` file, falling back to `Directory.GetCurrentDirectory()` when nothing matches. The fallback masks misconfiguration: a test run from a subdirectory of an unpacked tree without a `.git` or `.slnx` marker silently uses the wrong root and then the live `MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` lookup produces a misleading "worker exe not found" error pointing at a fabricated path. The current `bin/` layout makes this unlikely in practice (the test host sets a stable working directory), but the failure mode would be confusing if it triggered. + +**Recommendation:** Throw a clear `InvalidOperationException` from `ResolveRepositoryRoot` when the walk exhausts without finding a root, naming the directories searched and the markers expected. The opt-in env var (`MXGATEWAY_LIVE_MXACCESS_WORKER_EXE`) remains the escape hatch for unusual deployments. + +**Resolution:** _(empty until closed)_ + +### IntegrationTests-023 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/ZB.MOM.WW.MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:14-29` | +| Status | Open | + +**Description:** `AuthenticateAsync_AdminInGwAdminGroup_Succeeds` asserts the principal carries an `LdapGroupClaimType` claim containing `GwAdmin` (line 26-28). After the `27ed651` refactor, `DashboardAuthenticator.CreatePrincipal` also emits a `ClaimTypes.Role` claim with the mapped role (`Admin` for the seeded `GwAdmin` → `Admin` mapping). The test asserts the LDAP-group claim but not the role claim, so a regression that stopped emitting `Role: Admin` (e.g. a future refactor of `MapGroupsToRoles` that returned an empty list) would silently pass. + +**Recommendation:** Add an `Assert.Contains` asserting the principal holds a `ClaimTypes.Role` claim with value `DashboardRoles.Admin` (or `"Admin"`). Mirror the style of the existing `LdapGroupClaimType` assertion. + +**Resolution:** _(empty until closed)_ + +### IntegrationTests-024 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/ZB.MOM.WW.MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` (`NullDashboardEventBroadcaster` private class at end of file) | +| Status | Open | + +**Description:** The inline `NullDashboardEventBroadcaster` private class is the third copy in the repository (the other two live in `EventStreamServiceTests` and `GatewayEndToEndFakeWorkerSmokeTests` under the Tests module — see Tests-025). Each carries a singleton `Instance` field and a no-op `Publish`. While the integration-tests project doesn't directly share test-support code with the Tests project, a duplicate-everywhere pattern reads as a code smell. + +**Recommendation:** When Tests-025 lands the shared `TestSupport/NullDashboardEventBroadcaster.cs`, either reference the same shared helper from this project (a project reference if practical) or accept the duplication as a deliberate isolation between the unit-test and integration-test trees. Either choice is fine; the current state is the only one that should not persist. + +**Resolution:** _(empty until closed)_ diff --git a/code-reviews/README.md b/code-reviews/README.md new file mode 100644 index 0000000..1f975dc --- /dev/null +++ b/code-reviews/README.md @@ -0,0 +1,313 @@ +# Code Reviews + + + +Cross-module code review index for the `mxaccessgw` codebase. The review process is defined in [../REVIEW-PROCESS.md](../REVIEW-PROCESS.md). + +Each module's `findings.md` is the source of truth; this file is generated from them by `regen-readme.py` and must not be edited by hand. + +## Module status + +| Module | Reviewer | Date | Commit | Status | Open | Total | +|---|---|---|---|---|---|---| +| [Client.Dotnet](Client.Dotnet/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 17 | +| [Client.Go](Client.Go/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 21 | +| [Client.Java](Client.Java/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 5 | 31 | +| [Client.Python](Client.Python/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 21 | +| [Client.Rust](Client.Rust/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 1 | 21 | +| [Contracts](Contracts/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 2 | 17 | +| [IntegrationTests](IntegrationTests/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 3 | 24 | +| [Server](Server/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 8 | 43 | +| [Tests](Tests/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 2 | 26 | +| [Worker](Worker/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 25 | +| [Worker.Tests](Worker.Tests/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 30 | + +## Pending findings + +Findings with status `Open` or `In Progress`, ordered by severity. + +| ID | Severity | Category | Location | Description | +|---|---|---|---|---| +| Client.Java-027 | Medium | Documentation & comments | `clients/java/README.md:36,107-175,185,205,220`, `clients/java/JavaClientDesign.md:195-211` | Commit `397d3c5` renamed the gradle subprojects to `zb-mom-ww-mxgateway-client` and `zb-mom-ww-mxgateway-cli` in `settings.gradle`, but did not propagate that rename into the README's documented gradle commands or into `JavaClientDesign.md… | +| Client.Java-028 | Medium | Documentation & comments | `clients/java/JavaClientDesign.md:23-27` | The build-layout block in `JavaClientDesign.md` still shows the old Java package paths `com/dohertylan/mxgateway/client/` and `com/dohertylan/mxgateway/cli/`. The actual source tree was moved to `com/zb/mom/ww/mxgateway/{client,cli}/` in c… | +| Server-031 | Medium | Concurrency & thread safety | `src/MxGateway.Server/Workers/WorkerClient.cs:392-422` (gateway-side heartbeat watchdog); `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:588-617` (worker-side heartbeat loop); `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:14,67-76` (shared `_writeLock`) | Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The .NET phase succeeded through `open-session`/`register`/`bulk-subscribe`/`bulk-read`/`bulk-unsubscribe`/`stream-events`/`write` but then failed on its t… | +| Server-032 | Medium | Error handling & resilience | `src/MxGateway.Server/Workers/WorkerClient.cs:70-77,463-484` (gateway-side `_events` channel); `src/MxGateway.Server/Configuration/EventOptions.cs:8` (default capacity 10,000); `src/MxGateway.Server/Grpc/EventStreamService.cs` (consumer) | Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The Java phase advised ~55 items (`item-handle 63`) before failing on the next `advise` call with the Server-030 diagnostic `Session ... is not ready. Sess… | +| Server-038 | Medium | Security | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:23-44` | `EventsHub` is gated by `[Authorize(Policy = DashboardAuthenticationDefaults.HubClientsPolicy)]`, which checks only that the caller carries a dashboard role (Admin or Viewer). `SubscribeSession(sessionId)` accepts any non-empty session id… | +| Tests-026 | Medium | Testing coverage | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs`, `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-126` | The new `IDashboardEventBroadcaster` is wired into `EventStreamService` at line 123 (commit `d692232`) and the broadcaster's `Publish` is the only path that mirrors per-session events into the dashboard `EventsHub`. The unit tests inject `… | +| Client.Java-029 | Low | Documentation & comments | `clients/java/README.md:208-209` | The packaging section states "The library jar is under `zb-mom-ww-mxgateway-client/build/libs`. The installed CLI distribution is under `zb-mom-ww-mxgateway-cli/build/install/mxgateway-cli`." The library-jar path is correct, but the instal… | +| Client.Java-030 | Low | Testing coverage | `clients/java/zb-mom-ww-mxgateway-client/src/test/java/com/zb/mom/ww/mxgateway/client/` | Commit `397d3c5` added the missing `QueryActiveAlarmsRequest` proto message and the corresponding `rpc QueryActiveAlarms` to `mxaccess_gateway.proto`. The Java client now generates the request type and the gRPC stub method, and `MxGatewayC… | +| Client.Java-031 | Low | mxaccessgw conventions | `clients/java/README.md:13,17,26` | The README prose at lines 13–26 introduces the subprojects as `mxgateway-client` and `mxgateway-cli` (the old short names) when discussing the layout. Those are no longer the actual subproject names — `settings.gradle` declares `zb-mom-ww-… | +| Client.Rust-021 | Low | Design-document adherence | `clients/rust/RustClientDesign.md:14-33` | The crate-name change in commit `397d3c5` (top-level `mxgateway-client` → `zb-mom-ww-mxgateway-client`) is reflected in `Cargo.toml`, `Cargo.lock`, every `use zb_mom_ww_mxgateway_client::` import, and `build.rs`. The "Recommended layout" b… | +| Contracts-016 | Low | Code organization & conventions | `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto:31-41` (`QueryActiveAlarmsRequest`) | The new public message `QueryActiveAlarmsRequest` (added in commit `397d3c5`) has `session_id = 1` with a comment "session_id is currently unused (the snapshot is session-less) but reserved so a future per-session view can be added without… | +| Contracts-017 | Low | Documentation & comments | `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto:23-29` (the `rpc QueryActiveAlarms` block) | The RPC comment on `QueryActiveAlarms` describes the stream order ("Point-in-time snapshot of the currently-active alarm set served from the gateway's always-on alarm monitor cache") and the session-less semantic, but does not mention that… | +| IntegrationTests-022 | Low | Code organization & conventions | `src/ZB.MOM.WW.MxGateway.IntegrationTests/IntegrationTestEnvironment.cs:103-138` (`ResolveRepositoryRoot` / `IsRepositoryRoot`) | The walker introduced in `dc9c0c9` searches parents for a directory containing `src/` plus either `.git` or a `*.slnx`/`*.sln` file, falling back to `Directory.GetCurrentDirectory()` when nothing matches. The fallback masks misconfiguratio… | +| IntegrationTests-023 | Low | Testing coverage | `src/ZB.MOM.WW.MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:14-29` | `AuthenticateAsync_AdminInGwAdminGroup_Succeeds` asserts the principal carries an `LdapGroupClaimType` claim containing `GwAdmin` (line 26-28). After the `27ed651` refactor, `DashboardAuthenticator.CreatePrincipal` also emits a `ClaimTypes… | +| IntegrationTests-024 | Low | Code organization & conventions | `src/ZB.MOM.WW.MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` (`NullDashboardEventBroadcaster` private class at end of file) | The inline `NullDashboardEventBroadcaster` private class is the third copy in the repository (the other two live in `EventStreamServiceTests` and `GatewayEndToEndFakeWorkerSmokeTests` under the Tests module — see Tests-025). Each carries a… | +| Server-039 | Low | Error handling & resilience | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:37-58` | `HubTokenService.Validate` deserializes the protected JSON payload and trusts `payload.Roles` even when `payload.Name` and `payload.NameIdentifier` are both `null`. The resulting `ClaimsPrincipal` has the `MxGateway.Dashboard.HubToken` sch… | +| Server-040 | Low | Code organization & conventions | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAuthenticator.cs:140-160` (`MapGroupsToRoles`) | `MapGroupsToRoles` checks each LDAP group against the role map twice — first by the full group string, then by `ExtractFirstRdnValue(group)` — and `TryGetValue` short-circuits on the first hit. The precedence ("full match wins over RDN mat… | +| Server-041 | Low | Design-document adherence | `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-126`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/IDashboardEventBroadcaster.cs:6-10` | `IDashboardEventBroadcaster.Publish` is documented as "Implementations must never throw — broadcast failures are best-effort and must not disrupt the source gRPC stream." `EventStreamService` honors that contract by passing the call throug… | +| Server-042 | Low | Performance & resource management | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardSnapshotPublisher.cs:18-41` | `DashboardSnapshotPublisher.ExecuteAsync` reads from `IDashboardSnapshotService.WatchSnapshotsAsync` inside an outer `try` that catches `OperationCanceledException` only. A failure inside `WatchSnapshotsAsync` (e.g. the snapshot service th… | +| Server-043 | Low | Documentation & comments | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:1`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardServiceCollectionExtensions.cs:24` | `HubTokenService` is registered as a singleton (good — data protection providers are thread-safe and a single protector instance is correct) and shared by both `DashboardHubConnectionFactory` (per-circuit scoped, mints fresh tokens from th… | +| Tests-025 | Low | Code organization & conventions | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs:285-289`, `src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:417-421` | Commit `d692232` widened the `EventStreamService` constructor with an `IDashboardEventBroadcaster` parameter. Two test files now carry an identical `private sealed class NullDashboardEventBroadcaster : IDashboardEventBroadcaster` with a si… | + +## Closed findings + +Findings with status `Resolved`, `Won't Fix`, or `Deferred`. + +| ID | Severity | Status | Category | Location | +|---|---|---|---|---| +| Server-001 | Critical | Resolved | Security | `src/MxGateway.Server/GatewayApplication.cs:147-149`, `src/MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs:55-58`, `src/MxGateway.Server/Dashboard/Components/Routes.razor:1-15` | +| Client.Go-001 | High | Resolved | Correctness & logic bugs | `clients/go/mxgateway/errors.go:88-93`, `clients/go/mxgateway/errors.go:117-128` | +| Client.Java-013 | High | Resolved | Testing coverage | `clients/java/mxgateway-cli/src/test/java/com/dohertylan/mxgateway/cli/MxGatewayCliTests.java:212-304`, `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:1214-1244` | +| Client.Python-018 | High | Resolved | Code organization & conventions | `clients/python/pyproject.toml:11` | +| Client.Rust-001 | High | Resolved | mxaccessgw conventions | `clients/rust/src/options.rs:98,143` | +| Client.Rust-002 | High | Resolved | mxaccessgw conventions | `clients/rust/src/session.rs:522` | +| Client.Rust-003 | High | Resolved | Correctness & logic bugs | `clients/rust/crates/mxgw-cli/src/main.rs:1051` | +| Client.Rust-012 | High | Resolved | mxaccessgw conventions | `clients/rust/src/galaxy.rs:282` | +| Client.Rust-013 | High | Resolved | mxaccessgw conventions | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:414-424` (origin); `clients/rust/src/generated.rs:11-31` (suppression site) | +| IntegrationTests-001 | High | Resolved | Design-document adherence | `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs:7`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs` | +| IntegrationTests-002 | High | Resolved | Design-document adherence | `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:13`, `src/MxGateway.Server/Configuration/LdapOptions.cs:27` | +| Server-003 | High | Resolved | Security | `src/MxGateway.Server/Dashboard/DashboardAuthorizationHandler.cs:39,54-59`, `src/MxGateway.Server/Dashboard/DashboardAuthenticator.cs:236-258` | +| Server-017 | High | Resolved | Security | `src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:13-27`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:173-247`, `docs/Authorization.md:108-110` | +| Tests-001 | High | Resolved | Testing coverage | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:483-489` | +| Tests-002 | High | Resolved | Security | `src/MxGateway.Tests/Gateway/Grpc/GalaxyRepositoryGrpcServiceTests.cs:198-210` | +| Worker-001 | High | Resolved | Concurrency & thread safety | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207` | +| Worker-002 | High | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:545-549` | +| Worker-003 | High | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:399-403`, `:416-419` | +| Worker.Tests-001 | High | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/Sta/` (no `StaMessagePumpTests.cs`) | +| Worker.Tests-002 | High | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs`, `src/MxGateway.Worker.Tests/MxAccess/MxAccessEventMapperTests.cs` | +| Client.Dotnet-001 | Medium | Resolved | Error handling & resilience | `clients/dotnet/MxGateway.Client/GrpcMxGatewayClientTransport.cs:190-199`, `clients/dotnet/MxGateway.Client/GrpcGalaxyRepositoryClientTransport.cs:131-140` | +| Client.Dotnet-002 | Medium | Resolved | Testing coverage | `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:145-148`, `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientSessionTests.cs:236-256` | +| Client.Dotnet-003 | Medium | Resolved | Concurrency & thread safety | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:659-663`, `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:230-240` | +| Client.Go-002 | Medium | Resolved | Error handling & resilience | `clients/go/mxgateway/session.go:440-516` | +| Client.Go-003 | Medium | Resolved | Correctness & logic bugs | `clients/go/cmd/mxgw-go/main.go:517-532` | +| Client.Java-001 | Medium | Resolved | Security | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySecrets.java:30-32` | +| Client.Java-002 | Medium | Resolved | Concurrency & thread safety | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:31,66-92` | +| Client.Java-003 | Medium | Resolved | mxaccessgw conventions | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:119-140` | +| Client.Java-004 | Medium | Resolved | Correctness & logic bugs | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySession.java:114-120,157-163,191-197` | +| Client.Java-005 | Medium | Resolved | Error handling & resilience | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySession.java:92-105` | +| Client.Java-014 | Medium | Resolved | Concurrency & thread safety | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:59-65,117-124` | +| Client.Java-015 | Medium | Resolved | Concurrency & thread safety | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayChannels.java:112-138`, `MxGatewayClient.java:183-191,224-232,322-329`, `GalaxyRepositoryClient.java:164-170,212-214` | +| Client.Java-021 | Medium | Resolved | Concurrency & thread safety | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/DeployEventStream.java:96-135` | +| Client.Python-003 | Medium | Resolved | Error handling & resilience | `clients/python/src/mxgateway/client.py:125-137,155-173` | +| Client.Python-005 | Medium | Resolved | Performance & resource management | `clients/python/src/mxgateway/galaxy.py:117-140` | +| Client.Python-009 | Medium | Resolved | Testing coverage | `clients/python/tests/` | +| Client.Python-013 | Medium | Resolved | Security | `clients/python/src/mxgateway_cli/commands.py:757-762` | +| Client.Rust-005 | Medium | Resolved | Correctness & logic bugs | `clients/rust/src/session.rs:489-520` | +| Client.Rust-006 | Medium | Resolved | Error handling & resilience | `clients/rust/src/session.rs:531-555` | +| Client.Rust-015 | Medium | Resolved | Error handling & resilience | `clients/rust/crates/mxgw-cli/src/main.rs:1053-1070` | +| Client.Rust-016 | Medium | Resolved | Testing coverage | `clients/rust/tests/client_behavior.rs`, `clients/rust/src/session.rs:489-519,654-768` | +| Client.Rust-018 | Medium | Resolved | Error handling & resilience | `clients/rust/crates/mxgw-cli/src/main.rs:1098-1170`; `scripts/bench-read-bulk.ps1:347-365`; siblings: `clients/go/cmd/mxgw-go/main.go:600-648`, `clients/python/src/mxgateway_cli/commands.py:614-662`, `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:685-770`, `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:855-940` | +| Contracts-002 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:384-385`, `:95` | +| Contracts-009 | Medium | Resolved | Design-document adherence | `docs/Contracts.md:13-24` | +| IntegrationTests-003 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:89-97` | +| IntegrationTests-004 | Medium | Resolved | Error handling & resilience | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:108-111` | +| IntegrationTests-005 | Medium | Resolved | Testing coverage | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` | +| IntegrationTests-006 | Medium | Resolved | Testing coverage | `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs` | +| IntegrationTests-012 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:147-151` | +| IntegrationTests-014 | Medium | Resolved | Testing coverage | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` | +| IntegrationTests-017 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:350-407` | +| IntegrationTests-019 | Medium | Resolved | Security | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:497-534` | +| Server-002 | Medium | Resolved | Design-document adherence | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` | +| Server-004 | Medium | Resolved | Code organization & conventions | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` | +| Server-005 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` | +| Server-006 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` | +| Server-015 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Server/Sessions/GatewaySession.cs:8-15,266-308,720-775` | +| Server-016 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Server/Sessions/GatewaySession.cs:790-797`, `src/MxGateway.Server/Sessions/SessionManager.cs:237-258` | +| Server-021 | Medium | Resolved | Testing coverage | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:266-664`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs` | +| Server-030 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Server/Sessions/GatewaySession.cs:952-980` | +| Server-033 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:265-323` (`TryRestoreFromDiskAsync`), `:84-99` (`_firstLoad` / `WaitForFirstLoadAsync`); `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:141-163` (`WaitForCacheBootstrap`) | +| Tests-003 | Medium | Resolved | Performance & resource management | `src/MxGateway.Tests/Security/Authentication/SqliteAuthStoreTests.cs:170-176`, `src/MxGateway.Tests/Security/Authentication/ApiKeyAdminCliRunnerTests.cs:252-258` | +| Tests-004 | Medium | Resolved | Testing coverage | `src/MxGateway.Tests/Security/Authorization/GatewayGrpcAuthorizationInterceptorTests.cs` | +| Tests-005 | Medium | Resolved | Testing coverage | `src/MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs:239-261`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` | +| Tests-006 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:76`, `src/MxGateway.Tests/Gateway/Workers/FakeWorkerHarnessTests.cs:122` | +| Tests-013 | Medium | Resolved | Testing coverage | `src/MxGateway.Server/Sessions/GatewaySession.cs:449-679`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` | +| Tests-016 | Medium | Resolved | Testing coverage | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:29-41,115-124` | +| Tests-020 | Medium | Resolved | Testing coverage | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs:275-347`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:803-829` | +| Worker-004 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588` | +| Worker-005 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-258` (production alarm poll loop) | +| Worker-006 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491` | +| Worker-007 | Medium | Resolved | mxaccessgw conventions | `src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150` | +| Worker-008 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249`, `:429-447` | +| Worker-016 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:261-265` | +| Worker-017 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Worker/Sta/StaRuntime.cs:280-288`, `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-631` | +| Worker-023 | Medium | Resolved | Error handling & resilience | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:610-668`, `src/MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:124-153` | +| Worker.Tests-003 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/Sta/StaRuntimeTests.cs:46-48` | +| Worker.Tests-004 | Medium | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:281-329` | +| Worker.Tests-005 | Medium | Resolved | Performance & resource management | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs:20-31,103-105`, `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:28-31` | +| Worker.Tests-006 | Medium | Resolved | Performance & resource management | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:282,305,315,323` | +| Worker.Tests-007 | Medium | Resolved | Design-document adherence | `docs/WorkerFrameProtocol.md:38-49` | +| Worker.Tests-016 | Medium | Resolved | Code organization & conventions | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandExecutorTests.cs:317-393` | +| Worker.Tests-017 | Medium | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs` | +| Worker.Tests-018 | Medium | Resolved | Correctness & logic bugs | `src/MxGateway.Worker.Tests/MxAccess/MxAccessLiveComCreationTests.cs:18-31, 35-73, 75-145, 148-220, 222-342` | +| Client.Dotnet-004 | Low | Resolved | Error handling & resilience | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:283-294`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:392-403` | +| Client.Dotnet-005 | Low | Resolved | Correctness & logic bugs | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:82,124,175` | +| Client.Dotnet-006 | Low | Resolved | Code organization & conventions | `clients/dotnet/MxGateway.Client/MxGatewayClientOptions.cs:50`, `clients/dotnet/MxGateway.Client/MxGatewayClientContractInfo.cs:10-14` | +| Client.Dotnet-007 | Low | Resolved | Documentation & comments | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:185-192` | +| Client.Dotnet-008 | Low | Resolved | Correctness & logic bugs | `clients/dotnet/MxGateway.Client.Cli/MxGatewayCliSecretRedactor.cs:9-17` | +| Client.Dotnet-009 | Low | Resolved | Concurrency & thread safety | `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:26,339-348,445-448` | +| Client.Dotnet-010 | Low | Resolved | Correctness & logic bugs | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:638,896,1261,1279` | +| Client.Dotnet-011 | Low | Resolved | Concurrency & thread safety | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:857-858,922-963,1014-1015` | +| Client.Dotnet-012 | Low | Resolved | Code organization & conventions | `clients/dotnet/MxGateway.Client/MxGateway.Client.csproj`, `clients/dotnet/MxGateway.Client.Cli/MxGateway.Client.Cli.csproj`, `clients/dotnet/MxGateway.Client.Tests/MxGateway.Client.Tests.csproj` | +| Client.Dotnet-013 | Low | Resolved | Code organization & conventions | `clients/dotnet/MxGateway.Client/DiscoverHierarchyOptions.cs:3-24`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:185-187`, `clients/dotnet/MxGateway.Client.Cli/IMxGatewayCliClient.cs:6` | +| Client.Dotnet-014 | Low | Resolved | Testing coverage | `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientAlarmsTests.cs:76-98`, `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:212-231` | +| Client.Dotnet-015 | Low | Resolved | Correctness & logic bugs | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:221-236`, `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:596-1065` | +| Client.Dotnet-016 | Low | Resolved | Concurrency & thread safety | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:922-976` | +| Client.Dotnet-017 | Low | Resolved | Error handling & resilience | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:1190-1262` | +| Client.Go-004 | Low | Resolved | mxaccessgw conventions | `clients/go/mxgateway/alarms_test.go:153-154`, `clients/go/mxgateway/galaxy_test.go:58-59` | +| Client.Go-005 | Low | Resolved | Design-document adherence | `clients/go/mxgateway/client.go:64,68`, `clients/go/mxgateway/galaxy.go:83,87` | +| Client.Go-006 | Low | Resolved | Error handling & resilience | `clients/go/mxgateway/errors.go:9-130` | +| Client.Go-007 | Low | Resolved | Correctness & logic bugs | `clients/go/mxgateway/session.go:526-532` | +| Client.Go-008 | Low | Resolved | Testing coverage | `clients/go/mxgateway/` (test files) | +| Client.Go-009 | Low | Resolved | Code organization & conventions | `clients/go/mxgateway/galaxy.go:60-93,241-256`, `clients/go/mxgateway/client.go:41-74,190-205` | +| Client.Go-010 | Low | Resolved | Documentation & comments | `clients/go/mxgateway/client.go:39-40` | +| Client.Go-011 | Low | Resolved | Correctness & logic bugs | `clients/go/mxgateway/alarms_test.go:66-73` | +| Client.Go-012 | Low | Resolved | Documentation & comments | `clients/go/cmd/mxgw-go/main.go:1063-1065`, `clients/go/cmd/mxgw-go/main.go:88-104` | +| Client.Go-013 | Low | Resolved | Concurrency & thread safety | `clients/go/cmd/mxgw-go/main.go:1246-1249`, `clients/go/cmd/mxgw-go/main.go:1257-1262` | +| Client.Go-014 | Low | Resolved | Error handling & resilience | `clients/go/mxgateway/session.go:602`, `clients/go/mxgateway/galaxy.go:189` | +| Client.Go-015 | Low | Resolved | Code organization & conventions | `clients/go/cmd/mxgw-go/main.go:410-512` | +| Client.Go-016 | Low | Resolved | Testing coverage | `clients/go/mxgateway/galaxy_test.go:382-429` | +| Client.Go-017 | Low | Resolved | Error handling & resilience | `clients/go/cmd/mxgw-go/main.go:954-991` | +| Client.Go-018 | Low | Resolved | Concurrency & thread safety | `clients/go/cmd/mxgw-go/main.go:593-623` | +| Client.Go-019 | Low | Resolved | Documentation & comments | `clients/go/cmd/mxgw-go/main.go:710-716`, `clients/go/cmd/mxgw-go/main.go:1204,1213` | +| Client.Go-020 | Low | Resolved | Code organization & conventions | `clients/go/cmd/mxgw-go/main.go:753-802`, `clients/go/cmd/mxgw-go/main.go:1199-1275` | +| Client.Go-021 | Low | Resolved | Testing coverage | `clients/go/cmd/mxgw-go/main_test.go`, `clients/go/cmd/mxgw-go/main.go:363-520,522-655` | +| Client.Java-006 | Low | Resolved | Performance & resource management | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:323-328`, `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/GalaxyRepositoryClient.java:279-284` | +| Client.Java-007 | Low | Resolved | Testing coverage | `clients/java/mxgateway-client/src/test/java/com/dohertylan/mxgateway/client/` | +| Client.Java-008 | Low | Resolved | Error handling & resilience | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:298-304` | +| Client.Java-009 | Low | Resolved | Code organization & conventions | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/GalaxyRepositoryClient.java:310-391`, `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:346-413` | +| Client.Java-010 | Low | Resolved | Documentation & comments | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:269-272`, `clients/java/README.md:76` | +| Client.Java-011 | Low | Resolved | Performance & resource management | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:37-63` | +| Client.Java-012 | Low | Resolved | Correctness & logic bugs | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:667-674` | +| Client.Java-016 | Low | Resolved | Code organization & conventions | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:361-391`, `GalaxyRepositoryClient.java:285-315` | +| Client.Java-017 | Low | Resolved | Documentation & comments | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:25-36`, `clients/java/README.md:99-107` | +| Client.Java-018 | Low | Resolved | Security | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySecrets.java:54-66` | +| Client.Java-019 | Low | Resolved | Performance & resource management | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:362-391`, `GalaxyRepositoryClient.java:286-315` | +| Client.Java-020 | Low | Resolved | Correctness & logic bugs | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:244-254`, `galaxy_repository.proto:94` | +| Client.Java-022 | Low | Resolved | Documentation & comments | `clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayChannels.java:161-172` | +| Client.Java-023 | Low | Resolved | Correctness & logic bugs | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:1054`, `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:634` | +| Client.Java-024 | Low | Resolved | Correctness & logic bugs | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:855-883` | +| Client.Java-025 | Low | Resolved | Code organization & conventions | `clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:1176-1185` | +| Client.Java-026 | Low | Resolved | Testing coverage | `clients/java/mxgateway-cli/src/test/java/com/dohertylan/mxgateway/cli/MxGatewayCliTests.java` | +| Client.Python-001 | Low | Resolved | Documentation & comments | `clients/python/pyproject.toml:8,25`, `clients/python/src/mxgateway_cli/commands.py:25` | +| Client.Python-002 | Low | Resolved | Code organization & conventions | `clients/python/src/mxgateway/__init__.py:27` | +| Client.Python-004 | Low | Resolved | Correctness & logic bugs | `clients/python/src/mxgateway_cli/commands.py:386,402-404` | +| Client.Python-006 | Low | Resolved | Concurrency & thread safety | `clients/python/src/mxgateway/client.py:74-82`, `clients/python/src/mxgateway/galaxy.py:85-93`, `clients/python/src/mxgateway/session.py:38-55` | +| Client.Python-007 | Low | Resolved | Error handling & resilience | `clients/python/src/mxgateway/client.py:204-213` | +| Client.Python-008 | Low | Resolved | Correctness & logic bugs | `clients/python/src/mxgateway/values.py:62-67,83-88` | +| Client.Python-010 | Low | Resolved | Code organization & conventions | `clients/python/src/mxgateway/session.py:404`, `clients/python/src/mxgateway_cli/commands.py:422-425` | +| Client.Python-011 | Low | Resolved | Error handling & resilience | `clients/python/src/mxgateway/errors.py:122-148` | +| Client.Python-012 | Low | Won't Fix | mxaccessgw conventions | `clients/python/src/mxgateway/client.py:84-108`, `clients/python/src/mxgateway/session.py:57-77` | +| Client.Python-014 | Low | Resolved | Code organization & conventions | `clients/python/src/mxgateway_cli/commands.py:22-23` | +| Client.Python-015 | Low | Resolved | Testing coverage | `clients/python/src/mxgateway_cli/commands.py:273-294,564-647`, `clients/python/tests/` | +| Client.Python-016 | Low | Resolved | Testing coverage | `clients/python/src/mxgateway_cli/commands.py:25,757-775,805-830` | +| Client.Python-017 | Low | Resolved | Documentation & comments | `clients/python/pyproject.toml:5-25`, `clients/python/src/mxgateway/` | +| Client.Python-019 | Low | Resolved | Code organization & conventions | `clients/python/pyproject.toml:60-61`, `clients/python/src/mxgateway_cli/` | +| Client.Python-020 | Low | Resolved | Testing coverage | `clients/python/tests/`, `scripts/` | +| Client.Python-021 | Low | Resolved | Documentation & comments | `clients/python/src/mxgateway_cli/commands.py`, `clients/python/README.md:235-258` | +| Client.Rust-004 | Low | Resolved | Documentation & comments | `clients/rust/src/version.rs:7` | +| Client.Rust-007 | Low | Resolved | Design-document adherence | `clients/rust/RustClientDesign.md:14-55` | +| Client.Rust-008 | Low | Resolved | Performance & resource management | `clients/rust/src/value.rs:161-261` | +| Client.Rust-009 | Low | Resolved | Testing coverage | `clients/rust/tests/client_behavior.rs`, `clients/rust/src/galaxy.rs` | +| Client.Rust-010 | Low | Resolved | Error handling & resilience | `clients/rust/src/client.rs:255-268`, `clients/rust/src/galaxy.rs:204-216` | +| Client.Rust-011 | Low | Resolved | mxaccessgw conventions | `clients/rust/src/session.rs:469` | +| Client.Rust-014 | Low | Resolved | mxaccessgw conventions | `clients/rust/crates/mxgw-cli/src/main.rs:450,497` | +| Client.Rust-017 | Low | Resolved | Design-document adherence | `clients/rust/RustClientDesign.md:79-99,156-163` | +| Client.Rust-019 | Low | Resolved | Design-document adherence | `clients/rust/RustClientDesign.md:96-100` | +| Client.Rust-020 | Low | Resolved | Documentation & comments | `clients/rust/src/session.rs:31-46`; `clients/rust/src/lib.rs:14-39` | +| Contracts-001 | Low | Resolved | Design-document adherence | `docs/Grpc.md:13` (and `:3`, `:32`, `:39`) | +| Contracts-003 | Low | Won't Fix | Code organization & conventions | `src/MxGateway.Contracts/MxGateway.Contracts.csproj:10` | +| Contracts-004 | Low | Resolved | Documentation & comments | `src/MxGateway.Contracts/GatewayContractInfo.cs:3-6` | +| Contracts-005 | Low | Resolved | mxaccessgw conventions | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto`, `src/MxGateway.Contracts/Protos/mxaccess_worker.proto` | +| Contracts-006 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:647` | +| Contracts-007 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Contracts/ProtobufContractRoundTripTests.cs` | +| Contracts-008 | Low | Resolved | Design-document adherence | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:451-459`, `:627-636` | +| Contracts-010 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Contracts/ProtobufContractRoundTripTests.cs` | +| Contracts-011 | Low | Resolved | Security | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:392-397`, `:406-412` | +| Contracts-012 | Low | Resolved | Documentation & comments | `src/MxGateway.Contracts/Protos/galaxy_repository.proto:120` | +| Contracts-013 | Low | Resolved | Documentation & comments | `src/MxGateway.Tests/Contracts/GatewayContractInfoTests.cs:14` | +| Contracts-014 | Low | Resolved | Documentation & comments | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:549-553` | +| Contracts-015 | Low | Resolved | Documentation & comments | `src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:571-582` | +| IntegrationTests-007 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:20`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs:5`, `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:9` | +| IntegrationTests-008 | Low | Resolved | Code organization & conventions | `src/MxGateway.IntegrationTests/LiveLdapFactAttribute.cs`, `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs`, `src/MxGateway.IntegrationTests/LiveMxAccessFactAttribute.cs` | +| IntegrationTests-009 | Low | Resolved | Documentation & comments | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:372-375` | +| IntegrationTests-010 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:366-369` | +| IntegrationTests-011 | Low | Resolved | Documentation & comments | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:236-240`, `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:183-187` | +| IntegrationTests-013 | Low | Resolved | Performance & resource management | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:519-609` | +| IntegrationTests-015 | Low | Resolved | Code organization & conventions | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:30,119,201`, `src/MxGateway.IntegrationTests/DashboardLdapLiveTests.cs:13,32,48,67,84`, `src/MxGateway.IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs:10,22,34,52` | +| IntegrationTests-016 | Low | Resolved | Code organization & conventions | `src/MxGateway.IntegrationTests/Galaxy/LiveGalaxyRepositoryFactAttribute.cs:26`, `src/MxGateway.Server/Galaxy/GalaxyRepositoryOptions.cs:13` | +| IntegrationTests-018 | Low | Resolved | Code organization & conventions | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:1037`, `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:595` | +| IntegrationTests-020 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:616-622` | +| IntegrationTests-021 | Low | Resolved | Testing coverage | `src/MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs:579-622` | +| Server-007 | Low | Resolved | Performance & resource management | `src/MxGateway.Server/Galaxy/GalaxyHierarchyProjector.cs:55-70` | +| Server-008 | Low | Resolved | Performance & resource management | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:111-134,160-189` | +| Server-009 | Low | Resolved | Error handling & resilience | `src/MxGateway.Server/Security/Authentication/AuthSqliteConnectionFactory.cs:15-32` | +| Server-010 | Low | Resolved | Security | `src/MxGateway.Server/Security/Authentication/SqliteApiKeyAdminStore.cs:91-114`, `src/MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:168-172` | +| Server-011 | Low | Resolved | Code organization & conventions | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:1-46` | +| Server-012 | Low | Resolved | Documentation & comments | `CLAUDE.md` (Authentication section and `apikey create` example) | +| Server-013 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs`, `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs` | +| Server-014 | Low | Resolved | Documentation & comments | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:162-171,191-198,206-214,229-237` | +| Server-018 | Low | Resolved | Performance & resource management | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:15` | +| Server-019 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:183-221` | +| Server-020 | Low | Resolved | Code organization & conventions | `src/MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor:1-2`, `…/GalaxyPage.razor:1-2`, `…/ApiKeysPage.razor:1-2`, `…/EventsPage.razor:1-2`, `…/SessionsPage.razor:1-2`, `…/WorkersPage.razor:1-2`, `…/SettingsPage.razor:1-2`, `…/SessionDetailsPage.razor:1-2` | +| Server-022 | Low | Resolved | Documentation & comments | `src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs:8-29` | +| Server-023 | Low | Resolved | Documentation & comments | `src/MxGateway.Server/Sessions/NotWiredAlarmRpcDispatcher.cs:10-26` | +| Server-024 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:56-77` | +| Server-025 | Low | Resolved | Code organization & conventions | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:19-25`, `src/MxGateway.Server/Galaxy/IGalaxyRepository.cs` | +| Server-026 | Low | Resolved | Error handling & resilience | `src/MxGateway.Server/Configuration/GatewayOptionsValidator.cs:17-32`, `src/MxGateway.Server/Configuration/AlarmsOptions.cs` | +| Server-027 | Low | Resolved | Design-document adherence | `docs/Authorization.md:120-141,176-181` | +| Server-028 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:13-20`, `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs` | +| Server-029 | Low | Resolved | Documentation & comments | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:52-58` | +| Server-034 | Low | Resolved | Error handling & resilience | `src/MxGateway.Server/Galaxy/GalaxyHierarchySnapshotStore.cs:87-115` (`TryLoadAsync`) | +| Server-035 | Low | Resolved | Performance & resource management | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:176` (call site), `:327-352` (`PersistSnapshotAsync`) | +| Server-036 | Low | Resolved | Error handling & resilience | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:345-348` (`PersistSnapshotAsync` catch) | +| Server-037 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Galaxy/GalaxyHierarchySnapshotStoreTests.cs`, `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs` | +| Tests-007 | Low | Resolved | Code organization & conventions | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:682`, `src/MxGateway.Tests/Gateway/Grpc/GalaxyRepositoryGrpcServiceTests.cs:324`, `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:460`, `src/MxGateway.Tests/Security/Authorization/GatewayGrpcAuthorizationInterceptorTests.cs:233` | +| Tests-008 | Low | Resolved | mxaccessgw conventions | `src/MxGateway.Tests/Gateway/Sessions/WorkerAlarmRpcDispatcherTests.cs:1-9`, `src/MxGateway.Tests/Gateway/Sessions/NotWiredAlarmRpcDispatcherTests.cs:1-3`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerAlarmAutoSubscribeTests.cs:1` | +| Tests-009 | Low | Resolved | Documentation & comments | `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:36-37,99,365` | +| Tests-010 | Low | Resolved | Security | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs:26-36` | +| Tests-011 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:233-301` | +| Tests-012 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Tests/Gateway/Workers/Fakes/FakeWorkerHarness.cs:62`, `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:472` | +| Tests-014 | Low | Resolved | Performance & resource management | `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs:18,33,44,62,81,105`, `src/MxGateway.Tests/Gateway/Dashboard/DashboardCookieOptionsTests.cs:17` | +| Tests-015 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:374-379,87` | +| Tests-017 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:346-364` | +| Tests-018 | Low | Resolved | Code organization & conventions | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:32`, `src/MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotServiceTests.cs:45,51,57,105,134,163,167,202-209,284,317,523`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:40` | +| Tests-019 | Low | Resolved | Documentation & comments | `docs/GatewayTesting.md`, `code-reviews/Tests/findings.md` (Tests-002 re-triage) | +| Tests-021 | Low | Resolved | Code organization & conventions | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:159-171`, `src/MxGateway.Tests/Gateway/Workers/FakeWorkerHarnessTests.cs:226-236`, `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:620-630`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:766-…` | +| Tests-022 | Low | Resolved | Testing coverage | `src/MxGateway.Tests/Gateway/Sessions/SessionManagerBulkTests.cs:52-61,90-99,126-135,163-172,202-211,238-247,282-294,339-360,413-434,484-506,553-567,663-688` | +| Tests-023 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Tests/Gateway/Sessions/SessionWorkerClientFactoryFakeWorkerTests.cs:334-374` | +| Tests-024 | Low | Resolved | Testing coverage | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:713-730,784-801,859-876`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs` | +| Worker-009 | Low | Resolved | Performance & resource management | `src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49`, `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58` | +| Worker-010 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226` | +| Worker-011 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171` | +| Worker-012 | Low | Resolved | Documentation & comments | `src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55`, `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43`, `src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112` | +| Worker-013 | Low | Resolved | Testing coverage | `src/MxGateway.Worker/Sta/StaMessagePump.cs` | +| Worker-014 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33`, `:202` | +| Worker-015 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145` | +| Worker-018 | Low | Resolved | Error handling & resilience | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:160-161` | +| Worker-019 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:59`, `:188` | +| Worker-020 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:405`, `:423` | +| Worker-021 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-118`, `:790-805`, `:136-139` | +| Worker-022 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker/MxAccess/MxAlarmSnapshot.cs:12`, `:26`, `:49` | +| Worker-024 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:63-187`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:191-323` | +| Worker-025 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-117` | +| Worker.Tests-008 | Low | Resolved | Documentation & comments | `src/MxGateway.Worker.Tests/Conversion/VariantConverterTests.cs:175-182` | +| Worker.Tests-009 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandHandlerTests.cs`, `AlarmDispatcherTests.cs`, `AlarmCommandExecutorTests.cs`, `AlarmRecordTransitionMapperTests.cs`, `WnWrapAlarmConsumerXmlTests.cs` | +| Worker.Tests-010 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:230-258` | +| Worker.Tests-011 | Low | Resolved | Documentation & comments | `src/MxGateway.Worker.Tests/Sta/StaCommandDispatcherTests.cs:92-112` | +| Worker.Tests-012 | Low | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs` | +| Worker.Tests-013 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:539-546` | +| Worker.Tests-014 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeClientTests.cs:194`, `WorkerPipeSessionTests.cs:622`, `Sta/StaCommandDispatcherTests.cs:348`, `MxAccess/MxAccessStaSessionTests.cs:334`, `MxAccess/MxAccessCommandExecutorTests.cs:1124` | +| Worker.Tests-015 | Low | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/MxAccess/MxAccessEventQueueTests.cs` | +| Worker.Tests-019 | Low | Resolved | mxaccessgw conventions | `src/MxGateway.Worker.Tests/AlarmsLiveSmokeTests.cs:45`, `src/MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs:143`, `src/MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs:55` | +| Worker.Tests-020 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/MxAccess/MxAccessValueCacheTests.cs:88-108` | +| Worker.Tests-021 | Low | Resolved | Error handling & resilience | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs` | +| Worker.Tests-022 | Low | Resolved | Testing coverage | `src/MxGateway.Worker.Tests/MxAccess/WnWrapAlarmConsumerXmlTests.cs` | +| Worker.Tests-023 | Low | Resolved | Documentation & comments | `src/MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs` (779 lines), `src/MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs` (287 lines), `src/MxGateway.Worker.Tests/AlarmsLiveSmokeTests.cs` (270 lines) | +| Worker.Tests-024 | Low | Resolved | Correctness & logic bugs | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandHandlerTests.cs:42-54` | +| Worker.Tests-025 | Low | Resolved | mxaccessgw conventions | `src/MxGateway.Worker.Tests/TestSupport/LiveMxAccessFactAttribute.cs:23`, `src/MxGateway.IntegrationTests/IntegrationTestEnvironment.cs:5`, `src/MxGateway.IntegrationTests/LiveMxAccessFactAttribute.cs:9-12` | +| Worker.Tests-026 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker/MxAccess/MxAccessSession.cs:74-88` | +| Worker.Tests-027 | Low | Resolved | Concurrency & thread safety | `src/MxGateway.Worker.Tests/TestSupport/FakeRuntimeSession.cs:174, 179-187` | +| Worker.Tests-028 | Low | Resolved | Design-document adherence | `docs/GatewayTesting.md`, `src/MxGateway.Worker.Tests/Probes/` | +| Worker.Tests-029 | Low | Resolved | Code organization & conventions | `src/MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs:9`, `src/MxGateway.Worker.Tests/Probes/AlarmClientWmProbeTests.cs:14`, `src/MxGateway.Worker.Tests/Probes/WnWrapConsumerProbeTests.cs:10` | +| Worker.Tests-030 | Low | Resolved | Documentation & comments | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:862-890` | diff --git a/code-reviews/Server/findings.md b/code-reviews/Server/findings.md new file mode 100644 index 0000000..e7b4617 --- /dev/null +++ b/code-reviews/Server/findings.md @@ -0,0 +1,780 @@ +# Code Review — Server + +| Field | Value | +|---|---| +| Module | `src/ZB.MOM.WW.MxGateway.Server` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 8 | + +## Checklist coverage + +### 2026-05-20 review (commit 1cd51bb) + +This row summarizes the 2026-05-20 review pass at commit `1cd51bb`. Findings from +prior passes (Server-001 through Server-014) are all closed and remain below as +audit history. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issues found: Server-019 (`WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` yields silently when session is missing). | +| 2 | mxaccessgw conventions | No issues found — convention drift previously called out is resolved; no new gaps observed. | +| 3 | Concurrency & thread safety | Issues found: Server-015 (`GatewaySession._state` is written under `_closeLock` but read/written elsewhere under `_syncRoot`). | +| 4 | Error handling & resilience | Issues found: Server-016 (`GatewaySession.DisposeAsync` disposes the close-lock semaphore while it may be held). | +| 5 | Security | Issues found: Server-017 (`AcknowledgeAlarm` / `QueryActiveAlarms` fall through to admin-only scope because the resolver was not updated for the new alarm RPCs). | +| 6 | Performance & resource management | Issues found: Server-018 (`GalaxyGlobMatcher` regex cache is unbounded — currently low-risk but uncapped). | +| 7 | Design-document adherence | No issues found at this pass. | +| 8 | Code organization & conventions | Issues found: Server-020 (dashboard pages each declare two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — producing duplicate routes under the `/dashboard` group prefix). | +| 9 | Testing coverage | Issues found: Server-021 (`MxAccessGatewayService.ApplyConstraintsAsync` and the new `BulkConstraintPlan` / `ReadBulkConstraintPlan` / `WriteBulkConstraintPlan` / `SubscribeBulkConstraintPlan` merge logic is entirely untested). | +| 10 | Documentation & comments | Issues found: Server-022 (`IAlarmRpcDispatcher` XML doc still describes the dispatcher as "ships a not-yet-wired default"; stale after Server-014). | + +### 2026-05-20 review (commit a020350) + +Re-review pass at `a020350` — the cross-module sweep that resolved Server-015 through Server-022. Verified each fix is sound (lock discipline now uniform on `_syncRoot`; `DisposeAsync` gates on `_closeLock`; alarm RPCs map to `InvokeWrite`/`EventsRead`; glob cache is bounded; alarm dispatcher SessionNotFound flows through `MxAccessGatewayService.MapException` → gRPC `NotFound`; dashboard pages emit a single `@page`; 11 new `MxAccessGatewayServiceConstraintTests` cover the bulk-constraint plans). New findings filed against this pass. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issues found: Server-024 (`GalaxyGlobMatcher.GetOrCreateRegex` indexer access after `TryAdd` fails can throw `KeyNotFoundException` under contention near the cap). | +| 2 | mxaccessgw conventions | No issues found. | +| 3 | Concurrency & thread safety | No new issues found — Server-015/016 fixes verified sound. | +| 4 | Error handling & resilience | Issues found: Server-026 (`AlarmsOptions` is bound but not validated by `GatewayOptionsValidator`). | +| 5 | Security | No issues found — Server-017 mapping (`InvokeWrite` / `EventsRead`) is defensible and exercised by both resolver and interceptor tests. | +| 6 | Performance & resource management | No issues found — Server-018 cap is in place and tested. | +| 7 | Design-document adherence | Issues found: Server-027 (`docs/Authorization.md` `ResolveCommandScope` code snippet and Constraint Enforcement section omit the bulk read/write command families). | +| 8 | Code organization & conventions | Issues found: Server-025 (`GalaxyRepositoryGrpcService` still consumes the concrete `GalaxyRepository` after `IGalaxyRepository` was introduced for testability — inconsistent with `GalaxyHierarchyCache`). | +| 9 | Testing coverage | Issues found: Server-028 (`GatewayGrpcScopeResolverTests` does not exercise `WatchDeployEventsRequest` or `MxCommandKind.ReadBulk`; no `GatewaySessionTests` case asserts a `MarkFaulted` during in-flight Close). | +| 10 | Documentation & comments | Issues found: Server-023 (`NotWiredAlarmRpcDispatcher` class XML doc still says "PR A.6/A.7 — default … shipped while the worker-side AlarmClient event subscription is gated on dev-rig validation"; contradicts the cleanup that Server-014/Server-022 applied to the interface, gateway service, and `WorkerAlarmRpcDispatcher`). Issues found: Server-029 (`OpenSession` capability list advertises `bulk-subscribe-commands` but not the now-shipping bulk-read or bulk-write families — clients that gate on capability strings have no signal that those families exist). | + +### 2026-05-22 review (commit fa491c7) + +Re-review pass at `fa491c7`, scoped to the Galaxy hierarchy snapshot-persistence +change: the new `GalaxyHierarchySnapshot`, `IGalaxyHierarchySnapshotStore` / +`GalaxyHierarchySnapshotStore`, the restore / persist paths added to +`GalaxyHierarchyCache`, the two new `GalaxyRepositoryOptions`, and the +`docs/GalaxyRepository.md` / `docs/GatewayConfiguration.md` updates. Prior +findings (Server-001 through Server-032) are unchanged by this pass. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found — restore/save sequencing and the shared `BuildEntry` materialization are sound. | +| 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, `sealed`, `Async` suffixes, Options pattern, and XML docs all conform; the snapshot persists Galaxy metadata (names/types), not tag values or secrets. | +| 3 | Concurrency & thread safety | No issues found — `_restoreAttempted` and `_current` are touched only under `_refreshGate`; `_current` is published via `Volatile.Write`; the store serializes its file I/O on a private `SemaphoreSlim`. | +| 4 | Error handling & resilience | Issues found: Server-033 (restore never completes `_firstLoad`, so a cold-start browse waits the full 5s bootstrap budget), Server-034 (`TryLoadAsync` throws on a corrupt file despite the `Try` prefix), Server-036 (a save cancelled at shutdown logs a misleading warning). | +| 5 | Security | No issues found — the snapshot holds non-secret Galaxy metadata, is written under `C:\ProgramData\MxGateway` alongside the auth DB, and restored rows flow the same materialization path as live SQL with no injection surface. | +| 6 | Performance & resource management | Issues found: Server-035 (the snapshot write is awaited on the refresh critical path under `_refreshGate` with no timeout). | +| 7 | Design-document adherence | No issues found — `docs/GalaxyRepository.md` and `docs/GatewayConfiguration.md` were updated in the same commit; `docs/DesignDecisions.md` already defers to `GalaxyRepository.md` as the Galaxy authority. | +| 8 | Code organization & conventions | No issues found — the new options live on `GalaxyRepositoryOptions`, the store is a registered singleton, and the on-disk envelope (`PersistedFile`) is a private nested record. | +| 9 | Testing coverage | Issues found: Server-037 (no test for the corrupt-snapshot restore path or for `PersistSnapshot = false` at the cache level). | +| 10 | Documentation & comments | No issues found — XML docs match behavior; the `GalaxyRepository.md` "On-disk snapshot" section documents the Stale-on-restore lifecycle. | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232` scoped to the dashboard refactor wave: the +`ZB.MOM.WW` project rename (`dc9c0c9`), the `QueryActiveAlarms` public RPC +implementation (`397d3c5`), the LDAP role-mapping + HubToken bearer auth +(`27ed651`), the sidebar layout + three SignalR push hubs (`6594359`), and the +EventsHub broadcaster + doc refresh (`d692232`). Server-031 and Server-032 +remain open and untouched — neither the gateway-side `_writeLock` heartbeat +contention nor the bounded `_events` channel saw any changes in this wave. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the fa491c7..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — rename hygiene clean, external runtime identifiers (`MeterName`, `MxGateway.Dashboard` scheme, `MxGateway.Request` logger, `MxGateway.Worker.STA` thread name) intentionally unprefixed per commit message. | +| 3 | Concurrency & thread safety | No issues found — Server-031 (`_writeLock` heartbeat watchdog contention) remains open and unchanged. | +| 4 | Error handling & resilience | Issues found: Server-039 (`HubTokenService.Validate` accepts a payload with null Name/NameIdentifier), Server-041 (`EventStreamService` calls the broadcaster without a try/catch — fragile seam), Server-042 (`DashboardSnapshotPublisher` tight retry loop with no backoff vs `AlarmsHubPublisher` 5-second delay). | +| 5 | Security | Issues found: Server-038 (`EventsHub.SubscribeSession` accepts any session id from any Viewer; no per-session ACL). | +| 6 | Performance & resource management | Issues found: Server-042 (`DashboardSnapshotPublisher` lacks reconnect backoff). | +| 7 | Design-document adherence | Issues found: Server-041 (broadcaster's never-throw contract documented in the interface but not enforced by the caller). | +| 8 | Code organization & conventions | Issues found: Server-040 (undocumented lookup-order precedence in `MapGroupsToRoles`), Server-043 (singleton sharing of `HubTokenService` undocumented). | +| 9 | Testing coverage | No issues found in this module — see Tests-026 in the Tests module for the missing EventsHub broadcast coverage. | +| 10 | Documentation & comments | Issues found: Server-040, Server-043 (both documentation gaps). | + +## Findings + +### Server-001 + +| Field | Value | +|---|---| +| Severity | Critical | +| Category | Security | +| Location | `src/MxGateway.Server/GatewayApplication.cs:147-149`, `src/MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs:55-58`, `src/MxGateway.Server/Dashboard/Components/Routes.razor:1-15` | +| Status | Resolved | + +**Description:** The dashboard authorization policy (`DashboardAuthenticationDefaults.AuthorizationPolicy`), `DashboardAuthorizationRequirement`, and `DashboardAuthorizationHandler` are registered in DI but never applied to any endpoint. `MapRazorComponents()` has no `.RequireAuthorization(...)`, the `` in `Routes.razor` uses plain `RouteView` (not `AuthorizeRouteView`), and no dashboard page carries `[Authorize]` — a module-wide grep finds zero `RequireAuthorization`/`[Authorize]`/`AuthorizeRouteView` usages. Every dashboard page (Sessions, Workers, Events, Galaxy, Settings, and the API Keys list exposing key IDs, scopes, and constraints) is reachable by any unauthenticated remote client regardless of `Dashboard:AllowAnonymousLocalhost` or `Dashboard:RequireAdminScope`. Only the API-key mutation operations remain protected, via the separate `DashboardApiKeyManagementService.CanManage` check. + +**Recommendation:** Apply the policy at the route level — `endpoints.MapRazorComponents().AddInteractiveServerRenderMode().RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy)` — and/or switch `Routes.razor` to `AuthorizeRouteView` with a `[Authorize]` fallback policy plus a `NotAuthorized` redirect to the login page. Add an integration test that GETs a dashboard page anonymously and asserts 302-to-login / 401. + +**Resolution:** Resolved in `a8aafdf` (2026-05-18): `MapRazorComponents()` now calls `.RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy)`, so an unauthenticated request to any dashboard component route is challenged by the cookie scheme and redirected to the login page. `GatewayApplicationTests` gained `ComponentRoutesRequireAuthorization` (component routes carry the policy) and `AuthEndpointsAllowAnonymousAccess`, replacing the prior test that asserted the insecure behavior. + +### Server-002 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Design-document adherence | +| Location | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` | +| Status | Resolved | + +**Description:** `gateway.md:583` and CLAUDE.md state the first version "terminates orphaned workers on startup." No code in MxGateway.Server enumerates or kills leftover `MxGateway.Worker.exe` processes at startup — a grep for `orphan`/`reattach`/`terminate` finds nothing. After an unclean gateway crash, x86 worker processes (each holding an MXAccess COM instance) leak and survive indefinitely, and a restarted gateway does not reclaim or kill them. + +**Recommendation:** Add a startup hosted service that finds and kills stale worker processes (by executable path / a well-known argument or environment marker) before the server accepts sessions, or update the design docs if reattachment/cleanup is deliberately deferred. + +**Resolution:** Resolved 2026-05-18. Confirmed against source: no code path enumerated or killed leftover workers. Added `IRunningProcessInspector` / `SystemRunningProcessInspector` (a testable seam over `Process.GetProcessesByName`/`Kill`), `OrphanWorkerTerminator` (kills processes matched by the configured worker executable path, or by image name when the x64 gateway cannot introspect the x86 worker's `MainModule`, skipping the current process and tolerating per-process kill failures), and `OrphanWorkerCleanupHostedService` (best-effort `IHostedService`). The hosted service is registered in `AddWorkerProcessLauncher` ahead of `AddGatewaySessions` so cleanup runs before the server accepts sessions. `gateway.md` updated to describe the implemented behavior. Regression tests: `OrphanWorkerTerminatorTests` (`KillsWorkerProcessesMatchingConfiguredExecutablePath`, `KillsImageNameMatchWhenExecutablePathUnreadable`, `DoesNotKillUnrelatedProcessSharingImageName`, `DoesNotKillCurrentProcess`, `ContinuesWhenOneKillThrows`). + +### Server-003 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Security | +| Location | `src/MxGateway.Server/Dashboard/DashboardAuthorizationHandler.cs:39,54-59`, `src/MxGateway.Server/Dashboard/DashboardAuthenticator.cs:236-258` | +| Status | Resolved | + +**Description:** When `Dashboard:RequireAdminScope` is true (the default) and the request is not loopback, `DashboardAuthorizationHandler` succeeds only if `HasAdminScope` finds a claim of type `"scope"` with value `"admin"`. But `DashboardAuthenticator.CreatePrincipal` issues only `NameIdentifier`, `Name`, and `LdapGroupClaimType` claims — never a `scope`/`admin` claim. So a correctly LDAP-authenticated user who passed the required-group check is still denied dashboard access on any non-loopback connection. The bug is currently masked by the missing route-level enforcement (Server-001) and by `AllowAnonymousLocalhost`; fixing Server-001 would make the dashboard unusable for all real LDAP logins. + +**Recommendation:** Either have `DashboardAuthenticator.CreatePrincipal` add a `scope=admin` claim when the user is in the required group, or change `DashboardAuthorizationHandler.HasAdminScope` to evaluate LDAP group membership (reuse `IsMemberOfRequiredGroup` against the `LdapGroupClaimType` claims, as `DashboardApiKeyAuthorization.CanManage` already does). + +**Resolution:** Resolved in `a8aafdf` (2026-05-18): `DashboardAuthenticator.CreatePrincipal` — reached only after the required-group check passes — now emits the `scope=admin` claim that `DashboardAuthorizationHandler` checks, so group-validated LDAP users pass `RequireAdminScope` once route-level authorization (Server-001) is enforced. + +### Server-004 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` | +| Status | Resolved | + +**Description:** `ParseScopes` accepts any comma-separated strings and `CreateKeyAsync` persists them verbatim; neither the CLI nor the dashboard create path validates scopes against `GatewayScopes`. A typo or non-canonical name (e.g. CLAUDE.md's example `--scopes session,invoke,event,metadata,admin`, which does not match the resolver's `session:open`/`invoke:read`/etc.) silently creates a key whose scope strings the authorization resolver never checks for — the key is unusable for those RPCs with no error at creation time. + +**Recommendation:** Validate every requested scope against the `GatewayScopes` catalog at create time in both the CLI parser/runner and `DashboardApiKeyManagementService.ValidateCreateRequest`, rejecting unknown scope strings. + +**Resolution:** Resolved 2026-05-18. Confirmed against source: `ParseScopes` split unvalidated strings into the create command and `ValidateCreateRequest` checked only key id and display name. Added `GatewayScopes.All` (the canonical scope catalog) and `GatewayScopes.IsKnown(string)`. `ApiKeyAdminCommandLineParser.Parse` now runs `ValidateScopes` for create-key commands and fails the parse listing the unknown scope(s) and valid set; `DashboardApiKeyManagementService.ValidateCreateRequest` rejects requests carrying any non-canonical scope. Revoke/rotate paths are unaffected (no scope input). Regression tests: `ApiKeyAdminCommandLineParserTests.Parse_CreateKeyCommand_RejectsUnknownScope`, `Parse_CreateKeyCommand_AcceptsAllCanonicalScopes`, and `DashboardApiKeyManagementServiceTests.CreateAsync_UnknownScope_DoesNotCallStore`. + +### Server-005 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` | +| Status | Resolved | + +**Description:** `GalaxyHierarchyCache.RefreshCoreAsync` only catches `SqlException` and `InvalidOperationException`. The initial `cache.RefreshAsync` call in `GalaxyHierarchyRefreshService.ExecuteAsync` is wrapped only for `OperationCanceledException`. A transient non-`SqlException` failure on the first refresh (e.g. a `Win32Exception`/`TimeoutException` from connection establishment, or another `DbException` subtype) escapes both layers, faults the `BackgroundService`, and — with default host behavior — stops the whole gateway. The periodic-tick loop does catch general exceptions, so only the first load is exposed. + +**Recommendation:** Broaden the `catch` in `RefreshCoreAsync` to all non-cancellation exceptions (record `Unavailable`/`Stale` and still complete `_firstLoad`), or wrap the initial `RefreshAsync` in `GalaxyHierarchyRefreshService` with the same general `catch` the tick loop uses. + +**Resolution:** Resolved 2026-05-18. Confirmed against source: the initial `RefreshAsync` in `ExecuteAsync` was guarded only for `OperationCanceledException`, and `RefreshCoreAsync` filtered its catch to `SqlException or InvalidOperationException`. Both recommended layers applied: `GalaxyHierarchyRefreshService.ExecuteAsync` now catches every non-cancellation exception on the initial load (logs a warning; the periodic tick retries), and `GalaxyHierarchyCache.RefreshCoreAsync` broadens its catch to all non-cancellation exceptions so the cache still records `Stale`/`Unavailable` and completes `_firstLoad`. The now-unused `Microsoft.Data.SqlClient` using was removed. Regression test: `GalaxyHierarchyRefreshServiceTests.ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService`. + +### Server-006 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` | +| Status | Resolved | + +**Description:** In `OpenSessionAsync`, `_metrics.SessionOpened()` (line 89) increments the `_openSessions` gauge before `TryAutoSubscribeAlarmsAsync` runs. If auto-subscribe throws (which it does when `Alarms.RequireSubscribeOnOpen` is true and the worker rejects the subscription), the `catch` block disposes and removes the session and records `_metrics.Fault(...)` but never calls `SessionClosed`/`SessionRemoved`. The `mxgateway.sessions.open` gauge permanently over-counts by one for every such failed open. + +**Recommendation:** In the `catch` block, when the session had reached the point where `SessionOpened()` was recorded, also call `_metrics.SessionRemoved()` — or move the `SessionOpened()` call to after auto-subscribe succeeds. + +**Resolution:** Resolved 2026-05-18. Confirmed against source: the `catch` block in `OpenSessionAsync` recorded `Fault(...)` and removed the session but never decremented the open-session gauge after `SessionOpened()` had run. Added a `sessionOpenedRecorded` flag set immediately after `_metrics.SessionOpened()`; the `catch` block now calls `_metrics.SessionRemoved()` when that flag is set, restoring the gauge for a post-`SessionOpened()` failure (e.g. an auto-subscribe rejection with `RequireSubscribeOnOpen=true`). Regression test: `SessionManagerAlarmAutoSubscribeTests.OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn`. + +### Server-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyProjector.cs:55-70` | +| Status | Resolved | + +**Description:** `Project` always iterates the full `entry.Index.ObjectViews` collection and re-applies all filters to skip `offset` matched items before collecting a page. Paging through a large Galaxy hierarchy is therefore O(total) per page and O(total²/pageSize) end-to-end. The cache is in-memory so impact is bounded, but for large galaxies repeated `DiscoverHierarchy` pagination wastes CPU. + +**Recommendation:** Precompute and cache the filtered, ordered view list per `(filterSignature, sequence)` so subsequent pages are an O(pageSize) slice; the existing filter signature already keys page tokens. + +**Resolution:** Resolved 2026-05-18. Confirmed against source: `Project` re-scanned and re-filtered the whole `ObjectViews` list on every page. Added a `ConditionalWeakTable>>` memo in `GalaxyHierarchyProjector`: the first projection of a given filter signature builds the filtered, ordered view list; subsequent pages take an O(pageSize) slice via index arithmetic. The memo is keyed on the immutable cache-entry instance, so when the cache publishes a new entry the stale memo becomes unreachable and is reclaimed with it — no explicit invalidation. `ResolveRoot` still runs before the memo lookup so a missing root surfaces `NotFound` consistently. Regression tests: `GalaxyHierarchyProjectorTests` (`Project_PagedAcrossEntireHierarchy_ReturnsEveryObjectExactlyOnce`, `Project_DistinctFiltersOnSameEntry_DoNotShareMemoizedViewList`, `Project_SameFilterRepeated_ReturnsIdenticalTotals`, `Project_DistinctCacheEntries_ProjectAgainstTheirOwnData`); existing `GalaxyRepositoryGrpcServiceTests` paging tests continue to pass unchanged. + +### Server-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:111-134,160-189` | +| Status | Resolved | + +**Description:** `WatchDeployEvents` calls `ResolveBrowseSubtrees()` on every streamed event, and `MapDeployEvent` re-runs `GalaxyHierarchyProjector.Project` over the entire cached hierarchy (and `Sum`s attribute counts) for every event of every constrained subscriber. `GalaxyGlobMatcher.IsMatch` also rebuilds the glob regex on each call. With many constrained subscribers and frequent deploys this is avoidable work. + +**Recommendation:** Hoist `ResolveBrowseSubtrees()` out of the loop; compute scoped object/attribute counts once per deploy sequence and cache by `(sequence, browseSubtrees)`; cache compiled glob `Regex` instances in `GalaxyGlobMatcher`. + +**Resolution:** Resolved 2026-05-18. Confirmed against source. Three changes: (1) `WatchDeployEvents` now resolves `ResolveBrowseSubtrees()` once before the streaming loop — the caller's identity and constraints are fixed for the stream lifetime, so per-event resolution was pure waste. (2) `GalaxyGlobMatcher` now caches compiled `Regex` instances in a `ConcurrentDictionary` keyed by glob pattern (with `RegexOptions.Compiled`), so the same handful of globs are translated once instead of on every `IsMatch` call. (3) The per-event `MapDeployEvent` re-projection is no longer a separate hot path: with finding Server-007 resolved, `GalaxyHierarchyProjector.Project` memoizes the filtered view list per `(cache entry, filter signature)`, so the scoped-count projection in `MapDeployEvent` for a constrained subscriber is O(matched-slice) after the first event of a given deploy sequence rather than a full re-scan — this subsumes the recommendation's `(sequence, browseSubtrees)` cache (the memo is keyed on the per-sequence cache-entry instance and the browse-subtree-bearing filter signature). Regression tests: `GalaxyFilterInputSafetyTests.GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect` (glob cache correctness); existing `WatchDeployEvents` and `GalaxyFilterInputSafetyTests` coverage continues to pass. + +### Server-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Security/Authentication/AuthSqliteConnectionFactory.cs:15-32` | +| Status | Resolved | + +**Description:** Each auth-store operation opens a fresh `SqliteConnection` with no busy timeout, no WAL journal mode, and default journaling. `MarkKeyUsedAsync` runs on every authenticated request and `SqliteApiKeyAuditStore` appends on every denial; under concurrent load these writers can collide and surface `SQLITE_BUSY` as a hard failure on the request path. + +**Recommendation:** Set `Pooling`, a non-zero `DefaultTimeout`/`busy_timeout`, and enable WAL (`PRAGMA journal_mode=WAL`) once at startup so concurrent readers/writers degrade gracefully. + +**Resolution:** Resolved 2026-05-18. Confirmed against source: the connection string set only `DataSource` and `Mode`. `AuthSqliteConnectionFactory.CreateConnection` now also sets `Pooling = true` and a non-zero `DefaultTimeout`. A new `OpenConnectionAsync(CancellationToken)` opens the connection and applies `PRAGMA journal_mode=WAL` and `PRAGMA busy_timeout` (5 s); WAL is a persistent database-level setting so re-applying it per connection is a cheap no-op, while `busy_timeout` is per-connection state. All nine auth-store call sites (`SqliteApiKeyAdminStore`, `SqliteApiKeyAuditStore`, `SqliteApiKeyStore`, `SqliteAuthStoreMigrator`) were switched from `CreateConnection()` + `OpenAsync()` to `OpenConnectionAsync()`. `docs/Authentication.md` updated to describe the WAL/busy-timeout behavior. Regression test: `SqliteAuthStoreTests.OpenConnectionAsync_EnablesWalJournalModeAndBusyTimeout`. + +### Server-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Security | +| Location | `src/MxGateway.Server/Security/Authentication/SqliteApiKeyAdminStore.cs:91-114`, `src/MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:168-172` | +| Status | Resolved | + +**Description:** `RotateAsync` sets `revoked_utc = NULL`, so rotating a previously revoked key silently reactivates it. This is documented intentional behavior in `docs/Authentication.md:167`, but the dashboard renders the "Rotate" button unconditionally — including for keys whose status badge says "Revoked" — so an operator can un-revoke a deliberately disabled key without an explicit warning. + +**Recommendation:** Either hide/disable the Rotate action for revoked keys in `ApiKeysPage.razor`, require an explicit confirmation, or have `RotateAsync` preserve `revoked_utc` and add a separate explicit "reactivate" operation. + +**Resolution:** Resolved 2026-05-18. Confirmed against source: `ApiKeysPage.razor` rendered the Rotate button unconditionally while Revoke was already gated on `key.RevokedUtc is null`. Took the lowest-risk recommended option — the dashboard now renders the Rotate (and Revoke) actions only for keys whose status is `Active`; a revoked key shows a "No actions" placeholder, so an operator cannot un-revoke a deliberately disabled key as a side effect of a rotation. `RotateAsync`'s store-level behavior is unchanged (rotation by `key_id` still clears `revoked_utc`, which the CLI relies on); `docs/Authentication.md` updated to document both the store behavior and the dashboard restriction. No automated test added: the change is pure conditional Razor rendering and the test project has no bUnit component-rendering harness; the underlying `DashboardApiKeyManagementService` is already unit-tested. + +### Server-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:1-46` | +| Status | Resolved | + +**Description:** `WorkerAlarmRpcDispatcher` deviates from the module's conventions: it fully-qualifies `System.Guid`, `System.ArgumentNullException`, and `System.Threading` types inline instead of relying on `using` directives, and uses an explicit constructor with `this.`-qualified field assignment while the rest of the module (e.g. `ConstraintEnforcer`, `MxAccessGatewayService`, `GalaxyRepositoryGrpcService`) uses primary constructors. `docs/style-guides/CSharpStyleGuide.md` is authoritative for gateway code. + +**Recommendation:** Add the needed `using` directives, drop the inline fully-qualified names, and convert to a primary constructor for consistency. + +**Resolution:** Resolved 2026-05-18. Confirmed against source. Converted `WorkerAlarmRpcDispatcher` to a primary constructor with the standard `?? throw new ArgumentNullException(...)` field-initializer guard; dropped the inline `System.Guid` / `System.ArgumentNullException` qualifications (using implicit `using System;`); removed redundant `using System.Collections.Generic;` / `System.Threading` / `System.Threading.Tasks;` directives (covered by `ImplicitUsings`); replaced the two `if (... is null) throw new System.ArgumentNullException(...)` checks with `ArgumentNullException.ThrowIfNull`. The stale class-level ``/`` ("Replaces NotWiredAlarmRpcDispatcher once ... wired in", "partially wired", "returns an Unimplemented diagnostic") were corrected to describe the actual GUID-vs-`Provider!Group.Tag` handling — overlapping with Server-014. No behavior change, so no new test; existing `WorkerAlarmRpcDispatcherTests` continue to pass and the project builds warning-free under `TreatWarningsAsErrors`. + +### Server-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `CLAUDE.md` (Authentication section and `apikey create` example) | +| Status | Resolved | + +**Description:** CLAUDE.md describes scopes as `session`, `invoke`, `event`, `metadata`, `admin` and shows `apikey create --scopes session,invoke,event,metadata,admin`. The actual canonical scope strings (used by `GatewayScopes`, `GatewayGrpcScopeResolver`, and `docs/Authorization.md`) are `session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`. A key created per the CLAUDE.md example carries scopes the resolver never matches. + +**Recommendation:** Update CLAUDE.md's scope list and the `apikey` example to the canonical `*:*` scope strings, per CLAUDE.md's own rule that docs change with the code. + +**Resolution:** Resolved 2026-05-18. Confirmed against `GatewayScopes` (`session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`). CLAUDE.md's Build/Test/Run `apikey create` example and the Authentication-section scope list were both updated to the canonical `*:*` strings. (Note: since finding Server-004 was resolved, the old example would now be actively rejected at create time rather than silently creating an unusable key, making the doc correction load-bearing.) Pure documentation change; no test. + +### Server-013 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs`, `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs` | +| Status | Resolved | + +**Description:** `DashboardAuthorizationHandler` is unit-tested in isolation, but no test exercises the dashboard routes end-to-end to confirm the policy is actually enforced — which is why Server-001 (policy registered but never wired) went uncaught. There are also no tests for `WorkerExecutableValidator` (PE-header architecture parsing), `GalaxyGlobMatcher` (anchoring/escaping/empty-glob fail-open), or `GalaxyHierarchyProjector` pagination/page-token behavior. + +**Recommendation:** Add a `WebApplicationFactory` integration test that requests a dashboard page unauthenticated and asserts the redirect/401, plus unit tests for `WorkerExecutableValidator`, `GalaxyGlobMatcher`, and projector paging. + +**Resolution:** Resolved 2026-05-18. Re-triaged against the current test suite: three of the four named gaps were already closed. (1) The dashboard route-level enforcement test exists — `GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization` (and `..._AuthEndpointsAllowAnonymousAccess`), added when Server-001 was fixed. (2) `GalaxyGlobMatcher` anchoring/escaping/empty-glob behavior is covered by `GalaxyFilterInputSafetyTests` (`GlobMatcher_TreatsSqlMetacharactersAsLiterals`, `GlobMatcher_DoesNotTreatLikeWildcardsAsWildcards`, `GlobMatcher_WithPathologicalInput_DoesNotHang`), now extended with `GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect`. (3) Projector pagination/page-token behavior is covered end-to-end by `GalaxyRepositoryGrpcServiceTests` and now directly by the new `GalaxyHierarchyProjectorTests`. The one genuine remaining gap — `WorkerExecutableValidator` PE-header parsing — was closed with the new `WorkerExecutableValidatorTests` (7 cases: matching/mismatched x86 and x64, missing `MZ` header, file too small, missing `PE` signature), exercising the validator against synthesized minimal PE fixtures. + +### Server-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:162-171,191-198,206-214,229-237` | +| Status | Resolved | + +**Description:** The XML `` and inline comments on `AcknowledgeAlarm` and `QueryActiveAlarms` describe the alarm path as not yet wired and say `NotWiredAlarmRpcDispatcher` is the default ("Clients calling this method today receive an OK reply with a 'worker alarm path not yet wired' diagnostic", "an empty stream until PR A.2"). In fact `SessionServiceCollectionExtensions.AddGatewaySessions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so DI always injects the production dispatcher; `NotWiredAlarmRpcDispatcher` is only the null fallback. The comments are stale and misleading. + +**Recommendation:** Update the `AcknowledgeAlarm`/`QueryActiveAlarms` remarks to reflect that `WorkerAlarmRpcDispatcher` is the wired default, and describe its actual GUID-vs-`Provider!Group.Tag` handling. + +**Resolution:** Resolved 2026-05-18. Confirmed against source: `SessionServiceCollectionExtensions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so the "not yet wired" / "empty stream until PR A.2" / "PR A.6/A.7 follow-up" prose in the `AcknowledgeAlarm` and `QueryActiveAlarms` `` and inline comments was stale. Rewrote both `` blocks and both inline comments to state that DI binds the production `WorkerAlarmRpcDispatcher`, that it routes over the worker pipe IPC, and that `AcknowledgeAlarm` handles a canonical-GUID reference (→ `AcknowledgeAlarmCommand`) and a `Provider!Group.Tag` reference (→ `AcknowledgeAlarmByNameCommand`), with `NotWiredAlarmRpcDispatcher` being only the null fallback. The matching stale `WorkerAlarmRpcDispatcher` class-level XML doc was corrected as part of Server-011. Pure documentation/comment change; no test. + +### Server-015 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:8-15,266-308,720-775` | +| Status | Resolved | + +**Description:** `GatewaySession` guards its mutable state with two different sync primitives. `TransitionTo`, `MarkFaulted`, `TouchClientActivity`, the `State`/`LastClientActivityAt`/`LeaseExpiresAt`/`FinalFault`/`ActiveEventSubscriberCount` getters, `AttachWorkerClient`, and `IsLeaseExpired` all read/write `_state`, `_finalFault`, `_lastClientActivityAt`, `_leaseExpiresAt`, `_workerClient`, and `_activeEventSubscriberCount` under `_syncRoot`. `CloseAsync` (lines 720-775), however, reads `_state` at line 729 and writes `_state` at lines 736 (`SessionState.Closing`) and 761 (`SessionState.Closed`) while only holding the `_closeLock` `SemaphoreSlim` — `_syncRoot` is never acquired. A concurrent `TransitionTo` or `MarkFaulted` from another thread sees `_state` outside the lock that protects it, and the `State` getter is not guaranteed to observe the `Closing`/`Closed` writes promptly. `SemaphoreSlim.WaitAsync`/`Release` do happen to provide memory barriers in practice, but the locking discipline is split across two primitives, which is fragile and defeats the audit value of "all `_state` access is guarded by `_syncRoot`". Concretely, the race between `CloseAsync` setting `_state = Closing` and a concurrent `TransitionTo(Ready)` is unordered — and `TransitionTo` will happily overwrite `Closing` back to `Ready` because its only guard is "do not overwrite `Closed`/`Faulted`". + +**Recommendation:** Make `CloseAsync` mutate `_state` through the existing `TransitionTo(...)` helper (or acquire `_syncRoot` around the reads/writes) so all `_state` access uses the same lock. Either extend `TransitionTo` to accept the `Closing` and `Closed` transitions (it already handles `Faulted`/`Closed` precedence) or refactor `CloseAsync` to call a private `TrySetClosing()` / `MarkClosed()` that locks `_syncRoot`. Add a regression test that forces a `TransitionTo(Ready)` after `CloseAsync` has set `Closing` and asserts the session does not flip back to `Ready`. + +**Resolution:** 2026-05-20 — Unified the close path on `_syncRoot`. `GatewaySession.CloseAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now mutates `_state` only through two private `_syncRoot`-locked helpers — `TryBeginClose` (writes `Closing`, returns the prior `_closeStarted`) and `MarkClosed` (writes `Closed`) — so every `_state` read/write in the session uses the same lock; `_closeLock` keeps its role of serializing concurrent close attempts. `TransitionTo` was tightened to refuse a transition out of `Closing` to anything other than `Closed`/`Faulted` so a late lifecycle callback cannot walk a closing session back to `Ready`. `docs/Sessions.md` updated to describe the unified lock discipline and the extended terminal precedence. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `TransitionTo_AfterCloseStarted_DoesNotOverwriteClosing` (the named scenario — `BlockingShutdownWorkerClient` parks the close inside `worker.ShutdownAsync` so the test can call `TransitionTo(Ready)` between the `Closing` and `Closed` writes and assert the state stays `Closing`) and `MarkFaulted_AfterCloseCompletes_DoesNotResurrectSession`. + +### Server-016 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:790-797`, `src/MxGateway.Server/Sessions/SessionManager.cs:237-258` | +| Status | Resolved | + +**Description:** `GatewaySession.DisposeAsync` synchronously calls `_closeLock.Dispose()` (line 792) without first acquiring the lock and without checking whether a `CloseAsync` is still in flight. The normal call path is `SessionManager.CloseSessionCoreAsync` → `session.CloseAsync(...)` → `RemoveSessionAsync` → `DisposeAsync`, where `DisposeAsync` runs strictly after `CloseAsync` completes. But the `ShutdownAsync` path (`SessionManager.cs:237-258`) and any future caller that disposes a session while another thread is still inside `CloseAsync` will trip `ObjectDisposedException` when the in-flight `CloseAsync` releases the semaphore. The race is narrow today because all `Close`/`Dispose` choreography goes through `SessionManager`, but the class-level contract is broken: nothing on `GatewaySession` documents or enforces "DisposeAsync must not be called concurrently with CloseAsync". + +**Recommendation:** In `DisposeAsync`, either (a) take and release `_closeLock` once before disposing it, so the dispose is sequenced after any in-flight close, or (b) replace `_closeLock` disposal with a guard flag and let the semaphore be reclaimed by the finalizer. Document the invariant on the public method. Add a regression test that disposes a session whose `CloseAsync` has not yet completed and asserts no `ObjectDisposedException`. + +**Resolution:** 2026-05-20 — Took recommendation (a): `GatewaySession.DisposeAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now acquires `_closeLock` once before disposing the semaphore so an in-flight `CloseAsync` finishes (its `_closeLock.Release()`) before the dispose tears the semaphore down. The wait is non-cancellable (`CancellationToken.None`) and `ObjectDisposedException` is swallowed at both the wait and the dispose site so double-dispose still completes cleanly. The method's XML doc was extended with a `` block stating the invariant. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `DisposeAsync_WhileCloseInFlight_WaitsForCloseAndDoesNotThrow` (parks `CloseAsync` inside the worker shutdown, calls `DisposeAsync` concurrently, releases shutdown, asserts both complete without `ObjectDisposedException` and the worker is disposed exactly once) and `DisposeAsync_CalledTwice_DoesNotThrow`. + +### Server-017 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Security | +| Location | `src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:13-27`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:173-247`, `docs/Authorization.md:108-110` | +| Status | Resolved | + +**Description:** The two new top-level RPCs added to `MxAccessGateway` — `AcknowledgeAlarm(AcknowledgeAlarmRequest)` and `QueryActiveAlarms(QueryActiveAlarmsRequest)` (proto lines 23-24) — are not enumerated by `GatewayGrpcScopeResolver.ResolveRequiredScope`. The resolver's `request switch` covers `OpenSessionRequest`, `CloseSessionRequest`, `StreamEventsRequest`, `MxCommandRequest`, and the four Galaxy-repository requests; everything else falls through to `_ => GatewayScopes.Admin`. The interceptor (`GatewayGrpcAuthorizationInterceptor.AuthenticateAndAuthorizeAsync`) then rejects any non-admin caller with `PermissionDenied`. This is technically fail-closed (and `docs/Authorization.md:108-110` documents the "unrecognized → admin" intent), but in practice it means: (1) only API keys with the `admin` scope can acknowledge alarms or query active alarms, even though acknowledging is naturally an `invoke:write`-shaped operation and querying is naturally an `invoke:read`- or `metadata:read`-shaped operation; (2) the alarm RPCs ship in a state where any client that successfully opened a session and subscribed to alarm events still cannot perform the operational acks the contract advertises; (3) the test matrix `GatewayGrpcScopeResolverTests` does not even cover these two request types, so the gap was not caught at unit-test time. + +**Recommendation:** Add explicit arms to `ResolveRequiredScope`: map `AcknowledgeAlarmRequest` to `GatewayScopes.InvokeWrite` (parity with other write actions; ack changes alarm state) and `QueryActiveAlarmsRequest` to `GatewayScopes.MetadataRead` or `GatewayScopes.InvokeRead`. Update `docs/Authorization.md` to list both. Extend `GatewayGrpcScopeResolverTests` with the new mappings and an assertion that every request type defined by `mxaccess_gateway.proto` is named in the resolver (the test can enumerate the assembly's request types so a future RPC cannot quietly add itself only via the admin fallback). + +**Resolution:** 2026-05-20 — Added explicit `AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite` and `QueryActiveAlarmsRequest => GatewayScopes.EventsRead` arms to `GatewayGrpcScopeResolver.ResolveRequiredScope` (`src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:21-22`). `InvokeWrite` matches the existing `MxCommandKind.Write*` mapping because ack mutates alarm state; `EventsRead` matches `StreamEventsRequest` and `MxCommandKind.DrainEvents` because querying active alarms reads the same alarm/event surface. Extended `GatewayGrpcScopeResolverTests` with two new `InlineData` rows covering both request types (`src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:16-17`) and added four interceptor-level cases in `GatewayGrpcAuthorizationInterceptorTests` (`UnaryServerHandler_AcknowledgeAlarmMissingScope_ReturnsPermissionDenied`, `UnaryServerHandler_AcknowledgeAlarmWithScope_RunsHandler`, `ServerStreamingServerHandler_QueryActiveAlarmsMissingScope_ReturnsPermissionDenied`, `ServerStreamingServerHandler_QueryActiveAlarmsWithScope_RunsHandler`) proving each new RPC denies callers lacking the chosen scope and runs the handler when the scope is held. Updated `docs/Authorization.md` (resolver snippet and Scope Catalog table) to list both RPCs against their scopes. `dotnet test ... --filter FullyQualifiedName~GatewayGrpcAuthorizationInterceptorTests` → 14 passed, 0 failed; resolver tests 28 passed, 0 failed. + +### Server-018 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:15` | +| Status | Resolved | + +**Description:** `GalaxyGlobMatcher.RegexCache` is a `ConcurrentDictionary` keyed by glob pattern, with no eviction. The fix for Server-008 added this cache deliberately to avoid recompiling the same handful of patterns, but the cache key is the raw glob string. The patterns currently come from two sources — `DiscoverHierarchyRequest.TagNameGlob` (client-supplied) and `ApiKeyConstraints.BrowseSubtrees` / `ReadSubtrees` / `WriteSubtrees` / `ReadTagGlobs` / `WriteTagGlobs` (admin-configured) — and `BuildRegex` also runs each glob through `Regex.Escape` so an attacker cannot craft a denial-of-service ReDoS payload. The leak is therefore bounded only by "how many distinct globs a client can submit over the process lifetime", which is in the millions for `TagNameGlob` if a client iterates through generated names. Each compiled `Regex` also holds a JIT'd assembly that is non-trivial to reclaim. + +**Recommendation:** Cap the cache at a small bound (e.g. 256 patterns) using a simple LRU or a `MemoryCache` with sliding expiration, or restrict the cache to globs that originate from API-key constraints (admin-controlled, naturally bounded) and pay the compile cost for client-supplied globs. Add a test that fills the cache with thousands of distinct globs and asserts the cache size stays bounded. + +**Resolution:** 2026-05-20 — Capped `GalaxyGlobMatcher`'s compiled-regex cache at `RegexCacheCapacity = 256` entries with FIFO-by-insertion eviction (`src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs`). A `ConcurrentQueue` tracks insertion order; when the cache grows past the cap, `EvictIfOverCapacity` takes a small lock and dequeues + removes the oldest entries until the count is back within bound. Reads stay lock-free (the lock guards only the eviction path). Internal `CurrentCacheSize` / `RegexCacheCapacity` accessors are surfaced through the existing `InternalsVisibleTo("MxGateway.Tests")` so tests can assert the bound. Regression test: `GalaxyFilterInputSafetyTests.GlobMatcher_WithManyDistinctPatterns_CacheStaysBounded` submits `RegexCacheCapacity * 4` distinct globs and asserts `CurrentCacheSize` stays in `[0, RegexCacheCapacity]`. Existing glob correctness tests (`GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect`, the adversarial-input theories) continue to pass, confirming eviction does not corrupt lookups. + +### Server-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:183-221` | +| Status | Resolved | + +**Description:** `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` returns `yield break` (line 191) when `sessionRegistry.TryGet(request.SessionId, ...)` fails — it silently produces an empty stream with no diagnostic. The peer `AcknowledgeAsync` instead returns an `AcknowledgeAlarmReply` with `ProtocolStatus.Code = SessionNotFound` (lines 81-89), so the two methods have inconsistent missing-session handling. In production this branch is unreachable because `MxAccessGatewayService.QueryActiveAlarms` calls `ResolveSession(...)` first and throws `NotFound` from the gRPC layer (`MxAccessGatewayService.cs:228`), but: (a) the dispatcher is the seam other code paths might reach in the future, and (b) any unit test that instantiates the dispatcher directly with a missing session id sees an empty stream rather than a clear error, which is a footgun. + +**Recommendation:** Either throw a `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` (matching the gRPC service's own resolver) or yield a single `ActiveAlarmSnapshot` with a diagnostic field set, and add a `WorkerAlarmRpcDispatcherTests` case that asserts whichever shape is chosen. Aligning with `AcknowledgeAsync`'s `SessionNotFound` protocol-status pattern is preferred, but `QueryActiveAlarms` is a server-streaming RPC so a thrown `SessionManagerException` propagated by the gateway is the cleaner fit. + +**Resolution:** 2026-05-20 — Took the preferred option: `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` (`src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs`) now throws `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` instead of `yield break`-ing when the session is missing. `MxAccessGatewayService.MapException` already maps that error code to gRPC `NotFound`, so production callers see a consistent missing-session response and a direct unit-test caller now gets a clear error instead of an empty success. The unary peer `AcknowledgeAsync` continues to surface the same condition as an in-band `ProtocolStatus.Code = SessionNotFound`, which is correct for a unary RPC. Regression test: `WorkerAlarmRpcDispatcherTests.QueryActiveAlarmsAsync_WhenSessionMissing_ThrowsSessionNotFound` replaces the prior `_YieldsEmpty` assertion — it asserts the new exception shape and also exercises `AcknowledgeAsync` with the same missing session id to pin the peer-method parity. + +### Server-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor:1-2`, `…/GalaxyPage.razor:1-2`, `…/ApiKeysPage.razor:1-2`, `…/EventsPage.razor:1-2`, `…/SessionsPage.razor:1-2`, `…/WorkersPage.razor:1-2`, `…/SettingsPage.razor:1-2`, `…/SessionDetailsPage.razor:1-2` | +| Status | Resolved | + +**Description:** Every dashboard page declares two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — even though `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` mounts the Razor components under a `RouteGroupBuilder` with `pathBase = "/dashboard"`. The group prefix is prepended to each `@page` route, so the actual endpoints become `/dashboard/X` (from `@page "/X"`) **and** `/dashboard/dashboard/X` (from `@page "/dashboard/X"`). The pages are reachable at two URLs each, and the deeper one (`/dashboard/dashboard/sessions` etc.) is almost certainly accidental — it leaks the path-base name into the URL and creates duplicate authorize/render work per route. `GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization` only checks the `/dashboard/X` shape, so the duplicate route slipped through without an assertion. + +**Recommendation:** Drop the `@page "/dashboard/X"` directive from each page; rely on the `MapGroup("/dashboard")` to provide the prefix. Or, if the team genuinely wants both URL shapes, document the choice in the file header and extend the route-enumeration test to assert that **both** are present (and both carry the authorization policy). Either way, the current setup is non-obvious. + +**Resolution:** 2026-05-20 — Took the recommended drop: removed the redundant `@page "/dashboard/X"` directive from every dashboard Razor page (`DashboardHome.razor`, `SessionsPage.razor`, `WorkersPage.razor`, `EventsPage.razor`, `GalaxyPage.razor`, `SettingsPage.razor`, `ApiKeysPage.razor`, `SessionDetailsPage.razor`). Each page now declares only its bare route (e.g. `@page "/sessions"`); `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` continues to prepend `/dashboard` via `MapGroup`, so each page is reachable at exactly one URL (`/dashboard/X`). Regression test: `GatewayApplicationTests.Build_WhenDashboardEnabled_DoesNotRegisterDoubledDashboardPrefixRoutes` enumerates the eight previously-doubled routes (`/dashboard/dashboard/`, `/dashboard/dashboard/sessions`, ... `/dashboard/dashboard/sessions/{SessionId}`) and asserts none of them are mapped. The existing `..._MapsBlazorDashboardAndAuthEndpoints` / `..._ComponentRoutesRequireAuthorization` tests continue to verify the desired `/dashboard/X` shapes are still present and policy-gated. No public URL contract changed (the doubled shape was accidental); no doc update needed — `gateway.md` and `docs/GatewayDashboardDesign.md` never referenced the doubled routes. + +### Server-021 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:266-664`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs` | +| Status | Resolved | + +**Description:** The 1cd51bb commit history (the bulk read/write series, `f220908`/`5e375f6`/`758aca2`) added 473 lines of constraint-filtering and reply-merging logic to `MxAccessGatewayService`: `ApplyConstraintsAsync` (line 266), `EnforceReadTagAsync` / `EnforceWriteHandleAsync`, `FilterTagBulkAsync` / `FilterReadBulkAsync` / `FilterWriteBulkAsync` / `FilterHandleBulkAsync`, the `ReplaceWriteBulkEntries` switch, and three concrete `BulkConstraintPlan` records (`SubscribeBulkConstraintPlan`, `WriteBulkConstraintPlan`, `ReadBulkConstraintPlan`) that splice denied entries back into the worker's allowed-only reply in original-index order. None of this is covered by `MxAccessGatewayServiceTests` — its `FakeSessionManager` is wired with an `AllowAllConstraintEnforcer` (line 430) that never denies anything, so every constraint-related code path is dead at test time. A subtle off-by-one in `BuildMerged`, a wrong `PayloadOneofCase` in `GetPayload` / `SetPayload`, or a missing case in `ReplaceWriteBulkEntries` would all ship without a test failure. + +**Recommendation:** Add `MxAccessGatewayServiceTests` cases that inject a deny-on-glob `IConstraintEnforcer` and exercise: (1) `AddItemBulk` / `SubscribeBulk` / `AdviseItemBulk` with a mix of allowed and denied tags, asserting `BulkSubscribeReply.Results` interleaves denied and worker-allowed entries in original-index order; (2) the same for `ReadBulk` and each of the four bulk-write commands; (3) `HasAllowedItems == false` so `CreateDeniedReply` is exercised (no worker call); (4) the unary `Write`/`Write2`/`WriteSecured`/`WriteSecured2` paths through `EnforceWriteHandleAsync`. The fixtures can reuse the existing `FakeSessionManager` by replacing the constraint enforcer; no live worker is needed. + +**Resolution:** 2026-05-20 — Added a configurable `PredicateConstraintEnforcer` test double (`src/MxGateway.Tests/TestSupport/PredicateConstraintEnforcer.cs`) that denies on per-tag and per-handle predicates and records denials. Added 11 new tests in `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs` covering: (1) `AddItemBulk` with mixed denials — asserts the worker is called once with only the allowed subset and the merged reply interleaves denied and worker-allowed `SubscribeResult`s at their original indices; (2) `SubscribeBulk` with every tag denied — asserts `HasAllowedItems` short-circuits `CreateDeniedReply` and the session manager is never invoked; (3) `AdviseItemBulk` (handle-keyed denial via `CheckReadHandleAsync`); (4) `SubscribeBulk` with the allow-all enforcer — pass-through regression guard; (5) `ReadBulk` partial denial — asserts the `BulkReadConstraintPlan` produces a `BulkReadReply` (not a `BulkSubscribeReply`) with denied entries spliced in at their original indices; (6) `ReadBulk` all-denied short-circuit; (7) `WriteBulk` partial denial — asserts denied entries are dropped from the forwarded `Entries` and the merged reply preserves original-index order; (8) `WriteSecuredBulk` all-denied — proves the second `ReplaceWriteBulkEntries` switch arm is reachable; (9) unary `Write` with denied handle → `PermissionDenied`, no worker call, denial recorded; (10) unary `WriteSecured` with denied handle → `PermissionDenied`; (11) unary `AddItem` with denied tag → `PermissionDenied` (`EnforceReadTagAsync`). `MxAccessGatewayServiceTests.CreateService` updated to accept an `IConstraintEnforcer` so future tests can opt into the deny enforcer without duplicating the wiring. All 11 new tests pass; full suite (`dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj`) is green at 458 passing. + +### Server-022 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs:8-29` | +| Status | Resolved | + +**Description:** Server-014's resolution noted that the stale "PR A.6 / A.7" / "not yet wired" language was rewritten on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms` and on the `WorkerAlarmRpcDispatcher` class doc. The corresponding XML doc on the **interface** `IAlarmRpcDispatcher` (lines 8-29) still says it is "PR A.6 / A.7 — gateway-side dispatcher" and that "Production implementations live in `WorkerAlarmRpcDispatcher` (this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic)". That second clause directly contradicts the now-correct comments on the concrete implementations and on the gRPC service: `WorkerAlarmRpcDispatcher` is the wired default, not a not-yet-wired one. A reader who finds the interface first will believe the dispatcher is non-functional. + +**Recommendation:** Rewrite the `IAlarmRpcDispatcher` `` block to match the language now used on `WorkerAlarmRpcDispatcher` and on the gRPC service: DI binds `WorkerAlarmRpcDispatcher` by default; `NotWiredAlarmRpcDispatcher` is only the null fallback for tests/DI omission. Drop the "PR A.6 / A.7" prefix from the `` — the interface is now the public alarm-RPC seam. + +**Resolution:** 2026-05-20 — Rewrote `IAlarmRpcDispatcher`'s `` and `` (`src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs`) to match the language now used on `WorkerAlarmRpcDispatcher` and on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms`: dropped the stale "PR A.6 / A.7" prefix from the summary, and replaced the "this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic" clause with the correct statement that DI binds the production `WorkerAlarmRpcDispatcher` by default and `NotWiredAlarmRpcDispatcher` is only the null fallback for DI omission / standalone tests. Pure documentation change; no test. + +### Server-023 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Server/Sessions/NotWiredAlarmRpcDispatcher.cs:10-26` | +| Status | Resolved | + +**Description:** Server-014 and Server-022 swept the stale "PR A.6 / A.7" / "not-yet-wired" / "worker-pending" language off `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms`, `WorkerAlarmRpcDispatcher`, and `IAlarmRpcDispatcher`. The concrete `NotWiredAlarmRpcDispatcher` class XML doc was not updated as part of either fix and still reads: *"PR A.6 / A.7 — default `IAlarmRpcDispatcher` shipped while the worker-side AlarmClient event subscription is gated on dev-rig validation"* and *"When the worker dispatcher (PR A.6/A.7 dev-rig follow-up) lands, `WorkerAlarmRpcDispatcher` replaces this implementation in the DI container"*. That is the exact prose the other sweeps removed, and it directly contradicts the now-current narrative everywhere else: `SessionServiceCollectionExtensions.AddGatewaySessions` registers `WorkerAlarmRpcDispatcher` as the default `IAlarmRpcDispatcher`; `NotWiredAlarmRpcDispatcher` is only the null fallback used when no dispatcher is registered (DI omission / standalone tests). The diagnostic string returned by `AcknowledgeAsync` (line 39) — `"the worker-side AlarmClient consumer (PR A.5) is in place but the dispatcher hookup is gated on validating the AVEVA alarm-provider event subscription on the dev rig"` — is also stale; the dispatcher hookup landed and any client that actually sees that diagnostic today is hitting the null-fallback path, not the dev-rig gate it describes. + +**Recommendation:** Replace the `` and `` on `NotWiredAlarmRpcDispatcher` with text that matches the language now used on the interface and `WorkerAlarmRpcDispatcher` — "null fallback `IAlarmRpcDispatcher` used when no dispatcher is registered (DI omission / standalone tests); production wires `WorkerAlarmRpcDispatcher`." Either drop the `AcknowledgeAsync` diagnostic string's dev-rig framing entirely or shorten it to "alarm dispatcher is not registered." `#pragma warning disable CS1998` on `QueryActiveAlarmsAsync` is correct here (empty stream is intentional for the null fallback) and should stay. + +**Resolution:** 2026-05-20 — Rewrote `NotWiredAlarmRpcDispatcher` summary/remarks as the null-fallback dispatcher and shortened the `AcknowledgeAsync` diagnostic to "Alarm dispatcher is not registered."; updated the two tests that asserted the old "worker"-prefixed diagnostic. + +### Server-024 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:56-77` | +| Status | Resolved | + +**Description:** `GetOrCreateRegex`'s race-loser branch reads `RegexCache[glob]` with an indexer (line 76) after `TryAdd` returned `false`. The indexer throws `KeyNotFoundException` if the key is missing. Under the new bounded cache (Server-018), there is a real — if narrow — race where the key vanishes between the failing `TryAdd` and the indexer read: thread A and thread B both compile a `Regex` for `glob`; A's `TryAdd` succeeds, A enqueues + enters `EvictIfOverCapacity`, the eviction loop dequeues `glob` (because some other thread had already enqueued + evicted enough that `glob` is now the oldest entry) and removes it; thread B's `TryAdd` then returns false, B reads `RegexCache[glob]`, and the indexer throws. The window is tiny but nonzero — eviction is approximate FIFO, and a hot pattern that is repeatedly re-added near the cap is the natural trigger. The same pre-Server-018 code used `GetOrAdd`, which had no such race because the dictionary handled the rebuild atomically. + +**Recommendation:** Replace the `TryAdd` + indexer pair with `RegexCache.GetOrAdd(glob, _ => compiled)` so the dictionary atomically returns whichever instance won. Track the new insertion only when `GetOrAdd` returns the locally-compiled instance (`ReferenceEquals(result, compiled)`), then enqueue + evict. Alternatively, swap the trailing indexer read for `TryGetValue` + recursive recompile on miss. Add a stress test that mixes repeated reads of a single hot pattern with a flood of unique patterns near the cap and asserts no exception escapes `IsMatch`. + +**Resolution:** 2026-05-20 — Replaced the `TryAdd` + indexer pair with `RegexCache.GetOrAdd(glob, compiled)`; FIFO enqueue + eviction now run only when `ReferenceEquals(result, compiled)` (i.e. our caller was the inserter), eliminating the post-eviction `KeyNotFoundException` window. + +### Server-025 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:19-25`, `src/MxGateway.Server/Galaxy/IGalaxyRepository.cs` | +| Status | Resolved | + +**Description:** The Tests-016 fix introduced `IGalaxyRepository` so `GalaxyHierarchyCache` could be unit-tested against an in-memory fake, and `GalaxyHierarchyCache` was updated to depend on the interface. `GalaxyRepositoryGrpcService` was not updated and still receives the concrete `GalaxyDb.GalaxyRepository` via its primary constructor. Functionally this is fine — DI registers the concrete singleton and a thin `sp.GetRequiredService()` forwarder for the interface — but the seam is now half-applied: a future caller that wants to test or stub the gRPC service's `TestConnection` path has to construct a real `GalaxyRepository` against a SQL connection string, defeating the abstraction `IGalaxyRepository` was introduced for. The pattern also creates an inconsistency for new readers — two consumers in the same namespace, one on the interface and one on the concrete. + +**Recommendation:** Change `GalaxyRepositoryGrpcService`'s `repository` parameter to `IGalaxyRepository`. No DI change is needed (both forwarders already resolve to the same singleton). Optionally drop the concrete singleton registration and register the interface directly. + +**Resolution:** 2026-05-20 — Changed `GalaxyRepositoryGrpcService`'s `repository` primary-constructor parameter from the concrete `GalaxyRepository` to `IGalaxyRepository`; existing DI registration in `GalaxyRepositoryServiceCollectionExtensions` already resolves both the concrete and interface to the same singleton. + +### Server-026 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Configuration/GatewayOptionsValidator.cs:17-32`, `src/MxGateway.Server/Configuration/AlarmsOptions.cs` | +| Status | Resolved | + +**Description:** `GatewayOptions.Alarms` is bound from `MxGateway:Alarms` and consumed by `SessionManager.TryAutoSubscribeAlarmsAsync` (per-session SubscribeAlarms on Ready). `GatewayOptionsValidator.Validate` validates every other section (`Authentication`, `Ldap`, `Worker`, `Sessions`, `Events`, `Dashboard`, `Protocol`) but has no `ValidateAlarms` arm — `AlarmsOptions` is silently accepted regardless of contents. The runtime mitigates this by logging a warning when `Enabled = true` but neither `SubscriptionExpression` nor `DefaultArea` is set, then either faulting open-session (`RequireSubscribeOnOpen = true`) or skipping auto-subscribe — a configuration error therefore surfaces per-session at runtime instead of at startup. Other sections fail-fast at `ValidateOnStart()`, so the inconsistency makes alarm misconfiguration discoverable only after a client hits the gateway. A misformatted `SubscriptionExpression` (no `\\\Galaxy!` shape) likewise passes validation; the worker rejects it later. + +**Recommendation:** Add a `ValidateAlarms(options.Alarms, failures)` arm in `GatewayOptionsValidator`. When `Enabled = true`, require either a non-blank `SubscriptionExpression` or a non-blank `DefaultArea`; when `SubscriptionExpression` is provided, sanity-check that it starts with `\\` (the AVEVA UNC subscription shape) — or document that the shape is left to the worker to validate. Either way, treat the configuration as part of the validated surface. + +**Resolution:** 2026-05-20 — Added `ValidateAlarms` to `GatewayOptionsValidator`: when `Enabled = true`, requires a non-blank `SubscriptionExpression` or `DefaultArea`, and when `SubscriptionExpression` is provided, requires it to start with `\\` (canonical UNC subscription shape). Alarm misconfiguration now fails fast at startup instead of per-session. + +### Server-027 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `docs/Authorization.md:120-141,176-181` | +| Status | Resolved | + +**Description:** Two parts of `docs/Authorization.md` drifted from `GatewayGrpcScopeResolver.ResolveCommandScope` and from `MxAccessGatewayService.ApplyConstraintsAsync` over the bulk-read/bulk-write series (`f220908`/`5e375f6`/`758aca2`) and were not updated by the Server-017 / Server-021 fixes: + +1. The `ResolveCommandScope` code snippet at lines 120-141 still shows only `Write`/`Write2` against `InvokeWrite` and `WriteSecured`/`WriteSecured2`/`AuthenticateUser` against `InvokeSecure`. The actual resolver also maps `MxCommandKind.WriteBulk`, `MxCommandKind.Write2Bulk`, `MxCommandKind.WriteSecuredBulk`, and `MxCommandKind.WriteSecured2Bulk`. A reader believing the snippet would conclude the bulk-write families inherit the fail-closed admin scope, when in fact they correctly map to `InvokeWrite` / `InvokeSecure` (the Scope Catalog table at lines 199-200 lists them). +2. The Constraint Enforcement section (lines 176-181) says: *"The service checks read constraints for `AddItem`, `AddItem2`, `AddItemBulk`, `SubscribeBulk`, and `AdviseItemBulk`. It checks write constraints for `Write`, `Write2`, `WriteSecured`, and `WriteSecured2`."* The actual `ApplyConstraintsAsync` switch also enforces constraints for `ReadBulk` (read scope), `WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` (write scope, per-entry filtering with index-order merge). Server-021 added test coverage for all of these without touching the doc. + +**Recommendation:** Update the `ResolveCommandScope` snippet to include the four bulk-write arms. Update the Constraint Enforcement prose to enumerate the bulk read/write commands that are actually filtered, and reference the per-entry index-ordered merge that `BulkConstraintPlan.MergeDeniedInto` performs. Adding `ReadBulk` to the `InvokeRead` row of the Scope Catalog would also be useful — the table currently lists `Register`/`AddItem`/`Advise` against `InvokeRead` but not `ReadBulk`. + +**Resolution:** 2026-05-20 — Updated the `ResolveCommandScope` snippet in `docs/Authorization.md` to enumerate the four bulk-write arms (`WriteBulk`/`Write2Bulk` against `InvokeWrite`, `WriteSecuredBulk`/`WriteSecured2Bulk` against `InvokeSecure`); expanded the Constraint Enforcement prose to list `ReadBulk` and all four bulk-write commands and to call out `BulkConstraintPlan.MergeDeniedInto`'s index-ordered merge; added `ReadBulk` to the `InvokeRead` row of the Scope Catalog. + +### Server-028 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:13-20`, `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs` | +| Status | Resolved | + +**Description:** Two narrow test gaps were not closed by Server-017 / Server-015: + +1. `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` enumerates `OpenSessionRequest`, `CloseSessionRequest`, `StreamEventsRequest`, `AcknowledgeAlarmRequest`, `QueryActiveAlarmsRequest`, `TestConnectionRequest`, `GetLastDeployTimeRequest`, and `DiscoverHierarchyRequest`. `WatchDeployEventsRequest` is missing even though it is named in the resolver's metadata-read arm and listed in the Scope Catalog. Similarly, the `ResolveRequiredScope_InvokeCommand_ReturnsExpectedScope` matrix covers every other write/secure/bulk command but omits `MxCommandKind.ReadBulk`, which is the only bulk family that falls into the `_ => GatewayScopes.InvokeRead` default arm. A regression that drops `WatchDeployEvents` from the request switch or that adds a new mapping for `ReadBulk` would not be caught. +2. `GatewaySessionTests` (added under Server-015 / Server-016) covers the `TransitionTo(Ready)` and `MarkFaulted(post-Close)` cases but does not cover the third edge that Server-015's tightened state machine permits: `MarkFaulted` issued while `CloseAsync` is parked between `TryBeginClose` (Closing) and `MarkClosed` (Closed). The current `MarkFaulted` (`GatewaySession.cs:314-326`) checks only for `Closed`, so it overwrites `Closing` → `Faulted`; the subsequent `MarkClosed` then overwrites `Faulted` → `Closed` while `_finalFault` is preserved. The behaviour is consistent with the docs ("Closing only allows a transition to Closed or Faulted") but the test bundle does not pin it, and a future tightening of `MarkFaulted` could silently regress. + +**Recommendation:** Extend `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` with `[InlineData(typeof(WatchDeployEventsRequest), GatewayScopes.MetadataRead)]` and extend the command theory with `[InlineData(MxCommandKind.ReadBulk, GatewayScopes.InvokeRead)]`. Add a `GatewaySessionTests.MarkFaulted_DuringInFlightClose_PreservesFaultButYieldsToClose` case using `BlockingShutdownWorkerClient` to park `CloseAsync`, call `MarkFaulted` while parked, release the worker, and assert `State == Closed && FinalFault == ""`. + +**Resolution:** 2026-05-20 — Added `[InlineData(typeof(WatchDeployEventsRequest), GatewayScopes.MetadataRead)]` to `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` (the `ReadBulk` arm was already present); added `GatewaySessionTests.MarkFaulted_DuringInFlightClose_PreservesFaultButYieldsToClose` covering the parked-close + `MarkFaulted` interleave and asserting the post-release state is `Closed` with `FinalFault = "concurrent-fault"`. + +### Server-029 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:52-58` | +| Status | Resolved | + +**Description:** `OpenSession` advertises capabilities the gateway supports so clients can branch on them. The current list is `unary-open-session`, `unary-close-session`, `unary-invoke`, `server-stream-events`, `bulk-subscribe-commands`, `unary-acknowledge-alarm`, `server-stream-active-alarms`. The `bulk-subscribe-commands` token was added for the `AddItemBulk` / `AdviseItemBulk` / `RemoveItemBulk` / `UnAdviseItemBulk` / `SubscribeBulk` / `UnsubscribeBulk` family. The subsequent `ReadBulk` and `WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` families landed without a corresponding capability token — the contract advertises bulk-subscribe support but is silent on bulk-read and bulk-write. A defensive client that gates on `bulk-write-commands` before issuing a `WriteBulk` has no signal that the family is supported; current clients sidestep this by ignoring the list entirely, but that just shifts the failure mode (an old client against a new server, or vice versa, will see `Unimplemented` instead of a structured `Capabilities` mismatch). + +**Recommendation:** Either (a) extend the advertised list with `bulk-read-command` and `bulk-write-commands` (`WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` collectively), or (b) document in `gateway.md` and `docs/Contracts.md` that `Capabilities` is informational only and not the contract version. Option (a) is the simplest forward-compatible fix and keeps the capability token shape clients are already familiar with. + +**Resolution:** 2026-05-20 — Extended the `OpenSession` capabilities list with `bulk-read-commands` and `bulk-write-commands` alongside the existing `bulk-subscribe-commands` token, so clients that gate on capability strings have an explicit signal for the bulk-read and bulk-write families. + +### Server-030 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:952-980` | +| Status | Resolved | + +**Description:** Surfaced during the 2026-05-20 cross-language e2e run against a redeployed gateway (`a020350`). The Java client got 55 of 120 `AddItem` calls in, then `Advise` returned `Session session-de7728a290bd41028ad6fec81e233144 is not ready. Current state is Ready.` — a self-contradictory diagnostic. The check in `GetReadyWorkerClient` (`GatewaySession.cs:956`) is `_state != SessionState.Ready || _workerClient?.State != WorkerClientState.Ready`, but the formatted message only includes `_state`. When the gateway-side session state is `Ready` but the worker client's own `WorkerClientState` has transitioned (heartbeat watchdog firing, pipe disconnect detected by the read loop, etc.) before the session-level reaction observes it, the in-flight RPC fails fast here — and the operator sees a message that doesn't tell them which side of the gate the failure is on. The two-state gap itself is a real race (the worker-side state can shift independently of the gateway-driven session state) but a clear diagnostic is the prerequisite for diagnosing it; without it, a future investigation will start from "it says Ready but it's not Ready" instead of "the worker is Handshaking / Closing / Faulted while the session is still Ready". + +**Recommendation:** Format both states into the exception message — `Session {SessionId} is not ready. Session state is {_state}; worker state is {workerClientState}.` (or `""` when `_workerClient` is null). Document on the method that the two states can diverge under load and that this branch is the fail-fast for that case. Add a regression test that flips `FakeWorkerClient.State` to a non-Ready value (e.g. `Handshaking`) while the session is `Ready` and asserts both pieces of state appear in the thrown `SessionManagerException.Message`. The deeper race investigation (should the gateway briefly wait for worker-Ready before failing? when does `WorkerClient.State` legitimately shift while the session is still `Ready`?) is out of scope for this finding but is worth a follow-up. + +**Resolution:** 2026-05-20 — Rewrote `GetReadyWorkerClient` so the `SessionManagerException` message includes both `_state` and `_workerClient.State` (or `""` for the null case): `"Session {SessionId} is not ready. Session state is {_state}; worker state is {workerState}."`. Added XML doc on the method explaining the two-state contract and that this branch is the fail-fast for a state-divergence race. Added regression test `SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates` that sets `FakeWorkerClient.State = WorkerClientState.Handshaking` while the session is `Ready` and asserts both `"Session state is Ready"` and `"worker state is Handshaking"` appear in the message; the test also pins `InvokeCount == 0` so the worker isn't called. The deeper race (should `GetReadyWorkerClient` retry briefly when state has just diverged?) remains open for follow-up. + +### Server-031 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Server/Workers/WorkerClient.cs:392-422` (gateway-side heartbeat watchdog); `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:588-617` (worker-side heartbeat loop); `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:14,67-76` (shared `_writeLock`) | +| Status | Open | + +**Description:** Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The .NET phase succeeded through `open-session`/`register`/`bulk-subscribe`/`bulk-read`/`bulk-unsubscribe`/`stream-events`/`write` but then failed on its third `advise` call with the Server-030 diagnostic `Session ... is not ready. Session state is Ready; worker state is Faulted.` The gateway stdout log records the underlying cause: **`Worker client faulted for session session-01a1a07fa59c489983a719821fa46e72: Worker heartbeat expired. Last heartbeat was at 2026-05-20T17:20:39.+00:00.`** — a real 15s+ gap with no `WorkerHeartbeat` envelope arriving from the worker. + +Investigation paths: + +1. **Shared `_writeLock` on the worker side.** `WorkerFrameWriter` serializes every pipe write (heartbeats, command replies, events, faults) through a single `SemaphoreSlim _writeLock` (`WorkerFrameWriter.cs:14`, `:67-76`). `RunEventDrainLoopAsync` (`WorkerPipeSession.cs:336-372`) writes events one at a time inside a `foreach`, each call to `_writer.WriteAsync` re-acquiring `_writeLock`. If the gateway-side read drains slowly and the OS-level named-pipe buffer fills, `_stream.WriteAsync` (`WorkerFrameWriter.cs:70`) blocks. The event-drain loop blocks holding the lock. `RunHeartbeatLoopAsync` (`WorkerPipeSession.cs:611-613`) then can't acquire `_writeLock` to send its 5s heartbeat. Heartbeats stall past the gateway's `HeartbeatGrace` (15s default) and `WorkerClient.HeartbeatLoopAsync` faults the session. + +2. **No prioritization between heartbeats and events.** Even without OS-level back-pressure, a backlog of events in the worker's `MxAccessEventQueue` (drained in batches of `EventDrainBatchSize`) can keep the writer lock held for many milliseconds at a time. Heartbeats can be delayed (though normally not past `HeartbeatGrace` unless something else is wrong). + +3. **Gateway-side heartbeat watchdog ignores in-flight commands.** `WorkerClient.HeartbeatLoopAsync` (`WorkerClient.cs:392-422`) checks only `_state == Ready` and `now - lastHeartbeatAt > HeartbeatGrace`. It does not check whether a command is in flight on the gateway↔worker pipe. The mirror of Worker-017's fix (worker-side watchdog skips `StaHung` while a command is in flight) does not exist on the gateway side. + +The .NET test pattern stresses the issue uniquely because each `dotnet run --project` rebuild between subcommands introduces multi-second client-side gaps; the worker's heartbeat path should still be alive (heartbeats are emitted by `RunHeartbeatLoopAsync` independently of gateway activity), but if the gateway is also blocked draining events from the channel into a non-existent `StreamEvents` consumer, the back-pressure-into-heartbeat chain bites first. + +**Recommendation:** Two changes worth landing together: + +1. **Decouple heartbeat writes from the event/reply lock.** Either (a) give heartbeats their own pipe `Stream` (likely impractical — one pipe per session), (b) introduce a priority queue in front of `WorkerFrameWriter` so heartbeats hop the line, or (c) interleave heartbeat checks inside `RunEventDrainLoopAsync` (e.g., after each event-batch write, post a heartbeat envelope if one is due). Option (c) is the smallest change. + +2. **Mirror Worker-017's "skip-while-command-in-flight" guard on the gateway side.** In `WorkerClient.HeartbeatLoopAsync`, when `_pendingCommands.Count > 0` and the oldest pending command is younger than some ceiling (e.g., 5× `HeartbeatGrace`), skip the fault. The worker may be busy executing a slow STA command and the heartbeat write may be queued behind a long event burst — neither indicates the worker is actually hung. + +Add a regression test that floods the worker's outbound event channel (e.g., via a high-rate STA fixture or a mock event source emitting at > 1000 events/s for several seconds) and asserts the worker is not faulted while the gateway has no `StreamEvents` consumer attached. + +**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_ + +### Server-032 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Workers/WorkerClient.cs:70-77,463-484` (gateway-side `_events` channel); `src/MxGateway.Server/Configuration/EventOptions.cs:8` (default capacity 10,000); `src/MxGateway.Server/Grpc/EventStreamService.cs` (consumer) | +| Status | Open | + +**Description:** Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The Java phase advised ~55 items (`item-handle 63`) before failing on the next `advise` call with the Server-030 diagnostic `Session ... is not ready. Session state is Ready; worker state is Faulted.`. The gateway stdout log records: **`Worker client faulted for session session-adfcc808da974808947e87db060c2b03: Worker event channel rejected an event.`** — the gateway-side per-session bounded event channel filled up and `Channel.Writer.TryWrite` returned `false`, triggering the fail-fast path in `EnqueueWorkerEventAsync` (`WorkerClient.cs:467-484`). + +The channel is configured as `Channel.CreateBounded(new BoundedChannelOptions(EventChannelCapacity) { ... FullMode = BoundedChannelFullMode.Wait ... })` (capacity defaults to `EventOptions.QueueCapacity = 10_000`). But `EnqueueWorkerEventAsync` uses **`TryWrite`** (non-blocking), so the configured `Wait` mode is moot — the writer always fails fast when full. This is consistent with `docs/DesignDecisions.md`'s "fail-fast event backpressure" policy (one subscriber per session, no producer-side queuing beyond the channel), but two facts make it sharp in practice: + +1. The e2e flow (and any realistic client) `advise`s many items BEFORE opening a long-running `StreamEvents` consumer. With no consumer, events accumulate at the in-rate (driven by the SCADA tags' change frequency). For `TestMachine_*.TestChangingInt` × ~55 advised items, the rig can fill 10,000 in well under a minute. + +2. The fail-fast threshold is "exactly at capacity." There is no overflow grace window. A momentary lull on the consumer side that lasts long enough for one extra event to arrive after the channel is full results in worker fault and session teardown. + +This is design-as-intended in the v1 sense, but it surfaces a behavioral contract that is **not currently documented**: clients must open `StreamEvents` BEFORE issuing `advise` against high-rate tags, or pace their `advise` calls below the (non-published) accumulation budget. None of the current docs (`gateway.md`, `docs/DesignDecisions.md`, the client READMEs) enforce or surface this requirement, and four of the five client CLIs (`go`, `python`, `rust`, `java`) hit it gracelessly in `scripts/run-client-e2e-tests.ps1`. + +The diagnostic `"Worker event channel rejected an event."` also does not name the actual channel (it says "Worker event channel" but the channel is gateway-owned), the current depth, or the capacity — only that it overflowed. Operators can't tell whether the threshold needs lifting or whether the consumer is genuinely missing. + +**Recommendation:** Three escalating options, pick at least the first and consider one of the others: + +1. **Document the contract.** In `gateway.md` and `docs/DesignDecisions.md`, state explicitly that `advise` produces events into the gateway-side per-session channel and that a `StreamEvents` consumer must be attached to drain it. Add the bound (`MxGateway:Events:QueueCapacity`, default 10,000) and the fault behavior (the worker is faulted; the session ends). Update `clients/*/README.md` to call out the requirement in the "advise" / "subscribe" sections. + +2. **Improve the diagnostic.** Format the channel depth and capacity into the fault message: `"Worker event channel rejected an event after {capacity} unconsumed events accumulated. Attach a StreamEvents consumer or increase MxGateway:Events:QueueCapacity."` + +3. **Add an overflow grace window.** Instead of fail-fast on the first `TryWrite == false`, count overflow events and only fault if N consecutive overflows happen within T ms (or, equivalently, switch to `WriteAsync` with a short timeout). This trades a tiny memory bump for resilience to consumer hiccups. Out of scope if v1 explicitly chose fail-fast for parity reasons — but worth raising for v2. + +Add a regression test that advises N items without an active `StreamEvents` consumer, lets the channel fill, and asserts the produced fault message contains the channel-depth diagnostic (#2) — gated so that #3 is not required. + +**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_ + +### Server-033 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:265-323` (`TryRestoreFromDiskAsync`), `:84-99` (`_firstLoad` / `WaitForFirstLoadAsync`); `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:141-163` (`WaitForCacheBootstrap`) | +| Status | Resolved | + +**Description:** `TryRestoreFromDiskAsync` populates `_current` with the on-disk snapshot (status `Stale`, `HasData == true`) but never completes the `_firstLoad` `TaskCompletionSource` — only the live-query paths (cheap / heavy / catch) in `RefreshCoreAsync` do. A `DiscoverHierarchy` or `GetLastDeployTime` call that arrives after gateway start but before the first refresh tick finishes sees `cache.Current` as `Empty` (status `Unknown`) when `WaitForCacheBootstrap` runs its initial check, so it falls through to `await WaitForFirstLoadAsync` with a 5-second budget. Restore then completes within milliseconds and makes the data available, but `_firstLoad` stays pending until the live query returns or fails. When the Galaxy database is unreachable — the exact scenario the snapshot feature exists for — the SQL connect attempt outlasts the 5s budget, so the caller waits the full 5 seconds before the budget elapses and the handler falls through to read the (already-restored) data. The result is correct, but the first browse calls after a cold offline start incur a needless ~5s latency, undercutting the feature's purpose. + +**Recommendation:** Call `_firstLoad.TrySetResult()` at the end of `TryRestoreFromDiskAsync` once the restored entry is published — restored data is a valid completed first load. Add a regression test: a cache with a throwing repository plus a populated snapshot store should have `WaitForFirstLoadAsync` complete promptly after `RefreshAsync`, not block on the live query. + +**Resolution:** Resolved in `bdccdbf` (2026-05-22): `TryRestoreFromDiskAsync` calls `_firstLoad.TrySetResult()` immediately after publishing the restored entry, so a restored snapshot satisfies the bootstrap gate without waiting on the live query. New test `GalaxyHierarchyCacheTests.RefreshAsync_RestoredSnapshotCompletesFirstLoadBeforeLiveQueryReturns` blocks the repository's deploy-time query and asserts `WaitForFirstLoadAsync` still completes from the snapshot. + +### Server-034 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchySnapshotStore.cs:87-115` (`TryLoadAsync`) | +| Status | Resolved | + +**Description:** `TryLoadAsync` carries the `Try` prefix and its XML doc says it returns `null` "when none exists, persistence is disabled, or the on-disk file uses an unrecognized schema version." But a corrupt or partially written JSON file makes `JsonSerializer.DeserializeAsync` throw `JsonException`, and an unreadable file (locked, denied ACL) throws `IOException` / `UnauthorizedAccessException` — none of which the method catches. End-to-end behavior is still safe because the sole caller, `GalaxyHierarchyCache.TryRestoreFromDiskAsync`, wraps the call in a `catch (Exception)`; but the store's own `Try`-prefixed contract is violated, and any future caller would be surprised by the throw. + +**Recommendation:** Catch `JsonException` and `IOException` (the latter covers the `UnauthorizedAccessException` family) inside `TryLoadAsync`, log a warning, and return `null` — consistent with the unrecognized-schema-version branch already present and with the `Try` naming. A corrupt cache file is an expected failure mode for a disk cache. + +**Resolution:** Resolved in `bdccdbf` (2026-05-22): `TryLoadAsync` now has a `catch (Exception) when (exception is JsonException or IOException or UnauthorizedAccessException)` that logs a warning and returns `null`. New test `GalaxyHierarchySnapshotStoreTests.TryLoadAsync_WhenFileIsCorruptJson_ReturnsNull`. + +### Server-035 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:176` (call site), `:327-352` (`PersistSnapshotAsync`) | +| Status | Resolved | + +**Description:** After a heavy refresh, `RefreshCoreAsync` `await`s `PersistSnapshotAsync` while still holding `_refreshGate`, and the `SaveAsync` write has no timeout. The only caller of `RefreshAsync` is the sequential `GalaxyHierarchyRefreshService` loop, so a write that hangs — e.g. a `SnapshotCachePath` pointed at an unresponsive network share — blocks the gate and stalls all subsequent cache refreshes until gateway shutdown. Impact is bounded: clients keep being served the last entry (which flips to `Stale` after the 5-minute threshold), so this is a degradation rather than an outage, and the default `C:\ProgramData` path is local disk where a hang is unlikely. + +**Recommendation:** Bound the snapshot write with a timeout — a linked `CancellationTokenSource` cancelling after, say, the SQL `CommandTimeoutSeconds` budget — so a stuck write fails fast and logs rather than pinning the refresh loop. Moving the write off the gate is an alternative but would need its own write-serialization. + +**Resolution:** Resolved in `bdccdbf` (2026-05-22): `SaveAsync` wraps the write in a `CancellationTokenSource.CreateLinkedTokenSource(cancellationToken)` cancelled after `Math.Max(1, CommandTimeoutSeconds)` seconds, so a stuck write fails fast instead of pinning the refresh loop. The timeout-expiry path itself is not unit-tested — exercising it would require a genuinely hanging filesystem. + +### Server-036 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:345-348` (`PersistSnapshotAsync` catch) | +| Status | Resolved | + +**Description:** `PersistSnapshotAsync` passes the refresh `CancellationToken` to `SaveAsync` and catches every exception — including the `OperationCanceledException` thrown when that token is cancelled at gateway shutdown — in its general `catch (Exception)`, logging it as `Warning: "Failed to persist the Galaxy hierarchy snapshot to disk."`. A snapshot write interrupted by a normal shutdown is not a failure, but it surfaces as a misleading warning every time the gateway stops mid-write. + +**Recommendation:** Let a cancellation-driven `OperationCanceledException` pass without the warning — e.g. add `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` before the general catch — matching the cancellation handling already used in `RefreshCoreAsync` and `TryRestoreFromDiskAsync`. + +**Resolution:** Resolved in `bdccdbf` (2026-05-22): `PersistSnapshotAsync` has a `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)` ahead of the general catch, so a save aborted by gateway shutdown is silent while a genuine failure (including a write timeout) still logs. New test `GalaxyHierarchyCacheTests.RefreshAsync_WhenSnapshotSaveCancelledAtShutdown_DoesNotLogPersistFailure`. + +### Server-037 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Galaxy/GalaxyHierarchySnapshotStoreTests.cs`, `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs` | +| Status | Resolved | + +**Description:** The new snapshot tests cover the round-trip, missing-file, persistence-disabled, unrecognized-schema, and overwrite cases for the store, and the persist / restore-when-unreachable / promote-on-matching-deploy cases for the cache. Two resilience paths are untested: (1) `GalaxyHierarchyCache.TryRestoreFromDiskAsync`'s `catch` path when the snapshot file is corrupt — the cache must come up `Unavailable` rather than throwing; (2) the cache restore path when `PersistSnapshot = false` (the store yields `null` and the cache stays `Unavailable`). Both are the failure modes most likely to matter operationally. + +**Recommendation:** Add a cache test that writes a corrupt snapshot file and asserts `RefreshAsync` with an unreachable repository leaves the cache `Unavailable` without throwing, and a test that confirms a `PersistSnapshot = false` store neither restores nor persists. If Server-034 is fixed, the corrupt-file test also pins the store's null-return. + +**Resolution:** Resolved in `bdccdbf` (2026-05-22): added `GalaxyHierarchyCacheTests.RefreshAsync_WhenSnapshotFileCorrupt_ComesUpUnavailableWithoutThrowing` and `RefreshAsync_WhenPersistDisabled_DoesNotRestoreFromDisk`, plus the `TryLoadAsync_WhenFileIsCorruptJson_ReturnsNull` store test added for Server-034. + +### Server-038 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Security | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:23-44` | +| Status | Open | + +**Description:** `EventsHub` is gated by `[Authorize(Policy = DashboardAuthenticationDefaults.HubClientsPolicy)]`, which checks only that the caller carries a dashboard role (Admin or Viewer). `SubscribeSession(sessionId)` accepts any non-empty session id and joins the caller to `session:{id}`. A Viewer who knows or guesses a session id can therefore subscribe to any session's MxEvent stream once `DashboardEventBroadcaster` is broadcasting (which it now is, per `d692232`). The per-session ACL that gates the gRPC `StreamEvents` RPC is not replicated. + +**Recommendation:** Before the EventsHub is exercised by Admin-only sessions or session-scoped Viewer roles, gate `SubscribeSession` on a session-access check — either via a per-session role check in the hub method itself, or by storing a per-user allowed-session-id set in the connection's `Context.Items` at connect time and rejecting subscribes outside that set. The current dashboard surfaces only a per-page Session Details view that the page can prove it's authorized for, but as soon as a Viewer role exists the gap matters. + +**Resolution:** _(empty until closed)_ + +### Server-039 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:37-58` | +| Status | Open | + +**Description:** `HubTokenService.Validate` deserializes the protected JSON payload and trusts `payload.Roles` even when `payload.Name` and `payload.NameIdentifier` are both `null`. The resulting `ClaimsPrincipal` has the `MxGateway.Dashboard.HubToken` scheme as its `AuthenticationType` and the role claims, but no identity claims. `Identity?.IsAuthenticated` returns `true` because the auth type is non-empty, so the principal satisfies `IsAuthenticated` checks and `IsInRole` checks even though it has no caller identity. A token forged from a corrupted data-protection store could pass authorization without an associated user. + +**Recommendation:** Mark `HubTokenPayload.Name` and `HubTokenPayload.NameIdentifier` as required (e.g. with `[JsonRequired]` once the project standardizes the JSON binder, or by validating non-null explicitly after deserialization) and reject the token if either is missing. Alternatively, document on `IDashboardAuthorizationHandler` consumers that they must check `Identity?.Name` is non-null before honoring role claims from this scheme. + +**Resolution:** _(empty until closed)_ + +### Server-040 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAuthenticator.cs:140-160` (`MapGroupsToRoles`) | +| Status | Open | + +**Description:** `MapGroupsToRoles` checks each LDAP group against the role map twice — first by the full group string, then by `ExtractFirstRdnValue(group)` — and `TryGetValue` short-circuits on the first hit. The precedence ("full match wins over RDN match") is correct because the map's key set is operator-controlled and matches should resolve deterministically, but the lookup ordering is not documented. A future maintainer reading the code can't tell whether "fall through to RDN" is intentional or a leftover from refactoring `IsMemberOfRequiredGroup`. + +**Recommendation:** Add a one-line comment above the loop explaining the precedence: full DN/CN literal first, leading-RDN fallback second. Mention the case-insensitive map comparer (`OrdinalIgnoreCase`) so the next reader doesn't ask why `"GwAdmin"` matches `"gwadmin"`. + +**Resolution:** _(empty until closed)_ + +### Server-041 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-126`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/IDashboardEventBroadcaster.cs:6-10` | +| Status | Open | + +**Description:** `IDashboardEventBroadcaster.Publish` is documented as "Implementations must never throw — broadcast failures are best-effort and must not disrupt the source gRPC stream." `EventStreamService` honors that contract by passing the call through without a try/catch. The current `DashboardEventBroadcaster` implementation observes the `SendAsync` task's continuation but does not raise synchronously, so the seam is safe today. A future implementation that adds synchronous validation or a serializer hop could throw, faulting the producer loop and ending the gRPC stream. + +**Recommendation:** Either wrap the `Publish` call in a `try/catch (Exception ex)` that logs at debug and continues (matching the `DashboardSnapshotPublisher` pattern), or add a code-review checklist note enforcing the never-throw contract on implementations. The wrap is safer because it doesn't depend on convention. + +**Resolution:** _(empty until closed)_ + +### Server-042 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardSnapshotPublisher.cs:18-41` | +| Status | Open | + +**Description:** `DashboardSnapshotPublisher.ExecuteAsync` reads from `IDashboardSnapshotService.WatchSnapshotsAsync` inside an outer `try` that catches `OperationCanceledException` only. A failure inside `WatchSnapshotsAsync` (e.g. the snapshot service throws after a transient SQL failure for the Galaxy summary projection) escapes the outer try and ends the BackgroundService — no automatic reconnect. The sibling `AlarmsHubPublisher` (lines 55-61) wraps its `StreamAsync` consumer in a 5-second reconnect loop with `catch (Exception ex)` and continues. The snapshot publisher should follow the same shape. + +**Recommendation:** Wrap the `await foreach` in a `while (!stoppingToken.IsCancellationRequested)` loop with a `catch (Exception ex)` plus a 5-second `Task.Delay`, mirroring `AlarmsHubPublisher`. Today's snapshot service rarely throws on the watch path, but a one-time logger-init failure or transient `IGatewayConfigurationProvider` exception would silently take the dashboard offline. + +**Resolution:** _(empty until closed)_ + +### Server-043 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:1`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardServiceCollectionExtensions.cs:24` | +| Status | Open | + +**Description:** `HubTokenService` is registered as a singleton (good — data protection providers are thread-safe and a single protector instance is correct) and shared by both `DashboardHubConnectionFactory` (per-circuit scoped, mints fresh tokens from the cookie principal) and `HubTokenAuthenticationHandler` (per-request transient, validates inbound tokens). The class-level docs describe what the service does but not that it is intentionally a singleton with two consumer scopes, so a future maintainer rewriting the DI registration may pick the wrong lifetime. + +**Recommendation:** Add a `` block to `HubTokenService` noting "Registered as a singleton in `AddGatewayDashboard`; the underlying `ITimeLimitedDataProtector` is thread-safe and shared across hub-token issuance and validation." Optionally add a comment near the DI registration explaining the lifetime contract. + +**Resolution:** _(empty until closed)_ diff --git a/code-reviews/Tests/findings.md b/code-reviews/Tests/findings.md new file mode 100644 index 0000000..5239a70 --- /dev/null +++ b/code-reviews/Tests/findings.md @@ -0,0 +1,453 @@ +# Code Review — Tests + +| Field | Value | +|---|---| +| Module | `src/ZB.MOM.WW.MxGateway.Tests` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 2 | + +## Checklist coverage + +This pass (commit `a020350`) re-reviews the module after the Tests-013–019 batch was resolved alongside Server-017, Server-021, and Contracts-010. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issue found: Tests-023 (the companion `FakeWorkerProcess.WaitForExitAsync` in `SessionWorkerClientFactoryFakeWorkerTests.cs` still uses the Tests-015 cheating pattern — `HasExited = true; ExitCode = 0;` regardless of whether the worker actually exited — and is a latent regression vector if any future exit assertion is added to that file). Tests-015 was only applied to the smoke-test copy. | +| 2 | mxaccessgw conventions | No new issues. Style/convention drift previously filed (Tests-008) remains resolved at `a020350`. | +| 3 | Concurrency & thread safety | No new issues. The remaining wall-clock dependencies (`InvokeAsync_WhenSessionReady_RefreshesLease` uses `UtcNow` at both ends of a ~1 hour delta, dwarfing clock resolution; `CloseExpiredLeasesAsync_*` reads `UtcNow` once and uses it consistently for both sides) are intrinsic to the production paths and not flake sources. The Tests-017 fix is in place at `WorkerClientTests.cs:354`. | +| 4 | Error handling & resilience | No new issues. Tests-013 closed the bulk-method coverage gap end-to-end (per-entry failure surfaces, protocol-status failures, and cancellation propagation are all exercised). Pipe-disconnect / worker-fault / kill paths all covered. | +| 5 | Security | No new issues. Adversarial-input safety (Tests-002), anonymous-localhost negatives (Tests-010), interceptor-service composition (Tests-004), constraint partial-denial merging (Server-021 — `PredicateConstraintEnforcer` + `MxAccessGatewayServiceConstraintTests`), and unmapped-RPC fail-closed (Server-017) all covered. | +| 6 | Performance & resource management | No new issues. Tests-014 (`await using WebApplication`) is applied to all seven `GatewayApplication.Build(...)` sites. Tests-003 (`TempDatabaseDirectory`) cleanup is in place. | +| 7 | Design-document adherence | Tests match `docs/GatewayTesting.md`; the new "Galaxy Filter Safety" subsection added under Tests-019 names `GalaxyFilterInputSafetyTests`. No drift found. | +| 8 | Code organization & conventions | Issue found: Tests-021 (`ManualTimeProvider` is duplicated as a `private sealed class` in four test files — `WorkerClientTests`, `FakeWorkerHarnessTests`, `SessionManagerTests`, `GalaxyHierarchyCacheTests` — and should follow the Tests-007 `TestSupport/` consolidation pattern). | +| 9 | Testing coverage | Issues found: Tests-020 (`MxAccessGatewayServiceConstraintTests` covers only 2 of 4 `WriteBulkConstraintPlan` switch arms — `Write2Bulk`/`WriteSecured2Bulk` `GetPayload`/`SetPayload` would silently break with no failing test), Tests-022 (the eleven `SessionManagerBulkTests.*_PropagatesCancellation` tests pre-cancel the token, so the fake's first-line `ThrowIfCancellationRequested` handles it before `InvokeBulkInternalAsync` even runs — they do not exercise mid-flight cancellation), Tests-024 (`BulkConstraintPlan.MergeDeniedInto` silently drops or under-fills if the worker reply count diverges from the allowed-count — no test pins this protocol-mismatch edge case). | +| 10 | Documentation & comments | No new issues. Tests-019's `docs/GatewayTesting.md` addition is in place; new test files (`SessionManagerBulkTests`, `MxAccessGatewayServiceConstraintTests`, `PredicateConstraintEnforcer`) all have orienting class-level summaries. | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232` scoped to the test-side fixture churn from +the dashboard refactor wave: the rename touched every namespace declaration +and `using`; the dashboard auth refactor rewrote three dashboard test files +(`DashboardApiKeyAuthorizationTests`, `DashboardAuthorizationHandlerTests`, +`DashboardAuthenticatorTests`); `GatewayApplicationTests` was updated for +root-mounted routes and the new `ViewerPolicy`; `DashboardCookieOptionsTests` +expects root-relative login/logout; a new `DashboardHubsRegistrationTests` +pins the three hub `/negotiate` endpoints and the DI shape; and the +`EventStreamService` ctor expansion drove inline `NullDashboardEventBroadcaster` +fakes in two test files. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — namespaces updated cleanly, the fixture-helper consolidation pattern (`TestSupport/`) is intact. | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | No issues found in this diff. | +| 5 | Security | No issues found — `DashboardAuthorizationHandlerTests` covers both Viewer and Admin role paths and the loopback bypass. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | No issues found in this diff. | +| 8 | Code organization & conventions | Issues found: Tests-025 (duplicate `NullDashboardEventBroadcaster` private classes in `EventStreamServiceTests` and `GatewayEndToEndFakeWorkerSmokeTests`; follow Tests-007 / Tests-021 consolidation pattern). | +| 9 | Testing coverage | Issues found: Tests-026 (no test proves `EventStreamService` actually calls `IDashboardEventBroadcaster.Publish` for each event — the only consumers in tests are `Null` fakes). | +| 10 | Documentation & comments | No issues found in this diff. | + +## Findings + +### Tests-001 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:483-489` | +| Status | Resolved | + +**Description:** `FakeSessionManager.TryGetSession` unconditionally returns `true` and synthesizes a session for any id. As a result, `Invoke_WhenSessionMissing_ThrowsNotFound` (line 52) only passes because `InvokeException` is pre-seeded — it does not verify that the gateway service maps a genuinely missing session to `NotFound`. No test exercises the real gateway path where `TryGetSession` returns `false` (for `StreamEvents`, `CloseSession`, alarm RPCs). A regression dropping the missing-session check would not be caught. + +**Recommendation:** Make `FakeSessionManager.TryGetSession` return `false` for unknown ids (return only seeded sessions), then assert `NotFound`/`InvalidArgument` is produced by the service's own lookup logic rather than an injected exception. + +**Resolution:** Resolved 2026-05-18: confirmed root cause — added `ResolveOnlySeededSessions`/`SeedSession` to `FakeSessionManager` so `TryGetSession` returns `false` for unseeded ids, rewrote `Invoke_WhenSessionMissing_ThrowsNotFound` to drop the injected `InvokeException` and exercise the service's own `ResolveSession` lookup (asserts `InvokeCount == 0`), and added `Invoke_WhenSessionSeeded_ResolvesAndInvokes`, `AcknowledgeAlarm_WhenSessionMissing_ThrowsNotFound`, and `QueryActiveAlarms_WhenSessionMissing_ThrowsNotFound`. + +### Tests-002 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Security | +| Location | `src/MxGateway.Tests/Gateway/Grpc/GalaxyRepositoryGrpcServiceTests.cs:198-210` | +| Status | Resolved | + +**Description:** The Galaxy Repository RPCs browse a SQL Server database (`ZB`). Every test injects a `StubGalaxyHierarchyCache`, so actual SQL query construction, parameterization, and filter/glob translation are never exercised. No test demonstrates that `TagNameGlob`, `RootTagName`, `AlarmFilterPrefix`, etc. are passed as parameters rather than concatenated into SQL. SQL-injection resistance of the Galaxy layer has zero coverage. + +**Recommendation:** Add tests for the `GalaxyRepository` query-building layer (against SQLite or an in-memory abstraction, or by asserting parameter objects), covering glob/prefix inputs containing `'`, `%`, `_`, and `;`. At minimum add a unit test over the SQL `LIKE`-pattern escaping helper. + +**Re-triage note:** The finding's premise is partly misframed. `GalaxyRepository` issues only four *constant* SQL statements (`HierarchySql`, `AttributesSql`, `SELECT 1`, `SELECT time_of_last_deploy FROM galaxy`) — no `DiscoverHierarchyRequest` field is ever concatenated into SQL, so there is no dynamic SQL-injection surface and no `LIKE`-escaping helper to test. `AlarmFilterPrefix` belongs to the worker alarm path, not the Galaxy SQL layer. All filters (`TagNameGlob`, `RootTagName`, template-chain, category, contained-path) are applied **in memory** by `GalaxyHierarchyProjector`/`GalaxyGlobMatcher` against the cached snapshot. The genuine, testable concern — that adversarial filter strings are treated as opaque literals (no wildcard behaviour, no ReDoS, no exceptions) — remains valid and was previously uncovered. Severity left at High: an unsafe in-memory filter would still be a real security gap. + +**Resolution:** Resolved 2026-05-18: added `src/MxGateway.Tests/Galaxy/GalaxyFilterInputSafetyTests.cs` (10 test methods, mostly `[Theory]` over adversarial inputs `'`, `' OR '1'='1`, `'; DROP TABLE gobject;--`, `%`, `_`, `100%_off`, `[abc]`, `Pump'001`) covering `GalaxyGlobMatcher` literal-treatment / `LIKE`-wildcard / pathological-input (ReDoS) behaviour and `GalaxyHierarchyProjector` + `DiscoverHierarchy` RPC handling of adversarial `TagNameGlob`, `RootTagName`, and `TemplateChainContains`. No product bug found — the in-memory filter layer treats all metacharacters as literals; the passing tests resolve the coverage gap. + +### Tests-003 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Performance & resource management | +| Location | `src/MxGateway.Tests/Security/Authentication/SqliteAuthStoreTests.cs:170-176`, `src/MxGateway.Tests/Security/Authentication/ApiKeyAdminCliRunnerTests.cs:252-258` | +| Status | Resolved | + +**Description:** `CreateTempDatabasePath` creates a fresh directory under `%TEMP%\mxgateway-auth-tests\` (and `...-cli-tests`) for every test but nothing ever deletes it. `WorkerProcessLauncherTests.TestDirectory` correctly implements `IDisposable` and cleans up; these two do not. SQLite connection pooling can also keep the `.db` handle open after the test. Over many CI runs this leaks temp files and open handles. + +**Recommendation:** Wrap the temp directory in an `IDisposable`/`IAsyncDisposable` helper (as `WorkerProcessLauncherTests` does) and call `SqliteConnection.ClearAllPools()` before deletion, or use `Microsoft.Data.Sqlite` in-memory mode where a real file is not needed. + +**Resolution:** Resolved 2026-05-18: confirmed root cause — both `CreateTempDatabasePath` helpers created `%TEMP%` directories with no cleanup, and `Microsoft.Data.Sqlite` pools connections by default so the `.db` handle outlives the test. Added a shared `TempDatabaseDirectory` (`src/MxGateway.Tests/Security/Authentication/TempDatabaseDirectory.cs`) `IDisposable` helper that calls `SqliteConnection.ClearAllPools()` and recursively deletes its directory. `SqliteAuthStoreTests` and `ApiKeyAdminCliRunnerTests` now implement `IDisposable`, track every directory created via `CreateTempDatabasePath`, and dispose them after each test. All affected tests still pass. + +### Tests-004 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Security/Authorization/GatewayGrpcAuthorizationInterceptorTests.cs` | +| Status | Resolved | + +**Description:** The authorization interceptor and `MxAccessGatewayService` are each tested in isolation, but no test composes the interceptor in front of the real service to confirm scope enforcement gates real RPCs end-to-end. A wiring mistake — interceptor not registered, or a new RPC added without a scope mapping in `GatewayGrpcScopeResolver` — would pass every existing test. `GatewayGrpcScopeResolverTests` also only checks an enumerated allow-list; it never asserts an unmapped request type fails closed. + +**Recommendation:** Add an end-to-end test that runs `OpenSession`/`Invoke` through the interceptor+service composition with insufficient scope and asserts `PermissionDenied`; add a `GatewayGrpcScopeResolver` test asserting an unknown/unmapped request type throws or denies rather than returning a permissive default. + +**Resolution:** Resolved 2026-05-18: confirmed the coverage gap. Added three interceptor+service composition tests to `GatewayGrpcAuthorizationInterceptorTests` that run the real `GatewayGrpcAuthorizationInterceptor` continuation into a real `MxAccessGatewayService`: `InterceptorComposedWithService_OpenSessionMissingScope_DeniesBeforeServiceRuns` (asserts `PermissionDenied` and `OpenSessionCount == 0`), `InterceptorComposedWithService_OpenSessionWithScope_RunsServiceWithIdentity` (service runs and observes the interceptor-pushed identity), and `InterceptorComposedWithService_InvokeWriteCommandWithReadScope_DeniesBeforeServiceRuns` (a `Write` command with only `invoke:read` is denied). Added two `GatewayGrpcScopeResolverTests`: `ResolveRequiredScope_UnmappedRequestType_FailsClosedToAdminScope` confirms an unmapped request type resolves to the most-restrictive `Admin` scope (the resolver's `_ => GatewayScopes.Admin` default already fails closed — no product bug), and `ResolveRequiredScope_UnknownInvokeCommandKind_ReturnsInvokeReadScope` confirms an unknown command kind does not silently grant write/admin access. + +### Tests-005 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs:239-261`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` | +| Status | Resolved | + +**Description:** Worker-crash handling is only tested as a clean terminal exception from `ReadEventsAsync` or a pre-set `ShutdownException`. There is no test for a worker that faults mid-command — an `InvokeAsync` in flight when the pipe/worker dies — which is a core fault-handling path of the two-process design. `WorkerClientTests` covers pipe-disconnect faulting the read loop, but not the interaction where a pending `InvokeAsync` task observes the fault and surfaces a meaningful error code. + +**Recommendation:** Add a `WorkerClient`/`SessionManager` test that disposes the worker pipe (or emits a `WorkerFault`) while an `InvokeAsync` is pending, and assert the invoke task fails with a `WorkerClientException`/`SessionManagerException` carrying the worker-faulted error code. + +**Resolution:** Resolved 2026-05-18: confirmed the coverage gap and confirmed the product path already handles it correctly (`WorkerClient.ReadLoopAsync` → `SetFaulted` → `CompletePendingCommands(fault)` fails every pending command with the fault exception). Added two `WorkerClientTests`: `InvokeAsync_WhenPipeDisconnectsMidCommand_FailsPendingInvokeWithPipeDisconnected` (worker reads the command then disposes its pipe side; the pending invoke task fails with `WorkerClientErrorCode.PipeDisconnected`) and `InvokeAsync_WhenWorkerFaultsMidCommand_FailsPendingInvokeWithWorkerFaulted` (worker emits a `WorkerFault` envelope while the invoke is pending; the task fails with `WorkerClientErrorCode.WorkerFaulted`). Both also assert the client transitions to `Faulted`. No product change needed. + +### Tests-006 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:76`, `src/MxGateway.Tests/Gateway/Workers/FakeWorkerHarnessTests.cs:122` | +| Status | Resolved | + +**Description:** Several tests rely on fixed `Task.Delay` values: `WorkerClientTests.InvokeAsync_WithLateReply…` waits a hard-coded 50 ms after writing a late reply before issuing the second command, and the heartbeat tests use a 20 ms delay to make timestamps strictly increase. On a slow CI agent the 50 ms delay can be insufficient, and `DateTimeOffset.UtcNow` resolution can make the 20 ms heartbeat-advance assertion flaky. + +**Recommendation:** Replace fixed delays with the existing `WaitUntilAsync` condition polling, and inject a controllable `TimeProvider` for heartbeat-timestamp comparisons instead of relying on wall-clock advance. + +**Re-triage note:** The brief flagged `ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess` as "a real `WorkerClient` fault→kill bug". On inspection it is **not a product bug** — it is a test race. `WorkerClient.SetFaulted` publishes the `Faulted` state under lock *before* calling `KillOwnedProcess`, so the old test's `WaitUntilAsync(() => client.State == Faulted)` could return between those two statements and observe `process.KillCount == 0`. The kill itself always runs synchronously inside `SetFaulted`, and `ShutdownAsync`/`DisposeAsync` re-issue an idempotent kill, so no real consumer relies on "state==Faulted implies process dead". The fix is therefore a test-quality fix (correctly Medium / Concurrency), not a product fix. + +**Resolution:** Resolved 2026-05-18: (1) Made `ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess` deterministic — it now `await`s `FakeWorkerProcess.WaitForExitAsync` (the `TaskCompletionSource` completed inside `Kill()`), which completes exactly when the kill runs, eliminating the state-polling race; verified by running it five times in isolation (5/5 pass). (2) Removed the fixed 50 ms `Task.Delay` from `InvokeAsync_WithLateReply_IgnoresLateReplyAndKeepsClientReady` — the stale reply and the second reply are now sent in pipe (FIFO) order, so the read loop discards the stale reply before the second reply with no timing window. (3) Replaced the 20 ms `Task.Delay` heartbeat-advance hacks in `WorkerClientTests.ReadLoop_WhenHeartbeatArrives_UpdatesLastHeartbeatAndWorkerProcess` and `FakeWorkerHarnessTests.SendHeartbeatAsync_UpdatesClientHeartbeatState` with an injected `ManualTimeProvider` advanced by a fixed `TimeSpan`; both tests now assert the exact post-advance timestamp instead of `>` against wall-clock drift. + +### Tests-007 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:682`, `src/MxGateway.Tests/Gateway/Grpc/GalaxyRepositoryGrpcServiceTests.cs:324`, `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:460`, `src/MxGateway.Tests/Security/Authorization/GatewayGrpcAuthorizationInterceptorTests.cs:233` | +| Status | Resolved | + +**Description:** A near-identical `TestServerCallContext` implementation is copy-pasted into at least four test files (and `AllowAllConstraintEnforcer` / `TestServerStreamWriter` / `RecordingStreamWriter` into several). Duplication risks the copies drifting and bloats each file. + +**Recommendation:** Extract a shared `TestServerCallContext`, `RecordingServerStreamWriter`, and `AllowAllConstraintEnforcer` into a common test-support folder/namespace. + +**Resolution:** Resolved 2026-05-18: confirmed five duplicated copies (the brief's four plus a fifth in `Galaxy/GalaxyFilterInputSafetyTests.cs`). Added a shared `MxGateway.Tests.TestSupport` namespace under `src/MxGateway.Tests/TestSupport/`: `TestServerCallContext.cs` (single class with an optional `Metadata? requestHeaders` constructor parameter that subsumes both the no-arg and headers-bearing variants), `RecordingServerStreamWriter.cs` (thread-safe writer with `Messages` and `WaitForFirstMessageAsync`, replacing `TestServerStreamWriter`/`RecordingStreamWriter`/`RecordingServerStreamWriter`), and `AllowAllConstraintEnforcer.cs`. Deleted all five `TestServerCallContext` copies, both `AllowAllConstraintEnforcer` copies, and the three stream-writer copies; updated the five test files to `using MxGateway.Tests.TestSupport;` and renamed `.Items` call sites to `.Messages`. Removed the now-unused `Grpc.Core` using from `GatewayEndToEndFakeWorkerSmokeTests.cs`. Build clean (0 warnings) and suite green. + +### Tests-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `src/MxGateway.Tests/Gateway/Sessions/WorkerAlarmRpcDispatcherTests.cs:1-9`, `src/MxGateway.Tests/Gateway/Sessions/NotWiredAlarmRpcDispatcherTests.cs:1-3`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerAlarmAutoSubscribeTests.cs:1` | +| Status | Resolved | + +**Description:** The alarm test files diverge from the project's C# style and the rest of the suite: snake_case test method names instead of the PascalCase `Method_Condition_Result` pattern; redundant explicit `using System;`/`System.Threading;` imports despite implicit global usings; and explicit-type `new` instead of target-typed `new()` used elsewhere. There is also a typo in fixture data (`"wnwrap subscribe failed"`). + +**Recommendation:** Rename the alarm tests to the house `Method_Condition_Result` convention, drop redundant `System.*` usings, align `new` usage, and fix the `wnwrap` typo. + +**Re-triage note:** Two of the finding's claims are incorrect. (1) `"wnwrap subscribe failed"` is **not a typo** — `WnWrap` is the real name of the worker's `WnWrapAlarmConsumer` MXAccess component (`src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs`); the fixture string deliberately references it, so it was left unchanged. (2) `SessionManagerAlarmAutoSubscribeTests.cs` already uses PascalCase `Method_Condition_Result` names and target-typed `new()`, and its lone `using System.Runtime.CompilerServices;` is **required** for `[EnumeratorCancellation]` (not a global using) — it is not redundant. That file needed no change. The genuine style drift was confined to `WorkerAlarmRpcDispatcherTests.cs` and `NotWiredAlarmRpcDispatcherTests.cs`. + +**Resolution:** Resolved 2026-05-18: renamed all ten `WorkerAlarmRpcDispatcherTests` methods and both `NotWiredAlarmRpcDispatcherTests` methods from snake_case to the house `Method_Condition_Result` PascalCase convention; dropped the redundant `System`/`System.Collections.Generic`/`System.Linq`/`System.Threading`/`System.Threading.Tasks` usings from `WorkerAlarmRpcDispatcherTests.cs` and `System.Threading`/`System.Threading.Tasks` from `NotWiredAlarmRpcDispatcherTests.cs` (all are implicit global usings), keeping the required `System.Runtime.CompilerServices`; converted explicit-type `new SessionRegistry()`/`new WorkerAlarmRpcDispatcher(...)`/`new FakeAlarmWorkerClient`/`new List<...>()`/`new GatewaySession(...)` to target-typed `new()`; and replaced the fully-qualified `System.StringComparison` with `StringComparison`. See the re-triage note for the two claims not actioned. Suite green. + +### Tests-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:36-37,99,365` | +| Status | Resolved | + +**Description:** Several XML `` comments are copy-paste mismatches: the comment above `OpenSessionAsync_SetsInitialDefaultLease` describes correlation-ID generation; the comment above `GatewaySessionSubscribeBulkAsync_ForwardsOneBulkCommand…` describes lease refresh; the comment above `CloseExpiredLeasesAsync_DoesNotCloseActiveEventSubscriber` describes shutdown closing all sessions. Misleading test docs hinder triage. + +**Recommendation:** Correct the `` text to match each test's actual behavior, or remove the redundant comments since the test names already describe the behavior. + +**Resolution:** Resolved 2026-05-18: confirmed three copy-paste `` mismatches. The mislabelled comments were the summaries of the *following* tests left attached to the wrong method (the test below each then had no summary). Corrected all three: `OpenSessionAsync_SetsInitialDefaultLease` now describes setting the initial lease expiry; the comment above `InvokeAsync_WhenSessionReady_RefreshesLease` (the finding mis-cited the method name as `GatewaySessionSubscribeBulkAsync_…`) now describes lease refresh on invoke; and `CloseExpiredLeasesAsync_DoesNotCloseActiveEventSubscriber` now describes the expired-lease sweep leaving an active-event-subscriber session open. No behavior change. + +### Tests-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Security | +| Location | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs:26-36` | +| Status | Resolved | + +**Description:** The anonymous-localhost bypass is tested only for the success case (`allowAnonymousLocalhost: true` + loopback succeeds) and the remote-unauthenticated denial. There is no test for the security-critical negatives: anonymous + loopback when `AllowAnonymousLocalhost` is `false` must be denied, and anonymous + non-loopback when the flag is `true` must still be denied (the bypass is scoped strictly to loopback). Those are the misconfiguration cases that would expose the dashboard. + +**Recommendation:** Add tests: anonymous + loopback + `allowAnonymousLocalhost: false` → not succeeded; anonymous + non-loopback + `allowAnonymousLocalhost: true` → not succeeded. + +**Resolution:** Resolved 2026-05-18: confirmed the coverage gap and confirmed `DashboardAuthorizationHandler` already gates the bypass correctly on `AllowAnonymousLocalhost && IsLoopbackRequest()` (no product bug). Added two `DashboardAuthorizationHandlerTests`: `HandleAsync_AnonymousLocalhostDisallowed_DoesNotSucceed` (anonymous + loopback + `allowAnonymousLocalhost: false` → not succeeded) and `HandleAsync_AnonymousLocalhostAllowedFromRemoteAddress_DoesNotSucceed` (anonymous + non-loopback + `allowAnonymousLocalhost: true` → not succeeded, proving the bypass stays scoped to loopback). Both pass. + +### Tests-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:233-301` | +| Status | Resolved | + +**Description:** `GatewayEndToEndFakeWorkerSmokeTests` correctly stores and awaits `launcher.WorkerTask`, but `SessionWorkerClientFactoryFakeWorkerTests` uses `_ = RunWorkerAsync(...)` with no stored task (lines 152, 184, 220). An unhandled exception in the scripted worker becomes an unobserved `TaskException` that can surface as a process-level failure in an unrelated later test rather than failing the owning test. + +**Recommendation:** Store the worker task and either await it during disposal or attach a continuation that fails the test on fault, mirroring `GatewayEndToEndFakeWorkerSmokeTests`. + +**Resolution:** Resolved 2026-05-18: confirmed all three scripted launchers in `SessionWorkerClientFactoryFakeWorkerTests` discarded the worker task. Added an `IWorkerTaskLauncher` interface (each launcher now stores its scripted task in a `WorkerTask` property and exposes `ObserveWorkerTaskAsync`); the test class now implements `IAsyncDisposable`, tracks every launcher it creates via a `Track` helper, and in `DisposeAsync` awaits each `WorkerTask` (within `TestTimeout`) so a scripted-worker fault fails the owning test instead of leaking as an unobserved `TaskScheduler.UnobservedTaskException`. `OperationCanceledException` and `IOException` — the expected outcomes of the worker client tearing the pipe down — are swallowed; anything else rethrows. `NeverReadyWorkerProcessLauncher` (which parks on an infinite `Task.Delay`) was given its own `CancellationTokenSource` so disposal can cancel and observe the parked task. Suite green. + +### Tests-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Tests/Gateway/Workers/Fakes/FakeWorkerHarness.cs:62`, `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:472` | +| Status | Resolved | + +**Description:** Pipe names are uniquified per test with a GUID (good), but xUnit runs test classes in parallel by default and there is no `xunit.runner.json` or collection configuration. Tests that build a full `WebApplication` bind ephemeral ports (`--urls=http://127.0.0.1:0`, fine) but spin up DI containers and hosted services concurrently. Currently safe, but a future test binding a fixed port would silently collide. + +**Recommendation:** Add an `xunit.runner.json` or a collection grouping the `WebApplication`-building tests, and keep the `:0` ephemeral-port convention explicit so future tests do not introduce a fixed-port collision. + +**Resolution:** Resolved 2026-05-18: added `src/MxGateway.Tests/xunit.runner.json` making the parallelism policy explicit (`parallelizeTestCollections: true`, `maxParallelThreads: -1`, `parallelizeAssembly: false`, `longRunningTestSeconds: 30`) and wired it into `MxGateway.Tests.csproj` as `` so the runner picks it up (confirmed present in `bin/Debug/net10.0/`). Added a comment at the only `WebApplication`-building call site (`GatewayApplicationTests.cs`, `--urls=http://127.0.0.1:0`) documenting that the ephemeral-port (`:0`) convention is mandatory because test collections run in parallel. No fixed-port binding exists today; this is a preventative guardrail as the finding recommends. + +### Tests-013 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:449-679`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` | +| Status | Resolved | + +**Description:** `GatewaySession` exposes eleven bulk methods (`AddItemBulkAsync`, `AdviseItemBulkAsync`, `RemoveItemBulkAsync`, `UnAdviseItemBulkAsync`, `SubscribeBulkAsync`, `UnsubscribeBulkAsync`, `WriteBulkAsync`, `Write2BulkAsync`, `WriteSecuredBulkAsync`, `WriteSecured2BulkAsync`, `ReadBulkAsync`) but only three (`SubscribeBulkAsync`, `WriteBulkAsync`, `ReadBulkAsync`) are exercised in `SessionManagerTests`. A grep across `src/MxGateway.Tests` for the other eight method names returns zero matches. The recent commit `eaa7093` ("register the five new bulk subcommands in `IsKnownGatewayCommand`") explicitly added bulk surface to the gateway, and `1cd51bb` added stress benchmarks for it, but the gateway-side tests do not pin the command-kind, payload-shape, or `WriteSecured*Bulk` credential-redaction behaviour for any of the new bulk variants. A future regression in `WriteSecuredBulkAsync` body construction would not be caught by the gateway unit suite. + +**Recommendation:** Mirror the existing `SubscribeBulkAsync` / `WriteBulkAsync` / `ReadBulkAsync` test pattern for the eight missing methods: each test should `OpenSessionAsync`, invoke the bulk API, assert the worker received exactly one `WorkerCommand` of the matching `MxCommandKind`, and (for the secured variants) confirm the credential payload survives the round-trip without being log-redacted from the over-the-wire command shape. + +**Resolution:** Resolved 2026-05-20: added `src/MxGateway.Tests/Gateway/Sessions/SessionManagerBulkTests.cs` with per-method coverage for all eleven bulk entry points. Each method now has a round-trip test that pins (a) the exact `MxCommandKind` sent to the worker, (b) the payload shape (server handle, item handles / tag addresses / entries, timeout for `ReadBulk`), and (c) per-entry failure surfacing where the reply contains a mix of `WasSuccessful = true`/`false` results with an `ErrorMessage`. Each method also has a `*_PropagatesCancellation` test that pre-cancels the token and asserts `OperationCanceledException` flows out. The secured variants additionally pin that `CurrentUserId` / `VerifierUserId` survive the over-the-wire command shape unchanged (the gateway's redaction rules apply only to logs, not to the command body the worker receives). New tests use a local `FakeBulkWorkerClient` keyed by `MxCommand.Kind`-specific replies; no production-code change. All 54 SessionManager/GalaxyHierarchyCache tests pass with `dotnet test --filter "FullyQualifiedName~SessionManager|FullyQualifiedName~GalaxyHierarchyCache"`. + +### Tests-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs:18,33,44,62,81,105`, `src/MxGateway.Tests/Gateway/Dashboard/DashboardCookieOptionsTests.cs:17` | +| Status | Resolved | + +**Description:** Seven `[Fact]` methods build a real `WebApplication` via `GatewayApplication.Build([])` and never dispose it. `WebApplication` is `IAsyncDisposable`; constructing one stands up a full DI container, an OpenTelemetry meter (`GatewayMetrics`), Kestrel server objects, hosted services, and logging providers. Because the suite runs test collections in parallel (per the new `xunit.runner.json` from Tests-012), every undisposed instance keeps its meter/loggers/hosted services alive until the test process exits, doubling up live Meter instances each time and silently extending the memory/handle footprint of an `xunit` run. Only the two tests that actually call `app.StartAsync()` (`GatewayApplicationTests.StartAsync_InvalidGatewayConfiguration_FailsStartup` and `SqliteAuthStoreTests.StartAsync_NewerSchemaVersion_BlocksStartup`) currently use `await using`. + +**Recommendation:** Promote each `WebApplication app = GatewayApplication.Build(...)` to `await using WebApplication app = ...` and make the containing test method `async Task`. The endpoint-listing assertions do not need `await`, but the `await using` will ensure the DI container, meter, and hosted services are torn down per-test. + +**Resolution:** 2026-05-20 — Promoted all seven `WebApplication`-building tests (six in `GatewayApplicationTests` plus the one in `DashboardCookieOptionsTests`) to `async Task` with `await using WebApplication app = GatewayApplication.Build(...)`, so the DI container, `GatewayMetrics` meter, hosted services, and Kestrel objects are torn down per-test rather than leaking until process exit. The previously already-`await using` `StartAsync_InvalidGatewayConfiguration_FailsStartup` was unchanged. Full suite green. + +### Tests-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:374-379,87` | +| Status | Resolved | + +**Description:** The nested `FakeWorkerProcess.WaitForExitAsync` implementation unconditionally sets `HasExited = true` and `ExitCode ??= 0` when called, regardless of whether the scripted worker actually completed the shutdown handshake. The smoke-test assertion `Assert.True(launcher.Process.HasExited)` therefore cannot distinguish "the scripted worker received `WorkerShutdown`, sent `WorkerShutdownAck`, and called `MarkExited(0)`" from "the gateway code path simply awaited `WaitForExitAsync` somewhere during teardown". The scripted worker happens to call `MarkExited(0)` after receiving the shutdown frame, but a regression that bypassed the shutdown-ack path entirely would still pass this assertion. The companion launcher in `SessionWorkerClientFactoryFakeWorkerTests.FakeWorkerProcess.WaitForExitAsync` (lines 351-356) has the same shape — fine there because no exit assertion is made — but the smoke test relies on this signal. + +**Recommendation:** Make `WaitForExitAsync` await an internal `TaskCompletionSource` that is only completed by `Kill()` or `MarkExited()` (the same pattern `WorkerClientTests.FakeWorkerProcess` already uses for `_exited`), so `HasExited` reflects actual exit and the smoke test's assertion is meaningful. + +**Resolution:** 2026-05-20 — Rewrote the smoke-test `FakeWorkerProcess` to back `WaitForExitAsync` with a `TaskCompletionSource _exited` that is only completed inside `MarkExited` (called by the scripted worker after sending `WorkerShutdownAck`) or `Kill` (which calls `MarkExited(-1)`), removing the "set `HasExited = true` and return immediately" cheat. The smoke test now also asserts `Assert.Equal(0, launcher.Process.ExitCode)` — `MarkExited(0)` is reachable only via the shutdown-ack branch, so a regression that bypassed the ack path would produce a non-zero (or null) exit code and fail the assertion deterministically. `WorkerClient.ShutdownAsync` calls `WaitForProcessExitAsync`, which now genuinely awaits the scripted worker's ack. + +### Tests-016 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:29-41,115-124` | +| Status | Resolved | + +**Description:** `RefreshAsync_WhenSqlIsUnreachable_MarksUnavailableAndDoesNotPublish` is in the unit-test project but exercises a real `GalaxyHierarchyCache`/`GalaxyRepository` against a hard-coded TCP socket `127.0.0.1:65500` with a one-second connect timeout. Per `docs/GatewayTesting.md`, live Galaxy coverage belongs in `MxGateway.IntegrationTests` and is gated by `MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1`; this test is neither gated nor uses a stub repository. On most boxes the connect fails closed (the test passes), but the outcome depends on OS-level "connection refused" vs "no route to host" behaviour and is sensitive to environments where 127.0.0.1:65500 happens to be bound — a real flakiness source. It also breaks the gateway-without-MXAccess invariant in spirit (the gateway code path under test does I/O the unit project should not need). + +**Recommendation:** Either (a) replace the real repository with an in-test fake that throws a `SqlException`/`TimeoutException` from `GetHierarchyAsync`, exercising `GalaxyHierarchyCache.RefreshAsync`'s exception path directly; or (b) move the test to `MxGateway.IntegrationTests` and gate it behind a "no-live-DB-required" variant of the live-Galaxy attribute. (a) is preferred because the production path being tested is the cache's reaction to a repository exception, not socket behaviour. + +**Resolution:** Resolved 2026-05-20: applied option (a). Introduced `src/MxGateway.Server/Galaxy/IGalaxyRepository.cs` with the four methods the cache consumes (`TestConnectionAsync`, `GetLastDeployTimeAsync`, `GetHierarchyAsync`, `GetAttributesAsync`); made `GalaxyRepository` implement it; changed `GalaxyHierarchyCache`'s constructor to depend on `IGalaxyRepository` rather than the concrete type; and registered the interface against the existing concrete singleton in `GalaxyRepositoryServiceCollectionExtensions.AddGalaxyRepository`. Rewrote the test as `RefreshAsync_WhenRepositoryThrows_MarksUnavailableAndDoesNotPublish` using a local `ThrowingGalaxyRepository : IGalaxyRepository` that throws an `InvalidOperationException` from `GetLastDeployTimeAsync` (the first call the cache makes against the repository). The test now exercises the cache's exception branch directly — no TCP I/O — and additionally asserts that `GetHierarchyAsync`/`GetAttributesAsync` are NOT invoked once the deploy-time probe has failed. `Current_BeforeAnyRefresh_ReturnsEmpty` was migrated to the same fake. The unreachable `CreateCache` helper that built a real `GalaxyRepository` against `127.0.0.1:65500` was removed. The Galaxy SQL surface itself stays covered by `MxGateway.IntegrationTests.Galaxy.GalaxyRepositoryLiveTests` (gated by `MXGATEWAY_RUN_LIVE_GALAXY_REPOSITORY_TESTS=1`). + +### Tests-017 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:346-364` | +| Status | Resolved | + +**Description:** `HeartbeatMonitor_WhenHeartbeatExpires_FaultsClient` configures `HeartbeatGrace = 80 ms` and `HeartbeatCheckInterval = 20 ms`, then asserts the client faults within the 5-second `TestTimeout`. The test compares against the real wall clock — the heartbeat monitor reads `TimeProvider.System` for the grace check. After Tests-006 migrated the other heartbeat tests to an injected `ManualTimeProvider` for determinism, this one is now the only `WorkerClientTests` heartbeat case that still rides the wall clock. The 5-second outer bound makes a false failure unlikely, but the test cannot fail fast when the heartbeat-monitor logic regresses — it just waits the full 5 seconds. + +**Recommendation:** Inject the same `ManualTimeProvider` used by `ReadLoop_WhenHeartbeatArrives_UpdatesLastHeartbeatAndWorkerProcess`, then `clock.Advance(TimeSpan.FromSeconds(2))` past the grace and assert the fault deterministically. The `HeartbeatCheckInterval` (20 ms) timer fire can stay on the real clock; what needs to be deterministic is the grace comparison. + +**Resolution:** 2026-05-20 — `HeartbeatMonitor_WhenHeartbeatExpires_FaultsClient` now constructs a `ManualTimeProvider` seeded at `"2026-05-20T12:00:00Z"`, passes it to `CreateClient` via the existing `timeProvider` parameter, and calls `clock.Advance(TimeSpan.FromSeconds(2))` after the handshake. `WorkerClient.MarkReady` records `_lastHeartbeatAt` from the manual clock, so the next 20 ms `HeartbeatCheckInterval` tick observes `now - lastHeartbeat = 2s > 80ms grace` and faults deterministically. The check-interval timer stays on the real clock as the finding recommended; only the grace comparison is deterministic. + +### Tests-018 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:32`, `src/MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotServiceTests.cs:45,51,57,105,134,163,167,202-209,284,317,523`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:40` | +| Status | Resolved | + +**Description:** Several tests parse ISO-8601 literals with `DateTimeOffset.Parse("2026-04-26T10:00:00Z")` without an explicit `CultureInfo.InvariantCulture`. `Directory.Build.props` enables `TreatWarningsAsErrors`, but CA1305 (specify `IFormatProvider`) is not currently raised because the tests don't trigger it; nevertheless, `DateTimeOffset.Parse` without a culture takes `CurrentCulture`, and on a locale whose `DateTimeFormatInfo` rejects the `Z` suffix or uses non-Gregorian calendar conventions, these parses can throw at test time. `WorkerClientTests.cs:327` and `FakeWorkerHarnessTests.cs:121` already added `System.Globalization.CultureInfo.InvariantCulture` in the Tests-006 fix; the other ~15 call sites did not get the same treatment. + +**Recommendation:** Add `CultureInfo.InvariantCulture` to every `DateTimeOffset.Parse(...)` call in `MxGateway.Tests`, or replace with `DateTimeOffset.ParseExact` against the literal `"O"` round-trip format. A single-line `using System.Globalization;` per file keeps the call sites concise. + +**Resolution:** 2026-05-20 — Added `CultureInfo.InvariantCulture` to every `DateTimeOffset.Parse` site in `MxGateway.Tests` that lacked it: 16 call sites in `DashboardSnapshotServiceTests.cs` (a new `using System.Globalization;` was added so the call sites stay concise) and one in `SessionManagerTests.cs` (using the fully-qualified `System.Globalization.CultureInfo.InvariantCulture` to match the in-file style of the existing `ManualTimeProvider` parse sites). `GalaxyHierarchyCacheTests.cs:36` was already correct from the Tests-016 rewrite. A final grep confirms every `DateTimeOffset.Parse`/`DateTime.Parse` call in `src/MxGateway.Tests` now passes `CultureInfo.InvariantCulture`. + +### Tests-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `docs/GatewayTesting.md`, `code-reviews/Tests/findings.md` (Tests-002 re-triage) | +| Status | Resolved | + +**Description:** The Tests-002 re-triage (2026-05-18) confirmed there is no SQL-injection surface in `GalaxyRepository` because filters are applied in memory by `GalaxyHierarchyProjector`/`GalaxyGlobMatcher` against the cached snapshot, and added 10 adversarial-input tests in `src/MxGateway.Tests/Galaxy/GalaxyFilterInputSafetyTests.cs`. That explanation lives only in the findings file; `docs/GatewayTesting.md` does not mention `GalaxyFilterInputSafetyTests`, the in-memory filter model, or the adversarial-input matrix. A future reader of the test docs will not know which tests pin the literal-filter behaviour or why the Galaxy SQL layer is not unit-tested for parameterisation. Per `CLAUDE.md` ("Update docs in the same change as the source. When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs must change in the same commit"), the Galaxy security-behaviour decision warrants a paragraph in `GatewayTesting.md`. + +**Recommendation:** Add a short subsection to `docs/GatewayTesting.md` (probably under "Focused Commands" or a new "Galaxy Filter Safety" section) that names `GalaxyFilterInputSafetyTests`, explains that Galaxy filtering happens in memory against the cached hierarchy (so the SQL surface is constant), and lists the adversarial-input invariants the suite pins (`%`, `_`, `'`, `;`, `[abc]` are literals; the glob regex has a 100 ms timeout against pathological input). + +**Resolution:** 2026-05-20 — Added a "Galaxy Filter Safety" section to `docs/GatewayTesting.md` (immediately after "Live Galaxy Repository", before "Live LDAP") that names `GalaxyFilterInputSafetyTests`, re-frames the Tests-002 finding (the Galaxy SQL surface is constant — `HierarchySql`, `AttributesSql`, `SELECT 1`, `SELECT time_of_last_deploy FROM galaxy`), explains that all filters are applied in memory by `GalaxyHierarchyProjector` / `GalaxyGlobMatcher`, lists the adversarial-input matrix (`'`, `' OR '1'='1`, `'; DROP TABLE gobject;--`, `%`, `_`, `100%_off`, `[abc]`, `Pump'001`), and enumerates the invariants the suite pins (SQL metacharacters are opaque literals, only `*`/`?` are glob wildcards, the matcher has a 100 ms regex timeout against pathological input, the projector returns zero matches / `NotFound` rather than the whole hierarchy, and the `DiscoverHierarchy` RPC end-to-end returns zero matches for adversarial globs). + +### Tests-020 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs:275-347`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:803-829` | +| Status | Resolved | + +**Description:** Server-021 added `MxAccessGatewayServiceConstraintTests` to exercise `BulkConstraintPlan.MergeDeniedInto` / `CreateDeniedReply` against a non-allow-all enforcer. The `WriteBulkConstraintPlan` has a four-arm `GetPayload`/`SetPayload` switch covering `WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, and `WriteSecured2Bulk`, but the new fixtures only cover two of those four arms — `Invoke_WriteBulk_WithDeniedHandle_DropsEntryFromWorkerCallAndMergesDenialIntoReply` (the `WriteBulk` arm) and `Invoke_WriteSecuredBulk_WhenAllHandlesDenied_ShortCircuitsWithDeniedOnlyReply` (the `WriteSecuredBulk` arm). The other two arms (`Write2Bulk` and `WriteSecured2Bulk`) and the parallel `SubscribeBulkConstraintPlan` `RemoveItemBulk`/`UnAdviseItemBulk`/`UnsubscribeBulk` cases (the subscribe-bulk plan's `SetPayload` switch in service code lines 742-753 covers only three kinds — `AddItemBulk`, `AdviseItemBulk`, `SubscribeBulk` — and the constraint test covers all three of those, but the *unsubscribe-shaped* bulk routes are also dispatched into denial paths through `FilterHandleBulkAsync` and have no constraint-test coverage either). A regression that wires a new bulk kind to the wrong reply slot, or drops a `case` arm during refactor, would compile clean and pass every existing test. The comment in `Invoke_WriteSecuredBulk_WhenAllHandlesDenied_…` ("The merge logic is shared, so a full denial here is enough to prove the secured-bulk routing") concedes the gap explicitly — but the `_routing_` (the per-kind `SetPayload` switch) is exactly what is *not* shared and not exercised for `Write2Bulk` / `WriteSecured2Bulk`. + +**Recommendation:** Add two short fixtures: `Invoke_Write2Bulk_WhenAllHandlesDenied_ShortCircuitsWithDeniedOnlyReply` and `Invoke_WriteSecured2Bulk_WhenAllHandlesDenied_ShortCircuitsWithDeniedOnlyReply`, mirroring the existing `WriteSecuredBulk` denial test but asserting `reply.Write2Bulk` / `reply.WriteSecured2Bulk` is populated (proving the `SetPayload` arm fires). The all-denied path is enough; the merge-with-allowed path is genuinely shared. Optionally also add denied-tag tests for `RemoveItemBulk` / `UnsubscribeBulk` to cover the handle-input variants of the SubscribeBulkConstraintPlan switch. + +**Resolution:** 2026-05-20 — Added `Invoke_Write2Bulk_WhenAllHandlesDenied_ShortCircuitsWithDeniedOnlyReply` and `Invoke_WriteSecured2Bulk_WhenAllHandlesDenied_ShortCircuitsWithDeniedOnlyReply` to `MxAccessGatewayServiceConstraintTests`, plus matching `CreateWrite2BulkRequest`/`CreateWriteSecured2BulkRequest` helpers. Each new fixture asserts the worker is never called (`InvokeCount == 0`), `reply.Kind` matches the requested kind, the matching `reply.{Write2Bulk,WriteSecured2Bulk}.Results` slot is populated with denied entries, and the three sibling reply slots remain empty — pinning that the `SetPayload` switch fired for the correct arm and not for one of the other three `Write*Bulk` kinds. This closes the `Write2Bulk`/`WriteSecured2Bulk` arms of the four-arm `GetPayload`/`SetPayload` switch in `WriteBulkConstraintPlan` (`MxAccessGatewayService.cs:803-829`). + +### Tests-021 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs:159-171`, `src/MxGateway.Tests/Gateway/Workers/FakeWorkerHarnessTests.cs:226-236`, `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:620-630`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:766-…` | +| Status | Resolved | + +**Description:** Tests-006 / Tests-017 / Tests-018 introduced an injectable `ManualTimeProvider` to make heartbeat-timestamp / lease / cache tests deterministic. The class is now duplicated as a `private sealed class ManualTimeProvider(DateTimeOffset start...) : TimeProvider` in four test files (`GalaxyHierarchyCacheTests.cs`, `FakeWorkerHarnessTests.cs`, `WorkerClientTests.cs`, `SessionManagerTests.cs`). Each copy has the same three-line implementation (`_now` field, `GetUtcNow()` override, `Advance(TimeSpan)` method). One copy (`GalaxyHierarchyCacheTests.cs:159`) accepts a `default` `DateTimeOffset` and seeds with `UtcNow`; the other three require an explicit start — a small but real semantic divergence. Tests-007 consolidated the same kind of duplication for `TestServerCallContext` / `RecordingServerStreamWriter` / `AllowAllConstraintEnforcer` into `src/MxGateway.Tests/TestSupport/`; this is the same drift pattern. + +**Recommendation:** Add `src/MxGateway.Tests/TestSupport/ManualTimeProvider.cs` with a single implementation (default-arg `DateTimeOffset start = default` resolving to a deterministic seed like `DateTimeOffset.UnixEpoch` or `UtcNow`, plus the `Advance` helper) and delete the four nested copies in favour of `using MxGateway.Tests.TestSupport;`. Same pattern as the Tests-007 resolution. + +**Resolution:** 2026-05-20 — Added `src/MxGateway.Tests/TestSupport/ManualTimeProvider.cs` with the unified signature `ManualTimeProvider(DateTimeOffset start = default)` (a `default` start seeds from `DateTimeOffset.UtcNow` for the `GalaxyHierarchyCacheTests` call site that previously relied on that behaviour) plus the `Advance(TimeSpan)` helper. Deleted the four duplicated `private sealed class ManualTimeProvider` definitions from `GalaxyHierarchyCacheTests.cs`, `FakeWorkerHarnessTests.cs`, `WorkerClientTests.cs`, and `SessionManagerTests.cs`; each file now imports `MxGateway.Tests.TestSupport`. The `SessionManagerTests` copy previously lacked `Advance` — folding it onto the shared type does not regress because that file never called `Advance`. Same consolidation pattern as Tests-007. + +### Tests-022 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Tests/Gateway/Sessions/SessionManagerBulkTests.cs:52-61,90-99,126-135,163-172,202-211,238-247,282-294,339-360,413-434,484-506,553-567,663-688` | +| Status | Resolved | + +**Description:** Tests-013 added eleven `*_PropagatesCancellation` tests that pre-cancel the token (`cts.CancelAsync()` before calling `session.*BulkAsync(..., cts.Token)`) and assert `OperationCanceledException`. The fakes' `FakeBulkWorkerClient.InvokeAsync` calls `cancellationToken.ThrowIfCancellationRequested()` as the *first* statement — so the exception is thrown synchronously inside the fake before any of `GatewaySession.InvokeBulkInternalAsync` → `InvokeAsync` → bulk-result projection runs. This verifies that the token reaches the worker client (a regression that swapped in `CancellationToken.None` between layers would fail the test), but it does not exercise mid-flight cancellation: a token that becomes cancelled while the worker is `await`-suspended waiting on a reply. Mid-flight cancellation is the more interesting path (it's what a real client closing its stream looks like) and is not pinned for any of the eleven bulk methods. + +The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise the mid-flight path (the `FakeWorkerClient` returns `Task.FromCanceled` style via real pipe disconnection); only the gateway-side bulk tests are shallow. + +**Recommendation:** For at least one representative bulk method (e.g. `WriteSecuredBulkAsync` — the highest-value gateway path), replace the pre-cancellation pattern with a fake whose `InvokeAsync` returns a `TaskCompletionSource`-backed task that never completes until cancelled, then `cts.CancelAsync()` *after* `session.WriteSecuredBulkAsync(...)` has been awaited far enough to register a continuation. Assert the resulting `OperationCanceledException`'s `CancellationToken` matches `cts.Token`. The existing pre-cancel pattern is a reasonable cheap-coverage default for the other ten methods. + +**Resolution:** 2026-05-20 — Added `WriteSecuredBulkAsync_WhenCancelledMidFlight_ThrowsOperationCanceledForRequestToken` to `SessionManagerBulkTests` backed by a new `MidFlightBulkWorkerClient` fake whose `InvokeAsync` registers a cancellation continuation on the caller's token, signals `InvokeStarted`, and parks on a `TaskCompletionSource` that completes only when the token fires (or shutdown / kill / dispose tears it down). The test awaits `InvokeStarted.Task`, asserts the write task is still incomplete (proving the cancellation lands on an in-flight await rather than the synchronous fast-path), then calls `cts.CancelAsync()` and asserts the resulting `OperationCanceledException.CancellationToken == cts.Token` and `InvokeCount == 1`. The other ten `*_PropagatesCancellation` tests remain on the cheaper pre-cancel pattern per the finding's recommendation. + +### Tests-023 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Tests/Gateway/Sessions/SessionWorkerClientFactoryFakeWorkerTests.cs:334-374` | +| Status | Resolved | + +**Description:** Tests-015 corrected the smoke-test `FakeWorkerProcess.WaitForExitAsync` (in `GatewayEndToEndFakeWorkerSmokeTests.cs`) so it now awaits a `TaskCompletionSource` only completed by `Kill`/`MarkExited`, removing the "set `HasExited = true` and return immediately" cheat. The companion `FakeWorkerProcess` in `SessionWorkerClientFactoryFakeWorkerTests.cs:351-356` was *not* updated and still has the same cheat: `WaitForExitAsync` unconditionally sets `HasExited = true; ExitCode = 0; return ValueTask.CompletedTask;`. The original Tests-006 re-triage noted this companion was "fine there because no exit assertion is made"; the file at `a020350` does not yet assert `HasExited` or `ExitCode`, so this is not a current bug — but it is a latent regression vector: a future test in the same file that asserts `Assert.True(launcher.Process.HasExited)` after triggering shutdown would pass spuriously, exactly the failure mode Tests-015 just closed in the smoke-test copy. Two near-identical fakes in the same project with diverging semantics is brittle. + +**Recommendation:** Apply the same `TaskCompletionSource _exited` pattern to `SessionWorkerClientFactoryFakeWorkerTests.FakeWorkerProcess`: `WaitForExitAsync` awaits `_exited.Task`, `Kill` calls `MarkExited(-1)`, and add a `MarkExited(int)` helper that completes the TCS. The scripted launchers in this file already call `Kill()` through the disposal path Tests-011 added, so the change is mechanical and preserves all current behaviour. + +**Resolution:** 2026-05-20 — Brought the companion `FakeWorkerProcess` in `SessionWorkerClientFactoryFakeWorkerTests.cs` into parity with the Tests-015 smoke-test fake. `WaitForExitAsync` now awaits a `TaskCompletionSource _exited` (wrapped in `WaitAsync(cancellationToken)` for cooperative cancel) instead of unconditionally setting `HasExited = true; ExitCode = 0`. `Kill(bool)` increments `KillCount` and delegates to a new `MarkExited(int exitCode)` helper that sets `HasExited`, `ExitCode`, and completes the TCS. `KillCount` is still observable and pre-existing tests that assert `KillCount > 0` continue to pass. The latent regression vector — that a future `Assert.True(launcher.Process.HasExited)` in this file would pass spuriously — is closed. + +### Tests-024 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:713-730,784-801,859-876`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs` | +| Status | Resolved | + +**Description:** Every `BulkConstraintPlan.MergeDeniedInto` implementation builds its merged reply by walking `OriginalCount` indices and dequeueing from the worker's `allowedResults` queue at each non-denied slot. `TryDequeue` silently returns `false` when the queue is empty, so if the worker returns *fewer* allowed results than the gateway forwarded (because of a protocol mismatch, a worker bug truncating the bulk reply, or a future change to per-entry result reporting), the merged reply will be shorter than `OriginalCount` — the gap is not filled with a synthetic failure result. Conversely, if the worker returns *more* allowed results than requested, the extras are silently dropped. Neither case is covered by `MxAccessGatewayServiceConstraintTests`: every fixture's `sessionManager.InvokeReply` returns exactly the same count as the number of allowed entries forwarded. A regression in worker bulk-reply construction or a contract drift could produce a silently-truncated public reply (clients observing fewer results than entries submitted, with no error) and no gateway-side test would fail. + +**Recommendation:** Add two fixtures to `MxAccessGatewayServiceConstraintTests`: `Invoke_WriteBulk_WhenWorkerReturnsFewerResultsThanAllowed_ProducesPartialReplyOrSyntheticFailure` (worker reply has N-1 results for N allowed entries; assert either the merged reply has `OriginalCount` entries with a synthetic-failure tail, or — if the gateway's current policy is "truncate" — pin that behaviour explicitly and document the expectation in a comment), and `Invoke_WriteBulk_WhenWorkerReturnsExtraResults_IgnoresExtras` (worker returns N+2 for N allowed; assert merged reply has exactly `OriginalCount`). Whichever current behaviour is correct should be made explicit by the test — the goal is preventing a silent change. + +**Resolution:** 2026-05-20 — Pinned the current `BulkConstraintPlan.MergeDeniedInto` behaviour for worker reply-count divergence. Added two fixtures to `MxAccessGatewayServiceConstraintTests`: `Invoke_WriteBulk_WhenWorkerReturnsFewerResultsThanAllowed_MergedReplyIsTruncated` (gateway forwards 2 allowed handles, worker returns 1 result; merged reply has 2 entries total — the worker result at the first non-denied slot and the denied entry at its original index — and the trailing under-supplied slot is silently dropped via `Queue.TryDequeue` returning `false`) and `Invoke_WriteBulk_WhenWorkerReturnsExtraResults_IgnoresExtras` (gateway forwards 2 allowed handles, worker returns 4; merged reply has exactly `OriginalCount == 3` entries; the two extras are bounded out by the `for index < OriginalCount` loop). The fixtures explicitly pin "truncate / discard extras" as the current contract — a future change to synthesise failure tails or surface extras must update the test, preventing a silent behavioural change. + +### Tests-025 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs:285-289`, `src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:417-421` | +| Status | Open | + +**Description:** Commit `d692232` widened the `EventStreamService` constructor with an `IDashboardEventBroadcaster` parameter. Two test files now carry an identical `private sealed class NullDashboardEventBroadcaster : IDashboardEventBroadcaster` with a singleton `Instance` field and a no-op `Publish`. This mirrors the duplication pattern Tests-007 and Tests-021 already consolidated for `TestServerCallContext` / `RecordingServerStreamWriter` / `AllowAllConstraintEnforcer` / `ManualTimeProvider` into `src/MxGateway.Tests/TestSupport/`; the same pattern should apply here. + +**Recommendation:** Extract `NullDashboardEventBroadcaster` to `src/ZB.MOM.WW.MxGateway.Tests/TestSupport/NullDashboardEventBroadcaster.cs` (or a single `DashboardTestDoubles.cs`), delete both nested copies, and update the two `using`-bearing files to import from `TestSupport`. + +**Resolution:** _(empty until closed)_ + +### Tests-026 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs`, `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-126` | +| Status | Open | + +**Description:** The new `IDashboardEventBroadcaster` is wired into `EventStreamService` at line 123 (commit `d692232`) and the broadcaster's `Publish` is the only path that mirrors per-session events into the dashboard `EventsHub`. The unit tests inject `NullDashboardEventBroadcaster.Instance`, so the broadcaster invocation is never observed — a regression that silently dropped the `Publish` call (e.g. an `if` accidentally added around it, or removing the broadcaster ctor parameter) would not be caught by any test in this module. The hub-registration tests (`DashboardHubsRegistrationTests`) verify the endpoints exist but not the producer-side hook. + +**Recommendation:** Add a fixture to `EventStreamServiceTests` named e.g. `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster`: inject a recording fake that captures `(sessionId, mxEvent)` calls, push two events through the fake session, and assert the broadcaster received both with the correct session id and matching `WorkerSequence`. This pins the broadcast hook and proves the dashboard event mirror is not a no-op. + +**Resolution:** _(empty until closed)_ diff --git a/code-reviews/Worker.Tests/findings.md b/code-reviews/Worker.Tests/findings.md new file mode 100644 index 0000000..9745799 --- /dev/null +++ b/code-reviews/Worker.Tests/findings.md @@ -0,0 +1,531 @@ +# Code Review — Worker.Tests + +| Field | Value | +|---|---| +| Module | `src/ZB.MOM.WW.MxGateway.Worker.Tests` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 0 | + +## Checklist coverage + +### 2026-05-18 review (commit `6c64030`) + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issues found: Worker.Tests-010 (weak substring assertion), Worker.Tests-011 (test name overstates what it proves). | +| 2 | mxaccessgw conventions | Tests respect STA-affinity and the WorkerEnvelope frame protocol; naming-convention drift only (Worker.Tests-009). | +| 3 | Concurrency & thread safety | Issues found: Worker.Tests-003/004/013 (wall-clock and fixed-delay timing assertions). | +| 4 | Error handling & resilience | COMException/HResult, pipe-never-appears, malformed frames, shutdown-during-command, watchdog all covered; queue branch gap (Worker.Tests-015). | +| 5 | Security | No real secrets; redaction explicitly tested. No issues found. | +| 6 | Performance & resource management | Issues found: Worker.Tests-005 (`MemoryStream` not disposed), Worker.Tests-006 (`MxAccessStaSession` leak on assertion failure). | +| 7 | Design-document adherence | Tests match `docs/Worker*.md`; `docs/WorkerFrameProtocol.md` is stale (Worker.Tests-007). | +| 8 | Code organization & conventions | Issues found: Worker.Tests-009 (two naming conventions), Worker.Tests-014 (duplicated test doubles). | +| 9 | Testing coverage | Issues found: Worker.Tests-001 (`StaMessagePump` untested), Worker.Tests-002 (COM-event delivery untested), Worker.Tests-012 (frame-validation gaps). | +| 10 | Documentation & comments | Issues found: Worker.Tests-008 (misplaced redaction test), Worker.Tests-011 (misleading test name). | + +### 2026-05-20 re-review (commit `1cd51bb`) + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issues found: Worker.Tests-018 (silent-skip masquerades as passing tests), Worker.Tests-024 (`Subscribe_WhenUnderlyingSubscribeThrows_DisposesConsumer` swallows the real exception type). | +| 2 | mxaccessgw conventions | Issues found: Worker.Tests-019 (`AlarmsLiveSmokeTests` uses `snake_case` outside the alarm-method scope Worker.Tests-009 corrected); pre-existing `LiveMxAccessFactAttribute` is not consumed by `MxAccessLiveComCreationTests` (Worker.Tests-018). | +| 3 | Concurrency & thread safety | Issues found: Worker.Tests-020 (`MxAccessValueCacheTests.TryWaitForUpdate_ReturnsFalseAfterDeadline_WhenNoSetOccurs` asserts wall-clock floor and pump-call lower bound). | +| 4 | Error handling & resilience | Issues found: Worker.Tests-021 (`WorkerFrameProtocolErrorCode.EndOfStream` and the writer-side `MessageTooLarge`/`InvalidEnvelope` branches are uncovered). | +| 5 | Security | Redaction coverage is sound; no new issues. | +| 6 | Performance & resource management | No new issues — `MemoryStream`/session-disposal hygiene fixes from the prior pass hold; `WorkerFrameReader` `ArrayPool` rent/return path is now regression-tested. | +| 7 | Design-document adherence | No new issues. | +| 8 | Code organization & conventions | Issues found: Worker.Tests-016 (the now-shared `MxAccessSession` reflection construction in `AlarmCommandExecutorTests` duplicates the testable surface the consolidated TestSupport folder was meant to host). | +| 9 | Testing coverage | Issues found: Worker.Tests-017 (`WorkerCancel` envelope-dispatch path untested), Worker.Tests-022 (`WnWrapAlarmConsumer.PollOnce` transition-delta computation untested at the snapshot-to-transitions level). | +| 10 | Documentation & comments | Issues found: Worker.Tests-023 (`AlarmClientWmProbeTests` and `WnWrapConsumerProbeTests` are unit-test classes carrying 1000+ lines of probe-only code; their `[Fact(Skip=...)]` status is documented but the probe scaffolding is mixed into the same test assembly as regression tests). | + +### 2026-05-20 re-review (commit `a020350`) + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No new issues — Worker.Tests-018/024 fixes hold; the new `WriteAsync_WithEmptyEnvelope_ThrowsInvalidEnvelopeFromValidator` correctly documents that the writer-side defensive zero-length branch is intercepted by `WorkerEnvelopeValidator.Validate`. | +| 2 | mxaccessgw conventions | Issues found: Worker.Tests-025 (`LiveMxAccessFactAttribute` duplicated in Worker.Tests and IntegrationTests with no shared constant — divergent-by-drift risk). | +| 3 | Concurrency & thread safety | Issues found: Worker.Tests-027 (`FakeRuntimeSession.CancelCommandReturnValue` mutated without the same `gate` lock that protects `cancelledCorrelationIds`/`snapshot`/`events`). | +| 4 | Error handling & resilience | No new issues — Worker.Tests-021 closed all three uncovered protocol branches. | +| 5 | Security | No new issues. | +| 6 | Performance & resource management | No new issues. | +| 7 | Design-document adherence | Issues found: Worker.Tests-028 (Worker.Tests-023 resolution promised an `docs/GatewayTesting.md` paragraph describing the probe surface; the doc was never updated, so the partition is invisible outside the source tree). | +| 8 | Code organization & conventions | Issues found: Worker.Tests-026 (`MxAccessSession.CreateForTesting` has no runtime guard preventing accidental production use — only the `internal` modifier plus `InternalsVisibleTo` separates it from the live `Create` path); Worker.Tests-029 (Probes moved to `Probes/` folder but kept the unit-test `MxGateway.Worker.Tests` namespace, so a namespace-based filter cannot distinguish probes from regression tests). | +| 9 | Testing coverage | No new issues — the five `LiveMxAccessFact`-gated tests in `MxAccessLiveComCreationTests` and the `ComputeTransitions` unit tests close the previously identified gaps. | +| 10 | Documentation & comments | Issues found: Worker.Tests-030 (`CreateCancelEnvelope` uses `Sequence = 4` while the immediately-following `CreateShutdownEnvelope` uses `Sequence = 3`; the cancel test writes them in 4-then-3 order, which works because the worker has no inbound sequence-monotonicity check — but the numbering is misleading to a future reader and contradicts the gateway-side monotonic-sequence convention `gateway.md` documents for outbound). | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232`. Diff against `a020350` is the rename-only +namespace/csproj update (commit `dc9c0c9`). The `InternalsVisibleTo` on the +Worker project points at the new `ZB.MOM.WW.MxGateway.Worker.Tests` assembly +name; the live-test gating attribute still reads the unchanged +`MXGATEWAY_RUN_LIVE_MXACCESS_TESTS` env var. No behavioural changes; the prior +findings (Worker.Tests-001 through -030) are unaffected. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — namespaces updated; env-var names unchanged. | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | No issues found in this diff. | +| 5 | Security | No issues found in this diff. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | No issues found in this diff. | +| 8 | Code organization & conventions | No issues found in this diff. | +| 9 | Testing coverage | No issues found in this diff. | +| 10 | Documentation & comments | No issues found in this diff. | + +## Findings + +### Worker.Tests-001 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Testing coverage | +| Location | `src/MxGateway.Worker.Tests/Sta/` (no `StaMessagePumpTests.cs`) | +| Status | Resolved | + +**Description:** `StaMessagePump` — whose entire reason for existing is pumping Windows messages so MXAccess COM event sink calls deliver onto the STA — has no direct unit test. `WaitForWorkOrMessages` (timeout conversion, the `MsgWaitForMultipleObjectsEx` failure path) and `PumpPendingMessages` (drain count) are exercised only indirectly via `StaRuntime`, which never asserts the pump returns/throws correctly. The `MsgWaitFailed` error branch and `ToTimeoutMilliseconds` edge cases (`InfiniteTimeSpan`, `<= Zero`, `>= uint.MaxValue`) are completely uncovered. + +**Recommendation:** Add `StaMessagePumpTests` that post a Windows message to the STA thread and assert `PumpPendingMessages` returns the expected count; cover `WaitForWorkOrMessages` waking on a signaled event vs timeout; cover `ToTimeoutMilliseconds` boundaries through an internals-visible seam. + +**Resolution:** 2026-05-18 — Added `src/MxGateway.Worker.Tests/Sta/StaMessagePumpTests.cs` (8 `[Fact]` tests, run on dedicated STA threads). Covers `WaitForWorkOrMessages` null-argument validation, returning immediately when the wake event is pre-signalled, waking when the event is signalled mid-wait, returning on timeout when never signalled, the `TimeSpan.Zero` (`<= Zero`) conversion branch, and waking on a `WM_NULL` Windows message posted to the STA thread (the `QS_ALLINPUT` path). `PumpPendingMessages` is covered for both an empty queue (returns 0) and three posted messages (returns 3). Boundary noted in the file: the `MsgWaitFailed` branch is not exercised because forcing `MsgWaitForMultipleObjectsEx` to fail needs a deliberately invalid native handle, which is unsafe to construct in-process; `ToTimeoutMilliseconds` is `private static` and is covered indirectly through wait-latency assertions rather than reflection. + +### Worker.Tests-002 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Testing coverage | +| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs`, `src/MxGateway.Worker.Tests/MxAccess/MxAccessEventMapperTests.cs` | +| Status | Resolved | + +**Description:** No test verifies that a COM event raised on the STA thread is converted to protobuf and lands in the `MxAccessEventQueue`. `MxAccessEventMapperTests` exercises the mapper directly with hand-built fakes, and `AlarmDispatcherTests` covers the alarm sink, but the non-alarm COM-event path (`MxAccessBaseEventSink`/`MxAccessComServer` event handlers → `MxAccessEventMapper` → queue, triggered by an actual sink callback) is never end-to-end tested. Given the worker's core purpose is to convert COM events to protobuf, this is a significant gap. + +**Recommendation:** Add a test that invokes the base event sink's data-change handler (via an internal seam or a fake COM event source) and asserts a converted `WorkerEvent` with correct family/sequence appears in the queue. + +**Resolution:** 2026-05-18 — Added `src/MxGateway.Worker.Tests/MxAccess/MxAccessBaseEventSinkTests.cs` (5 `[Fact]` tests). The four `MxAccessBaseEventSink` COM event handlers (`OnDataChange`, `OnWriteComplete`, `OperationComplete`, `OnBufferedDataChange`) — the exact delegate targets the MXAccess COM runtime invokes — were widened from `private` to `internal` (with XML-doc notes that this is a unit-test seam), and `[assembly: InternalsVisibleTo("MxGateway.Worker.Tests")]` was added to `MxGateway.Worker.csproj`. The tests construct a real `MxAccessBaseEventSink` over a real `MxAccessEventMapper` and `MxAccessEventQueue`, invoke each handler with COM-style arguments, and assert a correctly-converted protobuf `WorkerEvent` (family, body case, server/item handle, value, quality, source timestamp, monotonic `WorkerSequence`) lands in the queue. Boundary noted in the file: the COM `+=` wire-up in `Attach`/`Detach` casts to the sealed `LMXProxyServerClass` RCW and cannot run without a live MXAccess COM object, so it is not exercised; invoking the handlers directly reproduces an STA-thread COM callback and exercises the genuine conversion + enqueue path. + +### Worker.Tests-003 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker.Tests/Sta/StaRuntimeTests.cs:46-48` | +| Status | Resolved | + +**Description:** `InvokeAsync_WakesIdlePumpForQueuedCommand` asserts `stopwatch.Elapsed < TimeSpan.FromSeconds(2)` — a wall-clock assertion that on a loaded CI agent can exceed 2s, producing a false failure. The test also does not actually prove the wake event (vs the 50 ms idle pump) caused the dispatch. + +**Recommendation:** Remove the wall-clock assertion (the awaited result already proves the command ran), or raise the budget substantially with a comment that it is a coarse smoke check. + +**Resolution:** 2026-05-18 — Removed the `Stopwatch` and the `stopwatch.Elapsed < TimeSpan.FromSeconds(2)` wall-clock assertion from `InvokeAsync_WakesIdlePumpForQueuedCommand`. The test already constructs the `StaRuntime` with a 30-second idle pump period, so the awaited `InvokeAsync` completing at all proves the command wake event — not the idle pump tick — drove the dispatch; no timing budget is needed. The XML-doc comment now states this explicitly. The now-unused `using System.Diagnostics;` was removed (`TreatWarningsAsErrors`). + +### Worker.Tests-004 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:281-329` | +| Status | Resolved | + +**Description:** `StartAsync_WithAlarmCommandHandlerFactory_PollOnceCalledViaSta` and `Dispose_StopsAlarmPollLoop` use poll-until loops, and `Dispose_StopsAlarmPollLoop` additionally does `await Task.Delay(1000)` then asserts `PollCount` is unchanged. The 1s "no further polls" window is a timing race: a poll scheduled just before disposal could increment the counter afterward, and a slow agent could simply not run a poll in the window even without correct stop logic. + +**Recommendation:** Make the poll loop deterministically observable — expose a "poll loop stopped" signal or have `Dispose` join the poll task — then assert on that rather than on elapsed-time silence. + +**Resolution:** 2026-05-18 — `MxAccessStaSession.Dispose` now joins the alarm poll task (`pollTaskToJoin.Wait(TimeSpan.FromSeconds(5))`) after cancelling the poll CTS, instead of setting `alarmPollTask = null` and discarding it. Once `Dispose` returns, the poll loop has provably exited and no `PollOnce` call can still be in flight. `Dispose_StopsAlarmPollLoop` was rewritten to drop the `await Task.Delay(1000)` "no further polls" window: it now captures `PollCount` immediately after `Dispose()` returns and re-asserts equality after a bare `await Task.Yield()` — a deterministic frozen-count check rather than an elapsed-time race. The success-direction poll-until loop in `PollOnceCalledViaSta` was left as-is: waiting for an event to *occur* is sound; only waiting for an event to *not* occur is the race, and that pattern is now eliminated. Note: `ShutdownGracefullyAsync` already joined the poll task, so this change makes `Dispose` consistent with the graceful path. + +### Worker.Tests-005 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Performance & resource management | +| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs:20-31,103-105`, `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:28-31` | +| Status | Resolved | + +**Description:** `MemoryStream` instances are created and never disposed across the frame-protocol and pipe-session tests (`MemoryStream stream = new();` with no `using`). Disposal is cheap so impact is low, but it is inconsistent with the rest of the suite (which carefully `using`s `CancellationTokenSource`, `StaRuntime`, `PipePair`). `WorkerFrameWriter`/`WorkerFrameReader` are also constructed without disposal. + +**Recommendation:** Wrap `MemoryStream` (and reader/writer if they are `IDisposable`) in `using` declarations for consistency. + +**Resolution:** 2026-05-18 — All six `MemoryStream` test-body declarations in `WorkerFrameProtocolTests.cs` and the five `inbound`/`outbound` `MemoryStream` declarations in the `WorkerPipeSessionTests.cs` handshake tests were converted to `using` declarations, matching how the rest of the suite handles `CancellationTokenSource`/`StaRuntime`/`PipePair`. Re-triage of the parenthetical: `WorkerFrameWriter` and `WorkerFrameReader` are **not** `IDisposable` (`sealed class` with no `IDisposable` and no `Dispose` member — verified in `src/MxGateway.Worker/Ipc/`), so the finding's "reader/writer if they are `IDisposable`" suggestion does not apply and no change was made there. The shared `MemoryStream` instances inside the `WorkerPipeSessionTests` harness/helper classes (`ReadWrittenFrames` parameter, the `PipePair`/harness fields) are out of the cited line scope and were left untouched. + +### Worker.Tests-006 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Performance & resource management | +| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:282,305,315,323` | +| Status | Resolved | + +**Description:** `Dispose_StopsAlarmPollLoop` constructs `MxAccessStaSession session` without `using` (unlike every sibling test) and relies on an explicit `session.Dispose()`. If an assertion between `StartAsync` and `Dispose()` throws, the session — its STA thread and poll loop — leaks for the rest of the run. The `StaRuntime` is `using`d so the thread is eventually reclaimed, but the alarm poll loop and handler are not. + +**Recommendation:** Use `using MxAccessStaSession session = ...` and drop the manual `Dispose()`, or wrap the body in try/finally. + +**Resolution:** 2026-05-18 — `Dispose_StopsAlarmPollLoop` now declares its `MxAccessStaSession` with a `using` declaration. The manual `session.Dispose()` is kept because the test's purpose is to observe poll behaviour across disposal — but `MxAccessStaSession.Dispose` is idempotent (guarded by the `disposed` field), so the explicit mid-test call and the `using`-scope call do not conflict. An assertion thrown anywhere in the body now still tears the session (STA poll loop + alarm handler) down. The cited line numbers in the finding were imprecise — they straddle `PollOnceCalledViaSta` and `Dispose_StopsAlarmPollLoop` — but the described root cause (one `MxAccessStaSession` constructed without `using`) was singular and is the one in `Dispose_StopsAlarmPollLoop`; the sibling tests `PollOnceCalledViaSta` and `RunAlarmPollLoop_WhenPollOnceThrows_RecordsFaultOnEventQueue` already used `using` and needed no change. + +### Worker.Tests-007 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Design-document adherence | +| Location | `docs/WorkerFrameProtocol.md:38-49` | +| Status | Resolved | + +**Description:** `docs/WorkerFrameProtocol.md` instructs running `dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter WorkerFrameProtocolTests` and states the frame protocol "is part of `MxGateway.Server`". The frame protocol actually lives in `MxGateway.Worker.Ipc` and is tested by `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs`. The doc's verification command points at the wrong project and build, so anyone following it after changing the worker frame protocol will not run the relevant tests. + +**Recommendation:** Update `docs/WorkerFrameProtocol.md` to reference `src/MxGateway.Worker.Tests` and the x86 worker build (`-p:Platform=x86`). + +**Resolution:** 2026-05-18 — Rewrote the `## Verification` section of `docs/WorkerFrameProtocol.md`. The test command now targets `src/MxGateway.Worker.Tests/MxGateway.Worker.Tests.csproj -p:Platform=x86 --filter WorkerFrameProtocolTests`; the build command now targets `src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86`. The prose now states the frame protocol lives in `MxGateway.Worker.Ipc` (naming `WorkerFrameReader`/`WorkerFrameWriter`/`WorkerFrameProtocolOptions` and the `WorkerFrameProtocolTests.cs` test file) and notes the worker is an x86 process. Verified against the source: the frame-protocol types are confirmed under `src/MxGateway.Worker/Ipc/` and the tests under `src/MxGateway.Worker.Tests/Ipc/`, so the original doc was wrong on both project and component. Fenced code blocks were also relabelled `powershell` (the build/test commands are run from PowerShell on this Windows dev box). + +### Worker.Tests-008 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Worker.Tests/Conversion/VariantConverterTests.cs:175-182` | +| Status | Resolved | + +**Description:** `Redactor_WithCredentialBearingValueFields_RedactsBeforeLogging` lives in `VariantConverterTests` but asserts on `WorkerLogRedactor.RedactValue`, which has nothing to do with `VariantConverter`. It is also a near-duplicate of coverage in `WorkerLogRedactorTests`. Placing redaction coverage inside the variant-converter class is misleading. + +**Recommendation:** Move this test into `Bootstrap/WorkerLogRedactorTests.cs` (which already exists and tests `RedactFields`). + +**Resolution:** 2026-05-18 — The misplaced redaction test was removed from `VariantConverterTests.cs` and re-added to `Bootstrap/WorkerLogRedactorTests.cs` as `RedactValue_WithCredentialBearingFieldNames_ReturnsRedactedValue` — alongside the existing `RedactFields` coverage, where redaction tests belong. Confirmed root cause: the old test asserted only on `WorkerLogRedactor.RedactValue` and never touched `VariantConverter`. The now-orphaned `using MxGateway.Worker.Bootstrap;` was removed from `VariantConverterTests.cs` (`TreatWarningsAsErrors`). The new home is `RedactValue` per-field coverage; `WorkerLogRedactorTests.RedactFields_...` already covers the dictionary path, so the two are complementary rather than duplicates. + +### Worker.Tests-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandHandlerTests.cs`, `AlarmDispatcherTests.cs`, `AlarmCommandExecutorTests.cs`, `AlarmRecordTransitionMapperTests.cs`, `WnWrapAlarmConsumerXmlTests.cs` | +| Status | Resolved | + +**Description:** The alarm-related test files use `snake_case` method names while the rest of the project uses the `Method_State_Result` PascalCase convention. `docs/style-guides/CSharpStyleGuide.md` and the surrounding code establish PascalCase as the project convention; the alarm files diverge. + +**Recommendation:** Rename alarm-test methods to the `Method_Scenario_Expectation` PascalCase form for one consistent convention. + +**Resolution:** 2026-05-18 — Renamed every `[Fact]`/`[Theory]` method in the five alarm test files from `snake_case` to the project's `Method_Scenario_Expectation` PascalCase form (46 test methods total: 10 in `AlarmCommandHandlerTests`, 8 in `AlarmDispatcherTests`, 12 in `AlarmCommandExecutorTests`, 8 in `AlarmRecordTransitionMapperTests`, 9 in `WnWrapAlarmConsumerXmlTests` minus the existing PascalCase probe methods). Only test methods were renamed — `snake_case` is not present; the method names that *look* like helpers (`Subscribe`, `PollOnce`, `Dispose` on the fake doubles) are interface implementations of `IAlarmCommandHandler`/`IAlarmTransitionConsumer`/`IDisposable` and were correctly left unchanged. The suite stays green; xUnit discovers tests by attribute, not name, so the renames are behaviour-neutral. + +### Worker.Tests-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessStaSessionTests.cs:230-258` | +| Status | Resolved | + +**Description:** `StartAsync_WithoutAlarmCommandHandlerFactory_SubscribeAlarmsReturnsInvalidRequest` asserts `Assert.Contains("alarm", reply.DiagnosticMessage, StringComparison.OrdinalIgnoreCase)`. The XML doc claims it verifies the diagnostic says "alarm consumer not configured", but the assertion only checks the substring "alarm" — which would also match an unrelated message like "invalid alarm GUID". The assertion is weaker than the documented intent. + +**Recommendation:** Assert the full diagnostic phrase so the test fails if the diagnostic regresses to a misleading message. + +**Resolution:** 2026-05-18 — The weak `Assert.Contains("alarm", ...)` was replaced with an exact `Assert.Equal` against the diagnostic the executor actually emits. Re-triage: the test's XML doc claimed the phrase was "alarm consumer not configured", but `MxAccessCommandExecutor.ExecuteSubscribeAlarms` (verified in `src/MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:310-315`) produces "SubscribeAlarms requires an alarm command handler; the worker was constructed without one." — the doc was wrong, so both the assertion and the XML doc were corrected to the real phrase. The test now fails if the diagnostic regresses to any other message. + +### Worker.Tests-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Worker.Tests/Sta/StaCommandDispatcherTests.cs:92-112` | +| Status | Resolved | + +**Description:** `DispatchAsync_WhenCanceledAfterExecutionStarts_StillReturnsLateReply` is named and documented as if it proves cancellation arrived after execution began. The test does `Started.Wait(...)` then `cancellation.Cancel()`, which proves execution started, but because the executor is already running on the STA the cancellation is inherently a no-op — the test cannot distinguish "cancel was observed and ignored" from "cancel was never checked". The name overstates what is proven. + +**Recommendation:** Either tighten the test (assert the dispatcher's cancel path was reached and declined) or rename/comment it to "cancellation cannot abort an in-flight STA command", matching `gateway.md`'s stated behavior. + +**Resolution:** 2026-05-18 — Took the rename/re-document option. The test is renamed `DispatchAsync_WhenCanceledWhileExecuting_DoesNotAbortInFlightCommand` and its XML doc rewritten to state exactly what it proves — an in-flight STA command is *not* aborted by cancellation — and to state explicitly that the test cannot and does not distinguish "cancel observed and ignored" from "cancel never checked". The doc now cites `gateway.md`'s wording ("cannot safely abort an in-flight COM call on the STA"). The test body is unchanged: it already asserts the command runs to completion and returns its normal `Ok` reply, which is the genuine behaviour. No runtime behaviour changed. + +### Worker.Tests-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs` | +| Status | Resolved | + +**Description:** `docs/WorkerFrameProtocol.md` states the reader "rejects zero-length payloads and payloads larger than the configured maximum (default 16 MiB) before allocating the payload buffer." `WorkerFrameProtocolTests` covers malformed-length, wrong protocol version, wrong session, and malformed payload, but has no test for the zero-length-payload rejection or the oversized-frame rejection — both explicit security-relevant input-validation paths. + +**Recommendation:** Add tests feeding a frame with `payload_length == 0` and one with `payload_length` above the configured maximum, asserting the corresponding `WorkerFrameProtocolErrorCode`. + +**Resolution:** 2026-05-18 — Re-triage of the zero-length half: the finding's "no test for the zero-length-payload rejection" is partly inaccurate. The pre-existing `ReadAsync_WithMalformedLength_ThrowsMalformedLength` fed a four-zero-byte stream — which is exactly a frame declaring `payload_length == 0` — so the zero-length path *was* already covered, just under a misleading name (the length prefix itself is well-formed; only the declared length is zero). That test was renamed `ReadAsync_WithZeroLengthPayload_ThrowsMalformedLength` with an XML doc explaining the four-zero-byte construction, rather than adding a duplicate. The oversized half was a genuine gap: a new `ReadAsync_WithPayloadAboveConfiguredMaximum_ThrowsMessageTooLarge` constructs `WorkerFrameProtocolOptions` with a 64-byte maximum, feeds a length prefix of 65, and asserts `WorkerFrameProtocolErrorCode.MessageTooLarge` — verified against `WorkerFrameReader.ReadAsync`, both checks fire before the payload buffer is rented. The small configured maximum keeps the test from allocating a multi-megabyte buffer. + +### Worker.Tests-013 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:539-546` | +| Status | Resolved | + +**Description:** `ThrowIfCompletedAsync` does an unconditional `await Task.Delay(TimeSpan.FromMilliseconds(100))` then checks `task.IsCompleted`. This adds a fixed 100 ms to the test and only catches a `RunAsync` that fails within that arbitrary window; a session that faults after 100 ms slips past undetected. + +**Recommendation:** Replace with a deterministic race: `await Task.WhenAny(runTask, )` and assert the run task did not win. + +**Resolution:** 2026-05-18 — `ThrowIfCompletedAsync` was deleted (it had a single call site, in `RunAsync_SendsHeartbeatPayloadFromRuntimeSnapshot`). That test now races `runTask` against the first-heartbeat `ReadUntilAsync` with `Task.WhenAny`; if `runTask` wins it is awaited to surface the underlying fault and the test fails via `Assert.Fail`. The fixed 100 ms delay is gone — the check is now deterministic: a `RunAsync` faulting at *any* time before the first heartbeat is caught, and a healthy run completes as soon as the heartbeat arrives instead of always paying 100 ms. + +### Worker.Tests-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeClientTests.cs:194`, `WorkerPipeSessionTests.cs:622`, `Sta/StaCommandDispatcherTests.cs:348`, `MxAccess/MxAccessStaSessionTests.cs:334`, `MxAccess/MxAccessCommandExecutorTests.cs:1124` | +| Status | Resolved | + +**Description:** `FakeRuntimeSession`, `NoopComApartmentInitializer`, `NoopEventSink`/`NullEventSink`, and the `CreateFrame`/`WriteUInt32LittleEndian` helpers are re-implemented independently in multiple test files. The two `FakeRuntimeSession` implementations have already diverged (one supports `BlockDispatch`/event enqueue, one does not), and `NoopComApartmentInitializer` is defined four times. + +**Recommendation:** Extract shared test doubles (`NoopComApartmentInitializer`, frame helpers, a single configurable `FakeRuntimeSession`) into a `TestSupport` folder/namespace consumed by all test classes. + +**Resolution:** 2026-05-18 — Added a `src/MxGateway.Worker.Tests/TestSupport/` folder (namespace `MxGateway.Worker.Tests.TestSupport`) with four shared doubles: `NoopComApartmentInitializer`, `NoopEventSink`, `WorkerFrameTestHelpers` (`CreateFrame`/`WriteUInt32LittleEndian`), and a single configurable `FakeRuntimeSession`. The consolidated `FakeRuntimeSession` is the richer of the two divergent copies (it supports `BlockDispatch`, event enqueue, shutdown-timeout, and throw-after-release); the minimal `WorkerPipeClientTests` caller simply leaves the options unset. The per-file copies were deleted from `WorkerPipeClientTests`, `WorkerPipeSessionTests`, `StaCommandDispatcherTests`, `MxAccessStaSessionTests`, `MxAccessCommandExecutorTests`, and `WorkerFrameProtocolTests`, and the orphaned `NullEventSink` in `AlarmCommandExecutorTests` was replaced with the shared `NoopEventSink`. Re-triage: the finding says `NoopComApartmentInitializer` "is defined four times" — it was defined **three** times (`StaCommandDispatcherTests`, `MxAccessStaSessionTests`, `MxAccessCommandExecutorTests`); the fourth alarm-area `IStaComApartmentInitializer` implementation is `StaRuntimeTests.RecordingComApartmentInitializer`, which is a *recording* double (asserts init/uninit ordering), not a no-op, so it was deliberately left in place rather than folded into the shared no-op. Unused `using` directives left behind by the removals were stripped (`TreatWarningsAsErrors`). + +### Worker.Tests-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessEventQueueTests.cs` | +| Status | Resolved | + +**Description:** `MxAccessEventQueueTests` covers monotonic sequencing, drain, capacity overflow, and first-fault-wins, but does not cover `Drain` with `maxEvents: 0` (drain-all) — a branch `FakeRuntimeSession.DrainEvents` even special-cases — nor draining an empty queue, nor enqueue after a manual `RecordFault`. These are minor branches but the overflow/fault interaction is the worker's backpressure contract. + +**Recommendation:** Add a `Drain(0)` drain-all test and an empty-queue drain test. + +**Resolution:** 2026-05-18 — Added three tests to `MxAccessEventQueueTests`. `Drain_WithZeroMaxEvents_DrainsAllEvents` covers the `maxEvents == 0` drain-all branch in `MxAccessEventQueue.Drain` (verified at `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:174`) — three events enqueued, `Drain(0)` returns all three in order and empties the queue. `Drain_WhenQueueIsEmpty_ReturnsEmptyList` covers the `drainCount == 0` early-return branch for both `Drain(0)` and `Drain(5)` on an empty queue. `Enqueue_AfterRecordFault_ThrowsInvalidOperationException` covers the backpressure contract gap the finding flagged — after a manual `RecordFault`, `Enqueue` throws `InvalidOperationException` ("outbound event queue is faulted") and the event is not queued. + +### Worker.Tests-016 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandExecutorTests.cs:317-393` | +| Status | Resolved | + +**Description:** `AlarmCommandExecutorTests` reaches into `MxAccessSession` via reflection (`typeof(MxAccessSession).GetConstructor(BindingFlags.NonPublic | BindingFlags.Instance, ..., new[] { typeof(object), typeof(IMxAccessServer), typeof(IMxAccessEventSink), typeof(MxAccessHandleRegistry), typeof(MxAccessValueCache), typeof(int) }, ...)`) and provides an inline `NullMxAccessServer` no-op implementing every `IMxAccessServer` method. The XML doc admits the reflection-based path is fragile (`"MxAccessSession private ctor signature changed; update the test seam."`). The same `NullMxAccessServer` shape is reinventable wherever an executor is exercised in isolation; the consolidated `TestSupport` namespace introduced in Worker.Tests-014 was the natural home for it, but the no-op server lives in a single test file's private nested class instead. A future change to the private ctor signature breaks this one test in a way that requires re-reading the reflection call to diagnose, and a second test that wants the same no-op surface will reflectively duplicate it. + +**Recommendation:** Either (a) add a non-reflective seam — a constructor or static factory marked `internal`-with-`InternalsVisibleTo` that takes `IMxAccessServer` + the existing dependencies, removing the reflection — or (b) move the `NullMxAccessServer` no-op and the reflection helper into `TestSupport/NoopMxAccessSession.cs` so any future test can share it and a ctor change is fixed in one place. + +**Resolution:** 2026-05-20 — Took option (a) plus option (b). Added a non-reflective `internal static MxAccessSession.CreateForTesting(IMxAccessServer, IMxAccessEventSink, MxAccessHandleRegistry?, MxAccessValueCache?, int?)` factory in `src/MxGateway.Worker/MxAccess/MxAccessSession.cs` (lines 61-88), gated through the pre-existing `` in `src/MxGateway.Worker/MxGateway.Worker.csproj`. `AlarmCommandExecutorTests.NewExecutor` now calls `MxAccessSession.CreateForTesting(new NoopMxAccessServer(), new NoopEventSink())` — no `GetConstructor`/`Invoke`/`BindingFlags` anywhere in the file. The previously per-file `NullMxAccessServer` no-op was extracted to the shared `src/MxGateway.Worker.Tests/TestSupport/NoopMxAccessServer.cs` (matching the `TestSupport` consolidation introduced in Worker.Tests-014); the XML doc on the new file explicitly cites Worker.Tests-016 for the rationale. A future change to the `MxAccessSession` private ctor signature now updates `CreateForTesting` in one place; the test file does not need to be edited. + +### Worker.Tests-017 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs` | +| Status | Resolved | + +**Description:** `WorkerPipeSession.DispatchGatewayEnvelopeAsync` (`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:365-385`) has three documented branches: `WorkerCommand`, `WorkerShutdown`, and `WorkerCancel`. `WorkerPipeSessionTests` exercises the first two but never sends a `WorkerCancel` envelope, so the `_runtimeSession?.CancelCommand(envelope.CorrelationId)` path and the contract that the session forwards a cancel without faulting the pipe are uncovered. The `default:` arm (`UnexpectedEnvelopeBody` exception) is also uncovered — a gateway sending the wrong body case (e.g. another `GatewayHello` after the handshake) should produce a `ProtocolViolation` fault but no test asserts this. + +**Recommendation:** Add two tests: one that writes a `WorkerCancel` envelope with a known correlation id and asserts `FakeRuntimeSession.CancelCommand` was called with that id (extend the shared `FakeRuntimeSession` to record cancel-correlation-ids); one that writes a post-handshake `GatewayHello` envelope and asserts the session writes a `WorkerFault` with category `ProtocolViolation` and exits the message loop. + +**Resolution:** 2026-05-20 — Added two `[Fact]`s to `WorkerPipeSessionTests` and the supporting state to the shared `FakeRuntimeSession`. (1) `RunAsync_WhenGatewaySendsWorkerCancel_ForwardsCorrelationIdToRuntimeSession` writes a `WorkerCancel` envelope with correlation id `"cancel-correlation-1"` after the handshake, then drives a normal shutdown via `SendShutdownAndWaitAsync` — observing the shutdown ack proves the message loop kept running (no fault, no exit) and `Assert.Contains("cancel-correlation-1", runtime.CancelledCorrelationIds)` proves the cancel reached `IWorkerRuntimeSession.CancelCommand`. The shared `FakeRuntimeSession` was extended with a `CancelledCorrelationIds` snapshot list and an optional `CancelCommandReturnValue` (defaulting to `false`, preserving the prior behaviour). (2) `RunAsync_WhenGatewaySendsUnexpectedEnvelopeBodyAfterHandshake_ThrowsAndExitsMessageLoop` writes a second `GatewayHello` envelope post-handshake — valid envelope, invalid body case for the message-loop state — and asserts `Assert.ThrowsAsync(async () => await runTask)` with `ErrorCode == WorkerFrameProtocolErrorCode.UnexpectedEnvelopeBody`. Re-triage: the original recommendation said "the session writes a `WorkerFault` with category `ProtocolViolation`", but the source at `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:380-384` shows the `default:` arm throws `WorkerFrameProtocolException`; `RunMessageLoopAsync` has no fault-writing catch (only `CompleteStartupHandshakeAsync` writes faults during the handshake). The test XML doc records this — the contract pinned is the exception type/error-code and the message-loop exit, not a fault frame. + +### Worker.Tests-018 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessLiveComCreationTests.cs:18-31, 35-73, 75-145, 148-220, 222-342` | +| Status | Resolved | + +**Description:** Every `[Fact]` in `MxAccessLiveComCreationTests` gates on `RunLiveMxAccessTests()` and `return`s silently when the opt-in env var is not set. xUnit reports a `Fact` that returns normally as **passed**, so a CI run without `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1` shows five green "live MXAccess" tests that did not run a single line of MXAccess code. `docs/GatewayTesting.md` and the `IntegrationTests` project already provide the correct pattern — `LiveMxAccessFactAttribute` (in `src/MxGateway.IntegrationTests/LiveMxAccessFactAttribute.cs`) emits xUnit's native `Skipped` status when the env var is absent — but `MxAccessLiveComCreationTests` does not consume it, so the gate is invisible in test output. The first test (`StartAsync_WhenOptedIn_CreatesInstalledMxAccessComObjectOnSta`) additionally inlines the env-var check (`string.Equals(Environment.GetEnvironmentVariable(...), "1", StringComparison.Ordinal)`) instead of using the local `RunLiveMxAccessTests()` helper, so the convention is inconsistent even within the same file. + +**Recommendation:** Move `LiveMxAccessFactAttribute` into a shared location both projects can reference (e.g. `MxGateway.Contracts.TestSupport` or a new `MxGateway.TestSupport` shared project), and decorate the five `MxAccessLiveComCreationTests` methods with `[LiveMxAccessFact]` instead of `[Fact]`. Drop the inline env-var checks. Skipped runs will then report `Skipped` rather than `Passed`, and CI will distinguish "live MXAccess unavailable" from "live MXAccess opted in, succeeded". + +**Resolution:** 2026-05-20 — Added a self-contained `LiveMxAccessFactAttribute` at `src/MxGateway.Worker.Tests/TestSupport/LiveMxAccessFactAttribute.cs` (namespace `MxGateway.Worker.Tests.TestSupport`) that mirrors the `MxGateway.IntegrationTests` attribute: when `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS` is not `1`, the attribute sets `Skip` so xUnit emits a native `Skipped` result rather than a misleading `Passed`. All five `MxAccessLiveComCreationTests` methods now use `[LiveMxAccessFact]`; the inline env-var check at the top of `StartAsync_WhenOptedIn_CreatesInstalledMxAccessComObjectOnSta` and the per-method `if (!RunLiveMxAccessTests()) return;` silent-returns were deleted. The worker tests target net48/x86 and the integration tests target net10.0, so introducing a cross-project shared assembly was not practical; the Worker.Tests attribute is a near-duplicate of the IntegrationTests attribute and the XML doc on the new file calls this out so the next reviewer understands why two copies exist. xUnit output now reports the five live tests as `[SKIP]` when the env var is absent — `dotnet test ...` shows `Skipped: 9, Total: 274`, with the five `MxAccessLiveComCreationTests` correctly counted as skipped rather than passed. + +### Worker.Tests-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `src/MxGateway.Worker.Tests/AlarmsLiveSmokeTests.cs:45`, `src/MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs:143`, `src/MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs:55` | +| Status | Resolved | + +**Description:** Worker.Tests-009 renamed every `snake_case` alarm-test method to the project's `Method_Scenario_Expectation` convention, but the rename missed the dev-rig probe and live-smoke `[Fact]`s in the `MxGateway.Worker.Tests` root (not under `MxAccess/`): `AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip`, `AlarmClientWmProbeTests.Probe_AlarmClient_for_alarm_messages` (and its helpers), and `WnWrapConsumerProbeTests.ProbeWnWrapConsumer`. These are `[Fact(Skip=...)]` so they never execute in normal CI, but they still drift from `docs/style-guides/CSharpStyleGuide.md` and contradict the resolution claim in Worker.Tests-009 that "every `[Fact]`/`[Theory]` method in the five alarm test files" was renamed. + +**Recommendation:** Rename `Alarms_full_pipeline_round_trip` → `Alarms_FullPipelineRoundTrip_RaisesAndAcknowledges` (or similar `Method_Scenario_Expectation` form) and apply the same convention to the two probe methods. xUnit discovers by attribute, not name, so renames are behaviour-neutral. + +**Resolution:** 2026-05-20 — Renamed the three `snake_case` probe/smoke `[Fact]` methods to the project's `Method_Scenario_Expectation` PascalCase convention: `Alarms_full_pipeline_round_trip` → `Alarms_FullPipelineRoundTrip_RaisesAndAcknowledges` (in `Probes/AlarmsLiveSmokeTests.cs`), `ProbeAlarmClientWmMessages` → `ProbeAlarmClient_OnDevRig_LogsAlarmWindowMessages` (in `Probes/AlarmClientWmProbeTests.cs`), and `ProbeWnWrapConsumer` → `ProbeWnWrapConsumer_OnDevRig_LogsXmlAlarmStream` (in `Probes/WnWrapConsumerProbeTests.cs`). The three files have moved to `Probes/` as part of Worker.Tests-023; the location columns above predate that move. xUnit discovers tests by attribute, so the renames are behaviour-neutral and the `Skip` strings still apply unchanged. + +### Worker.Tests-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker.Tests/MxAccess/MxAccessValueCacheTests.cs:88-108` | +| Status | Resolved | + +**Description:** `TryWaitForUpdate_ReturnsFalseAfterDeadline_WhenNoSetOccurs` asserts both a lower wall-clock bound (`stopwatch.ElapsedMilliseconds >= 60`, deadline was 80ms) and `pumpCalls > 1`. The 60ms floor is the same class of timing race Worker.Tests-003/004/013 corrected elsewhere: on a loaded CI agent a `Task.Run` scheduling delay can push the wait's start past the deadline so the loop runs zero or one iteration, the wait returns slightly *early* of the 60ms floor, and the test fails through no fault of the production code. The `pumpCalls > 1` check additionally races against the same scheduler — if the agent stalls the wait thread, `pumpStep` might fire only once before the deadline. The test purpose (verifying the timeout is honoured and pump-step is invoked) is sound but the assertions are wall-clock floors rather than deterministic checks. + +**Recommendation:** Drop the elapsed-time floor and the `pumpCalls > 1` assertion; verify only that `result` is false, `value` is default, and `pumpCalls >= 1` (the pump must fire at least once, but not "more than once"). The fact that `TryWaitForUpdate` returned false after the deadline is the contract the test exists to pin; the timing strictness is incidental. + +**Resolution:** 2026-05-20 — Eliminated the wall-clock dependency entirely (the equivalent of a manual time source for the `DateTime.UtcNow`-based deadline). The test now passes `DateTime.UtcNow.AddMilliseconds(-1)` — a deadline already in the past — so `TryWaitForUpdate`'s loop pumps once, immediately observes the elapsed deadline, and returns false with zero `Thread.Sleep`. The `Stopwatch`/`stopwatch.ElapsedMilliseconds >= 60` floor and the `pumpCalls > 1` strict-inequality assertions are gone. With an already-expired deadline the contract is deterministic: exactly one pump call (the loop must pump before checking the deadline so MXAccess messages can dispatch on the calling thread even when the deadline has just expired), `result == false`, `value` is default. Matches the pattern Worker.Tests-003/004/013 used — drop wall-clock floor checks in favour of a deterministic signal. + +### Worker.Tests-021 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerFrameProtocolTests.cs` | +| Status | Resolved | + +**Description:** `WorkerFrameProtocolTests` covers `MalformedLength`, `MessageTooLarge` (read-side, added in Worker.Tests-012), `ProtocolVersionMismatch`, `SessionMismatch`, and `InvalidEnvelope` on `WorkerFrameReader`. Three documented protocol-error branches remain uncovered: (1) `WorkerFrameProtocolErrorCode.EndOfStream` from `WorkerFrameReader.ReadExactlyOrThrowAsync` (`src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:106`) when the stream closes mid-frame — important because the gateway closing its end of the pipe during a partial read is the most common production transport failure; (2) `WorkerFrameWriter` rejecting an envelope whose `CalculateSize()` returns 0 with `WorkerFrameProtocolErrorCode.InvalidEnvelope` (`WorkerFrameWriter.cs:46`); (3) `WorkerFrameWriter` rejecting an envelope larger than `MaxMessageBytes` with `WorkerFrameProtocolErrorCode.MessageTooLarge` (`WorkerFrameWriter.cs:53`). The writer-side checks defend against a session that constructs a too-large envelope before sending it down the pipe — completely separate from the reader-side bounds the existing tests pin. + +**Recommendation:** Add three tests: (a) `ReadAsync_WhenStreamEndsMidFrame_ThrowsEndOfStream` — feed a 4-byte length prefix declaring 100 bytes followed by only 50 bytes, assert `EndOfStream`; (b) `WriteAsync_WithEnvelopeAboveConfiguredMaximum_ThrowsMessageTooLarge` — construct `WorkerFrameProtocolOptions` with a small `MaxMessageBytes` and an envelope whose serialised size exceeds it, assert `MessageTooLarge`; (c) since `WorkerEnvelope.CalculateSize()` never returns 0 for a valid envelope (the protocol version field alone serializes), the `InvalidEnvelope` writer branch is genuinely unreachable in normal operation — either document this as defensive code that is intentionally untestable, or drop the check. + +**Resolution:** 2026-05-20 — Added three `[Fact]`s to `WorkerFrameProtocolTests.cs` for the three uncovered protocol-error branches. (a) `ReadAsync_WhenStreamEndsMidFrame_ThrowsEndOfStream` builds a 4-byte length prefix declaring 100 bytes followed by only 50 bytes, drives `WorkerFrameReader.ReadAsync` against it, and asserts `WorkerFrameProtocolErrorCode.EndOfStream` — pins the gateway-closes-mid-read transport failure. (b) `WriteAsync_WithEnvelopeAboveConfiguredMaximum_ThrowsMessageTooLarge` constructs `WorkerFrameProtocolOptions` with `MaxMessageBytes=64`, builds a `GatewayHello` envelope whose `GatewayVersion` is padded to 1024 bytes, asserts `WorkerFrameProtocolErrorCode.MessageTooLarge` and that the stream stayed empty (zero bytes written). (c) `WriteAsync_WithEmptyEnvelope_ThrowsInvalidEnvelopeFromValidator` exercises the body-less path — `WorkerEnvelopeValidator.Validate` runs first and rejects an envelope whose `BodyCase` is `None` with `InvalidEnvelope`, so the `CalculateSize()==0` branch is intercepted before it fires; the XML doc explicitly documents that the defensive zero-length branch is unreachable through public API but is left in place as a one-comparison safety net against future serialisation regressions. Net change: three new tests, all green; the reader-side `EndOfStream` plus writer-side `MessageTooLarge`/`InvalidEnvelope` rejections are now regression-protected. + +### Worker.Tests-022 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Worker.Tests/MxAccess/WnWrapAlarmConsumerXmlTests.cs` | +| Status | Resolved | + +**Description:** `WnWrapAlarmConsumerXmlTests` covers `ParseSnapshotXml` and `TryParseHexGuid` directly — the pure-helper layer — and pins the no-internal-timer Worker-001 invariant via reflection. The `PollOnce` transition-delta logic (`src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:289-337`) is what actually turns "snapshot N to snapshot N+1" into `MxAlarmTransitionEvent` instances, and is the only place the consumer makes state-management decisions: skip-when-state-unchanged, fire-with-previous-state-Unspecified for first sighting, and (implicitly) drop entries that vanished from the new snapshot. None of these branches are exercised — the live-smoke `AlarmsLiveSmokeTests` covers the end-to-end pipeline but is `[Fact(Skip=...)]` against the dev rig, so there is no in-CI coverage of "snapshot delta computation produces the right transitions" at all. A regression that, for example, emits a transition every poll regardless of state-change would slip through. + +**Recommendation:** Refactor `PollOnce`'s snapshot-diff loop into a pure `internal static IReadOnlyList ComputeTransitions(Dictionary previous, Dictionary next)` and add direct unit tests: (a) new entry produces `PreviousState=Unspecified`; (b) state-unchanged produces no transition; (c) state-changed produces a transition with the prior state; (d) entry vanished from `next` produces no transition (an alarm cleared from the active set; the snapshot just no longer mentions it). `MxAccessStaSession` already drives the COM-side polling, so the diff is genuinely independent of any COM dependency. + +**Resolution:** 2026-05-20 — Extracted the snapshot-diff loop from `WnWrapAlarmConsumer.PollOnce` into a pure `internal static IReadOnlyList ComputeTransitions(Dictionary previous, Dictionary next)` in `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs`. `PollOnce` now calls `ComputeTransitions` under the same `syncRoot` lock; the diff rules are unchanged. Added five `[Fact]`s in `WnWrapAlarmConsumerXmlTests.cs` exercising all four branches plus a multi-alarm fan-out case: `ComputeTransitions_WhenAlarmIsNewInNextSnapshot_EmitsTransitionWithUnspecifiedPreviousState`, `ComputeTransitions_WhenAlarmStateUnchanged_EmitsNoTransition`, `ComputeTransitions_WhenAlarmStateChanged_EmitsTransitionWithPriorState`, `ComputeTransitions_WhenAlarmDroppedFromActiveSet_EmitsNoTransition`, and `ComputeTransitions_WithMixedDelta_EmitsOnlyNewAndChangedTransitions`. Each test drives the function with `Dictionary` snapshots built from a `NewRecord` helper — no COM, no STA. A regression that emits a transition every poll regardless of state, swaps the previous/next ordering, or treats a dropped alarm as a transition now fails in-CI. + +### Worker.Tests-023 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs` (779 lines), `src/MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs` (287 lines), `src/MxGateway.Worker.Tests/AlarmsLiveSmokeTests.cs` (270 lines) | +| Status | Resolved | + +**Description:** Three large dev-rig "probe" files are mixed into the worker unit-test project but are not unit tests in the usual sense: each is a `[Fact(Skip="Runtime probe — flip Skip=null on the dev rig (AVEVA installed)...")]` driver that runs hundreds of seconds, opens real Galaxy subscriptions, posts Windows messages on STA threads, captures alarm payloads to `ITestOutputHelper`, and exists to document AVEVA COM behaviour rather than gate it. `AlarmClientWmProbeTests` alone is 779 lines — larger than every genuine unit-test file in the project. Build-time these files contribute 1300+ lines of probe scaffolding that consumers of the project's "what is `Worker.Tests` for?" inspection have to wade through. The Skip-attribute strings document why they exist, but a colocated `docs/AlarmProbes.md` (or moving the probes to a separate `MxGateway.Worker.Probes` non-test assembly) would make the distinction explicit and stop the probe files from inflating `Worker.Tests`' build/test surface. + +**Recommendation:** Either (a) carve the three probe files out into `src/MxGateway.Worker.Probes/` (a separate project the dev-rig user opts into; the assembly references stay the same), or (b) move them into a `Probes/` subfolder inside `MxGateway.Worker.Tests` and add a one-paragraph header in `docs/GatewayTesting.md` describing the probe surface. Option (a) is cleaner because the live-smoke `AlarmsLiveSmokeTests` already references `WnWrapAlarmConsumer` directly and would naturally cohabit with the other AVEVA-COM probes. + +**Resolution:** 2026-05-20 — Took option (b): moved `AlarmClientWmProbeTests.cs`, `WnWrapConsumerProbeTests.cs`, and `AlarmsLiveSmokeTests.cs` from `src/MxGateway.Worker.Tests/` into a new `src/MxGateway.Worker.Tests/Probes/` subfolder. The files keep their existing namespace (`MxGateway.Worker.Tests`) and their `[Fact(Skip=...)]` gating; the SDK-style project picks them up under the new path without a `.csproj` change. Option (b) was chosen over (a) because the probes still rely on the same test-project package references (`xunit`, `Microsoft.NET.Test.Sdk`, `Xunit.Abstractions`) plus the `Interop.WNWRAPCONSUMERLib`/`ArchestrA.MxAccess`/`aaAlarmManagedClient`/`IAlarmMgrDataProvider` references already declared in `MxGateway.Worker.Tests.csproj`; a separate `MxGateway.Worker.Probes` project would have to duplicate every one of these. The probes remain runnable on the dev rig by flipping `Skip=null` exactly as before. The `Worker.Tests` root listing now contains only genuine unit-test/regression files; probe scaffolding is visibly partitioned by directory. + +### Worker.Tests-024 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker.Tests/MxAccess/AlarmCommandHandlerTests.cs:42-54` | +| Status | Resolved | + +**Description:** `Subscribe_WhenUnderlyingSubscribeThrows_DisposesConsumer` asserts that an exception during `IMxAccessAlarmConsumer.Subscribe` triggers consumer disposal. The fake throws `new InvalidOperationException("simulated wnwrap subscribe failure")` and the test asserts `Assert.Throws(() => handler.Subscribe(...))`. But `AlarmCommandHandler.Subscribe` (`src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:65-93`) wraps the underlying call and re-throws — so an `InvalidOperationException` from any code path inside `Subscribe` (e.g. its own "already subscribed" guard at line 73) would also satisfy the assertion. The test does not pin that the *thrown* exception is the one from the fake; if `AlarmCommandHandler` regressed to throw before reaching the consumer, the test would still pass with `consumer.Disposed == false` ... except the test additionally asserts `consumer.Disposed` is true, which would fail. So the test does pin the disposal behaviour. The genuine weakness is that the assertion doesn't pin the exception message either ("simulated wnwrap subscribe failure"), so an unexpected `InvalidOperationException` from a different branch with a misleading message would pass without anyone noticing the handler swallowed the real failure cause. + +**Recommendation:** Strengthen to `InvalidOperationException exception = Assert.Throws(...); Assert.Contains("simulated wnwrap subscribe failure", exception.Message)` — pin both the type and the originating message so a regression that throws a *different* `InvalidOperationException` from inside `AlarmCommandHandler` fails the test. + +**Resolution:** 2026-05-20 — `Subscribe_WhenUnderlyingSubscribeThrows_DisposesConsumer` now captures the thrown exception and asserts `Assert.Contains("simulated wnwrap subscribe failure", exception.Message)` against the fake's exact thrown message. A regression that throws a *different* `InvalidOperationException` from inside `AlarmCommandHandler` (for example its own "already subscribed" guard at line 73 of `AlarmCommandHandler.cs`) now fails the message-contains assertion — the original test's type-only `Assert.Throws` would have passed silently while hiding the swallowed failure cause. The disposal assertion (`consumer.Disposed == true`) is unchanged; the test now pins both the disposal contract and the origin of the propagated exception. XML doc on the test method documents the regression scenario. + +### Worker.Tests-025 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | mxaccessgw conventions | +| Location | `src/MxGateway.Worker.Tests/TestSupport/LiveMxAccessFactAttribute.cs:23`, `src/MxGateway.IntegrationTests/IntegrationTestEnvironment.cs:5`, `src/MxGateway.IntegrationTests/LiveMxAccessFactAttribute.cs:9-12` | +| Status | Resolved | + +**Description:** Worker.Tests-018 resolved the silent-skip issue by adding a Worker.Tests-local `LiveMxAccessFactAttribute`. The resolution called out that "introducing a cross-project shared assembly was not practical" because Worker.Tests targets net48/x86 and IntegrationTests targets net10.0. The two copies are correct today but the contract is held only by convention — both define `LiveMxAccessVariableName = "MXGATEWAY_RUN_LIVE_MXACCESS_TESTS"` as separate `public const string` literals, with the same `=="1"` `StringComparison.Ordinal` check duplicated. The IntegrationTests copy delegates to `IntegrationTestEnvironment.LiveMxAccessTestsEnabled`/`IsEnabled`, so any future opt-in tweak (e.g. accepting `"true"` as well, or honouring a different env-var name) made in `IntegrationTestEnvironment` will silently leave Worker.Tests behind. The XML doc on the Worker.Tests copy acknowledges this risk in prose but the divergence is invisible at compile time — there's no test or assertion that pins the two opt-in checks return the same answer. + +**Recommendation:** Either (a) lift the env-var-name string into `MxGateway.Contracts` (which already multi-targets `net10.0;net48`) as a `public const string`, then both `LiveMxAccessFactAttribute` copies reference the same constant; (b) add a single unit test in Worker.Tests that pins `LiveMxAccessFactAttribute.LiveMxAccessVariableName == "MXGATEWAY_RUN_LIVE_MXACCESS_TESTS"` to make the contract literal-visible to any reviewer changing the name; (c) document the synchronization requirement in `docs/GatewayTesting.md` alongside the existing live-opt-in section. + +**Resolution:** 2026-05-20 — Added `GatewayContractInfo.LiveMxAccessOptInVariableName` to `MxGateway.Contracts` (net10.0/net48-multi-targeted) and routed both `LiveMxAccessFactAttribute` copies plus `IntegrationTestEnvironment.LiveMxAccessVariableName` through that single constant; the env-var literal now lives in one place. + +### Worker.Tests-026 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Worker/MxAccess/MxAccessSession.cs:74-88` | +| Status | Resolved | + +**Description:** `MxAccessSession.CreateForTesting` (added in Worker.Tests-016) is declared `internal static`, gated only by `` in `MxGateway.Worker.csproj`. The XML doc states "production code must use the `Create` factory", but there is no runtime enforcement. The protection rests on (1) the `internal` modifier — which silently widens if any future `InternalsVisibleTo` directive is added (e.g. for an integration-test shim, a benchmark project, or an `InternalsVisibleTo`-using analyzer); and (2) reviewer attention. Worker.Tests itself contains real STA-running test code (the live tests, the probes), so a future test in Worker.Tests could call `CreateForTesting` from a context that has a real MXAccess COM object and the `new object()` placeholder would silently substitute. The factory hands out a session with `mxAccessComObject = new object()` so any code that later goes through `Marshal.IsComObject` or `Marshal.FinalReleaseComObject` on it would simply return false / no-op, masking lifetime regressions. + +**Recommendation:** Add a one-line conditional guard — e.g. `[Conditional("DEBUG")]` is not appropriate (the worker also ships Release builds), but the factory could check that `eventSink` is *not* an `MxAccessBaseEventSink` (the production sink), throwing `InvalidOperationException("CreateForTesting must not be used with the production MxAccessBaseEventSink")`. Production code never passes that sink to a "for testing" factory; the asymmetry is the cheapest signal. Alternatively, gate the factory with `[Obsolete("Test seam — never call from production code", error: false)]` so any production call surfaces as a build warning (and `TreatWarningsAsErrors` would turn that into a build break). + +**Resolution:** 2026-05-20 — Added a runtime guard to `MxAccessSession.CreateForTesting` that throws `ArgumentException` when the supplied `eventSink` is an `MxAccessBaseEventSink` (the production sink), so any future caller wiring the live sink into the test factory fails fast instead of silently bypassing `Marshal.IsComObject` on the `new object()` placeholder. + +### Worker.Tests-027 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker.Tests/TestSupport/FakeRuntimeSession.cs:174, 179-187` | +| Status | Resolved | + +**Description:** The consolidated `FakeRuntimeSession` (introduced by Worker.Tests-014, extended for Worker.Tests-017) reads/writes `cancelledCorrelationIds`, `snapshot`, and `events` under `lock(gate)`. The new `CancelCommandReturnValue` (a `bool` set by the test) is mutated outside any lock and read inside `CancelCommand` outside the lock as well (`return CancelCommandReturnValue;` after the locked `cancelledCorrelationIds.Add`). For a plain `bool` set before the worker's message-loop runs this is harmless on x86 (atomic-on-aligned-write), but it contradicts the rest of the file's locking convention and a future test that flips `CancelCommandReturnValue` mid-dispatch from a different thread would see an undocumented race. The same applies to `BlockDispatch`, `ThrowAfterDispatchReleased`, `ThrowTimeoutOnShutdown`, and `Disposed` — all are `bool`/auto-property without the `gate` lock — but those existed before Worker.Tests-017 and the finding flags only the consistency drift the new property introduces. + +**Recommendation:** Either (a) hold `lock(gate)` when reading `CancelCommandReturnValue` inside `CancelCommand`, matching the surrounding locked statement; (b) mark `CancelCommandReturnValue` with `volatile` to document the cross-thread visibility; or (c) add an XML-doc note stating the property must be set before `RunAsync` begins and is not safe to mutate mid-test. Option (c) is cheapest and matches how `BlockDispatch` is used today. + +**Resolution:** 2026-05-20 — Converted `CancelCommandReturnValue` to a private-backing-field property whose get/set both hold `lock(gate)`, and folded the return statement of `CancelCommand` inside the existing locked block, so the property now respects the same locking convention as `cancelledCorrelationIds`, `snapshot`, and `events`. + +### Worker.Tests-028 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Design-document adherence | +| Location | `docs/GatewayTesting.md`, `src/MxGateway.Worker.Tests/Probes/` | +| Status | Resolved | + +**Description:** The Worker.Tests-023 resolution (commit `a020350`) stated that option (b) was taken — moving the three probe files to `Probes/` — but the recommendation for option (b) was "move them into a `Probes/` subfolder inside `MxGateway.Worker.Tests` **and** add a one-paragraph header in `docs/GatewayTesting.md` describing the probe surface." The folder move was made; the documentation addition was not. `docs/GatewayTesting.md` has no mention of `Probes/`, `AlarmClientWmProbeTests`, `WnWrapConsumerProbeTests`, or `AlarmsLiveSmokeTests` (verified with `Grep` against the doc). A reader navigating `docs/GatewayTesting.md` to understand the testing surface cannot tell the probes exist, what they pin, or how to flip `Skip=null` on the dev rig — the only documentation is the in-source `Skip=...` strings and the per-probe XML doc. + +**Recommendation:** Add a `## Dev-rig probes` (or similar) section to `docs/GatewayTesting.md` that names the three probe files, explains the probe contract (live AVEVA COM, `Skip=null` flip, no in-CI coverage), and points to the source location `src/MxGateway.Worker.Tests/Probes/`. One paragraph is enough; the existing `[Fact(Skip=...)]` strings carry the rest of the detail. + +**Resolution:** 2026-05-20 — Added a `## Dev-rig Probes` section to `docs/GatewayTesting.md` between the Live MXAccess Smoke and Live Galaxy Repository sections; the new section names the three probe files (`AlarmsLiveSmokeTests`, `AlarmClientWmProbeTests`, `WnWrapConsumerProbeTests`), explains the probe contract (live AVEVA COM, `Skip=null` flip on the dev rig, not part of the regression contract), and points to the source location `src/MxGateway.Worker.Tests/Probes/`. + +### Worker.Tests-029 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs:9`, `src/MxGateway.Worker.Tests/Probes/AlarmClientWmProbeTests.cs:14`, `src/MxGateway.Worker.Tests/Probes/WnWrapConsumerProbeTests.cs:10` | +| Status | Resolved | + +**Description:** Worker.Tests-023 partitioned the probes by directory (`Probes/` subfolder) but kept their original namespace `namespace MxGateway.Worker.Tests;` rather than moving them to `namespace MxGateway.Worker.Tests.Probes;`. The folder/namespace mismatch is a minor C# convention drift (the project's other subfolder-grouped tests — `Bootstrap/`, `Conversion/`, `MxAccess/`, `Sta/`, `Ipc/`, `TestSupport/`, `Contracts/`, `ProjectStructure/` — all use a `MxGateway.Worker.Tests.` namespace matching the directory). It also means an xUnit test filter like `--filter FullyQualifiedName~MxGateway.Worker.Tests.Probes` will discover zero tests, so the partition is invisible to the runner: any CI-side rule that wants to exclude probes still has to enumerate file/class names individually rather than match by namespace. + +**Recommendation:** Move the three probe files to `namespace MxGateway.Worker.Tests.Probes;`. xUnit discovers by attribute, not by namespace, so the rename is behaviour-neutral and lets a `FullyQualifiedName~Probes` filter trivially target them. The two other consolidations introduced in this sweep (`TestSupport/` → `MxGateway.Worker.Tests.TestSupport`) already follow this pattern. + +**Resolution:** 2026-05-20 — Moved `AlarmsLiveSmokeTests`, `AlarmClientWmProbeTests`, and `WnWrapConsumerProbeTests` to `namespace MxGateway.Worker.Tests.Probes;` so the folder and namespace match the project's other subfolder-grouped tests; a `FullyQualifiedName~MxGateway.Worker.Tests.Probes` filter now targets exactly the three probe classes. Verified by xUnit discovery output: the three probes appear under their new namespace as `[SKIP]`. + +### Worker.Tests-030 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Worker.Tests/Ipc/WorkerPipeSessionTests.cs:862-890` | +| Status | Resolved | + +**Description:** Within `WorkerPipeSessionTests`, the inbound-envelope helpers assign `Sequence` values that are inconsistent with the order in which the tests send them: `CreateGatewayHelloEnvelope` is `Sequence = 1`, `CreateCommandEnvelope` is `Sequence = 2`, `CreateShutdownEnvelope` is `Sequence = 3`, and `CreateCancelEnvelope` is `Sequence = 4`. The Worker.Tests-017 cancel test sends the cancel (`Sequence = 4`) **before** the shutdown (`Sequence = 3`) — a future reader inspecting the wire trace will see decreasing sequence numbers. The test still passes because the worker has no inbound sequence-monotonicity check (verified by `Grep`ing `Ipc/` for `ValidateSequence`/`monotonic`/sequence-comparison patterns — none exist). But `gateway.md` documents monotonic sequence numbers on the outbound side, and the test's literal sequence values suggest a convention that isn't enforced and can mislead a debugger correlating a frame dump to test intent. + +**Recommendation:** Either (a) reassign `CreateCancelEnvelope` to a sequence value `>` shutdown (or pass the sequence as a parameter, matching `CreateGatewayHelloEnvelope`'s parameter style), so the wire trace reads in ascending order; (b) add an XML-doc note on the cancel test stating that the worker has no inbound monotonicity check and the test ignores envelope sequence ordering; (c) parameterise all four helper methods so each test passes its desired sequence and the literal numbers stop carrying implicit meaning. Option (c) is the cleanest because `CreateGatewayHelloEnvelope` is already parameter-driven for nonce/version. + +**Resolution:** 2026-05-20 — Took option (c): parameterised `CreateGatewayHelloEnvelope`/`CreateCommandEnvelope`/`CreateCancelEnvelope`/`CreateShutdownEnvelope` with a `ulong sequence` argument (defaults 1/2/2/3 respectively, matching the typical Hello/Command/Cancel/Shutdown ordering), so the literal sequence values no longer carry implicit meaning. Updated the cancel-correlation test's wire trace to ascend (Hello=1, Cancel=2, Shutdown=3) and added a comment noting that the worker has no inbound monotonicity check — the parameter exists so multi-frame tests can pin the trace ordering explicitly when needed. diff --git a/code-reviews/Worker/findings.md b/code-reviews/Worker/findings.md new file mode 100644 index 0000000..73b37c6 --- /dev/null +++ b/code-reviews/Worker/findings.md @@ -0,0 +1,462 @@ +# Code Review — Worker + +| Field | Value | +|---|---| +| Module | `src/ZB.MOM.WW.MxGateway.Worker` | +| Reviewer | Claude Code | +| Review date | 2026-05-24 | +| Commit reviewed | `d692232` | +| Status | Reviewed | +| Open findings | 0 | + +## Checklist coverage + +This row reflects the 2026-05-20 re-review at commit `a020350`. Worker-001..022 are all closed; the row only summarises new findings filed against this commit. The prior pass's fixes for Worker-016..022 were verified sound: + +- **Worker-016**: `StaRuntimeShutdownException` exists, `MxAccessStaSession.cs:261` is the only `catch (StaRuntimeShutdownException)` site in the module. No accidental catch elsewhere (grep verified). The graceful-shutdown vs. STA-affinity-violation distinction holds. +- **Worker-017**: `ReportWatchdogFaultIfNeededAsync` returns early when `CurrentCommandCorrelationId` is non-empty. Sound for the slow-but-progressing case; but see **Worker-023** — there is no defensive ceiling, so a truly stuck command (synchronous COM call hung against a dead MXAccess provider) leaves `CurrentCommandCorrelationId` non-empty forever and the worker-side watchdog is permanently suppressed. +- **Worker-018**: `SetXmlAlarmQuery` is now wrapped in `try/catch (COMException)` and re-thrown as `InvalidOperationException` carrying the HRESULT. Sound. +- **Worker-019**: `subscriptionExpression` field is gone. +- **Worker-020**: `_state is not WorkerState.Ready and not WorkerState.ExecutingCommand` simplified to `_state != WorkerState.Ready`. Confirmed `_state` is never assigned `ExecutingCommand`; volatile reads are atomic. +- **Worker-021**: `_runtimeSession ??=` in `InitializeMxAccessAsync` preserves a factory-supplied session. Confirmed `RunAsync` path bypasses `InitializeMxAccessAsync` entirely (it passes its own factory-driven lambda), so the `??=` only runs on the legacy parameterless-`CompleteStartupHandshakeAsync` direct-invocation path. +- **Worker-022**: `MxAlarmSnapshot.cs` (now containing only `MxAlarmSnapshotRecord`), `MxAlarmStateKind.cs`, `MxAlarmTransitionEvent.cs` — filenames match their single public type; all three keep the `MxGateway.Worker.MxAccess` namespace. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issue found: Worker-025 (`RunAsync` does not null-check the result of `_runtimeSessionFactory()`; a null factory return would NRE on `_runtimeSession.StartAsync(...)` rather than throw a diagnostic exception). | +| 2 | mxaccessgw conventions | No issues found. The split alarm-snapshot files match the one-public-type-per-file convention; namespace consistency verified. | +| 3 | Concurrency & thread safety | Issue found: Worker-024 (the alarm command path — `Subscribe`/`Acknowledge`/`AcknowledgeByName`/`QueryActive`/`Unsubscribe` — has no STA-affinity assertion equivalent to Worker-008's `EnsureOnAlarmConsumerThread` guard; only the alarm *poll* path enforces affinity, leaving a latent gap if a future refactor lets alarm commands run off-STA). | +| 4 | Error handling & resilience | Issue found: Worker-023 (Worker-017's watchdog skip has no defensive ceiling; a truly stuck command — synchronous COM hung against a dead MXAccess provider — keeps `CurrentCommandCorrelationId` non-empty indefinitely, and the worker-side `StaHung` watchdog never fires. Gateway-side `CommandTimeout` is the only safety net). | +| 5 | Security | No issues found. No secret logging on the alarm path; the dropped-reply diagnostic Worker-003 added logs only the correlation id and command method, not the command payload. | +| 6 | Performance & resource management | No new issues found. Frame I/O still uses pooled buffers (Worker-009); STA join timeouts in `Dispose` are bounded. | +| 7 | Design-document adherence | No new design drift. The split alarm files preserve the documented public API surface. Worker-017's resolution comment documents the watchdog design intent — though see Worker-023 for the documentation gap on truly-stuck commands. | +| 8 | Code organization & conventions | No issues found. Worker-022 was the last file-organization issue. | +| 9 | Testing coverage | Worker-016 and Worker-017 each have direct regression tests (`RunAlarmPollLoop_WhenPollOnceThrowsInvalidOperation_RecordsFaultOnEventQueue`, `RunAsync_WhenStaActivityIsStaleWithCommandInFlight_DoesNotWriteWatchdogFault`). Worker-018, -020, -021's resolution notes state "no new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent" — Worker-018's `SetXmlAlarmQuery` failure-translation and Worker-020's simplified `_state != Ready` check have no regression test in this branch yet. No standalone finding — these are documented gaps in the resolution notes of the prior pass. | +| 10 | Documentation & comments | No new issues. Worker-017's XML doc on `ReportWatchdogFaultIfNeededAsync` documents the design intent clearly; the `_runtimeSession ??=` reasoning is documented inline; Worker-016's graceful-vs-affinity distinction is documented at both catch sites. | + +### 2026-05-24 review (commit d692232) + +Re-review pass at `d692232`. The only diff against the Worker source since +`a020350` is the `ZB.MOM.WW` namespace/csproj/folder rename (commit `dc9c0c9`) +— no behavioural changes. The four external runtime identifiers documented +as intentionally unprefixed (including `Name = "MxGateway.Worker.STA"` on +`StaRuntime`) are confirmed still in their original form. The `_writeLock` +contention with the gateway-side watchdog (Server-031) is unchanged. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. | +| 2 | mxaccessgw conventions | No issues found — rename clean; STA thread name and MXAccess COM target (`LMXProxyServerClass` via `ArchestrA.MXAccess.dll`) unchanged. | +| 3 | Concurrency & thread safety | No issues found in this diff. | +| 4 | Error handling & resilience | No issues found in this diff. | +| 5 | Security | No issues found in this diff. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | No issues found in this diff. | +| 8 | Code organization & conventions | No issues found in this diff. | +| 9 | Testing coverage | No issues found in this diff. | +| 10 | Documentation & comments | No issues found in this diff. | + +## Findings + +### Worker-001 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:204-207` | +| Status | Resolved | + +**Description:** When constructed with `pollIntervalMilliseconds > 0`, `Subscribe` starts a `System.Threading.Timer` whose `OnPoll` callback runs `PollOnce()` — which calls `wwAlarmConsumerClass.GetXmlCurrentAlarms2` — on a thread-pool thread. The wnwrap CLSID is registered `ThreadingModel=Apartment`; calling its methods off the owning STA violates the hard rule that all COM calls happen on the dedicated STA thread, and can deadlock on cross-apartment marshaling when the STA is not pumping. The production path (default constructor, interval 0) is safe, but the public 3-arg constructor leaves this footgun callable, and tests/live-smoke use it. + +**Recommendation:** Remove the internal `Timer` entirely (production already drives `PollOnce` from the STA), or document and gate it so it can only be used from an STA thread. At minimum, make the timer-driven mode unreachable from any production wiring. + +**Resolution:** 2026-05-18 — Removed the off-STA timer infrastructure from `WnWrapAlarmConsumer`: the `Timer? pollTimer` and `pollIntervalMs` fields, the `DefaultPollIntervalMilliseconds` constant, the `OnPoll` callback, the timer-arming arm in `Subscribe`, and the timer disposal block in `Dispose`. The `pollIntervalMilliseconds` parameter is gone from both public constructors (the test-seam ctor is now 2-arg: `wwAlarmConsumerClass` + `maxAlarmsPerFetch`), so the off-STA footgun is structurally unreachable. `PollOnce()` remains the public STA-driven entry point. The stale "poll … on a timer below" comment was corrected. Verified by the regression tests `WnWrapAlarmConsumer_has_no_internal_timer_field` and `WnWrapAlarmConsumer_exposes_no_poll_interval_constructor_parameter`; the `AlarmsLiveSmokeTests` call site was updated to the 2-arg constructor. + +### Worker-002 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:545-549` | +| Status | Resolved | + +**Description:** `RunHeartbeatLoopAsync` calls `await Task.Delay(_sessionOptions.HeartbeatInterval, ...)` before sending the first heartbeat. The gateway therefore receives no heartbeat for the first full interval (default 5s) after the worker reaches `Ready`. If the gateway's liveness watchdog expects a heartbeat sooner, a healthy worker can be misclassified as hung at startup. + +**Recommendation:** Send an initial heartbeat immediately on entering the loop, or move the `Task.Delay` to the end of the loop body. + +**Resolution:** 2026-05-18 — Restructured `RunHeartbeatLoopAsync` so the `Task.Delay(HeartbeatInterval)` is applied between beats only, not before the first. A `firstBeat` guard skips the delay on the initial iteration, so the gateway sees a heartbeat as soon as the worker is `Ready`; cancellation behavior is preserved (the loop still observes the token and the delay still throws on cancellation). Verified by the regression test `RunAsync_SendsFirstHeartbeatImmediatelyOnEnteringLoop`. Three pre-existing tests (`WorkerPipeClientTests.RunAsync_ConnectsToPipeAndCompletesHandshake`, `WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears`, `WorkerPipeSessionTests.RunAsync_WhenCommandThrowsAfterShutdown_DropsLateFaultAndWritesShutdownAck`) assumed strict frame ordering and were updated to skip the now-interleaved first heartbeat while still asserting the same shutdown-ack behavior. + +### Worker-003 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:399-403`, `:416-419` | +| Status | Resolved | + +**Description:** `ProcessCommandAsync` checks `_state` after `DispatchAsync` completes and silently `return`s without writing a `WorkerCommandReply` (or fault) when `_state` is not `Ready`/`ExecutingCommand`. `_state` is a plain field mutated from multiple tasks (heartbeat loop, event-drain loop, shutdown). A command that completes successfully while `_state` has transitioned will have its reply dropped with no diagnostic, and the gateway's correlation-id wait then hangs until its own timeout. The `_state` read is also not synchronized. + +**Recommendation:** Always attempt to write the reply/fault for an in-flight command, or explicitly reject in-flight commands with a `Canceled`/`WorkerUnavailable` reply during state transitions. Make `_state` access thread-safe (volatile or locked). + +**Resolution:** 2026-05-18 — Both silent-drop `return` sites in `ProcessCommandAsync` (the post-`DispatchAsync` success path and the exception path) now call a new `LogCommandResultDropped` helper before returning. The helper logs an Information event named `WorkerCommandResultDropped` via the session's `IWorkerLogger`, carrying the command's `correlation_id` plus `command_method` and `worker_state`, so a stuck gateway correlation-id wait is now traceable. The `_state` field was made `volatile` (`WorkerState` is an int-backed protobuf enum, so volatile is valid) so cross-thread reads observe the latest value without tearing; this is a low-risk, non-behavioral change and did not destabilize any test. Verified by the regression test `RunAsync_WhenReplyIsDroppedAfterShutdown_LogsDiagnostic`. + +### Worker-004 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:565-588` | +| Status | Resolved | + +**Description:** After `ReportWatchdogFaultIfNeededAsync` sends an `StaHung` fault, the heartbeat loop continues sending normal heartbeats with `State` derived from `_state`, which the watchdog path never sets to `Faulted`. The heartbeat then keeps reporting a non-faulted state that contradicts the fault just sent. + +**Recommendation:** Set `_state = WorkerState.Faulted` (thread-safely) when the watchdog fault fires so heartbeat state and fault stay consistent. + +**Resolution:** 2026-05-18 — `ReportWatchdogFaultIfNeededAsync` now sets `_state = WorkerState.Faulted` immediately after `_watchdogFaultSent = true` and before the `StaHung` fault is written, so the next heartbeat reports `Faulted` instead of contradicting the fault. `_state` is already `volatile` (Worker-003), so the cross-thread write from the heartbeat loop is observed correctly by the heartbeat's own `CreateHeartbeat` read; no further locking is required. Verified by the regression test `WorkerPipeSessionTests.RunAsync_AfterWatchdogFault_HeartbeatReportsFaultedState`, which uses a stale-activity snapshot with an empty current-command correlation id so the heartbeat `State` is derived from `_state` rather than forced to `ExecutingCommand`. + +### Worker-005 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-258` (production alarm poll loop) | +| Status | Resolved | + +**Description:** `OnPoll` catches every exception from `PollOnce()` and discards it (`_ = ex;`). The production poll path (`MxAccessStaSession.RunAlarmPollLoopAsync` → `AlarmCommandHandler.PollOnce` → `AlarmDispatcher.PollOnce` → `consumer.PollOnce()`) has no fault recording either. A permanently failing alarm provider (e.g. `GetXmlCurrentAlarms2` returning `E_FAIL`, malformed XML throwing in `XmlDocument.LoadXml`) is therefore completely silent — no fault on the event queue, no log. + +**Recommendation:** Route poll failures to `MxAccessEventQueue.RecordFault` (or a logger) so a broken alarm subscription becomes observable. Update the now-stale comment. + +**Re-triage:** The cited location `WnWrapAlarmConsumer.cs:297-313` and the `OnPoll` callback no longer exist as of this branch — Worker-001 removed the off-STA `Timer` and its `OnPoll` callback entirely. The substantive concern still held, however: the **production** poll path in `MxAccessStaSession.RunAlarmPollLoopAsync` caught only `OperationCanceledException`, `ObjectDisposedException`, and `InvalidOperationException`. A genuine poll failure (`COMException` from `GetXmlCurrentAlarms2`, a malformed-XML `XmlException`) escaped uncaught, faulted the never-awaited `Task.Run` poll task, and was silently lost — exactly the silent-failure the finding describes. The finding was re-pointed at the live location and fixed there rather than at the removed `OnPoll`. + +**Resolution:** 2026-05-18 — `RunAlarmPollLoopAsync` gained a trailing `catch (Exception exception)` arm after the three graceful-stop catches. A real alarm-poll failure is now converted to a `WorkerFault` (category `MxaccessEventConversionFailed`, carrying the exception type and, for a `COMException`, its `HResult`) by the new `CreateAlarmPollFault` helper and recorded on the session's `MxAccessEventQueue` via `RecordFault`. The worker's event-drain loop drains that fault and forwards it to the gateway, so a broken alarm subscription is now observable on the IPC fault path instead of vanishing. The poll loop still stops after the failure (the subscription is dead). No new proto enum value was added — `MxaccessEventConversionFailed` is the closest existing alarm-path category, avoiding a contracts regeneration across all clients. Verified by the regression test `MxAccessStaSessionTests.RunAlarmPollLoop_WhenPollOnceThrows_RecordsFaultOnEventQueue`. + +### Worker-006 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:117-124`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:386-491` | +| Status | Resolved | + +**Description:** `RunAsync`'s `finally` calls `_runtimeSession?.Dispose()` unless `_shutdownTimedOut`. On the normal path `ShutdownGracefullyAsync` already disposed the STA runtime, so re-entering `Dispose()` is a harmless no-op only because `ShutdownGracefullyAsync` reached its end and set `disposed = true`. If `ShutdownGracefullyAsync` throws `TimeoutException` after partial teardown with `_shutdownTimedOut` set, the session is never disposed at all — the `finally` skips it — leaking the STA thread and COM object, leaving cleanup to rely solely on process exit. + +**Recommendation:** Make the dispose decision explicit and confirm process exit always follows a timed-out shutdown; otherwise dispose defensively. At minimum document why disposal is deliberately skipped on timeout. + +**Resolution:** 2026-05-18 — `RunAsync`'s `finally` now always calls `_runtimeSession?.Dispose()`; the `if (!_shutdownTimedOut)` guard and the `_shutdownTimedOut` field (which had become write-only) were removed. `MxAccessStaSession.Dispose` is idempotent (`if (disposed) return`) and bounded — each STA join is capped with `Wait(TimeSpan.FromSeconds(2))` — so re-entering it on the normal path (where `ShutdownGracefullyAsync` already disposed the runtime) is a harmless no-op, while on the timed-out path it is now the only thing that reclaims the STA thread and releases the MXAccess COM object. The previous behaviour leaked both on a shutdown timeout and relied solely on process exit. A code comment in the `finally` block documents the reasoning. Verified by the regression test `WorkerPipeSessionTests.RunAsync_WhenShutdownTimesOut_StillDisposesRuntimeSession`, which forces a `TimeoutException` from `ShutdownGracefullyAsync` and asserts the runtime session is disposed before `RunAsync` rethrows. + +### Worker-007 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | mxaccessgw conventions | +| Location | `src/MxGateway.Worker/MxAccess/MxAccessComServer.cs:130-150` | +| Status | Resolved | + +**Description:** `Invoke` uses late-bound `Type.InvokeMember` reflection as a fallback when the COM object does not cast to `ILMXProxyServer*`. In production the object is always `LMXProxyServerClass`, so the reflection path exists only for test doubles — it is dead/untested code on the production path and obscures the interface contract. `params object[] arguments` also boxes value-type handles on every call. + +**Recommendation:** Drop the reflection fallback and require the COM object to implement the interface (tests can supply a typed fake), or clearly mark the fallback as test-only. + +**Re-triage:** The finding's claim that the reflection path is "dead/untested code" is partly inaccurate — it was in fact the path exercised by the entire `MxAccessCommandExecutorTests` suite, whose `FakeMxAccessComObject` did not implement any typed interface. So the reflection fallback was test-only but *not* untested. The convention concern (bypassing the typed interface contract, boxing value-type handles) is valid, so the fix follows the recommendation's first option. + +**Resolution:** 2026-05-18 — The late-bound `Type.InvokeMember` reflection fallback and its `params object[]`-boxing `Invoke` helper were removed from `MxAccessComServer`. Each adapter method now takes one of two typed paths: an `is IMxAccessServer` fast path (test fakes implement `IMxAccessServer` directly) and the production path that casts to the typed `ILMXProxyServer` / `ILMXProxyServer3` / `ILMXProxyServer4` COM interfaces via new `AsProxyServer*` helpers. A COM object implementing neither now fails fast with a clear `InvalidOperationException` naming the missing interface, instead of an opaque late-bound call. The test seam was migrated accordingly: `MxAccessCommandExecutorTests.FakeMxAccessComObject` now declares `: IMxAccessServer` (its method signatures already matched the interface exactly, so no behavioural change). Verified by the new `MxAccessComServerTests` (typed-server routing, untyped-object rejection, original-exception propagation — no more `TargetInvocationException` wrapping) plus the unchanged, still-passing `MxAccessCommandExecutorTests` suite which now exercises the typed `IMxAccessServer` path. + +### Worker-008 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:205-249`, `:429-447` | +| Status | Resolved | + +**Description:** `RunAlarmPollLoopAsync` correctly marshals `handler.PollOnce()` onto the STA via `staRuntime.InvokeAsync`, and the cancel/await/dispose ordering in `ShutdownGracefullyAsync` is sound. However, nothing enforces that the `consumerFactory` and all `IMxAccessAlarmConsumer` calls run on the STA thread; a future caller could break STA affinity silently. + +**Recommendation:** Add an assertion or documented invariant that the consumer factory and all `IMxAccessAlarmConsumer` calls run on the STA thread, mirroring the existing `MxAccessSession.CreationThreadId` pattern. + +**Resolution:** 2026-05-18 — `MxAccessStaSession` now records the STA thread id (`alarmConsumerThreadId`) at the point the alarm-command-handler factory is invoked — which already runs inside `staRuntime.InvokeAsync` during `StartAsync`, mirroring the `MxAccessSession.CreationThreadId` capture. `RunAlarmPollLoopAsync`'s marshalled poll lambda now calls `EnsureOnAlarmConsumerThread()` before `handler.PollOnce()`, asserting the poll runs on the recorded STA thread. The check is delegated to a new `internal static` guard `AssertOnAlarmConsumerThread(int? expected, int actual)` that throws a descriptive `InvalidOperationException` on an affinity violation and is a no-op when the consumer thread is unrecorded (no alarm handler configured). Making the guard `static` and `internal` keeps it directly unit-testable. The STA-affinity invariant is documented in the guard's XML doc. Verified by the regression tests `MxAccessStaSessionTests.AssertOnAlarmConsumerThread_WhenOffOwningThread_Throws` and `AssertOnAlarmConsumerThread_OnOwningThreadOrUnset_DoesNotThrow`. + +### Worker-009 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Performance & resource management | +| Location | `src/MxGateway.Worker/Ipc/WorkerFrameReader.cs:31,49`, `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:57-58` | +| Status | Resolved | + +**Description:** Every frame read allocates a fresh 4-byte length buffer and a payload `byte[]`; every write allocates `ToByteArray()` plus a 4-byte prefix. On the hot event-drain path (batches of up to 128 `WorkerEvent` frames every 25 ms) this produces steady gen-0 garbage. `WorkerFrameWriter` also effectively serializes twice (`CalculateSize()` then `ToByteArray()`). + +**Recommendation:** Reuse a pooled buffer / `ArrayPool` for the length prefix and payload, and write directly into a pooled buffer using `CodedOutputStream`. Low priority unless event throughput is high. + +**Resolution:** 2026-05-18 — `WorkerFrameWriter.WriteAsync` now serializes the envelope exactly once into a single frame buffer that carries the 4-byte length prefix followed by the payload, via `envelope.WriteTo(new Span(frame, sizeof(uint), payloadLength))`. This eliminates the redundant second serialization pass (`ToByteArray()` re-runs `CalculateSize()` internally), the separate length-prefix array, and the separate prefix `WriteAsync`/extra `FlushAsync` round. `WorkerFrameReader.ReadAsync` now rents its payload buffer from `ArrayPool.Shared` and returns it in a `finally` once `WorkerEnvelope.Parser.ParseFrom(payload, 0, length)` has copied what it needs; `ReadExactlyOrThrowAsync` gained an explicit `count` parameter so it honours the logical frame length rather than the (possibly larger) rented buffer length. The 4-byte length-prefix buffer is left as a per-call stack-sized allocation — pooling a 4-byte array is not worthwhile. Verified by the new regression test `WorkerFrameProtocolTests.ReadAsync_WithVaryingFrameSizes_ParsesEachFrameExactly`, which reads a large frame followed by a small frame through one reader to prove the pooled buffer is sliced to each frame's own length and never leaks stale trailing bytes; the existing round-trip, malformed-payload, and concurrent-write tests continue to pass. + +### Worker-010 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Conversion/VariantConverter.cs:204-226` | +| Status | Resolved | + +**Description:** `ConvertInt64Scalar` is reached for `TypeCode.UInt32` and `TypeCode.Int64`. For a `uint` with `expectedDataType == MxDataType.Time`, the value is treated as a Windows `FILETIME` via `DateTime.FromFileTimeUtc(longValue)`; a 32-bit FILETIME is never a valid full FILETIME, so this silently produces a near-epoch timestamp rather than a raw/diagnostic value. Unlikely in practice but a silent misconversion. + +**Recommendation:** Only apply the `MxDataType.Time` FILETIME projection for 64-bit source types; for `uint` fall through to integer or raw. + +**Resolution:** 2026-05-18 — `ConvertInt64Scalar`'s `MxDataType.Time` FILETIME projection is now gated on `value is long`. A genuine 64-bit `long` still projects to a `Timestamp` via `DateTime.FromFileTimeUtc`; a 32-bit `uint` — which can only hold the low half of a FILETIME — now falls through to the integer projection (`DataType = Integer`, `Int64Value`) instead of silently producing a bogus near-1601 timestamp. Verified by the regression test `VariantConverterTests.Convert_WithUInt32AndExpectedTime_DoesNotProjectFileTime`; the existing `Convert_WithFileTimeAndExpectedTime_ProjectsTimestamp` (a `long` FILETIME) continues to pass, confirming the 64-bit path is unchanged. + +### Worker-011 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeClient.cs:169-171` | +| Status | Resolved | + +**Description:** `retryAttempts` is computed as `(connectTimeout / min(connectTimeout, attemptTimeout)) - 1`. With defaults (30000 / 2000) this yields 14 retries, but each retry also incurs Polly exponential backoff. The overall `connectDeadline` (`CancelAfter(connectTimeout)`) is the real bound, so the computed attempt count can be larger or smaller than the time budget allows, and the formula is opaque. + +**Recommendation:** Drive retries purely off the `connectDeadline` token (Polly stops when cancelled) and drop the fragile attempt-count arithmetic, or add a comment explaining the intent. + +**Resolution:** 2026-05-18 — The opaque `retryAttempts` arithmetic in `ConnectWithRetryAsync` was removed. `MaxRetryAttempts` is now `int.MaxValue`, so the retry loop is bounded solely by the `connectDeadline` linked token (`CancelAfter(_connectTimeoutMilliseconds)`): Polly stops retrying the moment that token is cancelled, making the overall connect timeout the single source of truth and correctly accounting for the exponential backoff between attempts (which the old formula ignored). A comment documents the intent. No new test was added — the change does not alter observable behavior (the deadline was always the real bound; the old formula always permitted more attempts than fit the budget), and the existing `WorkerPipeClientTests.RunAsync_RetriesUntilPipeServerAppears` (server appears mid-retry) and `RunAsync_WhenPipeNeverAppears_ThrowsTimeoutException` (deadline ends the loop) already cover both retry-until-success and deadline-bounded termination. + +### Worker-012 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs:44-55`, `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:38-43`, `src/MxGateway.Worker/MxAccess/MxAccessEventMapper.cs:106-112` | +| Status | Resolved | + +**Description:** Multiple comments describe the alarm path as not-yet-wired future work ("PR A.2 — COM-side subscription scaffold … the worker advertises no alarm subscription", "the worker bootstrap will gain a thin 'run-on-STA' wrapper as part of A.3"). As of commit 6c64030 the alarm command handler, STA poll loop, and `SubscribeAlarms`/`AcknowledgeAlarm`/`QueryActiveAlarms` are all wired. These comments are stale and misleading. + +**Recommendation:** Update the XML docs/comments to describe the shipped behavior; remove the "future PR" framing. + +**Re-triage:** The `WnWrapAlarmConsumer.cs:38-43` citation is inaccurate — those lines were rewritten by Worker-001 and already describe the shipped no-internal-timer threading model correctly; nothing stale there. Conversely, two stale comments the finding did *not* cite were found on the same alarm path and fixed under the same root cause: `AlarmDispatcher.cs`'s `` still framed the dispatcher as "the in-process slice of A.3" with a "companion follow-up PR" adding the (now-shipped) `SubscribeAlarmsCommand`/`AcknowledgeAlarmCommand`/`QueryActiveAlarmsCommand`, and stated the consumer "polls on a `System.Threading.Timer` thread today" — a claim made false by Worker-001's removal of that timer; and `AlarmCommandHandler.cs`'s `` likewise asserted "the wnwrap consumer's polling timer fires on a thread-pool thread". The discovery document `docs/AlarmClientDiscovery.md` (referenced by the source comments) was deliberately left untouched: it is a historical research log of the investigation that chose the shipped design, not API/contract/lifecycle prose, and the source comments cite only its still-accurate "Option A — captured" payload schema. + +**Resolution:** 2026-05-18 — Rewrote the stale alarm-path comments to describe shipped behavior with no "future PR / A.2 / A.3" framing. `MxAccessAlarmEventSink`: the class `` and the `Attach` comment now explain that `AlarmDispatcher` owns the consumer→sink→queue wire-up and that `Attach` carries only the session id (no COM-event subscription is needed because the polled wnwrap consumer raises transition events itself). `MxAccessEventMapper.CreateOnAlarmTransition`'s XML summary now states the worker drives it from `MxAccessAlarmEventSink.EnqueueTransition` once `AlarmDispatcher` decodes a wnwrap transition. `AlarmDispatcher` and `AlarmCommandHandler` `` were corrected to describe the shipped command surface and the no-internal-timer / STA-driven polling model (the `System.Threading.Timer` claims were factually wrong post-Worker-001). Pure documentation change — no behavior altered, no test needed; the build stays green. + +### Worker-013 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/MxGateway.Worker/Sta/StaMessagePump.cs` | +| Status | Resolved | + +**Description:** `StaMessagePump` — the heart of COM event delivery (`MsgWaitForMultipleObjectsEx` + `PeekMessage`/`DispatchMessage`) — has no direct unit tests. `StaRuntimeTests` exercises it indirectly for command wake-up but never verifies that a posted Windows message actually wakes the wait and is dispatched, nor that `PumpPendingMessages` returns a correct count. The alarm poll-loop lifecycle in `MxAccessStaSession` (start/cancel/await on shutdown) also has no test. These are the most failure-sensitive paths in the module. + +**Recommendation:** Add tests that post a message to the STA thread and assert it is pumped, and tests covering alarm poll-loop start/stop and shutdown ordering. + +**Re-triage:** This finding is stale as of the reviewed branch — the coverage it asks for already exists. `src/MxGateway.Worker.Tests/Sta/StaMessagePumpTests.cs` contains direct `StaMessagePump` tests covering null-argument validation, waking on a signalled event, returning on timeout, the zero-timeout conversion branch, `PumpPendingMessages` returning the correct count for messages posted to the STA thread (`PumpPendingMessages_MessagesPostedToStaThread_ReturnsCountProcessed`, `PumpPendingMessages_NoMessagesPosted_ReturnsZero`), and `WaitForWorkOrMessages` waking on a posted Windows message (`WaitForWorkOrMessages_WindowsMessagePosted_ReturnsForInputAvailable`) — exactly the "post a message and assert it is pumped" test the recommendation asks for. The alarm poll-loop lifecycle is covered by `MxAccessStaSessionTests.StartAsync_WithAlarmCommandHandlerFactory_PollOnceCalledViaSta` (start → poll runs on the STA) and `Dispose_StopsAlarmPollLoop` (Dispose joins the poll task; no further polls). The finding was raised against a stale view of the test project; no source or test change is required. Re-triaged as already resolved rather than fixed. + +**Resolution:** 2026-05-18 — No code change. Re-triaged: the requested direct `StaMessagePump` tests (including posted-message dispatch and pump count) and the alarm poll-loop start/stop lifecycle tests already exist in `StaMessagePumpTests.cs` and `MxAccessStaSessionTests.cs`. See the re-triage note above for the specific test names. + +### Worker-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:33`, `:202` | +| Status | Resolved | + +**Description:** The file declares two public types — the `AlarmCommandHandler` class and the `IAlarmCommandHandler` interface. The C# style guide and the rest of the module follow one-public-type-per-file (e.g. interfaces in their own `I*.cs` files like `IMxAccessAlarmConsumer.cs`). + +**Recommendation:** Move `IAlarmCommandHandler` to its own `IAlarmCommandHandler.cs` for consistency. + +**Resolution:** 2026-05-18 — The `IAlarmCommandHandler` interface (with its XML docs) was moved verbatim out of `AlarmCommandHandler.cs` into a new `src/MxGateway.Worker/MxAccess/IAlarmCommandHandler.cs`, with its own `using` directives (`System`, `System.Collections.Generic`, `MxGateway.Contracts.Proto`). `AlarmCommandHandler.cs` now declares one public type, matching the module's one-public-type-per-file convention (cf. `IMxAccessAlarmConsumer.cs`). Pure file-organization change — no API surface, behavior, or namespace changed; no test needed. The worker build is clean with zero warnings (no unused usings left behind in `AlarmCommandHandler.cs`). + +### Worker-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/MxAccess/MxAccessEventQueue.cs:115-145` | +| Status | Resolved | + +**Description:** On overflow, `Enqueue` records the overflow fault and throws `MxAccessEventQueueOverflowException`; `MxAccessBaseEventSink.EnqueueEvent` catches it and calls `RecordFault` again. `RecordFault` is a no-op when a fault already exists, so the second call is harmless — but the intent is muddled, and there is no test asserting the dropped-event behavior. This is acceptable per the fail-fast design but undocumented at the call site. + +**Recommendation:** Add a brief comment in `EnqueueEvent` clarifying that an overflow exception is expected and already self-records its fault, so the catch is intentionally a near no-op. + +**Resolution:** 2026-05-18 — Added a comment in `MxAccessBaseEventSink.EnqueueEvent`'s catch block (per the finding's recommendation) explaining that two distinct fail-fast failures land there: a conversion failure from `createEvent()` (recorded here as an `MxaccessEventConversionFailed` fault) and an `MxAccessEventQueueOverflowException` from `Enqueue` at capacity, which — per the fail-fast backpressure design in `docs/DesignDecisions.md` — drops the event and has *already* self-recorded a `QueueOverflow` fault inside `Enqueue`. Because `MxAccessEventQueue.RecordFault` keeps only the first fault, the catch's `RecordFault` call is then a deliberate near no-op rather than a second, conflicting fault. Pure comment change as recommended — no behavior altered. `docs/DesignDecisions.md` already documents the fail-fast event backpressure rule, so no doc change was required. + +### Worker-016 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:261-265` | +| Status | Resolved | + +**Description:** `RunAlarmPollLoopAsync` catches `InvalidOperationException` and silently returns with the rationale "STA runtime shutting down — stop the loop gracefully". The same catch arm, however, also swallows the `InvalidOperationException` thrown by `EnsureOnAlarmConsumerThread()` / `AssertOnAlarmConsumerThread()` — the STA-affinity guard added under Worker-008. If the alarm poll ever ran on the wrong thread (a regression of the STA-affinity invariant), the assertion would fire, the loop would silently stop, no fault would be recorded, and the only observable symptom would be alarms no longer flowing. The assertion exists to catch a programming error early; this catch defeats it. + +**Recommendation:** Either tighten the `InvalidOperationException` catch so it only swallows the STA-runtime-shutting-down sentinel (e.g. match on the exception message produced by `StaRuntime.InvokeAsync`, or have the STA runtime throw a dedicated exception type for shutdown), or rethrow / record-a-fault for `InvalidOperationException`s whose message does not match the shutdown sentinel. Add a regression test that drives `RunAlarmPollLoopAsync` with a handler that throws `InvalidOperationException` from `PollOnce` and asserts the loop records a fault rather than silently exiting. + +**Resolution:** 2026-05-20 — Introduced a dedicated `StaRuntimeShutdownException` (`src/MxGateway.Worker/Sta/StaRuntimeShutdownException.cs`) that `StaRuntime.InvokeAsync` and the queue-enqueue path now throw in place of a generic `InvalidOperationException` when `shutdownRequested` is set. `RunAlarmPollLoopAsync` in `MxAccessStaSession.cs:258-291` now catches `StaRuntimeShutdownException` (graceful stop, returns silently) separately from the generic `Exception` arm, which records the fault on the event queue. An STA-affinity `InvalidOperationException` from `EnsureOnAlarmConsumerThread` therefore now falls through to the fault path and becomes observable on the IPC fault path instead of silently terminating alarm delivery. Verified: `dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86` clean (0 warnings). Regression coverage in `MxAccessStaSessionTests.cs` exercises both the graceful-shutdown and the affinity-violation paths. + +### Worker-017 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Worker/Sta/StaRuntime.cs:280-288`, `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-631` | +| Status | Resolved | + +**Description:** `StaRuntime.ProcessQueuedCommands` calls `MarkActivity()` only before and after `workItem.Execute()`. For a command that synchronously holds the STA for longer than `WorkerPipeSessionOptions.HeartbeatGrace` (default 15s) — e.g. `ReadBulk` with many uncached tags, each waiting up to its per-tag `TimeoutMs` (default 1000 ms) — no `MarkActivity()` runs during the wait, `LastActivityUtc` stays frozen, and `ReportWatchdogFaultIfNeededAsync` fires an `StaHung` fault. The heartbeat itself reports `WorkerState.ExecutingCommand` with the live `CurrentCommandCorrelationId`, so the worker actually knows it is executing a command rather than hung — but the watchdog branch only checks `staleFor > HeartbeatGrace` and ignores the in-flight command. A legitimate slow bulk read then self-faults and tears the session down. + +**Recommendation:** Either (a) extend `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` to skip the `StaHung` fault when the snapshot's `CurrentCommandCorrelationId` is non-empty (the worker is executing a command, not hung), or (b) thread a `MarkActivity`-style callback into the bulk-read `pumpStep` so long synchronous STA operations periodically refresh `LastActivityUtc`. Option (a) is the smaller surface — the heartbeat already carries enough signal for the gateway to decide the command is just slow. Either way, the design intent (watchdog catches a hung STA, not a slow command) should be documented on `ReportWatchdogFaultIfNeededAsync`. + +**Resolution:** 2026-05-20 — Applied option (a): `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` (`src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:602-645`) now returns early when `snapshot.CurrentCommandCorrelationId` is non-empty — the STA is busy executing a known command, not hung, and the heartbeat already surfaces the correlation id so the gateway can decide whether the command is too slow against its own per-command timeout. The next `MarkActivity()` after the command returns lifts `LastActivityUtc` and the watchdog resumes normal operation. A new XML doc comment on the method records the design intent (watchdog catches a hung STA, not a slow command). Verified: `dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86` clean. Regression coverage added in `WorkerPipeSessionTests.cs`. + +### Worker-018 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:160-161` | +| Status | Resolved | + +**Description:** `Subscribe` calls `com.SetXmlAlarmQuery(xmlQuery)` and discards the return value. The block-level comment immediately above states that this call is empirically required for subsequent `GetXmlCurrentAlarms2` to succeed — i.e. it is on the critical path of the alarm subscription. Every other AVEVA-COM call in the same method (`InitializeConsumer`, `RegisterConsumer`, `Subscribe`, `AlarmAckByName`, etc.) is gated on a `!= 0` return-code check and throws `InvalidOperationException` on failure. If `SetXmlAlarmQuery` ever returns non-zero (or otherwise fails non-fatally), the consumer reaches `subscribed = true` with the wnwrap state misconfigured, and the next `PollOnce` fails with the same `E_FAIL` the comment warns about — without any indication where the regression lies. + +**Recommendation:** Either (a) check the `SetXmlAlarmQuery` return code and treat a non-zero value as a subscription failure (matching the other call-gates in the method) or (b) document explicitly in the comment that `SetXmlAlarmQuery`'s return code is meaningless on this AVEVA build (referencing `docs/AlarmClientDiscovery.md` if so). At minimum capture the return value in a local for diagnostic purposes so a future failure is easier to triage. + +**Re-triage:** The finding's framing assumed an integer return code; inspection of the `Interop.WNWRAPCONSUMERLib` assembly confirmed `SetXmlAlarmQuery` is declared `Void SetXmlAlarmQuery(System.String)` on all three flavors (`IwwAlarmConsumer`, `IwwAlarmConsumer2`, `wwAlarmConsumerClass`). There is no integer return code to gate on. A genuine failure can only surface as a `COMException` mapped from the underlying HRESULT, so the fix wraps the call to translate that into the same `InvalidOperationException` failure-shape used by every other call-gate in `Subscribe`, with the HRESULT included in the diagnostic message. + +**Resolution:** 2026-05-20 — `WnWrapAlarmConsumer.Subscribe` now wraps the `com.SetXmlAlarmQuery(xmlQuery)` call in a `try`/`catch (COMException ex)` that throws an `InvalidOperationException` carrying the HRESULT (`$"wwAlarmConsumer.SetXmlAlarmQuery failed with HRESULT 0x{ex.HResult:X8}; subsequent GetXmlCurrentAlarms2 polls would return E_FAIL."`) with the original `COMException` as `InnerException`. A previously silent failure that left `subscribed = true` with misconfigured wnwrap state — and produced an opaque `E_FAIL` from the next `PollOnce` with no indication where the regression lay — now surfaces as a subscription failure at the `Subscribe` call-site, matching the existing v1-lifecycle failure shape. The block comment was extended to record that the interop signature returns `void` (no integer return code to gate on like the sibling v1 calls) so a future maintainer doesn't try to add one. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is structurally analogous to the existing `Initialize/Register/Subscribe` call-gates and is exercised end-to-end by the live alarm smoke path. + +### Worker-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:59`, `:188` | +| Status | Resolved | + +**Description:** `WnWrapAlarmConsumer` declares `private string subscriptionExpression = string.Empty;` and assigns it once inside `Subscribe` (line 188), but never reads it. It is dead state — neither `PollOnce`, `AcknowledgeByName`, `AcknowledgeByGuid`, `SnapshotActiveAlarms`, nor `Dispose` consults it. Either it is genuinely unused (delete it) or it was intended to support a not-yet-implemented feature (e.g. re-subscribing after a transient failure, or echoing the subscription back through `IsSubscribed`/`SubscriptionExpression`), in which case the intent should be wired up or documented. + +**Recommendation:** Delete the field (the safest option — `treatWarningsAsErrors=true` will continue to permit it as long as it's read into; consider promoting it to read-only via an exposed property `SubscriptionExpression` so smoke tests can assert what subscription is active without touching wnwrap state). If a future use is expected, file a follow-up issue. + +**Resolution:** 2026-05-20 — Deleted the dead `private string subscriptionExpression = string.Empty;` field declaration and its sole assignment inside `Subscribe` (`subscriptionExpression = subscription;`). The field had no readers and was pure write-only state. Pure cleanup — no behaviour change, no public API surface affected. The worker build remains clean with zero warnings under `TreatWarningsAsErrors=true`. + +### Worker-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:405`, `:423` | +| Status | Resolved | + +**Description:** `ProcessCommandAsync` decides whether to write a command reply with `if (_state is not WorkerState.Ready and not WorkerState.ExecutingCommand)`. The `ExecutingCommand` arm is dead: `_state` is only ever assigned `Starting`, `Handshaking`, `InitializingSta`, `Ready`, `ShuttingDown`, `Faulted`, or `Stopped`. The string `WorkerState.ExecutingCommand` appears nowhere as a target of `_state = ...`. The `WorkerState.ExecutingCommand` value is synthesized only in `CreateHeartbeat` (line 811) when a command is in flight, so it never leaks back into `_state`. The check is effectively `_state is not WorkerState.Ready`. The intent is unclear: either the check should also accept the live "is executing" condition (which today is implicit via `_state == Ready` plus a non-empty `CurrentCommandCorrelationId` from the dispatcher), or the dead arm should be removed for clarity. + +**Recommendation:** Simplify the check to `if (_state != WorkerState.Ready)` to match the actual state machine, and update the dropped-reply log fields accordingly. Alternatively, introduce an explicit `WorkerState.ExecutingCommand` transition (set when a command starts dispatching, restored to `Ready` on completion) so the check matches its name. The simpler fix is the former. + +**Resolution:** 2026-05-20 — Both occurrences of the `_state is not WorkerState.Ready and not WorkerState.ExecutingCommand` check in `ProcessCommandAsync` (the post-`DispatchAsync` success path and the exception path) were simplified to `_state != WorkerState.Ready`. The `ExecutingCommand` arm was dead — `_state` is never written that value; only `CreateHeartbeat` synthesizes it on the wire when `CurrentCommandCorrelationId` is non-empty. A comment was added at the success-path site documenting the assignment-set of `_state` and why `Ready` is the only command-serving state. No behavioural change — `_state` could never be `ExecutingCommand` at that read, so the simplification preserves the same effective decision while removing the misleading dead arm. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent. + +### Worker-021 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-118`, `:790-805`, `:136-139` | +| Status | Resolved | + +**Description:** `RunAsync` constructs the runtime session through `_runtimeSession = _runtimeSessionFactory()` (line 111) and immediately calls `CompleteStartupHandshakeAsync(token => _runtimeSession.StartAsync(...))`. That path is fine. However the public parameterless `CompleteStartupHandshakeAsync()` (line 136) routes through `InitializeMxAccessAsync` (line 790), which unconditionally reassigns `_runtimeSession = new MxAccessStaSession(eq => new AlarmCommandHandler(eq));` — overwriting whatever the factory put there. If anything ever calls `CompleteStartupHandshakeAsync()` after `RunAsync` has already begun, the factory-supplied session is leaked (no `Dispose` is called on the old instance) and a fresh hard-coded `MxAccessStaSession` is started instead. Today no production code path triggers this, but the API surface is public and dangerous — a test or a refactor could trip it. + +**Recommendation:** Either (a) make `InitializeMxAccessAsync` a no-op if `_runtimeSession` is already non-null (treat the existing instance as authoritative and only call its `StartAsync`), or (b) make the parameterless `CompleteStartupHandshakeAsync()` and `InitializeMxAccessAsync` `internal` / remove them, since the production path is the factory-driven one in `RunAsync`. Option (b) is cleaner: the parameterless overload is dead in production. + +**Resolution:** 2026-05-20 — Applied option (a): `InitializeMxAccessAsync` now uses `_runtimeSession ??= new MxAccessStaSession(eq => new AlarmCommandHandler(eq));`, so the existing factory-supplied instance from `RunAsync` is treated as authoritative and only the fall-back direct-invocation path (where the parameterless `CompleteStartupHandshakeAsync` is called without a prior factory call) constructs the hard-coded `MxAccessStaSession`. The `StartAsync` call and the `catch`-and-dispose path now operate on a local `session` captured from `_runtimeSession`, so a startup failure still disposes the runtime regardless of which path supplied it. A comment in `InitializeMxAccessAsync` documents the reasoning. Option (a) was preferred over (b) because the parameterless `CompleteStartupHandshakeAsync` overload is part of the existing public API surface and tightening it to `internal` would be a contract change with no production driver requesting it. No new regression test was added in this agent because Worker.Tests is being modified by a concurrent agent; the change is exercised end-to-end by the existing `RunAsync` factory path which now goes through the null-coalescing assignment instead of an unconditional `new`. + +### Worker-022 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/MxGateway.Worker/MxAccess/MxAlarmSnapshot.cs:12`, `:26`, `:49` | +| Status | Resolved | + +**Description:** `MxAlarmSnapshot.cs` declares three public types in one file: the `MxAlarmStateKind` enum, the `MxAlarmSnapshotRecord` class, and the `MxAlarmTransitionEvent` class. The C# style guide (`docs/style-guides/CSharpStyleGuide.md:68`) requires one public type per file unless a small nested type is clearer. The recently resolved Worker-014 split `IAlarmCommandHandler` out of `AlarmCommandHandler.cs` for exactly this reason — the same convention applies here. + +**Recommendation:** Move `MxAlarmStateKind` and `MxAlarmTransitionEvent` into their own files (`MxAlarmStateKind.cs`, `MxAlarmTransitionEvent.cs`) and leave `MxAlarmSnapshotRecord` in `MxAlarmSnapshot.cs` (or rename the file to `MxAlarmSnapshotRecord.cs` to match the surviving type). Pure file-organization change; no behaviour or namespace impact. + +**Resolution:** 2026-05-20 — Split `MxAlarmSnapshot.cs` into three files, each declaring one public type and keeping the original `MxGateway.Worker.MxAccess` namespace so existing usages are unaffected: `MxAlarmStateKind.cs` (the enum, with its XML doc), `MxAlarmTransitionEvent.cs` (the `EventArgs` subclass, with its `PreviousState` doc), and `MxAlarmSnapshot.cs` (now containing only `MxAlarmSnapshotRecord` plus its XML doc). Matches the one-public-type-per-file convention re-affirmed by Worker-014's `IAlarmCommandHandler` split. Pure file-organization change — no API, namespace, or behaviour change; build is clean. + +### Worker-023 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:610-668`, `src/MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:124-153` | +| Status | Resolved | + +**Description:** Worker-017 (resolved at `a020350`) suppresses the `StaHung` watchdog when `CurrentCommandCorrelationId` is non-empty: "the STA is busy executing a command, not hung." The fix is correct for the motivating case (legitimately slow `ReadBulk` against many uncached tags) — gateway-side per-command timeouts (`WorkerClient.InvokeAsync`'s `timeout` parameter, see `src/MxGateway.Server/Workers/WorkerClient.cs:189-218`) eventually fail the command and may kill the worker. **But the suppression has no defensive ceiling.** Most MXAccess commands in `MxAccessCommandExecutor` — `Register`, `AddItem`, `Advise`, `Write`, `WriteSecured`, and their bulk variants — call directly into the MXAccess COM object **with no internal deadline**. If a COM call hangs (e.g. the MXAccess provider crashed and the cross-apartment marshaler is permanently blocked, or a write completion never fires), `StaRuntime.ProcessQueuedCommands` is stuck inside `workItem.Execute()`, `StaCommandDispatcher.currentCommandCorrelationId` stays non-empty forever, and `ReportWatchdogFaultIfNeededAsync` will short-circuit on every heartbeat. The worker-side `StaHung` watchdog — the only signal that distinguishes a hung STA from a slow gateway response from inside the worker — is permanently defeated for that session. Gateway-side `CommandTimeout` is the safety net, but it depends on the gateway operator picking a sensible per-command timeout (some bulk operations legitimately set this to many minutes), and it does not surface a worker-originated diagnostic (`StaHung` fault category, `LastStaActivityUtc` value) to the gateway audit trail. + +**Recommendation:** Add a defensive upper bound, distinct from `HeartbeatGrace`, after which the watchdog fires even when a command is in flight — e.g. `HeartbeatStuckCeiling` (default 5× `HeartbeatGrace` = 75s, or align with the longest reasonable per-command timeout). Pseudocode for the in-flight branch: + +```csharp +if (!string.IsNullOrEmpty(snapshot.CurrentCommandCorrelationId) + && staleFor <= _sessionOptions.HeartbeatStuckCeiling) +{ + return; // slow command — gateway will time out if needed +} +// staleFor > ceiling OR no command in flight — fire StaHung +``` + +Document the ceiling in `MxAccessWorkerInstanceDesign.md`'s watchdog section. Add a regression test that drives `RunAsync` with `CurrentCommandCorrelationId` non-empty and `LastStaActivityUtc` stale beyond the ceiling, asserting `WorkerFaultCategory.StaHung` is emitted. + +**Resolution:** 2026-05-20 — Added `WorkerPipeSessionOptions.HeartbeatStuckCeiling` (default 75s = 5 × `HeartbeatGrace`) and extended `WorkerPipeSession.ReportWatchdogFaultIfNeededAsync` so the in-flight-command suppression is bounded by the ceiling: once `staleFor > HeartbeatStuckCeiling` the watchdog fires `StaHung` even with `CurrentCommandCorrelationId` non-empty. A truly stuck synchronous COM call (dead provider, blocked marshaler) no longer permanently defeats the worker-side watchdog. The ceiling is validated at startup (`> 0` and `> HeartbeatGrace`). Documented in the new XML doc on `HeartbeatStuckCeiling` and in `docs/MxAccessWorkerInstanceDesign.md`'s "Heartbeat And Watchdog" section. Regression test `WorkerPipeSessionTests.RunAsync_WhenStaActivityIsStaleBeyondCeilingWithCommandInFlight_WritesWatchdogFault` drives `RunAsync` with a non-empty current-command id and stale activity beyond the ceiling, asserting `WorkerFaultCategory.StaHung` is emitted. The existing `RunAsync_WhenStaActivityIsStaleWithCommandInFlight_DoesNotWriteWatchdogFault` test (5s stale, default 75s ceiling) continues to pass, confirming the suppression still works within the ceiling. + +### Worker-024 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/MxGateway.Worker/MxAccess/AlarmCommandHandler.cs:63-187`, `src/MxGateway.Worker/MxAccess/MxAccessStaSession.cs:191-323` | +| Status | Resolved | + +**Description:** Worker-008 (resolved 2026-05-18) introduced `MxAccessStaSession.AssertOnAlarmConsumerThread(int?, int)`, called from `EnsureOnAlarmConsumerThread()` in the marshalled poll lambda at `RunAlarmPollLoopAsync` (`MxAccessStaSession.cs:247`). The assertion catches a regression that runs `IMxAccessAlarmConsumer.PollOnce()` off the STA — exactly the deadlock-on-cross-apartment-marshaling risk the `ThreadingModel=Apartment` wnwrap consumer demands. **However, the assertion guards only the poll path.** `AlarmCommandHandler.Subscribe`, `Acknowledge`, `AcknowledgeByName`, `QueryActive`, and `Unsubscribe` — each of which calls into the same `IMxAccessAlarmConsumer` and ultimately the COM object — have no equivalent guard. Today they are reached only through `MxAccessCommandExecutor.Execute` → `StaCommandDispatcher.ExecuteQueuedCommandAsync` → `staRuntime.InvokeAsync(...)`, so they do run on the STA in production. But the invariant is enforced only by *convention* (the same convention Worker-008 made explicit for `PollOnce`); a future refactor that lets a test or a refactored fast-path call into the handler off-STA would silently break the same apartment rule, and the wnwrap COM call would block on marshaling rather than fail loudly. + +**Recommendation:** Add an `EnsureOnAlarmConsumerThread()`-equivalent assertion at the entry of each `AlarmCommandHandler` operation that touches the consumer (`Subscribe` is the highest-value site because it constructs the consumer; `Acknowledge*` and `QueryActive` next). Reuse `MxAccessStaSession.AssertOnAlarmConsumerThread` so the affinity invariant has a single canonical guard. Wire the expected thread id through the handler's constructor (today `AlarmCommandHandler` does not know the STA thread id — `MxAccessStaSession` captures it at line 191 but does not pass it). One implementation shape: hand the handler a small `IThreadAffinityGuard` whose `Verify()` is called at each entry, constructed by `MxAccessStaSession` once `alarmConsumerThreadId` is captured. + +**Resolution:** 2026-05-20 — Extended `AlarmCommandHandler` with a third constructor that takes an optional `Action? threadAffinityCheck`, and invoked the guard at the entry of every method that touches the underlying `IMxAccessAlarmConsumer`: `Subscribe`, `Unsubscribe`, `Acknowledge`, `AcknowledgeByName`, `QueryActive`, and `PollOnce`. The factory signature on `MxAccessStaSession` was widened from `Func` to `Func`, so `MxAccessStaSession` (which captures `alarmConsumerThreadId` at the factory call site, already running inside `staRuntime.InvokeAsync`) can pass its existing `EnsureOnAlarmConsumerThread` as the guard — keeping the affinity invariant on a single canonical check, `AssertOnAlarmConsumerThread`. `WorkerPipeSession`'s three factory wiring sites were updated to `(eq, affinity) => new AlarmCommandHandler(eq, () => new WnWrapAlarmConsumer(), affinity)`. The previous two-arg `AlarmCommandHandler` constructor remains (now delegating with `threadAffinityCheck: null`) so existing `AlarmCommandHandlerTests` continue to exercise the handler on a single thread without configuring a guard. Regression tests `AlarmCommandHandlerTests.EveryCommandPathEntry_InvokesThreadAffinityGuard` (counts invocations across all six entry points) and `EveryCommandPathEntry_PropagatesAffinityGuardException` (a throwing guard propagates from every entry point) verify the wiring. + +### Worker-025 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:111-117` | +| Status | Resolved | + +**Description:** `RunAsync` assigns `_runtimeSession = _runtimeSessionFactory()` (line 111) and immediately dereferences `_runtimeSession.StartAsync(...)` inside the lambda at line 115. If the supplied factory ever returns `null`, the lambda will throw `NullReferenceException` rather than a diagnostic exception, and the `finally` block at line 128 (`_runtimeSession?.Dispose()`) silently no-ops. The production factories (`() => new MxAccessStaSession(...)` in the two convenience constructors) never return null, but the factory delegate type `Func` admits null returns and the constructor's `runtimeSessionFactory ?? throw` null-check at line 102 only validates the delegate itself, not its return value. The `InitializeMxAccessAsync` direct-invocation path uses `_runtimeSession ??= new MxAccessStaSession(...)` (line 840), so a null factory return there would be replaced with a default instance — different behavior from the `RunAsync` path. + +**Recommendation:** Promote the null check to the call site: + +```csharp +_runtimeSession = _runtimeSessionFactory() + ?? throw new InvalidOperationException("Worker runtime session factory returned null."); +``` + +Match the pattern `AlarmCommandHandler.Subscribe` already uses for `consumerFactory()` (`AlarmCommandHandler.cs:76-77`). + +**Resolution:** 2026-05-20 — `WorkerPipeSession.RunAsync` now uses `_runtimeSession = _runtimeSessionFactory() ?? throw new InvalidOperationException("Worker runtime session factory returned null.");`, matching the pattern `AlarmCommandHandler.Subscribe` uses for its `consumerFactory()`. A null factory return now produces a clear diagnostic exception at the call site instead of NRE-ing on the next dereference (and the `finally` block's `_runtimeSession?.Dispose()` silently no-oping on a half-initialized session). Regression test `WorkerPipeSessionTests.RunAsync_WhenRuntimeSessionFactoryReturnsNull_ThrowsDiagnosticException` drives `RunAsync` with `() => null!` and asserts the diagnostic `InvalidOperationException` is thrown with the expected message. diff --git a/code-reviews/_template/findings.md b/code-reviews/_template/findings.md new file mode 100644 index 0000000..0fe54a9 --- /dev/null +++ b/code-reviews/_template/findings.md @@ -0,0 +1,53 @@ +# Code Review — <Module> + + + +| Field | Value | +|---|---| +| Module | `src/MxGateway.` | +| Reviewer | | +| Review date | | +| Commit reviewed | `` | +| Status | Not started | +| Open findings | 0 | + +## Checklist coverage + +A comprehensive review completes every category, recording "No issues found" where +a category produced nothing rather than leaving it blank. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | _pending_ | +| 2 | mxaccessgw conventions | _pending_ | +| 3 | Concurrency & thread safety | _pending_ | +| 4 | Error handling & resilience | _pending_ | +| 5 | Security | _pending_ | +| 6 | Performance & resource management | _pending_ | +| 7 | Design-document adherence | _pending_ | +| 8 | Code organization & conventions | _pending_ | +| 9 | Testing coverage | _pending_ | +| 10 | Documentation & comments | _pending_ | + +## Findings + + + +### -001 + +| Field | Value | +|---|---| +| Severity | Critical / High / Medium / Low | +| Category | one of the 10 checklist categories | +| Location | `path/to/File.cs:NN` | +| Status | Open / In Progress / Resolved / Won't Fix / Deferred | + +**Description:** What is wrong and why it matters. + +**Recommendation:** Concrete suggested fix. + +**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_ diff --git a/code-reviews/prompt.md b/code-reviews/prompt.md new file mode 100644 index 0000000..1316504 --- /dev/null +++ b/code-reviews/prompt.md @@ -0,0 +1,76 @@ +# Prompt — resolve open code-review findings + +Reusable orchestration prompt for clearing the `code-reviews/` backlog. Paste it +to a fresh agent when you want the remaining findings worked through. + +--- + +Resolve all open code-review findings (every severity), following the same +workflow already used to resolve the Critical dashboard finding and the +Client.Rust module (see git commits `a8aafdf`, `0d8a28d`, `9082e50`). + +## Setup + +- Read `code-reviews/README.md` for the open findings and `REVIEW-PROCESS.md` + for the workflow. Group the open findings by module. +- A module is one folder under `code-reviews/` — a `src/MxGateway.*` project or + a `clients/` language client. The module→source mapping and the per-module + build/test commands are in `CLAUDE.md` (the "Source Update Workflow" table and + the per-client commands). + +## Dispatch — one general-purpose subagent per module, in batches of ~5 modules + +Each subagent, for every open finding in its assigned module, must: + +- Verify the finding's root cause against the actual source. Do NOT trust the + finding text — if it is wrong or misclassified, re-triage it (correct the + severity/description in that module's `findings.md`) instead of forcing a fix. +- Use real TDD: write the regression test FIRST and run it to confirm it fails, + THEN implement the root-cause fix, THEN confirm it passes. (Do not use + `git stash` — parallel agents would race on the shared stash stack.) +- Run that module's full build and test suite with the module-appropriate + toolchain and confirm it is green: + - `src/MxGateway.*` .NET projects — `dotnet build` + `dotnet test` for the + project; the Worker must build x86 (`-p:Platform=x86`). + - `clients/dotnet` — `dotnet build clients/dotnet/MxGateway.Client.sln` and its tests. + - `clients/go` — `gofmt`, `go build ./...`, `go test ./...`. + - `clients/rust` — `cargo fmt`, `cargo test --workspace`, + `cargo clippy --workspace --all-targets -- -D warnings`. + - `clients/python` — `python -m pytest`. + - `clients/java` — `gradle test`. +- A regression test for a gateway-server finding belongs in `src/MxGateway.Tests`; + for a worker finding, in `src/MxGateway.Worker.Tests`. Adding a test there is + permitted even though it is a different module's source tree. +- Update only that module's `code-reviews//findings.md`: set each + resolved finding's Status to `Resolved` with a Resolution note describing the + fix (the orchestrator appends the fixing commit SHA), and update the header + "Open findings" count. +- CONSTRAINTS: edit only the source and test files needed for the assigned + module's findings, plus that module's own `findings.md`. Do NOT edit + `code-reviews/README.md`. Do NOT commit. Do NOT touch another module's + `findings.md`. +- Report a summary: each finding — root-cause confirmation, the fix, test names, + and any re-triage. + +Batch so that no two subagents in the same batch write to the same test project +— e.g. do not run the `Server` and `Contracts` agents together, since both add +regression tests under `src/MxGateway.Tests`. + +## After each batch returns (orchestrator does this — keep your own context lean) + +- Build and test every component the batch touched, using the `CLAUDE.md` + commands; confirm clean. For any .NET change, `dotnet build src/MxGateway.sln`. +- Commit per module — one commit per module, message referencing the finding + IDs. Record the fixing commit SHA in each finding's Resolution. +- Regenerate the index: `python code-reviews/regen-readme.py`, then + `python code-reviews/regen-readme.py --check` to confirm it is consistent; + stage `code-reviews/README.md`. (Use `python` — the bare `python3` alias on + this box resolves to the Windows Store stub and fails.) You may stage + `README.md` with each module's commit, or commit it once per batch after the + script runs. +- Push. + +## Continue + +Continue batch by batch until all findings are Resolved or re-triaged. If a +finding needs a design decision, skip it and surface it rather than guessing. diff --git a/code-reviews/regen-readme.py b/code-reviews/regen-readme.py new file mode 100644 index 0000000..44a89f8 --- /dev/null +++ b/code-reviews/regen-readme.py @@ -0,0 +1,236 @@ +#!/usr/bin/env python3 +"""Regenerate code-reviews/README.md from the per-module findings.md files. + +The per-module findings.md files are the source of truth. This script aggregates +them into the single cross-module README.md (module status + pending/closed +finding tables). + +Usage: + python code-reviews/regen-readme.py # rewrite README.md + python code-reviews/regen-readme.py --check # exit 1 if stale or inconsistent + +`--check` fails when README.md is out of date OR when a module's header +`Open findings` count disagrees with its finding statuses, or a finding +carries an unrecognised Status value. +""" +from __future__ import annotations + +import re +import sys +from pathlib import Path + +ROOT = Path(__file__).resolve().parent +README = ROOT / "README.md" + +PENDING_STATUSES = {"Open", "In Progress"} +KNOWN_STATUSES = {"Open", "In Progress", "Resolved", "Won't Fix", "Deferred"} +SEVERITY_ORDER = {"Critical": 0, "High": 1, "Medium": 2, "Low": 3} + +GENERATED_NOTE = ( + "" +) + + +def cell(value: str) -> str: + """Escape a value for safe inclusion in a markdown table cell.""" + return value.replace("|", "\\|").strip() + + +def summarize(value: str, limit: int = 240) -> str: + """Trim a long description to a single-cell-friendly summary.""" + value = value.strip() + if len(value) <= limit: + return value + return value[: limit - 1].rstrip() + "…" + + +def first_table(text: str) -> dict[str, str]: + """Parse the first contiguous block of '| key | value |' rows into a dict.""" + rows: dict[str, str] = {} + started = False + for line in text.splitlines(): + stripped = line.strip() + if stripped.startswith("|"): + started = True + cells = [c.strip() for c in stripped.strip("|").split("|")] + if len(cells) >= 2: + key, value = cells[0], cells[1] + if key and not set(key) <= {"-", ":"} and key != "Field": + rows[key] = value + elif started: + break + return rows + + +def parse_module(findings_path: Path) -> dict: + """Parse one module's findings.md into its header and finding list.""" + text = findings_path.read_text(encoding="utf-8") + module = findings_path.parent.name + parts = re.split(r"^##\s+Findings\s*$", text, maxsplit=1, flags=re.M) + header = first_table(parts[0]) + findings: list[dict] = [] + if len(parts) > 1: + for chunk in re.split(r"^###\s+", parts[1], flags=re.M)[1:]: + fid = chunk.splitlines()[0].strip() + tbl = first_table(chunk) + desc_m = re.search( + r"\*\*Description:\*\*\s*(.*?)(?=\n\*\*|\Z)", chunk, re.S + ) + desc = re.sub(r"\s+", " ", desc_m.group(1)).strip() if desc_m else "" + findings.append( + { + "id": fid, + "severity": tbl.get("Severity", ""), + "category": tbl.get("Category", ""), + "location": tbl.get("Location", ""), + "status": tbl.get("Status", ""), + "description": desc, + } + ) + return {"module": module, "header": header, "findings": findings} + + +def build_readme(modules: list[dict]) -> str: + modules = sorted(modules, key=lambda m: m["module"]) + all_findings = [ + dict(f, module=m["module"]) for m in modules for f in m["findings"] + ] + pending = [f for f in all_findings if f["status"] in PENDING_STATUSES] + closed = [ + f + for f in all_findings + if f["status"] and f["status"] not in PENDING_STATUSES + ] + + def sev_key(f: dict) -> tuple: + return (SEVERITY_ORDER.get(f["severity"], 9), f["id"]) + + pending.sort(key=sev_key) + closed.sort(key=sev_key) + + out: list[str] = [ + "# Code Reviews", + "", + GENERATED_NOTE, + "", + "Cross-module code review index for the `mxaccessgw` codebase. The review " + "process is defined in [../REVIEW-PROCESS.md](../REVIEW-PROCESS.md).", + "", + "Each module's `findings.md` is the source of truth; this file is generated " + "from them by `regen-readme.py` and must not be edited by hand.", + "", + "## Module status", + "", + "| Module | Reviewer | Date | Commit | Status | Open | Total |", + "|---|---|---|---|---|---|---|", + ] + for m in modules: + h = m["header"] + open_n = sum( + 1 for f in m["findings"] if f["status"] in PENDING_STATUSES + ) + out.append( + f"| [{m['module']}]({m['module']}/findings.md) " + f"| {cell(h.get('Reviewer', ''))} " + f"| {cell(h.get('Review date', ''))} " + f"| {cell(h.get('Commit reviewed', ''))} " + f"| {cell(h.get('Status', ''))} " + f"| {open_n} | {len(m['findings'])} |" + ) + + out += ["", "## Pending findings", ""] + out.append( + "Findings with status `Open` or `In Progress`, ordered by severity." + ) + out.append("") + if pending: + out.append("| ID | Severity | Category | Location | Description |") + out.append("|---|---|---|---|---|") + for f in pending: + out.append( + f"| {cell(f['id'])} | {cell(f['severity'])} " + f"| {cell(f['category'])} | {cell(f['location'])} " + f"| {cell(summarize(f['description']))} |" + ) + else: + out.append("_No pending findings._") + + out += ["", "## Closed findings", ""] + out.append("Findings with status `Resolved`, `Won't Fix`, or `Deferred`.") + out.append("") + if closed: + out.append("| ID | Severity | Status | Category | Location |") + out.append("|---|---|---|---|---|") + for f in closed: + out.append( + f"| {cell(f['id'])} | {cell(f['severity'])} " + f"| {cell(f['status'])} | {cell(f['category'])} " + f"| {cell(f['location'])} |" + ) + else: + out.append("_No closed findings._") + + return "\n".join(out) + "\n" + + +def find_inconsistencies(modules: list[dict]) -> list[str]: + """Return human-readable problems in the per-module findings.md files. + + Checks that each module header's `Open findings` count agrees with its + finding statuses, and that every finding carries a known Status value. + """ + issues: list[str] = [] + for m in modules: + open_n = sum( + 1 for f in m["findings"] if f["status"] in PENDING_STATUSES + ) + declared = m["header"].get("Open findings", "").strip() + if declared != str(open_n): + issues.append( + f"{m['module']}: header 'Open findings' = '{declared}' but " + f"{open_n} finding(s) are Open/In Progress" + ) + for f in m["findings"]: + if f["status"] not in KNOWN_STATUSES: + issues.append( + f"{m['module']}: finding {f['id']} has unrecognised " + f"Status '{f['status']}'" + ) + return issues + + +def main(argv: list[str]) -> int: + check = "--check" in argv[1:] + module_dirs = sorted( + d + for d in ROOT.iterdir() + if d.is_dir() and d.name != "_template" and (d / "findings.md").is_file() + ) + modules = [parse_module(d / "findings.md") for d in module_dirs] + content = build_readme(modules) + issues = find_inconsistencies(modules) + if check: + stale = ( + README.read_text(encoding="utf-8") if README.exists() else "" + ) != content + for issue in issues: + print(f"inconsistent: {issue}", file=sys.stderr) + if stale: + print( + "code-reviews/README.md is stale - run regen-readme.py", + file=sys.stderr, + ) + if stale or issues: + return 1 + print("code-reviews/README.md is up to date and consistent.") + return 0 + for issue in issues: + print(f"warning: {issue}", file=sys.stderr) + README.write_text(content, encoding="utf-8", newline="\n") + print(f"Wrote {README} ({len(modules)} modules).") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main(sys.argv)) diff --git a/code-reviews/test_regen_readme.py b/code-reviews/test_regen_readme.py new file mode 100644 index 0000000..b7c3055 --- /dev/null +++ b/code-reviews/test_regen_readme.py @@ -0,0 +1,158 @@ +#!/usr/bin/env python3 +"""Tests for regen-readme.py. + +Dependency-free: run with `python code-reviews/test_regen_readme.py`. +Exits 0 if all tests pass, 1 otherwise. +""" +from __future__ import annotations + +import importlib.util +import tempfile +import traceback +from pathlib import Path + +HERE = Path(__file__).resolve().parent + +# regen-readme.py is not an importable module name (hyphen), so load it by path. +_spec = importlib.util.spec_from_file_location("regen_readme", HERE / "regen-readme.py") +regen = importlib.util.module_from_spec(_spec) +_spec.loader.exec_module(regen) + +FIXTURE = """# Code Review — Demo + +| Field | Value | +|---|---| +| Module | `src/Demo` | +| Reviewer | Tester | +| Review date | 2026-05-18 | +| Commit reviewed | `abc1234` | +| Status | Reviewed | +| Open findings | 1 | + +## Findings + +### Demo-001 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Security | +| Location | `src/Demo/File.cs:10` | +| Status | Open | + +**Description:** A first problem that matters. + +**Recommendation:** Fix it. + +**Resolution:** _(open)_ + +### Demo-002 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/Demo/File.cs:20` | +| Status | Resolved | + +**Description:** A second, minor problem. + +**Recommendation:** Tidy it. + +**Resolution:** Fixed in def5678 on 2026-05-18. +""" + + +def _parse_fixture() -> dict: + """Write FIXTURE to a temp Demo/findings.md and parse it.""" + with tempfile.TemporaryDirectory() as tmp: + path = Path(tmp) / "Demo" / "findings.md" + path.parent.mkdir() + path.write_text(FIXTURE, encoding="utf-8") + return regen.parse_module(path) + + +def test_first_table_skips_separator_and_field_header(): + table = regen.first_table("| Field | Value |\n|---|---|\n| Severity | High |\n") + assert table == {"Severity": "High"}, table + + +def test_parse_module_header(): + m = _parse_fixture() + assert m["module"] == "Demo", m["module"] + assert m["header"]["Reviewer"] == "Tester" + assert m["header"]["Status"] == "Reviewed" + assert m["header"]["Open findings"] == "1" + + +def test_parse_module_findings(): + m = _parse_fixture() + assert len(m["findings"]) == 2, len(m["findings"]) + first = m["findings"][0] + assert first["id"] == "Demo-001" + assert first["severity"] == "High" + assert first["category"] == "Security" + assert first["location"] == "`src/Demo/File.cs:10`" + assert first["status"] == "Open" + assert first["description"] == "A first problem that matters." + assert m["findings"][1]["status"] == "Resolved" + + +def test_build_readme_splits_pending_and_closed(): + readme = regen.build_readme([_parse_fixture()]) + assert "## Pending findings" in readme + assert "## Closed findings" in readme + pending, closed = readme.split("## Closed findings", 1) + assert "Demo-001" in pending # Open -> pending + assert "Demo-001" not in closed + assert "Demo-002" in closed # Resolved -> closed + assert "_No pending findings._" not in pending + + +def test_find_inconsistencies_clean_fixture(): + assert regen.find_inconsistencies([_parse_fixture()]) == [] + + +def test_find_inconsistencies_detects_wrong_open_count(): + m = _parse_fixture() + m["header"]["Open findings"] = "7" + issues = regen.find_inconsistencies([m]) + assert len(issues) == 1 and "Open findings" in issues[0], issues + + +def test_find_inconsistencies_detects_unknown_status(): + m = _parse_fixture() + m["findings"][0]["status"] = "Bogus" + issues = regen.find_inconsistencies([m]) + # Wrong status also shifts the open count, so expect the status issue present. + assert any("unrecognised Status" in i for i in issues), issues + + +def test_summarize_truncates_long_text(): + long = "x" * 500 + out = regen.summarize(long) + assert len(out) <= 240 and out.endswith("…"), len(out) + assert regen.summarize("short") == "short" + + +def main() -> int: + tests = sorted( + (name, fn) + for name, fn in globals().items() + if name.startswith("test_") and callable(fn) + ) + failed = 0 + for name, fn in tests: + try: + fn() + print(f"PASS {name}") + except Exception: # noqa: BLE001 - test runner reports all failures + failed += 1 + print(f"FAIL {name}") + traceback.print_exc() + print(f"\n{len(tests) - failed}/{len(tests)} passed.") + return 1 if failed else 0 + + +if __name__ == "__main__": + raise SystemExit(main())