Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules

Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 09:46:47 -04:00
parent 1cd51bbda3
commit a0203503a7
122 changed files with 8723 additions and 757 deletions
+10 -4
View File
@@ -102,12 +102,18 @@ public string ResolveRequiredScope(object request)
CloseSessionRequest => GatewayScopes.SessionClose,
StreamEventsRequest => GatewayScopes.EventsRead,
MxCommandRequest commandRequest => ResolveCommandScope(commandRequest.Command?.Kind ?? MxCommandKind.Unspecified),
AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite,
QueryActiveAlarmsRequest => GatewayScopes.EventsRead,
TestConnectionRequest or
GetLastDeployTimeRequest or
DiscoverHierarchyRequest or
WatchDeployEventsRequest => GatewayScopes.MetadataRead,
_ => GatewayScopes.Admin
};
}
```
The `_ => GatewayScopes.Admin` fallback is intentional: any future request type that the resolver does not recognize fails closed, requiring the strongest scope until the resolver is updated.
The `_ => GatewayScopes.Admin` fallback is intentional: any future request type that the resolver does not recognize fails closed, requiring the strongest scope until the resolver is updated. `AcknowledgeAlarm` is treated as a write — it mutates alarm state, mirroring `MxCommandKind.Write*` — and `QueryActiveAlarms` shares the alarm/event surface with `StreamEvents` and `MxCommandKind.DrainEvents`, so it carries `events:read`.
`MxCommandRequest` is special because it multiplexes many MxAccess operations through a single RPC. The resolver inspects the embedded `MxCommandKind` so each operation gets its own scope:
@@ -188,10 +194,10 @@ blocking constraint; secured values and raw credentials are never logged.
|----------|-------|--------------|
| `SessionOpen` | `session:open` | `OpenSessionRequest` |
| `SessionClose` | `session:close` | `CloseSessionRequest` |
| `EventsRead` | `events:read` | `StreamEventsRequest`, `MxCommandKind.DrainEvents` |
| `EventsRead` | `events:read` | `StreamEventsRequest`, `QueryActiveAlarmsRequest`, `MxCommandKind.DrainEvents` |
| `InvokeRead` | `invoke:read` | `MxCommandRequest` for read-style command kinds (`Register`, `AddItem`, `Advise`, and any kind not otherwise mapped) |
| `InvokeWrite` | `invoke:write` | `MxCommandKind.Write`, `MxCommandKind.Write2` |
| `InvokeSecure` | `invoke:secure` | `MxCommandKind.WriteSecured`, `MxCommandKind.WriteSecured2`, `MxCommandKind.AuthenticateUser` |
| `InvokeWrite` | `invoke:write` | `AcknowledgeAlarmRequest`, `MxCommandKind.Write`, `MxCommandKind.Write2`, `MxCommandKind.WriteBulk`, `MxCommandKind.Write2Bulk` |
| `InvokeSecure` | `invoke:secure` | `MxCommandKind.WriteSecured`, `MxCommandKind.WriteSecured2`, `MxCommandKind.WriteSecuredBulk`, `MxCommandKind.WriteSecured2Bulk`, `MxCommandKind.AuthenticateUser` |
| `MetadataRead` | `metadata:read` | `MxCommandKind.ArchestraUserToId`, `MxCommandKind.GetSessionState`, `MxCommandKind.GetWorkerInfo`, `GalaxyRepository.TestConnection`, `GalaxyRepository.GetLastDeployTime`, `GalaxyRepository.DiscoverHierarchy`, `GalaxyRepository.WatchDeployEvents` |
| `Admin` | `admin` | `MxCommandKind.ShutdownWorker`, the default for any unrecognized request type, and the dashboard authorization policy |
+42
View File
@@ -23,6 +23,48 @@ the corresponding MXAccess `AddItem`, `Advise`, `UnAdvise`, and `RemoveItem`
calls sequentially on the session STA and preserves input order in the result
list.
The command model also includes bulk write/read command kinds:
`WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, and
`ReadBulk`. They are unary `Invoke` payloads on the same `MxAccessGateway`
surface (not separate gRPC methods) and exist so a caller can submit one list
of items per round trip while preserving MXAccess parity per entry.
- `WriteBulkCommand` / `Write2BulkCommand` / `WriteSecuredBulkCommand` /
`WriteSecured2BulkCommand` each carry a `server_handle` and a `repeated`
list of entries (`WriteBulkEntry`, `Write2BulkEntry`,
`WriteSecuredBulkEntry`, `WriteSecured2BulkEntry`). Each entry mirrors the
single-item command shape — `item_handle` + `value` (+ `timestamp_value` on
the `*2` variants, + `current_user_id` / `verifier_user_id` on the secured
variants). All four replies use `BulkWriteReply`, which carries
`repeated BulkWriteResult`. A `BulkWriteResult` has `server_handle`,
`item_handle`, `was_successful`, `optional int32 hresult`, `repeated
MxStatusProxy statuses`, and `error_message`. Per-entry failures populate
`error_message` + `hresult` and never raise — callers iterate and inspect
each entry. The credential-sensitive redaction rules for `WriteSecured` /
`WriteSecured2` apply to every `value` inside `WriteSecuredBulkEntry` and
`WriteSecured2BulkEntry`.
- `ReadBulkCommand` carries `server_handle`, `repeated string tag_addresses`,
and `uint32 timeout_ms` (0 means use the gateway-configured default). The
reply is `BulkReadReply` carrying `repeated BulkReadResult`. A
`BulkReadResult` has `server_handle`, `tag_address`, `item_handle`,
`was_successful`, `was_cached`, `value`, `quality`, `source_timestamp`,
`repeated MxStatusProxy statuses`, and `error_message`. MXAccess has no
synchronous `Read`, so `ReadBulk` is dual-mode per entry: when a tag is
already advised in the session the worker returns the cached
`OnDataChange` payload without touching the subscription
(`was_cached = true`); otherwise the worker takes a full
`AddItem` + `Advise` + wait-for-first-`OnDataChange` + `UnAdvise` +
`RemoveItem` snapshot lifecycle and returns the result
(`was_cached = false`). The asymmetry that `BulkReadResult` has no
`hresult` field is intentional — `ReadBulk` outcomes are timeout / cache
/ lifecycle states rather than MXAccess COM return codes.
See `gateway.md` for the full cached-vs-snapshot `ReadBulk` lifecycle and the
per-command scope requirements, and `docs/DesignDecisions.md` "Bulk Command
Family" for the rationale behind the per-entry result shape (independent
success tracking, input-order preservation, no partial-failure exceptions).
`src/MxGateway.Contracts/Protos/mxaccess_worker.proto` defines the named-pipe
worker IPC envelope and control messages. It imports
`mxaccess_gateway.proto` so the worker and gateway use the same command, reply,
+57 -5
View File
@@ -51,14 +51,29 @@ shutdown request even when a command or event assertion fails. Cleanup failures
in that `finally` block are logged rather than thrown, so a real assertion
failure is never masked by a shutdown timeout.
`WorkerLiveMxAccessSmokeTests` additionally covers two MXAccess parity paths the
`WorkerLiveMxAccessSmokeTests` additionally covers five MXAccess parity paths the
fake-worker tests cannot validate:
- a `Write` round-trip against an advised item, and
- a `Write` round-trip against an advised item, asserting both that the reply is
`Ok` / `MxCommandKind.Write` *and* that the worker emits a matching
`OnWriteComplete` event for the targeted (server, item) handle pair — the
same round-trip proof used by `scripts/run-client-e2e-tests.ps1`,
- an `AddItem` against an invalid server handle, asserting the MXAccess failure
surfaces in the command reply without faulting the gateway transport.
surfaces in the command reply without faulting the gateway transport,
- the `UnAdvise``RemoveItem``Unregister` teardown chain, asserting each
step replies `Ok` with the matching `MxCommandKind`, that no further
`OnDataChange` events arrive for the un-advised pair, and that a second
`RemoveItem` against the freed handle relays a non-`Ok` MXAccess failure,
- a `WriteSecured` round-trip after `AuthenticateUser`, asserting the reply
carries `MxCommandKind.WriteSecured` and the credential password never
appears in the diagnostic message (parity for both the secured-write
ordering rule and the "do not log secrets" contract), and
- an abnormal worker exit (the worker process is killed mid-session) where the
gateway must transition the session to `SessionState.Faulted` with a
non-empty fault description carrying a known worker-client classification
(pipe disconnected / worker faulted / end-of-stream / heartbeat expired).
All three tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
All six tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
opt-in variable.
Build the worker before running the smoke:
@@ -81,7 +96,9 @@ Optional live smoke variables:
| `MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` | First existing `MxGateway.Worker.exe` under `src/MxGateway.Worker/bin/...` | Worker executable path. Set this when running against a packaged worker or a non-default build output. |
| `MXGATEWAY_LIVE_MXACCESS_ITEM` | `TestChildObject.TestInt` | MXAccess item reference used by `AddItem`. |
| `MXGATEWAY_LIVE_MXACCESS_CLIENT_NAME` | `MxGateway.IntegrationTests` | Client name passed to `Register`. |
| `MXGATEWAY_LIVE_MXACCESS_EVENT_TIMEOUT_SECONDS` | `15` | Maximum wait for the first `OnDataChange`. |
| `MXGATEWAY_LIVE_MXACCESS_EVENT_TIMEOUT_SECONDS` | `15` | Maximum wait for the first `OnDataChange` (also used for the `OnWriteComplete` round-trip and the abnormal-exit fault transition). |
| `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER` | `admin` | ArchestrA user name passed to `AuthenticateUser` before the `WriteSecured` parity step. |
| `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_PASSWORD` | `admin123` | Password paired with the user above. Never logged; the test asserts the value does not appear in the WriteSecured diagnostic message. |
The test output includes session id, worker process id, command status,
HRESULT/status diagnostics, event sequence and handles, close status, and worker
@@ -116,6 +133,41 @@ Optional live Galaxy variables:
The default connection string targets `ZB` on `localhost` with Windows
authentication, which matches the Galaxy Repository conventions in CLAUDE.md.
## Galaxy Filter Safety
`GalaxyFilterInputSafetyTests` in `src/MxGateway.Tests/Galaxy/` covers adversarial
input handling for the Galaxy Repository browse filter layer. It runs in the
unit-test project (no live SQL needed) and complements the live SQL coverage in
`GalaxyRepositoryLiveTests`.
The test class re-frames the original "Galaxy SQL injection" concern (Tests-002 in
`code-reviews/Tests/findings.md`). `GalaxyRepository` issues only four *constant*
SQL statements (`HierarchySql`, `AttributesSql`, `SELECT 1`,
`SELECT time_of_last_deploy FROM galaxy`) — no `DiscoverHierarchyRequest` field
is ever concatenated into a SQL string, so there is no dynamic SQL surface and no
`LIKE`-escaping helper to test. All filters (`TagNameGlob`, `RootTagName`,
template-chain, category, contained-path) are applied **in memory** by
`GalaxyHierarchyProjector` / `GalaxyGlobMatcher` against the cached snapshot.
The adversarial-input matrix (`'`, `' OR '1'='1`, `'; DROP TABLE gobject;--`,
`%`, `_`, `100%_off`, `[abc]`, `Pump'001`) pins the following invariants:
- SQL metacharacters (`'`, `;`) and `LIKE`-wildcards (`%`, `_`) are treated as
opaque literals by `GalaxyGlobMatcher` — they never act as wildcards, never
spuriously match unrelated text.
- Only `*` and `?` are glob wildcards.
- `GalaxyGlobMatcher` applies a 100 ms regex timeout so a pathological glob
(e.g. 5 000 `a` characters plus a literal `!`) completes promptly rather than
catastrophically backtracking.
- `GalaxyHierarchyProjector` returns zero matches (rather than the whole
hierarchy) for an adversarial `TagNameGlob` or `TemplateChainContains`, and
surfaces `NotFound` for an adversarial `RootTagName`.
- The `DiscoverHierarchy` RPC end-to-end returns zero matches for adversarial
`TagNameGlob` rather than faulting.
These invariants are the real security surface of the Galaxy browse path; the
SQL-injection framing does not apply to a constant-query layer.
## Live LDAP
`DashboardLdapLiveTests` in `src/MxGateway.IntegrationTests/` exercises
+16 -6
View File
@@ -655,12 +655,22 @@ the event queue implementation owns those counters.
The STA watchdog currently emits a `WorkerFault` with
`WorkerFaultCategory.StaHung` when `LastStaActivityUtc` is older than
`WorkerPipeSessionOptions.HeartbeatGrace`. The fault includes the current
command correlation id when a command is active. Command duration and high event
queue depth remain observable through heartbeat fields until dedicated
thresholds own those warnings. The worker reports stale STA activity, but the
gateway owns the final kill decision through its existing heartbeat and worker
lifecycle policy.
`WorkerPipeSessionOptions.HeartbeatGrace` **and no command is in flight**.
`StaRuntime.ProcessQueuedCommands` calls `MarkActivity()` only immediately
before and after each work item, so a synchronously long-running STA command
(for example a `ReadBulk` waiting `timeout_ms` for the first `OnDataChange`)
legitimately freezes `LastStaActivityUtc` for the duration of the wait while
the worker is healthy. The watchdog is therefore suppressed while the
heartbeat snapshot's `CurrentCommandCorrelationId` is non-empty: the worker is
busy executing a command, not hung, and the heartbeat already surfaces the
in-flight correlation id so the gateway can apply its own per-command timeout
if it considers the command too slow. The fault still fires on a truly hung
STA — no command in flight and no activity for longer than `HeartbeatGrace`
which is the only case the watchdog can usefully distinguish from a slow
command. Command duration and high event queue depth remain observable through
heartbeat fields until dedicated thresholds own those warnings. The worker
reports stale STA activity, but the gateway owns the final kill decision
through its existing heartbeat and worker lifecycle policy.
## Shutdown
+9 -2
View File
@@ -33,12 +33,19 @@ public void TransitionTo(SessionState nextState)
return;
}
if (_state is SessionState.Closing
&& nextState is not SessionState.Closed
&& nextState is not SessionState.Faulted)
{
return;
}
_state = nextState;
}
}
```
`Closed` is terminal and `Faulted` only allows a transition to `Closed`. This guards against late callbacks (worker exit, heartbeat timeout) re-animating a session that is already torn down.
`Closed` is terminal, `Faulted` only allows a transition to `Closed`, and `Closing` only allows a transition to `Closed` or `Faulted`. This guards against late callbacks (worker exit, heartbeat timeout) re-animating a session that is already tearing down or torn down — once `CloseAsync` has set `Closing` under `_syncRoot`, no `TransitionTo(Ready)` from another thread can walk the session back to `Ready`. Both close-related writes (`Closing` and `Closed`) go through `_syncRoot` exactly like every other state write; `_closeLock` only serializes concurrent close attempts.
### SessionManager (ISessionManager)
@@ -184,7 +191,7 @@ Sessions open with `MxGateway:Sessions:DefaultLeaseSeconds` (default 1800) added
### Close
`GatewaySession.CloseAsync` is serialized by a per-session `SemaphoreSlim` (`_closeLock`). It transitions to `Closing`, asks the worker client to shut down within `ShutdownTimeout`, and on success transitions to `Closed`. If `WorkerClient.ShutdownAsync` throws, the session falls back to `IWorkerClient.Kill` (forced close):
`GatewaySession.CloseAsync` is serialized by a per-session `SemaphoreSlim` (`_closeLock`) so only one close runs at a time, but every read/write of `_state` still passes through `_syncRoot` (via `TryBeginClose` and `MarkClosed`). The close path therefore obeys the same lock discipline as `TransitionTo` / `MarkFaulted`: it transitions to `Closing`, asks the worker client to shut down within `ShutdownTimeout`, and on success transitions to `Closed`. `DisposeAsync` waits on `_closeLock` once before disposing the semaphore so an in-flight close's `Release()` cannot race against the dispose. If `WorkerClient.ShutdownAsync` throws, the session falls back to `IWorkerClient.Kill` (forced close):
```csharp
if (_workerClient is not null)