Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules
Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).
Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
GatewayGrpcScopeResolver so non-admin keys can use them; document
the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
in generated tonic code by reformatting the ReadBulkCommand proto
comment and scoping a #![allow(...)] to the generated submodules.
Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
make DisposeAsync race-safe against in-flight CloseAsync (-016);
add constraint-enforcement test coverage for the bulk-plan path
(-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
can distinguish graceful shutdown from a real STA-affinity
violation (-016); have the watchdog skip StaHung while
CurrentCommandCorrelationId is non-empty so a legitimate slow
ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
11 GatewaySession bulk methods (-013); replace the real TCP probe
in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
(-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
test and assert OnWriteComplete (-012); add live tests for
Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
CreateForTesting factory (-016); cover WorkerCancel and
unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
beforeStart() (-014); return a CancellingCompletableFuture that
actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
histograms with failed-call durations (-015); add coverage for
the five MalformedReply paths, the bulk-write helpers, the
Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
command family (-009).
Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
WorkerAlarmRpcDispatcher missing-session handling; drop the
duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
subscriptionExpression / ExecutingCommand arms; preserve
factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
source; switch the heartbeat-expires test to ManualTimeProvider;
add InvariantCulture to the remaining DateTimeOffset.Parse sites;
document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
IDisposable, class-level [Trait], single-source ZB default
connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
so absent env vars SKIP not pass; PascalCase rename of probe
[Fact]s; deterministic deadline test; new frame-protocol error
tests; ComputeTransitions diff-coverage; relocate dev-rig probes
to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
TreatWarningsAsErrors / analysers apply; document
DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
bulk-read handles in CLI; surface AcknowledgeAlarm transport
faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
runWriteBulkVariant; document the six new subcommands in
writeUsage; drain galaxy-watch events on limit; switch io.EOF
comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
option; regex-based credential redaction; Long.toUnsignedString
for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
_percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
_api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
stop hard-coding correlation IDs; resync RustClientDesign.md
with the current Session / Error surface and CLI subcommand set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+10
-4
@@ -102,12 +102,18 @@ public string ResolveRequiredScope(object request)
|
||||
CloseSessionRequest => GatewayScopes.SessionClose,
|
||||
StreamEventsRequest => GatewayScopes.EventsRead,
|
||||
MxCommandRequest commandRequest => ResolveCommandScope(commandRequest.Command?.Kind ?? MxCommandKind.Unspecified),
|
||||
AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite,
|
||||
QueryActiveAlarmsRequest => GatewayScopes.EventsRead,
|
||||
TestConnectionRequest or
|
||||
GetLastDeployTimeRequest or
|
||||
DiscoverHierarchyRequest or
|
||||
WatchDeployEventsRequest => GatewayScopes.MetadataRead,
|
||||
_ => GatewayScopes.Admin
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
The `_ => GatewayScopes.Admin` fallback is intentional: any future request type that the resolver does not recognize fails closed, requiring the strongest scope until the resolver is updated.
|
||||
The `_ => GatewayScopes.Admin` fallback is intentional: any future request type that the resolver does not recognize fails closed, requiring the strongest scope until the resolver is updated. `AcknowledgeAlarm` is treated as a write — it mutates alarm state, mirroring `MxCommandKind.Write*` — and `QueryActiveAlarms` shares the alarm/event surface with `StreamEvents` and `MxCommandKind.DrainEvents`, so it carries `events:read`.
|
||||
|
||||
`MxCommandRequest` is special because it multiplexes many MxAccess operations through a single RPC. The resolver inspects the embedded `MxCommandKind` so each operation gets its own scope:
|
||||
|
||||
@@ -188,10 +194,10 @@ blocking constraint; secured values and raw credentials are never logged.
|
||||
|----------|-------|--------------|
|
||||
| `SessionOpen` | `session:open` | `OpenSessionRequest` |
|
||||
| `SessionClose` | `session:close` | `CloseSessionRequest` |
|
||||
| `EventsRead` | `events:read` | `StreamEventsRequest`, `MxCommandKind.DrainEvents` |
|
||||
| `EventsRead` | `events:read` | `StreamEventsRequest`, `QueryActiveAlarmsRequest`, `MxCommandKind.DrainEvents` |
|
||||
| `InvokeRead` | `invoke:read` | `MxCommandRequest` for read-style command kinds (`Register`, `AddItem`, `Advise`, and any kind not otherwise mapped) |
|
||||
| `InvokeWrite` | `invoke:write` | `MxCommandKind.Write`, `MxCommandKind.Write2` |
|
||||
| `InvokeSecure` | `invoke:secure` | `MxCommandKind.WriteSecured`, `MxCommandKind.WriteSecured2`, `MxCommandKind.AuthenticateUser` |
|
||||
| `InvokeWrite` | `invoke:write` | `AcknowledgeAlarmRequest`, `MxCommandKind.Write`, `MxCommandKind.Write2`, `MxCommandKind.WriteBulk`, `MxCommandKind.Write2Bulk` |
|
||||
| `InvokeSecure` | `invoke:secure` | `MxCommandKind.WriteSecured`, `MxCommandKind.WriteSecured2`, `MxCommandKind.WriteSecuredBulk`, `MxCommandKind.WriteSecured2Bulk`, `MxCommandKind.AuthenticateUser` |
|
||||
| `MetadataRead` | `metadata:read` | `MxCommandKind.ArchestraUserToId`, `MxCommandKind.GetSessionState`, `MxCommandKind.GetWorkerInfo`, `GalaxyRepository.TestConnection`, `GalaxyRepository.GetLastDeployTime`, `GalaxyRepository.DiscoverHierarchy`, `GalaxyRepository.WatchDeployEvents` |
|
||||
| `Admin` | `admin` | `MxCommandKind.ShutdownWorker`, the default for any unrecognized request type, and the dashboard authorization policy |
|
||||
|
||||
|
||||
@@ -23,6 +23,48 @@ the corresponding MXAccess `AddItem`, `Advise`, `UnAdvise`, and `RemoveItem`
|
||||
calls sequentially on the session STA and preserves input order in the result
|
||||
list.
|
||||
|
||||
The command model also includes bulk write/read command kinds:
|
||||
`WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, and
|
||||
`ReadBulk`. They are unary `Invoke` payloads on the same `MxAccessGateway`
|
||||
surface (not separate gRPC methods) and exist so a caller can submit one list
|
||||
of items per round trip while preserving MXAccess parity per entry.
|
||||
|
||||
- `WriteBulkCommand` / `Write2BulkCommand` / `WriteSecuredBulkCommand` /
|
||||
`WriteSecured2BulkCommand` each carry a `server_handle` and a `repeated`
|
||||
list of entries (`WriteBulkEntry`, `Write2BulkEntry`,
|
||||
`WriteSecuredBulkEntry`, `WriteSecured2BulkEntry`). Each entry mirrors the
|
||||
single-item command shape — `item_handle` + `value` (+ `timestamp_value` on
|
||||
the `*2` variants, + `current_user_id` / `verifier_user_id` on the secured
|
||||
variants). All four replies use `BulkWriteReply`, which carries
|
||||
`repeated BulkWriteResult`. A `BulkWriteResult` has `server_handle`,
|
||||
`item_handle`, `was_successful`, `optional int32 hresult`, `repeated
|
||||
MxStatusProxy statuses`, and `error_message`. Per-entry failures populate
|
||||
`error_message` + `hresult` and never raise — callers iterate and inspect
|
||||
each entry. The credential-sensitive redaction rules for `WriteSecured` /
|
||||
`WriteSecured2` apply to every `value` inside `WriteSecuredBulkEntry` and
|
||||
`WriteSecured2BulkEntry`.
|
||||
|
||||
- `ReadBulkCommand` carries `server_handle`, `repeated string tag_addresses`,
|
||||
and `uint32 timeout_ms` (0 means use the gateway-configured default). The
|
||||
reply is `BulkReadReply` carrying `repeated BulkReadResult`. A
|
||||
`BulkReadResult` has `server_handle`, `tag_address`, `item_handle`,
|
||||
`was_successful`, `was_cached`, `value`, `quality`, `source_timestamp`,
|
||||
`repeated MxStatusProxy statuses`, and `error_message`. MXAccess has no
|
||||
synchronous `Read`, so `ReadBulk` is dual-mode per entry: when a tag is
|
||||
already advised in the session the worker returns the cached
|
||||
`OnDataChange` payload without touching the subscription
|
||||
(`was_cached = true`); otherwise the worker takes a full
|
||||
`AddItem` + `Advise` + wait-for-first-`OnDataChange` + `UnAdvise` +
|
||||
`RemoveItem` snapshot lifecycle and returns the result
|
||||
(`was_cached = false`). The asymmetry that `BulkReadResult` has no
|
||||
`hresult` field is intentional — `ReadBulk` outcomes are timeout / cache
|
||||
/ lifecycle states rather than MXAccess COM return codes.
|
||||
|
||||
See `gateway.md` for the full cached-vs-snapshot `ReadBulk` lifecycle and the
|
||||
per-command scope requirements, and `docs/DesignDecisions.md` "Bulk Command
|
||||
Family" for the rationale behind the per-entry result shape (independent
|
||||
success tracking, input-order preservation, no partial-failure exceptions).
|
||||
|
||||
`src/MxGateway.Contracts/Protos/mxaccess_worker.proto` defines the named-pipe
|
||||
worker IPC envelope and control messages. It imports
|
||||
`mxaccess_gateway.proto` so the worker and gateway use the same command, reply,
|
||||
|
||||
+57
-5
@@ -51,14 +51,29 @@ shutdown request even when a command or event assertion fails. Cleanup failures
|
||||
in that `finally` block are logged rather than thrown, so a real assertion
|
||||
failure is never masked by a shutdown timeout.
|
||||
|
||||
`WorkerLiveMxAccessSmokeTests` additionally covers two MXAccess parity paths the
|
||||
`WorkerLiveMxAccessSmokeTests` additionally covers five MXAccess parity paths the
|
||||
fake-worker tests cannot validate:
|
||||
|
||||
- a `Write` round-trip against an advised item, and
|
||||
- a `Write` round-trip against an advised item, asserting both that the reply is
|
||||
`Ok` / `MxCommandKind.Write` *and* that the worker emits a matching
|
||||
`OnWriteComplete` event for the targeted (server, item) handle pair — the
|
||||
same round-trip proof used by `scripts/run-client-e2e-tests.ps1`,
|
||||
- an `AddItem` against an invalid server handle, asserting the MXAccess failure
|
||||
surfaces in the command reply without faulting the gateway transport.
|
||||
surfaces in the command reply without faulting the gateway transport,
|
||||
- the `UnAdvise` → `RemoveItem` → `Unregister` teardown chain, asserting each
|
||||
step replies `Ok` with the matching `MxCommandKind`, that no further
|
||||
`OnDataChange` events arrive for the un-advised pair, and that a second
|
||||
`RemoveItem` against the freed handle relays a non-`Ok` MXAccess failure,
|
||||
- a `WriteSecured` round-trip after `AuthenticateUser`, asserting the reply
|
||||
carries `MxCommandKind.WriteSecured` and the credential password never
|
||||
appears in the diagnostic message (parity for both the secured-write
|
||||
ordering rule and the "do not log secrets" contract), and
|
||||
- an abnormal worker exit (the worker process is killed mid-session) where the
|
||||
gateway must transition the session to `SessionState.Faulted` with a
|
||||
non-empty fault description carrying a known worker-client classification
|
||||
(pipe disconnected / worker faulted / end-of-stream / heartbeat expired).
|
||||
|
||||
All three tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
|
||||
All six tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
|
||||
opt-in variable.
|
||||
|
||||
Build the worker before running the smoke:
|
||||
@@ -81,7 +96,9 @@ Optional live smoke variables:
|
||||
| `MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` | First existing `MxGateway.Worker.exe` under `src/MxGateway.Worker/bin/...` | Worker executable path. Set this when running against a packaged worker or a non-default build output. |
|
||||
| `MXGATEWAY_LIVE_MXACCESS_ITEM` | `TestChildObject.TestInt` | MXAccess item reference used by `AddItem`. |
|
||||
| `MXGATEWAY_LIVE_MXACCESS_CLIENT_NAME` | `MxGateway.IntegrationTests` | Client name passed to `Register`. |
|
||||
| `MXGATEWAY_LIVE_MXACCESS_EVENT_TIMEOUT_SECONDS` | `15` | Maximum wait for the first `OnDataChange`. |
|
||||
| `MXGATEWAY_LIVE_MXACCESS_EVENT_TIMEOUT_SECONDS` | `15` | Maximum wait for the first `OnDataChange` (also used for the `OnWriteComplete` round-trip and the abnormal-exit fault transition). |
|
||||
| `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER` | `admin` | ArchestrA user name passed to `AuthenticateUser` before the `WriteSecured` parity step. |
|
||||
| `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_PASSWORD` | `admin123` | Password paired with the user above. Never logged; the test asserts the value does not appear in the WriteSecured diagnostic message. |
|
||||
|
||||
The test output includes session id, worker process id, command status,
|
||||
HRESULT/status diagnostics, event sequence and handles, close status, and worker
|
||||
@@ -116,6 +133,41 @@ Optional live Galaxy variables:
|
||||
The default connection string targets `ZB` on `localhost` with Windows
|
||||
authentication, which matches the Galaxy Repository conventions in CLAUDE.md.
|
||||
|
||||
## Galaxy Filter Safety
|
||||
|
||||
`GalaxyFilterInputSafetyTests` in `src/MxGateway.Tests/Galaxy/` covers adversarial
|
||||
input handling for the Galaxy Repository browse filter layer. It runs in the
|
||||
unit-test project (no live SQL needed) and complements the live SQL coverage in
|
||||
`GalaxyRepositoryLiveTests`.
|
||||
|
||||
The test class re-frames the original "Galaxy SQL injection" concern (Tests-002 in
|
||||
`code-reviews/Tests/findings.md`). `GalaxyRepository` issues only four *constant*
|
||||
SQL statements (`HierarchySql`, `AttributesSql`, `SELECT 1`,
|
||||
`SELECT time_of_last_deploy FROM galaxy`) — no `DiscoverHierarchyRequest` field
|
||||
is ever concatenated into a SQL string, so there is no dynamic SQL surface and no
|
||||
`LIKE`-escaping helper to test. All filters (`TagNameGlob`, `RootTagName`,
|
||||
template-chain, category, contained-path) are applied **in memory** by
|
||||
`GalaxyHierarchyProjector` / `GalaxyGlobMatcher` against the cached snapshot.
|
||||
|
||||
The adversarial-input matrix (`'`, `' OR '1'='1`, `'; DROP TABLE gobject;--`,
|
||||
`%`, `_`, `100%_off`, `[abc]`, `Pump'001`) pins the following invariants:
|
||||
|
||||
- SQL metacharacters (`'`, `;`) and `LIKE`-wildcards (`%`, `_`) are treated as
|
||||
opaque literals by `GalaxyGlobMatcher` — they never act as wildcards, never
|
||||
spuriously match unrelated text.
|
||||
- Only `*` and `?` are glob wildcards.
|
||||
- `GalaxyGlobMatcher` applies a 100 ms regex timeout so a pathological glob
|
||||
(e.g. 5 000 `a` characters plus a literal `!`) completes promptly rather than
|
||||
catastrophically backtracking.
|
||||
- `GalaxyHierarchyProjector` returns zero matches (rather than the whole
|
||||
hierarchy) for an adversarial `TagNameGlob` or `TemplateChainContains`, and
|
||||
surfaces `NotFound` for an adversarial `RootTagName`.
|
||||
- The `DiscoverHierarchy` RPC end-to-end returns zero matches for adversarial
|
||||
`TagNameGlob` rather than faulting.
|
||||
|
||||
These invariants are the real security surface of the Galaxy browse path; the
|
||||
SQL-injection framing does not apply to a constant-query layer.
|
||||
|
||||
## Live LDAP
|
||||
|
||||
`DashboardLdapLiveTests` in `src/MxGateway.IntegrationTests/` exercises
|
||||
|
||||
@@ -655,12 +655,22 @@ the event queue implementation owns those counters.
|
||||
|
||||
The STA watchdog currently emits a `WorkerFault` with
|
||||
`WorkerFaultCategory.StaHung` when `LastStaActivityUtc` is older than
|
||||
`WorkerPipeSessionOptions.HeartbeatGrace`. The fault includes the current
|
||||
command correlation id when a command is active. Command duration and high event
|
||||
queue depth remain observable through heartbeat fields until dedicated
|
||||
thresholds own those warnings. The worker reports stale STA activity, but the
|
||||
gateway owns the final kill decision through its existing heartbeat and worker
|
||||
lifecycle policy.
|
||||
`WorkerPipeSessionOptions.HeartbeatGrace` **and no command is in flight**.
|
||||
`StaRuntime.ProcessQueuedCommands` calls `MarkActivity()` only immediately
|
||||
before and after each work item, so a synchronously long-running STA command
|
||||
(for example a `ReadBulk` waiting `timeout_ms` for the first `OnDataChange`)
|
||||
legitimately freezes `LastStaActivityUtc` for the duration of the wait while
|
||||
the worker is healthy. The watchdog is therefore suppressed while the
|
||||
heartbeat snapshot's `CurrentCommandCorrelationId` is non-empty: the worker is
|
||||
busy executing a command, not hung, and the heartbeat already surfaces the
|
||||
in-flight correlation id so the gateway can apply its own per-command timeout
|
||||
if it considers the command too slow. The fault still fires on a truly hung
|
||||
STA — no command in flight and no activity for longer than `HeartbeatGrace` —
|
||||
which is the only case the watchdog can usefully distinguish from a slow
|
||||
command. Command duration and high event queue depth remain observable through
|
||||
heartbeat fields until dedicated thresholds own those warnings. The worker
|
||||
reports stale STA activity, but the gateway owns the final kill decision
|
||||
through its existing heartbeat and worker lifecycle policy.
|
||||
|
||||
## Shutdown
|
||||
|
||||
|
||||
+9
-2
@@ -33,12 +33,19 @@ public void TransitionTo(SessionState nextState)
|
||||
return;
|
||||
}
|
||||
|
||||
if (_state is SessionState.Closing
|
||||
&& nextState is not SessionState.Closed
|
||||
&& nextState is not SessionState.Faulted)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
_state = nextState;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`Closed` is terminal and `Faulted` only allows a transition to `Closed`. This guards against late callbacks (worker exit, heartbeat timeout) re-animating a session that is already torn down.
|
||||
`Closed` is terminal, `Faulted` only allows a transition to `Closed`, and `Closing` only allows a transition to `Closed` or `Faulted`. This guards against late callbacks (worker exit, heartbeat timeout) re-animating a session that is already tearing down or torn down — once `CloseAsync` has set `Closing` under `_syncRoot`, no `TransitionTo(Ready)` from another thread can walk the session back to `Ready`. Both close-related writes (`Closing` and `Closed`) go through `_syncRoot` exactly like every other state write; `_closeLock` only serializes concurrent close attempts.
|
||||
|
||||
### SessionManager (ISessionManager)
|
||||
|
||||
@@ -184,7 +191,7 @@ Sessions open with `MxGateway:Sessions:DefaultLeaseSeconds` (default 1800) added
|
||||
|
||||
### Close
|
||||
|
||||
`GatewaySession.CloseAsync` is serialized by a per-session `SemaphoreSlim` (`_closeLock`). It transitions to `Closing`, asks the worker client to shut down within `ShutdownTimeout`, and on success transitions to `Closed`. If `WorkerClient.ShutdownAsync` throws, the session falls back to `IWorkerClient.Kill` (forced close):
|
||||
`GatewaySession.CloseAsync` is serialized by a per-session `SemaphoreSlim` (`_closeLock`) so only one close runs at a time, but every read/write of `_state` still passes through `_syncRoot` (via `TryBeginClose` and `MarkClosed`). The close path therefore obeys the same lock discipline as `TransitionTo` / `MarkFaulted`: it transitions to `Closing`, asks the worker client to shut down within `ShutdownTimeout`, and on success transitions to `Closed`. `DisposeAsync` waits on `_closeLock` once before disposing the semaphore so an in-flight close's `Release()` cannot race against the dispose. If `WorkerClient.ShutdownAsync` throws, the session falls back to `IWorkerClient.Kill` (forced close):
|
||||
|
||||
```csharp
|
||||
if (_workerClient is not null)
|
||||
|
||||
Reference in New Issue
Block a user