Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings

Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 10:28:54 -04:00
parent a0203503a7
commit 1aafd6bde4
74 changed files with 3349 additions and 395 deletions
+19 -8
View File
@@ -123,10 +123,14 @@ private static string ResolveCommandScope(MxCommandKind kind)
return kind switch
{
MxCommandKind.Write or
MxCommandKind.Write2 => GatewayScopes.InvokeWrite,
MxCommandKind.Write2 or
MxCommandKind.WriteBulk or
MxCommandKind.Write2Bulk => GatewayScopes.InvokeWrite,
MxCommandKind.WriteSecured or
MxCommandKind.WriteSecured2 or
MxCommandKind.WriteSecuredBulk or
MxCommandKind.WriteSecured2Bulk or
MxCommandKind.AuthenticateUser => GatewayScopes.InvokeSecure,
MxCommandKind.ArchestraUserToId or
@@ -141,7 +145,7 @@ private static string ResolveCommandScope(MxCommandKind kind)
}
```
Reads (`Register`, `AddItem`, `Advise`, and any other unspecified kind) fall through to `InvokeRead`, which keeps the matrix small while still separating reads from writes, secured writes, metadata lookups, event drains, and worker shutdown.
Reads (`Register`, `AddItem`, `Advise`, `ReadBulk`, and any other unspecified kind) fall through to `InvokeRead`, which keeps the matrix small while still separating reads from writes, secured writes, metadata lookups, event drains, and worker shutdown. The four bulk-write families (`WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`) are mapped explicitly so a missing arm cannot silently demote a bulk write to a read scope.
## Constraint Enforcement
@@ -174,11 +178,18 @@ page create dialog (see
dashboard API Keys page also renders each key's effective constraints.
The service checks read constraints for `AddItem`, `AddItem2`, `AddItemBulk`,
`SubscribeBulk`, and `AdviseItemBulk`. It checks write constraints for
`Write`, `Write2`, `WriteSecured`, and `WriteSecured2`. Successful item
registrations are tracked per session so later item-handle commands resolve
back to the original tag address. If a constrained key presents an unknown item
handle, the gateway fails closed.
`SubscribeBulk`, `AdviseItemBulk`, and `ReadBulk`. It checks write constraints
for `Write`, `Write2`, `WriteSecured`, `WriteSecured2`, `WriteBulk`,
`Write2Bulk`, `WriteSecuredBulk`, and `WriteSecured2Bulk`. Bulk commands run
through `BulkConstraintPlan` (`ReadBulkConstraintPlan`,
`WriteBulkConstraintPlan`, `SubscribeBulkConstraintPlan`), which preserves the
caller's input order: each entry is evaluated against the constraint surface,
and `BulkConstraintPlan.MergeDeniedInto` re-merges denied entries back into
their original index positions so the reply slot at `entries[i]` always
corresponds to the request slot at `entries[i]`. Successful item registrations
are tracked per session so later item-handle commands resolve back to the
original tag address. If a constrained key presents an unknown item handle,
the gateway fails closed.
Non-bulk constraint failures return gRPC `PermissionDenied`. Bulk read
commands preserve input order and return a failed `SubscribeResult` for each
@@ -195,7 +206,7 @@ blocking constraint; secured values and raw credentials are never logged.
| `SessionOpen` | `session:open` | `OpenSessionRequest` |
| `SessionClose` | `session:close` | `CloseSessionRequest` |
| `EventsRead` | `events:read` | `StreamEventsRequest`, `QueryActiveAlarmsRequest`, `MxCommandKind.DrainEvents` |
| `InvokeRead` | `invoke:read` | `MxCommandRequest` for read-style command kinds (`Register`, `AddItem`, `Advise`, and any kind not otherwise mapped) |
| `InvokeRead` | `invoke:read` | `MxCommandRequest` for read-style command kinds (`Register`, `AddItem`, `Advise`, `ReadBulk`, and any kind not otherwise mapped) |
| `InvokeWrite` | `invoke:write` | `AcknowledgeAlarmRequest`, `MxCommandKind.Write`, `MxCommandKind.Write2`, `MxCommandKind.WriteBulk`, `MxCommandKind.Write2Bulk` |
| `InvokeSecure` | `invoke:secure` | `MxCommandKind.WriteSecured`, `MxCommandKind.WriteSecured2`, `MxCommandKind.WriteSecuredBulk`, `MxCommandKind.WriteSecured2Bulk`, `MxCommandKind.AuthenticateUser` |
| `MetadataRead` | `metadata:read` | `MxCommandKind.ArchestraUserToId`, `MxCommandKind.GetSessionState`, `MxCommandKind.GetWorkerInfo`, `GalaxyRepository.TestConnection`, `GalaxyRepository.GetLastDeployTime`, `GalaxyRepository.DiscoverHierarchy`, `GalaxyRepository.WatchDeployEvents` |
+30
View File
@@ -104,6 +104,36 @@ The test output includes session id, worker process id, command status,
HRESULT/status diagnostics, event sequence and handles, close status, and worker
stdout/stderr lines emitted during the run.
## Dev-rig Probes
`src/MxGateway.Worker.Tests/Probes/` partitions runtime probes from the regular
Worker.Tests regression suite. The folder is its own
`MxGateway.Worker.Tests.Probes` namespace so a discovery filter (e.g. `dotnet
test --filter FullyQualifiedName~MxGateway.Worker.Tests.Probes`) can target or
exclude them without enumerating individual class names. The probes are
`[Fact(Skip = "...")]` by default and exist to characterize live AVEVA
behavior on the dev rig, not to gate CI — flip `Skip = null` on the dev box
with installed MXAccess + a running Galaxy provider when running them:
- `AlarmsLiveSmokeTests` — end-to-end smoke for the alarms-over-gateway
pipeline (`WnWrapAlarmConsumer` + `AlarmDispatcher` +
`MxAccessAlarmEventSink`) against `\\<machine>\Galaxy!DEV` with the dev rig's
10-second flip script writing `TestMachine_001.TestAlarm001`.
- `AlarmClientWmProbeTests` — registers as an `AlarmClient` consumer on a real
hidden message-only window and logs every Win32 message that arrives during
a fixed pump window. Used to identify the `WM_APP` /
`RegisterWindowMessage` IDs alarm callbacks use.
- `WnWrapConsumerProbeTests` — instantiates AVEVA's standalone `wnwrapConsumer`
COM class, subscribes to the dev rig's `\\<machine>\Galaxy!DEV` provider,
and polls `GetXmlCurrentAlarms2`. The XML payload bypasses the
`FILETIME→DateTime` auto-marshaling that crashes
`aaAlarmManagedClient.AlarmClient.GetHighPriAlarm` on this rig.
The probes share the Worker.Tests project (so they can use its `net48`/`x86`
configuration and the installed `ArchestrA.MxAccess` / `aaAlarmManagedClient`
references), but they are not part of the regression contract — a Worker.Tests
run with `Skip` left in place passes them as skipped.
## Live Galaxy Repository
`GalaxyRepositoryLiveTests` in `src/MxGateway.IntegrationTests/Galaxy/` exercises
+17
View File
@@ -672,6 +672,23 @@ heartbeat fields until dedicated thresholds own those warnings. The worker
reports stale STA activity, but the gateway owns the final kill decision
through its existing heartbeat and worker lifecycle policy.
The in-flight-command suppression itself is bounded by
`WorkerPipeSessionOptions.HeartbeatStuckCeiling` (default 75 seconds = 5 ×
`HeartbeatGrace`). The motivating case for the suppression is a legitimately
slow synchronous command — but a genuinely stuck COM call (for example
against a dead MXAccess provider whose cross-apartment marshaler is
permanently blocked, or a write completion that never fires) leaves
`CurrentCommandCorrelationId` non-empty indefinitely. Without an upper bound
the worker-side `StaHung` watchdog would be permanently defeated for that
session and only the gateway's per-command timeout would catch the hang —
losing the worker-originated diagnostic (`StaHung` fault category, the
stale-by interval) from the gateway audit trail. Once `LastStaActivityUtc`
has been stale for longer than `HeartbeatStuckCeiling`, the watchdog fires
`StaHung` regardless of whether a command is in flight, on the assumption
that no legitimate STA command should run that long without periodically
refreshing activity. Deployments that legitimately run very long bulk
operations should raise the ceiling rather than disable it.
## Shutdown
Graceful shutdown sequence: