Regenerate the Python protobuf stubs and replace query_active_alarms
with stream_alarms, an AsyncIterator over AlarmFeedMessage served by
the gateway's central alarm monitor (snapshot, snapshot_complete, then
live transitions). Drops session_id from the acknowledge surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cross-language e2e matrix spawned one CLI process per operation —
~250 per client — paying a process (and, for the Java CLI, a full JVM)
cold-start every time. The Java leg alone ran ~16 minutes.
Each client CLI (dotnet, go, rust, python, java) gains a `batch`
subcommand: a single process that reads one command line from stdin,
runs it through the normal subcommand dispatch, writes the JSON result,
then a line containing exactly `__MXGW_BATCH_EOR__`. A failing command
writes its `{"error":...}` envelope and the loop continues.
run-client-e2e-tests.ps1 now launches one batch process per client and
pings every operation through its stdin/stdout, so startup is paid once
per client. The orchestration and assertions are unchanged; the parity
and auth phases now read the `{"error":...}` envelope instead of a
process exit code.
Full 5-client matrix with -VerifyWrite: ~15 min, down from ~35; the Java
leg dropped from ~16 min to ~2-3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.
High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
string (it must be a valid SPDX expression), so `pip wheel .` and
`pip install -e .` both fail before any source compiles. Tests
still pass because pytest bypasses the build backend via
`pythonpath`. Dropped the invalid license string, kept the
`License :: Other/Proprietary License` classifier, and added
`tests/test_packaging.py` so a future regression of the same shape
is caught in CI.
Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
on WorkerPipeSessionOptions bounds the in-flight-command watchdog
suppression so a truly stuck COM call still triggers StaHung
instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
cross-language bench comparison is apples-to-apples again;
`failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
serialisation pattern to DeployEventStream so close() arriving
after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
stability check after UnAdvise instead of strict equality against
the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
log sink the WriteSecured live test owns (worker stdout/stderr,
gateway logs, direct WriteLine) so the credential is proven
absent from the full output buffer, not just the diagnostic
message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
for the previously-uncovered Write2Bulk and WriteSecured2Bulk
arms of WriteBulkConstraintPlan.SetPayload.
Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
AlarmsOptions validated at startup (Server-026); Authorization.md
Constraint Enforcement snippet/prose enumerate the bulk write/read
family (Server-027); bulk-read-commands and bulk-write-commands
capability tokens added to OpenSession (Server-029);
NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
guard the poll path uses, at every command entry (Worker-024);
RunAsync null-checks the runtime-session factory result
(Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
CancelCommandReturnValue serialised under lock (Worker.Tests-027);
Probes namespace lifted to MxGateway.Worker.Tests.Probes
(Worker.Tests-029); cancel-envelope sequence numbers monotonised
(Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
(Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
test backed by a TaskCompletionSource fake (Tests-022); companion
FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
(Tests-023); constraint plan reply-count divergence pinned
(Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
end-to-end (IntegrationTests-018); abnormal-exit keyword set
tightened to pipe-disconnected/end-of-stream and the test now
asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
default 30s wall-clock budget doesn't kill them (015);
BenchStreamEventsAsync observes the inner stream task on every
exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
%w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
RFC3339Nano with fractional seconds (019); runStreamEvents installs
signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
cancellation contract Client.Java-015 established (022); stream-events
text path uses Long.toUnsignedString for worker_sequence (023);
bench-read-bulk no longer pollutes success-latency histogram with
failure durations (024); --shutdown-timeout CLI option propagates
through to ClientOptions (025); seven new MxGatewayCliTests cover
the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
wheel-build smoke test added under tests/test_packaging.py (020);
README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
document the AsRef<str> read_bulk genericism (019);
next_correlation_id re-exported at the crate root, with a
property-style doc contract and an explicit disclaimer that the
literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
IConstraintEnforcer mechanism instead of "tag-allowlist filter"
(014); BulkReadResult gains explicit per-arm payload-population
documentation for the success vs failure cases (015).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).
Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
GatewayGrpcScopeResolver so non-admin keys can use them; document
the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
in generated tonic code by reformatting the ReadBulkCommand proto
comment and scoping a #![allow(...)] to the generated submodules.
Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
make DisposeAsync race-safe against in-flight CloseAsync (-016);
add constraint-enforcement test coverage for the bulk-plan path
(-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
can distinguish graceful shutdown from a real STA-affinity
violation (-016); have the watchdog skip StaHung while
CurrentCommandCorrelationId is non-empty so a legitimate slow
ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
11 GatewaySession bulk methods (-013); replace the real TCP probe
in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
(-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
test and assert OnWriteComplete (-012); add live tests for
Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
CreateForTesting factory (-016); cover WorkerCancel and
unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
beforeStart() (-014); return a CancellingCompletableFuture that
actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
histograms with failed-call durations (-015); add coverage for
the five MalformedReply paths, the bulk-write helpers, the
Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
command family (-009).
Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
WorkerAlarmRpcDispatcher missing-session handling; drop the
duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
subscriptionExpression / ExecutingCommand arms; preserve
factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
source; switch the heartbeat-expires test to ManualTimeProvider;
add InvariantCulture to the remaining DateTimeOffset.Parse sites;
document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
IDisposable, class-level [Trait], single-source ZB default
connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
so absent env vars SKIP not pass; PascalCase rename of probe
[Fact]s; deterministic deadline test; new frame-protocol error
tests; ComputeTransitions diff-coverage; relocate dev-rig probes
to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
TreatWarningsAsErrors / analysers apply; document
DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
bulk-read handles in CLI; surface AcknowledgeAlarm transport
faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
runWriteBulkVariant; document the six new subcommands in
writeUsage; drain galaxy-watch events on limit; switch io.EOF
comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
option; regex-based credential redaction; Long.toUnsignedString
for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
_percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
_api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
stop hard-coding correlation IDs; resync RustClientDesign.md
with the current Session / Error surface and CLI subcommand set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a bench-read-bulk subcommand to every client CLI (.NET, Go, Rust,
Python, Java) and a PowerShell driver that runs all five concurrently
against the deployed gateway and prints a side-by-side comparison.
Each CLI''s bench:
- Opens its own session, registers, subscribes to bulk-size tags so the
worker''s MxAccessValueCache populates from real OnDataChange events.
- Runs a warmup-seconds-long pre-loop with identical calls so JIT /
connection-pool / first-call overhead is amortised before the
measurement window.
- Runs ReadBulk in a tight in-process loop for duration-seconds with
per-call high-resolution latency capture (Stopwatch in .NET,
time.Now in Go, std::time::Instant in Rust, time.perf_counter in
Python, System.nanoTime in Java).
- Unsubscribes + closes the session, then emits one JSON object with
the shared schema: { language, durationMs, totalCalls, successfulCalls,
failedCalls, totalReadResults, cachedReadResults, callsPerSecond,
latencyMs: { p50, p95, p99, max, mean } }.
The PS driver (scripts/bench-read-bulk.ps1) launches one detached process
per client, waits for all to finish, parses the trailing JSON object from
each stdout, prints a comparison table, and persists the combined report
under artifacts/bench/. Quoting around Java''s `gradle --args="..."` is
handled by writing a one-shot .bat that cmd.exe runs; the .NET CLI''s
per-call gRPC timeout is auto-scaled to (Duration + Warmup + 30s) so the
channel-wide timeout doesn''t cancel the bench mid-loop.
Live 30-second steady-state run against the deployed gateway, all five
clients hitting the same six TestMachine_001..006.TestChangingInt tags:
client calls/sec cached/total p50 ms p95 ms p99 ms max ms
dotnet 171.78 30924/30924 3.84 14.06 40.41 542.48
go 175.46 31590/31590 3.93 13.52 41.26 243.00
rust 123.26 22188/22188 5.52 15.78 48.11 544.41
python 145.79 26244/26244 4.86 14.85 41.65 645.84
java 181.12 32604/32604 3.80 10.59 33.37 344.27
143,550 ReadBulk results across all five clients during the 30s window;
100% were was_cached = true (the worker''s cache fast-path never fell
through to the snapshot lifecycle). Aggregate read throughput ~800
calls/sec against five concurrent sessions sharing the same cached tags.
A second variant with bulk-size 20 sustained the same per-client call
rate while delivering 3.3x more values per call (~37,000 cached reads/sec
aggregate across the five concurrent sessions), confirming the linear
per-tag cache lookup inside one call is not a bottleneck at this scale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added the bulk read/write library surface in every
client; this commit makes that surface reachable from each client's CLI
and exercises it through scripts/run-client-e2e-tests.ps1.
Five new subcommands in every client CLI (.NET / Go / Rust / Python /
Java): read-bulk, write-bulk, write2-bulk, write-secured-bulk, and
write-secured2-bulk. Each follows the existing subscribe-bulk shape:
- read-bulk takes --server-handle, --items <csv tag list>, and
--timeout-ms (0 = worker default). JSON output carries the
BulkReadResult fields, including was_cached so the e2e matrix can
verify the cached-path semantics.
- The four bulk-write families take --server-handle, --item-handles
<csv>, --type, --values <csv>. write2-bulk and write-secured2-bulk
add a single --timestamp applied to every entry; the secured
variants take --current-user-id and --verifier-user-id. All four
output BulkWriteResult JSON.
A new -SkipReadWriteBulk switch on the matrix script (default OFF)
controls two new e2e phases:
- After the existing subscribe-bulk phase leaves tags advised, the
script runs read-bulk against the same tag list and asserts most
results return was_cached = true. This is the only e2e coverage of
the cache-then-snapshot fork — the unit + gateway tests verify the
semantics with a fake worker, but only the live cross-language
matrix proves the cache populates from real OnDataChange events and
survives the round-trip through every client''s JSON parser.
- When -VerifyWrite is set, the write phase now also runs a single-
entry write-bulk against the same writable item handle (using a
distinct sentinel value) and asserts a per-entry success. Confirms
the BulkWriteResult wire format end-to-end without complicating
the OnWriteComplete echo assertion the single-item phase already
verifies.
Dry-run validation passes for all five clients: each emits the correct
read-bulk and write-bulk CLI invocations with the right flags.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds five new MXAccess command kinds (WriteBulk, Write2Bulk,
WriteSecuredBulk, WriteSecured2Bulk, ReadBulk) that ride the existing
"one round-trip, per-entry results" bulk shape used by AddItemBulk and
SubscribeBulk today. MXAccess COM has no native bulk API; the worker
runs each bulk operation as a sequential loop on its STA, returning
one BulkWriteResult / BulkReadResult per requested entry so per-item
MXAccess failures surface as was_successful=false rather than throwing.
ReadBulk has no MXAccess analogue. The worker satisfies it by:
- Returning the last cached OnDataChange payload (was_cached=true)
when the requested tag is already in the session''s item registry
AND advised — the existing subscription is NOT touched, since the
caller did not create it.
- Otherwise taking the AddItem + Advise + wait-for-OnDataChange +
UnAdvise + RemoveItem snapshot lifecycle itself (was_cached=false)
and leaving the session exactly as it was. The wait pumps Windows
messages on the STA so the inbound MXAccess event can dispatch
while the executor still holds the thread.
The new MxAccessValueCache lives on each MxAccessSession, shared with
MxAccessBaseEventSink which populates it on every OnDataChange after
the event clears the outbound queue. Eviction on RemoveItem keeps
reused MXAccess handles from serving stale values from a previous
lifetime.
Gateway-side authorization wires WriteBulk/Write2Bulk to invoke:write,
WriteSecuredBulk/WriteSecured2Bulk to invoke:secure, ReadBulk to
invoke:read. The constraint-filter pipeline is refactored from a single
BulkConstraintPlan record into an abstract base plus three concretes
(SubscribeBulk, WriteBulk, ReadBulk), each owning its own denied-entry
merge so the dispatch site never branches on reply shape. A new
FilterWriteBulkAsync<TEntry> generic over the four write-entry shapes
runs CheckWriteHandleAsync per entry; denied entries surface as the
BulkWriteResult shape, preserving original-index order.
All five language clients (.NET, Go, Rust, Python, Java) gained the
five new methods following their existing bulk pattern, with regenerated
protobufs.
Tests added:
- MxAccessValueCacheTests (6 cases) — Set/TryGet, Remove resets the
version, TryWaitForUpdate signals on Set, pump step fires each poll.
- MxAccessBaseEventSinkTests — OnDataChange populates the cache,
ValueCache property exposes the bound instance.
- MxAccessCommandExecutorTests — four bulk-write variants (per-entry
success/failure, value+timestamp forwarding, secured user ids),
ReadBulk snapshot lifecycle on uncached tag (timeout surfaces as
was_successful=false), invalid-payload reply.
- GatewayGrpcScopeResolverTests — five new MxCommandKind cases.
- SessionManagerTests — WriteBulk and ReadBulk forwarding through
FakeWorkerHarness; ReadBulk forwards timeout_ms.
- Per-client (.NET, Go, Rust, Python, Java) — WriteBulk builds the
right command and returns per-entry results, ReadBulk forwards the
timeout and unpacks the was_cached flag.
Cross-language e2e CLI subcommands for the new bulks are deliberately
scoped out of this change (each of the five client CLIs would need
five new subcommands plus matching phases in
scripts/run-client-e2e-tests.ps1); coverage equivalent to the existing
bulk-subscribe coverage is provided by worker + gateway + per-client
unit tests.
Docs updated in the same commit: gateway.md (Public MXAccess Command
Surface), docs/DesignDecisions.md (new "Bulk Command Family" section
with the ReadBulk cache-then-snapshot rationale), and every client
README.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Python-001: dropped "scaffold" from the stale pyproject description.
Client.Python-002 (re-triaged): stale finding — MxGatewayCommandError is
already exported and in __all__; no change needed.
Client.Python-004: removed the dead `closed` variable in _smoke; the CLI
smoke now uses `async with session`.
Client.Python-006: close() on both clients and Session had an unlocked
check-then-set race; `_closed` is now set before the await.
Client.Python-007: gateway stream iterators now share one helper that
explicitly catches CancelledError and cancels the call.
Client.Python-008: to_mx_value now rejects nan/inf; float/bytes mapping
documented.
Client.Python-010: removed the circular-import-workaround late imports in
favour of TYPE_CHECKING / module-scope imports.
Client.Python-011: ensure_mxaccess_success no longer treats a proto3-default
success==0 with an unset category as a failure.
Client.Python-012 (Won't Fix): invoke_raw deliberately skips MXAccess-failure
detection for parity tests; documented the contract instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Python-003: stream_events_raw and query_active_alarms passed `timeout`
to the stub with no TypeError fallback, unlike _unary. Both now route through
a shared _open_stream helper that strips `timeout` on TypeError.
Client.Python-005: discover_hierarchy buffered the entire Galaxy hierarchy in
memory. Added GalaxyRepositoryClient.iter_hierarchy, a lazy async generator
yielding objects page-by-page; discover_hierarchy is now a thin wrapper that
preserves its list contract. README documents iter_hierarchy.
Client.Python-009: added regression coverage for previously untested paths —
write2/add_item2 request shape, the MAX_BULK_ITEMS boundary, the None-argument
TypeError guards, TLS ca_file reading, and the non-auth map_rpc_error fallthrough.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure mechanical regen following PR A.1 (alarm-transition event family
+ AcknowledgeAlarm / QueryActiveAlarms public RPCs). Ran:
- clients/python/generate-proto.ps1 → mxaccess_gateway_pb2.py +
mxaccess_gateway_pb2_grpc.py.
- clients/go/generate-proto.ps1 → mxaccess_gateway.pb.go +
mxaccess_gateway_grpc.pb.go + galaxy_repository.pb.go (whitespace
diff from upstream protoc minor version).
The .NET binding regenerates on csproj rebuild via Grpc.Tools — its
artifact (Generated/MxaccessGateway*.cs) was already updated as part
of A.1's commit. Java + Rust regen happens at build time via the
gradle plugin / build.rs respectively, with no committed output to
update.
Smoke-imported the regenerated Python descriptors:
OnAlarmTransitionEvent.DESCRIPTOR.fields → alarm_full_reference,
alarm_type_name, category, current_value, description, ...
AcknowledgeAlarmRequest.DESCRIPTOR.fields → alarm_full_reference,
client_correlation_id, comment, operator_user, session_id
ActiveAlarmSnapshot.DESCRIPTOR.fields → alarm_full_reference,
alarm_type_name, category, current_state, current_value, ...
PRs E.2 - E.6 layer hand-written SDK methods on top of the regenerated
types — those land per-language as separate PRs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Resolve 14 conflicts from popping local stash on top of origin's
eed1e88 + 8d3352f doc-comment additions (11 mechanical, plus
version.rs, DashboardAuthenticatorTests.cs, DashboardGalaxyProjector.cs)
- Fix 4 test files that used AGENTS.md as the repo-root sentinel
(now use CLAUDE.md, since AGENTS.md was removed in 4731ab5)
- Redirect 10 doc citations from AGENTS.md to the matching gateway.md
sections (Value Model, Status Model, Security, STA Worker Thread
Model, gRPC Layer rule, cancellation rule)
Verified: solution build clean, x86 worker build clean, 266/266
gateway tests passing, 121/121 worker tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename 16 kebab-case docs to PascalCase per StyleGuide
- Move per-language client design docs from docs/ to clients/<lang>/
alongside their READMEs
- Add ## Related Documentation sections to 15 docs that lacked one
- Fix sentence-case violations in H3 headings (StyleGuide rule)
- Update cross-references in gateway.md, client READMEs, scripts,
and generate-proto.ps1 helpers to follow the new paths
- Add CLAUDE.md with build/test commands, the source-update
verification matrix, the parity-first contract, and pointers
to MXAccess and Galaxy Repository analysis sources
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>