The gateway now monitors alarms continuously, independent of any client
session, and fans the feed out to every client.
GatewayAlarmMonitor is an always-on hosted service that owns one
gateway-managed worker session dedicated to alarms: it subscribes the
configured provider, caches the active-alarm set from the worker's
transition events (reconciled periodically against the worker's
authoritative snapshot), re-opens the session if the worker faults,
and broadcasts to all subscribers.
The new session-less StreamAlarms RPC opens with the current
active-alarm snapshot, then streams live transitions; any number of
clients fan out from the single monitor without opening a worker
session. AcknowledgeAlarm is now session-less and routes through the
monitor. The session-scoped QueryActiveAlarms RPC and the per-session
alarm auto-subscribe hook are removed, along with the now-dead
IAlarmRpcDispatcher trio; the dashboard Alarms tab reads the monitor's
in-process cache directly.
This intentionally reverses the v1 "no multi-subscriber fan-out"
decision for the alarm subsystem.
Contracts regenerated; gateway, dashboard and tests build clean,
94 alarm-affected tests pass, and the monitor is verified live.
Language-client stubs are regenerated in a follow-up change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Server-032 change made event-channel overflow wait
EventChannelFullModeTimeout before faulting, instead of faulting
instantly. Two pre-existing overflow tests were not updated and left
EventChannelFullModeTimeout at its 5s default, which races the 5s
TestTimeout: ReadLoop_WhenEventQueueOverflows_FaultsClient and
ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess. Pin it to 50ms in
both so overflow faults promptly.
EnqueueWorkerEvent_WhenChannelFullPastTimeout_FaultsWithRichDiagnostic
wrote 6 events into a 4-slot channel, but the worker client faults
while reading the 5th and its read loop then stops — the 6th event is
never drained and the test's pipe write for it blocks forever on a
full OS pipe buffer, hanging the test host. Write exactly 5 (4 to fill
plus 1 to overflow) as the test comment already intends, and bound the
post-fault event drain with TestTimeout so a future regression fails
instead of hanging.
No production change: the Server-031/032 WorkerClient logic is correct
— these were test-only defects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The subscription panel showed "(array)" for any array-valued tag and
"Unspecified" for its type. A scalar MxValue carries its type in
MxValue.DataType, but an array leaves that Unspecified and carries the
element type and dimensions on the MxArray itself.
DashboardMxValueFormatter.FormatValue now joins the typed MxArray
elements (e.g. "[1.5, 2.25, 3]"), capped at 24 elements with a
"... N total" suffix, and FormatDataType reads the element type and
dimensions off the array (e.g. "Double[10]").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-031: HeartbeatLoopAsync now skips the HeartbeatExpired fault
while a command is in flight on the gateway-worker pipe, up to
WorkerClientOptions.HeartbeatStuckCeiling (75s default) — a heartbeat
gap caused by a slow STA command or an event-drain write burst no
longer faults a healthy worker. Mirrors the worker-side Worker-023
guard. A command older than the ceiling still faults so a genuinely
stuck COM call cannot hide the worker indefinitely.
Server-032: EnqueueWorkerEventAsync now honors the configured
EventChannelFullModeTimeout by awaiting WriteAsync against the
wait-mode channel, instead of faulting on the first missed slot with
the non-blocking TryWrite. A transient consumer hiccup is absorbed up
to the timeout; the overflow diagnostic names the channel depth,
capacity, and the actionable fix.
Adds the Server-031 and Server-032 findings entries and WorkerClient
regression tests covering both.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MXAccess fires OnDataChange with pftItemTimeStamp marshaled as a
VT_BSTR string (e.g. "3/26/2026 1:38:22.907 PM"), not a FILETIME or
VT_DATE. VariantConverter classified it as a plain string, so
ApplySourceTimestamp never set MxEvent.SourceTimestamp — every
OnDataChange event and every cached ReadBulk result carried no
source timestamp for all gRPC clients.
Parse the string and set SourceTimestamp. MXAccess formats it in the
worker host's local time (verified empirically: a fast-changing tag's
timestamp landed exactly the host UTC offset behind wall-clock UTC),
so it is parsed as local and converted to UTC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Browse renders the Galaxy hierarchy tree from IGalaxyHierarchyCache:
expandable areas/objects with attribute name, data type and the
alarm/historized flags, plus a name/reference filter. Right-click or
double-click an attribute to add it to a subscription panel that polls
live value, quality and source timestamp every two seconds.
Alarms lists the worker's currently-active alarm set via
IAlarmRpcDispatcher, defaulting to unacknowledged Active alarms with
filters for acknowledged alarms, area, severity range and text. It is
read-only and warns when alarm auto-subscribe is disabled.
Both tabs read live MXAccess data through a new singleton
DashboardLiveDataService that owns one shared, lazily-opened gateway
session (one worker) for the whole dashboard, re-opened transparently
if it faults or its lease expires.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Set MxGateway:Alarms with Enabled=true and the dev-rig subscription
expression \DESKTOP-6JL3KKO\Galaxy!DEV. Worker sessions now subscribe
their wnwrap alarm consumer on open; without this the QueryActiveAlarms
RPC returns an empty stream because no session is ever alarm-subscribed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cross-language e2e matrix spawned one CLI process per operation —
~250 per client — paying a process (and, for the Java CLI, a full JVM)
cold-start every time. The Java leg alone ran ~16 minutes.
Each client CLI (dotnet, go, rust, python, java) gains a `batch`
subcommand: a single process that reads one command line from stdin,
runs it through the normal subcommand dispatch, writes the JSON result,
then a line containing exactly `__MXGW_BATCH_EOR__`. A failing command
writes its `{"error":...}` envelope and the loop continues.
run-client-e2e-tests.ps1 now launches one batch process per client and
pings every operation through its stdin/stdout, so startup is paid once
per client. The orchestration and assertions are unchanged; the parity
and auth phases now read the `{"error":...}` envelope instead of a
process exit code.
Full 5-client matrix with -VerifyWrite: ~15 min, down from ~35; the Java
leg dropped from ~16 min to ~2-3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cross-language client e2e matrix failed for dotnet and Java. Both
failures were in the harness, not the client code.
1. Per-call toolchain cold-start. The matrix issues ~250 CLI calls per
client; it invoked `dotnet run` / `gradle :mxgateway-cli:run` every
time, rebuilding and cold-starting the toolchain per call. Build each
CLI once up front (`dotnet build`, `gradle :mxgateway-cli:installDist`)
and invoke the compiled artifact directly. This alone fixes dotnet.
2. Worker event-channel overflow. The per-tag advise loop advises every
discovered tag with no StreamEvents consumer attached, so change
events accumulate in the worker event channel
(MxGateway:Events:QueueCapacity) until FailFast faults the worker.
dotnet's faster loop slipped under the window; the Java CLI's
process-per-call JVM cold-start did not. Every -DrainEveryTags advised
tags (default 15) the loop connects a short StreamEvents drain; the
gateway's per-stream producer empties the channel the instant a
subscriber attaches, so a small bounded read suffices.
Full 5-client matrix (dotnet, go, rust, python, java) now passes with
-VerifyWrite against a live gateway.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both findings surfaced when running the cross-language e2e matrix
(scripts/run-client-e2e-tests.ps1) against the redeployed gateway at
commit 84d36b7. Filed in code-reviews/Server/findings.md and
code-reviews/Client.Dotnet/findings.md and fixed in the same change.
Server-030 (Medium / Error handling): GatewaySession.GetReadyWorkerClient
gated on `_state == Ready && _workerClient.State == Ready` but only
formatted `_state` into the SessionManagerException message. Under load
the gateway-driven `_state` and the worker-driven `WorkerClient.State`
can diverge, producing a self-contradictory diagnostic ("Session ... is
not ready. Current state is Ready."). The Java e2e client hit this on
the 56th item after 55 successful add-items. Rewrote the message to
include both states ("Session state is X; worker state is Y"), added
an XML doc explaining the two-state contract and that this branch is
the fail-fast for a divergence race, and added regression test
SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates
that pins both states appear in the message. The deeper race (should
the gateway briefly wait for worker-Ready before failing?) remains
open as a follow-up.
Client.Dotnet-017 (Low / Error handling): stream-events CLI threw
OperationCanceledException as an unhandled exception when the user's
--timeout expired before --max-events was reached. Exit code
-532462766, no aggregate JSON. The other client CLIs (Go, Rust, Python,
Java) exit 0 in this case. Wrapped the `await foreach` in
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`
so the supplied token's cancellation (--timeout, Ctrl+C, or parent
CTS) becomes graceful completion; the aggregate `{ "events": [...] }`
JSON still runs after the catch. Added regression test
RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero
backed by a new FakeCliClient.StreamHangAfterEvents hook that yields
the configured events then parks on the cancellation token.
Side cleanup: the GatewayApplicationTests test added under Server-020
was asserting an invariant (`/dashboard/dashboard/X` doesn't exist)
that I broke by reverting Server-020 in 84d36b7. The doubled endpoint
shapes do exist now (MapGroup("/dashboard") prefixing an already
"/dashboard/X" @page directive) but they're harmless — no client
requests `/dashboard/dashboard/X`. Replaced the test with a positive
assertion (`/dashboard/X` routes ARE registered) and rewrote the XML
doc to record the actual contract.
Verified: dotnet test src/MxGateway.Tests passes 480/480, dotnet test
clients/dotnet/MxGateway.Client.Tests passes 77/77, gateway redeployed
at this commit and GET http://localhost:5130/dashboard returns 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-020 incorrectly removed the duplicate `@page "/dashboard/X"` directive
from each dashboard Razor page on the assumption that `MapGroup("/dashboard")`
would prepend the prefix to Blazor SSR route matching. It does not — Blazor's
`@page` template matcher operates on the full URL path, not relative to a
MapGroup. The removal left the dashboard returning HTTP 500 with
"Unable to find the provided template '/dashboard/'" from
RouteTableFactory.CreateEntry on every page.
Restored the eight `@page "/dashboard/X"` directives. The accompanying
regression test still passes (it asserts the genuinely-double-prefixed shape
`/dashboard/dashboard/X` never appears — it never did, since the original
duplicates were `"/"` + `"/dashboard/"`, not `"/dashboard/"` repeated). XML
doc on the test rewritten to record what was actually wrong with Server-020.
Verified: gateway redeploy + `GET http://localhost:5130/dashboard` returns
HTTP 200 7.3 KB; `/dashboard/sessions` also 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.
High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
string (it must be a valid SPDX expression), so `pip wheel .` and
`pip install -e .` both fail before any source compiles. Tests
still pass because pytest bypasses the build backend via
`pythonpath`. Dropped the invalid license string, kept the
`License :: Other/Proprietary License` classifier, and added
`tests/test_packaging.py` so a future regression of the same shape
is caught in CI.
Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
on WorkerPipeSessionOptions bounds the in-flight-command watchdog
suppression so a truly stuck COM call still triggers StaHung
instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
cross-language bench comparison is apples-to-apples again;
`failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
serialisation pattern to DeployEventStream so close() arriving
after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
stability check after UnAdvise instead of strict equality against
the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
log sink the WriteSecured live test owns (worker stdout/stderr,
gateway logs, direct WriteLine) so the credential is proven
absent from the full output buffer, not just the diagnostic
message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
for the previously-uncovered Write2Bulk and WriteSecured2Bulk
arms of WriteBulkConstraintPlan.SetPayload.
Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
AlarmsOptions validated at startup (Server-026); Authorization.md
Constraint Enforcement snippet/prose enumerate the bulk write/read
family (Server-027); bulk-read-commands and bulk-write-commands
capability tokens added to OpenSession (Server-029);
NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
guard the poll path uses, at every command entry (Worker-024);
RunAsync null-checks the runtime-session factory result
(Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
CancelCommandReturnValue serialised under lock (Worker.Tests-027);
Probes namespace lifted to MxGateway.Worker.Tests.Probes
(Worker.Tests-029); cancel-envelope sequence numbers monotonised
(Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
(Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
test backed by a TaskCompletionSource fake (Tests-022); companion
FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
(Tests-023); constraint plan reply-count divergence pinned
(Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
end-to-end (IntegrationTests-018); abnormal-exit keyword set
tightened to pipe-disconnected/end-of-stream and the test now
asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
default 30s wall-clock budget doesn't kill them (015);
BenchStreamEventsAsync observes the inner stream task on every
exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
%w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
RFC3339Nano with fractional seconds (019); runStreamEvents installs
signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
cancellation contract Client.Java-015 established (022); stream-events
text path uses Long.toUnsignedString for worker_sequence (023);
bench-read-bulk no longer pollutes success-latency histogram with
failure durations (024); --shutdown-timeout CLI option propagates
through to ClientOptions (025); seven new MxGatewayCliTests cover
the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
wheel-build smoke test added under tests/test_packaging.py (020);
README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
document the AsRef<str> read_bulk genericism (019);
next_correlation_id re-exported at the crate root, with a
property-style doc contract and an explicit disclaimer that the
literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
IConstraintEnforcer mechanism instead of "tag-allowlist filter"
(014); BulkReadResult gains explicit per-arm payload-population
documentation for the success vs failure cases (015).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).
Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
GatewayGrpcScopeResolver so non-admin keys can use them; document
the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
in generated tonic code by reformatting the ReadBulkCommand proto
comment and scoping a #![allow(...)] to the generated submodules.
Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
make DisposeAsync race-safe against in-flight CloseAsync (-016);
add constraint-enforcement test coverage for the bulk-plan path
(-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
can distinguish graceful shutdown from a real STA-affinity
violation (-016); have the watchdog skip StaHung while
CurrentCommandCorrelationId is non-empty so a legitimate slow
ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
11 GatewaySession bulk methods (-013); replace the real TCP probe
in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
(-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
test and assert OnWriteComplete (-012); add live tests for
Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
CreateForTesting factory (-016); cover WorkerCancel and
unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
beforeStart() (-014); return a CancellingCompletableFuture that
actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
histograms with failed-call durations (-015); add coverage for
the five MalformedReply paths, the bulk-write helpers, the
Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
command family (-009).
Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
WorkerAlarmRpcDispatcher missing-session handling; drop the
duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
subscriptionExpression / ExecutingCommand arms; preserve
factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
source; switch the heartbeat-expires test to ManualTimeProvider;
add InvariantCulture to the remaining DateTimeOffset.Parse sites;
document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
IDisposable, class-level [Trait], single-source ZB default
connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
so absent env vars SKIP not pass; PascalCase rename of probe
[Fact]s; deterministic deadline test; new frame-protocol error
tests; ComputeTransitions diff-coverage; relocate dev-rig probes
to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
TreatWarningsAsErrors / analysers apply; document
DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
bulk-read handles in CLI; surface AcknowledgeAlarm transport
faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
runWriteBulkVariant; document the six new subcommands in
writeUsage; drain galaxy-watch events on limit; switch io.EOF
comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
option; regex-based credential redaction; Long.toUnsignedString
for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
_percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
_api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
stop hard-coding correlation IDs; resync RustClientDesign.md
with the current Session / Error surface and CLI subcommand set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New subcommand drives the gateway''s StreamEvents server-stream as fast
as it can from a single client process. Subscribes to --bulk-size tags
(rotating through all six TestMachine attributes by default) and counts
events received over a --duration-seconds steady-state window. Tracks
events/sec, end-to-end latency (now - event.worker_timestamp), and any
worker faults observed via a post-run DrainEvents probe.
--session-count opens N independent gateway sessions from the same
client process — each session is independent at the gateway (own
worker, own event subscriber, own item handles) so this measures how
the gateway multiplexes concurrent event streams without needing
multiple client processes. Sessions are staggered open by default
(--session-start-stagger-ms 750) because firing N concurrent
OpenSession calls forces N concurrent worker x86 spawns, and on a dev
rig that exceeds the gateway''s 30-second worker startup timeout
around N >= 6-8. The stagger gives each worker headroom to init its
COM apartment + attach the event sink before the next one starts.
Phase 1 of the bench opens + subscribes every session sequentially;
phase 2 opens the steady-state window once everyone is advised, so
the measurement isn''t skewed by late-arriving sessions still in
warmup. The latency sample is shared across sessions (locked
List<double>); event counts use Interlocked.
Initial sweep at --bulk-size 120 against the dev galaxy (20 machines
x 6 attributes = 120 unique tags) showed:
- Linear throughput scaling with subscribed-tag count: N=6→2 ev/s,
N=24→8 ev/s, N=60→20 ev/s, N=120→41 ev/s. The dev galaxy is
producer-bound at ~0.34 events/sec per advised tag — gateway has
plenty of headroom.
- Latency stayed at p50 ≈17ms, p95 ≈34ms across the entire range —
no degradation with subscribed-tag count.
- Zero queue-overflow faults; gateway 10k-event buffer never came
close to filling at this producer rate.
- Linear scaling with session count too (staggered open): 1→44, 2→81,
4→130, 8→324 events/sec at p50 16ms across all session counts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Changes the public signature from `Vec<String>` (owned) to
`&[impl AsRef<str>]` so callers can re-issue the same call repeatedly
without cloning at the call site. The bench''s steady-state loop now
passes `&tags` instead of `tags.clone()`; the CLI subcommand passes the
parsed `&items`; the integration test passes `&["Area001.Pump001.Speed"]`
straight from a string literal slice.
Honest perf note: this is an ergonomics change, not a measurable speedup.
The method still has to materialise an owned `Vec<String>` internally
because prost''s generated `ReadBulkCommand` field requires it, so the
total heap traffic per call is unchanged. Across two 30-second, 5-way
concurrent bench runs at bulkSize=6:
pre-fix (.clone() at caller): 145.35 calls/sec, p99 62.31 ms
post-fix run 1 (&tags): 165.98 calls/sec, p99 40.65 ms
post-fix run 2 (&tags): 146.19 calls/sec, p99 60.04 ms
Run-to-run variance (145-166) dominates any signal from the fix. Solo
Rust release stayed at 261-267 calls/sec across both API shapes,
confirming the bench is gateway-bound under 5-way contention rather
than client-allocation-bound. The change is kept because the borrowed
slice is the idiomatic Rust API shape for "list of items the callee
does not need to take ownership of", and it cleans up the explicit
clone from the bench's inner loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rust''s debug profile costs the bench ~45% of solo throughput and ~3x of
p99 latency vs release (267 vs 184 solo calls/sec, p99 5.7 vs 16ms).
Debug disables inlining, runs overflow checks on every arithmetic op,
keeps Future state machines un-collapsed, and lets every Vec allocation
through unoptimized. Other compiled clients in the matrix don''t see
this gap: Go always builds optimized, Python is interpreted, and the
JIT-tiered runtimes (HotSpot for Java, CoreCLR Tier 1 for .NET) close
most of the gap during the warmup window.
The driver now requests `cargo run --release` for Rust and `dotnet run
-c Release --no-build` for .NET, so the two compiled-AOT clients race
under their production-equivalent profiles. Callers must `cargo build
--release -p mxgw-cli` and `dotnet build ... -c Release` once before
running the bench; `--no-build` then keeps each measurement window
free of compilation overhead.
Live re-run (5-way concurrent, 30s, bulkSize 6) after the switch:
rust: 145.35 calls/sec (was 123.26 in debug; 18% gain under contention)
go: 185.59 calls/sec
java: 171.80 calls/sec
dotnet:172.31 calls/sec
python:140.52 calls/sec
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a bench-read-bulk subcommand to every client CLI (.NET, Go, Rust,
Python, Java) and a PowerShell driver that runs all five concurrently
against the deployed gateway and prints a side-by-side comparison.
Each CLI''s bench:
- Opens its own session, registers, subscribes to bulk-size tags so the
worker''s MxAccessValueCache populates from real OnDataChange events.
- Runs a warmup-seconds-long pre-loop with identical calls so JIT /
connection-pool / first-call overhead is amortised before the
measurement window.
- Runs ReadBulk in a tight in-process loop for duration-seconds with
per-call high-resolution latency capture (Stopwatch in .NET,
time.Now in Go, std::time::Instant in Rust, time.perf_counter in
Python, System.nanoTime in Java).
- Unsubscribes + closes the session, then emits one JSON object with
the shared schema: { language, durationMs, totalCalls, successfulCalls,
failedCalls, totalReadResults, cachedReadResults, callsPerSecond,
latencyMs: { p50, p95, p99, max, mean } }.
The PS driver (scripts/bench-read-bulk.ps1) launches one detached process
per client, waits for all to finish, parses the trailing JSON object from
each stdout, prints a comparison table, and persists the combined report
under artifacts/bench/. Quoting around Java''s `gradle --args="..."` is
handled by writing a one-shot .bat that cmd.exe runs; the .NET CLI''s
per-call gRPC timeout is auto-scaled to (Duration + Warmup + 30s) so the
channel-wide timeout doesn''t cancel the bench mid-loop.
Live 30-second steady-state run against the deployed gateway, all five
clients hitting the same six TestMachine_001..006.TestChangingInt tags:
client calls/sec cached/total p50 ms p95 ms p99 ms max ms
dotnet 171.78 30924/30924 3.84 14.06 40.41 542.48
go 175.46 31590/31590 3.93 13.52 41.26 243.00
rust 123.26 22188/22188 5.52 15.78 48.11 544.41
python 145.79 26244/26244 4.86 14.85 41.65 645.84
java 181.12 32604/32604 3.80 10.59 33.37 344.27
143,550 ReadBulk results across all five clients during the 30s window;
100% were was_cached = true (the worker''s cache fast-path never fell
through to the snapshot lifecycle). Aggregate read throughput ~800
calls/sec against five concurrent sessions sharing the same cached tags.
A second variant with bulk-size 20 sustained the same per-client call
rate while delivering 3.3x more values per call (~37,000 cached reads/sec
aggregate across the five concurrent sessions), confirming the linear
per-tag cache lookup inside one call is not a bottleneck at this scale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added read-bulk / write-bulk / write2-bulk /
write-secured-bulk / write-secured2-bulk dispatch cases to RunCoreAsync
but left them out of IsKnownGatewayCommand, so the .NET CLI rejected
them at the pre-dispatch gate and printed the usage banner instead of
running the new code paths. Surfaced when the live e2e exercised the
read-bulk phase against the deployed gateway — the call routed through
the unknown-command path before reaching the protobuf builder.
Also extends WriteUsage with one line per new subcommand so the banner
documents the new surface.
Live e2e against the deployed gateway now passes for all five clients
(dotnet, go, rust, python, java) with 4/4 tags returning was_cached=true
after the subscribe-bulk + read-bulk path, confirming the worker
MxAccessValueCache populates from real MXAccess OnDataChange events and
round-trips through every client''s JSON parser.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added the bulk read/write library surface in every
client; this commit makes that surface reachable from each client's CLI
and exercises it through scripts/run-client-e2e-tests.ps1.
Five new subcommands in every client CLI (.NET / Go / Rust / Python /
Java): read-bulk, write-bulk, write2-bulk, write-secured-bulk, and
write-secured2-bulk. Each follows the existing subscribe-bulk shape:
- read-bulk takes --server-handle, --items <csv tag list>, and
--timeout-ms (0 = worker default). JSON output carries the
BulkReadResult fields, including was_cached so the e2e matrix can
verify the cached-path semantics.
- The four bulk-write families take --server-handle, --item-handles
<csv>, --type, --values <csv>. write2-bulk and write-secured2-bulk
add a single --timestamp applied to every entry; the secured
variants take --current-user-id and --verifier-user-id. All four
output BulkWriteResult JSON.
A new -SkipReadWriteBulk switch on the matrix script (default OFF)
controls two new e2e phases:
- After the existing subscribe-bulk phase leaves tags advised, the
script runs read-bulk against the same tag list and asserts most
results return was_cached = true. This is the only e2e coverage of
the cache-then-snapshot fork — the unit + gateway tests verify the
semantics with a fake worker, but only the live cross-language
matrix proves the cache populates from real OnDataChange events and
survives the round-trip through every client''s JSON parser.
- When -VerifyWrite is set, the write phase now also runs a single-
entry write-bulk against the same writable item handle (using a
distinct sentinel value) and asserts a per-entry success. Confirms
the BulkWriteResult wire format end-to-end without complicating
the OnWriteComplete echo assertion the single-item phase already
verifies.
Dry-run validation passes for all five clients: each emits the correct
read-bulk and write-bulk CLI invocations with the right flags.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds five new MXAccess command kinds (WriteBulk, Write2Bulk,
WriteSecuredBulk, WriteSecured2Bulk, ReadBulk) that ride the existing
"one round-trip, per-entry results" bulk shape used by AddItemBulk and
SubscribeBulk today. MXAccess COM has no native bulk API; the worker
runs each bulk operation as a sequential loop on its STA, returning
one BulkWriteResult / BulkReadResult per requested entry so per-item
MXAccess failures surface as was_successful=false rather than throwing.
ReadBulk has no MXAccess analogue. The worker satisfies it by:
- Returning the last cached OnDataChange payload (was_cached=true)
when the requested tag is already in the session''s item registry
AND advised — the existing subscription is NOT touched, since the
caller did not create it.
- Otherwise taking the AddItem + Advise + wait-for-OnDataChange +
UnAdvise + RemoveItem snapshot lifecycle itself (was_cached=false)
and leaving the session exactly as it was. The wait pumps Windows
messages on the STA so the inbound MXAccess event can dispatch
while the executor still holds the thread.
The new MxAccessValueCache lives on each MxAccessSession, shared with
MxAccessBaseEventSink which populates it on every OnDataChange after
the event clears the outbound queue. Eviction on RemoveItem keeps
reused MXAccess handles from serving stale values from a previous
lifetime.
Gateway-side authorization wires WriteBulk/Write2Bulk to invoke:write,
WriteSecuredBulk/WriteSecured2Bulk to invoke:secure, ReadBulk to
invoke:read. The constraint-filter pipeline is refactored from a single
BulkConstraintPlan record into an abstract base plus three concretes
(SubscribeBulk, WriteBulk, ReadBulk), each owning its own denied-entry
merge so the dispatch site never branches on reply shape. A new
FilterWriteBulkAsync<TEntry> generic over the four write-entry shapes
runs CheckWriteHandleAsync per entry; denied entries surface as the
BulkWriteResult shape, preserving original-index order.
All five language clients (.NET, Go, Rust, Python, Java) gained the
five new methods following their existing bulk pattern, with regenerated
protobufs.
Tests added:
- MxAccessValueCacheTests (6 cases) — Set/TryGet, Remove resets the
version, TryWaitForUpdate signals on Set, pump step fires each poll.
- MxAccessBaseEventSinkTests — OnDataChange populates the cache,
ValueCache property exposes the bound instance.
- MxAccessCommandExecutorTests — four bulk-write variants (per-entry
success/failure, value+timestamp forwarding, secured user ids),
ReadBulk snapshot lifecycle on uncached tag (timeout surfaces as
was_successful=false), invalid-payload reply.
- GatewayGrpcScopeResolverTests — five new MxCommandKind cases.
- SessionManagerTests — WriteBulk and ReadBulk forwarding through
FakeWorkerHarness; ReadBulk forwards timeout_ms.
- Per-client (.NET, Go, Rust, Python, Java) — WriteBulk builds the
right command and returns per-entry results, ReadBulk forwards the
timeout and unpacks the was_cached flag.
Cross-language e2e CLI subcommands for the new bulks are deliberately
scoped out of this change (each of the five client CLIs would need
five new subcommands plus matching phases in
scripts/run-client-e2e-tests.ps1); coverage equivalent to the existing
bulk-subscribe coverage is provided by worker + gateway + per-client
unit tests.
Docs updated in the same commit: gateway.md (Public MXAccess Command
Surface), docs/DesignDecisions.md (new "Bulk Command Family" section
with the ReadBulk cache-then-snapshot rationale), and every client
README.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Running the matrix against a live gateway surfaced several issues:
- The write phase is now opt-in (-VerifyWrite, was -SkipWrite). It runs
right after register so only a small event backlog precedes the write,
and asserts the reliable OnWriteComplete signal (the written value is
not echoed back by a provider-driven attribute like TestChangingInt, so
the value compare is best-effort).
- Java was launched as bare "gradle", which .NET's Process.Start cannot
exec (it is gradle.bat) — resolve the launcher and run it via cmd.exe.
- The Java client's MxEventStream queue capacity was 16, which overflows
on any active session's backlog-replay burst; raised to 1024.
- The Rust stream-events CLI now renders the event family as the proto
enum name, matching the protobuf-JSON the other four clients emit.
Update docs/GatewayTesting.md for the reworked write phase.
Verified live: the full five-client matrix passes with -VerifyWrite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .proto contract and MxCommandKind already defined Write, Write2,
WriteSecured, and WriteSecured2, but the worker's MxAccessCommandExecutor
had no case for any of them — every write kind fell through to
CreateInvalidRequestReply ("Unsupported MXAccess command kind Write").
Implement all four:
- VariantConverter.ConvertToComValue projects an MxValue into a
COM-marshalable object (scalars, arrays, null) — the inverse of the
existing COM-to-MxValue projection.
- IMxAccessServer / MxAccessComServer gain Write/Write2/WriteSecured/
WriteSecured2, routed to ILMXProxyServer / ILMXProxyServer4.
- MxAccessSession and MxAccessCommandExecutor add the four write paths,
following the existing ExecuteAdvise pattern; the reply is a plain OK
reply and the outcome surfaces later as an OnWriteComplete event.
Verified live: a Write now returns PROTOCOL_STATUS_CODE_OK and produces
an OnWriteComplete event where it previously returned InvalidRequest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Close the notable gaps in scripts/run-client-e2e-tests.ps1:
- Write round-trip: write a per-client sentinel value to a configurable
writable attribute, then assert it is echoed back through the event
stream. Extends the Rust mxgw-cli stream-events output with full
per-event JSON (itemHandle + protojson-shaped value) so all five
language clients run an identical value compare.
- Parity: assert an invalid item handle and an unknown session id are
rejected rather than silently succeeding.
- Auth rejection: assert open-session is rejected with a missing API key
and, when -RejectScopeApiKeyEnv is supplied, with an insufficient-scope
key.
- Parallel: -Parallel runs each language client as an isolated child
process and merges their JSON reports.
Update docs/GatewayTesting.md for the new phases and flags.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The checked-in generated Java sources under clients/java/src/main/generated/
were out of sync with both the .proto contracts and the configured
protobuf 4.33.1 toolchain: they were missing the alarm command kinds
(MX_COMMAND_KIND_SUBSCRIBE_ALARMS..ACKNOWLEDGE_ALARM_BY_NAME, 25-29), the
alarm/galaxy message additions, and the protobuf 4.x generated-code layout.
Regenerated via `gradle generateProto`; `gradle test` passes against the
refreshed sources. No hand edits — pure protoc/protoc-gen-grpc-java output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MxAccessInteropReference_ExistsOnlyInWorkerProject asserted the MXAccess COM
interop was referenced only by MxGateway.Worker. The worker test project now
legitimately references ArchestrA.MxAccess and Interop.WNWRAPCONSUMERLib so it
can exercise the COM-facing worker code (WnWrapAlarmConsumer, the alarm
tests). Renamed to ..._ExistsOnlyInWorkerAndWorkerTestProjects, updated the
assertion to expect both projects, and made it order-independent. The
architecture invariant the test protects — the gateway/contracts never
reference MXAccess COM — still holds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFault
BackgroundService cancelled the service immediately after StartAsync, so
under parallel load the first RefreshAsync could be skipped (RefreshCallCount
0) and `await executeTask` rethrew TaskCanceledException before the IsFaulted
assertion. The test now waits for a TaskCompletionSource signal that the
first refresh was attempted before cancelling, and uses Task.WhenAny so a
Canceled ExecuteTask does not rethrow. Confirmed stable across full-suite
runs (408/408).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects resolution of Contracts-001/004/005/006/007/008 (and Contracts-003
re-triaged Won't Fix). All code-review findings across every module are now
closed. Also normalizes the Contracts-003 Status to the canonical
`Won't Fix` value the index generator expects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contracts-001: docs/Grpc.md still described "four MxAccessGateway RPCs" —
updated to the actual six (adding AcknowledgeAlarm and QueryActiveAlarms to
the handler and validation-rule sections).
Contracts-003 (Won't Fix): the finding is factually wrong — the <Protobuf>
item for mxaccess_worker.proto already sets ProtoRoot="Protos"; all three
items are consistent (confirmed back to the reviewed commit).
Contracts-004: corrected the stale GatewayContractInfo XML summary
("before generated protobuf contracts are introduced").
Contracts-005: no proto field/enum value was ever removed, so no reserved
ranges were invented. Added a wire-compatibility policy comment to all three
.proto files instructing future editors to reserve removed numbers.
Contracts-006: documented MxStatusProxy.success — it mirrors the COM
MXSTATUS_PROXY numeric success member, is not a boolean, and clients should
branch on category.
Contracts-007: added 13 round-trip tests covering galaxy_repository.proto
messages, bulk-subscribe payloads, and raw-value/IPC worker bodies.
Contracts-008: WorkerAlarmRpcDispatcher never assigns AcknowledgeAlarmReply.
status, so the old "native status" proto comment was misleading. Corrected
the hresult/status proto comments and documented the worker native_status →
public reply mapping in AlarmClientDiscovery.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects resolution of Tests-007..012, Worker.Tests-008..015,
IntegrationTests-007..010, Client.Python-001/002/004/006/007/008/010/011/012.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Python-001: dropped "scaffold" from the stale pyproject description.
Client.Python-002 (re-triaged): stale finding — MxGatewayCommandError is
already exported and in __all__; no change needed.
Client.Python-004: removed the dead `closed` variable in _smoke; the CLI
smoke now uses `async with session`.
Client.Python-006: close() on both clients and Session had an unlocked
check-then-set race; `_closed` is now set before the await.
Client.Python-007: gateway stream iterators now share one helper that
explicitly catches CancelledError and cancels the call.
Client.Python-008: to_mx_value now rejects nan/inf; float/bytes mapping
documented.
Client.Python-010: removed the circular-import-workaround late imports in
favour of TYPE_CHECKING / module-scope imports.
Client.Python-011: ensure_mxaccess_success no longer treats a proto3-default
success==0 with an unset category as a failure.
Client.Python-012 (Won't Fix): invoke_raw deliberately skips MXAccess-failure
detection for parity tests; documented the contract instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IntegrationTests-007: the three live test classes contend for shared
singletons (one MXAccess COM, one ZB SQL DB, one GLAuth). Added
LiveResourcesCollection with DisableParallelization and applied it to all
three so they no longer run concurrently.
IntegrationTests-008: the three live fact attributes each re-implemented the
env-var check. Added IntegrationTestEnvironment.IsEnabled and all three now
delegate to it.
IntegrationTests-009: reworded the misleading "Mock server call context" XML
doc — it is a hand-written stub with no verification behavior.
IntegrationTests-010: WaitForMessageAsync ignored cancellation. It now takes
an optional CancellationToken linked with the timeout; the smoke test shares
one cancellation source with the StreamEvents call context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker.Tests-008: moved the misplaced WorkerLogRedactor test out of
VariantConverterTests into Bootstrap/WorkerLogRedactorTests.
Worker.Tests-009: renamed 46 snake_case alarm-test methods to PascalCase
Method_Scenario_Expectation.
Worker.Tests-010: replaced a weak Assert.Contains with an exact assertion
against the real diagnostic message and corrected the XML doc.
Worker.Tests-011: renamed and re-documented a cancellation test that
overstated what it proved.
Worker.Tests-012: added an oversized-frame (MessageTooLarge) test; renamed
the mislabeled zero-length-payload test.
Worker.Tests-013: removed the fixed-100ms ThrowIfCompletedAsync helper; the
caller now races runTask deterministically.
Worker.Tests-014: consolidated duplicated test fakes/helpers
(FakeRuntimeSession, NoopComApartmentInitializer, NoopEventSink, frame
helpers) into a shared TestSupport namespace.
Worker.Tests-015: added MxAccessEventQueue coverage for drain-all (maxEvents
0), empty-queue drain, and enqueue-after-fault.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests-007: TestServerCallContext and stream-writer/constraint helpers were
copy-pasted across five test files. Consolidated into a shared
MxGateway.Tests.TestSupport namespace; duplicates deleted.
Tests-008: renamed snake_case alarm-test methods to PascalCase
Method_Condition_Result and dropped redundant usings. Re-triaged two
inaccurate sub-claims (the "wnwrap" name and a required CompilerServices
using).
Tests-009: corrected three copy-paste-mismatched XML <summary> comments in
SessionManagerTests.
Tests-010: added the missing anonymous-localhost security negatives —
bypass disallowed, and loopback-allowed from a remote address.
Tests-011: SessionWorkerClientFactoryFakeWorkerTests discarded worker tasks.
The test class now tracks each launcher and observes its task in DisposeAsync.
Tests-012: added xunit.runner.json pinning collection parallelism and
documented the ephemeral-port convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects resolution of Server-007..014, Worker-009..015,
Client.Dotnet-004..008, Client.Go-004..010, Client.Java-006..012.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Java-006: close() on both clients only called shutdown(). It now
awaits termination up to the connect timeout and shutdownNow()s on timeout.
Client.Java-007: added MxGatewayLowFindingsTests covering the alarm surface,
async streaming, MxEventStream overflow, and TLS channel construction. A
latent bug surfaced: a missing CA file throws IllegalArgumentException, not
SSLException — the channel-builder catch was broadened accordingly.
Client.Java-008: async thenApply sites now route stray RuntimeExceptions
through MxGatewayErrors.fromGrpc via a normalising validator.
Client.Java-009: extracted ~80 duplicated lines (createChannel, withDeadline,
toCompletable, ...) into a shared MxGatewayChannels; both clients delegate.
Client.Java-010 (re-triaged): the README's metadata:read scope was correct;
the acknowledgeAlarm Javadoc's invoke:alarm-ack was wrong — corrected to the
admin scope.
Client.Java-011: documented the intentional fail-fast event-stream
backpressure in Javadoc and the README.
Client.Java-012: replaced CommonOptions.resolved()'s mutate-and-return-this
with side-effect-free resolvedApiKey()/resolvedTimeout() accessors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Go-004: ran gofmt on alarms_test.go and galaxy_test.go; the tree is
now gofmt-clean.
Client.Go-005/009/010: migrated Dial/DialGalaxy off the deprecated
grpc.DialContext/WithBlock to grpc.NewClient via a shared dial helper, with
a DialTimeout-bounded readiness probe to keep fail-fast semantics; shared
callContext deadline arithmetic; updated the stale Dial doc comment. Test
harnesses use passthrough:///bufnet for the NewClient default-scheme change.
Client.Go-006: added GatewayError.Code() and an IsTransient(err) helper so
callers can classify transient gRPC failures.
Client.Go-007: newCorrelationID no longer returns an empty id when
crypto/rand fails — it falls back to a non-empty time+counter id.
Client.Go-008: added coverage_test.go for transport-credential resolution,
callContext deadline arithmetic, and native value/array edge kinds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Dotnet-004: documented DefaultCallTimeout as both the per-attempt
deadline and the shared retry budget, and removed DeadlineExceeded from the
transient-retry set (a client-imposed deadline cannot be helped by retrying).
Client.Dotnet-005: RegisterAsync/AddItemAsync/AddItem2Async silently returned
0 when a successful reply lacked the typed payload. They now throw a
descriptive MxGatewayException.
Client.Dotnet-006: added XML docs to the previously undocumented public
members MaxGrpcMessageBytes, GatewayProtocolVersion, WorkerProtocolVersion.
Client.Dotnet-007: corrected the AcknowledgeAlarmAsync XML comment — the RPC
requires the admin scope, not a non-existent invoke:alarm-ack sub-scope.
Client.Dotnet-008: the CLI redactor missed env-var-sourced keys because the
caller passed only the --api-key option. Redaction now uses the same
resolver, stripping env-var keys too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-009: WorkerFrameWriter serialized twice and WorkerFrameReader
allocated a payload byte[] per frame. The writer now serializes once into a
single prefix+payload buffer; the reader rents the payload buffer from
ArrayPool and honors the logical frame length.
Worker-010: VariantConverter projected a uint+Time value as a full FILETIME,
producing a near-1601 timestamp. The FILETIME projection is now gated on
`value is long`; uint falls through to the integer projection.
Worker-011: replaced the opaque retryAttempts formula in WorkerPipeClient
with MaxRetryAttempts = int.MaxValue, leaving the connect deadline as the
sole bound.
Worker-012: rewrote stale "future PR / polls on a Timer" comments in
AlarmDispatcher, AlarmCommandHandler, MxAccessAlarmEventSink and
MxAccessEventMapper to match the shipped, post-Worker-001 behavior.
Worker-013 (re-triaged): already resolved — StaMessagePumpTests and
MxAccessStaSessionTests cover the pump and poll loop directly.
Worker-014: moved IAlarmCommandHandler into its own file so
AlarmCommandHandler.cs declares one public type.
Worker-015: clarified the MxAccessBaseEventSink.EnqueueEvent overflow-catch
comment explaining the deliberate double RecordFault no-op.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-007: GalaxyHierarchyProjector re-filtered the whole hierarchy per
page (O(total) paging). It now memoizes the filtered list per cache-entry +
filter signature so subsequent pages are an O(pageSize) slice.
Server-008: WatchDeployEvents re-resolved browse subtrees and rebuilt globs
per streamed event. ResolveBrowseSubtrees is hoisted out of the loop and
GalaxyGlobMatcher caches compiled Regex instances per pattern.
Server-009: auth-store connections used no busy timeout or WAL. A new
OpenConnectionAsync applies journal_mode=WAL and a busy_timeout; all auth
call sites use it. docs/Authentication.md updated.
Server-010: the dashboard rendered Rotate/Revoke for revoked keys, where
Rotate silently reactivates them. ApiKeysPage now shows actions only for
Active keys. docs/Authentication.md updated.
Server-011: WorkerAlarmRpcDispatcher converted to a primary constructor and
brought in line with module conventions.
Server-012: CLAUDE.md corrected to the canonical *:* scope strings.
Server-013 (partly re-triaged): three named coverage gaps were already
closed; the genuine gap (WorkerExecutableValidator) is now covered.
Server-014: rewrote stale "alarm path not yet wired" comments in
MxAccessGatewayService to describe the production WorkerAlarmRpcDispatcher.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MxCommandReply.payload has no by-name ack case: MX_COMMAND_KIND_ACKNOWLEDGE_
ALARM_BY_NAME reuses the acknowledge_alarm reply payload. Verified the worker
(MxAccessCommandExecutor.ExecuteAcknowledgeAlarmByName) and gateway
(WorkerAlarmRpcDispatcher) already implement this correctly — the gap was
purely undocumented contract asymmetry. Documented the reuse on the proto
oneof case and the AcknowledgeAlarmReplyPayload message comment (regenerating
the .NET contract), and in docs/AlarmClientDiscovery.md. Added
ProtobufContractRoundTripTests.MxCommandReply_AcknowledgeAlarmByName_Reuses
AcknowledgeAlarmPayloadCase to pin the contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects resolution of Tests-003..006, Worker.Tests-003..007,
IntegrationTests-003..006, Client.Python-003/005/009.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Python-003: stream_events_raw and query_active_alarms passed `timeout`
to the stub with no TypeError fallback, unlike _unary. Both now route through
a shared _open_stream helper that strips `timeout` on TypeError.
Client.Python-005: discover_hierarchy buffered the entire Galaxy hierarchy in
memory. Added GalaxyRepositoryClient.iter_hierarchy, a lazy async generator
yielding objects page-by-page; discover_hierarchy is now a thin wrapper that
preserves its list contract. README documents iter_hierarchy.
Client.Python-009: added regression coverage for previously untested paths —
write2/add_item2 request shape, the MAX_BULK_ITEMS boundary, the None-argument
TypeError guards, TLS ca_file reading, and the non-auth map_rpc_error fallthrough.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IntegrationTests-003: the live MXAccess smoke test asserted on the first
streamed event, which a registration/quality bootstrap event could occupy.
The recording writer now waits for the first event matching a predicate
(Family == OnDataChange).
IntegrationTests-004: the cleanup `finally` could throw and mask an original
assertion failure. Shutdown now routes through a helper that logs cleanup
exceptions instead of propagating them.
IntegrationTests-005: added live MXAccess parity tests — a Write round-trip
to an advised item, and an invalid-handle command surfacing the MXAccess
failure without a transport fault.
IntegrationTests-006: added live LDAP failure-path tests — wrong password
(no password leak), unknown username, and server-unreachable.
docs/GatewayTesting.md updated to describe the new cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker.Tests-003: removed the wall-clock `Elapsed < 2s` assertion from
InvokeAsync_WakesIdlePumpForQueuedCommand; the awaited completion against a
30s idle period already proves the wake event drove dispatch.
Worker.Tests-004: MxAccessStaSession.Dispose now joins the alarm poll task
after cancelling the CTS (consistent with ShutdownGracefullyAsync), and
Dispose_StopsAlarmPollLoop asserts deterministically instead of via Task.Delay.
Worker.Tests-005: undisposed MemoryStream instances across the frame-protocol
and pipe-session tests are now `using` declarations.
Worker.Tests-006: Dispose_StopsAlarmPollLoop now constructs MxAccessStaSession
with `using` so a failed assertion cannot leak the STA poll loop.
Worker.Tests-007: docs/WorkerFrameProtocol.md verification section corrected
to target MxGateway.Worker.Tests / MxGateway.Worker with -p:Platform=x86.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests-003: temp auth-DB directories leaked under %TEMP%. Added the
TempDatabaseDirectory IDisposable helper (clears the Sqlite connection pool,
then recursively deletes); SqliteAuthStoreTests and ApiKeyAdminCliRunnerTests
now dispose every directory they create.
Tests-004: added end-to-end coverage composing the real authorization
interceptor in front of the real MxAccessGatewayService, plus scope-resolver
tests confirming an unmapped request type fails closed to the admin scope.
Tests-005: added coverage for a worker faulting mid-command — a pipe
disconnect and a worker fault while an InvokeAsync is in flight both fail the
pending invoke. No product change needed.
Tests-006 (re-triaged): the flaky ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess
is a test race, not a product bug — the kill runs synchronously inside
SetFaulted. Rewrote it to await FakeWorkerProcess exit deterministically, and
replaced fixed Task.Delay timing in the late-reply and heartbeat tests with
FIFO ordering and an injected ManualTimeProvider.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflects resolution of Server-002/004/005/006, Worker-004..008,
Client.Dotnet-001/002/003, Client.Go-002/003, Client.Java-001..005.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Java-001: redactApiKey echoed the last 4 secret characters. It now
keeps only the non-secret mxgw_<key-id>_ prefix plus ***; non-gateway-shaped
tokens return <redacted>.
Client.Java-002: a close() after a queue-overflow could wipe the enqueued
overflow exception. Terminal transitions are now serialized through a single
guarded terminate() — first terminal condition wins.
Client.Java-003: openSession never read gateway_protocol_version. Both
openSession paths now call ensureGatewayProtocolCompatible, rejecting a
non-zero mismatch and accepting unset (0) for older gateways.
Client.Java-004: register/addItem/addItem2 fell back to a return_value that
silently yields 0 when unset. The fallback is now guarded by hasReturnValue()
and throws on a protocol violation.
Client.Java-005: close() in try-with-resources could mask the body exception
when the CloseSession RPC failed. close() now catches and logs the
close-time failure; closeRaw() still surfaces it for callers that want it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Go-002: the Events/EventsAfter compatibility path silently dropped
events when the 16-slot results channel filled — it cancelled the stream and
closed the channel with no error delivered. sendEventResult now evicts an
old buffered event and delivers a terminal EventResult carrying the new
exported ErrEventBufferOverflow before close, so the overflow is observable.
Client.Go-003: parseInt32List panicked on a malformed -item-handles token,
crashing the CLI with a stack trace. It now returns an error that
runUnsubscribeBulk propagates, exiting 2 with a clean message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Client.Dotnet-001: MapRpcException typed only Unauthenticated and
PermissionDenied; every other gRPC status collapsed to an untyped exception
with the status code discarded. Added a nullable StatusCode to
MxGatewayException, extracted the duplicated mappers into a shared
RpcExceptionMapper that records the code for every status, and documented it.
Client.Dotnet-002: the production retry branch (MxGatewayException wrapping
RpcException) was never exercised. FakeGatewayTransport gained a
MapTransportExceptions mode that runs thrown RpcExceptions through
RpcExceptionMapper exactly as the production transport does.
Client.Dotnet-003: MxGatewaySession.DisposeAsync disposed _closeLock while a
concurrent CloseAsync could be parked in WaitAsync. DisposeAsync now drains
in-flight CloseAsync callers before disposing the semaphore; the client's
_disposed flag is accessed via Interlocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>