Commit Graph

67 Commits

Author SHA1 Message Date
Joseph Doherty 6777d49030 Point the Java client at the StreamAlarms alarm feed
Regenerate the Java protobuf stubs and replace queryActiveAlarms with
streamAlarms, returning a MxGatewayAlarmFeedSubscription over
AlarmFeedMessage served by the gateway's central alarm monitor
(snapshot, snapshot_complete, then live transitions). Drops session_id
from the acknowledge surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:55:20 -04:00
Joseph Doherty 1b6ca07bb5 Point the Rust client at the StreamAlarms alarm feed
Replace GatewayClient::query_active_alarms with stream_alarms, an
AlarmFeedStream over AlarmFeedMessage served by the gateway's central
alarm monitor (snapshot, snapshot_complete, then live transitions).
Drops session_id from the acknowledge surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:49:26 -04:00
Joseph Doherty 1ad0be8276 Point the Python client at the StreamAlarms alarm feed
Regenerate the Python protobuf stubs and replace query_active_alarms
with stream_alarms, an AsyncIterator over AlarmFeedMessage served by
the gateway's central alarm monitor (snapshot, snapshot_complete, then
live transitions). Drops session_id from the acknowledge surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:45:53 -04:00
Joseph Doherty 9328c4f657 Point the Go client at the StreamAlarms alarm feed
Regenerate the Go protobuf stubs and replace the session-scoped
QueryActiveAlarms surface with the session-less StreamAlarms feed:
snapshot-then-live AlarmFeedMessage fan-out served by the gateway's
central alarm monitor. Drops session_id from the acknowledge surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:45:47 -04:00
Joseph Doherty ac12c150c3 Point the .NET client at the StreamAlarms alarm feed
Replace the client's QueryActiveAlarmsAsync with StreamAlarmsAsync —
a session-less subscription to the gateway's central alarm feed that
yields the active-alarm snapshot followed by live transitions.
AcknowledgeAlarm is session-less (AcknowledgeAlarmRequest no longer
carries a session id). Updates the transport interface, the gRPC
transport, the test fake, and the alarm tests; the .NET client
solution builds and its alarm tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:30:50 -04:00
Joseph Doherty 6126099cdb e2e: drive each client CLI through one long-lived batch process
The cross-language e2e matrix spawned one CLI process per operation —
~250 per client — paying a process (and, for the Java CLI, a full JVM)
cold-start every time. The Java leg alone ran ~16 minutes.

Each client CLI (dotnet, go, rust, python, java) gains a `batch`
subcommand: a single process that reads one command line from stdin,
runs it through the normal subcommand dispatch, writes the JSON result,
then a line containing exactly `__MXGW_BATCH_EOR__`. A failing command
writes its `{"error":...}` envelope and the loop continues.

run-client-e2e-tests.ps1 now launches one batch process per client and
pings every operation through its stdin/stdout, so startup is paid once
per client. The orchestration and assertions are unchanged; the parity
and auth phases now read the `{"error":...}` envelope instead of a
process exit code.

Full 5-client matrix with -VerifyWrite: ~15 min, down from ~35; the Java
leg dropped from ~16 min to ~2-3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 06:20:13 -04:00
Joseph Doherty b794c46bc7 File and fix Server-030 and Client.Dotnet-017 from e2e surfacing
Both findings surfaced when running the cross-language e2e matrix
(scripts/run-client-e2e-tests.ps1) against the redeployed gateway at
commit 84d36b7. Filed in code-reviews/Server/findings.md and
code-reviews/Client.Dotnet/findings.md and fixed in the same change.

Server-030 (Medium / Error handling): GatewaySession.GetReadyWorkerClient
gated on `_state == Ready && _workerClient.State == Ready` but only
formatted `_state` into the SessionManagerException message. Under load
the gateway-driven `_state` and the worker-driven `WorkerClient.State`
can diverge, producing a self-contradictory diagnostic ("Session ... is
not ready. Current state is Ready."). The Java e2e client hit this on
the 56th item after 55 successful add-items. Rewrote the message to
include both states ("Session state is X; worker state is Y"), added
an XML doc explaining the two-state contract and that this branch is
the fail-fast for a divergence race, and added regression test
SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates
that pins both states appear in the message. The deeper race (should
the gateway briefly wait for worker-Ready before failing?) remains
open as a follow-up.

Client.Dotnet-017 (Low / Error handling): stream-events CLI threw
OperationCanceledException as an unhandled exception when the user's
--timeout expired before --max-events was reached. Exit code
-532462766, no aggregate JSON. The other client CLIs (Go, Rust, Python,
Java) exit 0 in this case. Wrapped the `await foreach` in
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`
so the supplied token's cancellation (--timeout, Ctrl+C, or parent
CTS) becomes graceful completion; the aggregate `{ "events": [...] }`
JSON still runs after the catch. Added regression test
RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero
backed by a new FakeCliClient.StreamHangAfterEvents hook that yields
the configured events then parks on the cancellation token.

Side cleanup: the GatewayApplicationTests test added under Server-020
was asserting an invariant (`/dashboard/dashboard/X` doesn't exist)
that I broke by reverting Server-020 in 84d36b7. The doubled endpoint
shapes do exist now (MapGroup("/dashboard") prefixing an already
"/dashboard/X" @page directive) but they're harmless — no client
requests `/dashboard/dashboard/X`. Replaced the test with a positive
assertion (`/dashboard/X` routes ARE registered) and rewrote the XML
doc to record the actual contract.

Verified: dotnet test src/MxGateway.Tests passes 480/480, dotnet test
clients/dotnet/MxGateway.Client.Tests passes 77/77, gateway redeployed
at this commit and GET http://localhost:5130/dashboard returns 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:07:39 -04:00
Joseph Doherty 1aafd6bde4 Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings
Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:28:54 -04:00
Joseph Doherty a0203503a7 Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules
Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:46:47 -04:00
Joseph Doherty 1cd51bbda3 .NET CLI: bench-stream-events for max event-throughput characterization
New subcommand drives the gateway''s StreamEvents server-stream as fast
as it can from a single client process. Subscribes to --bulk-size tags
(rotating through all six TestMachine attributes by default) and counts
events received over a --duration-seconds steady-state window. Tracks
events/sec, end-to-end latency (now - event.worker_timestamp), and any
worker faults observed via a post-run DrainEvents probe.

--session-count opens N independent gateway sessions from the same
client process — each session is independent at the gateway (own
worker, own event subscriber, own item handles) so this measures how
the gateway multiplexes concurrent event streams without needing
multiple client processes. Sessions are staggered open by default
(--session-start-stagger-ms 750) because firing N concurrent
OpenSession calls forces N concurrent worker x86 spawns, and on a dev
rig that exceeds the gateway''s 30-second worker startup timeout
around N >= 6-8. The stagger gives each worker headroom to init its
COM apartment + attach the event sink before the next one starts.

Phase 1 of the bench opens + subscribes every session sequentially;
phase 2 opens the steady-state window once everyone is advised, so
the measurement isn''t skewed by late-arriving sessions still in
warmup. The latency sample is shared across sessions (locked
List<double>); event counts use Interlocked.

Initial sweep at --bulk-size 120 against the dev galaxy (20 machines
x 6 attributes = 120 unique tags) showed:

  - Linear throughput scaling with subscribed-tag count: N=6→2 ev/s,
    N=24→8 ev/s, N=60→20 ev/s, N=120→41 ev/s. The dev galaxy is
    producer-bound at ~0.34 events/sec per advised tag — gateway has
    plenty of headroom.
  - Latency stayed at p50 ≈17ms, p95 ≈34ms across the entire range —
    no degradation with subscribed-tag count.
  - Zero queue-overflow faults; gateway 10k-event buffer never came
    close to filling at this producer rate.
  - Linear scaling with session count too (staggered open): 1→44, 2→81,
    4→130, 8→324 events/sec at p50 16ms across all session counts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:30:24 -04:00
Joseph Doherty 61644e63fb Rust: take Session::read_bulk tag list by borrowed slice
Changes the public signature from `Vec<String>` (owned) to
`&[impl AsRef<str>]` so callers can re-issue the same call repeatedly
without cloning at the call site. The bench''s steady-state loop now
passes `&tags` instead of `tags.clone()`; the CLI subcommand passes the
parsed `&items`; the integration test passes `&["Area001.Pump001.Speed"]`
straight from a string literal slice.

Honest perf note: this is an ergonomics change, not a measurable speedup.
The method still has to materialise an owned `Vec<String>` internally
because prost''s generated `ReadBulkCommand` field requires it, so the
total heap traffic per call is unchanged. Across two 30-second, 5-way
concurrent bench runs at bulkSize=6:

  pre-fix  (.clone() at caller): 145.35 calls/sec, p99  62.31 ms
  post-fix run 1 (&tags):        165.98 calls/sec, p99  40.65 ms
  post-fix run 2 (&tags):        146.19 calls/sec, p99  60.04 ms

Run-to-run variance (145-166) dominates any signal from the fix. Solo
Rust release stayed at 261-267 calls/sec across both API shapes,
confirming the bench is gateway-bound under 5-way contention rather
than client-allocation-bound. The change is kept because the borrowed
slice is the idiomatic Rust API shape for "list of items the callee
does not need to take ownership of", and it cleans up the explicit
clone from the bench's inner loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:49:33 -04:00
Joseph Doherty 93633ce99c Cross-language ReadBulk stress benchmark
Adds a bench-read-bulk subcommand to every client CLI (.NET, Go, Rust,
Python, Java) and a PowerShell driver that runs all five concurrently
against the deployed gateway and prints a side-by-side comparison.

Each CLI''s bench:

  - Opens its own session, registers, subscribes to bulk-size tags so the
    worker''s MxAccessValueCache populates from real OnDataChange events.
  - Runs a warmup-seconds-long pre-loop with identical calls so JIT /
    connection-pool / first-call overhead is amortised before the
    measurement window.
  - Runs ReadBulk in a tight in-process loop for duration-seconds with
    per-call high-resolution latency capture (Stopwatch in .NET,
    time.Now in Go, std::time::Instant in Rust, time.perf_counter in
    Python, System.nanoTime in Java).
  - Unsubscribes + closes the session, then emits one JSON object with
    the shared schema: { language, durationMs, totalCalls, successfulCalls,
    failedCalls, totalReadResults, cachedReadResults, callsPerSecond,
    latencyMs: { p50, p95, p99, max, mean } }.

The PS driver (scripts/bench-read-bulk.ps1) launches one detached process
per client, waits for all to finish, parses the trailing JSON object from
each stdout, prints a comparison table, and persists the combined report
under artifacts/bench/. Quoting around Java''s `gradle --args="..."` is
handled by writing a one-shot .bat that cmd.exe runs; the .NET CLI''s
per-call gRPC timeout is auto-scaled to (Duration + Warmup + 30s) so the
channel-wide timeout doesn''t cancel the bench mid-loop.

Live 30-second steady-state run against the deployed gateway, all five
clients hitting the same six TestMachine_001..006.TestChangingInt tags:

  client    calls/sec  cached/total    p50 ms  p95 ms  p99 ms  max ms
  dotnet      171.78   30924/30924      3.84   14.06   40.41  542.48
  go          175.46   31590/31590      3.93   13.52   41.26  243.00
  rust        123.26   22188/22188      5.52   15.78   48.11  544.41
  python      145.79   26244/26244      4.86   14.85   41.65  645.84
  java        181.12   32604/32604      3.80   10.59   33.37  344.27

143,550 ReadBulk results across all five clients during the 30s window;
100% were was_cached = true (the worker''s cache fast-path never fell
through to the snapshot lifecycle). Aggregate read throughput ~800
calls/sec against five concurrent sessions sharing the same cached tags.

A second variant with bulk-size 20 sustained the same per-client call
rate while delivering 3.3x more values per call (~37,000 cached reads/sec
aggregate across the five concurrent sessions), confirming the linear
per-tag cache lookup inside one call is not a bottleneck at this scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:17:08 -04:00
Joseph Doherty eaa7093cd6 .NET CLI: register the five new bulk subcommands in IsKnownGatewayCommand
The previous commit added read-bulk / write-bulk / write2-bulk /
write-secured-bulk / write-secured2-bulk dispatch cases to RunCoreAsync
but left them out of IsKnownGatewayCommand, so the .NET CLI rejected
them at the pre-dispatch gate and printed the usage banner instead of
running the new code paths. Surfaced when the live e2e exercised the
read-bulk phase against the deployed gateway — the call routed through
the unknown-command path before reaching the protobuf builder.

Also extends WriteUsage with one line per new subcommand so the banner
documents the new surface.

Live e2e against the deployed gateway now passes for all five clients
(dotnet, go, rust, python, java) with 4/4 tags returning was_cached=true
after the subscribe-bulk + read-bulk path, confirming the worker
MxAccessValueCache populates from real MXAccess OnDataChange events and
round-trips through every client''s JSON parser.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 04:48:55 -04:00
Joseph Doherty f220908f3f Add bulk read/write CLI subcommands and e2e matrix coverage
The previous commit added the bulk read/write library surface in every
client; this commit makes that surface reachable from each client's CLI
and exercises it through scripts/run-client-e2e-tests.ps1.

Five new subcommands in every client CLI (.NET / Go / Rust / Python /
Java): read-bulk, write-bulk, write2-bulk, write-secured-bulk, and
write-secured2-bulk. Each follows the existing subscribe-bulk shape:

  - read-bulk takes --server-handle, --items <csv tag list>, and
    --timeout-ms (0 = worker default). JSON output carries the
    BulkReadResult fields, including was_cached so the e2e matrix can
    verify the cached-path semantics.
  - The four bulk-write families take --server-handle, --item-handles
    <csv>, --type, --values <csv>. write2-bulk and write-secured2-bulk
    add a single --timestamp applied to every entry; the secured
    variants take --current-user-id and --verifier-user-id. All four
    output BulkWriteResult JSON.

A new -SkipReadWriteBulk switch on the matrix script (default OFF)
controls two new e2e phases:

  - After the existing subscribe-bulk phase leaves tags advised, the
    script runs read-bulk against the same tag list and asserts most
    results return was_cached = true. This is the only e2e coverage of
    the cache-then-snapshot fork — the unit + gateway tests verify the
    semantics with a fake worker, but only the live cross-language
    matrix proves the cache populates from real OnDataChange events and
    survives the round-trip through every client''s JSON parser.
  - When -VerifyWrite is set, the write phase now also runs a single-
    entry write-bulk against the same writable item handle (using a
    distinct sentinel value) and asserts a per-entry success. Confirms
    the BulkWriteResult wire format end-to-end without complicating
    the OnWriteComplete echo assertion the single-item phase already
    verifies.

Dry-run validation passes for all five clients: each emits the correct
read-bulk and write-bulk CLI invocations with the right flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 04:06:14 -04:00
Joseph Doherty 5e375f6d3d Add bulk read/write command family across worker, gateway, and clients
Adds five new MXAccess command kinds (WriteBulk, Write2Bulk,
WriteSecuredBulk, WriteSecured2Bulk, ReadBulk) that ride the existing
"one round-trip, per-entry results" bulk shape used by AddItemBulk and
SubscribeBulk today. MXAccess COM has no native bulk API; the worker
runs each bulk operation as a sequential loop on its STA, returning
one BulkWriteResult / BulkReadResult per requested entry so per-item
MXAccess failures surface as was_successful=false rather than throwing.

ReadBulk has no MXAccess analogue. The worker satisfies it by:

  - Returning the last cached OnDataChange payload (was_cached=true)
    when the requested tag is already in the session''s item registry
    AND advised — the existing subscription is NOT touched, since the
    caller did not create it.
  - Otherwise taking the AddItem + Advise + wait-for-OnDataChange +
    UnAdvise + RemoveItem snapshot lifecycle itself (was_cached=false)
    and leaving the session exactly as it was. The wait pumps Windows
    messages on the STA so the inbound MXAccess event can dispatch
    while the executor still holds the thread.

The new MxAccessValueCache lives on each MxAccessSession, shared with
MxAccessBaseEventSink which populates it on every OnDataChange after
the event clears the outbound queue. Eviction on RemoveItem keeps
reused MXAccess handles from serving stale values from a previous
lifetime.

Gateway-side authorization wires WriteBulk/Write2Bulk to invoke:write,
WriteSecuredBulk/WriteSecured2Bulk to invoke:secure, ReadBulk to
invoke:read. The constraint-filter pipeline is refactored from a single
BulkConstraintPlan record into an abstract base plus three concretes
(SubscribeBulk, WriteBulk, ReadBulk), each owning its own denied-entry
merge so the dispatch site never branches on reply shape. A new
FilterWriteBulkAsync<TEntry> generic over the four write-entry shapes
runs CheckWriteHandleAsync per entry; denied entries surface as the
BulkWriteResult shape, preserving original-index order.

All five language clients (.NET, Go, Rust, Python, Java) gained the
five new methods following their existing bulk pattern, with regenerated
protobufs.

Tests added:
  - MxAccessValueCacheTests (6 cases) — Set/TryGet, Remove resets the
    version, TryWaitForUpdate signals on Set, pump step fires each poll.
  - MxAccessBaseEventSinkTests — OnDataChange populates the cache,
    ValueCache property exposes the bound instance.
  - MxAccessCommandExecutorTests — four bulk-write variants (per-entry
    success/failure, value+timestamp forwarding, secured user ids),
    ReadBulk snapshot lifecycle on uncached tag (timeout surfaces as
    was_successful=false), invalid-payload reply.
  - GatewayGrpcScopeResolverTests — five new MxCommandKind cases.
  - SessionManagerTests — WriteBulk and ReadBulk forwarding through
    FakeWorkerHarness; ReadBulk forwards timeout_ms.
  - Per-client (.NET, Go, Rust, Python, Java) — WriteBulk builds the
    right command and returns per-entry results, ReadBulk forwards the
    timeout and unpacks the was_cached flag.

Cross-language e2e CLI subcommands for the new bulks are deliberately
scoped out of this change (each of the five client CLIs would need
five new subcommands plus matching phases in
scripts/run-client-e2e-tests.ps1); coverage equivalent to the existing
bulk-subscribe coverage is provided by worker + gateway + per-client
unit tests.

Docs updated in the same commit: gateway.md (Public MXAccess Command
Surface), docs/DesignDecisions.md (new "Bulk Command Family" section
with the ReadBulk cache-then-snapshot rationale), and every client
README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 03:42:38 -04:00
Joseph Doherty 758aca2355 Make the e2e write phase work live across all five clients
Running the matrix against a live gateway surfaced several issues:

- The write phase is now opt-in (-VerifyWrite, was -SkipWrite). It runs
  right after register so only a small event backlog precedes the write,
  and asserts the reliable OnWriteComplete signal (the written value is
  not echoed back by a provider-driven attribute like TestChangingInt, so
  the value compare is best-effort).
- Java was launched as bare "gradle", which .NET's Process.Start cannot
  exec (it is gradle.bat) — resolve the launcher and run it via cmd.exe.
- The Java client's MxEventStream queue capacity was 16, which overflows
  on any active session's backlog-replay burst; raised to 1024.
- The Rust stream-events CLI now renders the event family as the proto
  enum name, matching the protobuf-JSON the other four clients emit.

Update docs/GatewayTesting.md for the reworked write phase.

Verified live: the full five-client matrix passes with -VerifyWrite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:45:47 -04:00
Joseph Doherty e355a7674b Add write, parity, auth, and parallel coverage to client e2e matrix
Close the notable gaps in scripts/run-client-e2e-tests.ps1:

- Write round-trip: write a per-client sentinel value to a configurable
  writable attribute, then assert it is echoed back through the event
  stream. Extends the Rust mxgw-cli stream-events output with full
  per-event JSON (itemHandle + protojson-shaped value) so all five
  language clients run an identical value compare.
- Parity: assert an invalid item handle and an unknown session id are
  rejected rather than silently succeeding.
- Auth rejection: assert open-session is rejected with a missing API key
  and, when -RejectScopeApiKeyEnv is supplied, with an insufficient-scope
  key.
- Parallel: -Parallel runs each language client as an isolated child
  process and merges their JSON reports.

Update docs/GatewayTesting.md for the new phases and flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:55:51 -04:00
Joseph Doherty cd92048f4e Regenerate stale Java client protobuf code
The checked-in generated Java sources under clients/java/src/main/generated/
were out of sync with both the .proto contracts and the configured
protobuf 4.33.1 toolchain: they were missing the alarm command kinds
(MX_COMMAND_KIND_SUBSCRIBE_ALARMS..ACKNOWLEDGE_ALARM_BY_NAME, 25-29), the
alarm/galaxy message additions, and the protobuf 4.x generated-code layout.
Regenerated via `gradle generateProto`; `gradle test` passes against the
refreshed sources. No hand edits — pure protoc/protoc-gen-grpc-java output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:21:13 -04:00
Joseph Doherty a7bf1ef95d Resolve Client.Python-001/002/004/006/007/008/010/011/012 findings
Client.Python-001: dropped "scaffold" from the stale pyproject description.
Client.Python-002 (re-triaged): stale finding — MxGatewayCommandError is
already exported and in __all__; no change needed.
Client.Python-004: removed the dead `closed` variable in _smoke; the CLI
smoke now uses `async with session`.
Client.Python-006: close() on both clients and Session had an unlocked
check-then-set race; `_closed` is now set before the await.
Client.Python-007: gateway stream iterators now share one helper that
explicitly catches CancelledError and cancels the call.
Client.Python-008: to_mx_value now rejects nan/inf; float/bytes mapping
documented.
Client.Python-010: removed the circular-import-workaround late imports in
favour of TYPE_CHECKING / module-scope imports.
Client.Python-011: ensure_mxaccess_success no longer treats a proto3-default
success==0 with an unset category as a failure.
Client.Python-012 (Won't Fix): invoke_raw deliberately skips MXAccess-failure
detection for parity tests; documented the contract instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:59:24 -04:00
Joseph Doherty 6eb9ea9105 Resolve Client.Java-006..012 code-review findings
Client.Java-006: close() on both clients only called shutdown(). It now
awaits termination up to the connect timeout and shutdownNow()s on timeout.

Client.Java-007: added MxGatewayLowFindingsTests covering the alarm surface,
async streaming, MxEventStream overflow, and TLS channel construction. A
latent bug surfaced: a missing CA file throws IllegalArgumentException, not
SSLException — the channel-builder catch was broadened accordingly.

Client.Java-008: async thenApply sites now route stray RuntimeExceptions
through MxGatewayErrors.fromGrpc via a normalising validator.

Client.Java-009: extracted ~80 duplicated lines (createChannel, withDeadline,
toCompletable, ...) into a shared MxGatewayChannels; both clients delegate.

Client.Java-010 (re-triaged): the README's metadata:read scope was correct;
the acknowledgeAlarm Javadoc's invoke:alarm-ack was wrong — corrected to the
admin scope.

Client.Java-011: documented the intentional fail-fast event-stream
backpressure in Javadoc and the README.

Client.Java-012: replaced CommonOptions.resolved()'s mutate-and-return-this
with side-effect-free resolvedApiKey()/resolvedTimeout() accessors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:51 -04:00
Joseph Doherty 555fe4c0ba Resolve Client.Go-004..010 code-review findings
Client.Go-004: ran gofmt on alarms_test.go and galaxy_test.go; the tree is
now gofmt-clean.

Client.Go-005/009/010: migrated Dial/DialGalaxy off the deprecated
grpc.DialContext/WithBlock to grpc.NewClient via a shared dial helper, with
a DialTimeout-bounded readiness probe to keep fail-fast semantics; shared
callContext deadline arithmetic; updated the stale Dial doc comment. Test
harnesses use passthrough:///bufnet for the NewClient default-scheme change.

Client.Go-006: added GatewayError.Code() and an IsTransient(err) helper so
callers can classify transient gRPC failures.

Client.Go-007: newCorrelationID no longer returns an empty id when
crypto/rand fails — it falls back to a non-empty time+counter id.

Client.Go-008: added coverage_test.go for transport-credential resolution,
callContext deadline arithmetic, and native value/array edge kinds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:33 -04:00
Joseph Doherty 89043cb2b6 Resolve Client.Dotnet-004..008 code-review findings
Client.Dotnet-004: documented DefaultCallTimeout as both the per-attempt
deadline and the shared retry budget, and removed DeadlineExceeded from the
transient-retry set (a client-imposed deadline cannot be helped by retrying).

Client.Dotnet-005: RegisterAsync/AddItemAsync/AddItem2Async silently returned
0 when a successful reply lacked the typed payload. They now throw a
descriptive MxGatewayException.

Client.Dotnet-006: added XML docs to the previously undocumented public
members MaxGrpcMessageBytes, GatewayProtocolVersion, WorkerProtocolVersion.

Client.Dotnet-007: corrected the AcknowledgeAlarmAsync XML comment — the RPC
requires the admin scope, not a non-existent invoke:alarm-ack sub-scope.

Client.Dotnet-008: the CLI redactor missed env-var-sourced keys because the
caller passed only the --api-key option. Redaction now uses the same
resolver, stripping env-var keys too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:27 -04:00
Joseph Doherty e4fbbb541a Resolve Client.Python-003, -005, -009 code-review findings
Client.Python-003: stream_events_raw and query_active_alarms passed `timeout`
to the stub with no TypeError fallback, unlike _unary. Both now route through
a shared _open_stream helper that strips `timeout` on TypeError.

Client.Python-005: discover_hierarchy buffered the entire Galaxy hierarchy in
memory. Added GalaxyRepositoryClient.iter_hierarchy, a lazy async generator
yielding objects page-by-page; discover_hierarchy is now a thin wrapper that
preserves its list contract. README documents iter_hierarchy.

Client.Python-009: added regression coverage for previously untested paths —
write2/add_item2 request shape, the MAX_BULK_ITEMS boundary, the None-argument
TypeError guards, TLS ca_file reading, and the non-auth map_rpc_error fallthrough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:45:16 -04:00
Joseph Doherty ff41556b9a Resolve Client.Java-001..005 code-review findings
Client.Java-001: redactApiKey echoed the last 4 secret characters. It now
keeps only the non-secret mxgw_<key-id>_ prefix plus ***; non-gateway-shaped
tokens return <redacted>.

Client.Java-002: a close() after a queue-overflow could wipe the enqueued
overflow exception. Terminal transitions are now serialized through a single
guarded terminate() — first terminal condition wins.

Client.Java-003: openSession never read gateway_protocol_version. Both
openSession paths now call ensureGatewayProtocolCompatible, rejecting a
non-zero mismatch and accepting unset (0) for older gateways.

Client.Java-004: register/addItem/addItem2 fell back to a return_value that
silently yields 0 when unset. The fallback is now guarded by hasReturnValue()
and throws on a protocol violation.

Client.Java-005: close() in try-with-resources could mask the body exception
when the CloseSession RPC failed. close() now catches and logs the
close-time failure; closeRaw() still surfaces it for callers that want it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:46 -04:00
Joseph Doherty f88a029ecc Resolve Client.Go-002, -003 code-review findings
Client.Go-002: the Events/EventsAfter compatibility path silently dropped
events when the 16-slot results channel filled — it cancelled the stream and
closed the channel with no error delivered. sendEventResult now evicts an
old buffered event and delivers a terminal EventResult carrying the new
exported ErrEventBufferOverflow before close, so the overflow is observable.

Client.Go-003: parseInt32List panicked on a malformed -item-handles token,
crashing the CLI with a stack trace. It now returns an error that
runUnsubscribeBulk propagates, exiting 2 with a clean message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:36 -04:00
Joseph Doherty 8023eccfa6 Resolve Client.Dotnet-001, -002, -003 code-review findings
Client.Dotnet-001: MapRpcException typed only Unauthenticated and
PermissionDenied; every other gRPC status collapsed to an untyped exception
with the status code discarded. Added a nullable StatusCode to
MxGatewayException, extracted the duplicated mappers into a shared
RpcExceptionMapper that records the code for every status, and documented it.

Client.Dotnet-002: the production retry branch (MxGatewayException wrapping
RpcException) was never exercised. FakeGatewayTransport gained a
MapTransportExceptions mode that runs thrown RpcExceptions through
RpcExceptionMapper exactly as the production transport does.

Client.Dotnet-003: MxGatewaySession.DisposeAsync disposed _closeLock while a
concurrent CloseAsync could be parked in WaitAsync. DisposeAsync now drains
in-flight CloseAsync callers before disposing the semaphore; the client's
_disposed flag is accessed via Interlocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:33 -04:00
Joseph Doherty e967e85973 Resolve Client.Go-001 code-review finding
MxAccessError.Unwrap returned e.Command directly; on the HRESULT-only path
Command is a nil *CommandError, so Unwrap returned a non-nil error wrapping
a typed nil and errors.As bound a nil *CommandError. Unwrap now returns an
untyped nil when Command is nil. Added errors_test.go regression coverage
for the HRESULT-only and populated-Command paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:46:12 -04:00
Joseph Doherty 0d8a28d2fe Fix all MxGateway.Client.Rust code-review findings
Resolves Client.Rust-001 through Client.Rust-011.

Build/test/clippy gate (Client.Rust-001/002/003):
- options.rs: doc comments on with_max_grpc_message_bytes /
  max_grpc_message_bytes (#![warn(missing_docs)])
- session.rs: rename BulkReplyKind variants to drop the shared `Bulk`
  suffix (clippy::enum_variant_names)
- galaxy.rs: deref instead of clone on Option<Timestamp>
  (clippy::clone_on_copy — an extra violation the gate also hit)
- mxgw-cli: assert version_json against GATEWAY/WORKER_PROTOCOL_VERSION
  constants instead of the stale literal 2
`cargo clippy --workspace --all-targets -- -D warnings` now passes.

Correctness / error handling:
- version.rs: CLIENT_VERSION = env!("CARGO_PKG_VERSION") (Client.Rust-004)
- session.rs: register/add_item/add_item2 handle extractors and
  bulk_results now return Err(Error::MalformedReply) instead of a
  silent 0 / empty vec on a shapeless OK reply (Client.Rust-005/006)
- error.rs: new Error::Unavailable classifies Code::Unavailable /
  ResourceExhausted as transient (Client.Rust-010)
- session.rs: per-call unique correlation ids via an atomic counter
  (Client.Rust-011)

Other:
- value.rs: MxValue/MxArrayValue compute the projection on demand
  instead of caching it, so a wire-only value pays no projection cost
  (Client.Rust-008)
- RustClientDesign.md: correct the crate layout, drop the unused
  `tracing` dependency (Client.Rust-007)
- client_behavior.rs: tests for the bulk-size cap, a mid-stream status
  fault, and the unreadable-CA-file path (Client.Rust-009)

cargo fmt / test --workspace (27 tests) / clippy all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:08:55 -04:00
Joseph Doherty 96bea1d478 Apply technical-light design system to the gateway dashboard
Restyles the Blazor dashboard onto a portable token-based theme so it
reads like an instrument panel: warm-paper background, hairline-ruled
panels, IBM Plex type, monospace tabular numerics, and status carried by
colour chips. Vendors theme.css + IBM Plex fonts, rewrites dashboard.css
as a thin token-driven view layer, and swaps the Bootstrap navbar and
status badges for the design-system app bar and chips.

Also includes pending API-key management, Galaxy hierarchy projection,
and constraint-enforcement work with their tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:31:04 -04:00
Joseph Doherty fe19c478c0 clients/rust: SDK methods for AcknowledgeAlarm + QueryActiveAlarms (PR E.6)
Eleventh PR of the alarms-over-gateway epic
(docs/plans/alarms-over-gateway.md). Mirrors PR E.2's .NET surface
on the Rust async SDK. Depends on PR E.1 (regen, merged).

- GatewayClient::acknowledge_alarm — async unary call. Uses the
  existing unary_request helper (call timeout) and routes failures
  through Error mapping; non-OK protocol status promotes to
  Error::ProtocolStatus via ensure_protocol_success.
- GatewayClient::query_active_alarms — async server-streaming call
  returning a new ActiveAlarmStream type alias (parallel to
  EventStream). Errors are pre-mapped from tonic::Status; dropping
  the stream cancels the call cooperatively.
- GATEWAY_PROTOCOL_VERSION bumped 2 → 3 to match the .NET contract.
- FakeGateway test impl extends to satisfy the new trait methods so
  client_behavior.rs builds. Two new integration tests cover the
  new SDK methods.

Tests:
- 12 unit + 10 client_behavior + 4 proto_fixtures = 26 tests, all
  pass under cargo test (Rust 1.x via existing toolchain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:06:13 -04:00
Joseph Doherty 730fdc93e0 clients/java: SDK methods for AcknowledgeAlarm + QueryActiveAlarms (PR E.5)
Tenth PR of the alarms-over-gateway epic
(docs/plans/alarms-over-gateway.md). Mirrors PR E.2's .NET surface
on the Java SDK. Depends on PR E.1 (regen, merged).

- MxGatewayClient.acknowledgeAlarm — blocking unary call, validates
  protocol status via the existing MxGatewayErrors helper. Wraps
  RuntimeException through MxGatewayErrors.fromGrpc for typed
  failure mapping.
- MxGatewayClient.acknowledgeAlarmAsync — CompletableFuture variant
  using the future stub.
- MxGatewayClient.queryActiveAlarms — async server-streaming RPC
  observed via a new MxGatewayActiveAlarmsSubscription handle
  (parallel to MxGatewayEventSubscription; the existing
  subscription class is hard-typed to MxEvent so a parallel type
  was simpler than retrofitting generics).
- MxGatewayClientVersion bumps GATEWAY_PROTOCOL_VERSION 2 → 3 to
  match the .NET contract; CLI version-string assertions updated
  to match.

Java SDK build green via Gradle 9.4.1 (mxgateway-client + mxgateway-cli).
17 tasks, all tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:01:35 -04:00
Joseph Doherty b4016e738c clients/go: SDK methods for AcknowledgeAlarm + QueryActiveAlarms (PR E.4)
Ninth PR of the alarms-over-gateway epic
(docs/plans/alarms-over-gateway.md). Mirrors PR E.2's .NET surface
on the Go SDK. Depends on PR E.1 (regen, merged).

- Client.AcknowledgeAlarm — context-aware unary call routed through
  the existing callContext helper (default 30s timeout). Failures
  wrap into *GatewayError; protocol-status non-OK promotes to typed
  protocol errors via EnsureProtocolSuccess.
- Client.QueryActiveAlarms — context-streaming wrapper around the
  generated MxAccessGateway_QueryActiveAlarmsClient. Caller drives
  the stream via Recv(); cancelling ctx releases it.
- types.go re-exports the four new generated types
  (AcknowledgeAlarmRequest/Reply, QueryActiveAlarmsRequest,
  ActiveAlarmSnapshot) plus the AlarmTransitionKind /
  AlarmConditionState enums and the
  QueryActiveAlarmsClient stream alias.
- version.go bumps GatewayProtocolVersion 1 → 3 to match the .NET
  contract; the const was previously stale and the bump fixes the
  pre-existing TestOpenSessionFixtureProtocolVersions failure that
  was masked because the fixture had not been regenerated until A.1.

Tests:
- 4 new tests in alarms_test.go — request shape + auth metadata,
  nil-request rejection, Unauthenticated mapping, snapshot
  streaming over bufconn, filter-prefix passthrough.
- All Go test suites green: cmd/mxgw-go + mxgateway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:54:22 -04:00
Joseph Doherty 168bb9a39a clients/python: SDK methods for AcknowledgeAlarm + QueryActiveAlarms (PR E.3)
Eighth PR of the alarms-over-gateway epic
(docs/plans/alarms-over-gateway.md). Mirrors PR E.2's .NET surface
on the Python async SDK. Depends on PR E.1 (regen, merged).

- GatewayClient.acknowledge_alarm — async unary call routed through
  the existing _unary helper. ensure_protocol_success raises typed
  gateway errors for non-OK protocol statuses; map_rpc_error wraps
  RpcError → MxGatewayAuthenticationError /
  MxGatewayAuthorizationError on Unauthenticated /
  PermissionDenied responses.
- GatewayClient.query_active_alarms — async iterator over
  ActiveAlarmSnapshot. Mirrors stream_events_raw's cancel-on-close
  pattern via a dedicated _canceling_active_alarms_iterator (typed
  for ActiveAlarmSnapshot).

Tests:
- 6 new tests in test_alarms.py — request shape, Unauthenticated +
  PermissionDenied mapping, snapshot streaming, filter prefix
  passthrough, cancel-on-aclose semantics.
- Full Python test suite: 39 passed (was 33; 6 new).

CLI verb (alarms subscribe / acknowledge / query-active) deferred —
the SDK surface is what lmxopcua consumes; CLI follow-up shares the
JSON output shape with E.2's .NET CLI for cross-language tooling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:50:31 -04:00
Joseph Doherty 0765eb4de3 clients/dotnet: SDK methods for AcknowledgeAlarm + QueryActiveAlarms (PR E.2)
Seventh PR of the alarms-over-gateway epic
(docs/plans/alarms-over-gateway.md). Depends on PR A.1 (proto, merged)
and E.1 (regen, merged).

Hand-written .NET SDK methods on top of the regenerated proto types:

- MxGatewayClient.AcknowledgeAlarmAsync — routes through the existing
  safe-unary retry pipeline (Acks are idempotent at MxAccess), maps
  Unauthenticated/PermissionDenied RpcExceptions to typed
  MxGatewayAuthenticationException / MxGatewayAuthorizationException
  via GrpcMxGatewayClientTransport.MapRpcException.
- MxGatewayClient.QueryActiveAlarmsAsync — server-streaming
  IAsyncEnumerable<ActiveAlarmSnapshot> mirroring the StreamEvents
  pattern.
- IMxGatewayClientTransport extended; GrpcMxGatewayClientTransport
  implements both methods using the regenerated grpc client.
- FakeGatewayTransport extended with capture lists, exception queue,
  and reply / snapshot enqueue helpers.

CLI version-string assertions updated for the GatewayProtocolVersion
2 → 3 bump from A.1.

The CLI alarms verb (subscribe / acknowledge / query-active) is
deferred to a follow-up — keeping this PR focused on the SDK surface
that lmxopcua's GalaxyDriver consumes in PR B.2. The other-language
SDKs (E.3-E.6) layer the same shape on the regen.

Tests:
- 6 new MxGatewayClientAlarmsTests — request shape, cancellation
  honor (linked-token via retry pipeline), Unauthenticated mapping,
  streaming snapshot enumeration, filter prefix passthrough,
  cancellation during enumeration.
- Full client test suite: 57 passed (was 51; 6 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:37:12 -04:00
Joseph Doherty 65d83b1400 clients: regenerate Python + Go protos for alarm RPCs (PR E.1)
Pure mechanical regen following PR A.1 (alarm-transition event family
+ AcknowledgeAlarm / QueryActiveAlarms public RPCs). Ran:

- clients/python/generate-proto.ps1 → mxaccess_gateway_pb2.py +
  mxaccess_gateway_pb2_grpc.py.
- clients/go/generate-proto.ps1 → mxaccess_gateway.pb.go +
  mxaccess_gateway_grpc.pb.go + galaxy_repository.pb.go (whitespace
  diff from upstream protoc minor version).

The .NET binding regenerates on csproj rebuild via Grpc.Tools — its
artifact (Generated/MxaccessGateway*.cs) was already updated as part
of A.1's commit. Java + Rust regen happens at build time via the
gradle plugin / build.rs respectively, with no committed output to
update.

Smoke-imported the regenerated Python descriptors:
  OnAlarmTransitionEvent.DESCRIPTOR.fields → alarm_full_reference,
    alarm_type_name, category, current_value, description, ...
  AcknowledgeAlarmRequest.DESCRIPTOR.fields → alarm_full_reference,
    client_correlation_id, comment, operator_user, session_id
  ActiveAlarmSnapshot.DESCRIPTOR.fields → alarm_full_reference,
    alarm_type_name, category, current_state, current_value, ...

PRs E.2 - E.6 layer hand-written SDK methods on top of the regenerated
types — those land per-language as separate PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:44:42 -04:00
Joseph Doherty 0f88a953d7 proto: add alarm-transition event family + ack/query RPCs (PR A.1)
First PR of the alarms-over-gateway epic
(docs/plans/alarms-over-gateway.md in lmxopcua). Pure contract-surface
change — no functional wiring yet. Worker-side subscription (A.2),
gateway-side dispatch + ack handler (A.3), and ConditionRefresh
(A.4) follow.

mxaccess_gateway.proto:

- Extend MxEventFamily with MX_EVENT_FAMILY_ON_ALARM_TRANSITION = 5.
- Extend MxEvent.body oneof with OnAlarmTransitionEvent on_alarm_transition = 24.
- Add OnAlarmTransitionEvent message carrying the full MxAccess alarm
  payload (full reference, source object, alarm-type-name, transition
  kind, raw severity, original raise timestamp, transition timestamp,
  operator user/comment, category, description, current/limit value).
  Mapping to OPC UA 0-1000 severity ladder happens server-side in
  lmxopcua's MxAccessSeverityMapper (B.1) — gateway preserves the
  native MxAccess scale.
- Add AlarmTransitionKind enum (Raise / Acknowledge / Clear / Retrigger).
- Add ActiveAlarmSnapshot + AlarmConditionState for the
  ConditionRefresh stream.
- Add public RPCs AcknowledgeAlarm (unary) and QueryActiveAlarms
  (server-streaming) on MxAccessGateway service.
- Add AcknowledgeAlarmRequest/Reply + QueryActiveAlarmsRequest.

GatewayContractInfo.GatewayProtocolVersion bumps 2 -> 3. Fixture
manifests (proto-inputs, behavior, parity, golden OpenSessionReply)
and protoset descriptor regenerated.

Tests: round-trip serialization for the new messages with
all-fields-populated and empty-optional-fields cases; oneof
last-write-wins guard between OnDataChange and OnAlarmTransition;
descriptor service-method enumeration includes the two new RPCs.
All 273 existing tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:34:35 -04:00
Joseph Doherty ddad573b75 Merge origin/main with local pending work and update AGENTS.md references
- Resolve 14 conflicts from popping local stash on top of origin's
  eed1e88 + 8d3352f doc-comment additions (11 mechanical, plus
  version.rs, DashboardAuthenticatorTests.cs, DashboardGalaxyProjector.cs)
- Fix 4 test files that used AGENTS.md as the repo-root sentinel
  (now use CLAUDE.md, since AGENTS.md was removed in 4731ab5)
- Redirect 10 doc citations from AGENTS.md to the matching gateway.md
  sections (Value Model, Status Model, Security, STA Worker Thread
  Model, gRPC Layer rule, cancellation rule)

Verified: solution build clean, x86 worker build clean, 266/266
gateway tests passing, 121/121 worker tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:13:33 -04:00
Joseph Doherty 8d3352f2c6 Add idiomatic documentation to Go, Java, Python, and Rust clients 2026-04-30 12:04:46 -04:00
Joseph Doherty eed1e88a37 Add XML documentation across gateway, worker, and .NET client 2026-04-30 11:49:58 -04:00
Joseph Doherty 51a9dadf62 Align docs with StyleGuide and add CLAUDE.md
- Rename 16 kebab-case docs to PascalCase per StyleGuide
- Move per-language client design docs from docs/ to clients/<lang>/
  alongside their READMEs
- Add ## Related Documentation sections to 15 docs that lacked one
- Fix sentence-case violations in H3 headings (StyleGuide rule)
- Update cross-references in gateway.md, client READMEs, scripts,
  and generate-proto.ps1 helpers to follow the new paths
- Add CLAUDE.md with build/test commands, the source-update
  verification matrix, the parity-first contract, and pointers
  to MXAccess and Galaxy Repository analysis sources

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:19:22 -04:00
Joseph Doherty 133c83029b Add Galaxy repository API and clients 2026-04-29 07:27:00 -04:00
Joseph Doherty b0041c5d18 Fix reliability findings 2026-04-28 06:27:01 -04:00
Joseph Doherty 907aa49aea Improve gateway reliability and client e2e coverage 2026-04-28 06:11:18 -04:00
Joseph Doherty 4fc355b357 Improve gateway reliability and dashboard docs 2026-04-28 00:13:22 -04:00
Joseph Doherty bd4a09a35e Add Polly resilience policies 2026-04-27 15:37:56 -04:00
Joseph Doherty 3d11ac3316 Add bulk MXAccess subscription commands 2026-04-26 22:29:27 -04:00
Joseph Doherty 4ea2c4fd86 Issue #50: clarify packaging API key placeholders 2026-04-26 21:26:28 -04:00
Joseph Doherty 41a2d70f8f Merge remote-tracking branch 'origin/main' into agent-2/issue-50-client-packaging-documentation 2026-04-26 21:23:06 -04:00
Joseph Doherty 79f73e04fd Issue #49: add cross-language smoke matrix 2026-04-26 21:21:49 -04:00
Joseph Doherty f2118f7028 Issue #50: document client packaging 2026-04-26 21:20:43 -04:00