Compare commits

...

47 Commits

Author SHA1 Message Date
Joseph Doherty 7b12eebbd1 docs: add MaxEventSubscribersPerSession to config shape example; assert its default
Add MaxEventSubscribersPerSession (value 8) to the Sessions block of the
Configuration Shape JSON example in GatewayConfiguration.md, matching the
appsettings.json default the options table already documents. Assert both
MaxEventSubscribersPerSession (8) and MaxPendingCommandsPerSession (128)
defaults in GatewayOptionsTests.OptionsBinding_UsesDesignDefaults.
2026-06-15 15:16:44 -04:00
Joseph Doherty bd190ab012 feat(config): allow multiple event subscribers + add MaxEventSubscribersPerSession cap
Remove the hard-rejection of AllowMultipleEventSubscribers=true in GatewayOptionsValidator
(fan-out is now implemented via SessionEventDistributor). Add MaxEventSubscribersPerSession
(default 8, must be >= 1) to SessionOptions, validate it, expose it in
EffectiveSessionConfiguration / GatewayConfigurationProvider, document it in
GatewayConfiguration.md and appsettings.json. Tests cover the no-error path for
AllowMultipleEventSubscribers=true, the 0/-1 rejection, positive pass, and default pass.
2026-06-15 15:13:21 -04:00
Joseph Doherty 2ead9bc200 fix(dashboard): close StartDashboardMirror/DisposeAsync race; internal-overflow test + metric label
(1) GatewaySession.StartDashboardMirror: publish _dashboardMirrorLease and _dashboardMirrorTask
    atomically under one _syncRoot section; if the session is already Closing/Closed/Faulted,
    dispose the just-created lease and return without starting the mirror task so nothing is orphaned.
(2) WaitUntilAsync test helper: catch OperationCanceledException and call Assert.Fail with the
    timeout duration and predicate source text instead of letting the exception propagate raw.
(3) New SessionEventDistributorTests.InternalSubscriberOverflow_HandlerSeesIsOnlySubscriberFalse:
    verifies CountExternalSubscribers excludes the internal subscriber, so isOnlySubscriber==false
    even when the internal subscriber is the only registered subscriber.
(4) SubscriberOverflowHandler delegate gains isInternal parameter; overflow metric label is
    "dashboard-mirror" for internal subscribers and "grpc-event-stream" for external ones.
(5) DashboardEventBroadcaster.Publish: wrap SendAsync Task acquisition in try/catch so a
    synchronous throw cannot escape the never-throw Publish interface contract.
2026-06-15 15:02:36 -04:00
Joseph Doherty 1ea08c3b10 feat(dashboard): mirror events via SessionEventDistributor subscriber (fixes dark feed without gRPC client) 2026-06-15 14:42:32 -04:00
Joseph Doherty 4f43733b96 test(sessions): document overflow race safety + close backpressure coverage gaps
- Issue 1: document the isOnlySubscriber snapshot race-safety assumption in
  OnSubscriberOverflow; flags the Task 7/8 revisit point explicitly.
- Issue 2: pin StreamDisconnects==1 in the FailFast overflow test so a
  regression dropping the StreamDisconnected("Detached") finally call is caught.
- Issue 3: replace plain int/bool? reads in SlowSubscriberOverflow test with
  Volatile.Read/Write + Interlocked.Increment stores to close the C# memory
  model data race on overflowCalls and observedIsOnlySubscriber.
- Issue 4: add SlowSubscriberOverflow_WithMultipleSubscribers_... distributor
  test pinning that isOnlySubscriber==false disables the session-fault path;
  includes TODO(Task 8) note for the GatewaySession-level assertion.
- Issue 5: reword SubscriberOverflowHandler XML doc to make explicit that the
  handler must NOT complete the subscriber's channel; the distributor owns that.
2026-06-15 13:46:37 -04:00
Joseph Doherty 039111ca05 feat(sessions): per-subscriber backpressure isolation in SessionEventDistributor 2026-06-15 13:39:25 -04:00
Joseph Doherty 61627fc5b0 fix(sessions): make EventSubscriberLease dispose atomic; dedupe lease dispose
Issue 1: replace plain bool _disposed in EventSubscriberLease with an
Interlocked.Exchange int (_leaseDisposed) matching the SubscriberLease
pattern in SessionEventDistributor. Concurrent stream-completion +
client-cancellation racing Dispose() now decrements _activeEventSubscriberCount
exactly once, never to -1.

Issue 5: remove the `using` declaration on the subscriber lease in
EventStreamService.StreamEventsAsync; the finally block already disposes it
alongside the reader, so the using was a redundant second dispose on the
same code path.

Issue 2: add an inline comment at the StartAsync().GetAwaiter().GetResult()
call documenting the sync-over-async invariant (StartAsync only schedules via
Task.Run and is synchronous; do not make it truly async without changing
this call site).

Issue 10: remove the redundant .WithCancellation(cancellationToken) chained
on ReadEventsAsync(cancellationToken) in MapWorkerEventsAsync; the
[EnumeratorCancellation] token already flows through the direct argument.

Issue 9: add EventSubscriberLease_ConcurrentDispose_DecrementsCountExactlyOnce
to GatewaySessionTests — 16 concurrent Dispose() calls on the same lease for
200 iterations; asserts count is exactly 0 after each race and a subsequent
single-subscriber AttachEventSubscriber succeeds.
2026-06-15 13:29:27 -04:00
Joseph Doherty 7f1018bac1 feat(sessions): route event streaming through SessionEventDistributor 2026-06-15 13:18:28 -04:00
Joseph Doherty c2c518862f fix(sessions): replay-buffer gap edge cases, effective-config exposure, capacity-0 tests
#2: Replace afterSequence+1<oldestRetained with overflow-safe oldestRetained>0&&afterSequence<oldestRetained-1 to prevent ulong wrap at MaxValue falsely reporting gap=true.
#3: Add ReplayBufferCapacity and ReplayRetentionSeconds to EffectiveEventConfiguration and populate from EventOptions in GatewayConfigurationProvider.
#4: Add four new SessionEventDistributorTests covering capacity=0 gap/no-gap paths and the ulong.MaxValue boundary case.
#5: Update class-level <remarks> to describe the Task 3 replay ring buffer (capacity + age eviction, TryGetReplayFrom) rather than its absence.
#6: Add O(n)-is-acceptable comment at TryGetReplayFrom linear scan.
#8: Narrow no-replay 4-arg ctor to internal; InternalsVisibleTo already covers the test project.
2026-06-15 12:48:11 -04:00
Joseph Doherty e962737d2c feat(sessions): add bounded replay ring buffer to SessionEventDistributor 2026-06-15 12:42:15 -04:00
Joseph Doherty 7773bdebbd fix(sessions): close SessionEventDistributor dispose/register races + add overflow logging 2026-06-15 12:37:39 -04:00
Joseph Doherty c79b292968 feat(sessions): add SessionEventDistributor (pump + per-subscriber fan-out skeleton) 2026-06-15 12:32:13 -04:00
Joseph Doherty a43b2ee6af test(sessions): cover OwnerKeyId service-layer forwarding; doc 11-param ctor
Add LastOwnerKeyId capture to FakeSessionManager and assert it equals
"operator01" in OpenSession_WithValidRequest_ReturnsSessionDetails, closing
the gap where OwnerKeyId threading through the service layer had no test
coverage. Add a <remarks> to the 11-param GatewaySession convenience ctor
documenting that OwnerKeyId is null there and authenticated call sites must
use the 12-param overload.
2026-06-15 12:29:16 -04:00
Joseph Doherty f5479f3ca3 feat(sessions): record OwnerKeyId on session creation
Add a nullable string? OwnerKeyId property to GatewaySession that captures
the API key identifier (KeyId) of the authenticated caller that opened the
session. Wire it through ISessionManager.OpenSessionAsync → SessionManager
→ GatewaySession constructor. The gRPC service passes identityAccessor
.Current?.KeyId; internal callers (GatewayAlarmMonitor, DashboardLiveDataService)
pass null. Covers the positive and null cases with two new TDD-first tests.
2026-06-15 12:24:29 -04:00
Joseph Doherty 00c849e63b docs: session-resilience implementation plan (28 tasks, 5 phases) 2026-06-15 12:15:34 -04:00
Joseph Doherty 3fc6ccad30 docs: session-resilience epic design (fan-out, reconnect, ACL, reattach) 2026-06-15 12:11:46 -04:00
Joseph Doherty 0e4843612b feat(python): add --parent drill-down to galaxy-browse for 5/5 CLI parity
Add --parent-gobject-id (integer) to the galaxy-browse CLI command so the
Python client matches the Go (-parent) and Rust (--parent-gobject-id) CLIs.
When set, drives BrowseChildren paging via browse_children_raw (page size 500,
repeated-token guard) and renders the same JSON node shape (flattened object
fields + hasChildrenHint + empty children array) and indented-text tree as the
root-walk path. --depth is ignored on the parent path with a one-line stderr
warning, matching the Go/Rust behaviour. Tests added in TDD order.
2026-06-15 11:45:44 -04:00
Joseph Doherty a56ce0ddbd docs: refresh stillpending.md for completed work; record residuals (§7/E2)
Mark §1.1 (11 worker commands), §1.2 (audit CorrelationId), and §4 client
CLI/helper parity as Resolved with commit refs; correct §4.4 (dotnet version
already worked). Record open residuals: §1.3 live failover counter, §3.2
multi-sample buffered conversion, §1.4 vendor-stub ack, DrainEvents snapshot
semantics.
2026-06-15 11:35:48 -04:00
Joseph Doherty f7ada90359 test(integration): harden B8 live assertions (ArchestrAUserToId user_id, bootstrap arrival, split guard)
Fix 1 (Important): assert ArchestrAUserToId Ok-path payload carries a non-zero user_id, mirroring the AuthenticateUser pattern.
Fix 2 (Important): assert bootstrapBufferedEvents > 0 before the residual return so the "empty NoData bootstrap arrives" claim is verified, not just assumed.
Fix 3 (Minor): change SplitLiveItemForBuffered guard from lastDot <= 0 to lastDot < 0 so a leading-dot reference ".TestInt" (lastDot==0) is not mis-handled as undotted.
2026-06-15 11:30:15 -04:00
Joseph Doherty efd99718d7 test(integration): live COM command smoke + buffered capture (B8) 2026-06-15 11:19:12 -04:00
Joseph Doherty b298ca74be fix(java): picocli ParameterException for browse --depth; warn on --parent 0
Replaces the raw IllegalArgumentException thrown by GalaxyBrowseCommand for
--depth < 0 with a CommandLine.ParameterException so picocli surfaces a clean
single-line error instead of an unhandled stack trace. Adds an upper bound of
50 (matching the Python client) so --depth > 50 is also rejected cleanly.

Emits a stderr warning when --parent 0 is supplied explicitly, matching
Go/Rust client behaviour, because gobject id 0 is the server's root-walk
sentinel and passing it via --parent is almost always a mistake.

Adds three new tests: negative depth, depth > 50, and the --parent 0 warning path.
2026-06-15 11:08:07 -04:00
Joseph Doherty 0d5b488c11 feat(java): add ping + galaxy-browse CLI subcommands and galaxy command aliases
- D4: add 'ping' subcommand (MX_COMMAND_KIND_PING / PingCommand{message}),
  accepting --session-id and optional --message (default "ping"); prints the
  worker's echoed diagnostic message.
- D8-java: add 'galaxy-browse' subcommand over browse()/LazyBrowseNode.expand()
  and raw BrowseChildren paging for --parent. JSON node shape matches the
  cross-client surface (flattened object fields + hasChildrenHint + nested
  children array).
- D9-java: make galaxy-test-connection / galaxy-last-deploy the primary names,
  keeping galaxy-test / galaxy-deploy-time as deprecated picocli aliases.
- Tests for ping, galaxy-browse JSON hasChildrenHint key, and alias resolution.
- README updated for the new/renamed subcommands.
2026-06-15 10:58:04 -04:00
Joseph Doherty bb5139fec2 test(gateway): fake worker responds to control commands (A6)
Add RespondToControlCommandAsync to FakeWorkerHarness so scripted fake
workers can auto-reply to the five control command kinds (Ping,
GetSessionState, GetWorkerInfo, DrainEvents, ShutdownWorker) with canned
replies whose shapes match the real WorkerPipeSession helpers.

Add five unit tests in FakeWorkerHarnessTests covering each control
command kind through the WorkerClient→pipe roundtrip, and one gateway
E2E test (GatewayService_WithFakeWorker_ControlCommandsRoundtripThroughGateway)
that exercises Ping, GetWorkerInfo, and DrainEvents through the full
gRPC→SessionManager→WorkerClient→named-pipe path using a scripted
ControlCommandFakeWorkerProcessLauncher.
2026-06-15 10:56:56 -04:00
Joseph Doherty dde9934e60 test(worker): silence CS0649 on reflection-only FakeMxStatus fields 2026-06-15 10:42:59 -04:00
Joseph Doherty 29399325d5 feat(worker): implement 6 MXAccess COM commands in executor
Wire up the previously-unimplemented Suspend, Activate, AuthenticateUser,
ArchestrAUserToId, AddBufferedItem, and SetBufferedUpdateInterval command
kinds in MxAccessCommandExecutor. These are real COM calls and run on the
STA via the executor.

- IMxAccessServer gains the 6 methods; MxAccessComServer routes them to the
  right interface version (Suspend/Activate -> ILMXProxyServer4 out MxStatus,
  AuthenticateUser -> base ILMXProxyServer, ArchestrAUserToId ->
  ILMXProxyServer2, AddBufferedItem/SetBufferedUpdateInterval ->
  ILMXProxyServer5).
- Suspend/Activate surface the native MxStatus, converted to MxStatusProxy
  via the existing MxStatusProxyConverter.
- AuthenticateUser hands the credential straight to MXAccess and never logs
  it; native HResult failures propagate via the dispatcher.
- MxAccessSession gains matching pass-throughs; AddBufferedItem registers
  the item handle in the handle registry.
- Unit tests (fake IMxAccessServer / fake COM object) cover each arm plus a
  password-non-leak assertion; existing IMxAccessServer fakes updated.

No proto changes (all request/reply messages already exist).
2026-06-15 10:41:22 -04:00
Joseph Doherty f94c206489 test(worker): use Register (a real STA command) for STA-dispatch race tests
Ping is now intercepted as a worker control command and answered on the
message-loop thread, so the dispatch/heartbeat/shutdown-race tests must use a
genuine STA-dispatched command kind to keep exercising DispatchAsync.
2026-06-15 10:25:34 -04:00
Joseph Doherty 72e1aca716 test(worker): fix control-command test helpers (CreatePipeSession overload, drop ConfigureAwait) 2026-06-15 10:23:45 -04:00
Joseph Doherty bf72cd8961 feat(worker): implement Ping/GetSessionState/GetWorkerInfo/DrainEvents/ShutdownWorker control commands
Answer the five worker control/lifecycle commands at the WorkerPipeSession
message-loop layer instead of the STA-bound MxAccessCommandExecutor. These
replies are built from process-level state (worker pid, assembly version,
worker lifecycle, the runtime session's event queue) the executor cannot see,
and ShutdownWorker must emit its OK reply before the graceful shutdown joins
the STA thread - dispatching it onto the STA would deadlock.

- Ping: OK reply, echoes message into diagnostic_message.
- GetSessionState: maps WorkerState to proto SessionState.
- GetWorkerInfo: pid, worker version, MXAccess ProgID/CLSID.
- DrainEvents: drains the runtime event queue into DrainEventsReply.
- ShutdownWorker: OK reply, then graceful shutdown, then stops the loop.

Tests added in WorkerPipeSessionTests; FakeRuntimeSession gains a
batch-size drain suppressor so DrainEvents does not race the background
drain loop.
2026-06-15 10:20:51 -04:00
Joseph Doherty 5a7f8ace77 fix(go): use hasChildrenHint JSON key for browse parity; warn on -parent 0
Rename the browse JSON key from hasChildren to hasChildrenHint to match the
Rust and Python CLIs and the library property name (HasChildrenHint). Update
the text-output label to match. Add a one-line stderr warning when -parent 0
is supplied, since 0 is the server root sentinel and omitting -parent is the
intended way to walk from the root.
2026-06-15 10:09:38 -04:00
Joseph Doherty c10faa2ee5 fix(dotnet): use hasChildrenHint JSON key for browse cross-client parity
Rename the anonymous-object member `hasChildren` → `hasChildrenHint` so the
serialized JSON key matches the Rust and Python CLIs and the library property
HasChildrenHint. Also update the text-output suffix to `hasChildrenHint=` for
consistency.
2026-06-15 10:09:36 -04:00
Joseph Doherty 7975b09325 fix(python): bound galaxy-browse --depth; assert no _text leak in JSON
Guard _galaxy_browse against unbounded recursion by rejecting --depth
values outside [0, 50] with a descriptive BadParameter. Add test coverage
for --depth 99 and --depth -1 rejection, and assert _text is never
present in the JSON output from galaxy-browse.
2026-06-15 10:09:30 -04:00
Joseph Doherty d7e2a8b3cf feat(dotnet): add galaxy-browse CLI (§4.6); chore: verify version subcommand (§4.4) 2026-06-15 10:07:24 -04:00
Joseph Doherty 39ec2a3275 feat(python): add galaxy-browse CLI subcommand (§4.6) 2026-06-15 10:00:52 -04:00
Joseph Doherty 8cb416ba30 feat(go): add galaxy-browse CLI subcommand (§4.6) 2026-06-15 10:00:36 -04:00
Joseph Doherty 55526d5e56 fix(gateway): preserve raw client correlation id in denial audit DetailsJson + add wiring test (§1.2) 2026-06-15 09:56:24 -04:00
Joseph Doherty a59fc998e3 fix(python): UTC-normalize galaxy-last-deploy output, add deploy-event collector, help text, test 2026-06-15 09:53:01 -04:00
Joseph Doherty 539e6ef2de fix(rust): warn when browse --depth ignored, extract page-size const, tidy clones
Warn on stderr when --parent-gobject-id and --depth>0 are both supplied
since depth is silently ignored in the single-level parent path. Also
updates the --depth arg doc to document this. Extracts BROWSE_PAGE_SIZE
const (500) with a cross-reference to galaxy.rs instead of a bare literal.
Removes three redundant .clone() calls in BrowseChildrenOptions construction
since the originals are not used after the struct is built.
2026-06-15 09:52:24 -04:00
Joseph Doherty 742ced7970 test(go): assert ping echo in JSON output; comment ping fallback
TestRunPingJSON now verifies the fake gateway's echoed text appears in
the serialised reply body, catching any future wiring regression that
maps PingRaw to the wrong proto field.  runPing gains a one-line comment
explaining why DiagnosticMessage carries the echo, why the kind-string
fallback exists, and why writeCommandOutput is not reused on the
plain-text path.
2026-06-15 09:52:13 -04:00
Joseph Doherty bd46ba1270 fix(test): drop removed logger arg from GalaxyRepositoryGrpcService test call sites; docs: STA phrasing
Remove the trailing NullLogger<GalaxyRepositoryGrpcService>.Instance argument
from all four CreateService/inline constructions in GalaxyRepositoryGrpcServiceTests
and GalaxyFilterInputSafetyTests, matching the now-4-param constructor after the
dead logger parameter was removed in 0032d2d. Also drop the now-unused
Microsoft.Extensions.Logging.Abstractions using from both files.

Rephrase the §5 STA blurb in docs/AlarmClientDiscovery.md: GatewayAlarmMonitor
routes polling *through* the worker's StaRuntime (which owns the STA pump) rather
than owning the pump itself.
2026-06-15 09:52:07 -04:00
Joseph Doherty 0032d2dc44 docs+chore: fix stale prose, project names, remove dead MapSqlException (§7)
- docs/plans/2026-06-14-deferred-followups.md: mark D1 as executed
  (commit 4af24b9; metric emitted at DashboardSnapshotService.cs:198);
  note D2 resolved as no-op; D3-D5 remain pending
- docs/AlarmClientDiscovery.md §5: rewrite STA "production fix needed"
  to past tense — alarms now route through GatewayAlarmMonitor/worker STA
- EventsHub.cs: replace stale "publisher side is a future follow-up"
  comment; DashboardEventBroadcaster is live and DI-registered
- CLAUDE.md: fix all project-name drift (src/MxGateway.* →
  src/ZB.MOM.WW.MxGateway.*; MxGateway.sln → ZB.MOM.WW.MxGateway.slnx;
  clients/dotnet/MxGateway.Client.sln → ZB.MOM.WW.MxGateway.Client.slnx)
- GalaxyRepositoryGrpcService.cs: remove dead MapSqlException method and
  its IDE0051 suppression pragma; drop now-unused ILogger ctor param and
  Microsoft.Data.SqlClient using; build confirmed 0 warnings/errors
2026-06-15 09:43:00 -04:00
Joseph Doherty 8415f35abd feat(gateway): thread ClientCorrelationId into constraint-denial audit (§1.2) 2026-06-15 09:42:40 -04:00
Joseph Doherty 639e36b1bc feat(rust): add browse CLI subcommand (§4.6) 2026-06-15 09:42:16 -04:00
Joseph Doherty 90529dce6e feat(go): add ping CLI subcommand (§4.3)
Add PingRaw to Session (session.go), runPing to the CLI dispatch
(main.go), and three tests covering plain-text echo, JSON output,
and missing-session-id validation (main_test.go). Default message
is "ping"; gateway echo is read from DiagnosticMessage, falling
back to the kind string if absent.
2026-06-15 09:41:40 -04:00
Joseph Doherty a211faefed feat(python): add galaxy-* CLI commands (§4.2) 2026-06-15 09:40:55 -04:00
Joseph Doherty 849f1d2f6d feat(go): add single-shot Write2 session helper (§4.1)
Add Write2/Write2Raw to the Go client Session, mirroring the existing
Write/WriteRaw pair, so all five language clients now expose write2.
Includes three TDD tests covering payload propagation, raw-reply return,
and nil-value rejection.
2026-06-15 09:40:15 -04:00
Joseph Doherty 883557fc8a docs: implementation plan for stillpending.md completion
28 tasks across 5 workstreams (A worker control cmds, B worker COM cmds,
C audit CorrelationId, D client CLI parity, E docs). Zero proto changes;
worker net48/x86 + Java on windev, rest local.
2026-06-15 09:35:50 -04:00
Joseph Doherty 4a00b1bdc1 docs: design for completing stillpending.md actionable items
Covers the 11 worker command kinds (§1.1), audit CorrelationId threading
(§1.2), client CLI/helper parity (§4), and doc hygiene (§7). Key finding:
all 11 commands already have proto/validation/scope/routing, so this is a
worker-executor + COM-wrapper + client-CLI effort with zero contract changes.
2026-06-15 09:32:01 -04:00
85 changed files with 9303 additions and 424 deletions
+20 -20
View File
@@ -8,10 +8,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
The architecture is a two-process design — read `gateway.md` before making structural changes:
- **Gateway** (`src/MxGateway.Server`, .NET 10, x64): ASP.NET Core gRPC server. Owns the public API, sessions, auth, the Blazor dashboard, and the Galaxy Repository SQL browse RPCs. **Never instantiates MXAccess COM directly.**
- **Worker** (`src/MxGateway.Worker`, .NET Framework 4.8, **x86**): one process per session. Owns one MXAccess COM instance on a dedicated STA, pumps Windows messages, and converts COM events to protobuf.
- **Gateway** (`src/ZB.MOM.WW.MxGateway.Server`, .NET 10, x64): ASP.NET Core gRPC server. Owns the public API, sessions, auth, the Blazor dashboard, and the Galaxy Repository SQL browse RPCs. **Never instantiates MXAccess COM directly.**
- **Worker** (`src/ZB.MOM.WW.MxGateway.Worker`, .NET Framework 4.8, **x86**): one process per session. Owns one MXAccess COM instance on a dedicated STA, pumps Windows messages, and converts COM events to protobuf.
- **IPC**: gateway↔worker uses one bidirectional named pipe per worker (`mxaccess-gateway-{gatewayPid}-{sessionId}`) with length-prefixed `WorkerEnvelope` protobuf frames. Gateway hosts the pipe server and launches the worker. **gRPC is not used inside the worker** — .NET Framework 4.8 doesn't have a first-class gRPC stack.
- **Contracts** (`src/MxGateway.Contracts`): multi-targets `net10.0;net48` and owns the `.proto` files (`mxaccess_gateway.proto`, `mxaccess_worker.proto`, `galaxy_repository.proto`). All other projects consume the generated types from here. Do not hand-edit anything under `Generated/`.
- **Contracts** (`src/ZB.MOM.WW.MxGateway.Contracts`): multi-targets `net10.0;net48` and owns the `.proto` files (`mxaccess_gateway.proto`, `mxaccess_worker.proto`, `galaxy_repository.proto`). All other projects consume the generated types from here. Do not hand-edit anything under `Generated/`.
The worker must do all MXAccess COM calls on its dedicated STA thread, and the STA loop must pump Windows messages (`MsgWaitForMultipleObjectsEx` + `PeekMessage`/`DispatchMessage`) so MXAccess events deliver. A plain blocking queue on an STA is not enough.
@@ -19,42 +19,42 @@ The worker must do all MXAccess COM calls on its dedicated STA thread, and the S
```powershell
# Full solution build (gateway, worker, contracts, tests)
dotnet build src/MxGateway.sln
dotnet build src/ZB.MOM.WW.MxGateway.slnx
# Worker must be built x86 — the gateway looks for MxGateway.Worker.exe under bin\x86
dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86
# Worker must be built x86 — the gateway looks for ZB.MOM.WW.MxGateway.Worker.exe under bin\x86
dotnet build src/ZB.MOM.WW.MxGateway.Worker/ZB.MOM.WW.MxGateway.Worker.csproj -p:Platform=x86
# Gateway tests (no MXAccess required — uses FakeWorkerHarness)
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj
dotnet test src/MxGateway.Worker.Tests/MxGateway.Worker.Tests.csproj -p:Platform=x86
dotnet test src/ZB.MOM.WW.MxGateway.Tests/ZB.MOM.WW.MxGateway.Tests.csproj
dotnet test src/ZB.MOM.WW.MxGateway.Worker.Tests/ZB.MOM.WW.MxGateway.Worker.Tests.csproj -p:Platform=x86
# Run gateway locally (defaults bound under MxGateway:* in src/MxGateway.Server/appsettings.json)
dotnet run --project src/MxGateway.Server/MxGateway.Server.csproj
# Run gateway locally (defaults bound under MxGateway:* in src/ZB.MOM.WW.MxGateway.Server/appsettings.json)
dotnet run --project src/ZB.MOM.WW.MxGateway.Server/ZB.MOM.WW.MxGateway.Server.csproj
# API-key admin CLI (same exe, "apikey" subcommand)
dotnet run --project src/MxGateway.Server/MxGateway.Server.csproj -- apikey create --display-name "dev" --scopes session,invoke,event,metadata,admin
dotnet run --project src/ZB.MOM.WW.MxGateway.Server/ZB.MOM.WW.MxGateway.Server.csproj -- apikey create --display-name "dev" --scopes session,invoke,event,metadata,admin
```
Single test by name (xUnit `--filter`):
```powershell
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~GatewayEndToEndFakeWorkerSmokeTests
dotnet test src/ZB.MOM.WW.MxGateway.Tests/ZB.MOM.WW.MxGateway.Tests.csproj --filter FullyQualifiedName~GatewayEndToEndFakeWorkerSmokeTests
```
Live MXAccess integration tests are **opt-in** because they need installed MXAccess COM and live provider state:
```powershell
$env:MXGATEWAY_RUN_LIVE_MXACCESS_TESTS = "1"
dotnet test src/MxGateway.IntegrationTests/MxGateway.IntegrationTests.csproj --filter FullyQualifiedName~WorkerLiveMxAccessSmokeTests
dotnet test src/ZB.MOM.WW.MxGateway.IntegrationTests/ZB.MOM.WW.MxGateway.IntegrationTests.csproj --filter FullyQualifiedName~WorkerLiveMxAccessSmokeTests
```
Live LDAP tests use `MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`. See `docs/GatewayTesting.md` for the full opt-in matrix and `LiveMxAccessFactAttribute` / `LiveLdapFactAttribute` for the gating logic.
## Clients
Each language client is in `clients/<lang>/` with its own README. They all consume the shared `.proto` files in `src/MxGateway.Contracts/Protos`:
Each language client is in `clients/<lang>/` with its own README. They all consume the shared `.proto` files in `src/ZB.MOM.WW.MxGateway.Contracts/Protos`:
- `clients/dotnet`: `dotnet build clients/dotnet/MxGateway.Client.sln`
- `clients/dotnet`: `dotnet build clients/dotnet/ZB.MOM.WW.MxGateway.Client.slnx`
- `clients/python`: `python -m pip install -e ".[dev]"; python -m pytest`
- `clients/rust`: `cargo test --workspace; cargo clippy --workspace --all-targets -- -D warnings`
- `clients/java`: `gradle test` (Java 21)
@@ -77,7 +77,7 @@ powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1
- **Gateway restart does not reattach orphan workers.** The first version terminates orphaned workers on startup; do not design code paths that assume reattachment.
- **No Blazor UI component libraries.** Dashboard uses local Bootstrap CSS/JS only — do not introduce MudBlazor, Radzen, FluentUI, etc.
- **Don't log secrets or full tag values by default.** API keys, passwords, `WriteSecured` payloads, and `AuthenticateUser` credentials must never reach logs. Value logging is opt-in and redacted.
- **Generated code** under `src/MxGateway.Contracts/Generated/`, `clients/*/generated*/`, `clients/python/src/mxgateway/generated/`, etc., is build output. Don't hand-edit. To regenerate, build the contracts project (`dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj`) or run the per-client generation step in that client's README.
- **Generated code** under `src/ZB.MOM.WW.MxGateway.Contracts/Generated/`, `clients/*/generated*/`, `clients/python/src/mxgateway/generated/`, etc., is build output. Don't hand-edit. To regenerate, build the contracts project (`dotnet build src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csproj`) or run the per-client generation step in that client's README.
- **Documentation style** (`StyleGuide.md`): PascalCase filenames, no marketing language, present tense, explain *why* not *what*.
- **Update docs in the same change as the source.** When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs (`gateway.md`, `docs/`, client READMEs, design docs) must change in the same commit. Don't leave stale prose describing old behavior.
@@ -88,9 +88,9 @@ When source code changes, build and test the affected component before reporting
| Changed area | Required verification |
|---|---|
| Contracts or `.proto` files | regenerate generated code, then build gateway, worker, and every generated client touched by the contract |
| Gateway server, sessions, workers, gRPC, dashboard, metrics | `dotnet build src/MxGateway.Server` and run affected gateway / fake-worker tests |
| Worker IPC, STA, MXAccess, conversion | `dotnet build src/MxGateway.Worker -p:Platform=x86` and run worker tests |
| .NET client | `dotnet build clients/dotnet/MxGateway.Client.sln` and run its tests |
| Gateway server, sessions, workers, gRPC, dashboard, metrics | `dotnet build src/ZB.MOM.WW.MxGateway.Server` and run affected gateway / fake-worker tests |
| Worker IPC, STA, MXAccess, conversion | `dotnet build src/ZB.MOM.WW.MxGateway.Worker -p:Platform=x86` and run worker tests |
| .NET client | `dotnet build clients/dotnet/ZB.MOM.WW.MxGateway.Client.slnx` and run its tests |
| Go client | `gofmt`, `go build ./...`, `go test ./...` from `clients/go` |
| Rust client | `cargo fmt`, `cargo check --workspace`, `cargo test --workspace`, `cargo clippy --all-targets -- -D warnings` from `clients/rust` |
| Python client | `python -m pytest` from `clients/python` |
@@ -114,7 +114,7 @@ External analysis sources referenced by design docs:
## Authentication
Gateway gRPC clients authenticate with an API key in metadata: `authorization: Bearer mxgw_<key-id>_<secret>`. Keys are stored hashed (with a peppered SHA) in a gateway-owned SQLite DB (default `C:\ProgramData\MxGateway\gateway-auth.db`). Scopes (`session`, `invoke`, `event`, `metadata`, `admin`) gate specific RPCs; missing → `Unauthenticated`, insufficient → `PermissionDenied`. The `apikey` subcommand on the server exe manages keys; see `src/MxGateway.Server/Security/Authentication/`.
Gateway gRPC clients authenticate with an API key in metadata: `authorization: Bearer mxgw_<key-id>_<secret>`. Keys are stored hashed (with a peppered SHA) in a gateway-owned SQLite DB (default `C:\ProgramData\MxGateway\gateway-auth.db`). Scopes (`session`, `invoke`, `event`, `metadata`, `admin`) gate specific RPCs; missing → `Unauthenticated`, insufficient → `PermissionDenied`. The `apikey` subcommand on the server exe manages keys; see `src/ZB.MOM.WW.MxGateway.Server/Security/Authentication/`.
Dashboard auth is LDAP-backed (separate from the gRPC API-key model). `/login` binds against `MxGateway:Ldap` and maps the user's LDAP groups to `Admin` or `Viewer` via `MxGateway:Dashboard:GroupToRole`, then issues an HTTP-only secure `__Host-MxGatewayDashboard` cookie. SignalR hubs at `/hubs/{snapshot,alarms,events}` accept either the cookie or a 30-minute bearer minted at `/hubs/token`. `Dashboard:AllowAnonymousLocalhost` bypasses auth on loopback when enabled.
+13
View File
@@ -244,6 +244,19 @@ foreach (LazyBrowseNode root in roots)
and is safe under concurrent callers. To refresh after a Galaxy redeploy, call
`BrowseAsync` again from the root.
The CLI counterpart is `galaxy-browse`. Without `--parent` it walks the root
objects and eagerly expands `--depth` further levels into an indented tree; with
`--parent <gobject-id>` it fetches exactly one level of children for that object
(`--depth` is ignored there). Filter flags map onto `BrowseChildrenOptions`:
`--category-ids` and `--template-contains` are comma-separated lists,
`--tag-name-glob` / `--alarm-bearing-only` / `--historized-only` are scalar, and
`--include-attributes` overrides the server default for attribute population.
```powershell
dotnet run --project clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli -- galaxy-browse --endpoint http://localhost:5000 --api-key-env MXGATEWAY_API_KEY --depth 1
dotnet run --project clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli -- galaxy-browse --endpoint http://localhost:5000 --api-key-env MXGATEWAY_API_KEY --parent 42 --json
```
### Watching deploy events
`WatchDeployEventsAsync` opens the `WatchDeployEvents` server-streaming RPC. The
@@ -105,4 +105,16 @@ public interface IMxGatewayCliClient : IAsyncDisposable
IAsyncEnumerable<DeployEvent> GalaxyWatchDeployEventsAsync(
WatchDeployEventsRequest request,
CancellationToken cancellationToken);
/// <summary>
/// Fetches one page of direct children of a Galaxy parent (or the root
/// objects when the parent selector is unset), the primitive that backs the
/// lazy-browse helper.
/// </summary>
/// <param name="request">The browse-children request.</param>
/// <param name="cancellationToken">Cancellation token for the operation.</param>
/// <returns>The browse-children reply.</returns>
Task<BrowseChildrenReply> GalaxyBrowseChildrenAsync(
BrowseChildrenRequest request,
CancellationToken cancellationToken);
}
@@ -100,6 +100,14 @@ internal sealed class MxGatewayCliClientAdapter : IMxGatewayCliClient
return _galaxyClient.Value.WatchDeployEventsRawAsync(request, cancellationToken);
}
/// <inheritdoc />
public Task<BrowseChildrenReply> GalaxyBrowseChildrenAsync(
BrowseChildrenRequest request,
CancellationToken cancellationToken)
{
return _galaxyClient.Value.BrowseChildrenRawAsync(request, cancellationToken);
}
/// <inheritdoc />
public async ValueTask DisposeAsync()
{
@@ -144,6 +144,8 @@ public static class MxGatewayClientCli
.ConfigureAwait(false),
"galaxy-discover" => await GalaxyDiscoverAsync(arguments, client, standardOutput, cancellation.Token)
.ConfigureAwait(false),
"galaxy-browse" => await GalaxyBrowseAsync(arguments, client, standardOutput, standardError, cancellation.Token)
.ConfigureAwait(false),
"galaxy-watch" => await GalaxyWatchAsync(arguments, client, standardOutput, cancellation.Token)
.ConfigureAwait(false),
_ => WriteUnknownCommand(command, standardError),
@@ -1607,6 +1609,270 @@ public static class MxGatewayClientCli
return aggregate;
}
/// <summary>
/// Per-request page size for the galaxy-browse single-level walks. Mirrors
/// the library's <c>BrowseChildrenPageSize</c> so the CLI and the
/// lazy-browse helper page identically.
/// </summary>
private const int BrowseChildrenCliPageSize = 500;
/// <summary>
/// Drives the lazy-browse Galaxy surface from the CLI. Without
/// <c>--parent</c> it walks the root objects and eagerly expands
/// <c>--depth</c> further levels (each level reuses the same
/// <see cref="BrowseChildrenOptions"/>, like the library helper). With
/// <c>--parent</c> it fetches exactly one level of children for that
/// gobject id via a parent-scoped BrowseChildren request; <c>--depth</c>
/// is not meaningful there and a warning is emitted if combined, mirroring
/// the Go/Rust CLIs.
/// </summary>
private static async Task<int> GalaxyBrowseAsync(
CliArguments arguments,
IMxGatewayCliClient client,
TextWriter output,
TextWriter standardError,
CancellationToken cancellationToken)
{
BrowseChildrenOptions options = ParseBrowseChildrenOptions(arguments);
bool json = arguments.HasFlag("json");
int parent = arguments.GetInt32("parent", -1);
int depth = arguments.GetInt32("depth", 0);
// A specific parent → one level of children via the parent-scoped RPC.
if (parent >= 0)
{
if (depth > 0)
{
standardError.WriteLine("warning: --depth is ignored when --parent is specified.");
}
IReadOnlyList<GalaxyObject> children = await BrowseOneLevelAsync(
client,
options,
parent,
cancellationToken)
.ConfigureAwait(false);
if (json)
{
output.WriteLine(JsonSerializer.Serialize(
new
{
command = "galaxy-browse",
parentId = parent,
children = children.Select(GalaxyObjectToJsonElement).ToArray(),
},
JsonOptions));
return 0;
}
output.WriteLine(children.Count.ToString(CultureInfo.InvariantCulture));
foreach (GalaxyObject child in children)
{
output.WriteLine(FormatGalaxyObject(child, level: 0, hasChildrenHint: null));
}
return 0;
}
// No parent → walk the root objects, eagerly expanding --depth levels.
IReadOnlyList<BrowseTreeNode> roots = await BrowseTreeAsync(
client,
options,
parentGobjectId: 0,
remainingDepth: depth,
cancellationToken)
.ConfigureAwait(false);
if (json)
{
output.WriteLine(JsonSerializer.Serialize(
new
{
command = "galaxy-browse",
nodes = roots.Select(BrowseTreeNodeToJson).ToArray(),
},
JsonOptions));
return 0;
}
output.WriteLine(roots.Count.ToString(CultureInfo.InvariantCulture));
foreach (BrowseTreeNode node in roots)
{
WriteBrowseTreeNode(output, node, level: 0);
}
return 0;
}
/// <summary>
/// One node in the eagerly-expanded galaxy-browse tree: the Galaxy object,
/// the server's has-children hint, and any children fetched up to the
/// requested depth.
/// </summary>
private sealed record BrowseTreeNode(
GalaxyObject Object,
bool HasChildrenHint,
IReadOnlyList<BrowseTreeNode> Children);
/// <summary>
/// Fetches the direct children of <paramref name="parentGobjectId"/>
/// (0 = root) and recursively expands <paramref name="remainingDepth"/>
/// further levels. Paging is followed to completion at each level.
/// </summary>
private static async Task<IReadOnlyList<BrowseTreeNode>> BrowseTreeAsync(
IMxGatewayCliClient client,
BrowseChildrenOptions options,
int parentGobjectId,
int remainingDepth,
CancellationToken cancellationToken)
{
List<BrowseTreeNode> nodes = [];
string pageToken = string.Empty;
HashSet<string> seenPageTokens = new(StringComparer.Ordinal);
do
{
BrowseChildrenRequest request = BuildBrowseChildrenRequest(options, parentGobjectId, pageToken);
BrowseChildrenReply reply = await client.GalaxyBrowseChildrenAsync(request, cancellationToken)
.ConfigureAwait(false);
for (int i = 0; i < reply.Children.Count; i++)
{
GalaxyObject child = reply.Children[i];
bool hint = i < reply.ChildHasChildren.Count && reply.ChildHasChildren[i];
IReadOnlyList<BrowseTreeNode> grandChildren = remainingDepth > 0
? await BrowseTreeAsync(client, options, child.GobjectId, remainingDepth - 1, cancellationToken)
.ConfigureAwait(false)
: [];
nodes.Add(new BrowseTreeNode(child, hint, grandChildren));
}
pageToken = reply.NextPageToken;
if (!string.IsNullOrWhiteSpace(pageToken) && !seenPageTokens.Add(pageToken))
{
throw new MxGatewayException(
$"Galaxy BrowseChildren returned a repeated page token '{pageToken}'.");
}
}
while (!string.IsNullOrWhiteSpace(pageToken));
return nodes;
}
/// <summary>Fetches exactly one level of children for a parent gobject id, paging to completion.</summary>
private static async Task<IReadOnlyList<GalaxyObject>> BrowseOneLevelAsync(
IMxGatewayCliClient client,
BrowseChildrenOptions options,
int parentGobjectId,
CancellationToken cancellationToken)
{
List<GalaxyObject> children = [];
string pageToken = string.Empty;
HashSet<string> seenPageTokens = new(StringComparer.Ordinal);
do
{
BrowseChildrenRequest request = BuildBrowseChildrenRequest(options, parentGobjectId, pageToken);
BrowseChildrenReply reply = await client.GalaxyBrowseChildrenAsync(request, cancellationToken)
.ConfigureAwait(false);
children.AddRange(reply.Children);
pageToken = reply.NextPageToken;
if (!string.IsNullOrWhiteSpace(pageToken) && !seenPageTokens.Add(pageToken))
{
throw new MxGatewayException(
$"Galaxy BrowseChildren returned a repeated page token '{pageToken}'.");
}
}
while (!string.IsNullOrWhiteSpace(pageToken));
return children;
}
private static BrowseChildrenOptions ParseBrowseChildrenOptions(CliArguments arguments)
{
return new BrowseChildrenOptions
{
CategoryIds = ParseOptionalInt32List(arguments.GetOptional("category-ids")),
TemplateChainContains = ParseOptionalStringList(arguments.GetOptional("template-contains")),
TagNameGlob = arguments.GetOptional("tag-name-glob"),
AlarmBearingOnly = arguments.HasFlag("alarm-bearing-only"),
HistorizedOnly = arguments.HasFlag("historized-only"),
// Tri-state: only override the server default when the flag is present.
IncludeAttributes = arguments.HasFlag("include-attributes") ? true : null,
};
}
private static BrowseChildrenRequest BuildBrowseChildrenRequest(
BrowseChildrenOptions options,
int parentGobjectId,
string pageToken)
{
BrowseChildrenRequest request = new()
{
PageSize = BrowseChildrenCliPageSize,
PageToken = pageToken,
ParentGobjectId = parentGobjectId,
AlarmBearingOnly = options.AlarmBearingOnly,
HistorizedOnly = options.HistorizedOnly,
};
request.CategoryIds.Add(options.CategoryIds);
request.TemplateChainContains.Add(options.TemplateChainContains);
if (!string.IsNullOrWhiteSpace(options.TagNameGlob))
{
request.TagNameGlob = options.TagNameGlob;
}
if (options.IncludeAttributes.HasValue)
{
request.IncludeAttributes = options.IncludeAttributes.Value;
}
return request;
}
private static void WriteBrowseTreeNode(TextWriter output, BrowseTreeNode node, int level)
{
output.WriteLine(FormatGalaxyObject(node.Object, level, node.HasChildrenHint));
foreach (BrowseTreeNode child in node.Children)
{
WriteBrowseTreeNode(output, child, level + 1);
}
}
private static string FormatGalaxyObject(GalaxyObject galaxyObject, int level, bool? hasChildrenHint)
{
string indent = new(' ', level * 2);
string suffix = hasChildrenHint is null
? $"(attrs={galaxyObject.Attributes.Count})"
: $"(attrs={galaxyObject.Attributes.Count}, hasChildrenHint={hasChildrenHint.Value})";
return $"{indent}{galaxyObject.GobjectId}\t{galaxyObject.TagName}\t{galaxyObject.BrowseName}\t{suffix}";
}
private static object BrowseTreeNodeToJson(BrowseTreeNode node)
{
return new
{
@object = GalaxyObjectToJsonElement(node.Object),
hasChildrenHint = node.HasChildrenHint,
children = node.Children.Select(BrowseTreeNodeToJson).ToArray(),
};
}
private static JsonElement GalaxyObjectToJsonElement(GalaxyObject galaxyObject)
{
return JsonDocument.Parse(ProtobufJsonFormatter.Format(galaxyObject)).RootElement.Clone();
}
private static IReadOnlyList<int> ParseOptionalInt32List(string? value)
{
return string.IsNullOrWhiteSpace(value) ? [] : ParseInt32List(value);
}
private static IReadOnlyList<string> ParseOptionalStringList(string? value)
{
return string.IsNullOrWhiteSpace(value) ? [] : ParseStringList(value);
}
private static async Task<int> GalaxyWatchAsync(
CliArguments arguments,
IMxGatewayCliClient client,
@@ -1736,6 +2002,7 @@ public static class MxGatewayClientCli
or "galaxy-test-connection"
or "galaxy-last-deploy"
or "galaxy-discover"
or "galaxy-browse"
or "galaxy-watch";
}
@@ -1797,6 +2064,7 @@ public static class MxGatewayClientCli
writer.WriteLine("mxgw-dotnet galaxy-test-connection [--json]");
writer.WriteLine("mxgw-dotnet galaxy-last-deploy [--json]");
writer.WriteLine("mxgw-dotnet galaxy-discover [--json]");
writer.WriteLine("mxgw-dotnet galaxy-browse [--parent <gobject-id>] [--depth <n>] [--category-ids <n,n>] [--template-contains <s,s>] [--tag-name-glob <glob>] [--alarm-bearing-only] [--historized-only] [--include-attributes] [--json]");
writer.WriteLine("mxgw-dotnet galaxy-watch [--last-seen-deploy-time <iso8601>] [--max-events <n>] [--json]");
}
}
@@ -360,6 +360,146 @@ public sealed class MxGatewayClientCliTests
Assert.Equal(string.Empty, error.ToString());
}
/// <summary>
/// Verifies galaxy-browse walks root objects and eagerly expands one further
/// level when --depth 1 is passed, printing an indented tree.
/// </summary>
[Fact]
public async Task RunAsync_GalaxyBrowse_TextTreeExpandsToDepth()
{
using var output = new StringWriter();
using var error = new StringWriter();
FakeCliClient fakeClient = new();
// Root level (parent 0): one area with a child hint.
fakeClient.GalaxyBrowseChildrenReplies[0] = new Queue<BrowseChildrenReply>(
[
new BrowseChildrenReply
{
Children = { new GalaxyObject { GobjectId = 10, TagName = "Area_001", BrowseName = "Area" } },
ChildHasChildren = { true },
},
]);
// Children of gobject 10.
fakeClient.GalaxyBrowseChildrenReplies[10] = new Queue<BrowseChildrenReply>(
[
new BrowseChildrenReply
{
Children = { new GalaxyObject { GobjectId = 20, TagName = "Tank_001", BrowseName = "Tank" } },
},
]);
int exitCode = await MxGatewayClientCli.RunAsync(
[
"galaxy-browse",
"--endpoint",
"http://localhost:5000",
"--api-key",
"test-api-key",
"--depth",
"1",
],
output,
error,
_ => fakeClient);
Assert.Equal(0, exitCode);
string text = output.ToString();
Assert.Contains("Area_001", text);
Assert.Contains("Tank_001", text);
// Children are indented beneath their parent (two-space indent per level).
Assert.Matches(@"\n \d+\tTank_001", text);
// Root fetched with the parent oneof unset; child fetch used parent 10.
Assert.Contains(
fakeClient.GalaxyBrowseChildrenRequests,
request => request.ParentCase == BrowseChildrenRequest.ParentOneofCase.ParentGobjectId
&& request.ParentGobjectId == 10);
Assert.Equal(string.Empty, error.ToString());
}
/// <summary>
/// Verifies galaxy-browse --json emits a nested JSON document and forwards
/// the filter flags onto the BrowseChildren request.
/// </summary>
[Fact]
public async Task RunAsync_GalaxyBrowse_JsonForwardsFilters()
{
using var output = new StringWriter();
using var error = new StringWriter();
FakeCliClient fakeClient = new();
fakeClient.GalaxyBrowseChildrenReplies[0] = new Queue<BrowseChildrenReply>(
[
new BrowseChildrenReply
{
Children = { new GalaxyObject { GobjectId = 10, TagName = "Area_001", BrowseName = "Area" } },
},
]);
int exitCode = await MxGatewayClientCli.RunAsync(
[
"galaxy-browse",
"--endpoint",
"http://localhost:5000",
"--api-key",
"test-api-key",
"--tag-name-glob",
"Area*",
"--alarm-bearing-only",
"--json",
],
output,
error,
_ => fakeClient);
Assert.Equal(0, exitCode);
using System.Text.Json.JsonDocument document = System.Text.Json.JsonDocument.Parse(output.ToString());
Assert.Equal("galaxy-browse", document.RootElement.GetProperty("command").GetString());
Assert.True(document.RootElement.GetProperty("nodes").GetArrayLength() >= 1);
BrowseChildrenRequest request = Assert.Single(fakeClient.GalaxyBrowseChildrenRequests);
Assert.Equal("Area*", request.TagNameGlob);
Assert.True(request.AlarmBearingOnly);
Assert.Equal(string.Empty, error.ToString());
}
/// <summary>
/// Verifies galaxy-browse --parent fetches exactly one level of children for
/// the supplied gobject id via a parent-scoped BrowseChildren request.
/// </summary>
[Fact]
public async Task RunAsync_GalaxyBrowse_ParentFetchesSingleLevel()
{
using var output = new StringWriter();
using var error = new StringWriter();
FakeCliClient fakeClient = new();
fakeClient.GalaxyBrowseChildrenReplies[10] = new Queue<BrowseChildrenReply>(
[
new BrowseChildrenReply
{
Children = { new GalaxyObject { GobjectId = 20, TagName = "Tank_001", BrowseName = "Tank" } },
},
]);
int exitCode = await MxGatewayClientCli.RunAsync(
[
"galaxy-browse",
"--endpoint",
"http://localhost:5000",
"--api-key",
"test-api-key",
"--parent",
"10",
],
output,
error,
_ => fakeClient);
Assert.Equal(0, exitCode);
Assert.Contains("Tank_001", output.ToString());
BrowseChildrenRequest request = Assert.Single(fakeClient.GalaxyBrowseChildrenRequests);
Assert.Equal(BrowseChildrenRequest.ParentOneofCase.ParentGobjectId, request.ParentCase);
Assert.Equal(10, request.ParentGobjectId);
Assert.Equal(string.Empty, error.ToString());
}
/// <summary>Verifies that galaxy-watch command prints text output for deploy events.</summary>
[Fact]
public async Task RunAsync_GalaxyWatch_PrintsTextOutputForEvents()
@@ -1051,5 +1191,33 @@ public sealed class MxGatewayClientCliTests
yield return deployEvent;
}
}
/// <summary>List of received galaxy browse-children requests, in call order.</summary>
public List<BrowseChildrenRequest> GalaxyBrowseChildrenRequests { get; } = [];
/// <summary>
/// Per-parent browse-children replies keyed by <c>parent_gobject_id</c>
/// (0 = root). Each parent's queue is dequeued in page order; an absent
/// or exhausted queue yields an empty reply.
/// </summary>
public Dictionary<int, Queue<BrowseChildrenReply>> GalaxyBrowseChildrenReplies { get; } = [];
/// <inheritdoc />
public Task<BrowseChildrenReply> GalaxyBrowseChildrenAsync(
BrowseChildrenRequest request,
CancellationToken cancellationToken)
{
GalaxyBrowseChildrenRequests.Add(request);
int parentId = request.ParentCase == BrowseChildrenRequest.ParentOneofCase.ParentGobjectId
? request.ParentGobjectId
: 0;
if (GalaxyBrowseChildrenReplies.TryGetValue(parentId, out Queue<BrowseChildrenReply>? queue)
&& queue.TryDequeue(out BrowseChildrenReply? reply))
{
return Task.FromResult(reply);
}
return Task.FromResult(new BrowseChildrenReply());
}
}
}
+279 -1
View File
@@ -121,6 +121,10 @@ func runWithIO(ctx context.Context, args []string, stdout, stderr io.Writer) err
return runGalaxyDiscover(ctx, args[1:], stdout, stderr)
case "galaxy-watch":
return runGalaxyWatch(ctx, args[1:], stdout, stderr)
case "galaxy-browse":
return runGalaxyBrowse(ctx, args[1:], stdout, stderr)
case "ping":
return runPing(ctx, args[1:], stdout, stderr)
case "batch":
return runBatch(ctx, os.Stdin, stdout, stderr)
default:
@@ -228,6 +232,52 @@ func runCloseSession(ctx context.Context, args []string, stdout, stderr io.Write
return nil
}
func runPing(ctx context.Context, args []string, stdout, stderr io.Writer) error {
flags := flag.NewFlagSet("ping", flag.ContinueOnError)
flags.SetOutput(stderr)
common := bindCommonFlags(flags)
jsonOutput := flags.Bool("json", false, "write JSON output")
sessionID := flags.String("session-id", "", "gateway session id")
message := flags.String("message", "ping", "ping payload message")
if err := flags.Parse(args); err != nil {
return err
}
if *sessionID == "" {
return errors.New("session-id is required")
}
client, options, err := dialForCommand(ctx, common)
if err != nil {
return err
}
defer client.Close()
session := mxgateway.NewSessionForID(client, *sessionID)
reply, err := session.PingRaw(ctx, *message)
if err != nil {
return err
}
if *jsonOutput {
return writeJSON(stdout, commandReplyOutput{
Command: "ping",
Options: options,
Reply: mustMarshalProto(reply),
})
}
// DiagnosticMessage carries the echoed ping text set by the gateway.
// Fall back to the kind string when the gateway returns an empty message
// (forward-compat guard for future gateway versions). writeCommandOutput
// is not reused here because it would print the opaque Kind enum rather
// than the human-readable echo.
echo := reply.GetDiagnosticMessage()
if echo == "" {
echo = reply.GetKind().String()
}
fmt.Fprintln(stdout, echo)
return nil
}
func runRegister(ctx context.Context, args []string, stdout, stderr io.Writer) error {
flags := flag.NewFlagSet("register", flag.ContinueOnError)
flags.SetOutput(stderr)
@@ -1196,7 +1246,7 @@ type protojsonMessage interface {
}
func writeUsage(writer io.Writer) {
fmt.Fprintln(writer, "usage: mxgw-go <version|open-session|close-session|register|add-item|advise|subscribe-bulk|unsubscribe-bulk|read-bulk|write-bulk|write2-bulk|write-secured-bulk|write-secured2-bulk|bench-read-bulk|write|stream-events|stream-alarms|acknowledge-alarm|smoke|galaxy-test-connection|galaxy-last-deploy|galaxy-discover|galaxy-watch|batch>")
fmt.Fprintln(writer, "usage: mxgw-go <version|open-session|close-session|ping|register|add-item|advise|subscribe-bulk|unsubscribe-bulk|read-bulk|write-bulk|write2-bulk|write-secured-bulk|write-secured2-bulk|bench-read-bulk|write|stream-events|stream-alarms|acknowledge-alarm|smoke|galaxy-test-connection|galaxy-last-deploy|galaxy-discover|galaxy-watch|galaxy-browse|batch>")
}
// batchEOR is the end-of-result sentinel emitted to stdout after every command
@@ -1459,6 +1509,234 @@ func runGalaxyWatch(ctx context.Context, args []string, stdout, stderr io.Writer
}
}
// runGalaxyBrowse drives the lazy-browse Galaxy helper from the CLI. Without
// -parent it walks the root objects via GalaxyClient.Browse and eagerly expands
// -depth further levels (each level reuses the same BrowseChildrenOptions, like
// the library helper). With -parent it fetches exactly one level of children for
// that gobject id via a parent-scoped BrowseChildren request; -depth is not
// meaningful there and a warning is emitted if combined, mirroring the Rust CLI.
//
// Filter flags map onto BrowseChildrenOptions: -category-ids and
// -template-contains are comma-separated lists (matching this CLI's other
// list-valued flags), -tag-name-glob / -alarm-bearing-only / -historized-only
// are scalar, and -include-attributes is a tri-state pointer (left nil unless
// the flag is provided so the server default applies).
func runGalaxyBrowse(ctx context.Context, args []string, stdout, stderr io.Writer) error {
flags := flag.NewFlagSet("galaxy-browse", flag.ContinueOnError)
flags.SetOutput(stderr)
common := bindCommonFlags(flags)
jsonOutput := flags.Bool("json", false, "write JSON output")
parent := flags.Int("parent", -1, "parent gobject id whose children to browse; omit (or <0) for root objects")
depth := flags.Int("depth", 0, "additional levels to eagerly expand beneath each root node; ignored with -parent")
categoryIDs := flags.String("category-ids", "", "comma-separated Galaxy category ids to restrict results")
templateContains := flags.String("template-contains", "", "comma-separated template tag names the chain must contain")
tagNameGlob := flags.String("tag-name-glob", "", "restrict to objects whose tag name matches this glob")
alarmBearingOnly := flags.Bool("alarm-bearing-only", false, "restrict to alarm-bearing objects")
historizedOnly := flags.Bool("historized-only", false, "restrict to historized objects")
includeAttributes := flags.Bool("include-attributes", false, "populate attributes on returned objects (overrides server default)")
if err := flags.Parse(args); err != nil {
return err
}
categoryList, err := parseInt32List(*categoryIDs)
if err != nil {
return err
}
opts := &mxgateway.BrowseChildrenOptions{
CategoryIds: categoryList,
TemplateChainContains: parseStringList(*templateContains),
TagNameGlob: *tagNameGlob,
AlarmBearingOnly: *alarmBearingOnly,
HistorizedOnly: *historizedOnly,
}
// Only override the server default when the flag was actually set; the
// pointer form mirrors the proto's optional field.
flags.Visit(func(f *flag.Flag) {
if f.Name == "include-attributes" {
value := *includeAttributes
opts.IncludeAttributes = &value
}
})
client, options, err := dialGalaxyForCommand(ctx, common)
if err != nil {
return err
}
defer client.Close()
// A specific parent → one level of children via the raw parent-scoped RPC.
if *parent >= 0 {
if *parent == 0 {
fmt.Fprintln(stderr, "warning: -parent 0 is the server root sentinel; omit -parent for the root walk, or use -parent <id> >= 1")
}
if *depth > 0 {
fmt.Fprintln(stderr, "warning: -depth is ignored when -parent is specified")
}
return runGalaxyBrowseParent(ctx, client, int32(*parent), opts, stdout, *jsonOutput, options)
}
// No parent → walk the lazy-browse tree from the root objects, eagerly
// expanding -depth further levels so the print walks cached children
// without re-issuing RPCs.
nodes, err := client.Browse(ctx, opts)
if err != nil {
return err
}
for _, node := range nodes {
if err := expandToDepth(ctx, node, *depth); err != nil {
return err
}
}
if *jsonOutput {
jsonNodes := make([]map[string]any, 0, len(nodes))
for _, node := range nodes {
jsonNodes = append(jsonNodes, lazyNodeToJSON(node))
}
return writeJSON(stdout, map[string]any{
"command": "galaxy-browse",
"options": options,
"nodes": jsonNodes,
})
}
fmt.Fprintln(stdout, len(nodes))
for _, node := range nodes {
printLazyNode(stdout, node, 0)
}
return nil
}
// runGalaxyBrowseParent fetches exactly one level of children for parentID via a
// parent-scoped BrowseChildren request, paging until the server stops. It does
// not lazily wrap the children in nodes; the single level is rendered directly.
func runGalaxyBrowseParent(
ctx context.Context,
client *mxgateway.GalaxyClient,
parentID int32,
opts *mxgateway.BrowseChildrenOptions,
stdout io.Writer,
jsonOutput bool,
options commonOptions,
) error {
var children []*mxgateway.GalaxyObject
pageToken := ""
seen := map[string]struct{}{}
for {
req := &mxgateway.BrowseChildrenRequest{
PageSize: browseChildrenCLIPageSize,
PageToken: pageToken,
CategoryIds: opts.CategoryIds,
TemplateChainContains: opts.TemplateChainContains,
TagNameGlob: opts.TagNameGlob,
AlarmBearingOnly: opts.AlarmBearingOnly,
HistorizedOnly: opts.HistorizedOnly,
IncludeAttributes: opts.IncludeAttributes,
Parent: &mxgateway.BrowseChildrenRequest_ParentGobjectId{ParentGobjectId: parentID},
}
reply, err := client.BrowseChildrenRaw(ctx, req)
if err != nil {
return err
}
children = append(children, reply.GetChildren()...)
pageToken = reply.GetNextPageToken()
if pageToken == "" {
break
}
if _, dup := seen[pageToken]; dup {
return fmt.Errorf("galaxy browse children returned repeated page token %q", pageToken)
}
seen[pageToken] = struct{}{}
}
if jsonOutput {
jsonChildren := make([]map[string]any, 0, len(children))
for _, child := range children {
jsonChildren = append(jsonChildren, galaxyObjectToJSON(child))
}
return writeJSON(stdout, map[string]any{
"command": "galaxy-browse",
"options": options,
"parentId": parentID,
"children": jsonChildren,
})
}
fmt.Fprintln(stdout, len(children))
for _, child := range children {
fmt.Fprintf(stdout, "%d\t%s\t%s\t(attrs=%d)\n", child.GetGobjectId(), child.GetTagName(), child.GetBrowseName(), len(child.GetAttributes()))
}
return nil
}
// browseChildrenCLIPageSize is the per-request page size for the -parent
// single-level walk. It mirrors the library's browseChildrenPageSize so the
// CLI and the lazy-browse helper page identically.
const browseChildrenCLIPageSize = 500
// expandToDepth eagerly expands node and remaining further levels beneath it so
// a subsequent print walk reads cached children without re-issuing RPCs. A
// remaining of 0 leaves the node unexpanded (only the requested level prints).
func expandToDepth(ctx context.Context, node *mxgateway.LazyBrowseNode, remaining int) error {
if remaining <= 0 {
return nil
}
if err := node.Expand(ctx); err != nil {
return err
}
for _, child := range node.Children() {
if err := expandToDepth(ctx, child, remaining-1); err != nil {
return err
}
}
return nil
}
// printLazyNode renders one node and its already-expanded children as an
// indent-per-level tree. Only children loaded by a prior Expand are walked.
func printLazyNode(stdout io.Writer, node *mxgateway.LazyBrowseNode, level int) {
indent := strings.Repeat(" ", level)
obj := node.Object()
fmt.Fprintf(stdout, "%s%d\t%s\t%s\t(attrs=%d, hasChildrenHint=%t)\n",
indent, obj.GetGobjectId(), obj.GetTagName(), obj.GetBrowseName(), len(obj.GetAttributes()), node.HasChildrenHint())
for _, child := range node.Children() {
printLazyNode(stdout, child, level+1)
}
}
// lazyNodeToJSON renders one lazy node and its already-expanded children as a
// nested JSON object.
func lazyNodeToJSON(node *mxgateway.LazyBrowseNode) map[string]any {
out := galaxyObjectToJSON(node.Object())
out["hasChildrenHint"] = node.HasChildrenHint()
children := node.Children()
jsonChildren := make([]map[string]any, 0, len(children))
for _, child := range children {
jsonChildren = append(jsonChildren, lazyNodeToJSON(child))
}
out["children"] = jsonChildren
return out
}
// galaxyObjectToJSON renders the scalar fields of a GalaxyObject for the
// browse JSON output. Attributes are summarised by count to keep the tree
// compact; -include-attributes still drives whether the server populates them.
func galaxyObjectToJSON(obj *mxgateway.GalaxyObject) map[string]any {
return map[string]any{
"gobjectId": obj.GetGobjectId(),
"tagName": obj.GetTagName(),
"containedName": obj.GetContainedName(),
"browseName": obj.GetBrowseName(),
"parentGobjectId": obj.GetParentGobjectId(),
"isArea": obj.GetIsArea(),
"categoryId": obj.GetCategoryId(),
"templateChain": obj.GetTemplateChain(),
"attributeCount": len(obj.GetAttributes()),
}
}
func formatDeployEvent(event *mxgateway.DeployEvent) string {
observed := ""
if ts := event.GetObservedAt(); ts != nil {
+243
View File
@@ -190,6 +190,109 @@ func TestRunBenchReadBulkRespectsContextCancellation(t *testing.T) {
}
}
// TestRunPingPlainText verifies the ping subcommand round-trips through the
// fake gateway and prints the echo (diagnostic_message) in plain-text mode.
func TestRunPingPlainText(t *testing.T) {
listener, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("listen: %v", err)
}
server := grpc.NewServer()
fake := &pingFakeGateway{}
pb.RegisterMxAccessGatewayServer(server, fake)
go func() { _ = server.Serve(listener) }()
defer server.Stop()
defer listener.Close()
var stdout, stderr bytes.Buffer
args := []string{
"ping",
"-endpoint", listener.Addr().String(),
"-plaintext",
"-api-key", "test",
"-session-id", "test-session",
"-message", "hello",
}
if err := runWithIO(t.Context(), args, &stdout, &stderr); err != nil {
t.Fatalf("runWithIO() error = %v; stderr = %s", err, stderr.String())
}
got := strings.TrimSpace(stdout.String())
if got != "pong:hello" {
t.Fatalf("ping plain-text output = %q, want %q", got, "pong:hello")
}
}
// TestRunPingJSON verifies the ping subcommand emits valid JSON in --json mode.
func TestRunPingJSON(t *testing.T) {
listener, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("listen: %v", err)
}
server := grpc.NewServer()
fake := &pingFakeGateway{}
pb.RegisterMxAccessGatewayServer(server, fake)
go func() { _ = server.Serve(listener) }()
defer server.Stop()
defer listener.Close()
var stdout, stderr bytes.Buffer
args := []string{
"ping",
"-endpoint", listener.Addr().String(),
"-plaintext",
"-api-key", "test",
"-session-id", "test-session",
"-message", "hello",
"-json",
}
if err := runWithIO(t.Context(), args, &stdout, &stderr); err != nil {
t.Fatalf("runWithIO() error = %v; stderr = %s", err, stderr.String())
}
var out commandReplyOutput
if err := json.Unmarshal(stdout.Bytes(), &out); err != nil {
t.Fatalf("parse JSON: %v\noutput: %s", err, stdout.String())
}
if out.Command != "ping" {
t.Fatalf("command = %q, want %q", out.Command, "ping")
}
// The fake gateway echoes "pong:<message>" in diagnostic_message; verify the
// echo appears in the serialised reply so a future regression that wired
// PingRaw to the wrong proto field would be caught here.
replyStr := string(out.Reply)
if !strings.Contains(replyStr, "pong:hello") {
t.Fatalf("ping JSON reply missing echoed message %q; reply = %s", "pong:hello", replyStr)
}
}
// TestRunPingRequiresSessionID verifies the ping subcommand rejects missing session-id.
func TestRunPingRequiresSessionID(t *testing.T) {
var stdout, stderr bytes.Buffer
err := runWithIO(t.Context(), []string{"ping", "-plaintext", "-api-key", "test"}, &stdout, &stderr)
if err == nil {
t.Fatalf("runWithIO(ping without --session-id) returned no error")
}
if !strings.Contains(err.Error(), "session-id is required") {
t.Fatalf("error = %v; want 'session-id is required'", err)
}
}
// pingFakeGateway handles Invoke for MX_COMMAND_KIND_PING by echoing the
// message back in the diagnostic_message field so the CLI plain-text path
// has a deterministic, non-empty string to assert on.
type pingFakeGateway struct {
pb.UnimplementedMxAccessGatewayServer
}
func (g *pingFakeGateway) Invoke(_ context.Context, req *pb.MxCommandRequest) (*pb.MxCommandReply, error) {
echo := "pong:" + req.GetCommand().GetPing().GetMessage()
return &pb.MxCommandReply{
SessionId: req.GetSessionId(),
Kind: pb.MxCommandKind_MX_COMMAND_KIND_PING,
DiagnosticMessage: echo,
ProtocolStatus: &pb.ProtocolStatus{Code: pb.ProtocolStatusCode_PROTOCOL_STATUS_CODE_OK},
}, nil
}
// benchFakeGateway is a minimal MxAccessGatewayServer that satisfies the
// bench-read-bulk session-setup sequence (OpenSession + Invoke for Register
// / SubscribeBulk / ReadBulk / UnsubscribeBulk / CloseSession).
@@ -245,6 +348,146 @@ func TestRunBenchReadBulkRejectsNonPositiveBulkSize(t *testing.T) {
}
}
// browseFakeGalaxy implements BrowseChildren for the galaxy-browse subcommand
// tests. It returns two root objects when no parent is supplied (the first
// flagged as having children), and one child when the first root's gobject id
// is supplied as the parent. The recorded last request lets a test assert the
// CLI forwarded the parent and filter fields onto the wire.
type browseFakeGalaxy struct {
pb.UnimplementedGalaxyRepositoryServer
lastRequest *pb.BrowseChildrenRequest
}
func (g *browseFakeGalaxy) BrowseChildren(_ context.Context, req *pb.BrowseChildrenRequest) (*pb.BrowseChildrenReply, error) {
g.lastRequest = req
if req.GetParentGobjectId() == 10 {
return &pb.BrowseChildrenReply{
Children: []*pb.GalaxyObject{
{GobjectId: 11, TagName: "Area1.Tank", BrowseName: "Tank"},
},
ChildHasChildren: []bool{false},
}, nil
}
return &pb.BrowseChildrenReply{
Children: []*pb.GalaxyObject{
{GobjectId: 10, TagName: "Area1", BrowseName: "Area1"},
{GobjectId: 20, TagName: "Area2", BrowseName: "Area2"},
},
ChildHasChildren: []bool{true, false},
}, nil
}
func startBrowseFakeGalaxy(t *testing.T) (addr string, fake *browseFakeGalaxy) {
t.Helper()
listener, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("listen: %v", err)
}
server := grpc.NewServer()
fake = &browseFakeGalaxy{}
pb.RegisterGalaxyRepositoryServer(server, fake)
go func() { _ = server.Serve(listener) }()
t.Cleanup(func() {
server.Stop()
_ = listener.Close()
})
return listener.Addr().String(), fake
}
// TestRunGalaxyBrowseTextTree verifies the galaxy-browse subcommand issues
// BrowseChildren for the root walk, eagerly expands one level when --depth is
// set, and renders an indented tree.
func TestRunGalaxyBrowseTextTree(t *testing.T) {
addr, _ := startBrowseFakeGalaxy(t)
var stdout, stderr bytes.Buffer
args := []string{
"galaxy-browse",
"-endpoint", addr,
"-plaintext",
"-api-key", "test",
"-depth", "1",
}
if err := runWithIO(t.Context(), args, &stdout, &stderr); err != nil {
t.Fatalf("runWithIO() error = %v; stderr = %s", err, stderr.String())
}
out := stdout.String()
// Both roots present; the first root's eagerly-expanded child appears
// indented beneath it.
for _, want := range []string{"Area1", "Area2", "Tank"} {
if !strings.Contains(out, want) {
t.Fatalf("galaxy-browse text output missing %q; got:\n%s", want, out)
}
}
if !strings.Contains(out, " ") {
t.Fatalf("galaxy-browse text output not indented for children; got:\n%s", out)
}
}
// TestRunGalaxyBrowseJSON verifies the galaxy-browse subcommand emits valid
// nested JSON and forwards filter options onto the BrowseChildren request.
func TestRunGalaxyBrowseJSON(t *testing.T) {
addr, fake := startBrowseFakeGalaxy(t)
var stdout, stderr bytes.Buffer
args := []string{
"galaxy-browse",
"-endpoint", addr,
"-plaintext",
"-api-key", "test",
"-depth", "1",
"-tag-name-glob", "Area%",
"-alarm-bearing-only",
"-json",
}
if err := runWithIO(t.Context(), args, &stdout, &stderr); err != nil {
t.Fatalf("runWithIO() error = %v; stderr = %s", err, stderr.String())
}
var payload map[string]any
if err := json.Unmarshal(stdout.Bytes(), &payload); err != nil {
t.Fatalf("parse JSON: %v\noutput: %s", err, stdout.String())
}
if payload["command"] != "galaxy-browse" {
t.Fatalf("command = %v, want galaxy-browse", payload["command"])
}
nodes, ok := payload["nodes"].([]any)
if !ok || len(nodes) != 2 {
t.Fatalf("nodes = %v, want 2 root nodes", payload["nodes"])
}
// Filter fields must have reached the wire.
if got := fake.lastRequest.GetTagNameGlob(); got != "Area%" {
t.Fatalf("BrowseChildren TagNameGlob = %q, want %q", got, "Area%")
}
if !fake.lastRequest.GetAlarmBearingOnly() {
t.Fatalf("BrowseChildren AlarmBearingOnly = false, want true")
}
}
// TestRunGalaxyBrowseParentSingleLevel verifies that passing --parent fetches a
// single level of children for that parent via the parent-scoped request.
func TestRunGalaxyBrowseParentSingleLevel(t *testing.T) {
addr, fake := startBrowseFakeGalaxy(t)
var stdout, stderr bytes.Buffer
args := []string{
"galaxy-browse",
"-endpoint", addr,
"-plaintext",
"-api-key", "test",
"-parent", "10",
}
if err := runWithIO(t.Context(), args, &stdout, &stderr); err != nil {
t.Fatalf("runWithIO() error = %v; stderr = %s", err, stderr.String())
}
if !strings.Contains(stdout.String(), "Tank") {
t.Fatalf("galaxy-browse -parent output missing child %q; got:\n%s", "Tank", stdout.String())
}
if got := fake.lastRequest.GetParentGobjectId(); got != 10 {
t.Fatalf("BrowseChildren ParentGobjectId = %d, want 10", got)
}
}
// TestRunBatchSkipsBlankLinesAndContinuesUntilEOF pins the Client.Go-027 fix:
// a blank line in the middle of a batch session must NOT terminate the loop —
// only stdin EOF ends the session.
@@ -363,6 +363,89 @@ func TestBulkMethodsShortCircuitOnEmptySliceWithoutRoundTrip(t *testing.T) {
}
}
func TestWrite2BuildsCommandWithTimestampAndReturnsNoError(t *testing.T) {
fake := &fakeGatewayServer{
invokeReply: &pb.MxCommandReply{
SessionId: "session-1",
Kind: pb.MxCommandKind_MX_COMMAND_KIND_WRITE2,
ProtocolStatus: &pb.ProtocolStatus{
Code: pb.ProtocolStatusCode_PROTOCOL_STATUS_CODE_OK,
},
},
}
client, cleanup := newBufconnClient(t, fake)
defer cleanup()
session := NewSessionForID(client, "session-1")
val := Int32Value(99)
ts := Int32Value(77)
err := session.Write2(context.Background(), 12, 34, val, ts, 100)
if err != nil {
t.Fatalf("Write2() error = %v", err)
}
req := fake.invokeRequest
if req.GetCommand().GetKind() != pb.MxCommandKind_MX_COMMAND_KIND_WRITE2 {
t.Fatalf("command kind = %s, want WRITE2", req.GetCommand().GetKind())
}
w2 := req.GetCommand().GetWrite2()
if w2.GetServerHandle() != 12 {
t.Fatalf("server handle = %d, want 12", w2.GetServerHandle())
}
if w2.GetItemHandle() != 34 {
t.Fatalf("item handle = %d, want 34", w2.GetItemHandle())
}
if w2.GetValue().GetInt32Value() != 99 {
t.Fatalf("value int32 = %d, want 99", w2.GetValue().GetInt32Value())
}
if w2.GetTimestampValue().GetInt32Value() != 77 {
t.Fatalf("timestamp value int32 = %d, want 77", w2.GetTimestampValue().GetInt32Value())
}
if w2.GetUserId() != 100 {
t.Fatalf("user id = %d, want 100", w2.GetUserId())
}
}
func TestWrite2RawReturnsRawReply(t *testing.T) {
fake := &fakeGatewayServer{
invokeReply: &pb.MxCommandReply{
SessionId: "session-1",
Kind: pb.MxCommandKind_MX_COMMAND_KIND_WRITE2,
ProtocolStatus: &pb.ProtocolStatus{
Code: pb.ProtocolStatusCode_PROTOCOL_STATUS_CODE_OK,
},
},
}
client, cleanup := newBufconnClient(t, fake)
defer cleanup()
session := NewSessionForID(client, "session-1")
reply, err := session.Write2Raw(context.Background(), 12, 34, Int32Value(1), Int32Value(0), 0)
if err != nil {
t.Fatalf("Write2Raw() error = %v", err)
}
if reply == nil {
t.Fatal("Write2Raw() returned nil reply")
}
if reply.GetKind() != pb.MxCommandKind_MX_COMMAND_KIND_WRITE2 {
t.Fatalf("reply kind = %s, want WRITE2", reply.GetKind())
}
}
func TestWrite2RejectsNilValue(t *testing.T) {
fake := &fakeGatewayServer{}
client, cleanup := newBufconnClient(t, fake)
defer cleanup()
session := NewSessionForID(client, "session-1")
if err := session.Write2(context.Background(), 12, 34, nil, Int32Value(0), 0); err == nil {
t.Fatal("Write2(nil value) returned no error")
}
if err := session.Write2(context.Background(), 12, 34, Int32Value(1), nil, 0); err == nil {
t.Fatal("Write2(nil timestampValue) returned no error")
}
}
func TestReadBulkForwardsTimeoutAndUnpacksCachedFlag(t *testing.T) {
fake := &fakeGatewayServer{
invokeReply: &pb.MxCommandReply{
+5
View File
@@ -54,6 +54,11 @@ type (
BrowseChildrenRequest = pb.BrowseChildrenRequest
// BrowseChildrenReply is the reply for BrowseChildren.
BrowseChildrenReply = pb.BrowseChildrenReply
// BrowseChildrenRequest_ParentGobjectId selects the parent-by-gobject-id
// variant of the BrowseChildrenRequest parent oneof. Exposed so callers
// (e.g. the mxgw-go CLI) can issue a parent-scoped single-level browse
// without reaching into the generated package.
BrowseChildrenRequest_ParentGobjectId = pb.BrowseChildrenRequest_ParentGobjectId //nolint:revive,staticcheck // mirrors generated proto oneof name
)
// RawDeployEventStream is the generated WatchDeployEvents client stream.
+40
View File
@@ -580,6 +580,46 @@ func (s *Session) WriteRaw(ctx context.Context, serverHandle, itemHandle int32,
})
}
// PingRaw sends a diagnostic PING command and returns the raw reply.
// The message is echoed back by the gateway in the reply's DiagnosticMessage field.
func (s *Session) PingRaw(ctx context.Context, message string) (*MxCommandReply, error) {
return s.invokeCommand(ctx, &pb.MxCommand{
Kind: pb.MxCommandKind_MX_COMMAND_KIND_PING,
Payload: &pb.MxCommand_Ping{
Ping: &pb.PingCommand{Message: message},
},
})
}
// Write2 invokes MXAccess Write2 (timestamped single-item write).
func (s *Session) Write2(ctx context.Context, serverHandle, itemHandle int32, value, timestampValue *MxValue, userID int32) error {
_, err := s.Write2Raw(ctx, serverHandle, itemHandle, value, timestampValue, userID)
return err
}
// Write2Raw invokes MXAccess Write2 (timestamped single-item write) and returns the raw reply.
func (s *Session) Write2Raw(ctx context.Context, serverHandle, itemHandle int32, value, timestampValue *MxValue, userID int32) (*MxCommandReply, error) {
if value == nil {
return nil, errors.New("mxgateway: write2 value is required")
}
if timestampValue == nil {
return nil, errors.New("mxgateway: write2 timestamp value is required")
}
return s.invokeCommand(ctx, &pb.MxCommand{
Kind: pb.MxCommandKind_MX_COMMAND_KIND_WRITE2,
Payload: &pb.MxCommand_Write2{
Write2: &pb.Write2Command{
ServerHandle: serverHandle,
ItemHandle: itemHandle,
Value: value,
TimestampValue: timestampValue,
UserId: userID,
},
},
})
}
// Events streams ordered session events until the server ends the stream,
// context cancellation stops Recv, or a terminal error is sent.
func (s *Session) Events(ctx context.Context) (<-chan EventResult, error) {
+23 -6
View File
@@ -115,17 +115,33 @@ try (GalaxyRepositoryClient galaxy = GalaxyRepositoryClient.connect(options)) {
messages directly so callers can read all fields (including the nested
`GalaxyAttribute` list) without an extra DTO layer.
The CLI exposes matching subcommands: `galaxy-test`, `galaxy-deploy-time`,
`galaxy-discover`, and `galaxy-watch`. They take the same `--endpoint`,
`--api-key-env`, `--plaintext`, `--ca-file`, `--server-name-override`,
`--timeout`, and `--json` options as the gateway commands.
The CLI exposes matching subcommands: `galaxy-test-connection`,
`galaxy-last-deploy`, `galaxy-discover`, `galaxy-browse`, and `galaxy-watch`.
The short names `galaxy-test` and `galaxy-deploy-time` remain as deprecated
aliases for `galaxy-test-connection` and `galaxy-last-deploy` so existing
scripts keep working. They take the same `--endpoint`, `--api-key-env`,
`--plaintext`, `--ca-file`, `--server-name-override`, `--timeout`, and `--json`
options as the gateway commands.
```powershell
gradle :zb-mom-ww-mxgateway-cli:run --args="galaxy-test --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="galaxy-deploy-time --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="galaxy-test-connection --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="galaxy-last-deploy --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="galaxy-discover --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --json"
```
`galaxy-browse` walks the hierarchy via `BrowseChildren`. Without `--parent` it
returns the root nodes and eagerly expands `--depth` further levels; with
`--parent <gobject-id>` it returns exactly one level of children for that
parent. The filter flags (`--category-ids`, `--template-contains`,
`--tag-name-glob`, `--alarm-bearing-only`, `--historized-only`,
`--include-attributes`) match `galaxy-discover`. The `--json` node shape is the
cross-client browse surface: the flattened object fields plus a
`hasChildrenHint` flag and a nested `children` array.
```powershell
gradle :zb-mom-ww-mxgateway-cli:run --args="galaxy-browse --depth 1 --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --json"
```
### Browsing lazily
For UI trees or OPC UA bridges, use `browseChildrenRaw` to walk one level at a
@@ -239,6 +255,7 @@ Run the CLI through Gradle:
```powershell
gradle :zb-mom-ww-mxgateway-cli:run --args="version --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="open-session --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --client-session-name java-cli --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="ping --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --session-id <id> --message hello --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="register --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --session-id <id> --client-name java-cli --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="add-item --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --session-id <id> --server-handle 1 --item TestObject.TestInt --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="advise --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --session-id <id> --server-handle 1 --item-handle 1 --json"
@@ -1,7 +1,9 @@
package com.zb.mom.ww.mxgateway.cli;
import com.zb.mom.ww.mxgateway.client.BrowseChildrenOptions;
import com.zb.mom.ww.mxgateway.client.DeployEventStream;
import com.zb.mom.ww.mxgateway.client.GalaxyRepositoryClient;
import com.zb.mom.ww.mxgateway.client.LazyBrowseNode;
import com.zb.mom.ww.mxgateway.client.MxEventStream;
import com.zb.mom.ww.mxgateway.client.MxGatewayAlarmFeedSubscription;
import com.zb.mom.ww.mxgateway.client.MxGatewayClient;
@@ -10,6 +12,8 @@ import com.zb.mom.ww.mxgateway.client.MxGatewayClientVersion;
import com.zb.mom.ww.mxgateway.client.MxGatewaySecrets;
import com.zb.mom.ww.mxgateway.client.MxGatewaySession;
import com.zb.mom.ww.mxgateway.client.MxValues;
import galaxy_repository.v1.GalaxyRepositoryOuterClass.BrowseChildrenReply;
import galaxy_repository.v1.GalaxyRepositoryOuterClass.BrowseChildrenRequest;
import galaxy_repository.v1.GalaxyRepositoryOuterClass.DeployEvent;
import galaxy_repository.v1.GalaxyRepositoryOuterClass.GalaxyAttribute;
import galaxy_repository.v1.GalaxyRepositoryOuterClass.GalaxyObject;
@@ -26,8 +30,10 @@ import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Set;
import java.util.Map;
import java.util.Optional;
import java.util.concurrent.ArrayBlockingQueue;
@@ -42,11 +48,14 @@ import mxaccess_gateway.v1.MxaccessGateway.AlarmFeedMessage;
import mxaccess_gateway.v1.MxaccessGateway.BulkReadResult;
import mxaccess_gateway.v1.MxaccessGateway.BulkWriteResult;
import mxaccess_gateway.v1.MxaccessGateway.CloseSessionRequest;
import mxaccess_gateway.v1.MxaccessGateway.MxCommand;
import mxaccess_gateway.v1.MxaccessGateway.MxCommandKind;
import mxaccess_gateway.v1.MxaccessGateway.MxCommandReply;
import mxaccess_gateway.v1.MxaccessGateway.MxEvent;
import mxaccess_gateway.v1.MxaccessGateway.MxValue;
import mxaccess_gateway.v1.MxaccessGateway.OnAlarmTransitionEvent;
import mxaccess_gateway.v1.MxaccessGateway.OpenSessionRequest;
import mxaccess_gateway.v1.MxaccessGateway.PingCommand;
import mxaccess_gateway.v1.MxaccessGateway.StreamAlarmsRequest;
import mxaccess_gateway.v1.MxaccessGateway.SubscribeResult;
import mxaccess_gateway.v1.MxaccessGateway.Write2BulkEntry;
@@ -126,6 +135,7 @@ public final class MxGatewayCli implements Callable<Integer> {
commandLine.addSubcommand("version", new VersionCommand());
commandLine.addSubcommand("open-session", new OpenSessionCommand(clientFactory));
commandLine.addSubcommand("close-session", new CloseSessionCommand(clientFactory));
commandLine.addSubcommand("ping", new PingCommandLine(clientFactory));
commandLine.addSubcommand("register", new RegisterCommand(clientFactory));
commandLine.addSubcommand("add-item", new AddItemCommand(clientFactory));
commandLine.addSubcommand("advise", new AdviseCommand(clientFactory));
@@ -142,9 +152,10 @@ public final class MxGatewayCli implements Callable<Integer> {
commandLine.addSubcommand("stream-alarms", new StreamAlarmsCommand(clientFactory));
commandLine.addSubcommand("acknowledge-alarm", new AcknowledgeAlarmCommand(clientFactory));
commandLine.addSubcommand("smoke", new SmokeCommand(clientFactory));
commandLine.addSubcommand("galaxy-test", new GalaxyTestConnectionCommand());
commandLine.addSubcommand("galaxy-deploy-time", new GalaxyDeployTimeCommand());
commandLine.addSubcommand("galaxy-test-connection", new GalaxyTestConnectionCommand());
commandLine.addSubcommand("galaxy-last-deploy", new GalaxyDeployTimeCommand());
commandLine.addSubcommand("galaxy-discover", new GalaxyDiscoverCommand());
commandLine.addSubcommand("galaxy-browse", new GalaxyBrowseCommand());
commandLine.addSubcommand("galaxy-watch", new GalaxyWatchCommand());
commandLine.addSubcommand("batch", new BatchCommand(clientFactory));
return commandLine;
@@ -359,7 +370,10 @@ public final class MxGatewayCli implements Callable<Integer> {
}
}
@Command(name = "galaxy-test", description = "Calls GalaxyRepository.TestConnection.")
@Command(
name = "galaxy-test-connection",
aliases = {"galaxy-test"},
description = "Calls GalaxyRepository.TestConnection.")
static final class GalaxyTestConnectionCommand extends GalaxyCommand {
@Override
public Integer call() {
@@ -368,7 +382,7 @@ public final class MxGatewayCli implements Callable<Integer> {
PrintWriter out = common.spec.commandLine().getOut();
if (json) {
Map<String, Object> output = new LinkedHashMap<>();
output.put("command", "galaxy-test");
output.put("command", "galaxy-test-connection");
output.put("options", common.redactedJsonMap());
output.put("ok", ok);
out.println(jsonObject(output));
@@ -380,7 +394,10 @@ public final class MxGatewayCli implements Callable<Integer> {
}
}
@Command(name = "galaxy-deploy-time", description = "Calls GalaxyRepository.GetLastDeployTime.")
@Command(
name = "galaxy-last-deploy",
aliases = {"galaxy-deploy-time"},
description = "Calls GalaxyRepository.GetLastDeployTime.")
static final class GalaxyDeployTimeCommand extends GalaxyCommand {
@Override
public Integer call() {
@@ -389,7 +406,7 @@ public final class MxGatewayCli implements Callable<Integer> {
PrintWriter out = common.spec.commandLine().getOut();
if (json) {
Map<String, Object> output = new LinkedHashMap<>();
output.put("command", "galaxy-deploy-time");
output.put("command", "galaxy-last-deploy");
output.put("options", common.redactedJsonMap());
output.put("present", result.isPresent());
output.put("timeOfLastDeploy", result.map(Instant::toString).orElse(""));
@@ -429,6 +446,274 @@ public final class MxGatewayCli implements Callable<Integer> {
}
}
/**
* Page size used for the raw {@code BrowseChildren} paging loop driven by
* the {@code --parent} one-level path. Mirrors {@code BROWSE_CHILDREN_PAGE_SIZE}
* in the client library's lazy-browse helper and the other clients' CLI page
* size so paging behaviour is consistent across languages.
*/
private static final int BROWSE_CHILDREN_CLI_PAGE_SIZE = 500;
@Command(
name = "galaxy-browse",
description = "Browses the Galaxy hierarchy via GalaxyRepository.BrowseChildren.")
static final class GalaxyBrowseCommand extends GalaxyCommand {
@Spec
private CommandSpec spec;
@Option(
names = "--parent",
defaultValue = "-1",
description =
"Parent gobject id to browse one level of children for."
+ " Use the default (omit) to walk root nodes;"
+ " gobject id 0 is reserved by the server to mean roots.")
int parent;
@Option(
names = "--depth",
defaultValue = "0",
description =
"When walking roots, eagerly expand this many further levels before printing."
+ " Must be between 0 and 50 inclusive.")
int depth;
@Option(names = "--category-ids", description = "Comma-separated category ids to include.")
String categoryIds;
@Option(names = "--template-contains", description = "Comma-separated template names each child's chain must contain.")
String templateContains;
@Option(names = "--tag-name-glob", description = "SQL-LIKE-style glob applied to tag_name.")
String tagNameGlob;
@Option(names = "--alarm-bearing-only", description = "Restrict to alarm-bearing objects.")
boolean alarmBearingOnly;
@Option(names = "--historized-only", description = "Restrict to objects with at least one historized attribute.")
boolean historizedOnly;
@Option(names = "--include-attributes", description = "Request attribute population on each returned object.")
boolean includeAttributes;
@Override
public Integer call() {
if (depth < 0) {
throw new CommandLine.ParameterException(spec.commandLine(), "--depth must be non-negative");
}
if (depth > 50) {
throw new CommandLine.ParameterException(spec.commandLine(), "--depth must be at most 50");
}
BrowseChildrenOptions options = buildOptions();
PrintWriter out = common.spec.commandLine().getOut();
PrintWriter err = common.spec.commandLine().getErr();
if (parent == 0) {
err.println("warning: --parent 0 is the server sentinel for root nodes; omit --parent to walk roots instead.");
}
try (GalaxyRepositoryClient client = connect()) {
if (parent >= 0) {
if (depth > 0) {
err.println("warning: --depth is ignored when --parent is specified.");
}
List<BrowseChild> children = browseOneLevel(client, parent, options);
if (json) {
List<Map<String, Object>> nodes = new ArrayList<>(children.size());
for (BrowseChild child : children) {
nodes.add(browseNodeMap(child.object(), child.hasChildrenHint(), List.of()));
}
Map<String, Object> output = new LinkedHashMap<>();
output.put("command", "galaxy-browse");
output.put("options", common.redactedJsonMap());
output.put("parentId", parent);
output.put("nodes", nodes);
out.println(jsonObject(output));
} else {
out.println(children.size());
for (BrowseChild child : children) {
printBrowseChild(out, child);
}
}
return 0;
}
List<LazyBrowseNode> roots = client.browse(options);
for (LazyBrowseNode root : roots) {
expandToDepth(root, depth);
}
if (json) {
List<Map<String, Object>> nodes = new ArrayList<>(roots.size());
for (LazyBrowseNode root : roots) {
nodes.add(lazyNodeMap(root));
}
Map<String, Object> output = new LinkedHashMap<>();
output.put("command", "galaxy-browse");
output.put("options", common.redactedJsonMap());
output.put("nodes", nodes);
out.println(jsonObject(output));
} else {
out.println(roots.size());
for (LazyBrowseNode root : roots) {
printLazyNode(out, root, 0);
}
}
}
return 0;
}
private BrowseChildrenOptions buildOptions() {
return BrowseChildrenOptions.builder()
.categoryIds(parseOptionalIntList(categoryIds))
.templateChainContains(parseOptionalStringList(templateContains))
.tagNameGlob(tagNameGlob == null ? "" : tagNameGlob)
// Tri-state: only override the server default when the flag is present.
.includeAttributes(includeAttributes ? Boolean.TRUE : null)
.alarmBearingOnly(alarmBearingOnly)
.historizedOnly(historizedOnly)
.build();
}
}
/** One raw {@code BrowseChildren} child paired with its server-supplied has-children hint. */
private record BrowseChild(GalaxyObject object, boolean hasChildrenHint) {
}
/**
* Drives the raw {@code BrowseChildren} paging loop for a single parent and
* returns the flattened one-level child list. Used by the {@code --parent}
* path, which surfaces a single level rather than the lazy root-tree walk.
*/
private static List<BrowseChild> browseOneLevel(
GalaxyRepositoryClient client, int parentGobjectId, BrowseChildrenOptions options) {
List<BrowseChild> children = new ArrayList<>();
Set<String> seenPageTokens = new HashSet<>();
String pageToken = "";
while (true) {
BrowseChildrenRequest.Builder builder = BrowseChildrenRequest.newBuilder()
.setPageSize(BROWSE_CHILDREN_CLI_PAGE_SIZE)
.setPageToken(pageToken)
.setParentGobjectId(parentGobjectId)
.setAlarmBearingOnly(options.isAlarmBearingOnly())
.setHistorizedOnly(options.isHistorizedOnly());
if (!options.getCategoryIds().isEmpty()) {
builder.addAllCategoryIds(options.getCategoryIds());
}
if (!options.getTemplateChainContains().isEmpty()) {
builder.addAllTemplateChainContains(options.getTemplateChainContains());
}
if (!options.getTagNameGlob().isEmpty()) {
builder.setTagNameGlob(options.getTagNameGlob());
}
if (options.getIncludeAttributes() != null) {
builder.setIncludeAttributes(options.getIncludeAttributes());
}
BrowseChildrenReply reply = client.browseChildrenRaw(builder.build());
for (int i = 0; i < reply.getChildrenCount(); i++) {
boolean hint = i < reply.getChildHasChildrenCount() && reply.getChildHasChildren(i);
children.add(new BrowseChild(reply.getChildren(i), hint));
}
pageToken = reply.getNextPageToken();
if (pageToken == null || pageToken.isEmpty()) {
return children;
}
if (!seenPageTokens.add(pageToken)) {
throw new IllegalStateException(
"galaxy browse children returned repeated page token: " + pageToken);
}
}
}
/**
* Recursively expands a {@link LazyBrowseNode} up to {@code depth} further
* levels. A {@code depth} of 0 leaves the node unexpanded so callers print
* only the requested level. Nodes the server reports as childless are not
* expanded.
*/
private static void expandToDepth(LazyBrowseNode node, int depth) {
if (depth <= 0) {
return;
}
if (node.hasChildrenHint()) {
node.expand();
}
for (LazyBrowseNode child : node.getChildren()) {
expandToDepth(child, depth - 1);
}
}
/**
* Renders one {@link LazyBrowseNode} (and any already-expanded descendants)
* as a JSON map. Mirrors the {@code galaxy-discover} object shape with an
* added {@code hasChildrenHint} flag and a nested {@code children} array,
* matching the cross-client browse JSON surface.
*/
private static Map<String, Object> lazyNodeMap(LazyBrowseNode node) {
List<Map<String, Object>> children = new ArrayList<>();
if (node.isExpanded()) {
for (LazyBrowseNode child : node.getChildren()) {
children.add(lazyNodeMap(child));
}
}
return browseNodeMap(node.getObject(), node.hasChildrenHint(), children);
}
/**
* Builds the per-node browse JSON map: the flattened Galaxy object fields,
* the {@code hasChildrenHint} flag, and a nested {@code children} array.
* The {@code hasChildrenHint} key is the cross-client standard (Rust /
* Python / .NET / Go all use the same key and node shape).
*/
static Map<String, Object> browseNodeMap(
GalaxyObject object, boolean hasChildrenHint, List<Map<String, Object>> children) {
Map<String, Object> values = galaxyObjectMap(object);
values.put("hasChildrenHint", hasChildrenHint);
values.put("children", children);
return values;
}
private static void printLazyNode(PrintWriter out, LazyBrowseNode node, int level) {
GalaxyObject obj = node.getObject();
out.printf(
"%s%d\t%s\t%s\t(attrs=%d, hasChildrenHint=%b)%n",
" ".repeat(level),
obj.getGobjectId(),
obj.getTagName(),
obj.getBrowseName(),
obj.getAttributesCount(),
node.hasChildrenHint());
if (node.isExpanded()) {
for (LazyBrowseNode child : node.getChildren()) {
printLazyNode(out, child, level + 1);
}
}
}
private static void printBrowseChild(PrintWriter out, BrowseChild child) {
GalaxyObject obj = child.object();
out.printf(
"%d\t%s\t%s\t(attrs=%d, hasChildrenHint=%b)%n",
obj.getGobjectId(),
obj.getTagName(),
obj.getBrowseName(),
obj.getAttributesCount(),
child.hasChildrenHint());
}
private static List<Integer> parseOptionalIntList(String value) {
if (value == null || value.isBlank()) {
return List.of();
}
return parseIntList(value);
}
private static List<String> parseOptionalStringList(String value) {
if (value == null || value.isBlank()) {
return List.of();
}
return parseStringList(value);
}
@Command(
name = "galaxy-watch",
description = "Streams GalaxyRepository.WatchDeployEvents until cancelled.")
@@ -622,6 +907,31 @@ public final class MxGatewayCli implements Callable<Integer> {
}
}
@Command(name = "ping", description = "Sends a diagnostic ping command to the session worker.")
static final class PingCommandLine extends GatewayCommand {
@Option(names = "--session-id", required = true, description = "Gateway session id.")
String sessionId;
@Option(names = "--message", defaultValue = "ping", description = "Message echoed back in the reply.")
String message;
PingCommandLine(MxGatewayCliClientFactory clientFactory) {
super(clientFactory);
}
@Override
public Integer call() {
try (MxGatewayCliClient client = clientFactory.connect(common.resolved())) {
MxCommandReply reply = client.session(sessionId).pingRaw(message);
// The worker echoes the message in the diagnostic message field;
// there is no dedicated ping reply payload, so the plain-text path
// surfaces that field.
writeOutput("ping", common, json, reply, reply::getDiagnosticMessage);
}
return 0;
}
}
@Command(name = "register", description = "Invokes MXAccess Register.")
static final class RegisterCommand extends GatewayCommand {
@Option(names = "--session-id", required = true, description = "Gateway session id.")
@@ -1438,6 +1748,8 @@ public final class MxGatewayCli implements Callable<Integer> {
}
interface MxGatewayCliSession {
MxCommandReply pingRaw(String message);
int register(String clientName);
MxCommandReply registerRaw(String clientName);
@@ -1523,6 +1835,14 @@ public final class MxGatewayCli implements Callable<Integer> {
}
record GrpcMxGatewayCliSession(MxGatewaySession session) implements MxGatewayCliSession {
@Override
public MxCommandReply pingRaw(String message) {
return session.invokeCommand(MxCommand.newBuilder()
.setKind(MxCommandKind.MX_COMMAND_KIND_PING)
.setPing(PingCommand.newBuilder().setMessage(message))
.build());
}
@Override
public int register(String clientName) {
return session.register(clientName);
@@ -6,6 +6,7 @@ import static org.junit.jupiter.api.Assertions.assertTrue;
import com.zb.mom.ww.mxgateway.client.MxGatewayAlarmFeedSubscription;
import com.zb.mom.ww.mxgateway.client.MxGatewayClientOptions;
import galaxy_repository.v1.GalaxyRepositoryOuterClass.GalaxyObject;
import io.grpc.stub.StreamObserver;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
@@ -15,6 +16,7 @@ import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import mxaccess_gateway.v1.MxaccessGateway.AcknowledgeAlarmReply;
import mxaccess_gateway.v1.MxaccessGateway.AcknowledgeAlarmRequest;
import mxaccess_gateway.v1.MxaccessGateway.ActiveAlarmSnapshot;
@@ -124,6 +126,166 @@ final class MxGatewayCliTests {
assertTrue(run.output().contains("\"itemHandle\":7"));
}
// ---- ping subcommand (D4) ----
@Test
void pingCommandForwardsMessageAndPrintsEcho() {
FakeClientFactory factory = new FakeClientFactory();
CliRun run = execute(
factory, "ping", "--session-id", "session-cli", "--message", "hello-mxgw");
assertEquals(0, run.exitCode());
assertEquals("hello-mxgw", factory.client.session.lastPingMessage);
// The worker echoes the message in the diagnostic message field; the
// plain-text path surfaces exactly that echoed value.
assertEquals("hello-mxgw", run.output().trim());
}
@Test
void pingCommandDefaultsMessageToPing() {
FakeClientFactory factory = new FakeClientFactory();
CliRun run = execute(factory, "ping", "--session-id", "session-cli");
assertEquals(0, run.exitCode());
assertEquals("ping", factory.client.session.lastPingMessage);
}
@Test
void pingCommandJsonIncludesPingKindAndDiagnosticMessage() {
FakeClientFactory factory = new FakeClientFactory();
CliRun run = execute(
factory, "ping", "--session-id", "session-cli", "--message", "diag-1", "--json");
assertEquals(0, run.exitCode());
String out = run.output();
assertTrue(out.contains("\"command\":\"ping\""), out);
assertTrue(out.contains("\"kind\":\"MX_COMMAND_KIND_PING\""), out);
assertTrue(out.contains("diag-1"), out);
}
// ---- galaxy-browse subcommand (D8-java) ----
@Test
void galaxyBrowseNodeJsonUsesHasChildrenHintKeyAndFlattensObjectFields() {
GalaxyObject object = GalaxyObject.newBuilder()
.setGobjectId(101)
.setTagName("Area001")
.setBrowseName("Area001")
.build();
Map<String, Object> leaf = MxGatewayCli.browseNodeMap(
GalaxyObject.newBuilder().setGobjectId(202).setTagName("Pump001").build(),
false,
List.of());
Map<String, Object> node = MxGatewayCli.browseNodeMap(object, true, List.of(leaf));
// Cross-client JSON parity: the per-node "has children" flag MUST use the
// key hasChildrenHint (Rust / Python / .NET / Go all standardized on it).
assertTrue(node.containsKey("hasChildrenHint"), node.toString());
assertEquals(Boolean.TRUE, node.get("hasChildrenHint"));
// Object fields are flattened directly into the node (matching the
// galaxy-discover object shape), not nested under an "object" key.
assertFalse(node.containsKey("object"), node.toString());
assertEquals(101L, ((Number) node.get("gobjectId")).longValue());
assertEquals("Area001", node.get("tagName"));
// Nested children array carries the same node shape recursively.
@SuppressWarnings("unchecked")
List<Map<String, Object>> children = (List<Map<String, Object>>) node.get("children");
assertEquals(1, children.size());
assertTrue(children.get(0).containsKey("hasChildrenHint"));
assertEquals(Boolean.FALSE, children.get(0).get("hasChildrenHint"));
}
@Test
void galaxyBrowseInvocationsParseCleanly() {
// galaxy-browse connects via GalaxyRepositoryClient.connect (a static),
// so the full surface is exercised only by the cross-language matrix
// against a live gateway. Here we assert the option surface parses.
assertReadmeExampleParses(new String[] {"galaxy-browse", "--json"});
assertReadmeExampleParses(new String[] {"galaxy-browse", "--parent", "42", "--json"});
assertReadmeExampleParses(new String[] {
"galaxy-browse",
"--depth", "2",
"--category-ids", "1,2",
"--template-contains", "$Pump",
"--tag-name-glob", "Area%",
"--alarm-bearing-only",
"--historized-only",
"--include-attributes",
"--json"
});
}
@Test
void galaxyBrowseNegativeDepthYieldsNonZeroExitViaParameterException() {
// Fix: --depth validation must surface as a picocli ParameterException
// (clean error line on stderr) rather than an unhandled IllegalArgumentException
// stack trace. Picocli maps ParameterException to exit code 2.
CliRun run = execute(new FakeClientFactory(), "galaxy-browse", "--depth", "-1");
assertFalse(run.exitCode() == 0, "expected non-zero exit for --depth -1");
// Picocli writes ParameterException messages to the error writer.
assertTrue(run.errors().contains("--depth"), "expected --depth in error output: " + run.errors());
}
@Test
void galaxyBrowseDepthAbove50YieldsNonZeroExit() {
CliRun run = execute(new FakeClientFactory(), "galaxy-browse", "--depth", "51");
assertFalse(run.exitCode() == 0, "expected non-zero exit for --depth 51");
assertTrue(run.errors().contains("--depth"), "expected --depth in error output: " + run.errors());
}
@Test
void galaxyBrowseParentZeroEmitsWarningToStderr() {
// --parent 0 is the server sentinel for roots; passing it explicitly is
// almost certainly a mistake. The CLI must print a warning to stderr
// (matching Go/Rust client behaviour) but must still attempt the call
// (exit behaviour depends on gateway reachability, not tested here;
// we only assert the warning path is triggered by checking the error
// writer before any gRPC connection is attempted).
//
// GalaxyBrowseCommand connects to a real GalaxyRepositoryClient, so the
// call() body will throw after printing the warning when no gateway is
// reachable. We only assert the warning appears on stderr.
StringWriter output = new StringWriter();
StringWriter errors = new StringWriter();
// Non-zero exit is expected (no live gateway), but the warning must
// appear on stderr regardless of what happens next.
MxGatewayCli.execute(
new FakeClientFactory(),
new PrintWriter(output, true),
new PrintWriter(errors, true),
"galaxy-browse", "--parent", "0", "--depth", "1");
assertTrue(
errors.toString().contains("--parent 0"),
"expected '--parent 0' warning on stderr; got: " + errors);
}
// ---- galaxy command-name aliases (D9-java) ----
@Test
void galaxyTestConnectionCanonicalAndDeprecatedAliasResolve() {
picocli.CommandLine commandLine = MxGatewayCli.commandLine(new FakeClientFactory());
// Both the canonical dash-separated name and the deprecated short alias
// must resolve to the same subcommand so existing scripts keep working.
assertTrue(commandLine.getSubcommands().containsKey("galaxy-test-connection"));
assertTrue(commandLine.getSubcommands().containsKey("galaxy-test"));
assertEquals(
commandLine.getSubcommands().get("galaxy-test-connection"),
commandLine.getSubcommands().get("galaxy-test"));
}
@Test
void galaxyLastDeployCanonicalAndDeprecatedAliasResolve() {
picocli.CommandLine commandLine = MxGatewayCli.commandLine(new FakeClientFactory());
assertTrue(commandLine.getSubcommands().containsKey("galaxy-last-deploy"));
assertTrue(commandLine.getSubcommands().containsKey("galaxy-deploy-time"));
assertEquals(
commandLine.getSubcommands().get("galaxy-last-deploy"),
commandLine.getSubcommands().get("galaxy-deploy-time"));
}
@Test
void subscribeBulkCommandPrintsResults() {
CliRun run = execute(
@@ -652,6 +814,19 @@ final class MxGatewayCliTests {
private boolean addItemCalled;
private boolean adviseCalled;
private MxValue lastWriteValue;
private String lastPingMessage;
@Override
public MxCommandReply pingRaw(String message) {
lastPingMessage = message;
// The worker echoes the request message in the diagnostic message
// field; there is no dedicated ping reply payload.
return MxCommandReply.newBuilder()
.setKind(MxCommandKind.MX_COMMAND_KIND_PING)
.setProtocolStatus(ok())
.setDiagnosticMessage(message)
.build();
}
@Override
public int register(String clientName) {
+27 -3
View File
@@ -214,9 +214,33 @@ The method returns an async iterator yielding the generated `DeployEvent`
proto. Breaking out of the loop, calling `aclose()` on the iterator, or
cancelling the surrounding task closes the underlying gRPC stream
cleanly. The streaming RPC requires the same `metadata:read` scope as
the other Galaxy methods. The CLI does not currently expose a
streaming `watch-deploy-events` subcommand — use the library API
directly when subscribing to deploy events from Python.
the other Galaxy methods.
The CLI exposes the Galaxy Repository RPCs through five subcommands that
mirror the other clients:
```bash
mxgw-py galaxy-test-connection --plaintext --json
mxgw-py galaxy-last-deploy --plaintext --json
mxgw-py galaxy-discover --plaintext --json
mxgw-py galaxy-browse --plaintext --json
mxgw-py galaxy-watch --plaintext --json
```
`galaxy-watch` is bounded by `--max-events` (default `1`) and `--timeout`
(seconds) so it always terminates; pass `--last-seen-deploy-time` (an
ISO-8601 timestamp) to suppress the bootstrap event when it matches the
current cached deploy time.
`galaxy-browse` wraps the lazy `LazyBrowseNode` walker. Without `--depth`
it lists only the root objects; `--depth N` eagerly expands `N` further
levels before printing. Text output is a node count followed by an indented
tree (`+`/`-` marks the server's has-children hint); `--json` emits nested
`{..., "hasChildrenHint": bool, "children": [...]}` nodes that match the
`galaxy-discover` object shape. The `BrowseChildrenRequest` filters are
exposed as `--category-id` (repeatable), `--template-chain-contains`
(repeatable), `--tag-name-glob`, `--include-attributes`,
`--alarm-bearing-only`, and `--historized-only`, all AND-combined.
## Authentication And TLS
@@ -21,8 +21,10 @@ from zb_mom_ww_mxgateway import __version__
from zb_mom_ww_mxgateway.auth import redact_secret
from zb_mom_ww_mxgateway.client import GatewayClient
from zb_mom_ww_mxgateway.errors import MxGatewayError
from zb_mom_ww_mxgateway.galaxy import GalaxyRepositoryClient
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
from zb_mom_ww_mxgateway.generated import mxaccess_gateway_pb2 as pb
from zb_mom_ww_mxgateway.options import ClientOptions
from zb_mom_ww_mxgateway.options import BrowseChildrenOptions, ClientOptions
from zb_mom_ww_mxgateway.values import MxValueInput, to_mx_value
logger = logging.getLogger(__name__)
@@ -512,6 +514,148 @@ def smoke(**kwargs: Any) -> None:
_run(_smoke(**kwargs), output_json=kwargs["output_json"], secrets=_secrets(kwargs))
@main.command("galaxy-test-connection")
@gateway_options
@click.option("--json", "output_json", is_flag=True, help="Emit JSON output.")
def galaxy_test_connection(**kwargs: Any) -> None:
"""Test whether the gateway can reach the Galaxy Repository DB."""
_run(
_galaxy_test_connection(**kwargs),
output_json=kwargs["output_json"],
secrets=_secrets(kwargs),
)
@main.command("galaxy-last-deploy")
@gateway_options
@click.option("--json", "output_json", is_flag=True, help="Emit JSON output.")
def galaxy_last_deploy(**kwargs: Any) -> None:
"""Read the last Galaxy deploy timestamp."""
_run(
_galaxy_last_deploy(**kwargs),
output_json=kwargs["output_json"],
secrets=_secrets(kwargs),
)
@main.command("galaxy-discover")
@gateway_options
@click.option("--json", "output_json", is_flag=True, help="Emit JSON output.")
def galaxy_discover(**kwargs: Any) -> None:
"""Enumerate the deployed Galaxy object hierarchy."""
_run(
_galaxy_discover(**kwargs),
output_json=kwargs["output_json"],
secrets=_secrets(kwargs),
)
@main.command("galaxy-browse")
@gateway_options
@click.option(
"--parent-gobject-id",
"parent_gobject_id",
default=None,
type=int,
help=(
"Fetch one level of this parent's direct children via BrowseChildren "
"instead of the lazy root walk. Pass a gobject id >= 1. "
"(gobject-id 0 is the server root sentinel — omit the flag to list root objects.) "
"--depth is ignored when this option is set."
),
)
@click.option(
"--depth",
default=0,
type=int,
show_default=True,
help="Eagerly expand the root nodes this many further levels before printing.",
)
@click.option(
"--category-id",
"category_ids",
multiple=True,
type=int,
help="Restrict to objects whose category_id matches one of these ids (repeatable).",
)
@click.option(
"--template-chain-contains",
"template_chain_contains",
multiple=True,
help="Restrict to objects whose template chain contains this entry (repeatable).",
)
@click.option(
"--tag-name-glob",
"tag_name_glob",
default=None,
help="Restrict to objects whose tag name matches this glob.",
)
@click.option(
"--include-attributes",
"include_attributes",
is_flag=True,
help="Include each object's attribute metadata in the browse results.",
)
@click.option(
"--alarm-bearing-only",
"alarm_bearing_only",
is_flag=True,
help="Only return objects that own at least one alarm-bearing attribute.",
)
@click.option(
"--historized-only",
"historized_only",
is_flag=True,
help="Only return objects that own at least one historized attribute.",
)
@click.option("--json", "output_json", is_flag=True, help="Emit JSON output.")
def galaxy_browse(**kwargs: Any) -> None:
"""Browse the deployed Galaxy object hierarchy as a lazy-expanded tree."""
_run(
_galaxy_browse(**kwargs),
output_json=kwargs["output_json"],
secrets=_secrets(kwargs),
)
@main.command("galaxy-watch")
@gateway_options
@click.option(
"--last-seen-deploy-time",
"last_seen_deploy_time",
default=None,
help="ISO-8601 timestamp; when it matches the current cached deploy time the "
"bootstrap event is suppressed.",
)
@click.option(
"--max-events",
default=1,
type=int,
show_default=True,
help="Stop after collecting this many deploy events.",
)
@click.option(
"--timeout",
default=5.0,
type=float,
show_default=True,
help="Seconds to wait for each event before stopping.",
)
@click.option("--json", "output_json", is_flag=True, help="Emit JSON output.")
def galaxy_watch(**kwargs: Any) -> None:
"""Stream a bounded number of Galaxy deploy events."""
_run(
_galaxy_watch(**kwargs),
output_json=kwargs["output_json"],
secrets=_secrets(kwargs),
)
async def _open_session(**kwargs: Any) -> dict[str, Any]:
async with await _connect(kwargs) as client:
reply = await client.open_session_raw(
@@ -922,6 +1066,215 @@ async def _smoke(**kwargs: Any) -> dict[str, Any]:
await session.close()
async def _galaxy_test_connection(**kwargs: Any) -> dict[str, Any]:
async with await _connect_galaxy(kwargs) as galaxy:
ok = await galaxy.test_connection()
return {"command": "galaxy-test-connection", "ok": ok}
async def _galaxy_last_deploy(**kwargs: Any) -> dict[str, Any]:
async with await _connect_galaxy(kwargs) as galaxy:
last_deploy = await galaxy.get_last_deploy_time()
payload: dict[str, Any] = {
"command": "galaxy-last-deploy",
"present": last_deploy is not None,
}
if last_deploy is not None:
# galaxy.py returns a timezone-NAIVE UTC datetime (protobuf ToDatetime()).
# Stamp it as UTC so the emitted ISO-8601 carries an unambiguous offset,
# matching the Go client's "...Z" output.
payload["timeOfLastDeploy"] = last_deploy.replace(tzinfo=timezone.utc).isoformat()
return payload
async def _galaxy_discover(**kwargs: Any) -> dict[str, Any]:
async with await _connect_galaxy(kwargs) as galaxy:
objects = await galaxy.discover_hierarchy()
return {
"command": "galaxy-discover",
"objects": [_message_dict(obj) for obj in objects],
}
async def _galaxy_browse(**kwargs: Any) -> dict[str, Any]:
depth = int(kwargs["depth"])
if depth < 0 or depth > 50:
raise click.BadParameter("--depth must be between 0 and 50", param_hint="--depth")
parent_gobject_id: int | None = kwargs.get("parent_gobject_id")
options = BrowseChildrenOptions(
category_ids=tuple(kwargs.get("category_ids") or ()),
template_chain_contains=tuple(kwargs.get("template_chain_contains") or ()),
tag_name_glob=kwargs.get("tag_name_glob"),
include_attributes=True if kwargs.get("include_attributes") else None,
alarm_bearing_only=bool(kwargs.get("alarm_bearing_only")),
historized_only=bool(kwargs.get("historized_only")),
)
async with await _connect_galaxy(kwargs) as galaxy:
if parent_gobject_id is not None:
# Single-level parent drill-down: drive BrowseChildren paging by hand
# and return the children as a flat list. --depth is not meaningful
# here; warn if the caller set it so they know it was ignored.
if depth > 0:
click.echo(
"warning: --depth is ignored when --parent-gobject-id is specified",
err=True,
)
children = await _browse_children_one_level(galaxy, parent_gobject_id, options)
return {
"command": "galaxy-browse",
"nodes": [_browse_child_dict(obj, hint) for obj, hint in children],
"_text": _render_browse_children(children),
}
roots = await galaxy.browse(options)
for root in roots:
await _expand_to_depth(root, depth)
return {
"command": "galaxy-browse",
"nodes": [_browse_node_dict(node) for node in roots],
"_text": _render_browse_tree(roots),
}
async def _expand_to_depth(node: Any, depth: int) -> None:
"""Recursively expand a LazyBrowseNode up to ``depth`` further levels.
``depth == 0`` leaves the node unexpanded so only the requested level is
printed; each level beyond fetches and recurses into the loaded children.
"""
if depth <= 0:
return
if node.has_children_hint:
await node.expand()
for child in node.children:
await _expand_to_depth(child, depth - 1)
def _browse_node_dict(node: Any) -> dict[str, Any]:
"""Render one LazyBrowseNode (and any already-expanded descendants).
Mirrors the ``galaxy-discover`` object shape with an added
``hasChildrenHint`` flag and a nested ``children`` array, matching the
cross-client browse JSON surface.
"""
payload = _message_dict(node.object)
payload["hasChildrenHint"] = bool(node.has_children_hint)
payload["children"] = (
[_browse_node_dict(child) for child in node.children] if node.is_expanded else []
)
return payload
def _render_browse_tree(roots: list[Any]) -> str:
"""Render the lazy-browse roots as a node count plus an indented tree."""
lines: list[str] = [str(len(roots))]
for root in roots:
_append_browse_node_lines(root, 0, lines)
return "\n".join(lines)
def _append_browse_node_lines(node: Any, indent: int, lines: list[str]) -> None:
obj = node.object
marker = "+" if node.has_children_hint else "-"
pad = " " * indent
lines.append(f"{pad}{marker} {obj.tag_name} {obj.browse_name} (gobject {obj.gobject_id})")
if node.is_expanded:
for child in node.children:
_append_browse_node_lines(child, indent + 2, lines)
_BROWSE_CHILDREN_PAGE_SIZE = 500
async def _browse_children_one_level(
galaxy: Any,
parent_gobject_id: int,
options: BrowseChildrenOptions,
) -> list[tuple[Any, bool]]:
"""Page through BrowseChildren for ``parent_gobject_id`` and return (object, hint) pairs.
Uses page size 500 (matching the library constant) and guards against a
repeated page token to prevent an infinite loop if the server misbehaves.
"""
results: list[tuple[Any, bool]] = []
seen_page_tokens: set[str] = set()
page_token = ""
while True:
request = galaxy_pb.BrowseChildrenRequest(
parent_gobject_id=parent_gobject_id,
page_size=_BROWSE_CHILDREN_PAGE_SIZE,
page_token=page_token,
alarm_bearing_only=options.alarm_bearing_only,
historized_only=options.historized_only,
)
if options.category_ids:
request.category_ids.extend(options.category_ids)
if options.template_chain_contains:
request.template_chain_contains.extend(options.template_chain_contains)
if options.tag_name_glob:
request.tag_name_glob = options.tag_name_glob
if options.include_attributes is not None:
request.include_attributes = options.include_attributes
reply = await galaxy.browse_children_raw(request)
for index, obj in enumerate(reply.children):
hint = index < len(reply.child_has_children) and bool(reply.child_has_children[index])
results.append((obj, hint))
page_token = reply.next_page_token
if not page_token:
return results
if page_token in seen_page_tokens:
raise MxGatewayError(
f"galaxy browse children returned repeated page token {page_token!r}"
)
seen_page_tokens.add(page_token)
def _browse_child_dict(obj: Any, has_children_hint: bool) -> dict[str, Any]:
"""Render one raw browse child as a node dict matching the lazy-browse shape.
The ``children`` array is always empty the parent drill-down path returns
a flat single-level listing without recursive expansion.
"""
payload = _message_dict(obj)
payload["hasChildrenHint"] = has_children_hint
payload["children"] = []
return payload
def _render_browse_children(children: list[tuple[Any, bool]]) -> str:
"""Render a flat one-level child list as a count line plus marker lines."""
lines: list[str] = [str(len(children))]
for obj, has_children_hint in children:
marker = "+" if has_children_hint else "-"
lines.append(f"{marker} {obj.tag_name} {obj.browse_name} (gobject {obj.gobject_id})")
return "\n".join(lines)
async def _galaxy_watch(**kwargs: Any) -> dict[str, Any]:
last_seen = kwargs.get("last_seen_deploy_time")
last_seen_dt = _parse_datetime(last_seen) if last_seen else None
async with await _connect_galaxy(kwargs) as galaxy:
events = await _collect_deploy_events(
galaxy.watch_deploy_events(last_seen_dt),
max_events=kwargs["max_events"],
timeout=kwargs["timeout"],
)
return {
"command": "galaxy-watch",
"events": [_message_dict(event) for event in events],
}
async def _connect(kwargs: dict[str, Any]) -> GatewayClient:
api_key = kwargs.get("api_key") or _api_key_from_env(kwargs.get("api_key_env"))
return await GatewayClient.connect(
@@ -938,6 +1291,22 @@ async def _connect(kwargs: dict[str, Any]) -> GatewayClient:
)
async def _connect_galaxy(kwargs: dict[str, Any]) -> GalaxyRepositoryClient:
api_key = kwargs.get("api_key") or _api_key_from_env(kwargs.get("api_key_env"))
return await GalaxyRepositoryClient.connect(
ClientOptions(
endpoint=kwargs["endpoint"],
api_key=api_key,
plaintext=_use_plaintext(kwargs),
ca_file=kwargs.get("ca_file"),
require_certificate_validation=bool(kwargs.get("require_certificate_validation")),
server_name_override=kwargs.get("server_name_override"),
call_timeout=kwargs.get("call_timeout"),
stream_timeout=kwargs.get("stream_timeout"),
),
)
def _session(client: GatewayClient, session_id: str):
from zb_mom_ww_mxgateway.session import Session
@@ -995,11 +1364,17 @@ def _emit(
output_json: bool,
text: str | None = None,
) -> None:
# A payload may carry a pre-rendered text representation under the private
# "_text" key (used by commands like galaxy-browse whose text output is a
# custom indented tree rather than the default JSON dump). Strip it so it
# never leaks into the JSON branch.
rendered_text = payload.pop("_text", None) if isinstance(payload, dict) else None
if output_json:
click.echo(json.dumps(payload, sort_keys=True))
return
click.echo(text or json.dumps(payload, sort_keys=True))
click.echo(text or rendered_text or json.dumps(payload, sort_keys=True))
async def _collect_events(
@@ -1058,6 +1433,34 @@ async def _collect_alarm_messages(
return collected
async def _collect_deploy_events(
events: Any,
*,
max_events: int,
timeout: float,
) -> list[galaxy_pb.DeployEvent]:
if max_events > MAX_AGGREGATE_EVENTS:
raise click.BadParameter(
f"must be less than or equal to {MAX_AGGREGATE_EVENTS}",
param_hint="--max-events",
)
collected: list[galaxy_pb.DeployEvent] = []
iterator = events.__aiter__()
try:
while len(collected) < max_events:
collected.append(await asyncio.wait_for(iterator.__anext__(), timeout=timeout))
except StopAsyncIteration:
pass
except asyncio.TimeoutError:
pass
finally:
close = getattr(iterator, "aclose", None)
if close is not None:
await close()
return collected
def _parse_value(raw_value: str, value_type: str) -> MxValueInput:
normalized = value_type.lower()
if normalized == "bool":
+434
View File
@@ -211,3 +211,437 @@ def test_batch_continues_after_error_line() -> None:
# Second block: successful version JSON.
version_payload = json.loads(blocks[1].strip())
assert version_payload["version"] == __version__
class _FakeGalaxyClient:
"""Minimal async-context-manager fake satisfying the galaxy command bodies."""
def __init__(
self,
*,
ok: bool = True,
objects=None,
last_deploy=None,
events=None,
browse_roots=None,
browse_children_pages=None,
) -> None:
self._ok = ok
self._objects = objects or []
self._last_deploy = last_deploy
self._events = events or []
self._browse_roots = browse_roots or []
# List of BrowseChildrenReply-like objects to serve in order (paged).
self._browse_children_pages = browse_children_pages or []
self._browse_children_calls: list = []
self.browse_options = None
async def __aenter__(self) -> "_FakeGalaxyClient":
return self
async def __aexit__(self, *_exc: object) -> None:
return None
async def test_connection(self) -> bool:
return self._ok
async def discover_hierarchy(self):
return self._objects
async def browse(self, options=None):
self.browse_options = options
return self._browse_roots
async def browse_children_raw(self, request):
"""Return the next queued BrowseChildrenReply page; raises if queue empty."""
self._browse_children_calls.append(request)
if not self._browse_children_pages:
raise AssertionError("browse_children_raw called but no pages queued")
return self._browse_children_pages.pop(0)
async def get_last_deploy_time(self):
# Mirrors galaxy.py: protobuf ToDatetime() yields a timezone-NAIVE UTC datetime.
return self._last_deploy
def watch_deploy_events(self, _last_seen_deploy_time=None):
events = self._events
async def _iter():
for event in events:
yield event
return _iter()
def _patch_galaxy_connect(monkeypatch: pytest.MonkeyPatch, fake: _FakeGalaxyClient) -> None:
async def fake_connect(options, **_kwargs):
return fake
monkeypatch.setattr(commands_module.GalaxyRepositoryClient, "connect", fake_connect)
def test_galaxy_test_connection_emits_ok(monkeypatch: pytest.MonkeyPatch) -> None:
_patch_galaxy_connect(monkeypatch, _FakeGalaxyClient(ok=True))
result = CliRunner().invoke(
main,
["galaxy-test-connection", "--plaintext", "--json"],
)
assert result.exit_code == 0, result.output
payload = json.loads(result.output)
assert payload == {"command": "galaxy-test-connection", "ok": True}
def test_galaxy_discover_serializes_objects(monkeypatch: pytest.MonkeyPatch) -> None:
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
objects = [
galaxy_pb.GalaxyObject(gobject_id=7, tag_name="Area001", contained_name="Area001"),
galaxy_pb.GalaxyObject(gobject_id=8, tag_name="Pump001", contained_name="Pump001"),
]
_patch_galaxy_connect(monkeypatch, _FakeGalaxyClient(objects=objects))
result = CliRunner().invoke(
main,
["galaxy-discover", "--plaintext", "--json"],
)
assert result.exit_code == 0, result.output
payload = json.loads(result.output)
assert payload["command"] == "galaxy-discover"
assert len(payload["objects"]) == 2
assert payload["objects"][0]["tagName"] == "Area001"
assert payload["objects"][1]["gobjectId"] == 8
def test_galaxy_last_deploy_emits_utc_iso(monkeypatch: pytest.MonkeyPatch) -> None:
"""The naive-UTC deploy time from the library must be emitted as unambiguous UTC ISO-8601."""
from datetime import datetime
naive_utc = datetime(2025, 6, 15, 12, 0, 0) # noqa: DTZ001 -- mirrors protobuf ToDatetime()
_patch_galaxy_connect(monkeypatch, _FakeGalaxyClient(last_deploy=naive_utc))
result = CliRunner().invoke(
main,
["galaxy-last-deploy", "--plaintext", "--json"],
)
assert result.exit_code == 0, result.output
payload = json.loads(result.output)
assert payload["command"] == "galaxy-last-deploy"
assert payload["present"] is True
assert payload["timeOfLastDeploy"].endswith(("+00:00", "Z"))
def test_galaxy_watch_serializes_deploy_events(monkeypatch: pytest.MonkeyPatch) -> None:
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
events = [galaxy_pb.DeployEvent(sequence=1)]
_patch_galaxy_connect(monkeypatch, _FakeGalaxyClient(events=events))
result = CliRunner().invoke(
main,
["galaxy-watch", "--plaintext", "--max-events", "1", "--json"],
)
assert result.exit_code == 0, result.output
payload = json.loads(result.output)
assert payload["command"] == "galaxy-watch"
assert len(payload["events"]) == 1
class _FakeBrowseNode:
"""Minimal stand-in for LazyBrowseNode covering the CLI render path."""
def __init__(self, obj, *, has_children_hint=False, children=None) -> None:
self._object = obj
self._has_children_hint = has_children_hint
self._children = list(children or [])
self._is_expanded = bool(children)
self.expand_calls = 0
@property
def object(self):
return self._object
@property
def has_children_hint(self) -> bool:
return self._has_children_hint
@property
def children(self):
return list(self._children)
@property
def is_expanded(self) -> bool:
return self._is_expanded
async def expand(self) -> None:
self.expand_calls += 1
self._is_expanded = True
def test_galaxy_browse_serializes_nested_nodes(monkeypatch: pytest.MonkeyPatch) -> None:
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
child = _FakeBrowseNode(
galaxy_pb.GalaxyObject(gobject_id=8, tag_name="Pump001", contained_name="Pump001"),
has_children_hint=False,
)
root = _FakeBrowseNode(
galaxy_pb.GalaxyObject(gobject_id=7, tag_name="Area001", contained_name="Area001"),
has_children_hint=True,
children=[child],
)
_patch_galaxy_connect(monkeypatch, _FakeGalaxyClient(browse_roots=[root]))
result = CliRunner().invoke(
main,
["galaxy-browse", "--plaintext", "--json"],
)
assert result.exit_code == 0, result.output
payload = json.loads(result.output)
assert "_text" not in payload
assert payload["command"] == "galaxy-browse"
assert len(payload["nodes"]) == 1
node = payload["nodes"][0]
assert node["tagName"] == "Area001"
assert node["hasChildrenHint"] is True
assert len(node["children"]) == 1
assert node["children"][0]["gobjectId"] == 8
assert node["children"][0]["children"] == []
def test_galaxy_browse_renders_indented_text_tree(monkeypatch: pytest.MonkeyPatch) -> None:
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
child = _FakeBrowseNode(
galaxy_pb.GalaxyObject(gobject_id=8, tag_name="Pump001", browse_name="Pump001"),
)
root = _FakeBrowseNode(
galaxy_pb.GalaxyObject(gobject_id=7, tag_name="Area001", browse_name="Area001"),
has_children_hint=True,
children=[child],
)
_patch_galaxy_connect(monkeypatch, _FakeGalaxyClient(browse_roots=[root]))
result = CliRunner().invoke(
main,
["galaxy-browse", "--plaintext"],
)
assert result.exit_code == 0, result.output
lines = result.output.splitlines()
assert lines[0] == "1"
assert lines[1] == "+ Area001 Area001 (gobject 7)"
assert lines[2] == " - Pump001 Pump001 (gobject 8)"
def test_galaxy_browse_forwards_filter_options(monkeypatch: pytest.MonkeyPatch) -> None:
fake = _FakeGalaxyClient(browse_roots=[])
_patch_galaxy_connect(monkeypatch, fake)
result = CliRunner().invoke(
main,
[
"galaxy-browse",
"--plaintext",
"--category-id",
"10",
"--category-id",
"12",
"--template-chain-contains",
"$Pump",
"--tag-name-glob",
"Area*",
"--include-attributes",
"--alarm-bearing-only",
"--historized-only",
"--json",
],
)
assert result.exit_code == 0, result.output
options = fake.browse_options
assert tuple(options.category_ids) == (10, 12)
assert tuple(options.template_chain_contains) == ("$Pump",)
assert options.tag_name_glob == "Area*"
assert options.include_attributes is True
assert options.alarm_bearing_only is True
assert options.historized_only is True
def test_galaxy_browse_expands_to_depth(monkeypatch: pytest.MonkeyPatch) -> None:
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
root = _FakeBrowseNode(
galaxy_pb.GalaxyObject(gobject_id=7, tag_name="Area001"),
has_children_hint=True,
)
_patch_galaxy_connect(monkeypatch, _FakeGalaxyClient(browse_roots=[root]))
result = CliRunner().invoke(
main,
["galaxy-browse", "--plaintext", "--depth", "2", "--json"],
)
assert result.exit_code == 0, result.output
assert root.expand_calls == 1
def test_galaxy_commands_are_registered() -> None:
runner = CliRunner()
for command in (
"galaxy-test-connection",
"galaxy-last-deploy",
"galaxy-discover",
"galaxy-watch",
"galaxy-browse",
):
result = runner.invoke(main, [command, "--help"])
assert result.exit_code == 0, result.output
assert "--endpoint" in result.output
@pytest.mark.parametrize("depth_arg", ["99", "-1"])
def test_galaxy_browse_rejects_out_of_range_depth(
monkeypatch: pytest.MonkeyPatch,
depth_arg: str,
) -> None:
"""--depth values outside [0, 50] must be rejected with a non-zero exit."""
_patch_galaxy_connect(monkeypatch, _FakeGalaxyClient(browse_roots=[]))
result = CliRunner().invoke(
main,
["galaxy-browse", "--plaintext", "--depth", depth_arg, "--json"],
)
assert result.exit_code != 0
assert "--depth must be between 0 and 50" in result.output
# ---------------------------------------------------------------------------
# --parent-gobject-id drill-down tests
# ---------------------------------------------------------------------------
def _fake_browse_children_reply(children_and_hints, *, next_page_token=""):
"""Build a minimal fake BrowseChildrenReply-like object."""
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
reply = galaxy_pb.BrowseChildrenReply()
for obj, hint in children_and_hints:
reply.children.append(obj)
reply.child_has_children.append(hint)
reply.next_page_token = next_page_token
return reply
def test_galaxy_browse_parent_fetches_one_level_json(monkeypatch: pytest.MonkeyPatch) -> None:
"""--parent-gobject-id N calls browse_children_raw and renders one-level JSON."""
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
child_a = galaxy_pb.GalaxyObject(gobject_id=10, tag_name="PumpA", browse_name="PumpA")
child_b = galaxy_pb.GalaxyObject(gobject_id=11, tag_name="PumpB", browse_name="PumpB")
page = _fake_browse_children_reply([(child_a, True), (child_b, False)])
fake = _FakeGalaxyClient(browse_children_pages=[page])
_patch_galaxy_connect(monkeypatch, fake)
result = CliRunner().invoke(
main,
["galaxy-browse", "--plaintext", "--parent-gobject-id", "7", "--json"],
)
assert result.exit_code == 0, result.output
payload = json.loads(result.output)
# One BrowseChildren RPC was issued with the correct parent id.
assert len(fake._browse_children_calls) == 1
call_req = fake._browse_children_calls[0]
assert call_req.parent_gobject_id == 7
# JSON shape mirrors the lazy-browse node shape.
assert payload["command"] == "galaxy-browse"
nodes = payload["nodes"]
assert len(nodes) == 2
assert nodes[0]["tagName"] == "PumpA"
assert nodes[0]["hasChildrenHint"] is True
assert nodes[0]["children"] == []
assert nodes[1]["gobjectId"] == 11
assert nodes[1]["hasChildrenHint"] is False
assert nodes[1]["children"] == []
def test_galaxy_browse_parent_renders_text_tree(monkeypatch: pytest.MonkeyPatch) -> None:
"""--parent-gobject-id N text output: count line then marker lines (no indent)."""
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
child = galaxy_pb.GalaxyObject(gobject_id=10, tag_name="PumpA", browse_name="PumpA")
page = _fake_browse_children_reply([(child, False)])
fake = _FakeGalaxyClient(browse_children_pages=[page])
_patch_galaxy_connect(monkeypatch, fake)
result = CliRunner().invoke(
main,
["galaxy-browse", "--plaintext", "--parent-gobject-id", "7"],
)
assert result.exit_code == 0, result.output
lines = result.output.splitlines()
assert lines[0] == "1"
assert lines[1] == "- PumpA PumpA (gobject 10)"
def test_galaxy_browse_parent_pages_correctly(monkeypatch: pytest.MonkeyPatch) -> None:
"""--parent-gobject-id loops on next_page_token until exhausted."""
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
child_a = galaxy_pb.GalaxyObject(gobject_id=10, tag_name="PumpA", browse_name="PumpA")
child_b = galaxy_pb.GalaxyObject(gobject_id=11, tag_name="PumpB", browse_name="PumpB")
page1 = _fake_browse_children_reply([(child_a, False)], next_page_token="tok1")
page2 = _fake_browse_children_reply([(child_b, True)])
fake = _FakeGalaxyClient(browse_children_pages=[page1, page2])
_patch_galaxy_connect(monkeypatch, fake)
result = CliRunner().invoke(
main,
["galaxy-browse", "--plaintext", "--parent-gobject-id", "7", "--json"],
)
assert result.exit_code == 0, result.output
assert len(fake._browse_children_calls) == 2
# Second call must carry the page token from the first reply.
assert fake._browse_children_calls[1].page_token == "tok1"
payload = json.loads(result.output)
assert len(payload["nodes"]) == 2
def test_galaxy_browse_parent_warns_when_depth_also_set(
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""When both --parent-gobject-id and --depth>0 are supplied a warning is emitted."""
from zb_mom_ww_mxgateway.generated import galaxy_repository_pb2 as galaxy_pb
child = galaxy_pb.GalaxyObject(gobject_id=10, tag_name="PumpA", browse_name="PumpA")
page = _fake_browse_children_reply([(child, False)])
fake = _FakeGalaxyClient(browse_children_pages=[page])
_patch_galaxy_connect(monkeypatch, fake)
# CliRunner mixes stderr into output in this Click version.
result = CliRunner().invoke(
main,
["galaxy-browse", "--plaintext", "--parent-gobject-id", "7", "--depth", "2", "--json"],
)
assert result.exit_code == 0, result.output
assert "--depth is ignored" in result.output
def test_galaxy_browse_help_shows_parent_gobject_id() -> None:
"""--parent-gobject-id appears in the galaxy-browse --help output."""
result = CliRunner().invoke(main, ["galaxy-browse", "--help"])
assert result.exit_code == 0
assert "--parent-gobject-id" in result.output
+334
View File
@@ -18,6 +18,7 @@ use clap::{Args, Parser, Subcommand, ValueEnum};
use futures_util::StreamExt;
use serde_json::json;
use serde_json::Value;
use zb_mom_ww_mxgateway_client::galaxy::{BrowseChildrenOptions, LazyBrowseNode};
use zb_mom_ww_mxgateway_client::generated::galaxy_repository::v1::DeployEvent;
use zb_mom_ww_mxgateway_client::generated::mxaccess_gateway::v1::{
alarm_feed_message, AcknowledgeAlarmRequest, AlarmFeedMessage, CloseSessionRequest, MxCommand,
@@ -387,6 +388,47 @@ enum GalaxyCommand {
#[arg(long)]
json: bool,
},
/// Lazily browse the Galaxy hierarchy through `BrowseChildren`.
///
/// With no `--parent-gobject-id` the root objects are listed; pass a
/// parent id to list that object's direct children. `--depth` controls
/// how many further levels are eagerly expanded (0 = the requested level
/// only). The filter flags map onto `BrowseChildrenOptions` and are reused
/// at every expanded level, mirroring the lazy-browse library helper.
Browse {
#[command(flatten)]
connection: ConnectionArgs,
/// Parent gobject id whose children to browse. Omit for root objects.
#[arg(long)]
parent_gobject_id: Option<i32>,
/// Restrict to objects whose `category_id` matches one of these ids.
/// Repeatable.
#[arg(long = "category-id")]
category_ids: Vec<i32>,
/// Restrict to objects whose template chain contains this entry.
/// Repeatable (combined with AND).
#[arg(long = "template-contains")]
template_chain_contains: Vec<String>,
/// Restrict to objects whose tag name matches this SQL `LIKE`-style glob.
#[arg(long)]
tag_name_glob: Option<String>,
/// Populate `attributes` on the returned objects.
#[arg(long)]
include_attributes: bool,
/// Only return objects that own at least one alarm-bearing attribute.
#[arg(long)]
alarm_bearing_only: bool,
/// Only return objects that own at least one historized attribute.
#[arg(long)]
historized_only: bool,
/// Number of additional levels to eagerly expand beneath each returned
/// node. 0 (the default) prints only the requested level.
/// Ignored when --parent-gobject-id is specified.
#[arg(long, default_value_t = 0)]
depth: usize,
#[arg(long)]
json: bool,
},
/// Subscribe to the WatchDeployEvents server stream.
///
/// Prints one line per received event (or one JSON object with `--json`).
@@ -1103,10 +1145,271 @@ async fn run_galaxy(command: GalaxyCommand) -> Result<(), Error> {
}
}
}
GalaxyCommand::Browse {
connection,
parent_gobject_id,
category_ids,
template_chain_contains,
tag_name_glob,
include_attributes,
alarm_bearing_only,
historized_only,
depth,
json,
} => {
if parent_gobject_id.is_some() && depth > 0 {
eprintln!("warning: --depth is ignored when --parent-gobject-id is specified");
}
let mut client = connect_galaxy(connection).await?;
let options = BrowseChildrenOptions {
category_ids,
template_chain_contains,
tag_name_glob,
include_attributes: include_attributes.then_some(true),
alarm_bearing_only,
historized_only,
};
match parent_gobject_id {
// No parent → walk the lazy-browse tree from the root objects,
// eagerly expanding `depth` further levels so the print walks
// cached children without re-issuing RPCs.
None => {
let nodes = client.browse(Some(options)).await?;
for node in &nodes {
expand_to_depth(node, depth).await?;
}
if json {
let mut payload = Vec::with_capacity(nodes.len());
for node in &nodes {
payload.push(lazy_node_to_json(node).await);
}
println!("{}", json!({ "nodes": payload }));
} else {
println!("{}", nodes.len());
for node in &nodes {
print_lazy_node(node, 0).await;
}
}
}
// A specific parent → fetch exactly one level of children via
// the raw paged RPC. `--depth` is not meaningful here; the
// single-level children are returned as-is.
Some(parent) => {
let children = browse_children_one_level(&mut client, parent, &options).await?;
print_browse_children(&children, json);
}
}
}
}
Ok(())
}
/// Page size used for the raw `BrowseChildren` RPC when fetching a single
/// level via `--parent-gobject-id`. Mirrors `BROWSE_CHILDREN_PAGE_SIZE` in
/// `galaxy.rs` (the library's lazy-browse helper uses the same value).
const BROWSE_PAGE_SIZE: i32 = 500;
/// Drive `BrowseChildren` paging by hand for a single parent and return the
/// flattened child list. Used by the `browse --parent-gobject-id` path, which
/// surfaces one level of children rather than the lazy root-tree walk.
async fn browse_children_one_level(
client: &mut GalaxyClient,
parent_gobject_id: i32,
options: &BrowseChildrenOptions,
) -> Result<Vec<GalaxyBrowseChild>, Error> {
use std::collections::HashSet;
use zb_mom_ww_mxgateway_client::generated::galaxy_repository::v1::{
browse_children_request, BrowseChildrenRequest,
};
let mut children = Vec::new();
let mut page_token = String::new();
let mut seen: HashSet<String> = HashSet::new();
loop {
let request = BrowseChildrenRequest {
page_size: BROWSE_PAGE_SIZE,
page_token: page_token.clone(),
category_ids: options.category_ids.clone(),
template_chain_contains: options.template_chain_contains.clone(),
tag_name_glob: options.tag_name_glob.clone().unwrap_or_default(),
include_attributes: options.include_attributes,
alarm_bearing_only: options.alarm_bearing_only,
historized_only: options.historized_only,
parent: Some(browse_children_request::Parent::ParentGobjectId(
parent_gobject_id,
)),
};
let reply = client.browse_children_raw(request).await?;
let hints = reply.child_has_children;
for (index, object) in reply.children.into_iter().enumerate() {
let has_children_hint = hints.get(index).copied().unwrap_or(false);
children.push(GalaxyBrowseChild {
object,
has_children_hint,
});
}
page_token = reply.next_page_token;
if page_token.is_empty() {
return Ok(children);
}
if !seen.insert(page_token.clone()) {
return Err(Error::InvalidArgument {
name: "page_token".to_owned(),
detail: format!(
"galaxy browse children returned repeated page token `{page_token}`"
),
});
}
}
}
/// A single child returned by the raw `BrowseChildren` paging path, paired
/// with its server-supplied `child_has_children` hint.
struct GalaxyBrowseChild {
object: zb_mom_ww_mxgateway_client::generated::galaxy_repository::v1::GalaxyObject,
has_children_hint: bool,
}
/// Print the one-level children of a browsed parent, mirroring the JSON node
/// shape used by the root-tree walk (minus the recursive `children` array).
fn print_browse_children(children: &[GalaxyBrowseChild], use_json: bool) {
if use_json {
let payload: Vec<_> = children.iter().map(browse_child_to_json).collect();
println!("{}", json!({ "nodes": payload }));
} else {
println!("{}", children.len());
for child in children {
let object = &child.object;
let marker = if child.has_children_hint { "+" } else { "-" };
println!(
"{marker} {} {} (gobject {})",
object.tag_name, object.browse_name, object.gobject_id,
);
}
}
}
/// Render one raw browse child as a JSON object whose key set matches the
/// lazy-node renderer (with an empty `children` array).
fn browse_child_to_json(child: &GalaxyBrowseChild) -> Value {
let object = &child.object;
json!({
"gobjectId": object.gobject_id,
"tagName": object.tag_name,
"containedName": object.contained_name,
"browseName": object.browse_name,
"parentGobjectId": object.parent_gobject_id,
"isArea": object.is_area,
"categoryId": object.category_id,
"hostedByGobjectId": object.hosted_by_gobject_id,
"templateChain": object.template_chain,
"hasChildrenHint": child.has_children_hint,
"attributes": object.attributes.iter().map(|attribute| json!({
"attributeName": attribute.attribute_name,
"fullTagReference": attribute.full_tag_reference,
"mxDataType": attribute.mx_data_type,
"dataTypeName": attribute.data_type_name,
"isArray": attribute.is_array,
"arrayDimension": attribute.array_dimension,
"arrayDimensionPresent": attribute.array_dimension_present,
"mxAttributeCategory": attribute.mx_attribute_category,
"securityClassification": attribute.security_classification,
"isHistorized": attribute.is_historized,
"isAlarm": attribute.is_alarm,
})).collect::<Vec<_>>(),
"children": Vec::<Value>::new(),
})
}
/// Recursively expand a [`LazyBrowseNode`] up to `depth` further levels. A
/// `depth` of 0 leaves the node unexpanded so the caller prints only the
/// requested level.
fn expand_to_depth(
node: &LazyBrowseNode,
depth: usize,
) -> std::pin::Pin<Box<dyn std::future::Future<Output = Result<(), Error>> + Send + '_>> {
Box::pin(async move {
if depth == 0 {
return Ok(());
}
node.expand().await?;
for child in node.children().await {
expand_to_depth(&child, depth - 1).await?;
}
Ok(())
})
}
/// Print a [`LazyBrowseNode`] and any already-expanded descendants as an
/// indented tree. Indentation is two spaces per level.
fn print_lazy_node(
node: &LazyBrowseNode,
indent: usize,
) -> std::pin::Pin<Box<dyn std::future::Future<Output = ()> + Send + '_>> {
Box::pin(async move {
let object = node.object();
let marker = if node.has_children_hint() { "+" } else { "-" };
println!(
"{:indent$}{marker} {} {} (gobject {})",
"",
object.tag_name,
object.browse_name,
object.gobject_id,
indent = indent,
);
if node.is_expanded().await {
for child in node.children().await {
print_lazy_node(&child, indent + 2).await;
}
}
})
}
/// Render a [`LazyBrowseNode`] (and its already-expanded children) as a JSON
/// object. Mirrors the `discover-hierarchy` object shape with an added
/// `hasChildrenHint` flag and a nested `children` array.
fn lazy_node_to_json(
node: &LazyBrowseNode,
) -> std::pin::Pin<Box<dyn std::future::Future<Output = Value> + Send + '_>> {
Box::pin(async move {
let object = node.object();
let mut children = Vec::new();
if node.is_expanded().await {
for child in node.children().await {
children.push(lazy_node_to_json(&child).await);
}
}
json!({
"gobjectId": object.gobject_id,
"tagName": object.tag_name,
"containedName": object.contained_name,
"browseName": object.browse_name,
"parentGobjectId": object.parent_gobject_id,
"isArea": object.is_area,
"categoryId": object.category_id,
"hostedByGobjectId": object.hosted_by_gobject_id,
"templateChain": object.template_chain,
"hasChildrenHint": node.has_children_hint(),
"attributes": object.attributes.iter().map(|attribute| json!({
"attributeName": attribute.attribute_name,
"fullTagReference": attribute.full_tag_reference,
"mxDataType": attribute.mx_data_type,
"dataTypeName": attribute.data_type_name,
"isArray": attribute.is_array,
"arrayDimension": attribute.array_dimension,
"arrayDimensionPresent": attribute.array_dimension_present,
"mxAttributeCategory": attribute.mx_attribute_category,
"securityClassification": attribute.security_classification,
"isHistorized": attribute.is_historized,
"isAlarm": attribute.is_alarm,
})).collect::<Vec<_>>(),
"children": children,
})
})
}
async fn session_for(
connection: ConnectionArgs,
session_id: String,
@@ -2131,6 +2434,37 @@ mod tests {
assert!(parsed.is_ok(), "parse failed: {parsed:?}");
}
#[test]
fn parses_galaxy_browse_command_with_filters_and_depth() {
let parsed = Cli::try_parse_from([
"mxgw",
"galaxy",
"browse",
"--parent-gobject-id",
"42",
"--category-id",
"3",
"--category-id",
"5",
"--template-contains",
"$DelmiaReceiver",
"--tag-name-glob",
"Recv_*",
"--include-attributes",
"--alarm-bearing-only",
"--depth",
"2",
"--json",
]);
assert!(parsed.is_ok(), "parse failed: {parsed:?}");
}
#[test]
fn parses_galaxy_browse_command_with_defaults() {
let parsed = Cli::try_parse_from(["mxgw", "galaxy", "browse"]);
assert!(parsed.is_ok(), "parse failed: {parsed:?}");
}
#[test]
fn parses_batch_command() {
let parsed = Cli::try_parse_from(["mxgw", "batch"]);
+5 -4
View File
@@ -762,16 +762,17 @@ in the codebase for the forward-compat shape, but the gateway-side
`AcknowledgeAlarmByName` when the public RPC supplies a recognizable
`Provider!Group.Tag` reference.
### 5. STA / threading — production fix needed
### 5. STA / threading — resolved
The wnwrap COM is `ThreadingModel=Apartment`. The consumer's
internal `Timer` fires on threadpool threads and would block forever
on cross-apartment marshaling unless the host STA pumps Win32
messages. The smoke test sidesteps this by setting
`pollIntervalMilliseconds=0` (Timer disabled) and driving `PollOnce`
manually from the test's STA. Production hosting will route polls
through the worker's `StaRuntime` in a follow-up — the consumer's
`PollOnce` is `public` and idempotent so the wire-up is mechanical.
manually from the test's STA. Production alarm polling was wired up
through `GatewayAlarmMonitor`, which routes polling through the
worker's `StaRuntime` (the STA pump owner) via the worker IPC path. This item is resolved; the wnwrap consumer's `PollOnce`
is no longer invoked directly in production.
### Capture summary
+26 -11
View File
@@ -37,11 +37,14 @@ paths, timeouts, queue sizes, enum values, or protocol values are invalid.
"MaxPendingCommandsPerSession": 128,
"DefaultLeaseSeconds": 1800,
"LeaseSweepIntervalSeconds": 30,
"AllowMultipleEventSubscribers": false
"AllowMultipleEventSubscribers": false,
"MaxEventSubscribersPerSession": 8
},
"Events": {
"QueueCapacity": 10000,
"BackpressurePolicy": "FailFast"
"BackpressurePolicy": "FailFast",
"ReplayBufferCapacity": 1024,
"ReplayRetentionSeconds": 300
},
"Dashboard": {
"Enabled": true,
@@ -123,23 +126,35 @@ to avoid accidental large allocations from malformed or oversized frames.
| `MxGateway:Sessions:MaxPendingCommandsPerSession` | `128` | Maximum number of pending worker commands for one session. Excess commands fail fast instead of queueing indefinitely. |
| `MxGateway:Sessions:DefaultLeaseSeconds` | `1800` | Initial session lease and refresh duration. Unary client activity extends the lease by this duration. |
| `MxGateway:Sessions:LeaseSweepIntervalSeconds` | `30` | Hosted monitor interval for closing expired leases. Active event-stream subscribers keep a session from expiring while the stream remains attached. |
| `MxGateway:Sessions:AllowMultipleEventSubscribers` | `false` | Controls whether multiple `StreamEvents` subscribers may attach to one session. `true` is rejected until event fan-out is implemented. |
| `MxGateway:Sessions:AllowMultipleEventSubscribers` | `false` | Controls whether multiple `StreamEvents` subscribers may attach to one session. When `false` the session refuses a second subscriber with `AlreadyExists`. Set to `true` to enable fan-out via the `SessionEventDistributor`. |
| `MxGateway:Sessions:MaxEventSubscribersPerSession` | `8` | Maximum number of concurrent `StreamEvents` subscribers per session when `AllowMultipleEventSubscribers` is `true`. Effectively 1 when `AllowMultipleEventSubscribers` is `false`. Must be greater than zero. |
All numeric session options must be greater than zero. The current event stream
implementation supports one active subscriber per session; this preserves event
ordering and avoids competing consumers.
All numeric session options must be greater than zero.
## Event Options
| Option | Default | Description |
|--------|---------|-------------|
| `MxGateway:Events:QueueCapacity` | `10000` | Capacity for bounded per-session event queues used by the gateway worker event channel and the public gRPC event stream queue. |
| `MxGateway:Events:BackpressurePolicy` | `FailFast` | Event backpressure behavior. `FailFast` faults the session on public stream queue overflow. `DisconnectSubscriber` disconnects only the slow stream. |
| `MxGateway:Events:BackpressurePolicy` | `FailFast` | Per-subscriber event backpressure behavior when a subscriber's bounded event channel overflows. Overflow is isolated to the offending subscriber: it is always disconnected with an `EventQueueOverflow` fault while the session pump and other subscribers keep running. `FailFast` additionally faults the whole session only in the legacy single-subscriber case (the current default mode); with multiple subscribers it degrades to a per-subscriber disconnect so one slow consumer never faults a shared session. `DisconnectSubscriber` disconnects only the slow subscriber in all cases. |
| `MxGateway:Events:ReplayBufferCapacity` | `1024` | Maximum number of events retained per session in the replay ring buffer, used to re-deliver events a returning subscriber missed (reconnect/reattach). The oldest retained event is evicted once this count is exceeded. `0` disables replay retention. |
| `MxGateway:Events:ReplayRetentionSeconds` | `300` | Maximum age, in seconds, of an event retained in the replay ring buffer. Entries older than this are evicted regardless of capacity. `0` disables age-based eviction. |
`QueueCapacity` must be greater than zero. With `FailFast`, queue overflow
faults the affected worker or session instead of silently dropping MXAccess
events. With `DisconnectSubscriber`, public gRPC stream overflow terminates only
the affected stream while the MXAccess session remains active.
`QueueCapacity` must be greater than zero; it bounds each per-subscriber event
channel fed by the session's single event pump. A slow subscriber overflows only
its own channel and is always disconnected with an `EventQueueOverflow` fault
rather than silently dropping MXAccess events — the pump, the session, and other
subscribers are unaffected. With `FailFast` in the single-subscriber case (the
default mode), that overflow additionally faults the whole session; with multiple
subscribers `FailFast` degrades to a per-subscriber disconnect, matching
`DisconnectSubscriber`, so one slow consumer cannot fault a session shared by
healthy subscribers. With `DisconnectSubscriber`, overflow terminates only the
affected stream while the MXAccess session remains active.
`ReplayBufferCapacity` and `ReplayRetentionSeconds` must each be greater than or
equal to zero (either dimension can be disabled with `0`). A returning subscriber
that asks for events older than the oldest still-retained event is told it missed
events (a "gap") and must re-snapshot; whatever is still retained is replayed.
## Dashboard Options
+3 -2
View File
@@ -167,7 +167,7 @@ bearer). Each hub class is `[Authorize(Policy = HubClientsPolicy)]`.
|---|---|---|---|---|
| `DashboardSnapshotHub` | `/hubs/snapshot` | `DashboardSnapshotPublisher` (BackgroundService consuming `IDashboardSnapshotService.WatchSnapshotsAsync`) | `DashboardSnapshot` | Sent to all connected clients on every snapshot tick; new connections receive the current snapshot synchronously in `OnConnectedAsync`. |
| `AlarmsHub` | `/hubs/alarms` | `AlarmsHubPublisher` (BackgroundService consuming `IGatewayAlarmService.StreamAsync(filter: null)`) | `AlarmFeedMessage` (`active_alarm` / `snapshot_complete` / `transition`) | Connected clients auto-join `__alarms__`; all clients receive every message. Publisher auto-reconnects every 5s on stream faults. |
| `EventsHub` | `/hubs/events` | `DashboardEventBroadcaster` invoked by `EventStreamService` for each event it forwards to a gRPC client | `MxEvent` | Clients call `SubscribeSession(sessionId)` to join `session:{id}`. Events appear only while a gRPC client is also consuming that session's events — the dashboard is a passive mirror, not a separate worker subscriber. |
| `EventsHub` | `/hubs/events` | `DashboardEventBroadcaster` invoked by each session's internal dashboard-mirror subscriber on its `SessionEventDistributor` (registered when the session becomes Ready) | `MxEvent` | Clients call `SubscribeSession(sessionId)` to join `session:{id}`. The dashboard is a first-class distributor subscriber, so it receives the session's events whether or not a gRPC client is streaming. It sees RAW session events — not the per-gRPC-subscriber `AfterWorkerSequence` filtering that `EventStreamService` applies at its own boundary — because the dashboard is a separate LDAP-authenticated monitoring view meant to show the session's full event activity (per-session dashboard ACL is tracked separately). |
`DashboardPageBase` opens a `DashboardSnapshotHub` connection via the connection
factory in `OnInitializedAsync`, seeds `Snapshot` synchronously from
@@ -184,7 +184,8 @@ Default cadences:
- snapshot service produces one snapshot per
`MxGateway:Dashboard:SnapshotIntervalMilliseconds` (default 1s);
- alarm publisher emits on each transition observed by the central monitor;
- event publisher emits per event forwarded by `StreamEvents`.
- event publisher emits per event fanned by the session's `SessionEventDistributor`
to its internal dashboard-mirror subscriber (independent of any gRPC `StreamEvents`).
Avoid pushing every MXAccess data-change event into a wider broadcast group.
The current design routes events strictly through `session:{id}` groups; the
+32 -1
View File
@@ -1,7 +1,7 @@
# Deferred Follow-ups Implementation Plan
**Date:** 2026-06-14
**Status:** Plan only — NOT yet executed. Saved for review.
**Status:** D1 executed (commit 4af24b9 — `mxgateway.alarms.provider_switches` emitted in `DashboardSnapshotService.cs:198`). D2 resolved as no-op (see resolution section below). D3D5 remain pending (ops/validation, no code).
**Context:** After the alarm-subtag-fallback cleanup (merged `5976770`) and its redeploy to
windev (10.100.0.48), five items remain deferred. This plan handles all five. They are
independent — execute in any order, or cherry-pick. Items D1D2 are code (branch off `main`);
@@ -268,3 +268,34 @@ the deployed instance. **No source change made** (no-op).
deployed instance (the only path that exercises routing past the unauthenticated 302-to-`/login`).
Recommend a spot-check of authenticated `GET /` after the next Server redeploy; if it returns 200
(not 500), this item can be fully closed.
---
## Recorded residuals after `feat/stillpending-completion` (2026-06-15)
The stillpending.md actionable items were implemented on branch `feat/stillpending-completion`
(see `docs/plans/2026-06-15-stillpending-completion.md`). These environment/vendor-gated residuals
remain explicitly open — none are code defects:
- **`provider_switches{from,to,reason}` counter — live exercise still pending.** The metric is
emitted on the alarm failover/failback path and unit-tested, but the dev rig's object-driven
alarms can't be made to fail a real alarmmgr→subtag switch from outside, so the `reason` tagging
is unproven against a live failover. Re-verify when a rig (or capture) can drive an actual
alarmmgr fault. (stillpending §1.3.)
- **`DrainEvents` is a diagnostic snapshot, not an exhaustive drain.** The worker now answers
`DrainEvents` (handled in `WorkerPipeSession`, off-STA), but it pulls from the same event queue
that the ~25 ms background stream-drain loop continuously empties. With an active event stream a
`DrainEvents` caller therefore receives only events not yet pushed by the stream loop — there is
no loss or double-delivery (the queue drain is lock-protected and destructive), but the result is
a non-deterministic snapshot. Documented here so the semantics aren't mistaken for a bug.
- **Buffered multi-sample conversion (`OnBufferedDataChange`) — unverified live.** `AddBufferedItem`
/ `SetBufferedUpdateInterval` are implemented and live-confirmed; the empty `NoData` bootstrap
event converts cleanly live (`f7ada90`). A real parallel quality/timestamp sample batch
(length > 1) is undrivable on the current rig, so the multi-sample path is exercised only by unit
tests against a fake `IMxAccessServer`. (stillpending §3.2.)
- **8-arg alarm ack operator `domain`/`full_name` — vendor-blocked.** The AVEVA `IwwAlarmConsumer2`
8-arg `AlarmAckByName` returns 55 (stub) and `AlarmAckByGUID` is `E_NOTIMPL` on this build, so the
two fields stay forward-compat-only on the wire. (stillpending §1.4 / §3.4 / §3.5.)
@@ -0,0 +1,233 @@
# Session Resilience Epic — Design
**Date:** 2026-06-15
**Branch:** `feat/session-resilience`
**Source:** `stillpending.md` §2 (intentional v1-deferred items), scoped into a real feature design.
**Status:** Design approved; implementation plan to follow.
## Goal
Lift four deliberately-deferred v1 limitations into supported features, built on
one shared foundation:
1. **Multi-event-subscriber fan-out** (§2: plumbed but validator-blocked).
2. **Reconnectable sessions** (§2: 1:1 session↔connection today).
3. **Per-session dashboard ACL** (§2 / EventsHub `TODO(per-session-acl)`).
4. **Orphan-worker reattach on gateway restart** (§2 — **overturns a hard
CLAUDE.md rule**, see "Documented-rule changes").
These are not peers: fan-out is the keystone, reconnect and reattach reuse its
machinery, and three of the four need a new session-ownership concept.
## Documented-rule changes (explicit, owner-approved)
This epic deliberately reverses three documented v1 decisions. Each reversal is a
required deliverable in the same change as the code:
- **CLAUDE.md:77** "Gateway restart does not reattach orphan workers… do not
design code paths that assume reattachment." → reattach becomes supported,
bounded, and opt-in.
- **`docs/DesignDecisions.md:63-73`** "no reconnectable sessions for v1." →
reconnect becomes supported within a bounded detach-grace window.
- **`docs/DesignDecisions.md:75-80`** single event subscriber per session. →
multi-subscriber fan-out becomes supported, capped.
The owner explicitly accepted overturning the reattach rule during design.
## Current-state seams (verified by recon, with citations)
- `GatewaySession.AttachEventSubscriber(bool allowMultipleSubscribers)`
(`Sessions/GatewaySession.cs:386-408`) guards on a single int
`_activeEventSubscriberCount` (`:16`) under `_syncRoot`; a second subscriber
throws `EventSubscriberAlreadyActive` (`:398`).
- `GatewayOptionsValidator.cs:181-185` hard-rejects
`AllowMultipleEventSubscribers` ("not supported until event fan-out is
implemented"); option bound at `SessionOptions.cs:26-29`.
- `EventStreamService.StreamEventsAsync` (`Grpc/EventStreamService.cs:27-101`)
creates **a new bounded `Channel<MxEvent>` per RPC call** (`:43-50`) and
`ProduceEventsAsync` drains `session.ReadEventsAsync()` directly — a
**destructive, single-consumer drain**. Two RPCs would fight over one queue.
- Backpressure: `ProduceEventsAsync` uses non-blocking `TryWrite`; on overflow
with `EventBackpressurePolicy.FailFast` (default, `EventOptions`) it calls
`session.MarkFaulted` (`EventStreamService.cs:143-162`) — faulting the **whole
session**, not just the slow consumer.
- `DashboardEventBroadcaster.Publish` (`Dashboard/Hubs/DashboardEventBroadcaster.cs:13-44`)
is called **inside** the per-RPC producer loop (`EventStreamService.cs:131-141`)
— so the dashboard only mirrors events while a gRPC client is streaming. Latent
bug: no gRPC subscriber ⇒ dashboard feed is dark.
- Pipe name `mxaccess-gateway-{Environment.ProcessId}-{sessionId}`
(`SessionManager.cs:433`); session id `session-{Guid:N}` (`:479`), in-memory
`SessionRegistry` only (`SessionRegistry.cs:12`), **not persisted**.
- `OrphanWorkerTerminator` (`Workers/OrphanWorkerTerminator.cs:49-112`) discovers
orphans by executable name/path (x64 gateway cannot introspect the x86 worker
module → image-name fallback) and **terminates** them; rationale comment at
`:9-16`.
- Pipe fault → `WorkerClient` read loop detects `EndOfStream`, session →
`Faulted` (`WorkerClient.cs:376-381`); no reattach. Worker launch passes the
per-session nonce via `MXGATEWAY_WORKER_NONCE` env var
(`WorkerProcessLauncher.cs:180-182`).
- Sessions store `ClientIdentity` (informational only, `GatewaySession.cs:114`);
**no `OwnerKeyId`, no per-session ACL.** gRPC `StreamEvents` enforces per-item
read constraints but **no session-level access gate** — any caller who knows a
session id can stream it.
- `EventsHub.SubscribeSession(string)` (`Dashboard/Hubs/EventsHub.cs:46-54`) joins
group `session:{id}`; only hub-level `[Authorize(HubClientsPolicy)]` gates it,
so **any** Admin/Viewer can subscribe to **any** session. `TODO(per-session-acl)`
at `:39-43`. `SnapshotHub`/`AlarmsHub` broadcast to all. Hub bearer
(`HubTokenService`, 30-min) carries name + roles only, **no session scope**.
- `StreamEventsRequest.AfterWorkerSequence` already exists on the wire (the
reconnect replay contract is half-built).
## Shared foundation
### A. `SessionEventDistributor` (one pump, N per-subscriber channels)
Per `GatewaySession`, replace the per-RPC direct drain with a single owned
distributor:
- One background **pump task** drains `ReadEventsAsync()` exactly once.
- Each event is (1) stamped with its worker sequence, (2) appended to a **bounded
replay ring buffer** (retain last `ReplayBufferCapacity` events or
`ReplayRetentionSeconds`, whichever first), and (3) `TryWrite`-fanned to every
registered subscriber's own bounded `Channel<MxEvent>`.
- **Per-subscriber backpressure isolation:** overflow completes only that
subscriber's channel (policy `DisconnectSubscriber`); the session and peers are
untouched. `FailFast``MarkFaulted` is retained only for the legacy
single-subscriber config path, for backward compatibility.
- **Constraint filtering stays per-subscriber:** the pump fans *raw* events; each
subscriber's read loop applies its own API-key read subtree/glob filter exactly
as today. No change to constraint semantics.
- `AttachEventSubscriber` returns a lease carrying that subscriber's channel
reader + its start sequence (for replay). `EventStreamService` reads the lease
channel instead of creating its own channel and draining the session.
### B. Session ownership
Record an authoritative **`OwnerKeyId`** (the creating API key id) on the session
at `OpenSession`, alongside the existing informational `ClientIdentity`. This one
field underpins ACL, reconnect re-validation, and reattach adoption.
## Feature designs
### 1. Multi-subscriber fan-out
- Remove the `GatewayOptionsValidator.cs:181-185` rejection; keep the option but
allow `true`.
- `_activeEventSubscriberCount` → a subscriber-lease collection on the
distributor. New cap `MaxEventSubscribersPerSession` (default 8) → reject the
N+1 attach with `EventSubscriberLimitReached`.
- Dashboard broadcaster registers as a distributor subscriber (removing the inline
tap), fixing the dashboard-dark-without-gRPC bug.
- **No proto change.**
### 2. Reconnectable sessions
- On stream drop, a session in **detach-grace** mode is retained (not closed) for
`DetachGraceSeconds` (separate from the session lease). New session
disconnect-policy value `DetachGrace`.
- On reconnect: client calls `StreamEvents` with the same session id +
`AfterWorkerSequence = lastSeen`. The distributor replays ring-buffer events
with `sequence > AfterWorkerSequence`, then resumes live.
- If the requested sequence is older than the ring's oldest retained event (gone
too long / ring overflowed), the server signals **`ReplayGap`** so the client
re-snapshots. **Contract addition** (a `ReplayGap` status / response marker) →
codegen ripple across all 5 clients.
- Reconnect re-validates caller `OwnerKeyId` == session owner → else
`PermissionDenied`.
### 3. Per-session ACL
- **gRPC (real security win, no proto change):** `Invoke` / `StreamEvents` /
`CloseSession` gated to the owning API key, OR a key holding a new
all-sessions admin scope → else `PermissionDenied`. Enforced in
`MxAccessGatewayService` against session `OwnerKeyId`.
- **Dashboard:** identity-domain mismatch (LDAP Admin/Viewer users vs API-key
sessions) means no natural owner link.
- **Decision required (flagged, not hard-coded):** default proposal —
**Admin** sees all sessions; **Viewer** scoped via config
`Dashboard:GroupToSessionTag` matched against an optional session `Tag`.
Enforced at `EventsHub.SubscribeSession` and in the `/hubs/token` mint
(token gains an allowed-session-tag claim). The owner may instead choose a
strict default (Viewers see nothing unless granted).
### 4. Orphan-worker reattach
- **Stable pipe naming:** drop `{gatewayPid}`; use a persisted stable
gateway-instance id. Replaces the pid's collision-avoidance role.
- **Adoption manifest:** persist a minimal record per live session
(`sessionId → workerPid, nonce, ownerKeyId, pipeName`) in the existing SQLite
store. This is the *only* persisted session state; COM/advise state stays in
the worker.
- **Worker phones home:** the worker runs a reconnect loop with bounded backoff;
the restarted gateway re-opens pipe servers for manifest entries and the
surviving worker re-attaches, presenting its **nonce**. Gateway validates the
nonce against the manifest and **rejects impostors / foreign workers**.
- **Resync, not replay:** the in-memory ring buffer is lost on restart, so a
reattached session's subscribers get `ReplayGap` and re-snapshot. Gateway
resyncs worker view via the now-implemented `GetSessionState` / `GetWorkerInfo`
commands.
- **Safety net retained:** workers self-terminate after `MaxOrphanLifetime` with
no re-adoption; `OrphanWorkerTerminator` stays as the fallback for un-adoptable
or foreign workers. Reattach is opt-in (`Workers:EnableOrphanReattach`,
default off) so the documented-safe behavior remains the default.
- **Pipe protocol:** add an **adopt/reconnect frame** to `mxaccess_worker.proto`
→ worker codegen regen + commit `Generated/` (net48 regen rule applies).
## Contract / codegen impact
Unlike the prior epic, this is **not zero-proto**:
- `mxaccess_gateway.proto``ReplayGap` signal for reconnect (Feature 2).
- `mxaccess_worker.proto` — adopt/reconnect frame (Feature 4).
Per the repo rule: regenerate `Generated/`, commit it, rebuild gateway + worker +
every generated client touched, and update affected docs in the same change.
## Error handling
- Per-subscriber overflow → disconnect that subscriber only; session survives.
- Reconnect past the ring horizon → `ReplayGap`, client re-snapshots (no silent
loss).
- Reattach nonce mismatch → reject + fall back to termination.
- ACL denial → `PermissionDenied` (gRPC) / hub subscribe refused (dashboard).
- All worker COM/STA interactions keep MXAccess parity — no synthesized events,
no "fixing" surprising returns.
## Testing & cross-platform verification
| Area | Test | Host |
|---|---|---|
| Distributor fan-out, per-sub backpressure, replay ring | unit, `FakeWorkerHarness` | local (macOS) |
| Reconnect replay + `ReplayGap` | unit + fake-worker integration | local |
| Session ownership / gRPC ACL | unit + gateway integration | local |
| Dashboard per-session ACL | LDAP test users (`multi-role`/`gw-viewer`) | local + live LDAP |
| Worker adopt frame, reattach handshake | worker unit (net48/x86) | **windev** |
| Gateway-restart reattach round-trip | integration | **windev** + live worker |
| Client `ReplayGap` handling | per-client tests; Java on macOS JDK 21 | local |
TDD throughout; per-task commits; `Generated/` regenerated+committed on proto
changes; docs (incl. the three documented-rule reversals) updated in the same
change as source.
## Delivery order (dependency stack)
Each phase is independently shippable:
1. **Foundation**`SessionEventDistributor` + replay ring + session `OwnerKeyId`
(refactor with no external behavior change; dashboard-dark bug fixed).
2. **Fan-out** — remove validator block, subscriber-lease list, cap, dashboard as
subscriber.
3. **Reconnect** — detach-grace, replay-on-reconnect, `ReplayGap` contract +
client handling.
4. **Per-session ACL** — gRPC owner gate + dashboard scoping.
5. **Reattach** — stable pipe naming, adoption manifest, worker phone-home +
adopt frame, resync, safety net; documented-rule reversals.
## Out of scope
- Cross-gateway / clustered session sharing (single gateway instance only).
- Event persistence beyond the in-memory ring (no durable event log).
- Reconnect across a gateway *restart* with zero event gap (restart always yields
`ReplayGap` by design — the ring is in-memory).
- Per-session ACL on `SnapshotHub` / `AlarmsHub` (they broadcast aggregate state;
only `EventsHub` is session-scoped).
+417
View File
@@ -0,0 +1,417 @@
# Session Resilience Epic — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development (same session) or executing-plans (parallel session) to implement this plan task-by-task.
**Goal:** Lift four deferred v1 limitations — multi-subscriber fan-out, reconnectable sessions, per-session ACL, orphan-worker reattach — onto one shared event-distribution foundation.
**Architecture:** A per-session `SessionEventDistributor` (one pump → N per-subscriber bounded channels + a bounded replay ring) replaces today's per-RPC destructive drain. Session ownership (`OwnerKeyId`) underpins ACL, reconnect re-validation, and reattach adoption. See `docs/plans/2026-06-15-session-resilience-design.md`.
**Tech Stack:** .NET 10 gateway (x64), .NET Framework 4.8 worker (x86, windev), SQLite auth/manifest store, gRPC + protobuf contracts (net10.0;net48), 5 language clients, Blazor/SignalR dashboard, LDAP dashboard auth.
**Cross-platform:** Gateway, dotnet/Go/Rust/Python clients, and the Java client build/test locally on macOS (JDK 21 at `~/.local/jdks/jdk-21.0.11+10/Contents/Home`). The net48/x86 worker and worker tests build/test on **windev** (ssh alias, PowerShell). Proto changes: regenerate `Generated/`, commit it, rebuild every touched component.
**Standing rules (from CLAUDE.md):** never log secrets/credentials/values; MXAccess parity (no synthesized events, no "fixing" surprising returns); no init-only props/positional records in net48 worker; update docs in the same change as source; branch already created (`feat/session-resilience`); per-task commits; build+test affected components before marking done.
---
## Phase 1 — Foundation (refactor; no external behavior change except the dashboard-dark fix)
### Task 1: Add `OwnerKeyId` to the session
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** none (other phase-1 tasks build on the session type)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (add `OwnerKeyId` readonly prop near `ClientIdentity:114`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs` (set `OwnerKeyId` from the request identity at `OpenSession`, near `CreateSessionId:479`)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`
**Steps:** TDD — failing test asserting an opened session records the creating API key id → add the property + assignment from `IGatewayRequestIdentityAccessor.Current` → green → `dotnet build src/ZB.MOM.WW.MxGateway.Server` + run session tests → commit.
### Task 2: `SessionEventDistributor` skeleton (single pump, subscriber registry)
**Classification:** high-risk (concurrency / actor model)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Create: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs`
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionEventDistributorTests.cs`
**Design:** One background pump `Task` draining `session.ReadEventsAsync()` exactly once; a thread-safe subscriber collection where each subscriber owns a bounded `Channel<MxEvent>` (`SingleReader=true`, `FullMode=Wait` for the per-sub channel, but writes use non-blocking `TryWrite`). `Register(startSequence)` returns a lease (channel reader + dispose). Pump fans each drained event to all subscriber channels via `TryWrite`.
**Steps:** Failing test: two registered subscribers both receive the same fanned event; disposing one stops its delivery without affecting the other. Implement pump + registry. Green. Build + test. Commit.
### Task 3: Bounded replay ring buffer
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none (extends Task 2)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs`
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/EventOptions.cs` (add `ReplayBufferCapacity`, `ReplayRetentionSeconds`)
- Test: `SessionEventDistributorTests.cs`
**Design:** Append each fanned event to a ring keyed by worker sequence, evicting by count (`ReplayBufferCapacity`) or age (`ReplayRetentionSeconds`), whichever first. Expose `TryGetReplayFrom(afterSequence, out events, out gap)`.
**Steps:** Failing test: events evicted past capacity; `TryGetReplayFrom` returns `gap=true` when requested sequence is older than the oldest retained. Implement. Green. Build+test. Commit.
### Task 4: Rewire `AttachEventSubscriber` + `EventStreamService` onto the distributor
**Classification:** high-risk (changes the live event path)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:386-408` (own a `SessionEventDistributor`; `AttachEventSubscriber` returns a lease wrapping `distributor.Register(...)`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:27-101` (read the lease's channel instead of creating a per-RPC channel and draining the session directly; remove the per-RPC `Channel.CreateBounded` at `:43-50`)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs`
**Steps:** Failing test: a single subscriber still streams events end-to-end through the distributor (regression parity with today). Rewire. Keep per-item constraint filtering in the subscriber read loop. Green. Build + run gateway event-stream tests. Commit.
### Task 5: Per-subscriber backpressure isolation
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none (extends Tasks 2/4)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs`
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (overflow path `:143-162`)
- Test: `SessionEventDistributorTests.cs`
**Design:** On a subscriber channel `TryWrite` failure, complete only that subscriber's channel with `EventQueueOverflow` (policy `DisconnectSubscriber`). Retain `FailFast``MarkFaulted` only when the session is in legacy single-subscriber mode (back-compat).
**Steps:** Failing test: a slow subscriber overflows and is disconnected while a second subscriber keeps receiving and the session stays `Ready`. Implement. Green. Build+test. Commit.
### Task 6: Dashboard broadcaster becomes a distributor subscriber
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:131-141` (remove the inline `dashboardEventBroadcaster.Publish` tap)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (register the dashboard broadcaster as a distributor subscriber on session start)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardEventBroadcaster.cs` (consume from a distributor lease)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardEventBroadcasterTests.cs`
**Steps:** Failing test: dashboard receives session events even with **no** active gRPC subscriber (fixes the latent dark-feed bug). Implement. Green. Build + dashboard tests. Commit.
---
## Phase 2 — Multi-subscriber fan-out
### Task 7: Remove the validator block + add the subscriber cap option
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** Task 8 is sequential (same files); none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/GatewayOptionsValidator.cs:181-185` (delete the rejection)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs` (add `MaxEventSubscribersPerSession`, default 8)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Configuration/GatewayOptionsValidatorTests.cs`
**Steps:** Failing test: `AllowMultipleEventSubscribers=true` now validates clean. Remove rule, add option. Green. Build+test. Commit.
### Task 8: Subscriber-lease collection + cap enforcement
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (replace `_activeEventSubscriberCount:16` with a lease collection; honor `allowMultipleSubscribers`; reject N+1 with new `SessionManagerErrorCode.EventSubscriberLimitReached`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManagerErrorCode.cs`
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`
**Steps:** Failing tests: N subscribers attach concurrently up to the cap; N+1 throws `EventSubscriberLimitReached`; single-subscriber mode still rejects the 2nd. Implement. Green. Build+test. Commit.
### Task 9: Multi-subscriber end-to-end test via FakeWorkerHarness
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs`
**Steps:** Two concurrent `StreamEvents` RPCs on one session both receive every worker event; one cancels, the other continues. Build + full fake-worker suite. Commit.
---
## Phase 3 — Reconnectable sessions
### Task 10: Proto — `ReplayGap` signal (contract change)
**Classification:** high-risk (contracts → all clients)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto` (add a `ReplayGap` marker — a `replay_gap` bool + `oldest_available_sequence` on the stream response, or a dedicated leading status frame)
- Regenerate: `dotnet build src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csproj`; **commit** `src/ZB.MOM.WW.MxGateway.Contracts/Generated/*` (net48 regen rule — see `project_proto_codegen_regen`)
- Test: contracts build both TFMs (net10.0;net48)
**Steps:** Add field(s), regen, `del Generated/*.cs` if needed to force regen, commit generated. Build contracts both TFMs. Commit. **This unblocks Task 11 and Task 14.**
### Task 11: Detach-grace session retention
**Classification:** high-risk (session lifecycle)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (add `DetachGrace` retention: on last-subscriber-drop, keep session alive for `DetachGraceSeconds` instead of closing)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs` (`DetachGraceSeconds`; new disconnect-policy value `DetachGrace`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionLeaseMonitorHostedService.cs` (sweep expired detach-grace windows)
- Test: `GatewaySessionTests.cs`
**Steps:** Failing test: subscriber drop under `DetachGrace` keeps the session `Ready` until the window expires, then closes. Implement. Green. Build + session/lease tests. Commit.
### Task 12: Replay-on-reconnect + emit `ReplayGap`
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (on attach with `AfterWorkerSequence`, call `distributor.TryGetReplayFrom`; replay buffered events then resume live; if `gap`, send the `ReplayGap` marker first)
- Test: `EventStreamServiceTests.cs`
**Steps:** Failing tests: reconnect with a known sequence replays only newer events; reconnect past the ring horizon yields `ReplayGap`. Implement. Green. Build + test. Commit.
### Task 13: Owner re-validation on reconnect
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** Task 12 (different assertion in same service — sequence after 12)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (reconnect requires caller `OwnerKeyId` == session owner → `PermissionDenied`)
- Test: `EventStreamServiceTests.cs`
**Steps:** Failing test: a different API key cannot resume someone else's session. Implement. Green. Build+test. Commit.
### Task 14: Client `ReplayGap` handling — all 5 clients
**Classification:** standard
**Estimated implement time:** ~5 min each (dispatch as 5 parallel sub-tasks; disjoint files)
**Parallelizable with:** each other (14a14e)
**Files (one client each):**
- 14a dotnet: `clients/dotnet/.../` stream consumer + test
- 14b Go: `clients/go/mxgateway/` + `go test`
- 14c Python: `clients/python/src/.../` + `pytest`
- 14d Rust: `clients/rust/crates/.../` + `cargo test`/clippy
- 14e Java: `clients/java/.../` + `gradle test` (macOS JDK 21; **revert generated `MxaccessGateway.java` churn** per `project_java_generated_churn`)
**Steps (each):** Regenerate client stubs from the updated proto; surface `ReplayGap` to the caller (callback/return marker) so apps know to re-snapshot; test the gap path. Build+test that client. Commit per client.
### Task 15: Reconnect integration test (fake worker)
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs`
**Steps:** Stream, drop, reconnect within grace with last sequence → no gap; reconnect after ring overflow → `ReplayGap`. Build + suite. Commit.
---
## Phase 4 — Per-session ACL
### Task 16: gRPC session-owner gate + all-sessions admin scope
**Classification:** high-risk (security)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/MxAccessGatewayService.cs` (`Invoke`/`StreamEvents`/`CloseSession` require caller key == session `OwnerKeyId`, or a key bearing a new all-sessions scope)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Security/Authorization/` (define the all-sessions scope; map it)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs`
**Steps:** Failing tests: foreign key gets `PermissionDenied` on another key's session; owner and all-sessions-scoped key succeed. Implement. Green. Build + gateway tests. Commit.
### Task 17: Session `Tag` + dashboard group→tag config
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** Task 16 (disjoint files)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (+`SessionManager` to set an optional `Tag` from the open request)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/` Dashboard options (`GroupToSessionTag` map)
- Test: config-binding test
**Steps:** Failing test: a session carries its tag; config map binds. Implement. Green. Build+test. Commit.
### Task 18: EventsHub per-session ACL + hub-token session-tag claim
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** none (depends on 17)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:39-54` (replace `TODO(per-session-acl)`: Admin sees all; Viewer allowed only if the session's tag is in the user's `GroupToSessionTag`-derived allowed set)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs` (mint an allowed-session-tag claim)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenAuthenticationHandler.cs` (carry the claim back)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/EventsHubTests.cs`
> **Decision flagged in the design doc:** default is "Admin all / Viewer by tag map." If the owner chose the strict variant (Viewers see nothing unless granted), invert the default here — the executor must confirm which before implementing.
**Steps:** Failing tests: Viewer without the tag is refused `SubscribeSession`; Admin allowed; Viewer with the mapped tag allowed. Implement. Green. Build + dashboard tests. Commit.
### Task 19: ACL tests incl. live LDAP users
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Test: `src/ZB.MOM.WW.MxGateway.IntegrationTests/DashboardLdapLiveTests.cs` (extend; gated `MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`)
**Steps:** With `multi-role` (Admin) vs `gw-viewer` (Viewer), assert subscribe authorization differs by session tag. Document if skipped (no live LDAP). Commit.
---
## Phase 5 — Orphan-worker reattach (overturns the CLAUDE.md rule)
### Task 20: Stable gateway-instance id + stable pipe naming
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:433` (pipe name uses a persisted stable gateway-instance id instead of `Environment.ProcessId`)
- Create: gateway-instance-id persistence (small file/SQLite row under `C:\ProgramData\MxGateway\`)
- Test: `SessionManagerTests.cs` / a new instance-id test
**Steps:** Failing test: pipe name is stable across simulated restarts (same instance id). Implement. Green. Build + tests. Commit.
### Task 21: Adoption manifest store (SQLite)
**Classification:** high-risk (persistence, security material)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Create: `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerAdoptionManifest.cs` (persist `sessionId → workerPid, nonce, ownerKeyId, pipeName`; upsert on launch, delete on clean close)
- Modify: gateway-auth SQLite schema/migration
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerAdoptionManifestTests.cs`
> Nonce is security material — store it like other secrets (no plaintext logging; standing rule).
**Steps:** Failing test: manifest round-trips an entry; clean close removes it. Implement. Green. Build + tests. Commit.
### Task 22: Proto — worker adopt/reconnect frame (contract change)
**Classification:** high-risk (contracts → worker)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_worker.proto` (add an adopt/reconnect `WorkerEnvelope` frame: worker presents `sessionId` + `nonce`; gateway ACK/NACK)
- Regenerate + **commit** `Generated/*` (net48 rule)
- Test: contracts build both TFMs
**Steps:** Add frame, regen, commit generated, build both TFMs. Commit. **Unblocks Tasks 2425.**
### Task 23: Worker phone-home reconnect loop + self-terminate
**Classification:** high-risk (worker, net48/x86 — **windev**)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/Ipc/WorkerPipeClient.cs` (on pipe drop: reconnect loop with bounded backoff to the stable pipe name; present the adopt frame)
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/` runtime (self-terminate after `MaxOrphanLifetime` with no adoption)
- Test: `src/ZB.MOM.WW.MxGateway.Worker.Tests/` (net48/x86 on windev)
**Steps:** Failing test (fake pipe server): worker retries and adopts; gives up + self-terminates past the lifetime. Build x86 + worker tests on **windev**. Commit. *(net48: no init-only/positional records.)*
### Task 24: Gateway adoption — re-open pipes, nonce-validate, reject impostors
**Classification:** high-risk (security, lifecycle — **windev** for live)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Create: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs` (startup: read manifest, re-open pipe servers, accept adopt frames, validate nonce → adopt or reject)
- Modify: gateway startup hosted-service order (adopter runs **before** `OrphanWorkerTerminator`; terminator handles only un-adoptable/foreign workers)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/OrphanWorkerAdopterTests.cs`
**Steps:** Failing tests: matching nonce adopts and rebuilds the session; mismatched nonce is rejected and the worker terminated. Implement. Green. Build + tests. Commit.
### Task 25: Resync adopted worker + `ReplayGap` to subscribers
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs` (after adoption, `GetSessionState`/`GetWorkerInfo` to resync; reattached subscribers get `ReplayGap` since the ring is gone)
- Test: `OrphanWorkerAdopterTests.cs`
**Steps:** Failing test: adopted session reports resynced state; a resuming subscriber receives `ReplayGap`. Implement. Green. Build + tests. Commit.
### Task 26: `EnableOrphanReattach` flag (default off) + terminator fallback
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/WorkerOptions.cs` (`EnableOrphanReattach`, default `false`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerTerminator.cs` (unchanged default behavior when reattach disabled)
- Test: `OrphanWorkerTerminatorTests.cs` / adopter test
**Steps:** Failing test: with the flag off, startup terminates (today's behavior); on, it adopts. Implement. Green. Build + tests. Commit.
### Task 27: Gateway-restart reattach round-trip (integration, **windev** + live worker)
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Test: `src/ZB.MOM.WW.MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` (gated `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`)
**Steps:** Open session → simulate gateway restart → adopter re-adopts the surviving worker → session usable → subscriber gets `ReplayGap` then live events. Run on **windev** with live MXAccess. Document if skipped.
### Task 28: Documented-rule reversals + stillpending refresh
**Classification:** trivial (doc-only)
**Estimated implement time:** ~3 min
**Parallelizable with:** none (final)
**Files:**
- Modify: `CLAUDE.md` (line ~77 — reattach now supported, opt-in/bounded)
- Modify: `docs/DesignDecisions.md` (`:63-73` reconnect, `:75-80` multi-subscriber, reattach rationale → mark superseded with this design's date/commit)
- Modify: `gateway.md` (post-v1 revisit items — reflect what shipped)
- Modify: `stillpending.md` (§2 items: mark fan-out/reconnect/ACL/reattach Resolved with commit refs)
- Modify: `docs/GatewayConfiguration.md` (new options: `MaxEventSubscribersPerSession`, `ReplayBufferCapacity`, `ReplayRetentionSeconds`, `DetachGraceSeconds`, `GroupToSessionTag`, `EnableOrphanReattach`, `MaxOrphanLifetime`)
**Steps:** Edit docs to match shipped behavior. Commit.
---
## Verification matrix
| Phase | Build/test | Host |
|---|---|---|
| 14 (gateway, clients) | `dotnet build` + gateway/fake-worker tests; per-client `go/pytest/cargo/gradle/dotnet test` | local (macOS) |
| 3/5 proto changes | regen + commit `Generated/`; build contracts both TFMs; rebuild touched clients | local |
| 5 worker (net48/x86) | `dotnet build -p:Platform=x86` + `Worker.Tests` | **windev** |
| 5 live reattach + Phase-4 LDAP | opt-in gated integration tests | **windev** / live LDAP |
## Final integration review
After all tasks: dispatch a final integration reviewer over `git diff main..HEAD` focusing on the live event path, concurrency in `SessionEventDistributor`, security gates (ACL + nonce adoption), and the three documented-rule reversals. Then use superpowers-extended-cc:finishing-a-development-branch.
@@ -0,0 +1,34 @@
{
"planPath": "docs/plans/2026-06-15-session-resilience.md",
"tasks": [
{"id": 108, "subject": "Task 1: Add OwnerKeyId to the session", "status": "pending"},
{"id": 109, "subject": "Task 2: SessionEventDistributor skeleton", "status": "pending", "blockedBy": [108]},
{"id": 110, "subject": "Task 3: Bounded replay ring buffer", "status": "pending", "blockedBy": [109]},
{"id": 111, "subject": "Task 4: Rewire AttachEventSubscriber + EventStreamService onto distributor", "status": "pending", "blockedBy": [110]},
{"id": 112, "subject": "Task 5: Per-subscriber backpressure isolation", "status": "pending", "blockedBy": [111]},
{"id": 113, "subject": "Task 6: Dashboard broadcaster becomes a distributor subscriber", "status": "pending", "blockedBy": [111]},
{"id": 114, "subject": "Task 7: Remove validator block + add subscriber cap option", "status": "pending", "blockedBy": [112]},
{"id": 115, "subject": "Task 8: Subscriber-lease collection + cap enforcement", "status": "pending", "blockedBy": [114]},
{"id": 116, "subject": "Task 9: Multi-subscriber end-to-end test (FakeWorkerHarness)", "status": "pending", "blockedBy": [115]},
{"id": 117, "subject": "Task 10: Proto - ReplayGap signal", "status": "pending", "blockedBy": [116]},
{"id": 118, "subject": "Task 11: Detach-grace session retention", "status": "pending", "blockedBy": [117]},
{"id": 119, "subject": "Task 12: Replay-on-reconnect + emit ReplayGap", "status": "pending", "blockedBy": [118, 110]},
{"id": 120, "subject": "Task 13: Owner re-validation on reconnect", "status": "pending", "blockedBy": [119, 108]},
{"id": 121, "subject": "Task 14: Client ReplayGap handling - all 5 clients", "status": "pending", "blockedBy": [117]},
{"id": 122, "subject": "Task 15: Reconnect integration test (fake worker)", "status": "pending", "blockedBy": [119]},
{"id": 123, "subject": "Task 16: gRPC session-owner gate + all-sessions admin scope", "status": "pending", "blockedBy": [116, 108]},
{"id": 124, "subject": "Task 17: Session Tag + dashboard group-to-tag config", "status": "pending", "blockedBy": [116]},
{"id": 125, "subject": "Task 18: EventsHub per-session ACL + hub-token tag claim", "status": "pending", "blockedBy": [124]},
{"id": 126, "subject": "Task 19: ACL tests incl. live LDAP users", "status": "pending", "blockedBy": [125]},
{"id": 127, "subject": "Task 20: Stable gateway-instance id + stable pipe naming", "status": "pending", "blockedBy": [126]},
{"id": 128, "subject": "Task 21: Adoption manifest store (SQLite)", "status": "pending", "blockedBy": [127]},
{"id": 129, "subject": "Task 22: Proto - worker adopt/reconnect frame", "status": "pending", "blockedBy": [128]},
{"id": 130, "subject": "Task 23: Worker phone-home reconnect loop + self-terminate", "status": "pending", "blockedBy": [129]},
{"id": 131, "subject": "Task 24: Gateway adoption - re-open pipes, nonce-validate, reject impostors", "status": "pending", "blockedBy": [130]},
{"id": 132, "subject": "Task 25: Resync adopted worker + ReplayGap to subscribers", "status": "pending", "blockedBy": [131, 119]},
{"id": 133, "subject": "Task 26: EnableOrphanReattach flag (default off) + terminator fallback", "status": "pending", "blockedBy": [131]},
{"id": 134, "subject": "Task 27: Gateway-restart reattach round-trip (WINDEV + live worker)", "status": "pending", "blockedBy": [132, 133]},
{"id": 135, "subject": "Task 28: Documented-rule reversals + stillpending refresh", "status": "pending", "blockedBy": [134]}
],
"lastUpdated": "2026-06-15"
}
@@ -0,0 +1,152 @@
# Still-Pending Completion — Design
**Date:** 2026-06-15
**Source:** `stillpending.md` (audit at commit `c7f754c`)
**Branch:** `feat/stillpending-completion`
**Status:** Design approved; implementation plan to follow.
## Goal
Close the genuinely actionable items in `stillpending.md`: the 11 unimplemented
worker command kinds (§1.1), audit-record CorrelationId threading (§1.2), the
client CLI/helper parity gaps (§4), and the documentation/residual-recording
hygiene (§7). Items that are deliberate v1 scope (§2), vendor- or rig-gated
(§1.3, §1.4, §3), or opt-in verification gates (§5) are documented, not built.
## Key discovery that shapes the work
All 11 worker commands **already** have proto request *and* reply messages,
gateway-side validation (`MxAccessGrpcRequestValidator.cs:86-111`), scope
mapping (`GatewayGrpcScopeResolver.cs:45-54`), and generic pass-through routing
(`MxAccessGatewayService.Invoke`). Therefore:
- **No `.proto` changes** — no codegen, no net48 `CS0246` regen risk.
- **No gateway routing changes** for the 11 commands.
- The real code surface is: worker executor arms, six new COM-wrapper methods,
the constraint-path CorrelationId thread, and client CLI/helper additions.
8 of the 11 have dedicated reply messages; 3 (`SetBufferedUpdateInterval`,
`Ping`, `ShutdownWorker`) intentionally return the base OK reply.
## Architecture
Two-process design is unchanged. Work lands in five independent workstreams.
### Workstream A — Worker control/lifecycle commands (5)
`Ping`, `GetSessionState`, `GetWorkerInfo`, `DrainEvents`, `ShutdownWorker`.
No COM involved. New arms in `MxAccessCommandExecutor.Execute`
(`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:97-128`).
The executor currently holds only COM collaborators, so it gains the runtime
state these commands report on:
- `DrainEvents``MxAccessEventQueue.Drain(maxEvents)` (already exists;
`maxEvents == 0` drains all) → `DrainEventsReply { repeated MxEvent events }`.
- `Ping` → echoes `PingCommand.message`; base OK reply.
- `GetSessionState``SessionStateReply { SessionState state }` from the
runtime/heartbeat snapshot.
- `GetWorkerInfo` → `WorkerInfoReply { worker_process_id, worker_version,
mxaccess_progid, mxaccess_clsid }` from `MxAccessInteropInfo` + process info.
- `ShutdownWorker` → honor `grace_period` then signal `StaRuntime` shutdown;
base OK reply.
Where the control arms physically live (inside `MxAccessCommandExecutor` with
injected collaborators, vs. intercepted one layer up where the runtime context
already exists) is an implementation decision for the plan; the contract surface
is identical either way.
### Workstream B — Worker MXAccess COM commands (6)
`Suspend`, `Activate`, `AuthenticateUser`, `ArchestrAUserToId`,
`AddBufferedItem`, `SetBufferedUpdateInterval`. Each needs:
1. A wrapper method on `IMxAccessServer`
(`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/IMxAccessServer.cs`) and dispatch in
`MxAccessComServer` (`MxAccessComServer.cs`). **Open question resolved on
windev:** which native interface (`ILMXProxyServer` / `…3` / `…4`) exposes
each method — confirm against the interop and `docs/MXAccess-Public-API.md`
before writing the dispatch.
2. An executor arm mapping request → wrapper call → reply
(`SuspendReply`/`ActivateReply` carry `MxStatusProxy`; `AuthenticateUserReply`/
`ArchestrAUserToIdReply` carry `user_id`; `AddBufferedItemReply` carries
`item_handle`; `SetBufferedUpdateInterval` returns base OK).
`AuthenticateUser` credentials must never reach logs (standing rule).
`AddBufferedItem`/`SetBufferedUpdateInterval` enable the already-wired
`OnBufferedDataChange` event path (`MxAccessEventMapper.cs:231-254`); per the
approved decision we **verify the buffered round-trip live on windev**,
capturing a real multi-sample batch to validate the §3.2 conversion path
(`Conversion/VariantConverter.cs`).
### Workstream C — §1.2 audit CorrelationId
Thread `request.ClientCorrelationId` from `MxAccessGatewayService.Invoke`
`ApplyConstraintsAsync` → the six filter helpers (`EnforceReadTagAsync`,
`EnforceWriteHandleAsync`, `FilterTagBulkAsync`, `FilterReadBulkAsync`,
`FilterWriteBulkAsync`, `FilterHandleBulkAsync`) →
`IConstraintEnforcer.RecordDenialAsync`, which gains a `correlationId`
parameter.
**Type detail:** `AuditEvent.CorrelationId` is `Guid?`, but `ClientCorrelationId`
is a free-form string. The enforcer does `Guid.TryParse` and stores the value
when parseable, else null. No audit-schema or contract change. Remove the
`TODO(Task 2.3)` at `ConstraintEnforcer.cs:134-136`.
### Workstream D — Client CLI/helper parity (5 clients)
- **Go** `Write2` single-session helper, modeled on `Write` (`session.go:559`).
- **Python CLI** — add the 4 `galaxy-*` commands wrapping existing `galaxy.py`
library methods (`test_connection`, `get_last_deploy_time`,
`discover_hierarchy`, `watch_deploy_events`).
- **`ping` CLI** — add to Go (`cmd/mxgw-go/main.go`) and Java (`MxGatewayCli.java`).
- **`browse` CLI** — add to **all 5**, wrapping each client's existing
`LazyBrowseNode`/`Browse` helper → 0/5 → 5/5 parity.
- **Galaxy name unification** — add canonical `galaxy-test-connection` /
`galaxy-last-deploy` to Java, keeping `galaxy-test` / `galaxy-deploy-time` as
**deprecated aliases** (no script breakage).
- **`version` in dotnet** — the explorer found a `version` path at
`MxGatewayClientCli.cs:85` that conflicts with the audit's §4.4. Treat as
**verify-then-fix-only-if-genuinely-missing**.
### Workstream E — Docs/hygiene (§7) + residual recording
- D1 plan header (`docs/plans/2026-06-14-deferred-followups.md:4`).
- Stale STA "production fix needed" prose (`docs/AlarmClientDiscovery.md:765-774`).
- Stale EventsHub follow-up comment (`Dashboard/Hubs/EventsHub.cs:9-17`).
- CLAUDE.md project-name drift (`MxGateway.*``ZB.MOM.WW.MxGateway.*`).
- Remove dead `MapSqlException` (`GalaxyRepositoryGrpcService.cs:350-360`).
- Record §1.3 (live failover counter unproven on rig) and §1.4 (8-arg ack drops
operator domain/full-name; vendor stub) as explicitly-documented residuals.
- Update `stillpending.md` to reflect what closed.
## Data flow / error handling specifics
- Control commands return base/typed replies with `ProtocolStatusCode.Ok`;
failures map to the existing `CreateInvalidRequestReply` /
`CreateAlarmFailureReply` helpers, never a thrown exception across the STA.
- COM commands surface the native `HResult` on the reply exactly as the other
COM arms do — MXAccess parity means we do not "fix" surprising returns
(e.g. `AuthenticateUser` is allowed to fail; the live test tolerates it).
- `ShutdownWorker` must not deadlock the STA: signal shutdown, return the reply,
then let the pump drain — sequencing is a plan-level concern.
## Testing & cross-platform verification
| Workstream | Build/test | Host |
|---|---|---|
| A, B (worker) | `Worker.Tests` xUnit with a fake `IMxAccessServer` asserting each arm calls the right wrapper and maps the reply | **windev** (net48/x86) |
| B (live) | `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1` smoke incl. buffered capture | **windev** + live MXAccess |
| C (gateway) | unit test: denied op persists parsed `CorrelationId` | local |
| D (clients) | `go test`, `pytest`, `cargo test`+clippy, `gradle test`, `dotnet test` | local; Java on windev |
| E (docs) | doc-only; `--check` regen where applicable | local |
TDD throughout, per-task commits, docs updated in the same change as source.
## Out of scope (documented, not built)
§1.3 live-drive (rig can't drive a real failover), §1.4 actual delivery +
§3.4/§3.5 (AVEVA `AlarmAckByName` stub returns -55, `AlarmAckByGUID` is
`E_NOTIMPL`), §3.1/§3.2-conversion-fix/§3.3/§3.6/§3.7 (await live captures),
all of §2 (deliberate v1 scope, incl. validator-blocked multi-subscriber), §5
(opt-in verification gates), §7.6 (`Won't Fix` review findings).
@@ -0,0 +1,334 @@
# Still-Pending Completion Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
**Goal:** Close the actionable items in `stillpending.md` — 11 unimplemented worker command kinds (§1.1), audit CorrelationId threading (§1.2), client CLI/helper parity (§4), and doc hygiene (§7).
**Architecture:** Two-process gateway/worker design is unchanged. All 11 worker commands already have proto request+reply messages, gateway validation, scope mapping, and generic pass-through routing — so the work is **worker executor arms + 6 new COM-wrapper methods + a gateway constraint-path CorrelationId thread + client CLI additions**. **Zero `.proto` changes**, therefore no codegen and no net48 regen risk.
**Tech Stack:** .NET 10 (gateway, x64), .NET Framework 4.8 (worker, x86, MXAccess COM on STA), Go/Python/Rust/Java/.NET clients. Worker net48/x86 + Java client build/test on Windows host `windev` (10.100.0.48, passwordless ssh, PowerShell); everything else builds locally on macOS.
**Design source:** `docs/plans/2026-06-15-stillpending-completion-design.md`.
**Branch:** `feat/stillpending-completion` (already created).
---
## Cross-platform build reference (read before any worker/Java task)
- **Worker (net48/x86) + Worker.Tests + Java client** do NOT build on macOS. Build/test them on `windev`:
- Copy the working tree (or use the existing build worktree pattern) to `windev`, `git fetch && git reset --hard origin/<branch>` in the build worktree (NEVER trust a stale local `main` — see memory `project_deploy_mechanics`).
- Build: `dotnet build src/ZB.MOM.WW.MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86`
- Test: `dotnet test src/ZB.MOM.WW.MxGateway.Worker.Tests/MxGateway.Worker.Tests.csproj -p:Platform=x86`
- Live MXAccess: set `$env:MXGATEWAY_RUN_LIVE_MXACCESS_TESTS = "1"` then run the IntegrationTests filter.
- Nested ssh→PowerShell mangles quotes; scp a `.ps1` and run `powershell -NoProfile -ExecutionPolicy Bypass -File`. Wrap git in `cmd /c "git ... 2>&1"`.
- **net48 worker C#:** no init-only props / positional records (no `IsExternalInit`); use `{ get; set; }` or ctors (memory `project_net48_worker_csharp`).
- **Gateway, .NET client, Go, Rust, Python** build+test locally on macOS.
---
## Workstream A — Worker control/lifecycle commands (5)
These add arms to `MxAccessCommandExecutor.Execute` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:90-129`). They need runtime state the executor does not currently hold. **Task A0 establishes how the executor reaches that state; do it first.**
### Task A0: Decide & wire control-command collaborators into the executor
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** none (A1A5 depend on it)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs` (constructor + fields)
- Read first: `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessEventQueue.cs` (`Drain(uint)`:168, `Count`:58), `WorkerRuntimeHeartbeatSnapshot.cs`, `src/ZB.MOM.WW.MxGateway.Worker/Sta/StaRuntime.cs` (`IsRunning`:80, `Shutdown()`), `MxAccessInteropInfo.cs` (progid/clsid), the executor's existing construction site (grep `new MxAccessCommandExecutor(`)
**What to do:** The 5 control commands need: the event queue (DrainEvents), a session-state source (GetSessionState), worker identity — pid/version/progid/clsid (GetWorkerInfo), and a shutdown signal (ShutdownWorker). Determine the cleanest seam:
- **Preferred:** inject the collaborators the executor lacks (event queue reference, a `Func<SessionState>` or the session object, `MxAccessInteropInfo`, and a shutdown delegate/`Action`) via the constructor, matching how its existing COM collaborator is passed.
- If the executor's construction site shows control commands are better intercepted one layer up (where `StaRuntime`/session context already lives), surface that to the controller before proceeding — do NOT silently relocate the dispatch.
**Acceptance:** executor compiles on windev with new collaborators available to A1A5; no behavior change yet (arms still fall through). Commit.
> Note: A1A5 are sequential edits to the same `Execute` switch + helper region of one file, so they are NOT parallelizable with each other. Bundle their review.
### Task A1: Ping
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** none (same file as A0/A2-A5)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs` (add `MxCommandKind.Ping` arm + `ExecutePing`)
- Test: `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/MxAccessCommandExecutorTests.cs` (or the existing executor test file — grep to confirm name)
**Step 1 — failing test:** assert `Execute` with a `Ping` command (`PingCommand { Message = "hi" }`) returns `ProtocolStatusCode.Ok`, `Hresult == 0`, and echoes the message (via reply diagnostic or base reply — `Ping` has no dedicated reply message, so assert OK status). Build/test on windev.
**Step 2 — run, expect FAIL** (currently INVALID_REQUEST).
**Step 3 — implement:** add `MxCommandKind.Ping => ExecutePing(command),` to the switch (`:99-126` region). `ExecutePing` returns `CreateOkReply(command)` (helper at `:784`).
**Step 4 — run, expect PASS** on windev.
**Step 5 — commit:** `feat(worker): implement Ping command`
### Task A2: DrainEvents
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `MxAccessCommandExecutor.cs` (`DrainEvents` arm + `ExecuteDrainEvents`)
- Test: executor test file
**Steps (TDD):** test that `DrainEvents { MaxEvents = N }` drains up to N from the injected `MxAccessEventQueue` and returns `DrainEventsReply { events = [...] }` (reply field 102). `MaxEvents == 0` drains all. Map each `WorkerEvent``MxEvent` using the existing event-mapping path (grep how the live event loop converts `WorkerEvent``MxEvent`; reuse, do not duplicate). Build/test windev. Commit `feat(worker): implement DrainEvents command`.
### Task A3: GetSessionState
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
**Steps:** test that `GetSessionState` returns `SessionStateReply { State = <current> }` (reply field 100) mapping the worker's lifecycle to the proto `SessionState` enum (READY when the STA is running). Build/test windev. Commit `feat(worker): implement GetSessionState command`.
### Task A4: GetWorkerInfo
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
**Steps:** test that `GetWorkerInfo` returns `WorkerInfoReply { WorkerProcessId, WorkerVersion, MxaccessProgid, MxaccessClsid }` (reply field 101) sourced from `Process.GetCurrentProcess().Id`, the worker assembly version, and `MxAccessInteropInfo` (progid `LMXProxy.LMXProxyServer.1`, clsid `{C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC}`). Build/test windev. Commit `feat(worker): implement GetWorkerInfo command`.
### Task A5: ShutdownWorker
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
**Steps:** test that `ShutdownWorker { GracePeriod }` returns a base OK reply and triggers the injected shutdown signal **after** the reply is produced (must not deadlock the STA — signal shutdown, return reply, let the pump drain). Verify the grace period is honored (or documented as best-effort). Build/test windev. Commit `feat(worker): implement ShutdownWorker command`.
### Task A6: Make FakeWorkerHarness respond to control commands
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** none (depends on A1A5 reply shapes)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/Fakes/FakeWorkerHarness.cs`
- Test: a gateway-side test that invokes `Ping`/`GetWorkerInfo`/`DrainEvents` through the harness and asserts the reply (builds locally on macOS).
**Why:** the audit (§1.1) flagged that control kinds were "exercised only through `FakeWorkerHarness`" but the harness is a passive relay that does not auto-respond — so gateway tests could not actually cover them. Add canned responses so the gateway↔worker round-trip for these commands is verified in the default (no-MXAccess) suite. Commit `test(gateway): fake worker responds to control commands`.
---
## Workstream B — Worker MXAccess COM commands (6)
`Suspend`, `Activate`, `AuthenticateUser`, `ArchestrAUserToId`, `AddBufferedItem`, `SetBufferedUpdateInterval`. **Task B0 (windev interop inspection) MUST run first** — the native interface exposing each method is unknown until inspected.
### Task B0: Resolve native COM signatures on windev
**Classification:** standard
**Estimated implement time:** ~5 min (investigation)
**Parallelizable with:** A-workstream tasks (different files/host activity)
**Files:**
- Read on windev: the generated interop for `ArchestrA.MXAccess.dll` (the `ILMXProxyServer` / `ILMXProxyServer3` / `ILMXProxyServer4` RCW definitions), `C:\Users\dohertj2\Desktop\mxaccess\docs\MXAccess-Public-API.md` (method list/signatures).
- Output: a short note appended to this plan (or a comment block) recording, for each of the 6 methods, which interface version exposes it and its exact signature.
**What to do:** Confirm the exact native signatures for `Suspend(int serverHandle, int itemHandle)`, `Activate(int serverHandle, int itemHandle)`, `AuthenticateUser(int serverHandle, string verifyUser, string verifyUserPassword)` → user id, `ArchestrAUserToId(int serverHandle, string userIdGuid)` → user id, `AddBufferedItem(int serverHandle, string itemDefinition, string itemContext)` → item handle, `SetBufferedUpdateInterval(int serverHandle, int intervalMs)`. If any method is **not** present on the installed interop (mirroring the §3.4/§3.5 vendor-stub pattern for alarms), STOP and surface it — implement only the available ones and record the rest as vendor-gated residuals. Commit the note.
### Task B1: Add 6 wrapper methods to IMxAccessServer + MxAccessComServer
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** none (blocked by B0)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/IMxAccessServer.cs` (add 6 method declarations)
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessComServer.cs` (dispatch to the interface version resolved in B0, mirroring existing methods like `Write2`:173, `AddItem2`:84)
- Modify: any fake/test `IMxAccessServer` implementation (grep `: IMxAccessServer`) to add the 6 methods (return canned values).
- Test: `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/...` for `MxAccessComServer` if one exists.
**Steps (TDD):** add the 6 declarations; implement dispatch following the existing version-selection pattern; update fakes so the solution compiles. Build on windev `-p:Platform=x86`. Commit `feat(worker): add MXAccess COM wrappers for suspend/activate/auth/buffered`.
> B2B7 are sequential edits to the same `Execute` switch; not parallelizable with each other. Bundle review.
### Task B2: Suspend arm
**Classification:** small · **~3 min** · **Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
TDD: `Suspend { ServerHandle, ItemHandle }` calls the wrapper and returns `SuspendReply { Status = MxStatusProxy }` (reply field 24). Use a fake `IMxAccessServer` asserting the call. Build/test windev. Commit `feat(worker): implement Suspend command`.
### Task B3: Activate arm
**Classification:** small · **~3 min** · **Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
TDD: `Activate { ServerHandle, ItemHandle }``ActivateReply { Status }` (field 25). Build/test windev. Commit `feat(worker): implement Activate command`.
### Task B4: AuthenticateUser arm
**Classification:** standard · **~4 min** · **Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
TDD: `AuthenticateUser { ServerHandle, VerifyUser, VerifyUserPassword }``AuthenticateUserReply { UserId }` (field 26). **Credentials must never be logged** (standing rule) — assert no log statement includes the password. AuthenticateUser is allowed to fail (surface the native HResult, do not throw). Build/test windev. Commit `feat(worker): implement AuthenticateUser command`.
### Task B5: ArchestrAUserToId arm
**Classification:** small · **~3 min** · **Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
TDD: `ArchestrAUserToId { ServerHandle, UserIdGuid }``ArchestrAUserToIdReply { UserId }` (field 27). Build/test windev. Commit `feat(worker): implement ArchestrAUserToId command`.
### Task B6: AddBufferedItem arm
**Classification:** standard · **~4 min** · **Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
TDD: `AddBufferedItem { ServerHandle, ItemDefinition, ItemContext }``AddBufferedItemReply { ItemHandle }` (field 23). Build/test windev. Commit `feat(worker): implement AddBufferedItem command`.
### Task B7: SetBufferedUpdateInterval arm
**Classification:** small · **~3 min** · **Parallelizable with:** none
**Files:** `MxAccessCommandExecutor.cs` + executor test.
TDD: `SetBufferedUpdateInterval { ServerHandle, UpdateIntervalMilliseconds }` → base OK reply (no dedicated reply message). Build/test windev. Commit `feat(worker): implement SetBufferedUpdateInterval command`.
### Task B8: Live COM smoke + buffered capture on windev
**Classification:** high-risk
**Estimated implement time:** ~5 min (authoring; live run is manual)
**Parallelizable with:** none (blocked by B1B7)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` (the existing AuthenticateUser send at ~line 919/931 should now get an OK/typed reply instead of INVALID_REQUEST; add Suspend/Activate/AddBufferedItem+SetBufferedUpdateInterval sends).
- Possibly: `src/ZB.MOM.WW.MxGateway.Worker.Tests/Probes/` for a buffered-capture probe.
**Steps:** Under `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1` on windev: verify the 4 unambiguous COM commands round-trip; then `AddBufferedItem` + `SetBufferedUpdateInterval` on a real tag and **capture a multi-sample `OnBufferedDataChange` batch** to validate the §3.2 `VariantConverter` path. If the buffered conversion proves correct, record it; if it surfaces a conversion bug, STOP and report (do not silently ship). If a live buffered sample cannot be elicited on the rig, record buffered round-trip as the documented residual (close the command gap, leave §3.2 open). Commit `test(integration): live COM command + buffered capture smoke`.
---
## Workstream C — §1.2 gateway audit CorrelationId
### Task C1: Thread ClientCorrelationId into constraint-denial audit records
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** all A/B/D tasks (gateway-only files, builds locally)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Security/Authorization/IConstraintEnforcer.cs` (add `string? correlationId` param to `RecordDenialAsync`, signature at `:49-54`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Security/Authorization/ConstraintEnforcer.cs` (`:124-158`): accept the param, `Guid.TryParse` it into the `Guid? CorrelationId` audit field (was hardcoded `null` at `:147`); remove `TODO(Task 2.3)` at `:134-136`.
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/MxAccessGatewayService.cs`: thread `request.ClientCorrelationId` from `Invoke` (`:96`) → `ApplyConstraintsAsync` (`:279`) → the 6 filter helpers (`EnforceReadTagAsync`:427, `EnforceWriteHandleAsync`:448, `FilterTagBulkAsync`:474, `FilterReadBulkAsync`:529, `FilterWriteBulkAsync`:584, `FilterHandleBulkAsync`:656) → `RecordDenialAsync`.
- Test: `src/ZB.MOM.WW.MxGateway.Tests/...` constraint-enforcer / gateway-service test.
**Step 1 — failing test:** a denied operation with `ClientCorrelationId = "<a real GUID>"` persists an audit record whose `CorrelationId` equals that GUID; a non-GUID correlation id persists `null` (documented behavior). Run locally: `dotnet test src/ZB.MOM.WW.MxGateway.Tests/MxGateway.Tests.csproj --filter <name>`.
**Step 2 — FAIL** (currently always null).
**Step 3 — implement** the threading + `Guid.TryParse`.
**Step 4 — PASS** locally + full gateway suite green.
**Step 5 — commit:** `feat(gateway): thread ClientCorrelationId into constraint-denial audit (§1.2)`
---
## Workstream D — Client CLI/helper parity (5 clients)
All D tasks touch disjoint client trees and are parallelizable across languages. Each builds/tests on its own toolchain (Java on windev; the rest local).
### Task D1: Go single-shot Write2 helper
**Classification:** small · **~3 min** · **Parallelizable with:** D2D9
**Files:**
- Modify: `clients/go/mxgateway/session.go` (add `Write2`/`Write2Raw` after `Write`:559, modeled on `Write` + the `Write2Bulk`:427 payload shape)
- Test: `clients/go/mxgateway/session_test.go` (or nearest)
TDD: `Write2(ctx, serverHandle, itemHandle, value, timestampValue *MxValue, userID int32) error` issues `MX_COMMAND_KIND_WRITE2` with `Write2Command{ServerHandle,ItemHandle,Value,TimestampValue,UserId}`. Verify: `gofmt`, `go build ./...`, `go test ./...` from `clients/go`. Commit `feat(go): add single-shot Write2 session helper (§4.1)`.
### Task D2: Python galaxy-* CLI commands (4)
**Classification:** standard · **~5 min** · **Parallelizable with:** D1,D3D9
**Files:**
- Modify: `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py` (add `galaxy-test-connection`, `galaxy-last-deploy`, `galaxy-discover`, `galaxy-watch` Click commands wrapping `galaxy.py` `test_connection`/`get_last_deploy_time`/`discover_hierarchy`/`watch_deploy_events`; mirror the existing `ping` command structure at `:221`)
- Modify: `clients/python/README.md:217` (correct the understated galaxy CLI claim)
- Test: `clients/python/tests/` CLI test
TDD then `python -m pytest` from `clients/python`. Commit `feat(python): add galaxy-* CLI commands (§4.2)`.
### Task D3: ping CLI in Go
**Classification:** small · **~3 min** · **Parallelizable with:** others
**Files:** `clients/go/cmd/mxgw-go/main.go` (add `ping` case to the switch ~`:77-130`/`:1199`, modeled on an existing simple command) + test.
TDD; `gofmt`, `go build ./...`, `go test ./...`. Commit `feat(go): add ping CLI subcommand (§4.3)`.
### Task D4: ping CLI in Java
**Classification:** small · **~3 min** · **Parallelizable with:** others — **build on windev**
**Files:** `clients/java/zb-mom-ww-mxgateway-cli/src/main/java/com/zb/mom/ww/mxgateway/cli/MxGatewayCli.java` (register a `ping` subcommand ~`:126-149`) + test.
TDD; `gradle test` on windev. Commit `feat(java): add ping CLI subcommand (§4.3)`.
### Task D5: browse CLI — Go
**Classification:** standard · **~4 min** · **Parallelizable with:** others
**Files:** `clients/go/cmd/mxgw-go/main.go` (new `browse` command wrapping `GalaxyClient.Browse`:398 / `LazyBrowseNode.Expand`:337) + test. `go build/test`. Commit `feat(go): add browse CLI (§4.6)`.
### Task D6: browse CLI — Python
**Classification:** standard · **~4 min** · **Parallelizable with:** others
**Files:** `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py` (new `browse` command wrapping `galaxy.py` `browse`:163) + test. `pytest`. Commit `feat(python): add browse CLI (§4.6)`.
### Task D7: browse CLI — Rust
**Classification:** standard · **~4 min** · **Parallelizable with:** others
**Files:** `clients/rust/crates/mxgw-cli/src/main.rs` (new `Browse` command variant wrapping the galaxy browse helper in `galaxy.rs`) + test. `cargo fmt`, `cargo test --workspace`, `cargo clippy --all-targets -- -D warnings`. Commit `feat(rust): add browse CLI (§4.6)`.
### Task D8: browse CLI — Java + dotnet
**Classification:** standard · **~5 min** · **Parallelizable with:** others — **Java builds on windev**
**Files:** `clients/java/.../MxGatewayCli.java` (browse subcommand wrapping `GalaxyRepositoryClient.browse`) + `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs` (`browse` command wrapping `LazyBrowseNode.ExpandAsync`:63) + tests. `gradle test` (windev), `dotnet test` (local). Commit `feat(dotnet,java): add browse CLI (§4.6) — 5/5 parity`.
### Task D9: Java galaxy-name aliases + verify dotnet version
**Classification:** small · **~4 min** · **Parallelizable with:** others — **Java builds on windev**
**Files:**
- Modify: `clients/java/.../MxGatewayCli.java:145-146` — add canonical `galaxy-test-connection`/`galaxy-last-deploy` as the primary names; keep `galaxy-test`/`galaxy-deploy-time` as **deprecated aliases** (picocli `@Command(name=..., aliases={...})` or equivalent).
- Verify: `clients/dotnet/.../MxGatewayClientCli.cs` — the explorer found a `version` path at `:85` that conflicts with audit §4.4. **Read it**: if a `version` subcommand genuinely works, no change (note it in the §7 update); if it's only a `--version` flag and `IsKnownGatewayCommand` lacks `version`, add the subcommand. Do not add what already exists.
- Test: Java CLI test asserting both names resolve.
`gradle test` (windev), `dotnet build/test` (local). Commit `feat(java): galaxy command aliases; chore(dotnet): verify version subcommand (§4.4,§4.5)`.
---
## Workstream E — Docs/hygiene + residual recording
### Task E1: Doc hygiene + dead-code removal
**Classification:** small · **~5 min** · **Parallelizable with:** all (mostly doc-only; one code deletion)
**Files:**
- `docs/plans/2026-06-14-deferred-followups.md:4` — change "Plan only — NOT yet executed" to reflect D1 done (`4af24b9`).
- `docs/AlarmClientDiscovery.md:765-774` — rewrite stale STA "production fix needed" prose (alarms now run through worker STA / `GatewayAlarmMonitor`).
- `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:9-17` — remove/update stale "publisher side is a follow-up" comment (broadcaster shipped).
- `CLAUDE.md` — fix project-name drift `src/MxGateway.*``src/ZB.MOM.WW.MxGateway.*` throughout.
- `src/ZB.MOM.WW.MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:350-360` — remove dead IDE0051-suppressed `MapSqlException`.
**Verify:** `dotnet build src/ZB.MOM.WW.MxGateway.Server` locally (the only code change is the deletion). Commit `docs+chore: fix stale prose, project names, remove dead MapSqlException (§7)`.
### Task E2: Record §1.3 and §1.4 residuals + refresh stillpending.md
**Classification:** trivial · **~3 min** · **Parallelizable with:** all (doc-only)
**Files:**
- `docs/plans/2026-06-14-deferred-followups.md` — record §1.3 (provider_switches counter live-exercise unproven; rig can't drive a real failover) as an explicit documented residual.
- Add a short note (in the worker alarm code's existing comment near `WnWrapAlarmConsumer.cs:261` or the design doc) that §1.4's 8-arg ack drops domain/full-name because the AVEVA `AlarmAckByName` v2 is a vendor stub (-55) — already partly noted; make it explicit and cross-referenced.
- `stillpending.md` — mark §1.1, §1.2, §4.1/§4.2/§4.3/§4.6 (and §4.4/§4.5 per outcome) as Resolved with commit refs; keep the documented residuals.
Commit `docs: record §1.3/§1.4 residuals and refresh stillpending.md (§7)`.
---
## Final integration review
After all workstreams: run the full local suite (`dotnet test` gateway + `.NET` client, `go test`, `pytest`, `cargo test`+clippy) and the windev suite (worker net48/x86 + Java + live MXAccess smoke). Then use **superpowers-extended-cc:finishing-a-development-branch**.
## Dependency summary
- A0 → A1..A5 → A6
- B0 → B1 → B2..B7 → B8
- C1 independent (gateway-only, local)
- D1..D9 independent of A/B/C and of each other (disjoint client trees)
- E1, E2 last (reflect what closed); E1 mostly independent
@@ -0,0 +1,34 @@
{
"planPath": "docs/plans/2026-06-15-stillpending-completion.md",
"tasks": [
{"id": 80, "subject": "Task A0: Wire control-command collaborators into executor", "status": "pending"},
{"id": 81, "subject": "Task A1: Implement Ping command", "status": "pending", "blockedBy": [80]},
{"id": 82, "subject": "Task A2: Implement DrainEvents command", "status": "pending", "blockedBy": [80]},
{"id": 83, "subject": "Task A3: Implement GetSessionState command", "status": "pending", "blockedBy": [80]},
{"id": 84, "subject": "Task A4: Implement GetWorkerInfo command", "status": "pending", "blockedBy": [80]},
{"id": 85, "subject": "Task A5: Implement ShutdownWorker command", "status": "pending", "blockedBy": [80]},
{"id": 86, "subject": "Task A6: FakeWorkerHarness responds to control commands", "status": "pending", "blockedBy": [81, 82, 84]},
{"id": 87, "subject": "Task B0: Resolve native COM signatures on windev", "status": "pending"},
{"id": 88, "subject": "Task B1: Add 6 COM wrapper methods (IMxAccessServer + MxAccessComServer)", "status": "pending", "blockedBy": [87]},
{"id": 89, "subject": "Task B2: Implement Suspend arm", "status": "pending", "blockedBy": [88]},
{"id": 90, "subject": "Task B3: Implement Activate arm", "status": "pending", "blockedBy": [88]},
{"id": 91, "subject": "Task B4: Implement AuthenticateUser arm", "status": "pending", "blockedBy": [88]},
{"id": 92, "subject": "Task B5: Implement ArchestrAUserToId arm", "status": "pending", "blockedBy": [88]},
{"id": 93, "subject": "Task B6: Implement AddBufferedItem arm", "status": "pending", "blockedBy": [88]},
{"id": 94, "subject": "Task B7: Implement SetBufferedUpdateInterval arm", "status": "pending", "blockedBy": [88]},
{"id": 95, "subject": "Task B8: Live COM smoke + buffered capture on windev", "status": "pending", "blockedBy": [89, 90, 91, 92, 93, 94]},
{"id": 96, "subject": "Task C1: Thread ClientCorrelationId into denial audit (§1.2)", "status": "pending"},
{"id": 97, "subject": "Task D1: Go single-shot Write2 helper (§4.1)", "status": "pending"},
{"id": 98, "subject": "Task D2: Python galaxy-* CLI commands (§4.2)", "status": "pending"},
{"id": 99, "subject": "Task D3: ping CLI in Go (§4.3)", "status": "pending"},
{"id": 100, "subject": "Task D4: ping CLI in Java (§4.3)", "status": "pending"},
{"id": 101, "subject": "Task D5: browse CLI — Go (§4.6)", "status": "pending"},
{"id": 102, "subject": "Task D6: browse CLI — Python (§4.6)", "status": "pending"},
{"id": 103, "subject": "Task D7: browse CLI — Rust (§4.6)", "status": "pending"},
{"id": 104, "subject": "Task D8: browse CLI — Java + dotnet (§4.6)", "status": "pending"},
{"id": 105, "subject": "Task D9: Java galaxy aliases + verify dotnet version (§4.4,§4.5)", "status": "pending"},
{"id": 106, "subject": "Task E1: Doc hygiene + dead-code removal (§7)", "status": "pending"},
{"id": 107, "subject": "Task E2: Record §1.3/§1.4 residuals + refresh stillpending.md (§7)", "status": "pending", "blockedBy": [81, 82, 83, 84, 85, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 105]}
],
"lastUpdated": "2026-06-15"
}
@@ -562,6 +562,344 @@ public sealed class WorkerLiveMxAccessSmokeTests(ITestOutputHelper output)
Assert.DoesNotContain(verifyPassword, recordedOutput.Captured, StringComparison.Ordinal);
}
/// <summary>
/// B8 live verification of the COM commands the B-bundle added against a fake
/// IMxAccessServer: <c>AuthenticateUser</c>, <c>ArchestrAUserToId</c>, <c>Suspend</c>,
/// and <c>Activate</c>. The contract being proven is that each command round-trips
/// to the worker and back carrying a real MXAccess outcome (Ok / an MxStatusProxy /
/// a non-zero HResult) and is NOT short-circuited to <c>INVALID_REQUEST</c> the way an
/// unimplemented command would be. MXAccess-level rejections (a wrong item class for
/// Suspend/Activate commonly returns 0x80070057) are parity, not test failures — we
/// assert the reply kind plus a non-INVALID_REQUEST protocol status, and log the
/// HResult for the record.
/// </summary>
[LiveMxAccessFact]
public async Task GatewaySession_WithLiveWorker_NewComCommands_RoundTripWithRealReplies()
{
string workerExecutablePath = IntegrationTestEnvironment.ResolveLiveMxAccessWorkerExecutablePath();
Assert.True(
File.Exists(workerExecutablePath),
$"Live MXAccess worker executable was not found at {workerExecutablePath}. Build the worker or set {IntegrationTestEnvironment.LiveMxAccessWorkerExecutableVariableName}.");
// Credential-redaction: AuthenticateUser carries a password. Route every test
// log surface through the buffering helper so the post-run assertion proves the
// password never reached the gateway logger, worker stdout/stderr, or any
// WriteLine the test body issued (same pattern as the WriteSecured parity test).
RecordingTestOutputHelper recordedOutput = new(output);
TestWorkerProcessFactory processFactory = new(recordedOutput);
await using GatewayServiceFixture fixture = new(workerExecutablePath, processFactory, recordedOutput);
string? sessionId = null;
(string verifyUser, string verifyPassword) = ResolveLiveMxAccessSecuredCredentials();
try
{
OpenSessionReply openReply = await fixture.Service.OpenSession(
new OpenSessionRequest
{
ClientSessionName = "live-mxaccess-new-com-commands",
ClientCorrelationId = "live-open-new-com",
CommandTimeout = Duration.FromTimeSpan(CommandTimeout),
},
new TestServerCallContext()).ConfigureAwait(false);
sessionId = openReply.SessionId;
Assert.Equal(ProtocolStatusCode.Ok, openReply.ProtocolStatus.Code);
MxCommandReply registerReply = await fixture.Service.Invoke(
CreateRegisterRequest(sessionId),
new TestServerCallContext()).ConfigureAwait(false);
LogReplyTo(recordedOutput, "Register", registerReply);
Assert.Equal(ProtocolStatusCode.Ok, registerReply.ProtocolStatus.Code);
int serverHandle = registerReply.Register.ServerHandle;
MxCommandReply addItemReply = await fixture.Service.Invoke(
CreateAddItemRequest(sessionId, serverHandle),
new TestServerCallContext()).ConfigureAwait(false);
LogReplyTo(recordedOutput, "AddItem", addItemReply);
Assert.Equal(ProtocolStatusCode.Ok, addItemReply.ProtocolStatus.Code);
int itemHandle = addItemReply.AddItem.ItemHandle;
// AuthenticateUser — the B-bundle command under live verification. Before the
// B-bundle this command was unimplemented and the worker short-circuited it to
// INVALID_REQUEST. It must now produce a real reply (Ok with a user id when the
// provider accepts the credential, or a real MXAccess HResult when it does not).
MxCommandReply authReply = await fixture.Service.Invoke(
CreateAuthenticateUserRequest(sessionId, serverHandle, verifyUser, verifyPassword),
new TestServerCallContext()).ConfigureAwait(false);
recordedOutput.WriteLine(
$"AuthenticateUser status={authReply.ProtocolStatus.Code} hresult={authReply.Hresult} user_id={authReply.AuthenticateUser?.UserId}");
Assert.Equal(MxCommandKind.AuthenticateUser, authReply.Kind);
Assert.NotEqual(ProtocolStatusCode.InvalidRequest, authReply.ProtocolStatus.Code);
Assert.True(
authReply.ProtocolStatus.Code is ProtocolStatusCode.Ok or ProtocolStatusCode.MxaccessFailure,
$"AuthenticateUser must surface a real MXAccess outcome, got {authReply.ProtocolStatus.Code}.");
int authenticatedUserId =
authReply.ProtocolStatus.Code == ProtocolStatusCode.Ok && authReply.AuthenticateUser is not null
? authReply.AuthenticateUser.UserId
: 0;
if (authReply.ProtocolStatus.Code == ProtocolStatusCode.Ok)
{
// On the dev rig AuthenticateUser("Administrator","") resolves to user id 1.
// Don't pin the exact value (provider/user-store dependent) — just prove a
// success carried a usable, non-zero ArchestrA user id through the reply.
Assert.NotEqual(0, authenticatedUserId);
}
// ArchestrAUserToId — resolves an ArchestrA user GUID to an integer user id.
// We feed an empty/placeholder GUID: the value is provider-dependent, so the
// assertion is the parity one (real reply, never INVALID_REQUEST). A non-zero
// HResult here is the expected MXAccess rejection of an unknown GUID.
MxCommandReply userToIdReply = await fixture.Service.Invoke(
CreateArchestrAUserToIdRequest(sessionId, serverHandle, userIdGuid: string.Empty),
new TestServerCallContext()).ConfigureAwait(false);
LogReplyTo(recordedOutput, "ArchestrAUserToId", userToIdReply);
recordedOutput.WriteLine($"ArchestrAUserToId user_id={userToIdReply.ArchestraUserToId?.UserId}");
Assert.Equal(MxCommandKind.ArchestraUserToId, userToIdReply.Kind);
Assert.NotEqual(ProtocolStatusCode.InvalidRequest, userToIdReply.ProtocolStatus.Code);
Assert.True(
userToIdReply.ProtocolStatus.Code is ProtocolStatusCode.Ok or ProtocolStatusCode.MxaccessFailure,
$"ArchestrAUserToId must surface a real MXAccess outcome, got {userToIdReply.ProtocolStatus.Code}.");
if (userToIdReply.ProtocolStatus.Code == ProtocolStatusCode.Ok)
{
// On the dev rig ArchestrAUserToId with a valid GUID resolves to user_id=1.
// Don't pin the exact id (provider-dependent) — just prove the Ok path carried
// a usable non-zero ArchestrA user id through the reply payload.
Assert.NotNull(userToIdReply.ArchestraUserToId);
Assert.NotEqual(0, userToIdReply.ArchestraUserToId.UserId);
}
// Suspend / Activate against the advised item. The dev-rig TestInt item class
// may not be suspendable (MXAccess returns 0x80070057 / E_INVALIDARG for a
// wrong item class — see B8 notes). That is MXAccess parity: assert the reply
// kind and a non-INVALID_REQUEST status, surface the HResult and MxStatusProxy
// for the record, and do NOT treat a provider-side rejection as a test failure.
MxCommandReply suspendReply = await fixture.Service.Invoke(
CreateSuspendRequest(sessionId, serverHandle, itemHandle),
new TestServerCallContext()).ConfigureAwait(false);
LogReplyTo(recordedOutput, "Suspend", suspendReply);
recordedOutput.WriteLine(
$"Suspend status_proxy success={suspendReply.Suspend?.Status?.Success} hresult=0x{(uint)suspendReply.Hresult:X8}");
Assert.Equal(MxCommandKind.Suspend, suspendReply.Kind);
Assert.NotEqual(ProtocolStatusCode.InvalidRequest, suspendReply.ProtocolStatus.Code);
Assert.True(
suspendReply.ProtocolStatus.Code is ProtocolStatusCode.Ok or ProtocolStatusCode.MxaccessFailure,
$"Suspend must surface a real MXAccess outcome, got {suspendReply.ProtocolStatus.Code}.");
MxCommandReply activateReply = await fixture.Service.Invoke(
CreateActivateRequest(sessionId, serverHandle, itemHandle),
new TestServerCallContext()).ConfigureAwait(false);
LogReplyTo(recordedOutput, "Activate", activateReply);
recordedOutput.WriteLine(
$"Activate status_proxy success={activateReply.Activate?.Status?.Success} hresult=0x{(uint)activateReply.Hresult:X8}");
Assert.Equal(MxCommandKind.Activate, activateReply.Kind);
Assert.NotEqual(ProtocolStatusCode.InvalidRequest, activateReply.ProtocolStatus.Code);
Assert.True(
activateReply.ProtocolStatus.Code is ProtocolStatusCode.Ok or ProtocolStatusCode.MxaccessFailure,
$"Activate must surface a real MXAccess outcome, got {activateReply.ProtocolStatus.Code}.");
}
finally
{
await ShutDownAsync(fixture, processFactory, sessionId, streamTask: null).ConfigureAwait(false);
}
// Credential contract: the AuthenticateUser password must never reach any log
// surface (gateway logger, worker stdout/stderr, or test WriteLine).
Assert.DoesNotContain(verifyPassword, recordedOutput.Captured, StringComparison.Ordinal);
}
/// <summary>
/// B8 §3.2 buffered-data path: adds a BUFFERED item (<c>AddBufferedItem</c>), sets the
/// buffered update interval (<c>SetBufferedUpdateInterval</c>), advises it, then attempts
/// to observe an <see cref="MxEventFamily.OnBufferedDataChange"/> event carrying multiple
/// samples so the worker's multi-sample conversion (VariantConverter →
/// OnBufferedDataChangeEvent quality/timestamp arrays) can be validated live.
/// <para>
/// The AddBufferedItem + SetBufferedUpdateInterval round-trips are asserted unconditionally
/// (they are the B-bundle commands under verification). The buffered EVENT capture is
/// best-effort: if the rig's object logic does not drive a buffered batch within the live
/// event timeout (the same environmental limitation seen with the externally-undrivable
/// alarm rig), the test records the buffered conversion as an unverified residual rather
/// than failing — the command path is proven, the live multi-sample conversion is not.
/// When a batch IS captured, the converted value and quality/timestamp arrays are asserted
/// to be non-empty and internally consistent (no crash, no dropped payload).
/// </para>
/// </summary>
[LiveMxAccessFact]
public async Task GatewaySession_WithLiveWorker_BufferedItem_AddsSetsIntervalAndAttemptsCapture()
{
string workerExecutablePath = IntegrationTestEnvironment.ResolveLiveMxAccessWorkerExecutablePath();
Assert.True(
File.Exists(workerExecutablePath),
$"Live MXAccess worker executable was not found at {workerExecutablePath}. Build the worker or set {IntegrationTestEnvironment.LiveMxAccessWorkerExecutableVariableName}.");
TestWorkerProcessFactory processFactory = new(output);
await using GatewayServiceFixture fixture = new(workerExecutablePath, processFactory, output);
using RecordingServerStreamWriter<MxEvent> eventWriter = new();
string? sessionId = null;
Task? streamTask = null;
using CancellationTokenSource streamCancellation = new();
// AddBufferedItem takes (item_definition, item_context) like AddItem2. The dev rig
// exposes TestChildObject.TestInt; the buffered form is item="TestInt",
// context="TestChildObject" (per B8 notes). Split the configured live item so a
// custom MXGATEWAY_LIVE_MXACCESS_ITEM override still works.
(string bufferedItem, string bufferedContext) = SplitLiveItemForBuffered(IntegrationTestEnvironment.LiveMxAccessItem);
try
{
OpenSessionReply openReply = await fixture.Service.OpenSession(
new OpenSessionRequest
{
ClientSessionName = "live-mxaccess-buffered",
ClientCorrelationId = "live-open-buffered",
CommandTimeout = Duration.FromTimeSpan(CommandTimeout),
},
new TestServerCallContext()).ConfigureAwait(false);
sessionId = openReply.SessionId;
Assert.Equal(ProtocolStatusCode.Ok, openReply.ProtocolStatus.Code);
streamTask = fixture.Service.StreamEvents(
new StreamEventsRequest { SessionId = sessionId },
eventWriter,
new TestServerCallContext(streamCancellation.Token));
MxCommandReply registerReply = await fixture.Service.Invoke(
CreateRegisterRequest(sessionId),
new TestServerCallContext()).ConfigureAwait(false);
LogReply("Register", registerReply);
Assert.Equal(ProtocolStatusCode.Ok, registerReply.ProtocolStatus.Code);
int serverHandle = registerReply.Register.ServerHandle;
// SetBufferedUpdateInterval first so the buffered cadence is established before
// the item is added/advised. MXAccess rounds to 100ms units and rejects < 1.
MxCommandReply intervalReply = await fixture.Service.Invoke(
CreateSetBufferedUpdateIntervalRequest(sessionId, serverHandle, updateIntervalMilliseconds: 1000),
new TestServerCallContext()).ConfigureAwait(false);
LogReply("SetBufferedUpdateInterval", intervalReply);
Assert.Equal(MxCommandKind.SetBufferedUpdateInterval, intervalReply.Kind);
Assert.NotEqual(ProtocolStatusCode.InvalidRequest, intervalReply.ProtocolStatus.Code);
Assert.Equal(ProtocolStatusCode.Ok, intervalReply.ProtocolStatus.Code);
// AddBufferedItem — must return a real item handle (the dev rig yields handle 1).
MxCommandReply addBufferedReply = await fixture.Service.Invoke(
CreateAddBufferedItemRequest(sessionId, serverHandle, bufferedItem, bufferedContext),
new TestServerCallContext()).ConfigureAwait(false);
LogReply("AddBufferedItem", addBufferedReply);
Assert.Equal(MxCommandKind.AddBufferedItem, addBufferedReply.Kind);
Assert.NotEqual(ProtocolStatusCode.InvalidRequest, addBufferedReply.ProtocolStatus.Code);
Assert.Equal(ProtocolStatusCode.Ok, addBufferedReply.ProtocolStatus.Code);
Assert.NotNull(addBufferedReply.AddBufferedItem);
int bufferedItemHandle = addBufferedReply.AddBufferedItem.ItemHandle;
Assert.True(bufferedItemHandle > 0, "AddBufferedItem must yield a usable item handle.");
MxCommandReply adviseReply = await fixture.Service.Invoke(
CreateAdviseRequest(sessionId, serverHandle, bufferedItemHandle),
new TestServerCallContext()).ConfigureAwait(false);
LogReply("Advise(buffered)", adviseReply);
Assert.Equal(ProtocolStatusCode.Ok, adviseReply.ProtocolStatus.Code);
// Best-effort capture of a SAMPLE-BEARING buffered batch.
//
// Live observation (B8): immediately after Advise the provider delivers an
// initial OnBufferedDataChange with data_type=NoData / raw_data_type=0 and zero
// quality+timestamp samples — the buffered analogue of the bad-quality/
// registration-state bootstrap event the OnDataChange tests skip with their
// family-match predicate. That empty bootstrap is parity, NOT a dropped payload:
// the converter ran without crashing and there were simply no samples to carry.
// We therefore match only a batch that actually carries samples, so a real
// multi-sample conversion can be validated and the empty bootstrap is skipped
// rather than mistaken for a defect.
MxEvent? bufferedBatch = null;
try
{
bufferedBatch = await eventWriter
.WaitForMessageAsync(
candidate => candidate.Family == MxEventFamily.OnBufferedDataChange
&& candidate.ServerHandle == serverHandle
&& candidate.ItemHandle == bufferedItemHandle
&& candidate.OnBufferedDataChange is not null
&& (CountArrayElements(candidate.OnBufferedDataChange.QualityValues) > 0
|| CountArrayElements(candidate.OnBufferedDataChange.TimestampValues) > 0),
IntegrationTestEnvironment.LiveMxAccessEventTimeout,
streamCancellation.Token)
.ConfigureAwait(false);
}
catch (TimeoutException)
{
bufferedBatch = null;
}
// Whether or not a sample-bearing batch arrived, record what buffered events the
// rig DID deliver (typically just the empty NoData bootstrap) for the record.
int bootstrapBufferedEvents = CountMatchingEvents(
eventWriter,
e => e.Family == MxEventFamily.OnBufferedDataChange
&& e.ServerHandle == serverHandle
&& e.ItemHandle == bufferedItemHandle);
if (bufferedBatch is null)
{
// RESIDUAL (documented): the command path (AddBufferedItem +
// SetBufferedUpdateInterval + Advise) is proven and the buffered EVENT plumbing
// is live (the empty NoData bootstrap arrives and converts without crashing),
// but the rig did not drive a sample-bearing buffered batch within the timeout
// — the same environmental limitation as the externally-undrivable alarm rig.
// The §3.2 OnBufferedDataChange MULTI-SAMPLE conversion therefore remains
// unverified live. This is environmental, not a defect — let the test pass.
output.WriteLine(
"B8 RESIDUAL: AddBufferedItem/SetBufferedUpdateInterval/Advise round-tripped and "
+ $"{bootstrapBufferedEvents} OnBufferedDataChange event(s) arrived (empty NoData "
+ "bootstrap, converted without crash/drop), but no sample-bearing buffered batch "
+ $"was observed within {IntegrationTestEnvironment.LiveMxAccessEventTimeout}. Live "
+ "§3.2 multi-sample conversion remains unverified (rig object logic may not drive "
+ "buffered samples on demand).");
// The residual claim ("empty NoData bootstrap arrives and converts without crashing")
// is only meaningful if at least one OnBufferedDataChange event arrived. Assert that
// the buffered subscription registered and the worker's event plumbing fired at all.
Assert.True(
bootstrapBufferedEvents > 0,
"No OnBufferedDataChange event arrived at all after Advise; the buffered subscription may not have registered.");
return;
}
// A SAMPLE-BEARING buffered batch was captured — validate the §3.2 conversion.
LogEvent(bufferedBatch);
OnBufferedDataChangeEvent body = bufferedBatch.OnBufferedDataChange;
Assert.NotNull(body);
int qualityCount = CountArrayElements(body.QualityValues);
int timestampCount = CountArrayElements(body.TimestampValues);
output.WriteLine(
$"B8 CAPTURED buffered batch: data_type={body.DataType} raw_data_type={body.RawDataType} "
+ $"quality_samples={qualityCount} timestamp_samples={timestampCount} "
+ $"value_kind={bufferedBatch.Value?.KindCase}");
// The predicate guaranteed at least one sample; the converted aggregate value
// must also exist (no crash, no dropped payload).
Assert.True(
qualityCount > 0 || timestampCount > 0,
"Sample-bearing OnBufferedDataChange lost its samples after the predicate matched.");
Assert.NotNull(bufferedBatch.Value);
// When MXAccess delivers parallel quality + timestamp arrays the converted
// arrays must agree in length; a mismatch is a real conversion defect (a sample
// was dropped on one side).
if (qualityCount > 0 && timestampCount > 0)
{
Assert.Equal(qualityCount, timestampCount);
}
}
finally
{
streamCancellation.Cancel();
await ShutDownAsync(fixture, processFactory, sessionId, streamTask).ConfigureAwait(false);
}
}
/// <summary>
/// Verifies that killing the worker process marks the session
/// <see cref="SessionState.Faulted"/> with a clean fault classification — the gateway
@@ -939,6 +1277,113 @@ public sealed class WorkerLiveMxAccessSmokeTests(ITestOutputHelper output)
};
}
private static MxCommandRequest CreateArchestrAUserToIdRequest(
string sessionId,
int serverHandle,
string userIdGuid)
{
return new MxCommandRequest
{
SessionId = sessionId,
ClientCorrelationId = "live-archestra-user-to-id",
Command = new MxCommand
{
Kind = MxCommandKind.ArchestraUserToId,
ArchestraUserToId = new ArchestrAUserToIdCommand
{
ServerHandle = serverHandle,
UserIdGuid = userIdGuid,
},
},
};
}
private static MxCommandRequest CreateAddBufferedItemRequest(
string sessionId,
int serverHandle,
string itemDefinition,
string itemContext)
{
return new MxCommandRequest
{
SessionId = sessionId,
ClientCorrelationId = "live-add-buffered-item",
Command = new MxCommand
{
Kind = MxCommandKind.AddBufferedItem,
AddBufferedItem = new AddBufferedItemCommand
{
ServerHandle = serverHandle,
ItemDefinition = itemDefinition,
ItemContext = itemContext,
},
},
};
}
private static MxCommandRequest CreateSetBufferedUpdateIntervalRequest(
string sessionId,
int serverHandle,
int updateIntervalMilliseconds)
{
return new MxCommandRequest
{
SessionId = sessionId,
ClientCorrelationId = "live-set-buffered-update-interval",
Command = new MxCommand
{
Kind = MxCommandKind.SetBufferedUpdateInterval,
SetBufferedUpdateInterval = new SetBufferedUpdateIntervalCommand
{
ServerHandle = serverHandle,
UpdateIntervalMilliseconds = updateIntervalMilliseconds,
},
},
};
}
private static MxCommandRequest CreateSuspendRequest(
string sessionId,
int serverHandle,
int itemHandle)
{
return new MxCommandRequest
{
SessionId = sessionId,
ClientCorrelationId = "live-suspend",
Command = new MxCommand
{
Kind = MxCommandKind.Suspend,
Suspend = new SuspendCommand
{
ServerHandle = serverHandle,
ItemHandle = itemHandle,
},
},
};
}
private static MxCommandRequest CreateActivateRequest(
string sessionId,
int serverHandle,
int itemHandle)
{
return new MxCommandRequest
{
SessionId = sessionId,
ClientCorrelationId = "live-activate",
Command = new MxCommand
{
Kind = MxCommandKind.Activate,
Activate = new ActivateCommand
{
ServerHandle = serverHandle,
ItemHandle = itemHandle,
},
},
};
}
private static MxCommandRequest CreateWriteSecuredRequest(
string sessionId,
int serverHandle,
@@ -978,6 +1423,50 @@ public sealed class WorkerLiveMxAccessSmokeTests(ITestOutputHelper output)
return (verifyUser, verifyPassword);
}
/// <summary>
/// Splits a dotted MXAccess reference (e.g. "TestChildObject.TestInt") into the
/// (item_definition, item_context) pair AddBufferedItem expects — attribute name and
/// owning object. An undotted reference is passed through with an empty context.
/// </summary>
private static (string Item, string Context) SplitLiveItemForBuffered(string liveItem)
{
int lastDot = liveItem.LastIndexOf('.');
if (lastDot < 0 || lastDot >= liveItem.Length - 1)
{
return (liveItem, string.Empty);
}
string context = liveItem[..lastDot];
string item = liveItem[(lastDot + 1)..];
return (item, context);
}
/// <summary>
/// Counts the elements in a converted buffered <see cref="MxArray"/> across whichever
/// typed-array oneof case the VariantConverter populated, so the buffered-capture
/// assertions are independent of the rig item's element type.
/// </summary>
private static int CountArrayElements(MxArray? array)
{
if (array is null)
{
return 0;
}
return array.ValuesCase switch
{
MxArray.ValuesOneofCase.BoolValues => array.BoolValues.Values.Count,
MxArray.ValuesOneofCase.Int32Values => array.Int32Values.Values.Count,
MxArray.ValuesOneofCase.Int64Values => array.Int64Values.Values.Count,
MxArray.ValuesOneofCase.FloatValues => array.FloatValues.Values.Count,
MxArray.ValuesOneofCase.DoubleValues => array.DoubleValues.Values.Count,
MxArray.ValuesOneofCase.StringValues => array.StringValues.Values.Count,
MxArray.ValuesOneofCase.TimestampValues => array.TimestampValues.Values.Count,
MxArray.ValuesOneofCase.RawValues => array.RawValues.Values.Count,
_ => 0,
};
}
private static int CountMatchingEvents(
RecordingServerStreamWriter<MxEvent> writer,
Func<MxEvent, bool> predicate)
@@ -1607,6 +2096,7 @@ public sealed class WorkerLiveMxAccessSmokeTests(ITestOutputHelper output)
string commandKind,
string target,
ConstraintFailure failure,
string? correlationId,
CancellationToken cancellationToken) => Task.CompletedTask;
}
@@ -203,6 +203,7 @@ public sealed class GatewayAlarmMonitor : BackgroundService, IGatewayAlarmServic
GatewaySession session = await _sessionManager.OpenSessionAsync(
new SessionOpenRequest(BackendName, MonitorClientName, Guid.NewGuid().ToString("N"), CommandTimeout: null),
MonitorClientName,
ownerKeyId: null,
stoppingToken)
.ConfigureAwait(false);
lock (_sync) { _session = session; }
@@ -2,4 +2,6 @@ namespace ZB.MOM.WW.MxGateway.Server.Configuration;
public sealed record EffectiveEventConfiguration(
int QueueCapacity,
string BackpressurePolicy);
string BackpressurePolicy,
int ReplayBufferCapacity,
double ReplayRetentionSeconds);
@@ -6,4 +6,5 @@ public sealed record EffectiveSessionConfiguration(
int MaxPendingCommandsPerSession,
int DefaultLeaseSeconds,
int LeaseSweepIntervalSeconds,
bool AllowMultipleEventSubscribers);
bool AllowMultipleEventSubscribers,
int MaxEventSubscribersPerSession);
@@ -11,4 +11,20 @@ public sealed class EventOptions
/// Gets the backpressure policy for event queue overflow.
/// </summary>
public EventBackpressurePolicy BackpressurePolicy { get; init; } = EventBackpressurePolicy.FailFast;
/// <summary>
/// Gets the maximum number of events retained in the per-session replay ring buffer
/// used to re-deliver events a returning subscriber missed (reconnect/reattach).
/// When the buffer exceeds this count the oldest retained events are evicted first.
/// A value of <c>0</c> disables replay retention entirely.
/// </summary>
public int ReplayBufferCapacity { get; init; } = 1024;
/// <summary>
/// Gets the maximum age, in seconds, of an event retained in the per-session replay
/// ring buffer. Entries older than this are evicted regardless of capacity. A value
/// of <c>0</c> disables age-based eviction (only <see cref="ReplayBufferCapacity"/>
/// bounds the buffer).
/// </summary>
public double ReplayRetentionSeconds { get; init; } = 300;
}
@@ -46,10 +46,13 @@ public sealed class GatewayConfigurationProvider(IOptions<GatewayOptions> option
MaxPendingCommandsPerSession: value.Sessions.MaxPendingCommandsPerSession,
DefaultLeaseSeconds: value.Sessions.DefaultLeaseSeconds,
LeaseSweepIntervalSeconds: value.Sessions.LeaseSweepIntervalSeconds,
AllowMultipleEventSubscribers: value.Sessions.AllowMultipleEventSubscribers),
AllowMultipleEventSubscribers: value.Sessions.AllowMultipleEventSubscribers,
MaxEventSubscribersPerSession: value.Sessions.MaxEventSubscribersPerSession),
Events: new EffectiveEventConfiguration(
QueueCapacity: value.Events.QueueCapacity,
BackpressurePolicy: value.Events.BackpressurePolicy.ToString()),
BackpressurePolicy: value.Events.BackpressurePolicy.ToString(),
ReplayBufferCapacity: value.Events.ReplayBufferCapacity,
ReplayRetentionSeconds: value.Events.ReplayRetentionSeconds),
Dashboard: new EffectiveDashboardConfiguration(
Enabled: value.Dashboard.Enabled,
AllowAnonymousLocalhost: value.Dashboard.AllowAnonymousLocalhost,
@@ -177,12 +177,10 @@ public sealed class GatewayOptionsValidator : OptionsValidatorBase<GatewayOption
options.LeaseSweepIntervalSeconds,
"MxGateway:Sessions:LeaseSweepIntervalSeconds must be greater than zero.",
builder);
if (options.AllowMultipleEventSubscribers)
{
builder.Add(
"MxGateway:Sessions:AllowMultipleEventSubscribers is not supported until event fan-out is implemented.");
}
AddIfNotPositive(
options.MaxEventSubscribersPerSession,
"MxGateway:Sessions:MaxEventSubscribersPerSession must be greater than zero.",
builder);
}
private static void ValidateEvents(EventOptions options, ValidationBuilder builder)
@@ -193,6 +191,16 @@ public sealed class GatewayOptionsValidator : OptionsValidatorBase<GatewayOption
{
builder.Add("MxGateway:Events:BackpressurePolicy must be a supported backpressure policy.");
}
// ReplayBufferCapacity and ReplayRetentionSeconds are bounds on the replay ring
// buffer; 0 is a valid value (disables that dimension), so only negatives fail.
AddIfNegative(
options.ReplayBufferCapacity,
"MxGateway:Events:ReplayBufferCapacity must be greater than or equal to zero.",
builder);
builder.RequireThat(
options.ReplayRetentionSeconds >= 0,
"MxGateway:Events:ReplayRetentionSeconds must be greater than or equal to zero.");
}
private static void ValidateDashboard(DashboardOptions options, ValidationBuilder builder)
@@ -27,4 +27,11 @@ public sealed class SessionOptions
/// Gets a value indicating whether multiple event subscribers are allowed per session.
/// </summary>
public bool AllowMultipleEventSubscribers { get; init; }
/// <summary>
/// Gets the maximum number of concurrent event subscribers per session.
/// Applies when <see cref="AllowMultipleEventSubscribers"/> is <see langword="true"/>;
/// effectively 1 when it is <see langword="false"/>. Must be greater than zero.
/// </summary>
public int MaxEventSubscribersPerSession { get; init; } = 8;
}
@@ -138,6 +138,7 @@ public sealed class DashboardLiveDataService : IDashboardLiveDataService, IAsync
GatewaySession session = await _sessionManager.OpenSessionAsync(
new SessionOpenRequest(BackendName, ClientName, Guid.NewGuid().ToString("N"), CommandTimeout: null),
ClientName,
ownerKeyId: null,
cancellationToken)
.ConfigureAwait(false);
@@ -24,9 +24,21 @@ public sealed class DashboardEventBroadcaster(
return;
}
Task send = hubContext.Clients
.Group(EventsHub.GroupName(sessionId))
.SendAsync(EventsHub.EventMessage, mxEvent);
// Wrap the Task acquisition in a try/catch so a hypothetical synchronous throw
// from SendAsync (e.g. an implementation that throws before returning the Task)
// cannot escape Publish. The interface contract is never-throw; fire-and-forget.
Task send;
try
{
send = hubContext.Clients
.Group(EventsHub.GroupName(sessionId))
.SendAsync(EventsHub.EventMessage, mxEvent);
}
catch (Exception ex)
{
logger.LogDebug(ex, "Dashboard event mirror to session {SessionId} threw synchronously.", sessionId);
return;
}
if (!send.IsCompletedSuccessfully)
{
@@ -6,15 +6,9 @@ namespace ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs;
/// <summary>
/// SignalR hub for per-session MxEvent push. Clients call
/// <see cref="SubscribeSession"/> to join the group for a specific
/// session; the dashboard's MxEvent broadcaster (a future hook on
/// <c>EventStreamService</c>) sends messages to <c>session:{id}</c>.
/// session; <see cref="DashboardEventBroadcaster"/> sends messages to
/// <c>session:{id}</c> as events arrive from the live gRPC stream.
/// </summary>
/// <remarks>
/// The publisher side is intentionally a follow-up. Today the dashboard's
/// per-session event view is fed by the snapshot hub, which carries the
/// rolling recent-events list. Once a dedicated MxEvent broadcaster
/// lands, this hub's group convention is what it will publish to.
/// </remarks>
[Authorize(Policy = DashboardAuthenticationDefaults.HubClientsPolicy)]
public sealed class EventsHub : Hub
{
@@ -1,9 +1,7 @@
using System.Runtime.CompilerServices;
using System.Threading.Channels;
using Microsoft.Extensions.Options;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Server.Configuration;
using ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs;
using ZB.MOM.WW.MxGateway.Server.Metrics;
using ZB.MOM.WW.MxGateway.Server.Sessions;
using ZB.MOM.WW.MxGateway.Server.Workers;
@@ -13,14 +11,46 @@ namespace ZB.MOM.WW.MxGateway.Server.Grpc;
public sealed class EventStreamService(
ISessionManager sessionManager,
IOptions<GatewayOptions> options,
MxAccessGrpcMapper mapper,
GatewayMetrics metrics,
IDashboardEventBroadcaster dashboardEventBroadcaster,
ILogger<EventStreamService> logger) : IEventStreamService
GatewayMetrics metrics) : IEventStreamService
{
/// <summary>
/// Streams events from a session to the client asynchronously.
/// Streams events from a session to the client asynchronously.
/// </summary>
/// <remarks>
/// <para>
/// Task 4 rewired this from a per-RPC channel that drained the session directly
/// to reading the subscriber's lease channel fed by the session's single
/// <see cref="SessionEventDistributor"/> pump. The pump owns the single drain of
/// the worker event stream and the worker→public mapping (mirroring the former
/// <c>ProduceEventsAsync</c>); this loop is the per-subscriber boundary that
/// applies the per-RPC filter (<c>AfterWorkerSequence</c>), queue-depth metrics,
/// and the backpressure/overflow policy.
/// </para>
/// <para>
/// Task 6 moved the dashboard mirror OFF this per-RPC loop. The dashboard is now a
/// first-class internal subscriber on the session's
/// <see cref="SessionEventDistributor"/> (see <c>GatewaySession.StartDashboardMirror</c>),
/// so it receives session events even when no gRPC client is streaming. This loop no
/// longer mirrors to the dashboard. One deliberate consequence: the dashboard now sees
/// RAW session events, not the per-gRPC-subscriber <c>AfterWorkerSequence</c>-filtered
/// view this loop applies — the dashboard is a separate LDAP-authenticated monitoring
/// view that should see the session's full event activity (per-session dashboard ACL is
/// the separate Task 18).
/// </para>
/// <para>
/// Overflow handling (Task 5): the distributor's per-subscriber channel is bounded
/// and the pump writes non-blocking. When this subscriber's channel is full the pump
/// applies the per-subscriber backpressure policy and completes this subscriber's
/// channel with a <see cref="SessionManagerException"/>
/// (<see cref="SessionManagerErrorCode.EventQueueOverflow"/>). That terminal fault
/// surfaces here when the reader's <c>MoveNextAsync</c> throws, and — like the
/// pre-epic per-RPC overflow — it propagates to the gRPC client unchanged. The
/// overflow metric, and (in the legacy single-subscriber FailFast case) the session
/// fault + fault metric, are recorded by the distributor's overflow handler so the
/// session, the pump, and other subscribers are isolated from this subscriber's
/// slowness.
/// </para>
/// </remarks>
/// <param name="request">Stream events request.</param>
/// <param name="cancellationToken">Cancellation token.</param>
/// <returns>Async enumerable of MX events.</returns>
@@ -35,151 +65,80 @@ public sealed class EventStreamService(
$"Session {request.SessionId} was not found.");
}
using IDisposable subscriber = session.AttachEventSubscriber(
// No `using` here — subscriber.Dispose() is called exactly once in the finally
// block below, which also disposes the reader. A `using` declaration would add a
// second Dispose on the same path and double-decrement the session subscriber count.
IEventSubscriberLease subscriber = session.AttachEventSubscriber(
options.Value.Sessions.AllowMultipleEventSubscribers);
using CancellationTokenSource streamCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
int streamQueueDepth = 0;
Channel<MxEvent> eventQueue = Channel.CreateBounded<MxEvent>(
new BoundedChannelOptions(options.Value.Events.QueueCapacity)
{
SingleReader = true,
SingleWriter = true,
FullMode = BoundedChannelFullMode.Wait,
AllowSynchronousContinuations = false,
});
Task producerTask = ProduceEventsAsync(
session,
request.AfterWorkerSequence,
eventQueue.Writer,
() =>
{
Interlocked.Increment(ref streamQueueDepth);
metrics.AdjustGrpcEventStreamQueueDepth(1);
},
streamCts.Token);
ulong afterWorkerSequence = request.AfterWorkerSequence;
IAsyncEnumerator<MxEvent> reader = subscriber.Reader
.ReadAllAsync(cancellationToken)
.GetAsyncEnumerator(cancellationToken);
try
{
await foreach (MxEvent mxEvent in eventQueue.Reader.ReadAllAsync(cancellationToken).ConfigureAwait(false))
while (true)
{
Interlocked.Decrement(ref streamQueueDepth);
metrics.AdjustGrpcEventStreamQueueDepth(-1);
MxEvent mxEvent;
try
{
if (!await reader.MoveNextAsync().ConfigureAwait(false))
{
break;
}
mxEvent = reader.Current;
}
catch (WorkerClientException workerException)
{
// The distributor pump completes every subscriber channel with the source
// fault when the worker event stream terminates abnormally; that surfaces
// here. Mirror the pre-Task-4 ProduceEventsAsync behavior: fault the
// session and record the metric, then propagate the terminal fault to the
// gRPC client.
session.MarkFaulted(workerException.Message);
metrics.Fault(WorkerClientErrorCode.WorkerFaulted.ToString());
throw;
}
// Per-RPC filter stays at the subscriber boundary: each request may resume
// from a different AfterWorkerSequence, so the shared pump fans raw events and
// this loop drops the ones at or below the caller's watermark.
if (mxEvent.WorkerSequence <= afterWorkerSequence)
{
continue;
}
// Queue-depth gauge tracks events the pump has fanned into this subscriber's
// channel but the client has not yet consumed — the same "buffered, not yet
// delivered" quantity the pre-Task-4 per-RPC channel reported. The bounded
// subscriber channel supports counting, so reconcile the gauge to the current
// backlog; falling back to a no-op delta if a channel ever cannot count.
int backlog = subscriber.Reader.CanCount ? subscriber.Reader.Count : streamQueueDepth;
int delta = backlog - streamQueueDepth;
if (delta != 0)
{
streamQueueDepth = backlog;
metrics.AdjustGrpcEventStreamQueueDepth(delta);
}
yield return mxEvent;
}
await producerTask.ConfigureAwait(false);
}
finally
{
await streamCts.CancelAsync().ConfigureAwait(false);
await reader.DisposeAsync().ConfigureAwait(false);
subscriber.Dispose();
try
if (streamQueueDepth != 0)
{
await producerTask.ConfigureAwait(false);
}
catch (OperationCanceledException) when (streamCts.IsCancellationRequested)
{
}
catch (Exception exception)
{
logger.LogDebug(
exception,
"Event stream producer stopped for session {SessionId}.",
request.SessionId);
}
int remainingDepth = Interlocked.Exchange(ref streamQueueDepth, 0);
if (remainingDepth > 0)
{
metrics.AdjustGrpcEventStreamQueueDepth(-remainingDepth);
metrics.AdjustGrpcEventStreamQueueDepth(-streamQueueDepth);
streamQueueDepth = 0;
}
metrics.StreamDisconnected("Detached");
}
}
private async Task ProduceEventsAsync(
GatewaySession session,
ulong afterWorkerSequence,
ChannelWriter<MxEvent> writer,
Action eventQueued,
CancellationToken cancellationToken)
{
try
{
await foreach (WorkerEvent workerEvent in session
.ReadEventsAsync(cancellationToken)
.WithCancellation(cancellationToken)
.ConfigureAwait(false))
{
MxEvent publicEvent = mapper.MapEvent(workerEvent);
if (publicEvent.WorkerSequence <= afterWorkerSequence)
{
continue;
}
// Mirror the event to the dashboard EventsHub group for this
// session. Fire-and-forget — broadcast errors must not affect
// the source gRPC stream. Server-041: the
// IDashboardEventBroadcaster contract documents Publish as
// never-throw, but we enforce that at the seam too, so a
// future implementation that adds synchronous validation or
// a serializer hop cannot fault the producer loop and end
// this client's gRPC stream.
try
{
dashboardEventBroadcaster.Publish(session.SessionId, publicEvent);
}
catch (Exception ex)
{
logger.LogDebug(
ex,
"Dashboard event mirror threw for session {SessionId}; continuing.",
session.SessionId);
}
if (!writer.TryWrite(publicEvent))
{
string message = $"Session {session.SessionId} event stream queue overflowed.";
metrics.QueueOverflow("grpc-event-stream");
if (options.Value.Events.BackpressurePolicy == EventBackpressurePolicy.FailFast)
{
session.MarkFaulted(message);
metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString());
}
else
{
logger.LogDebug(
"Disconnecting event stream for session {SessionId} after queue overflow.",
session.SessionId);
}
writer.TryComplete(new SessionManagerException(
SessionManagerErrorCode.EventQueueOverflow,
message));
return;
}
eventQueued();
}
writer.TryComplete();
}
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
{
writer.TryComplete();
}
catch (Exception exception)
{
if (exception is WorkerClientException)
{
session.MarkFaulted(exception.Message);
metrics.Fault(WorkerClientErrorCode.WorkerFaulted.ToString());
}
writer.TryComplete(exception);
}
}
}
@@ -1,6 +1,5 @@
using Google.Protobuf.WellKnownTypes;
using Grpc.Core;
using Microsoft.Data.SqlClient;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using GalaxyDb = ZB.MOM.WW.MxGateway.Server.Galaxy;
using ZB.MOM.WW.MxGateway.Server.Security.Authentication;
@@ -20,8 +19,7 @@ public sealed class GalaxyRepositoryGrpcService(
GalaxyDb.IGalaxyRepository repository,
GalaxyDb.IGalaxyHierarchyCache cache,
GalaxyDb.IGalaxyDeployNotifier notifier,
IGatewayRequestIdentityAccessor identityAccessor,
ILogger<GalaxyRepositoryGrpcService> logger) : ProtoGalaxyRepository.GalaxyRepositoryBase
IGatewayRequestIdentityAccessor identityAccessor) : ProtoGalaxyRepository.GalaxyRepositoryBase
{
private static readonly TimeSpan FirstLoadWaitBudget = TimeSpan.FromSeconds(5);
private const int DefaultDiscoverPageSize = 1000;
@@ -347,15 +345,4 @@ public sealed class GalaxyRepositoryGrpcService(
private sealed record PageToken(long Sequence, string FilterSignature, int Offset);
[System.Diagnostics.CodeAnalysis.SuppressMessage(
"Style",
"IDE0051:Remove unused private members",
Justification = "Kept for parity with prior SQL exception mapping; future direct-SQL paths reuse it.")]
private RpcException MapSqlException(SqlException exception)
{
logger.LogWarning(exception, "Galaxy repository query failed.");
return new RpcException(new Status(
StatusCode.Unavailable,
"Galaxy repository is unavailable."));
}
}
@@ -36,6 +36,7 @@ public sealed class MxAccessGatewayService(
.OpenSessionAsync(
SessionOpenRequest.FromContract(request),
ResolveClientIdentity(),
identityAccessor.Current?.KeyId,
context.CancellationToken)
.ConfigureAwait(false);
@@ -105,6 +106,7 @@ public sealed class MxAccessGatewayService(
BulkConstraintPlan? bulkConstraintPlan = await ApplyConstraintsAsync(
session,
command,
request.ClientCorrelationId,
context.CancellationToken)
.ConfigureAwait(false);
@@ -279,17 +281,18 @@ public sealed class MxAccessGatewayService(
private async Task<BulkConstraintPlan?> ApplyConstraintsAsync(
GatewaySession session,
MxCommand command,
string? correlationId,
CancellationToken cancellationToken)
{
ApiKeyIdentity? identity = identityAccessor.Current;
switch (command.Kind)
{
case MxCommandKind.AddItem:
await EnforceReadTagAsync(identity, command.Kind, command.AddItem.ItemDefinition, cancellationToken)
await EnforceReadTagAsync(identity, command.Kind, command.AddItem.ItemDefinition, correlationId, cancellationToken)
.ConfigureAwait(false);
return null;
case MxCommandKind.AddItem2:
await EnforceReadTagAsync(identity, command.Kind, command.AddItem2.ItemDefinition, cancellationToken)
await EnforceReadTagAsync(identity, command.Kind, command.AddItem2.ItemDefinition, correlationId, cancellationToken)
.ConfigureAwait(false);
return null;
case MxCommandKind.AddItemBulk:
@@ -298,6 +301,7 @@ public sealed class MxAccessGatewayService(
command,
command.AddItemBulk.ServerHandle,
command.AddItemBulk.TagAddresses,
correlationId,
cancellationToken)
.ConfigureAwait(false);
case MxCommandKind.SubscribeBulk:
@@ -306,6 +310,7 @@ public sealed class MxAccessGatewayService(
command,
command.SubscribeBulk.ServerHandle,
command.SubscribeBulk.TagAddresses,
correlationId,
cancellationToken)
.ConfigureAwait(false);
case MxCommandKind.AdviseItemBulk:
@@ -315,6 +320,7 @@ public sealed class MxAccessGatewayService(
command,
command.AdviseItemBulk.ServerHandle,
command.AdviseItemBulk.ItemHandles,
correlationId,
cancellationToken)
.ConfigureAwait(false);
case MxCommandKind.ReadBulk:
@@ -323,6 +329,7 @@ public sealed class MxAccessGatewayService(
command,
command.ReadBulk.ServerHandle,
command.ReadBulk.TagAddresses,
correlationId,
cancellationToken)
.ConfigureAwait(false);
case MxCommandKind.WriteBulk:
@@ -333,6 +340,7 @@ public sealed class MxAccessGatewayService(
command.WriteBulk.ServerHandle,
command.WriteBulk.Entries,
entry => entry.ItemHandle,
correlationId,
cancellationToken)
.ConfigureAwait(false);
case MxCommandKind.Write2Bulk:
@@ -343,6 +351,7 @@ public sealed class MxAccessGatewayService(
command.Write2Bulk.ServerHandle,
command.Write2Bulk.Entries,
entry => entry.ItemHandle,
correlationId,
cancellationToken)
.ConfigureAwait(false);
case MxCommandKind.WriteSecuredBulk:
@@ -353,6 +362,7 @@ public sealed class MxAccessGatewayService(
command.WriteSecuredBulk.ServerHandle,
command.WriteSecuredBulk.Entries,
entry => entry.ItemHandle,
correlationId,
cancellationToken)
.ConfigureAwait(false);
case MxCommandKind.WriteSecured2Bulk:
@@ -363,6 +373,7 @@ public sealed class MxAccessGatewayService(
command.WriteSecured2Bulk.ServerHandle,
command.WriteSecured2Bulk.Entries,
entry => entry.ItemHandle,
correlationId,
cancellationToken)
.ConfigureAwait(false);
case MxCommandKind.Write:
@@ -372,6 +383,7 @@ public sealed class MxAccessGatewayService(
command.Kind,
command.Write.ServerHandle,
command.Write.ItemHandle,
correlationId,
cancellationToken)
.ConfigureAwait(false);
return null;
@@ -382,6 +394,7 @@ public sealed class MxAccessGatewayService(
command.Kind,
command.Write2.ServerHandle,
command.Write2.ItemHandle,
correlationId,
cancellationToken)
.ConfigureAwait(false);
return null;
@@ -392,6 +405,7 @@ public sealed class MxAccessGatewayService(
command.Kind,
command.WriteSecured.ServerHandle,
command.WriteSecured.ItemHandle,
correlationId,
cancellationToken)
.ConfigureAwait(false);
return null;
@@ -402,6 +416,7 @@ public sealed class MxAccessGatewayService(
command.Kind,
command.WriteSecured2.ServerHandle,
command.WriteSecured2.ItemHandle,
correlationId,
cancellationToken)
.ConfigureAwait(false);
return null;
@@ -414,6 +429,7 @@ public sealed class MxAccessGatewayService(
ApiKeyIdentity? identity,
MxCommandKind commandKind,
string tagAddress,
string? correlationId,
CancellationToken cancellationToken)
{
ConstraintFailure? failure = await constraintEnforcer
@@ -424,7 +440,7 @@ public sealed class MxAccessGatewayService(
return;
}
await constraintEnforcer.RecordDenialAsync(identity, commandKind.ToString(), tagAddress, failure, cancellationToken)
await constraintEnforcer.RecordDenialAsync(identity, commandKind.ToString(), tagAddress, failure, correlationId, cancellationToken)
.ConfigureAwait(false);
throw new RpcException(new Status(StatusCode.PermissionDenied, failure.Message));
}
@@ -435,6 +451,7 @@ public sealed class MxAccessGatewayService(
MxCommandKind commandKind,
int serverHandle,
int itemHandle,
string? correlationId,
CancellationToken cancellationToken)
{
ConstraintFailure? failure = await constraintEnforcer
@@ -445,7 +462,7 @@ public sealed class MxAccessGatewayService(
return;
}
await constraintEnforcer.RecordDenialAsync(identity, commandKind.ToString(), itemHandle.ToString(System.Globalization.CultureInfo.InvariantCulture), failure, cancellationToken)
await constraintEnforcer.RecordDenialAsync(identity, commandKind.ToString(), itemHandle.ToString(System.Globalization.CultureInfo.InvariantCulture), failure, correlationId, cancellationToken)
.ConfigureAwait(false);
throw new RpcException(new Status(StatusCode.PermissionDenied, failure.Message));
}
@@ -455,6 +472,7 @@ public sealed class MxAccessGatewayService(
MxCommand command,
int serverHandle,
IReadOnlyList<string> tagAddresses,
string? correlationId,
CancellationToken cancellationToken)
{
Dictionary<int, SubscribeResult> denied = [];
@@ -471,7 +489,7 @@ public sealed class MxAccessGatewayService(
continue;
}
await constraintEnforcer.RecordDenialAsync(identity, command.Kind.ToString(), tagAddress, failure, cancellationToken)
await constraintEnforcer.RecordDenialAsync(identity, command.Kind.ToString(), tagAddress, failure, correlationId, cancellationToken)
.ConfigureAwait(false);
denied[index] = new SubscribeResult
{
@@ -507,6 +525,7 @@ public sealed class MxAccessGatewayService(
MxCommand command,
int serverHandle,
IReadOnlyList<string> tagAddresses,
string? correlationId,
CancellationToken cancellationToken)
{
// Mirrors FilterTagBulkAsync but produces BulkReadResult denial entries
@@ -526,7 +545,7 @@ public sealed class MxAccessGatewayService(
continue;
}
await constraintEnforcer.RecordDenialAsync(identity, command.Kind.ToString(), tagAddress, failure, cancellationToken)
await constraintEnforcer.RecordDenialAsync(identity, command.Kind.ToString(), tagAddress, failure, correlationId, cancellationToken)
.ConfigureAwait(false);
denied[index] = new BulkReadResult
{
@@ -557,6 +576,7 @@ public sealed class MxAccessGatewayService(
int serverHandle,
Google.Protobuf.Collections.RepeatedField<TEntry> entries,
Func<TEntry, int> getItemHandle,
string? correlationId,
CancellationToken cancellationToken) where TEntry : class
{
// The four bulk-write families each carry a different per-entry message
@@ -586,6 +606,7 @@ public sealed class MxAccessGatewayService(
command.Kind.ToString(),
itemHandle.ToString(System.Globalization.CultureInfo.InvariantCulture),
failure,
correlationId,
cancellationToken)
.ConfigureAwait(false);
denied[index] = new BulkWriteResult
@@ -637,6 +658,7 @@ public sealed class MxAccessGatewayService(
MxCommand command,
int serverHandle,
IReadOnlyList<int> itemHandles,
string? correlationId,
CancellationToken cancellationToken)
{
Dictionary<int, SubscribeResult> denied = [];
@@ -653,7 +675,7 @@ public sealed class MxAccessGatewayService(
continue;
}
await constraintEnforcer.RecordDenialAsync(identity, command.Kind.ToString(), itemHandle.ToString(System.Globalization.CultureInfo.InvariantCulture), failure, cancellationToken)
await constraintEnforcer.RecordDenialAsync(identity, command.Kind.ToString(), itemHandle.ToString(System.Globalization.CultureInfo.InvariantCulture), failure, correlationId, cancellationToken)
.ConfigureAwait(false);
denied[index] = new SubscribeResult
{
@@ -120,20 +120,24 @@ public sealed class ConstraintEnforcer(
/// <param name="commandKind">The command type (e.g., read, write).</param>
/// <param name="target">The target being accessed (tag address or handle).</param>
/// <param name="failure">The constraint failure details.</param>
/// <param name="correlationId">
/// The per-request client correlation id, if any. Persisted as the audit record's typed
/// <c>CorrelationId</c> when it parses as a GUID; a non-GUID value leaves that column null.
/// The raw string is always preserved in <c>DetailsJson["clientCorrelationId"]</c> so a
/// non-GUID id (e.g. from Rust/Python/Java clients) is never silently lost.
/// </param>
/// <param name="cancellationToken">Token to observe for cancellation.</param>
public async Task RecordDenialAsync(
ApiKeyIdentity? identity,
string commandKind,
string target,
ConstraintFailure failure,
string? correlationId,
CancellationToken cancellationToken)
{
// Emit a canonical Denied AuditEvent directly through the best-effort IAuditWriter
// (Task 2.3 #6): structured Target ("<commandKind>:<target>") and a richer DetailsJson
// envelope carrying constraint/message/commandKind/target.
// TODO(Task 2.3): CorrelationId is left null here. Threading the per-request
// ClientCorrelationId down to RecordDenialAsync would require an invasive IConstraintEnforcer
// signature change across the gRPC call path; that is deferred to a follow-up.
AuditEvent auditEvent = new()
{
EventId = Guid.NewGuid(),
@@ -144,13 +148,18 @@ public sealed class ConstraintEnforcer(
Category = CanonicalForwardingApiKeyAuditStore.ApiKeyCategory,
Target = $"{commandKind}:{target}",
SourceNode = null,
CorrelationId = null,
CorrelationId = Guid.TryParse(correlationId, out var cid) ? cid : (Guid?)null,
DetailsJson = JsonSerializer.Serialize(new Dictionary<string, string>
{
["constraint"] = failure.ConstraintName,
["message"] = failure.Message,
["commandKind"] = commandKind,
["target"] = target,
// Always preserve the raw client correlation id here so it is never silently
// lost: the typed CorrelationId column only retains GUID-parseable ids, but
// clients (Rust/Python/Java) commonly send non-GUID or empty trace ids. The
// raw id is a client trace id, not a secret, so storing it is fine.
["clientCorrelationId"] = correlationId ?? "",
}),
};
@@ -45,11 +45,16 @@ public interface IConstraintEnforcer
/// <param name="commandKind">The kind of command denied.</param>
/// <param name="target">The target of the denied command.</param>
/// <param name="failure">The constraint failure details.</param>
/// <param name="correlationId">
/// The per-request client correlation id, if any. Stored on the audit record's
/// <c>CorrelationId</c> when it parses as a GUID; otherwise left null.
/// </param>
/// <param name="cancellationToken">Token to observe for cancellation.</param>
Task RecordDenialAsync(
ApiKeyIdentity? identity,
string commandKind,
string target,
ConstraintFailure failure,
string? correlationId,
CancellationToken cancellationToken);
}
@@ -1,4 +1,10 @@
using System.Runtime.CompilerServices;
using Microsoft.Extensions.Logging;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Server.Configuration;
using ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs;
using ZB.MOM.WW.MxGateway.Server.Grpc;
using ZB.MOM.WW.MxGateway.Server.Metrics;
using ZB.MOM.WW.MxGateway.Server.Workers;
namespace ZB.MOM.WW.MxGateway.Server.Sessions;
@@ -7,6 +13,7 @@ public sealed class GatewaySession
{
private readonly object _syncRoot = new();
private readonly SemaphoreSlim _closeLock = new(1, 1);
private readonly SessionEventStreaming _eventStreaming;
private IWorkerClient? _workerClient;
private SessionState _state = SessionState.Creating;
private string? _finalFault;
@@ -14,6 +21,12 @@ public sealed class GatewaySession
private DateTimeOffset? _leaseExpiresAt;
private bool _closeStarted;
private int _activeEventSubscriberCount;
private SessionEventDistributor? _eventDistributor;
private bool _eventDistributorStarted;
private bool _dashboardMirrorStarted;
private IEventSubscriberLease? _dashboardMirrorLease;
private Task? _dashboardMirrorTask;
private CancellationTokenSource? _dashboardMirrorCts;
private readonly Dictionary<(int ServerHandle, int ItemHandle), SessionItemRegistration> _items = [];
/// <summary>
@@ -30,6 +43,11 @@ public sealed class GatewaySession
/// <param name="startupTimeout">Timeout for worker process startup.</param>
/// <param name="shutdownTimeout">Timeout for worker process shutdown.</param>
/// <param name="openedAt">Timestamp when the session opened.</param>
/// <remarks>
/// Constructs a session with no owner key (<see cref="OwnerKeyId"/> will be null).
/// Authenticated call sites that have a resolved API key identity must use the
/// 12-parameter overload and pass the caller's key id explicitly.
/// </remarks>
public GatewaySession(
string sessionId,
string backendName,
@@ -48,6 +66,7 @@ public sealed class GatewaySession
pipeName,
nonce,
clientIdentity,
ownerKeyId: null,
clientSessionName,
clientCorrelationId,
commandTimeout,
@@ -66,6 +85,7 @@ public sealed class GatewaySession
/// <param name="pipeName">Name of the named pipe for gateway-worker IPC.</param>
/// <param name="nonce">Security nonce for worker validation.</param>
/// <param name="clientIdentity">Client identity from the authentication context.</param>
/// <param name="ownerKeyId">API key identifier of the caller that created this session.</param>
/// <param name="clientSessionName">Client-supplied session name.</param>
/// <param name="clientCorrelationId">Client-supplied correlation identifier.</param>
/// <param name="commandTimeout">Timeout for command invocation.</param>
@@ -73,19 +93,30 @@ public sealed class GatewaySession
/// <param name="shutdownTimeout">Timeout for worker process shutdown.</param>
/// <param name="leaseDuration">Duration of the session lease.</param>
/// <param name="openedAt">Timestamp when the session opened.</param>
/// <param name="eventStreaming">
/// Dependencies the session uses to construct and own its
/// <see cref="SessionEventDistributor"/> (the single per-session worker-event pump
/// that fans raw mapped <see cref="MxEvent"/>s to every subscriber lease). When
/// <see langword="null"/>, defaults are used (no replay logger, system clock, a
/// fresh mapper, and default <see cref="EventOptions"/>) so unit tests that build a
/// session directly still get a working distributor. Production passes the
/// DI-resolved dependencies.
/// </param>
public GatewaySession(
string sessionId,
string backendName,
string pipeName,
string nonce,
string? clientIdentity,
string? ownerKeyId,
string? clientSessionName,
string? clientCorrelationId,
TimeSpan commandTimeout,
TimeSpan startupTimeout,
TimeSpan shutdownTimeout,
TimeSpan leaseDuration,
DateTimeOffset openedAt)
DateTimeOffset openedAt,
SessionEventStreaming? eventStreaming = null)
{
if (string.IsNullOrWhiteSpace(sessionId))
{
@@ -112,6 +143,7 @@ public sealed class GatewaySession
PipeName = pipeName;
Nonce = nonce;
ClientIdentity = clientIdentity;
OwnerKeyId = ownerKeyId;
ClientSessionName = clientSessionName;
ClientCorrelationId = clientCorrelationId;
CommandTimeout = commandTimeout;
@@ -121,6 +153,7 @@ public sealed class GatewaySession
OpenedAt = openedAt;
_lastClientActivityAt = openedAt;
_leaseExpiresAt = openedAt + leaseDuration;
_eventStreaming = eventStreaming ?? SessionEventStreaming.Default;
}
/// <summary>
@@ -148,6 +181,11 @@ public sealed class GatewaySession
/// </summary>
public string? ClientIdentity { get; }
/// <summary>
/// Gets the API key identifier of the caller that created this session.
/// </summary>
public string? OwnerKeyId { get; }
/// <summary>
/// Gets the client-supplied session name.
/// </summary>
@@ -318,9 +356,268 @@ public sealed class GatewaySession
/// <summary>
/// Transitions the session to the Ready state.
/// </summary>
/// <remarks>
/// On becoming Ready the session starts its internal dashboard mirror (Task 6) when a
/// dashboard broadcaster was supplied. The mirror registers an internal subscriber on
/// the distributor and starts the pump <em>before</em> any gRPC client attaches, so the
/// dashboard EventsHub receives session events even with no gRPC subscriber streaming —
/// fixing the "dark feed" where the dashboard only saw events while a gRPC client was
/// actively streaming. Registering the internal subscriber BEFORE
/// <see cref="SessionEventDistributor.StartAsync"/> also avoids the Task 4 hazard where
/// starting the pump at Ready with zero subscribers drained a fast-completing worker
/// stream into nothing and left a later subscriber hanging: there is now always a
/// subscriber (the dashboard one) registered before the pump starts.
/// </remarks>
public void MarkReady()
{
TransitionTo(SessionState.Ready);
StartDashboardMirror();
}
// Constructs and starts the distributor exactly once, registering the subscriber under
// the same start so no event the pump fans can be missed between start and register.
// Started lazily on the FIRST AttachEventSubscriber rather than at MarkReady: today the
// worker event stream is only drained when a client begins streaming, so deferring the
// single drain to first-attach preserves that "events start flowing on subscribe"
// behavior and avoids draining a fast-completing source into the void before any
// subscriber exists. The source factory mirrors the mapping/ordering/start that
// EventStreamService.ProduceEventsAsync used before Task 4: it drains the worker event
// stream in source order and maps each WorkerEvent to the public MxEvent with the same
// mapper, with no skip/filter — per-RPC filtering (e.g. AfterWorkerSequence) stays at the
// subscriber boundary in EventStreamService. Returns a registered lease atomically with
// the start so the very first subscriber sees the stream from its beginning.
private IEventSubscriberLease StartDistributorAndRegister()
{
SessionEventDistributor distributor = EnsureDistributorCreated(out bool startNow);
// Register BEFORE starting the pump so a subscriber is present when the pump begins
// draining — no event is fanned to an empty subscriber set and then missed by this
// first subscriber. StartAsync only schedules the pump task; it never blocks.
IEventSubscriberLease lease = distributor.Register();
StartPumpIfRequested(distributor, startNow);
return lease;
}
// Constructs the distributor exactly once and reports whether THIS caller is the one
// that should start the pump (i.e. it observed the unstarted state and claimed the
// start). Both the construction and the started-flag flip happen under _syncRoot so two
// concurrent callers (e.g. MarkReady's dashboard mirror and a racing first
// AttachEventSubscriber) agree on a single distributor and a single start.
private SessionEventDistributor EnsureDistributorCreated(out bool startNow)
{
lock (_syncRoot)
{
if (_eventDistributor is null)
{
EventOptions eventOptions = _eventStreaming.EventOptions;
_eventDistributor = new SessionEventDistributor(
SessionId,
MapWorkerEventsAsync,
eventOptions.QueueCapacity,
eventOptions.ReplayBufferCapacity,
eventOptions.ReplayRetentionSeconds,
_eventStreaming.DistributorLogger,
_eventStreaming.TimeProvider,
CreateOverflowHandler(eventOptions.BackpressurePolicy));
}
startNow = false;
if (!_eventDistributorStarted)
{
_eventDistributorStarted = true;
startNow = true;
}
return _eventDistributor;
}
}
private static void StartPumpIfRequested(SessionEventDistributor distributor, bool startNow)
{
if (!startNow)
{
return;
}
// StartAsync only schedules the pump via Task.Run and returns a completed task;
// it does not perform any async I/O itself. The sync-over-async call here is
// therefore safe and will not deadlock. Do not make StartAsync truly async
// (i.e., await real I/O before returning) without also changing this call site.
distributor.StartAsync(CancellationToken.None).GetAwaiter().GetResult();
}
// Registers the gateway-owned internal dashboard subscriber on the distributor and starts
// a background loop that mirrors every fanned event to the dashboard broadcaster. Called
// once when the session becomes Ready (idempotent). The internal subscriber is registered
// BEFORE the pump starts (see StartDistributorAndRegister / EnsureDistributorCreated), so
// a subscriber is always present at pump start — the dashboard receives events with no
// gRPC subscriber attached, and the Task 4 "zero-subscriber drain into the void" hang
// cannot occur. No-op when no dashboard broadcaster was supplied (unit tests).
//
// Race-safety (Issue 1): _dashboardMirrorLease and _dashboardMirrorTask are published
// atomically under a SINGLE second lock section, and DisposeAsync reads/nulls them under
// that same lock. After EnsureDistributorCreated/Register/StartPump (all outside _syncRoot
// to avoid lock inversion with the distributor's own lifecycle lock), we re-enter
// _syncRoot and check for concurrent disposal. If the session is already Closing/Closed/
// Faulted at that point, we dispose the just-created lease immediately and do NOT start
// the mirror task, so nothing is orphaned.
private void StartDashboardMirror()
{
IDashboardEventBroadcaster? broadcaster = _eventStreaming.DashboardBroadcaster;
if (broadcaster is null)
{
return;
}
CancellationToken loopToken;
lock (_syncRoot)
{
if (_dashboardMirrorStarted || _state is SessionState.Closing or SessionState.Closed or SessionState.Faulted)
{
return;
}
_dashboardMirrorStarted = true;
_dashboardMirrorCts = new CancellationTokenSource();
loopToken = _dashboardMirrorCts.Token;
}
// Create the distributor (claiming the start if we are first) and register the
// internal subscriber BEFORE starting the pump. isInternal: true keeps the dashboard
// subscriber out of the single-subscriber overflow accounting, so a slow/broken
// dashboard mirror only disconnects itself and never faults the session.
// These three calls are OUTSIDE _syncRoot to avoid holding it across
// EnsureDistributorCreated's own lock and StartAsync's Task.Run.
SessionEventDistributor distributor = EnsureDistributorCreated(out bool startNow);
IEventSubscriberLease lease = distributor.Register(isInternal: true);
StartPumpIfRequested(distributor, startNow);
// Publish BOTH the lease and the task atomically under one lock section so
// DisposeAsync always sees them in a consistent state: either both are set or
// both are null. If the session already started disposal before we got here,
// dispose the lease immediately instead of orphaning it.
lock (_syncRoot)
{
if (_state is SessionState.Closing or SessionState.Closed or SessionState.Faulted)
{
// Disposal already ran (or is in progress) — discard the just-created
// lease now so it is not orphaned. Do NOT launch the mirror task.
lease.Dispose();
return;
}
_dashboardMirrorLease = lease;
_dashboardMirrorTask = Task.Run(
() => RunDashboardMirrorAsync(broadcaster, lease, loopToken),
CancellationToken.None);
}
}
// Reads the internal dashboard subscriber's channel and publishes each RAW fanned event
// to the dashboard broadcaster. The dashboard is a first-class distributor subscriber
// (Task 6), so it sees the session's full raw event activity — NOT the per-gRPC-subscriber
// AfterWorkerSequence filtering that EventStreamService applies at its own boundary. This
// is intentional: the dashboard is a separate LDAP-authenticated monitoring view (per-
// session dashboard ACL is the separate Task 18). Publish is best-effort / never-throw, so
// a slow or broken dashboard cannot fault the session or stall the pump; the bounded
// internal subscriber channel (Task 5 per-subscriber isolation) only disconnects THIS
// mirror on overflow, leaving the session and other subscribers untouched.
private async Task RunDashboardMirrorAsync(
IDashboardEventBroadcaster broadcaster,
IEventSubscriberLease lease,
CancellationToken cancellationToken)
{
try
{
await foreach (MxEvent mxEvent in lease.Reader
.ReadAllAsync(cancellationToken)
.ConfigureAwait(false))
{
try
{
broadcaster.Publish(SessionId, mxEvent);
}
catch (Exception exception)
{
// Publish is documented never-throw, but enforce it here too so a future
// implementation cannot fault the mirror loop. Logs identifiers only.
_eventStreaming.DistributorLogger.LogDebug(
exception,
"Dashboard event mirror threw for session {SessionId}; continuing.",
SessionId);
}
}
}
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
{
// Teardown path: the session is shutting down the mirror.
}
catch (SessionManagerException)
{
// The internal subscriber's channel overflowed and the distributor disconnected
// it with a terminal overflow fault. That disconnects only the dashboard mirror;
// the session, pump, and any gRPC subscriber are unaffected. Stop mirroring.
}
catch (Exception exception)
{
// Source-fault completion (worker event stream terminated abnormally) surfaces
// here. The session's own fault handling runs via the gRPC path / lifecycle; the
// mirror just stops. Logs identifiers only.
_eventStreaming.DistributorLogger.LogDebug(
exception,
"Dashboard event mirror loop ended for session {SessionId}.",
SessionId);
}
}
// Builds the per-subscriber backpressure handler the distributor invokes when a
// subscriber's bounded channel overflows. The distributor always disconnects the
// offending subscriber with an EventQueueOverflow fault; this handler adds the
// observable side effects, preserving exactly what the pre-epic per-RPC overflow path
// emitted:
// - always record the queue-overflow metric, labeled by subscriber kind;
// - FailFast in the legacy single-subscriber case (isOnlySubscriber): fault the whole
// session and record the fault metric, matching back-compat behavior;
// - FailFast with multiple subscribers, or DisconnectSubscriber in any case: do NOT
// fault the session — the distributor's disconnect of the one slow subscriber is the
// whole remedy, so other subscribers and the pump are unaffected. Multi-subscriber
// FailFast deliberately degrades to a disconnect because faulting a shared session on
// one slow consumer would punish healthy subscribers.
// The delegate now carries isInternal directly (Issue 4), so the metric label is chosen
// without any heuristic: "dashboard-mirror" for internal, "grpc-event-stream" for external.
private SubscriberOverflowHandler CreateOverflowHandler(EventBackpressurePolicy policy)
{
GatewayMetrics metrics = _eventStreaming.Metrics;
string sessionId = SessionId;
return (isOnlySubscriber, isInternal) =>
{
// Label the overflow metric by subscriber kind. The distributor passes isInternal
// directly, so no heuristic is needed to distinguish an internal overflow (the
// gateway-owned dashboard mirror) from an external one (a gRPC streaming client).
string label = isInternal ? "dashboard-mirror" : "grpc-event-stream";
metrics.QueueOverflow(label);
if (policy == EventBackpressurePolicy.FailFast && isOnlySubscriber)
{
MarkFaulted($"Session {sessionId} event stream queue overflowed.");
metrics.Fault(SessionManagerErrorCode.EventQueueOverflow.ToString());
}
};
}
// The distributor's single event source. Drains the worker event stream once (the
// distributor guarantees a single consumer) and maps each frame to the public MxEvent,
// preserving worker order. Mirrors the former ProduceEventsAsync mapping exactly.
private async IAsyncEnumerable<MxEvent> MapWorkerEventsAsync(
[EnumeratorCancellation] CancellationToken cancellationToken)
{
MxAccessGrpcMapper mapper = _eventStreaming.Mapper;
await foreach (WorkerEvent workerEvent in ReadEventsAsync(cancellationToken)
.ConfigureAwait(false))
{
yield return mapper.MapEvent(workerEvent);
}
}
/// <summary>
@@ -381,10 +678,15 @@ public sealed class GatewaySession
}
/// <summary>
/// Attaches an event subscriber and returns a disposable lease.
/// Attaches an event subscriber and returns a lease whose
/// <see cref="IEventSubscriberLease.Reader"/> reads the fanned public
/// <see cref="MxEvent"/>s for this subscriber. The single-subscriber guard
/// (Tasks 7/8 relax it) is unchanged: with multi-subscriber disabled a second
/// attach is rejected. The returned lease, when disposed, unregisters the
/// distributor subscriber AND decrements the active-subscriber count.
/// </summary>
/// <param name="allowMultipleSubscribers">If true, allows multiple concurrent event subscribers.</param>
public IDisposable AttachEventSubscriber(bool allowMultipleSubscribers)
public IEventSubscriberLease AttachEventSubscriber(bool allowMultipleSubscribers)
{
lock (_syncRoot)
{
@@ -403,7 +705,20 @@ public sealed class GatewaySession
}
_activeEventSubscriberCount++;
return new EventSubscriberLease(this);
}
// Construct/start the distributor and register this subscriber. Done outside the
// guard lock (StartDistributorAndRegister takes _syncRoot itself for construction).
// On any failure roll back the count we just took so the guard stays consistent.
try
{
IEventSubscriberLease distributorLease = StartDistributorAndRegister();
return new EventSubscriberLease(this, distributorLease);
}
catch
{
DetachEventSubscriber();
throw;
}
}
@@ -960,6 +1275,63 @@ public sealed class GatewaySession
{
}
// Stop the internal dashboard mirror first: cancel its loop, dispose its lease (which
// unregisters its internal distributor subscriber and completes its channel), and
// await the loop task. Done BEFORE disposing the distributor and worker client — like
// the distributor itself — so the mirror is no longer reading the pump when the pump
// and its source (the worker client) tear down.
IEventSubscriberLease? dashboardLease;
Task? dashboardTask;
CancellationTokenSource? dashboardCts;
lock (_syncRoot)
{
dashboardLease = _dashboardMirrorLease;
dashboardTask = _dashboardMirrorTask;
dashboardCts = _dashboardMirrorCts;
_dashboardMirrorLease = null;
_dashboardMirrorTask = null;
_dashboardMirrorCts = null;
}
if (dashboardCts is not null)
{
await dashboardCts.CancelAsync().ConfigureAwait(false);
}
dashboardLease?.Dispose();
if (dashboardTask is not null)
{
try
{
await dashboardTask.ConfigureAwait(false);
}
catch (Exception)
{
// The mirror loop swallows its own faults; any escape here must not block
// disposal. The loop has stopped, which is all teardown requires.
}
}
dashboardCts?.Dispose();
// Stop the event pump and complete every subscriber channel before tearing down the
// worker client (the pump's source). DisposeAsync is the single session teardown
// point (SessionManager.RemoveSessionAsync awaits it after close), so awaiting it
// here guarantees the distributor's pump task is observed and subscribers are
// completed rather than left dangling.
SessionEventDistributor? distributor;
lock (_syncRoot)
{
distributor = _eventDistributor;
_eventDistributor = null;
}
if (distributor is not null)
{
await distributor.DisposeAsync().ConfigureAwait(false);
}
if (_workerClient is not null)
{
await _workerClient.DisposeAsync().ConfigureAwait(false);
@@ -1101,22 +1473,30 @@ public sealed class GatewaySession
}
}
private sealed class EventSubscriberLease(GatewaySession session) : IDisposable
private sealed class EventSubscriberLease(GatewaySession session, IEventSubscriberLease distributorLease)
: IEventSubscriberLease
{
private bool _disposed;
// 0 = live, 1 = disposed. Interlocked so concurrent stream-completion +
// client-cancellation paths cannot both call DetachEventSubscriber and
// double-decrement _activeEventSubscriberCount to -1.
private int _leaseDisposed;
/// <inheritdoc />
public System.Threading.Channels.ChannelReader<MxEvent> Reader => distributorLease.Reader;
/// <summary>
/// Disposes the lease and detaches the event subscriber.
/// Disposes the lease: unregisters this subscriber from the distributor (completing
/// its channel) and decrements the session's active-subscriber count. Ordering is
/// not significant — the count guard and the distributor registration are
/// independent — but both must run exactly once.
/// </summary>
public void Dispose()
{
if (_disposed)
if (Interlocked.Exchange(ref _leaseDisposed, 1) == 0)
{
return;
distributorLease.Dispose();
session.DetachEventSubscriber();
}
session.DetachEventSubscriber();
_disposed = true;
}
}
}
@@ -0,0 +1,18 @@
using System.Threading.Channels;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.MxGateway.Server.Sessions;
/// <summary>
/// A registration lease into a <see cref="SessionEventDistributor"/>. Exposes the
/// subscriber's own <see cref="ChannelReader{T}"/> of fanned events. Disposing the
/// lease unregisters the subscriber and completes its channel without disturbing the
/// pump or other subscribers.
/// </summary>
public interface IEventSubscriberLease : IDisposable
{
/// <summary>
/// Gets the reader for this subscriber's fanned event channel.
/// </summary>
ChannelReader<MxEvent> Reader { get; }
}
@@ -8,11 +8,13 @@ public interface ISessionManager
/// <summary>Opens a new gateway session and launches a worker process.</summary>
/// <param name="request">Request payload.</param>
/// <param name="clientIdentity">Client identity string.</param>
/// <param name="ownerKeyId">API key identifier of the caller creating the session.</param>
/// <param name="cancellationToken">Token to cancel the asynchronous operation.</param>
/// <returns>The newly opened session.</returns>
Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken);
/// <summary>Attempts to retrieve a session by ID.</summary>
@@ -0,0 +1,666 @@
using System.Collections.Concurrent;
using System.Threading.Channels;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.MxGateway.Server.Sessions;
/// <summary>
/// Invoked by the pump (on the pump thread) when a subscriber's bounded channel is full
/// and the event cannot be written. The handler applies policy side-effects only:
/// it records the overflow metric and, in the legacy single-subscriber FailFast case,
/// faults the owning session. The handler MUST NOT complete the subscriber's channel —
/// the distributor performs the disconnect and channel-completion unconditionally,
/// regardless of what the handler does.
/// </summary>
/// <param name="isOnlySubscriber">
/// <see langword="true"/> when the overflowing subscriber is the sole registered
/// subscriber at the moment of overflow (legacy single-subscriber mode). FailFast faults
/// the session only in this case; with multiple subscribers FailFast degrades to a
/// per-subscriber disconnect so one slow consumer never faults a session shared by others.
/// Always <see langword="false"/> for internal subscribers (the dashboard mirror) because
/// <see cref="SessionEventDistributor"/> excludes them from the external-subscriber count.
/// </param>
/// <param name="isInternal">
/// <see langword="true"/> when the overflowing subscriber is the gateway-owned internal
/// dashboard mirror subscriber. The handler uses this to choose the correct metric label
/// (<c>"dashboard-mirror"</c> vs <c>"grpc-event-stream"</c>).
/// </param>
public delegate void SubscriberOverflowHandler(bool isOnlySubscriber, bool isInternal);
/// <summary>
/// Per-session event pump and fan-out. A single background task drains the
/// session's event source <em>exactly once</em> and fans each event out to
/// every currently-registered subscriber's own bounded channel.
/// </summary>
/// <remarks>
/// <para>
/// Introduced by Task 2 of the Session Resilience epic; the bounded replay ring
/// buffer was added by Task 3, it was wired into <c>GatewaySession</c> and
/// <c>EventStreamService</c> by Task 4, and the per-subscriber backpressure-isolation
/// policy (Task 5) is implemented here: a slow subscriber overflows only its own
/// bounded channel and the pump applies the policy to that subscriber alone (see
/// <see cref="SubscriberOverflowHandler"/> and <c>OnSubscriberOverflow</c>), leaving
/// the pump, the session, and other subscribers running. The class does not yet
/// remove the single-subscriber guard (Tasks 7/8). The ring buffer supports capacity
/// eviction (oldest entry dropped when the count exceeds
/// <c>replayBufferCapacity</c>) and age eviction (entries older than
/// <c>replayRetentionSeconds</c> dropped on the next append or query), and is
/// queried via <see cref="TryGetReplayFrom"/> by reconnecting subscribers.
/// </para>
/// <para>
/// <b>Source seam.</b> The event source is injected as a
/// <see cref="Func{T, TResult}"/> producing an
/// <see cref="IAsyncEnumerable{T}"/> of already-mapped public
/// <see cref="MxEvent"/>s, given a <see cref="CancellationToken"/>. This is the
/// cleanest seam for Task 4: it can pass
/// <c>ct =&gt; session.ReadEventsAsync(ct).Select(mapper.MapEvent)</c> (or a
/// channel reader's <c>ReadAllAsync</c>), while unit tests pass a plain
/// channel reader's <c>ReadAllAsync</c> with no real session. The pump owns the
/// single consumption of this enumerable; fan-out happens on the public
/// <see cref="MxEvent"/> after mapping, mirroring today's
/// <c>EventStreamService.ProduceEventsAsync</c> ordering.
/// </para>
/// <para>
/// <b>Concurrency.</b> The subscriber set is a
/// <see cref="ConcurrentDictionary{TKey, TValue}"/> keyed by a monotonic id.
/// The pump iterates it with a snapshot-free enumerator (which never throws on
/// concurrent add/remove), and <see cref="Register"/> / lease disposal mutate it
/// without any lock held across an <c>await</c>. Each subscriber channel has a
/// single writer — the pump — so per-channel writes never race. MXAccess parity:
/// events are fanned in the order received; the pump never reorders or
/// synthesizes events.
/// </para>
/// </remarks>
public sealed class SessionEventDistributor : IAsyncDisposable
{
/// <summary>
/// Bounded wait for the pump to stop during disposal. A source factory that
/// ignores cancellation must not hang dispose forever; after this window the
/// pump is abandoned and subscribers are completed anyway.
/// </summary>
private static readonly TimeSpan DefaultShutdownTimeout = TimeSpan.FromSeconds(5);
private readonly string _sessionId;
private readonly Func<CancellationToken, IAsyncEnumerable<MxEvent>> _eventSourceFactory;
private readonly int _subscriberQueueCapacity;
private readonly SubscriberOverflowHandler? _overflowHandler;
private readonly TimeSpan _shutdownTimeout;
private readonly ILogger<SessionEventDistributor> _logger;
private readonly TimeProvider _timeProvider;
private readonly ConcurrentDictionary<long, Subscriber> _subscribers = new();
private readonly CancellationTokenSource _shutdownCts = new();
private readonly object _lifecycleLock = new();
// Replay ring buffer. Appended on the pump thread and queried from arbitrary
// threads via TryGetReplayFrom, so every access is under _replayLock. The deque
// keeps events in ascending WorkerSequence order (the pump fans in source order),
// so the oldest retained event is always at the front. Capacity == 0 disables
// retention; RetentionSeconds <= 0 disables age-based eviction.
private readonly int _replayBufferCapacity;
private readonly TimeSpan _replayRetention;
private readonly bool _ageEvictionEnabled;
private readonly LinkedList<ReplayEntry> _replayBuffer = new();
private readonly object _replayLock = new();
private bool _anyEventSeen;
private ulong _highestSequenceSeen;
private long _nextSubscriberId;
private Task? _pumpTask;
private bool _started;
private bool _disposed;
/// <summary>
/// Initializes a per-session event distributor.
/// </summary>
/// <param name="sessionId">Owning session id, used only for logging context.</param>
/// <param name="eventSourceFactory">
/// Factory producing the session's event stream given a cancellation token.
/// The pump consumes this exactly once. See the type remarks for the seam Task 4
/// plugs into.
/// </param>
/// <param name="subscriberQueueCapacity">
/// Bounded capacity of each per-subscriber channel. Mirrors the gRPC event-stream
/// queue capacity shape used today.
/// </param>
/// <param name="logger">Logger for pump lifecycle diagnostics.</param>
/// <remarks>
/// This overload disables the replay ring buffer (capacity 0). Use the overload
/// taking replay parameters to retain events for reconnect/reattach replay.
/// Kept <c>internal</c> so production wiring (Task 4) cannot accidentally use
/// the no-replay path; tests reach it via <c>InternalsVisibleTo</c>.
/// </remarks>
internal SessionEventDistributor(
string sessionId,
Func<CancellationToken, IAsyncEnumerable<MxEvent>> eventSourceFactory,
int subscriberQueueCapacity,
ILogger<SessionEventDistributor> logger,
SubscriberOverflowHandler? overflowHandler = null)
: this(
sessionId,
eventSourceFactory,
subscriberQueueCapacity,
replayBufferCapacity: 0,
replayRetentionSeconds: 0,
logger,
TimeProvider.System,
overflowHandler)
{
}
/// <summary>
/// Initializes a per-session event distributor with a bounded replay ring buffer.
/// </summary>
/// <param name="sessionId">Owning session id, used only for logging context.</param>
/// <param name="eventSourceFactory">
/// Factory producing the session's event stream given a cancellation token.
/// The pump consumes this exactly once. See the type remarks for the seam Task 4
/// plugs into.
/// </param>
/// <param name="subscriberQueueCapacity">
/// Bounded capacity of each per-subscriber channel. Mirrors the gRPC event-stream
/// queue capacity shape used today.
/// </param>
/// <param name="replayBufferCapacity">
/// Maximum number of events retained for replay. The oldest retained event is
/// evicted once this count is exceeded. <c>0</c> disables retention entirely.
/// </param>
/// <param name="replayRetentionSeconds">
/// Maximum age, in seconds, of a retained event. Entries older than this are
/// evicted regardless of capacity. <c>0</c> (or less) disables age-based eviction.
/// </param>
/// <param name="logger">Logger for pump lifecycle diagnostics.</param>
/// <param name="timeProvider">
/// Clock used to timestamp and age-evict replay entries. Inject a fake to make
/// age-eviction deterministic in tests.
/// </param>
/// <param name="overflowHandler">
/// Optional per-subscriber backpressure handler invoked when a subscriber's bounded
/// channel is full. It records the overflow metric and, for the legacy
/// single-subscriber FailFast case, faults the owning session. The distributor always
/// disconnects the offending subscriber with an overflow fault regardless of the
/// handler. When <see langword="null"/> (unit/skeleton use) the offending subscriber is
/// still disconnected but no metric/fault side effect runs.
/// </param>
public SessionEventDistributor(
string sessionId,
Func<CancellationToken, IAsyncEnumerable<MxEvent>> eventSourceFactory,
int subscriberQueueCapacity,
int replayBufferCapacity,
double replayRetentionSeconds,
ILogger<SessionEventDistributor> logger,
TimeProvider timeProvider,
SubscriberOverflowHandler? overflowHandler = null)
{
ArgumentException.ThrowIfNullOrWhiteSpace(sessionId);
ArgumentNullException.ThrowIfNull(eventSourceFactory);
ArgumentOutOfRangeException.ThrowIfLessThan(subscriberQueueCapacity, 1);
ArgumentOutOfRangeException.ThrowIfNegative(replayBufferCapacity);
ArgumentOutOfRangeException.ThrowIfNegative(replayRetentionSeconds);
ArgumentNullException.ThrowIfNull(logger);
ArgumentNullException.ThrowIfNull(timeProvider);
_sessionId = sessionId;
_eventSourceFactory = eventSourceFactory;
_subscriberQueueCapacity = subscriberQueueCapacity;
_overflowHandler = overflowHandler;
_shutdownTimeout = DefaultShutdownTimeout;
_replayBufferCapacity = replayBufferCapacity;
_ageEvictionEnabled = replayRetentionSeconds > 0;
_replayRetention = _ageEvictionEnabled
? TimeSpan.FromSeconds(replayRetentionSeconds)
: TimeSpan.Zero;
_logger = logger;
_timeProvider = timeProvider;
}
/// <summary>
/// Gets the count of currently-registered subscribers.
/// </summary>
public int SubscriberCount => _subscribers.Count;
/// <summary>
/// Starts the background pump. Idempotent — a second call is a no-op.
/// </summary>
/// <param name="cancellationToken">Token observed only while starting.</param>
public Task StartAsync(CancellationToken cancellationToken)
{
cancellationToken.ThrowIfCancellationRequested();
lock (_lifecycleLock)
{
ObjectDisposedException.ThrowIf(_disposed, this);
if (_started)
{
return Task.CompletedTask;
}
_started = true;
_pumpTask = Task.Run(() => PumpAsync(_shutdownCts.Token), CancellationToken.None);
}
return Task.CompletedTask;
}
/// <summary>
/// Registers a new subscriber and returns its lease. The lease exposes the
/// subscriber's <see cref="ChannelReader{T}"/> and, when disposed, unregisters the
/// subscriber and completes its channel without disturbing the pump or other
/// subscribers.
/// </summary>
/// <param name="isInternal">
/// <see langword="true"/> for a gateway-owned internal subscriber (Task 6: the
/// session's dashboard mirror) that must NOT participate in the single-subscriber
/// overflow accounting. An internal subscriber is excluded from the
/// <c>isOnlySubscriber</c> count, so a lone external gRPC subscriber still reports
/// <c>isOnlySubscriber == true</c> (preserving legacy FailFast session-fault
/// behavior) even while the dashboard subscriber is attached; and an internal
/// subscriber that itself overflows always reports <c>isOnlySubscriber == false</c>,
/// so a slow/broken dashboard can never fault the session — it is merely
/// disconnected from the mirror. Defaults to <see langword="false"/> (external
/// subscriber) so every existing call site is unchanged.
/// </param>
public IEventSubscriberLease Register(bool isInternal = false)
{
// The pump is the single writer for this channel; readers are single-consumer
// (one gRPC stream / dashboard subscriber). Synchronous continuations are
// disabled so a slow reader can never stall the pump on its completion.
//
// The pump MUST stay non-blocking: it writes with the non-blocking TryWrite so one
// slow reader can never stall the single pump that feeds every subscriber. FullMode
// is deliberately Wait — NOT because the pump ever blocks (it never calls the blocking
// WriteAsync overload), but because Wait is the only BoundedChannelFullMode under
// which TryWrite returns false when the channel is full. That false return IS the
// overflow signal the pump needs to apply the per-subscriber backpressure policy. The
// Drop* modes would make TryWrite silently succeed-and-drop, hiding overflow and
// re-introducing the silent data loss this task removes. So: Wait mode + TryWrite =
// a non-blocking pump that still detects a full subscriber channel.
Channel<MxEvent> channel = Channel.CreateBounded<MxEvent>(
new BoundedChannelOptions(_subscriberQueueCapacity)
{
SingleReader = true,
SingleWriter = true,
FullMode = BoundedChannelFullMode.Wait,
AllowSynchronousContinuations = false,
});
long id = Interlocked.Increment(ref _nextSubscriberId);
Subscriber subscriber = new(id, channel, isInternal);
// The disposed check AND the map add happen under the same lock with no await
// in between. DisposeAsync sets _disposed=true under this same lock before it
// calls CompleteAllSubscribers, so once disposal has begun no further subscriber
// can be added — closing the Register-after-DisposeAsync window that would
// otherwise leave a subscriber's channel never completed.
lock (_lifecycleLock)
{
ObjectDisposedException.ThrowIf(_disposed, this);
_subscribers[id] = subscriber;
}
return new SubscriberLease(this, subscriber);
}
/// <summary>
/// Stops the pump and completes all subscriber channels. Idempotent.
/// </summary>
public async ValueTask DisposeAsync()
{
Task? pumpTask;
lock (_lifecycleLock)
{
if (_disposed)
{
return;
}
_disposed = true;
pumpTask = _pumpTask;
}
// Signal the pump to stop. It must not block on a non-reading subscriber:
// it writes with non-blocking TryWrite, so cancellation tears it down promptly.
await _shutdownCts.CancelAsync().ConfigureAwait(false);
if (pumpTask is not null)
{
// Bound the wait: a source factory that ignores cancellation would otherwise
// hang dispose forever. If the pump does not stop in time we log and proceed
// to complete subscribers anyway; DisposeAsync must not throw on this path.
Task completed = await Task.WhenAny(pumpTask, Task.Delay(_shutdownTimeout)).ConfigureAwait(false);
if (!ReferenceEquals(completed, pumpTask))
{
_logger.LogWarning(
"Event distributor pump did not stop within {ShutdownTimeoutSeconds}s for session {SessionId}; completing subscribers and abandoning the pump.",
_shutdownTimeout.TotalSeconds,
_sessionId);
}
else
{
try
{
await pumpTask.ConfigureAwait(false);
}
catch (OperationCanceledException)
{
}
catch (Exception exception)
{
_logger.LogDebug(
exception,
"Event distributor pump faulted during shutdown for session {SessionId}.",
_sessionId);
}
}
}
CompleteAllSubscribers(error: null);
_shutdownCts.Dispose();
}
private async Task PumpAsync(CancellationToken cancellationToken)
{
try
{
await foreach (MxEvent mxEvent in _eventSourceFactory(cancellationToken)
.WithCancellation(cancellationToken)
.ConfigureAwait(false))
{
// Retain for replay BEFORE fan-out so a reconnecting subscriber that
// queries between fan-out and its own read still sees this event. Order
// is preserved: the pump is the single appender and events arrive in
// source order.
AppendToReplayBuffer(mxEvent);
// Enumerating a ConcurrentDictionary's Values never throws on concurrent
// add/remove; a subscriber registered mid-iteration may miss this event,
// which matches "late subscribers see events after they register".
foreach (Subscriber subscriber in _subscribers.Values)
{
// Non-blocking write: TryWrite never blocks the pump on a slow reader.
// A false return means this subscriber's bounded channel is full — the
// per-subscriber overflow signal. We apply the backpressure policy to
// THIS subscriber only; the pump, the session, and every other subscriber
// keep running. Logs identifiers (worker sequence, subscriber id, session)
// only, never the event payload or tag values.
if (!subscriber.Channel.Writer.TryWrite(mxEvent))
{
OnSubscriberOverflow(subscriber, mxEvent.WorkerSequence);
}
}
}
CompleteAllSubscribers(error: null);
}
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
{
// Shutdown path: DisposeAsync completes subscribers.
}
catch (Exception exception)
{
// Unexpected source fault (not the shutdown-cancellation path above) — visible
// by default so an event stream silently dying is not lost in Debug noise.
_logger.LogError(
exception,
"Event distributor source faulted for session {SessionId}.",
_sessionId);
CompleteAllSubscribers(exception);
}
}
// Applies the per-subscriber backpressure policy when a subscriber's bounded channel is
// full. Runs on the pump thread. The offending subscriber is ALWAYS disconnected with an
// overflow fault and unregistered, so it can never wedge the pump again; the overflow
// handler decides the observable side effects (overflow metric, and — for legacy
// single-subscriber FailFast — faulting the owning session). Multi-subscriber FailFast
// intentionally degrades to a plain disconnect (see SubscriberOverflowHandler docs): one
// slow consumer must not fault a session shared by other healthy subscribers.
private void OnSubscriberOverflow(Subscriber subscriber, ulong workerSequence)
{
// Snapshot whether this is the sole subscriber BEFORE we unregister it. This drives
// the FailFast-fault-session-vs-disconnect decision: FailFast only faults the session
// when the overflowing subscriber is the sole subscriber.
//
// This snapshot is safe in v1 because AllowMultipleEventSubscribers=false is enforced
// by the validator and the single-subscriber guard in AttachEventSubscriber — a
// concurrent second registration is impossible, so the false-FailFast race (two
// subscribers, one overflows, Count reads as 1 after the other concurrently unregisters,
// FailFast wrongly faults the session) cannot occur today.
//
// REVISIT (Task 7/8): when multi-subscriber is enabled the guard is removed and the
// race window opens — a concurrent second registration could cause Count to read as 1
// here even with two subscribers, producing a false FailFast that faults a shared
// session. Resolve before enabling multi-subscriber.
//
// Task 6: the gateway-owned internal dashboard subscriber is excluded from this
// accounting. (a) An internal subscriber that overflows is NEVER the "only subscriber"
// — a slow/broken dashboard must never fault the session, only disconnect its own
// mirror. (b) Internal subscribers are excluded from the count, so a lone external
// gRPC subscriber still reports isOnlySubscriber==true and preserves the legacy
// FailFast session-fault behavior even while the dashboard mirror is attached.
bool isOnlySubscriber = !subscriber.IsInternal && CountExternalSubscribers() == 1;
_logger.LogDebug(
"Event distributor disconnecting subscriber {SubscriberId} in session {SessionId} after queue overflow (worker sequence {WorkerSequence}).",
subscriber.Id,
_sessionId,
workerSequence);
// Observability + session-fault decision. Errors here must not stall the pump or
// leave the subscriber attached, so the disconnect below runs regardless.
// Pass subscriber.IsInternal so the handler can choose the correct metric label.
try
{
_overflowHandler?.Invoke(isOnlySubscriber, subscriber.IsInternal);
}
catch (Exception exception)
{
_logger.LogError(
exception,
"Event distributor overflow handler threw for session {SessionId}; disconnecting subscriber {SubscriberId} anyway.",
_sessionId,
subscriber.Id);
}
// Disconnect ONLY this subscriber: complete its channel with the overflow fault and
// remove it from the fan-out set. Its gRPC reader's MoveNextAsync then throws the
// SessionManagerException, which EventStreamService surfaces to the client exactly as
// the pre-epic per-RPC overflow did. The pump and every other subscriber are untouched.
if (_subscribers.TryRemove(subscriber.Id, out _))
{
subscriber.Channel.Writer.TryComplete(new SessionManagerException(
SessionManagerErrorCode.EventQueueOverflow,
$"Session {_sessionId} event stream queue overflowed."));
}
}
// Counts external (non-internal) subscribers. Drives the isOnlySubscriber FailFast
// decision so the gateway-owned internal dashboard subscriber never inflates the count.
private int CountExternalSubscribers()
{
int count = 0;
foreach (Subscriber subscriber in _subscribers.Values)
{
if (!subscriber.IsInternal)
{
count++;
}
}
return count;
}
private void CompleteAllSubscribers(Exception? error)
{
foreach (Subscriber subscriber in _subscribers.Values)
{
subscriber.Channel.Writer.TryComplete(error);
}
}
private void Unregister(Subscriber subscriber)
{
if (_subscribers.TryRemove(subscriber.Id, out _))
{
subscriber.Channel.Writer.TryComplete();
}
}
/// <summary>
/// Returns the retained events with <see cref="MxEvent.WorkerSequence"/> strictly
/// greater than <paramref name="afterSequence"/>, in ascending sequence order, so a
/// reconnecting or reattaching subscriber can replay what it missed.
/// </summary>
/// <param name="afterSequence">
/// The last worker sequence the caller already observed. Only events newer than this
/// are returned.
/// </param>
/// <param name="events">
/// The retained events newer than <paramref name="afterSequence"/>, in order. Never
/// null; empty when nothing newer is retained.
/// </param>
/// <param name="gap">
/// <see langword="true"/> when events between <paramref name="afterSequence"/> and the
/// oldest retained event were already evicted (by capacity or age), meaning the caller
/// missed events that can no longer be replayed and must re-snapshot. When
/// <see langword="true"/>, whatever IS still retained is still returned via
/// <paramref name="events"/>.
/// </param>
/// <returns>
/// Always <see langword="true"/> — the out parameters fully describe the result. The
/// return value exists for a fluent call shape and future extension.
/// </returns>
/// <remarks>
/// <para>Gap semantics, by buffer state:</para>
/// <list type="bullet">
/// <item>
/// Buffer non-empty: <paramref name="gap"/> is <see langword="true"/> iff
/// <paramref name="afterSequence"/> is below the oldest retained sequence minus
/// one (i.e. at least one event newer than <paramref name="afterSequence"/> but
/// older than the oldest retained was evicted). When
/// <paramref name="afterSequence"/> equals or exceeds the newest retained
/// sequence the caller is fully caught up: empty list, no gap.
/// </item>
/// <item>
/// Buffer empty (retention disabled, nothing seen yet, or everything evicted):
/// empty list, and <paramref name="gap"/> is <see langword="true"/> iff
/// <paramref name="afterSequence"/> is below the highest sequence ever seen —
/// i.e. the caller is behind but nothing is retained to replay. If no event has
/// ever been seen, or the caller is already at/ahead of the highest seen, there
/// is nothing to miss: no gap.
/// </item>
/// </list>
/// </remarks>
public bool TryGetReplayFrom(ulong afterSequence, out IReadOnlyList<MxEvent> events, out bool gap)
{
lock (_replayLock)
{
EvictAged();
if (_replayBuffer.Count == 0)
{
events = [];
// Nothing retained. The caller missed events only if it is behind the
// highest sequence ever seen (and we have seen at least one event).
gap = _anyEventSeen && afterSequence < _highestSequenceSeen;
return true;
}
ulong oldestRetained = _replayBuffer.First!.Value.Event.WorkerSequence;
// A gap exists when at least one event newer than afterSequence was evicted,
// i.e. afterSequence sits below the oldest-retained-minus-one boundary.
// Written as (oldestRetained > 0 && afterSequence < oldestRetained - 1) to
// avoid wrapping when afterSequence == ulong.MaxValue (afterSequence + 1
// would overflow to 0, falsely reporting a gap).
gap = oldestRetained > 0 && afterSequence < oldestRetained - 1;
// O(n) scan over the retained buffer — acceptable because TryGetReplayFrom
// is only called on subscriber reconnect, never on the hot fan-out path.
List<MxEvent> newer = [];
foreach (ReplayEntry entry in _replayBuffer)
{
if (entry.Event.WorkerSequence > afterSequence)
{
newer.Add(entry.Event);
}
}
events = newer;
return true;
}
}
private void AppendToReplayBuffer(MxEvent mxEvent)
{
lock (_replayLock)
{
_anyEventSeen = true;
if (mxEvent.WorkerSequence > _highestSequenceSeen)
{
_highestSequenceSeen = mxEvent.WorkerSequence;
}
// Capacity 0 disables retention: track the highest-seen sequence (so replay
// can still report a gap) but keep no events.
if (_replayBufferCapacity == 0)
{
return;
}
_replayBuffer.AddLast(new ReplayEntry(mxEvent, _timeProvider.GetUtcNow()));
// Capacity eviction: drop oldest until within bound.
while (_replayBuffer.Count > _replayBufferCapacity)
{
_replayBuffer.RemoveFirst();
}
EvictAged();
}
}
// Must be called under _replayLock. Drops entries older than the retention window.
private void EvictAged()
{
if (!_ageEvictionEnabled || _replayBuffer.Count == 0)
{
return;
}
DateTimeOffset cutoff = _timeProvider.GetUtcNow() - _replayRetention;
while (_replayBuffer.First is { } first && first.Value.RetainedAt < cutoff)
{
_replayBuffer.RemoveFirst();
}
}
private readonly record struct ReplayEntry(MxEvent Event, DateTimeOffset RetainedAt);
private sealed class Subscriber(long id, Channel<MxEvent> channel, bool isInternal)
{
public long Id { get; } = id;
public Channel<MxEvent> Channel { get; } = channel;
// True for the gateway-owned internal dashboard subscriber. Excluded from the
// single-subscriber overflow accounting so it cannot fault the session.
public bool IsInternal { get; } = isInternal;
}
private sealed class SubscriberLease(SessionEventDistributor distributor, Subscriber subscriber)
: IEventSubscriberLease
{
private int _leaseDisposed;
public ChannelReader<MxEvent> Reader => subscriber.Channel.Reader;
public void Dispose()
{
// Atomic check-and-set so concurrent Dispose calls unregister at most once.
if (Interlocked.Exchange(ref _leaseDisposed, 1) == 0)
{
distributor.Unregister(subscriber);
}
}
}
}
@@ -0,0 +1,60 @@
using Microsoft.Extensions.Logging.Abstractions;
using ZB.MOM.WW.MxGateway.Server.Configuration;
using ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs;
using ZB.MOM.WW.MxGateway.Server.Grpc;
using ZB.MOM.WW.MxGateway.Server.Metrics;
namespace ZB.MOM.WW.MxGateway.Server.Sessions;
/// <summary>
/// Dependencies a <see cref="GatewaySession"/> needs to construct and own its
/// <see cref="SessionEventDistributor"/>. Bundled so the session constructor stays a
/// single optional parameter rather than four, and so unit tests that build a session
/// directly get a working distributor from <see cref="Default"/> without wiring DI.
/// </summary>
/// <param name="Mapper">
/// Maps worker IPC <c>WorkerEvent</c> frames to public <c>MxEvent</c>s. The distributor
/// pump applies this once per event in worker order, mirroring the mapping
/// <c>EventStreamService.ProduceEventsAsync</c> used before Task 4.
/// </param>
/// <param name="EventOptions">
/// Supplies the distributor's per-subscriber queue capacity and replay ring-buffer
/// bounds (<see cref="EventOptions.QueueCapacity"/>,
/// <see cref="EventOptions.ReplayBufferCapacity"/>,
/// <see cref="EventOptions.ReplayRetentionSeconds"/>).
/// </param>
/// <param name="DistributorLogger">Logger for the distributor pump lifecycle.</param>
/// <param name="TimeProvider">Clock used to timestamp and age-evict replay entries.</param>
/// <param name="Metrics">
/// Gateway metrics sink used by the session's per-subscriber overflow handler to record
/// the queue-overflow counter and, for legacy single-subscriber FailFast, the session
/// fault. Carrying it here keeps the distributor decoupled from the metrics type while
/// preserving the observability the pre-epic per-RPC overflow path emitted.
/// </param>
/// <param name="DashboardBroadcaster">
/// Sink the session's internal dashboard mirror loop (Task 6) publishes raw session
/// <c>MxEvent</c>s to. When non-null the session registers an internal distributor
/// subscriber on becoming Ready and mirrors every fanned event to the dashboard
/// EventsHub group regardless of whether a gRPC client is streaming. When null
/// (unit tests that don't exercise the dashboard mirror) no mirror is started.
/// </param>
public sealed record SessionEventStreaming(
MxAccessGrpcMapper Mapper,
EventOptions EventOptions,
ILogger<SessionEventDistributor> DistributorLogger,
TimeProvider TimeProvider,
GatewayMetrics Metrics,
IDashboardEventBroadcaster? DashboardBroadcaster = null)
{
/// <summary>
/// Defaults used when a session is constructed without explicit streaming
/// dependencies (unit tests). Uses a fresh mapper, default event options, a no-op
/// logger, the system clock, a fresh metrics sink, and no dashboard mirror.
/// </summary>
public static SessionEventStreaming Default { get; } = new(
new MxAccessGrpcMapper(),
new EventOptions(),
NullLogger<SessionEventDistributor>.Instance,
TimeProvider.System,
new GatewayMetrics());
}
@@ -25,6 +25,9 @@ public sealed class SessionManager : ISessionManager
private readonly ILogger<SessionManager> _logger;
private readonly GatewayOptions _options;
private readonly SemaphoreSlim _sessionSlots;
private readonly Grpc.MxAccessGrpcMapper _eventMapper;
private readonly ILogger<SessionEventDistributor> _distributorLogger;
private readonly Dashboard.Hubs.IDashboardEventBroadcaster? _dashboardEventBroadcaster;
/// <summary>
/// Initializes a new instance of <see cref="SessionManager"/>.
@@ -35,13 +38,24 @@ public sealed class SessionManager : ISessionManager
/// <param name="metrics">Gateway metrics.</param>
/// <param name="timeProvider">Time provider for timestamps.</param>
/// <param name="logger">Logger.</param>
/// <param name="eventMapper">Mapper used by each session's event distributor to map worker events to public events.</param>
/// <param name="distributorLogger">Logger passed to each session's event distributor pump.</param>
/// <param name="dashboardEventBroadcaster">
/// Dashboard SignalR fan-out sink. Each session registers an internal distributor
/// subscriber (Task 6) that mirrors raw session events to this broadcaster, so the
/// dashboard receives events regardless of whether a gRPC client is streaming. Null in
/// unit tests that do not exercise the dashboard mirror.
/// </param>
public SessionManager(
ISessionRegistry registry,
ISessionWorkerClientFactory workerClientFactory,
IOptions<GatewayOptions> options,
GatewayMetrics metrics,
TimeProvider? timeProvider = null,
ILogger<SessionManager>? logger = null)
ILogger<SessionManager>? logger = null,
Grpc.MxAccessGrpcMapper? eventMapper = null,
ILogger<SessionEventDistributor>? distributorLogger = null,
Dashboard.Hubs.IDashboardEventBroadcaster? dashboardEventBroadcaster = null)
{
_registry = registry ?? throw new ArgumentNullException(nameof(registry));
_workerClientFactory = workerClientFactory ?? throw new ArgumentNullException(nameof(workerClientFactory));
@@ -49,6 +63,9 @@ public sealed class SessionManager : ISessionManager
_metrics = metrics ?? throw new ArgumentNullException(nameof(metrics));
_timeProvider = timeProvider ?? TimeProvider.System;
_logger = logger ?? NullLogger<SessionManager>.Instance;
_eventMapper = eventMapper ?? new Grpc.MxAccessGrpcMapper();
_distributorLogger = distributorLogger ?? NullLogger<SessionEventDistributor>.Instance;
_dashboardEventBroadcaster = dashboardEventBroadcaster;
_options = options.Value;
_sessionSlots = new SemaphoreSlim(_options.Sessions.MaxSessions, _options.Sessions.MaxSessions);
}
@@ -58,11 +75,13 @@ public sealed class SessionManager : ISessionManager
/// </summary>
/// <param name="request">Session open request.</param>
/// <param name="clientIdentity">Client authentication identity.</param>
/// <param name="ownerKeyId">API key identifier of the caller creating the session.</param>
/// <param name="cancellationToken">Cancellation token.</param>
/// <returns>Opened gateway session.</returns>
public async Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(request);
@@ -72,7 +91,7 @@ public sealed class SessionManager : ISessionManager
bool sessionOpenedRecorded = false;
try
{
session = CreateSession(request, clientIdentity);
session = CreateSession(request, clientIdentity, ownerKeyId);
if (!_registry.TryAdd(session))
{
throw new SessionManagerException(
@@ -420,7 +439,8 @@ public sealed class SessionManager : ISessionManager
private GatewaySession CreateSession(
SessionOpenRequest request,
string? clientIdentity)
string? clientIdentity,
string? ownerKeyId)
{
string sessionId = CreateSessionId();
string backendName = string.IsNullOrWhiteSpace(request.RequestedBackend)
@@ -435,19 +455,29 @@ public sealed class SessionManager : ISessionManager
DateTimeOffset openedAt = _timeProvider.GetUtcNow();
string clientCorrelationId = CreateClientCorrelationId(request.ClientSessionName, sessionId);
SessionEventStreaming eventStreaming = new(
_eventMapper,
_options.Events,
_distributorLogger,
_timeProvider,
_metrics,
_dashboardEventBroadcaster);
return new GatewaySession(
sessionId,
backendName,
pipeName,
nonce,
clientIdentity,
ownerKeyId,
request.ClientSessionName,
clientCorrelationId,
commandTimeout,
startupTimeout,
shutdownTimeout,
leaseDuration,
openedAt);
openedAt,
eventStreaming);
}
private static string CreateClientCorrelationId(
@@ -46,11 +46,14 @@
"MaxPendingCommandsPerSession": 128,
"DefaultLeaseSeconds": 1800,
"LeaseSweepIntervalSeconds": 30,
"AllowMultipleEventSubscribers": false
"AllowMultipleEventSubscribers": false,
"MaxEventSubscribersPerSession": 8
},
"Events": {
"QueueCapacity": 10000,
"BackpressurePolicy": "FailFast"
"BackpressurePolicy": "FailFast",
"ReplayBufferCapacity": 1024,
"ReplayRetentionSeconds": 300
},
"Dashboard": {
"Enabled": true,
@@ -410,6 +410,7 @@ public sealed class AlarmFailoverEndToEndTests
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken)
{
GatewaySession session = new(
@@ -711,6 +711,7 @@ public sealed class GatewayAlarmMonitorProviderModeTests
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken)
{
GatewaySession session = new(
@@ -31,9 +31,11 @@ public sealed class GatewayOptionsTests
Assert.Equal(30, options.Sessions.DefaultCommandTimeoutSeconds);
Assert.Equal(64, options.Sessions.MaxSessions);
Assert.Equal(128, options.Sessions.MaxPendingCommandsPerSession);
Assert.Equal(1800, options.Sessions.DefaultLeaseSeconds);
Assert.Equal(30, options.Sessions.LeaseSweepIntervalSeconds);
Assert.False(options.Sessions.AllowMultipleEventSubscribers);
Assert.Equal(8, options.Sessions.MaxEventSubscribersPerSession);
Assert.Equal(10_000, options.Events.QueueCapacity);
Assert.Equal(EventBackpressurePolicy.FailFast, options.Events.BackpressurePolicy);
@@ -289,4 +289,71 @@ public sealed class GatewayOptionsValidatorTests
Assert.True(result.Failed);
Assert.Contains(result.Failures!, f => f.Contains(keyPart));
}
// -------------------------------------------------------------------------
// AllowMultipleEventSubscribers / MaxEventSubscribersPerSession validation
// -------------------------------------------------------------------------
private static GatewayOptions CloneWithSessions(GatewayOptions source, SessionOptions sessions)
=> new()
{
Authentication = source.Authentication,
Ldap = source.Ldap,
Worker = source.Worker,
Sessions = sessions,
Events = source.Events,
Dashboard = source.Dashboard,
Protocol = source.Protocol,
Alarms = source.Alarms,
Tls = source.Tls,
};
[Fact]
public void Validate_Succeeds_WhenAllowMultipleEventSubscribersIsTrue()
{
// AllowMultipleEventSubscribers=true must now validate cleanly (no longer rejected).
GatewayOptions options = CloneWithSessions(
ValidOptions(),
new SessionOptions { AllowMultipleEventSubscribers = true });
ValidateOptionsResult result = new GatewayOptionsValidator().Validate(null, options);
Assert.True(result.Succeeded);
}
[Theory]
[InlineData(0)]
[InlineData(-1)]
public void Validate_Fails_WhenMaxEventSubscribersPerSessionBelowOne(int value)
{
GatewayOptions options = CloneWithSessions(
ValidOptions(),
new SessionOptions { MaxEventSubscribersPerSession = value });
ValidateOptionsResult result = new GatewayOptionsValidator().Validate(null, options);
Assert.True(result.Failed);
Assert.Contains(
result.Failures!,
f => f.Contains("MxGateway:Sessions:MaxEventSubscribersPerSession"));
}
[Theory]
[InlineData(1)]
[InlineData(8)]
[InlineData(32)]
public void Validate_Succeeds_WhenMaxEventSubscribersPerSessionIsPositive(int value)
{
GatewayOptions options = CloneWithSessions(
ValidOptions(),
new SessionOptions { MaxEventSubscribersPerSession = value });
ValidateOptionsResult result = new GatewayOptionsValidator().Validate(null, options);
Assert.True(result.Succeeded);
}
[Fact]
public void Validate_Succeeds_WithDefaultSessionOptions()
{
// Default SessionOptions (AllowMultipleEventSubscribers=false, MaxEventSubscribersPerSession=8)
// must validate cleanly.
GatewayOptions options = CloneWithSessions(ValidOptions(), new SessionOptions());
ValidateOptionsResult result = new GatewayOptionsValidator().Validate(null, options);
Assert.True(result.Succeeded);
}
}
@@ -1,6 +1,5 @@
using System.Diagnostics;
using Grpc.Core;
using Microsoft.Extensions.Logging.Abstractions;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Server.Dashboard;
using ZB.MOM.WW.MxGateway.Server.Galaxy;
@@ -266,8 +265,7 @@ public sealed class GalaxyFilterInputSafetyTests
new ZB.MOM.WW.MxGateway.Server.Galaxy.GalaxyRepository(options),
new StubGalaxyHierarchyCache(entry),
new GalaxyDeployNotifier(),
new GatewayRequestIdentityAccessor(),
NullLogger<GalaxyRepositoryGrpcService>.Instance);
new GatewayRequestIdentityAccessor());
}
private static GalaxyHierarchyCacheEntry CreateEntry(IReadOnlyList<GalaxyObject> objects)
@@ -244,6 +244,7 @@ public sealed class DashboardSessionAdminServiceTests
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken)
{
throw new NotSupportedException();
@@ -94,6 +94,103 @@ public sealed class GatewayEndToEndFakeWorkerSmokeTests
launcher.CommandKinds);
}
/// <summary>
/// Verifies that the gateway forwards control commands (Ping, GetWorkerInfo, DrainEvents)
/// through the full gRPC→WorkerClient→pipe roundtrip when the fake worker responds
/// with canned replies via RespondToControlCommandAsync.
/// </summary>
[Fact]
public async Task GatewayService_WithFakeWorker_ControlCommandsRoundtripThroughGateway()
{
ControlCommandFakeWorkerProcessLauncher launcher = new();
await using GatewayServiceFixture fixture = new(launcher);
OpenSessionReply openReply = await fixture.Service.OpenSession(
new OpenSessionRequest
{
ClientSessionName = "control-cmd-e2e",
ClientCorrelationId = "control-open-correlation",
CommandTimeout = Duration.FromTimeSpan(TestTimeout),
},
new TestServerCallContext());
Assert.Equal(ProtocolStatusCode.Ok, openReply.ProtocolStatus.Code);
string sessionId = openReply.SessionId;
// Ping — the scripted worker echoes back the message.
Task<MxCommandReply> pingTask = fixture.Service.Invoke(
new MxCommandRequest
{
SessionId = sessionId,
ClientCorrelationId = "ping-correlation",
Command = new MxCommand
{
Kind = MxCommandKind.Ping,
Ping = new PingCommand { Message = "e2e-ping" },
},
},
new TestServerCallContext());
await launcher.WaitForNextControlCommandAsync(TestTimeout);
MxCommandReply pingReply = await pingTask.WaitAsync(TestTimeout);
Assert.Equal(ProtocolStatusCode.Ok, pingReply.ProtocolStatus.Code);
Assert.Equal(MxCommandKind.Ping, pingReply.Kind);
Assert.Equal("e2e-ping", pingReply.DiagnosticMessage);
// GetWorkerInfo — the scripted worker returns canned info.
Task<MxCommandReply> infoTask = fixture.Service.Invoke(
new MxCommandRequest
{
SessionId = sessionId,
ClientCorrelationId = "info-correlation",
Command = new MxCommand
{
Kind = MxCommandKind.GetWorkerInfo,
GetWorkerInfo = new GetWorkerInfoCommand(),
},
},
new TestServerCallContext());
await launcher.WaitForNextControlCommandAsync(TestTimeout);
MxCommandReply infoReply = await infoTask.WaitAsync(TestTimeout);
Assert.Equal(ProtocolStatusCode.Ok, infoReply.ProtocolStatus.Code);
Assert.Equal(MxCommandKind.GetWorkerInfo, infoReply.Kind);
Assert.NotNull(infoReply.WorkerInfo);
Assert.Equal(FakeWorkerHarness.DefaultWorkerProcessId, infoReply.WorkerInfo.WorkerProcessId);
Assert.False(string.IsNullOrEmpty(infoReply.WorkerInfo.MxaccessProgid));
// DrainEvents — the scripted worker returns an empty drain reply.
Task<MxCommandReply> drainTask = fixture.Service.Invoke(
new MxCommandRequest
{
SessionId = sessionId,
ClientCorrelationId = "drain-correlation",
Command = new MxCommand
{
Kind = MxCommandKind.DrainEvents,
DrainEvents = new DrainEventsCommand { MaxEvents = 16 },
},
},
new TestServerCallContext());
await launcher.WaitForNextControlCommandAsync(TestTimeout);
MxCommandReply drainReply = await drainTask.WaitAsync(TestTimeout);
Assert.Equal(ProtocolStatusCode.Ok, drainReply.ProtocolStatus.Code);
Assert.Equal(MxCommandKind.DrainEvents, drainReply.Kind);
Assert.NotNull(drainReply.DrainEvents);
Assert.Empty(drainReply.DrainEvents.Events);
// Tear down cleanly.
await fixture.Service.CloseSession(
new CloseSessionRequest
{
SessionId = sessionId,
ClientCorrelationId = "control-close-correlation",
},
new TestServerCallContext());
await launcher.WorkerTask.WaitAsync(TestTimeout);
}
private static MxCommandRequest CreateRegisterRequest(string sessionId)
{
return new MxCommandRequest
@@ -171,15 +268,13 @@ public sealed class GatewayEndToEndFakeWorkerSmokeTests
workerClientFactory,
options,
_metrics,
logger: NullLogger<SessionManager>.Instance);
logger: NullLogger<SessionManager>.Instance,
dashboardEventBroadcaster: NullDashboardEventBroadcaster.Instance);
MxAccessGrpcMapper mapper = new();
EventStreamService eventStreamService = new(
sessionManager,
options,
mapper,
_metrics,
NullDashboardEventBroadcaster.Instance,
NullLogger<EventStreamService>.Instance);
_metrics);
Service = new MxAccessGatewayService(
sessionManager,
@@ -355,6 +450,89 @@ public sealed class GatewayEndToEndFakeWorkerSmokeTests
}
}
/// <summary>
/// A fake worker launcher whose scripted worker automatically responds to control
/// commands (Ping, GetWorkerInfo, DrainEvents) using <see cref="FakeWorkerHarness.RespondToControlCommandAsync"/>
/// and sends a shutdown ack when the gateway closes the session. Exposes
/// <see cref="WaitForNextControlCommandAsync"/> so the test can drive the interaction
/// one command at a time without races.
/// </summary>
private sealed class ControlCommandFakeWorkerProcessLauncher : IWorkerProcessLauncher
{
public const int ProcessId = 5590;
private readonly FakeWorkerProcess _process = new(ProcessId);
private readonly SemaphoreSlim _commandHandled = new(0);
/// <summary>Gets the task backing the scripted worker loop.</summary>
public Task WorkerTask { get; private set; } = Task.CompletedTask;
/// <inheritdoc />
public Task<WorkerProcessHandle> LaunchAsync(
WorkerProcessLaunchRequest request,
CancellationToken cancellationToken = default)
{
WorkerTask = RunWorkerAsync(request, cancellationToken);
return Task.FromResult(new WorkerProcessHandle(
_process,
new WorkerProcessCommandLine("fake-control-worker.exe", []),
DateTimeOffset.UtcNow));
}
/// <summary>Waits until the scripted worker has responded to one control command.</summary>
/// <param name="timeout">Maximum time to wait.</param>
public async Task WaitForNextControlCommandAsync(TimeSpan timeout)
{
using CancellationTokenSource cts = new(timeout);
await _commandHandled.WaitAsync(cts.Token).ConfigureAwait(false);
}
private async Task RunWorkerAsync(
WorkerProcessLaunchRequest request,
CancellationToken cancellationToken)
{
await using FakeWorkerHarness harness = await FakeWorkerHarness.ConnectToGatewayPipeAsync(
request.SessionId,
request.Nonce,
request.PipeName,
request.ProtocolVersion,
cancellationToken: cancellationToken).ConfigureAwait(false);
await harness.CompleteStartupAsync(ProcessId, cancellationToken: cancellationToken).ConfigureAwait(false);
while (!cancellationToken.IsCancellationRequested)
{
WorkerEnvelope envelope = await harness
.ReadGatewayEnvelopeAsync(cancellationToken)
.ConfigureAwait(false);
if (envelope.BodyCase == WorkerEnvelope.BodyOneofCase.WorkerShutdown)
{
await harness.SendShutdownAckAsync(cancellationToken: cancellationToken).ConfigureAwait(false);
_process.MarkExited(0);
return;
}
if (envelope.BodyCase == WorkerEnvelope.BodyOneofCase.WorkerCommand)
{
MxCommandKind kind = envelope.WorkerCommand?.Command?.Kind ?? MxCommandKind.Unspecified;
if (kind is MxCommandKind.Ping or MxCommandKind.GetSessionState
or MxCommandKind.GetWorkerInfo or MxCommandKind.DrainEvents
or MxCommandKind.ShutdownWorker)
{
await harness.RespondToControlCommandAsync(envelope, cancellationToken)
.ConfigureAwait(false);
_commandHandled.Release();
continue;
}
}
throw new InvalidOperationException(
$"ControlCommandFakeWorkerProcessLauncher received unexpected envelope {envelope.BodyCase}.");
}
}
}
private sealed class FakeWorkerProcess(int processId) : IWorkerProcess
{
private readonly TaskCompletionSource _exited = new(TaskCreationOptions.RunContinuationsAsynchronously);
@@ -9,7 +9,6 @@ using ZB.MOM.WW.MxGateway.Server.Grpc;
using ZB.MOM.WW.MxGateway.Server.Metrics;
using ZB.MOM.WW.MxGateway.Server.Sessions;
using ZB.MOM.WW.MxGateway.Server.Workers;
using ZB.MOM.WW.MxGateway.Tests.TestSupport;
namespace ZB.MOM.WW.MxGateway.Tests.Gateway.Grpc;
@@ -157,59 +156,99 @@ public sealed class EventStreamServiceTests
await WaitUntilAsync(() => metrics.GetSnapshot().GrpcEventStreamQueueDepth == 0);
}
/// <summary>Verifies that event queue overflow faults the session and reports the overflow metric.</summary>
/// <summary>
/// Re-targeted in Task 5: a per-subscriber channel overflow in the session's
/// <see cref="SessionEventDistributor"/> faults the whole session under the legacy
/// single-subscriber FailFast policy (the default, single-subscriber mode) and records
/// the overflow + fault metrics. The distributor completes this subscriber's channel
/// with the overflow fault, which surfaces here as the same
/// <see cref="SessionManagerErrorCode.EventQueueOverflow"/> the pre-epic per-RPC
/// overflow produced.
/// </summary>
[Fact]
public async Task StreamEventsAsync_WhenStreamQueueOverflows_FaultsSessionAndReportsOverflow()
{
FakeWorkerClient workerClient = new();
GatewaySession session = CreateReadySession(workerClient);
using GatewayMetrics metrics = new();
GatewaySession session = CreateReadySession(
workerClient,
queueCapacity: 1,
metrics: metrics,
backpressurePolicy: EventBackpressurePolicy.FailFast);
EventStreamService service = CreateService(
new FakeSessionManager(session),
metrics,
queueCapacity: 1);
workerClient.Events.Add(CreateWorkerEvent(sequence: 1, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(sequence: 2, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(sequence: 3, MxEventFamily.OnDataChange));
for (ulong sequence = 1; sequence <= 50; sequence++)
{
workerClient.Events.Add(CreateWorkerEvent(sequence, MxEventFamily.OnDataChange));
}
workerClient.CompleteAfterConfiguredEvents = true;
await using IAsyncEnumerator<MxEvent> subscriber = service
.StreamEventsAsync(CreateRequest(session.SessionId), CancellationToken.None)
.GetAsyncEnumerator();
Assert.True(await subscriber.MoveNextAsync().AsTask().WaitAsync(TestTimeout));
await WaitUntilAsync(() => session.State == SessionState.Faulted);
// The pump fans 50 events into a capacity-1 subscriber channel faster than this
// single reader drains, so one of the reads observes the terminal overflow fault.
SessionManagerException exception = await Assert.ThrowsAsync<SessionManagerException>(
async () => await subscriber.MoveNextAsync().AsTask().WaitAsync(TestTimeout));
async () =>
{
while (await subscriber.MoveNextAsync().AsTask().WaitAsync(TestTimeout))
{
}
});
Assert.Equal(SessionManagerErrorCode.EventQueueOverflow, exception.ErrorCode);
await WaitUntilAsync(() => session.State == SessionState.Faulted);
Assert.Equal(SessionState.Faulted, session.State);
Assert.Equal(1, metrics.GetSnapshot().QueueOverflows);
Assert.Equal(1, metrics.GetSnapshot().Faults);
GatewayMetricsSnapshot snapshot = metrics.GetSnapshot();
Assert.Equal(1, snapshot.QueueOverflows);
Assert.Equal(1, snapshot.Faults);
// The finally block in StreamEventsAsync calls StreamDisconnected("Detached") on the
// overflow+fault path too; pin it here so a regression removing that call is caught.
Assert.Equal(1, snapshot.StreamDisconnects);
}
/// <summary>Verifies that the disconnect backpressure policy disconnects the subscriber without faulting the session.</summary>
/// <summary>
/// Re-targeted in Task 5: under the DisconnectSubscriber policy a per-subscriber
/// channel overflow disconnects only that subscriber's stream (terminal
/// <see cref="SessionManagerErrorCode.EventQueueOverflow"/>) and records the overflow
/// metric, but leaves the session <see cref="SessionState.Ready"/> and records no
/// fault. The session, pump, and any other subscribers are unaffected.
/// </summary>
[Fact]
public async Task StreamEventsAsync_WhenStreamQueueOverflowsWithDisconnectPolicy_LeavesSessionReady()
{
FakeWorkerClient workerClient = new();
GatewaySession session = CreateReadySession(workerClient);
using GatewayMetrics metrics = new();
GatewaySession session = CreateReadySession(
workerClient,
queueCapacity: 1,
metrics: metrics,
backpressurePolicy: EventBackpressurePolicy.DisconnectSubscriber);
EventStreamService service = CreateService(
new FakeSessionManager(session),
metrics,
queueCapacity: 1,
backpressurePolicy: EventBackpressurePolicy.DisconnectSubscriber);
workerClient.Events.Add(CreateWorkerEvent(sequence: 1, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(sequence: 2, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(sequence: 3, MxEventFamily.OnDataChange));
for (ulong sequence = 1; sequence <= 50; sequence++)
{
workerClient.Events.Add(CreateWorkerEvent(sequence, MxEventFamily.OnDataChange));
}
workerClient.CompleteAfterConfiguredEvents = true;
await using IAsyncEnumerator<MxEvent> subscriber = service
.StreamEventsAsync(CreateRequest(session.SessionId), CancellationToken.None)
.GetAsyncEnumerator();
Assert.True(await subscriber.MoveNextAsync().AsTask().WaitAsync(TestTimeout));
SessionManagerException exception = await Assert.ThrowsAsync<SessionManagerException>(
async () => await subscriber.MoveNextAsync().AsTask().WaitAsync(TestTimeout));
async () =>
{
while (await subscriber.MoveNextAsync().AsTask().WaitAsync(TestTimeout))
{
}
});
Assert.Equal(SessionManagerErrorCode.EventQueueOverflow, exception.ErrorCode);
Assert.Equal(SessionState.Ready, session.State);
@@ -261,81 +300,11 @@ public sealed class EventStreamServiceTests
Assert.Equal(1, metrics.GetSnapshot().Faults);
}
/// <summary>
/// Tests-026 regression: <see cref="EventStreamService.StreamEventsAsync"/>
/// must mirror every yielded event to the
/// <see cref="ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs.IDashboardEventBroadcaster"/>
/// seam (the only path that fans events out to dashboard SignalR clients).
/// A regression that silently dropped the <c>Publish</c> call — e.g. an
/// <c>if</c> accidentally added around it, or the broadcaster ctor
/// parameter being removed — would have produced no failing test before
/// this fixture existed. The recording fake captures every call and we
/// assert one publish per yielded event, with the correct session id and
/// preserved <c>WorkerSequence</c>.
/// </summary>
[Fact]
public async Task StreamEventsAsync_PublishesEachEventToDashboardBroadcaster()
{
FakeWorkerClient workerClient = new();
GatewaySession session = CreateReadySession(workerClient);
RecordingDashboardEventBroadcaster recordingBroadcaster = new();
EventStreamService service = CreateService(
new FakeSessionManager(session),
dashboardEventBroadcaster: recordingBroadcaster);
workerClient.Events.Add(CreateWorkerEvent(sequence: 7, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(sequence: 8, MxEventFamily.OnWriteComplete));
workerClient.CompleteAfterConfiguredEvents = true;
List<MxEvent> events = await CollectEventsAsync(service, session.SessionId);
Assert.Equal([7UL, 8UL], events.Select(mxEvent => mxEvent.WorkerSequence).ToArray());
IReadOnlyList<DashboardEventCapture> captures = recordingBroadcaster.Captures;
Assert.Equal(2, captures.Count);
Assert.All(captures, capture => Assert.Equal(session.SessionId, capture.SessionId));
Assert.Equal([7UL, 8UL], captures.Select(capture => capture.MxEvent.WorkerSequence).ToArray());
Assert.Equal(MxEventFamily.OnDataChange, captures[0].MxEvent.Family);
Assert.Equal(MxEventFamily.OnWriteComplete, captures[1].MxEvent.Family);
}
/// <summary>
/// Server-041 regression: <see cref="EventStreamService"/> must not
/// abort the gRPC stream when the dashboard broadcaster throws.
/// <c>IDashboardEventBroadcaster.Publish</c> is documented as
/// best-effort and never-throw, but the gRPC consumer cannot rely on
/// implementation discipline alone — the seam itself swallows the
/// fault and logs at debug, mirroring the broadcaster's own
/// continuation handler. Without the wrap, the producer loop would
/// surface the exception and the client would see a faulted stream
/// for a dashboard-mirror failure.
/// </summary>
[Fact]
public async Task StreamEventsAsync_WhenDashboardBroadcasterThrows_StillYieldsEventsAndDoesNotFaultSession()
{
FakeWorkerClient workerClient = new();
GatewaySession session = CreateReadySession(workerClient);
using GatewayMetrics metrics = new();
ThrowingDashboardEventBroadcaster throwingBroadcaster = new();
EventStreamService service = CreateService(
new FakeSessionManager(session),
metrics,
dashboardEventBroadcaster: throwingBroadcaster);
workerClient.Events.Add(CreateWorkerEvent(sequence: 1, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(sequence: 2, MxEventFamily.OnDataChange));
workerClient.CompleteAfterConfiguredEvents = true;
List<MxEvent> events = await CollectEventsAsync(service, session.SessionId);
Assert.Equal([1UL, 2UL], events.Select(mxEvent => mxEvent.WorkerSequence).ToArray());
Assert.Equal(2, throwingBroadcaster.PublishAttempts);
Assert.NotEqual(SessionState.Faulted, session.State);
}
private static EventStreamService CreateService(
FakeSessionManager sessionManager,
GatewayMetrics? metrics = null,
int queueCapacity = 8,
EventBackpressurePolicy backpressurePolicy = EventBackpressurePolicy.FailFast,
ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs.IDashboardEventBroadcaster? dashboardEventBroadcaster = null)
EventBackpressurePolicy backpressurePolicy = EventBackpressurePolicy.FailFast)
{
return new EventStreamService(
sessionManager,
@@ -347,25 +316,7 @@ public sealed class EventStreamServiceTests
BackpressurePolicy = backpressurePolicy,
},
}),
new MxAccessGrpcMapper(),
metrics ?? new GatewayMetrics(),
dashboardEventBroadcaster ?? NullDashboardEventBroadcaster.Instance,
NullLogger<EventStreamService>.Instance);
}
private sealed class ThrowingDashboardEventBroadcaster : ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs.IDashboardEventBroadcaster
{
/// <summary>Gets the count of publish attempts.</summary>
public int PublishAttempts { get; private set; }
/// <summary>Increments the attempt count and throws a simulated failure.</summary>
/// <param name="sessionId">The session identifier.</param>
/// <param name="mxEvent">The event to publish.</param>
public void Publish(string sessionId, MxEvent mxEvent)
{
PublishAttempts++;
throw new InvalidOperationException("simulated dashboard broadcaster failure");
}
metrics ?? new GatewayMetrics());
}
private static async Task<List<MxEvent>> CollectEventsAsync(
@@ -393,20 +344,39 @@ public sealed class EventStreamServiceTests
private static GatewaySession CreateReadySession(
FakeWorkerClient workerClient,
string sessionId = "session-events")
string sessionId = "session-events",
int queueCapacity = 8,
GatewayMetrics? metrics = null,
EventBackpressurePolicy backpressurePolicy = EventBackpressurePolicy.FailFast)
{
// The per-subscriber overflow policy now lives in the session's
// SessionEventDistributor, so the session must share the same metrics sink and
// backpressure policy the overflow assertions observe. queueCapacity flows into the
// distributor's per-subscriber channel bound, which is what overflows.
GatewaySession session = new(
sessionId,
GatewayContractInfo.DefaultBackendName,
"pipe",
"nonce",
"client",
ownerKeyId: null,
"client-session",
"client-correlation",
TimeSpan.FromSeconds(30),
TimeSpan.FromSeconds(30),
TimeSpan.FromSeconds(10),
DateTimeOffset.UtcNow);
TimeSpan.FromMinutes(30),
DateTimeOffset.UtcNow,
new SessionEventStreaming(
new MxAccessGrpcMapper(),
new EventOptions
{
QueueCapacity = queueCapacity,
BackpressurePolicy = backpressurePolicy,
},
NullLogger<SessionEventDistributor>.Instance,
TimeProvider.System,
metrics ?? new GatewayMetrics()));
session.AttachWorkerClient(workerClient);
session.MarkReady();
@@ -471,6 +441,7 @@ public sealed class EventStreamServiceTests
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken)
{
return Task.FromResult(_sessions.Values.First());
@@ -1,5 +1,4 @@
using Grpc.Core;
using Microsoft.Extensions.Logging.Abstractions;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Server.Dashboard;
using ZB.MOM.WW.MxGateway.Server.Galaxy;
@@ -217,8 +216,7 @@ public sealed class GalaxyRepositoryGrpcServiceTests
new global::ZB.MOM.WW.MxGateway.Server.Galaxy.GalaxyRepository(options),
new StubGalaxyHierarchyCache(entry),
new GalaxyDeployNotifier(),
new GatewayRequestIdentityAccessor(),
NullLogger<GalaxyRepositoryGrpcService>.Instance);
new GatewayRequestIdentityAccessor());
}
private static GalaxyHierarchyCacheEntry CreateEntry(IReadOnlyList<GalaxyObject> objects)
@@ -366,8 +364,7 @@ public sealed class GalaxyRepositoryGrpcServiceTests
new global::ZB.MOM.WW.MxGateway.Server.Galaxy.GalaxyRepository(options),
new NeverLoadsHierarchyCache(),
new GalaxyDeployNotifier(),
new GatewayRequestIdentityAccessor(),
NullLogger<GalaxyRepositoryGrpcService>.Instance);
new GatewayRequestIdentityAccessor());
// No caller-supplied CT so WaitForCacheBootstrap exits via its 5s internal budget
// (instead of re-throwing OperationCanceledException from the caller's CT). The
@@ -448,8 +445,7 @@ public sealed class GalaxyRepositoryGrpcServiceTests
new global::ZB.MOM.WW.MxGateway.Server.Galaxy.GalaxyRepository(options),
new StubGalaxyHierarchyCache(CreateEntry(CreateFilterObjects())),
new GalaxyDeployNotifier(),
identityAccessor,
NullLogger<GalaxyRepositoryGrpcService>.Instance);
identityAccessor);
// Sanity: with no identity pushed, both Pump and Valve come back under Line3 (id=2).
BrowseChildrenReply unconstrained = await service.BrowseChildren(
@@ -548,6 +548,33 @@ public sealed class MxAccessGatewayServiceConstraintTests
Assert.Equal("42", enforcer.RecordedDenials[0].Target);
}
/// <summary>
/// End-to-end wiring (M-2): the per-request <c>ClientCorrelationId</c> must propagate
/// all the way through <c>Invoke</c> -> <c>ApplyConstraintsAsync</c> -> the unary write
/// enforce helper -> <c>RecordDenialAsync</c>, so the recorded denial carries the exact
/// id the client sent (including non-GUID trace ids used by Rust/Python/Java clients).
/// </summary>
[Fact]
public async Task Invoke_Write_WithDeniedHandle_ThreadsClientCorrelationIdIntoRecordedDenial()
{
const string CorrelationId = "rust-client-Write-7";
PredicateConstraintEnforcer enforcer = new()
{
DenyWriteHandle = (serverHandle, itemHandle) => serverHandle == 7 && itemHandle == 42,
};
FakeSessionManager sessionManager = CreateSessionManagerWithSeed();
MxAccessGatewayService service = CreateService(sessionManager, enforcer);
MxCommandRequest request = CreateWriteRequest(serverHandle: 7, itemHandle: 42);
request.ClientCorrelationId = CorrelationId;
await Assert.ThrowsAsync<RpcException>(
async () => await service.Invoke(request, new TestServerCallContext()));
Assert.Single(enforcer.RecordedDenials);
Assert.Equal(CorrelationId, enforcer.RecordedDenials[0].CorrelationId);
}
/// <summary>
/// Unary <c>WriteSecured</c> against a denied handle takes the same enforce path
/// and rejects identically — proving the four-arm switch in
@@ -857,10 +884,12 @@ public sealed class MxAccessGatewayServiceConstraintTests
/// <summary>Opens a test session asynchronously.</summary>
/// <param name="request">The session open request.</param>
/// <param name="clientIdentity">The client identity, if any.</param>
/// <param name="ownerKeyId">The API key identifier of the caller, if any.</param>
/// <param name="cancellationToken">Token to observe for cancellation.</param>
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken) =>
Task.FromResult(seededSessions.Values.First());
@@ -45,6 +45,7 @@ public sealed class MxAccessGatewayServiceTests
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.Contains("unary-invoke", reply.Capabilities);
Assert.Equal("Operator Key", sessionManager.LastClientIdentity);
Assert.Equal("operator01", sessionManager.LastOwnerKeyId);
Assert.Equal("operator-session", sessionManager.LastOpenRequest?.ClientSessionName);
}
@@ -508,6 +509,9 @@ public sealed class MxAccessGatewayServiceTests
/// <summary>The last client identity passed to OpenSessionAsync.</summary>
public string? LastClientIdentity { get; private set; }
/// <summary>The last owner key id passed to OpenSessionAsync.</summary>
public string? LastOwnerKeyId { get; private set; }
/// <summary>The last session ID passed to ReadEventsAsync.</summary>
public string? LastReadEventsSessionId { get; private set; }
@@ -545,10 +549,12 @@ public sealed class MxAccessGatewayServiceTests
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken)
{
LastOpenRequest = request;
LastClientIdentity = clientIdentity;
LastOwnerKeyId = ownerKeyId;
return Task.FromResult(OpenSessionResult ?? CreateSession("session-1", processId: 1234));
}
@@ -0,0 +1,323 @@
using System.Runtime.CompilerServices;
using Microsoft.Extensions.Logging.Abstractions;
using Microsoft.Extensions.Options;
using ZB.MOM.WW.MxGateway.Contracts;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Server.Configuration;
using ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs;
using ZB.MOM.WW.MxGateway.Server.Grpc;
using ZB.MOM.WW.MxGateway.Server.Metrics;
using ZB.MOM.WW.MxGateway.Server.Sessions;
using ZB.MOM.WW.MxGateway.Server.Workers;
using ZB.MOM.WW.MxGateway.Tests.TestSupport;
namespace ZB.MOM.WW.MxGateway.Tests.Gateway.Sessions;
/// <summary>
/// Task 6 regression tests for the internal dashboard mirror. The dashboard is a
/// first-class subscriber on the session's <see cref="SessionEventDistributor"/>, so it
/// receives session events whether or not a gRPC client is streaming — fixing the
/// "dark feed" where the dashboard only saw events while a gRPC client was actively
/// streaming (the inline per-RPC tap removed by this task).
/// </summary>
public sealed class GatewaySessionDashboardMirrorTests
{
private static readonly TimeSpan TestTimeout = TimeSpan.FromSeconds(5);
/// <summary>
/// The KEY bug-fix test: the dashboard broadcaster receives session events even when
/// NO gRPC <c>StreamEvents</c> subscriber is attached. The session is driven to Ready
/// with a fake worker emitting events; only the internal dashboard subscriber exists.
/// Before Task 6 the mirror lived inside the per-RPC gRPC loop, so with no gRPC
/// subscriber the dashboard saw nothing.
/// </summary>
[Fact]
public async Task DashboardMirror_ReceivesEvents_WithNoGrpcSubscriber()
{
FakeWorkerClient workerClient = new();
workerClient.Events.Add(CreateWorkerEvent(10, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(11, MxEventFamily.OnWriteComplete));
workerClient.CompleteAfterConfiguredEvents = true;
RecordingDashboardEventBroadcaster broadcaster = new();
await using GatewaySession session = CreateSession(workerClient, broadcaster);
session.AttachWorkerClient(workerClient);
// MarkReady starts the internal dashboard mirror; no gRPC subscriber is ever attached.
session.MarkReady();
await WaitUntilAsync(() => broadcaster.Captures.Count == 2);
IReadOnlyList<DashboardEventCapture> captures = broadcaster.Captures;
Assert.Equal(0, session.ActiveEventSubscriberCount);
Assert.Equal([10UL, 11UL], captures.Select(capture => capture.MxEvent.WorkerSequence).ToArray());
Assert.All(captures, capture => Assert.Equal(session.SessionId, capture.SessionId));
}
/// <summary>
/// A gRPC subscriber and the dashboard both receive every event concurrently. The
/// gRPC path is no longer the dashboard's source — both read independent leases fed by
/// the single distributor pump.
/// </summary>
[Fact]
public async Task DashboardMirror_AndGrpcSubscriber_BothReceiveEvents()
{
FakeWorkerClient workerClient = new();
workerClient.Events.Add(CreateWorkerEvent(1, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(2, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(3, MxEventFamily.OnWriteComplete));
workerClient.CompleteAfterConfiguredEvents = true;
RecordingDashboardEventBroadcaster broadcaster = new();
await using GatewaySession session = CreateSession(workerClient, broadcaster);
session.AttachWorkerClient(workerClient);
session.MarkReady();
EventStreamService service = new(
new SingleSessionManager(session),
Options.Create(new GatewayOptions { Events = new EventOptions { QueueCapacity = 8 } }),
new GatewayMetrics());
List<MxEvent> grpcEvents = [];
await foreach (MxEvent mxEvent in service
.StreamEventsAsync(new StreamEventsRequest { SessionId = session.SessionId }, CancellationToken.None)
.WithCancellation(CancellationToken.None))
{
grpcEvents.Add(mxEvent);
}
await WaitUntilAsync(() => broadcaster.Captures.Count == 3);
Assert.Equal([1UL, 2UL, 3UL], grpcEvents.Select(mxEvent => mxEvent.WorkerSequence).ToArray());
Assert.Equal([1UL, 2UL, 3UL], broadcaster.Captures.Select(capture => capture.MxEvent.WorkerSequence).ToArray());
}
/// <summary>
/// Task 4 hazard guard: starting the pump at Ready with a fast-completing worker stream
/// and zero subscribers used to drain into nothing and leave a later subscriber hanging.
/// Now the dashboard subscriber is registered BEFORE the pump starts, so even a worker
/// stream that completes immediately delivers every event to the dashboard with no hang.
/// </summary>
[Fact]
public async Task DashboardMirror_FastCompletingWorkerStream_DeliversAllEventsWithoutHang()
{
FakeWorkerClient workerClient = new();
workerClient.Events.Add(CreateWorkerEvent(1, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(2, MxEventFamily.OnDataChange));
workerClient.CompleteAfterConfiguredEvents = true;
RecordingDashboardEventBroadcaster broadcaster = new();
await using GatewaySession session = CreateSession(workerClient, broadcaster);
session.AttachWorkerClient(workerClient);
session.MarkReady();
await WaitUntilAsync(() => broadcaster.Captures.Count == 2);
Assert.Equal([1UL, 2UL], broadcaster.Captures.Select(capture => capture.MxEvent.WorkerSequence).ToArray());
}
/// <summary>
/// The dashboard Publish must be never-throw at the seam too: a throwing broadcaster
/// must not fault the session or stop the mirror from continuing past the failure.
/// </summary>
[Fact]
public async Task DashboardMirror_WhenBroadcasterThrows_DoesNotFaultSessionAndKeepsMirroring()
{
FakeWorkerClient workerClient = new();
workerClient.Events.Add(CreateWorkerEvent(1, MxEventFamily.OnDataChange));
workerClient.Events.Add(CreateWorkerEvent(2, MxEventFamily.OnDataChange));
workerClient.CompleteAfterConfiguredEvents = true;
ThrowingDashboardEventBroadcaster broadcaster = new();
await using GatewaySession session = CreateSession(workerClient, broadcaster);
session.AttachWorkerClient(workerClient);
session.MarkReady();
await WaitUntilAsync(() => broadcaster.PublishAttempts == 2);
Assert.NotEqual(SessionState.Faulted, session.State);
}
/// <summary>
/// The internal dashboard subscriber must NOT count against the single-subscriber
/// guard: a gRPC subscriber can still attach while the dashboard mirror is running.
/// </summary>
[Fact]
public async Task DashboardMirror_DoesNotCountAgainstSingleSubscriberGuard()
{
FakeWorkerClient workerClient = new();
RecordingDashboardEventBroadcaster broadcaster = new();
await using GatewaySession session = CreateSession(workerClient, broadcaster);
session.AttachWorkerClient(workerClient);
session.MarkReady();
Assert.Equal(0, session.ActiveEventSubscriberCount);
using IEventSubscriberLease lease = session.AttachEventSubscriber(allowMultipleSubscribers: false);
Assert.Equal(1, session.ActiveEventSubscriberCount);
}
private static GatewaySession CreateSession(
IWorkerClient workerClient,
IDashboardEventBroadcaster broadcaster)
{
return new GatewaySession(
sessionId: "session-dashboard-mirror",
backendName: GatewayContractInfo.DefaultBackendName,
pipeName: "mxaccess-gateway-1-session-dashboard-mirror",
nonce: "nonce",
clientIdentity: "client-1",
ownerKeyId: null,
clientSessionName: "test-session",
clientCorrelationId: "client-correlation-1",
commandTimeout: TimeSpan.FromSeconds(5),
startupTimeout: TimeSpan.FromSeconds(5),
shutdownTimeout: TimeSpan.FromSeconds(5),
leaseDuration: TimeSpan.FromMinutes(30),
openedAt: DateTimeOffset.UtcNow,
eventStreaming: new SessionEventStreaming(
new MxAccessGrpcMapper(),
new EventOptions { QueueCapacity = 8 },
NullLogger<SessionEventDistributor>.Instance,
TimeProvider.System,
new GatewayMetrics(),
broadcaster));
}
private static WorkerEvent CreateWorkerEvent(ulong sequence, MxEventFamily family)
{
MxEvent mxEvent = new()
{
SessionId = "session-dashboard-mirror",
Family = family,
WorkerSequence = sequence,
};
switch (family)
{
case MxEventFamily.OnDataChange:
mxEvent.OnDataChange = new OnDataChangeEvent();
break;
case MxEventFamily.OnWriteComplete:
mxEvent.OnWriteComplete = new OnWriteCompleteEvent();
break;
}
return new WorkerEvent { Event = mxEvent };
}
private static async Task WaitUntilAsync(Func<bool> predicate, [CallerArgumentExpression(nameof(predicate))] string? condition = null)
{
using CancellationTokenSource cancellationTokenSource = new(TestTimeout);
try
{
while (!predicate())
{
await Task.Delay(TimeSpan.FromMilliseconds(10), cancellationTokenSource.Token);
}
}
catch (OperationCanceledException)
{
Assert.Fail($"Timed out after {TestTimeout.TotalSeconds}s waiting for: {condition}");
}
}
private sealed class ThrowingDashboardEventBroadcaster : IDashboardEventBroadcaster
{
private int _publishAttempts;
public int PublishAttempts => Volatile.Read(ref _publishAttempts);
public void Publish(string sessionId, MxEvent mxEvent)
{
Interlocked.Increment(ref _publishAttempts);
throw new InvalidOperationException("simulated dashboard broadcaster failure");
}
}
private sealed class SingleSessionManager(GatewaySession session) : ISessionManager
{
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken) => Task.FromResult(session);
public bool TryGetSession(string sessionId, out GatewaySession gatewaySession)
{
gatewaySession = session;
return string.Equals(sessionId, session.SessionId, StringComparison.Ordinal);
}
public Task<WorkerCommandReply> InvokeAsync(
string sessionId,
WorkerCommand command,
CancellationToken cancellationToken) => Task.FromResult(new WorkerCommandReply());
public IAsyncEnumerable<WorkerEvent> ReadEventsAsync(
string sessionId,
CancellationToken cancellationToken) => session.ReadEventsAsync(cancellationToken);
public Task<SessionCloseResult> CloseSessionAsync(
string sessionId,
CancellationToken cancellationToken) =>
Task.FromResult(new SessionCloseResult(sessionId, SessionState.Closed, AlreadyClosed: false));
public Task<SessionCloseResult> KillWorkerAsync(
string sessionId,
string reason,
CancellationToken cancellationToken) =>
Task.FromResult(new SessionCloseResult(sessionId, SessionState.Closed, AlreadyClosed: false));
public Task<int> CloseExpiredLeasesAsync(
DateTimeOffset now,
CancellationToken cancellationToken) => Task.FromResult(0);
public Task ShutdownAsync(CancellationToken cancellationToken) => Task.CompletedTask;
}
private sealed class FakeWorkerClient : IWorkerClient
{
public List<WorkerEvent> Events { get; } = [];
public bool CompleteAfterConfiguredEvents { get; set; }
public string SessionId { get; } = "session-dashboard-mirror";
public int? ProcessId { get; } = 1234;
public WorkerClientState State { get; } = WorkerClientState.Ready;
public DateTimeOffset LastHeartbeatAt { get; } = DateTimeOffset.UtcNow;
public Task StartAsync(CancellationToken cancellationToken) => Task.CompletedTask;
public Task<WorkerCommandReply> InvokeAsync(
WorkerCommand command,
TimeSpan timeout,
CancellationToken cancellationToken) => Task.FromResult(new WorkerCommandReply());
public async IAsyncEnumerable<WorkerEvent> ReadEventsAsync(
[EnumeratorCancellation] CancellationToken cancellationToken)
{
foreach (WorkerEvent workerEvent in Events)
{
cancellationToken.ThrowIfCancellationRequested();
yield return workerEvent;
}
if (CompleteAfterConfiguredEvents)
{
yield break;
}
await Task.Delay(Timeout.InfiniteTimeSpan, cancellationToken);
}
public Task ShutdownAsync(TimeSpan timeout, CancellationToken cancellationToken) => Task.CompletedTask;
public void Kill(string reason)
{
}
public ValueTask DisposeAsync() => ValueTask.CompletedTask;
}
}
@@ -1,5 +1,9 @@
using System.Runtime.CompilerServices;
using Microsoft.Extensions.Logging.Abstractions;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Server.Configuration;
using ZB.MOM.WW.MxGateway.Server.Grpc;
using ZB.MOM.WW.MxGateway.Server.Metrics;
using ZB.MOM.WW.MxGateway.Server.Sessions;
using ZB.MOM.WW.MxGateway.Server.Workers;
@@ -156,6 +160,66 @@ public sealed class GatewaySessionTests
await session.DisposeAsync();
}
/// <summary>
/// Issue-1 regression. Concurrent <c>Dispose()</c> calls on the same
/// <see cref="IEventSubscriberLease"/> — as can happen when a gRPC stream
/// completion and a client cancellation both fire at the same time — must
/// decrement <c>_activeEventSubscriberCount</c> exactly once, never to 1.
/// A negative count permanently blocks future subscribers because
/// <c>AttachEventSubscriber(allowMultipleSubscribers:false)</c> gates on
/// <c>_activeEventSubscriberCount > 0</c>. After both racing disposes finish,
/// the count must be exactly 0 and a subsequent single-subscriber attach must
/// succeed.
/// </summary>
[Fact]
public async Task EventSubscriberLease_ConcurrentDispose_DecrementsCountExactlyOnce()
{
const int Concurrency = 16;
const int Iterations = 200;
TimeSpan testTimeout = TimeSpan.FromSeconds(10);
FakeWorkerClient workerClient = new();
GatewaySession session = CreateReadySessionWithEventStreaming(workerClient);
for (int i = 0; i < Iterations; i++)
{
// Attach one subscriber; this increments _activeEventSubscriberCount to 1.
IEventSubscriberLease lease = session.AttachEventSubscriber(
allowMultipleSubscribers: false);
// Race Concurrency threads all calling Dispose() on the same lease.
// Only one must actually run DetachEventSubscriber.
using SemaphoreSlim gate = new(0);
Task[] tasks = new Task[Concurrency];
for (int t = 0; t < Concurrency; t++)
{
tasks[t] = Task.Run(async () =>
{
// All threads wait at the gate so they start as simultaneously
// as the scheduler allows, maximising the race window.
await gate.WaitAsync(testTimeout);
lease.Dispose();
});
}
gate.Release(Concurrency);
await Task.WhenAll(tasks).WaitAsync(testTimeout);
// Count must be exactly 0 — not negative — after all disposes.
Assert.Equal(0, session.ActiveEventSubscriberCount);
// Observable contract: a fresh single subscriber must now be attachable
// (i.e., the guard _activeEventSubscriberCount > 0 is false).
IEventSubscriberLease next = session.AttachEventSubscriber(
allowMultipleSubscribers: false);
next.Dispose();
Assert.Equal(0, session.ActiveEventSubscriberCount);
}
await session.CloseAsync("test-done", CancellationToken.None);
await session.DisposeAsync();
}
private static GatewaySession CreateReadySession(IWorkerClient workerClient)
{
GatewaySession session = new(
@@ -164,6 +228,7 @@ public sealed class GatewaySessionTests
pipeName: "mxaccess-gateway-1-session-test",
nonce: "nonce",
clientIdentity: "client-1",
ownerKeyId: null,
clientSessionName: "test-session",
clientCorrelationId: "client-correlation-1",
commandTimeout: TimeSpan.FromSeconds(5),
@@ -176,6 +241,33 @@ public sealed class GatewaySessionTests
return session;
}
private static GatewaySession CreateReadySessionWithEventStreaming(IWorkerClient workerClient)
{
GatewaySession session = new(
sessionId: "session-test-concurrent",
backendName: "mxaccess",
pipeName: "mxaccess-gateway-1-session-test-concurrent",
nonce: "nonce",
clientIdentity: "client-1",
ownerKeyId: null,
clientSessionName: "test-session",
clientCorrelationId: "client-correlation-1",
commandTimeout: TimeSpan.FromSeconds(5),
startupTimeout: TimeSpan.FromSeconds(5),
shutdownTimeout: TimeSpan.FromSeconds(5),
leaseDuration: TimeSpan.FromMinutes(30),
openedAt: DateTimeOffset.UtcNow,
eventStreaming: new SessionEventStreaming(
new MxAccessGrpcMapper(),
new EventOptions { QueueCapacity = 8 },
NullLogger<SessionEventDistributor>.Instance,
TimeProvider.System,
new GatewayMetrics()));
session.AttachWorkerClient(workerClient);
session.MarkReady();
return session;
}
/// <summary>
/// Minimal worker client that parks <see cref="ShutdownAsync"/> until the test
/// explicitly releases it. Used to keep <see cref="GatewaySession.CloseAsync"/>
@@ -0,0 +1,569 @@
using System.Threading.Channels;
using Microsoft.Extensions.Logging.Abstractions;
using Microsoft.Extensions.Time.Testing;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Server.Sessions;
namespace ZB.MOM.WW.MxGateway.Tests.Gateway.Sessions;
/// <summary>
/// Concurrency and fan-out tests for <see cref="SessionEventDistributor"/>, the
/// Session Resilience epic's per-session event pump. One pump drains the source
/// exactly once and fans every event to N independent per-subscriber channels.
/// Every async wait is bounded so a fan-out or shutdown deadlock fails fast.
/// </summary>
public sealed class SessionEventDistributorTests
{
private static readonly TimeSpan ReadTimeout = TimeSpan.FromSeconds(5);
[Fact]
public async Task TwoSubscribers_BothReceiveFannedEventsInOrder()
{
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(source.Reader);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease leaseA = distributor.Register();
using IEventSubscriberLease leaseB = distributor.Register();
source.Writer.TryWrite(Event(1));
source.Writer.TryWrite(Event(2));
MxEvent a1 = await ReadOneAsync(leaseA.Reader);
MxEvent a2 = await ReadOneAsync(leaseA.Reader);
MxEvent b1 = await ReadOneAsync(leaseB.Reader);
MxEvent b2 = await ReadOneAsync(leaseB.Reader);
Assert.Equal(1ul, a1.WorkerSequence);
Assert.Equal(2ul, a2.WorkerSequence);
Assert.Equal(1ul, b1.WorkerSequence);
Assert.Equal(2ul, b2.WorkerSequence);
}
[Fact]
public async Task DisposingOneLease_StopsItsDelivery_OtherKeepsReceiving()
{
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(source.Reader);
await distributor.StartAsync(CancellationToken.None);
IEventSubscriberLease leaseA = distributor.Register();
using IEventSubscriberLease leaseB = distributor.Register();
source.Writer.TryWrite(Event(1));
_ = await ReadOneAsync(leaseA.Reader);
_ = await ReadOneAsync(leaseB.Reader);
leaseA.Dispose();
// A's reader must complete (no more delivery) after dispose.
await AssertCompletedAsync(leaseA.Reader);
// B still receives subsequent events.
source.Writer.TryWrite(Event(2));
MxEvent b2 = await ReadOneAsync(leaseB.Reader);
Assert.Equal(2ul, b2.WorkerSequence);
}
[Fact]
public async Task SubscriberRegisteredAfterStart_ReceivesEventsEmittedAfterRegistration()
{
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(source.Reader);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease leaseA = distributor.Register();
source.Writer.TryWrite(Event(1));
_ = await ReadOneAsync(leaseA.Reader);
// Late subscriber: only sees events emitted after it registered.
using IEventSubscriberLease leaseB = distributor.Register();
source.Writer.TryWrite(Event(2));
MxEvent b = await ReadOneAsync(leaseB.Reader);
Assert.Equal(2ul, b.WorkerSequence);
}
[Fact]
public async Task DisposingDistributor_CompletesAllSubscriberChannels_AndStopsPump()
{
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
SessionEventDistributor distributor = CreateDistributor(source.Reader);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease leaseA = distributor.Register();
using IEventSubscriberLease leaseB = distributor.Register();
// Bounded so a shutdown hang fails fast.
await distributor.DisposeAsync().AsTask().WaitAsync(ReadTimeout);
await AssertCompletedAsync(leaseA.Reader);
await AssertCompletedAsync(leaseB.Reader);
}
[Fact]
public async Task Register_AfterDispose_ThrowsObjectDisposedException()
{
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
SessionEventDistributor distributor = CreateDistributor(source.Reader);
await distributor.StartAsync(CancellationToken.None);
await distributor.DisposeAsync().AsTask().WaitAsync(ReadTimeout);
Assert.Throws<ObjectDisposedException>(() => distributor.Register());
}
[Fact]
public async Task ReplayBuffer_OverCapacity_EvictsOldestFirst_AndReportsGap()
{
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(
source.Reader,
replayBufferCapacity: 3,
replayRetentionSeconds: 0);
await distributor.StartAsync(CancellationToken.None);
// A live subscriber forces the pump to fan (and thereby retain) each event,
// and gives us a deterministic point to know the pump has processed event 5.
using IEventSubscriberLease lease = distributor.Register();
for (ulong sequence = 1; sequence <= 5; sequence++)
{
source.Writer.TryWrite(Event(sequence));
}
for (ulong sequence = 1; sequence <= 5; sequence++)
{
MxEvent e = await ReadOneAsync(lease.Reader);
Assert.Equal(sequence, e.WorkerSequence);
}
// Capacity 3 retains only the newest three: sequences 3, 4, 5. Events 1 and 2
// were evicted, so a caller asking from 0 missed events => gap=true, and it
// gets only the retained tail.
bool found = distributor.TryGetReplayFrom(0, out IReadOnlyList<MxEvent> replay, out bool gap);
Assert.True(found);
Assert.True(gap);
Assert.Equal(new ulong[] { 3, 4, 5 }, replay.Select(e => e.WorkerSequence));
}
[Fact]
public async Task ReplayBuffer_WithinRetainedWindow_ReturnsNewerEvents_NoGap()
{
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(
source.Reader,
replayBufferCapacity: 10,
replayRetentionSeconds: 0);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease lease = distributor.Register();
for (ulong sequence = 1; sequence <= 5; sequence++)
{
source.Writer.TryWrite(Event(sequence));
_ = await ReadOneAsync(lease.Reader);
}
// afterSequence 2 is still inside the retained window [1..5], so no gap and
// exactly the newer events 3, 4, 5 come back.
bool found = distributor.TryGetReplayFrom(2, out IReadOnlyList<MxEvent> replay, out bool gap);
Assert.True(found);
Assert.False(gap);
Assert.Equal(new ulong[] { 3, 4, 5 }, replay.Select(e => e.WorkerSequence));
}
[Fact]
public async Task ReplayBuffer_AgedEntries_AreEvictedAfterRetentionElapses()
{
FakeTimeProvider time = new(new DateTimeOffset(2026, 1, 1, 0, 0, 0, TimeSpan.Zero));
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(
source.Reader,
replayBufferCapacity: 100,
replayRetentionSeconds: 30,
timeProvider: time);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease lease = distributor.Register();
// Two old events, then advance the clock well past the retention window.
source.Writer.TryWrite(Event(1));
source.Writer.TryWrite(Event(2));
_ = await ReadOneAsync(lease.Reader);
_ = await ReadOneAsync(lease.Reader);
time.Advance(TimeSpan.FromSeconds(60));
// A fresh event triggers age-eviction of the now-stale entries 1 and 2.
source.Writer.TryWrite(Event(3));
_ = await ReadOneAsync(lease.Reader);
bool found = distributor.TryGetReplayFrom(0, out IReadOnlyList<MxEvent> replay, out bool gap);
Assert.True(found);
// Events 1 and 2 aged out; only 3 remains, and 0 predates the oldest retained.
Assert.Equal(new ulong[] { 3 }, replay.Select(e => e.WorkerSequence));
Assert.True(gap);
}
[Fact]
public async Task ReplayBuffer_AfterSequenceNewerThanAllRetained_ReturnsEmpty_NoGap()
{
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(
source.Reader,
replayBufferCapacity: 10,
replayRetentionSeconds: 0);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease lease = distributor.Register();
for (ulong sequence = 1; sequence <= 3; sequence++)
{
source.Writer.TryWrite(Event(sequence));
_ = await ReadOneAsync(lease.Reader);
}
// afterSequence 3 is at/after the newest retained; nothing newer, and the
// caller is fully caught up => empty list, gap=false.
bool found = distributor.TryGetReplayFrom(3, out IReadOnlyList<MxEvent> replay, out bool gap);
Assert.True(found);
Assert.False(gap);
Assert.Empty(replay);
}
[Fact]
public async Task ReplayBuffer_Capacity0_AfterSequenceBelowHighestSeen_ReportsGap_NoEvents()
{
// Disabled buffer: events are tracked for the highest-seen counter but not
// retained. A caller behind the highest-seen sequence must be told to re-snapshot.
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(
source.Reader,
replayBufferCapacity: 0,
replayRetentionSeconds: 0);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease lease = distributor.Register();
for (ulong sequence = 1; sequence <= 3; sequence++)
{
source.Writer.TryWrite(Event(sequence));
_ = await ReadOneAsync(lease.Reader);
}
// afterSequence=1 is below highestSeen=3 — gap, nothing to replay.
bool found = distributor.TryGetReplayFrom(1, out IReadOnlyList<MxEvent> replay, out bool gap);
Assert.True(found);
Assert.True(gap);
Assert.Empty(replay);
}
[Fact]
public async Task ReplayBuffer_Capacity0_AfterSequenceAtOrAboveHighestSeen_NoGap_NoEvents()
{
// Disabled buffer: caller is already caught up — no gap, nothing to replay.
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(
source.Reader,
replayBufferCapacity: 0,
replayRetentionSeconds: 0);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease lease = distributor.Register();
for (ulong sequence = 1; sequence <= 3; sequence++)
{
source.Writer.TryWrite(Event(sequence));
_ = await ReadOneAsync(lease.Reader);
}
// afterSequence=3 equals highestSeen — caller is fully caught up.
bool found = distributor.TryGetReplayFrom(3, out IReadOnlyList<MxEvent> replay, out bool gap);
Assert.True(found);
Assert.False(gap);
Assert.Empty(replay);
}
[Fact]
public async Task ReplayBuffer_NoEventsSeen_AnyAfterSequence_NoGap_NoEvents()
{
// No events ever seen: nothing can have been missed, so gap must be false.
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(
source.Reader,
replayBufferCapacity: 0,
replayRetentionSeconds: 0);
// Pump not started — no events arrive.
bool found = distributor.TryGetReplayFrom(0, out IReadOnlyList<MxEvent> replay, out bool gap);
Assert.True(found);
Assert.False(gap);
Assert.Empty(replay);
}
[Fact]
public async Task ReplayBuffer_AfterSequenceMaxValue_WithRetainedEvents_NoGap_NoNewEvents()
{
// ulong.MaxValue as afterSequence: afterSequence + 1 would wrap to 0, which the
// old code used to compare against oldestRetained, falsely reporting gap=true.
// The corrected formula must yield gap=false and an empty replay list.
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
await using SessionEventDistributor distributor = CreateDistributor(
source.Reader,
replayBufferCapacity: 10,
replayRetentionSeconds: 0);
await distributor.StartAsync(CancellationToken.None);
using IEventSubscriberLease lease = distributor.Register();
source.Writer.TryWrite(Event(1));
_ = await ReadOneAsync(lease.Reader);
bool found = distributor.TryGetReplayFrom(ulong.MaxValue, out IReadOnlyList<MxEvent> replay, out bool gap);
Assert.True(found);
Assert.False(gap);
Assert.Empty(replay);
}
[Fact]
public async Task SlowSubscriberOverflow_DisconnectsOnlyThatSubscriber_PumpAndOtherKeepRunning()
{
// Per-subscriber backpressure isolation (Task 5): one subscriber stops reading and
// overflows its own tiny channel; it is disconnected with an EventQueueOverflow fault
// while a second, healthy subscriber keeps receiving and the pump keeps pumping.
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
int overflowCalls = 0;
// Separate fields for the bool value and the "set" flag so both can use
// Volatile.Read/Write; bool? is not valid for the volatile keyword on a local.
// Interlocked.Increment on the pump thread is the store for overflowCalls;
// Volatile.Read/Write provide ordering for observedIsOnlySubscriber.
int observedIsOnlySubscriberSet = 0;
bool observedIsOnlySubscriberValue = false;
await using SessionEventDistributor distributor = new(
"session-test",
ct => source.Reader.ReadAllAsync(ct),
subscriberQueueCapacity: 2,
replayBufferCapacity: 1024,
replayRetentionSeconds: 0,
NullLogger<SessionEventDistributor>.Instance,
TimeProvider.System,
(isOnlySubscriber, _) =>
{
Interlocked.Increment(ref overflowCalls);
Volatile.Write(ref observedIsOnlySubscriberValue, isOnlySubscriber);
Volatile.Write(ref observedIsOnlySubscriberSet, 1);
});
await distributor.StartAsync(CancellationToken.None);
// Slow subscriber: registered but never read, so its capacity-2 channel fills.
using IEventSubscriberLease slow = distributor.Register();
// Healthy subscriber: drains promptly throughout.
using IEventSubscriberLease healthy = distributor.Register();
// Push more events than the slow subscriber's channel can hold while the healthy one
// keeps up. The slow channel overflows; the healthy channel does not.
for (ulong sequence = 1; sequence <= 10; sequence++)
{
source.Writer.TryWrite(Event(sequence));
MxEvent received = await ReadOneAsync(healthy.Reader);
Assert.Equal(sequence, received.WorkerSequence);
}
// The slow subscriber is disconnected with the overflow fault.
SessionManagerException fault = await Assert.ThrowsAsync<SessionManagerException>(
async () => await DrainUntilFaultAsync(slow.Reader));
Assert.Equal(SessionManagerErrorCode.EventQueueOverflow, fault.ErrorCode);
// Two subscribers were registered at overflow time, so isOnlySubscriber is false.
// Use Interlocked.Read / Volatile.Read so the test-thread reads are ordered after the
// pump-thread writes, avoiding a data race by the C# memory model.
Assert.Equal(1, Volatile.Read(ref overflowCalls));
Assert.Equal(1, Volatile.Read(ref observedIsOnlySubscriberSet));
Assert.False(Volatile.Read(ref observedIsOnlySubscriberValue));
Assert.Equal(1, distributor.SubscriberCount);
// The pump is still running and the healthy subscriber still receives new events.
source.Writer.TryWrite(Event(11));
MxEvent afterOverflow = await ReadOneAsync(healthy.Reader);
Assert.Equal(11ul, afterOverflow.WorkerSequence);
}
[Fact]
public async Task SlowSubscriberOverflow_WithMultipleSubscribers_HandlerSeesIsOnlySubscriberFalse_OtherKeepsReceiving()
{
// Distributor-level pin for "FailFast with multiple subscribers degrades to
// disconnect-only (no session fault)": when the overflowing subscriber is NOT the
// sole subscriber, isOnlySubscriber is false, so a FailFast-wired handler must NOT
// fault the session. This test drives the distributor directly (without GatewaySession)
// with two subscribers and a FailFast-style overflow handler seam, overflows the slow
// one, and asserts (a) isOnlySubscriber==false, (b) the other subscriber keeps
// receiving, and (c) the pump keeps running — all without a GatewaySession.
//
// TODO(Task 8): add a GatewaySession-level "session stays Ready" assertion once
// multi-subscriber config is enabled by the Tasks 7/8 validator/guard change.
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
bool handlerFiredWithFalse = false;
bool sessionFaultWouldBeCalled = false; // tracks if a FailFast path would fault
await using SessionEventDistributor distributor = new(
"session-multi-sub",
ct => source.Reader.ReadAllAsync(ct),
subscriberQueueCapacity: 2,
replayBufferCapacity: 0,
replayRetentionSeconds: 0,
NullLogger<SessionEventDistributor>.Instance,
TimeProvider.System,
(isOnlySubscriber, _) =>
{
if (!isOnlySubscriber)
{
// Multi-subscriber: FailFast degrades to disconnect-only.
Volatile.Write(ref handlerFiredWithFalse, true);
}
else
{
// Single-subscriber: FailFast would fault the session — must not happen here.
Volatile.Write(ref sessionFaultWouldBeCalled, true);
}
});
await distributor.StartAsync(CancellationToken.None);
// Slow subscriber: never reads, so capacity-2 channel overflows quickly.
using IEventSubscriberLease slow = distributor.Register();
// Healthy subscriber: drains every event promptly.
using IEventSubscriberLease healthy = distributor.Register();
// Drive enough events to overflow the slow subscriber's channel.
for (ulong sequence = 1; sequence <= 10; sequence++)
{
source.Writer.TryWrite(Event(sequence));
_ = await ReadOneAsync(healthy.Reader);
}
// Slow subscriber is disconnected with the overflow fault.
SessionManagerException fault = await Assert.ThrowsAsync<SessionManagerException>(
async () => await DrainUntilFaultAsync(slow.Reader));
Assert.Equal(SessionManagerErrorCode.EventQueueOverflow, fault.ErrorCode);
// The handler saw isOnlySubscriber==false (multi-subscriber degradation path).
Assert.True(Volatile.Read(ref handlerFiredWithFalse));
// The FailFast session-fault branch was NOT taken (session stays Ready equivalent).
Assert.False(Volatile.Read(ref sessionFaultWouldBeCalled));
// The pump and healthy subscriber are unaffected.
source.Writer.TryWrite(Event(11));
MxEvent afterOverflow = await ReadOneAsync(healthy.Reader);
Assert.Equal(11ul, afterOverflow.WorkerSequence);
}
[Fact]
public async Task InternalSubscriberOverflow_HandlerSeesIsOnlySubscriberFalse_ProvingCountExcludesInternal()
{
// Issue 3: verifies that CountExternalSubscribers() excludes the internal dashboard
// subscriber, so a FailFast policy would NOT fault the session even when the internal
// subscriber is the ONLY registered subscriber. The overflow handler receives
// isOnlySubscriber==false (not true) because the overflowing subscriber is internal
// and is therefore excluded from the external-subscriber count.
Channel<MxEvent> source = Channel.CreateUnbounded<MxEvent>();
int observedIsOnlySubscriberSet = 0;
bool observedIsOnlySubscriberValue = false;
bool observedIsInternalValue = false;
await using SessionEventDistributor distributor = new(
"session-internal-overflow",
ct => source.Reader.ReadAllAsync(ct),
subscriberQueueCapacity: 2,
replayBufferCapacity: 0,
replayRetentionSeconds: 0,
NullLogger<SessionEventDistributor>.Instance,
TimeProvider.System,
(isOnlySubscriber, isInternal) =>
{
Volatile.Write(ref observedIsOnlySubscriberValue, isOnlySubscriber);
Volatile.Write(ref observedIsInternalValue, isInternal);
Volatile.Write(ref observedIsOnlySubscriberSet, 1);
});
await distributor.StartAsync(CancellationToken.None);
// Register ONLY an internal subscriber — no external subscriber is attached.
using IEventSubscriberLease internalLease = distributor.Register(isInternal: true);
// Push enough events to overflow the capacity-2 internal subscriber channel.
for (ulong sequence = 1; sequence <= 10; sequence++)
{
source.Writer.TryWrite(Event(sequence));
}
// The internal subscriber is disconnected with the overflow fault.
SessionManagerException fault = await Assert.ThrowsAsync<SessionManagerException>(
async () => await DrainUntilFaultAsync(internalLease.Reader));
Assert.Equal(SessionManagerErrorCode.EventQueueOverflow, fault.ErrorCode);
// Wait for the handler to fire (it runs on the pump thread).
await Task.Run(async () =>
{
using CancellationTokenSource cts = new(ReadTimeout);
while (Volatile.Read(ref observedIsOnlySubscriberSet) == 0)
{
await Task.Delay(10, cts.Token);
}
});
// isOnlySubscriber must be FALSE even though the internal subscriber was the ONLY
// subscriber — CountExternalSubscribers excludes it, so a FailFast policy on the
// external count would NOT fault the session.
Assert.True(Volatile.Read(ref observedIsOnlySubscriberSet) == 1, "Overflow handler should have fired.");
Assert.False(Volatile.Read(ref observedIsOnlySubscriberValue),
"isOnlySubscriber must be false for an internal subscriber (CountExternalSubscribers excludes it).");
Assert.True(Volatile.Read(ref observedIsInternalValue),
"isInternal must be true for a subscriber registered with isInternal: true.");
}
private static async Task DrainUntilFaultAsync(ChannelReader<MxEvent> reader)
{
// Drains any buffered events, then surfaces the channel's completion fault (if any)
// by awaiting the final read past the buffered tail.
while (true)
{
await reader.WaitToReadAsync().AsTask().WaitAsync(ReadTimeout);
while (reader.TryRead(out _))
{
}
}
}
private static SessionEventDistributor CreateDistributor(ChannelReader<MxEvent> source)
=> CreateDistributor(source, replayBufferCapacity: 1024, replayRetentionSeconds: 300);
private static SessionEventDistributor CreateDistributor(
ChannelReader<MxEvent> source,
int replayBufferCapacity,
double replayRetentionSeconds,
TimeProvider? timeProvider = null)
=> new(
"session-test",
ct => source.ReadAllAsync(ct),
subscriberQueueCapacity: 64,
replayBufferCapacity: replayBufferCapacity,
replayRetentionSeconds: replayRetentionSeconds,
NullLogger<SessionEventDistributor>.Instance,
timeProvider ?? TimeProvider.System);
private static MxEvent Event(ulong sequence)
=> new() { SessionId = "session-test", WorkerSequence = sequence };
private static async Task<MxEvent> ReadOneAsync(ChannelReader<MxEvent> reader)
{
await reader.WaitToReadAsync().AsTask().WaitAsync(ReadTimeout);
Assert.True(reader.TryRead(out MxEvent? value));
return value!;
}
private static async Task AssertCompletedAsync(ChannelReader<MxEvent> reader)
{
// Drain anything still buffered, then assert the channel is completed
// (no further events). Bounded so a never-completing channel fails fast.
await reader.Completion.WaitAsync(ReadTimeout);
}
}
@@ -663,7 +663,7 @@ public sealed class SessionManagerBulkTests
private static async Task<GatewaySession> OpenSessionAsync(IWorkerClient workerClient)
{
SessionManager manager = CreateManager(workerClient);
return await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
return await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
}
private static SessionManager CreateManager(IWorkerClient workerClient)
@@ -23,7 +23,7 @@ public sealed class SessionManagerTests
using GatewayMetrics metrics = new();
SessionManager manager = CreateManager(factory, metrics: metrics);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
Assert.True(manager.TryGetSession(session.SessionId, out GatewaySession? registered));
Assert.Same(session, registered);
@@ -34,6 +34,36 @@ public sealed class SessionManagerTests
Assert.Equal(1, metrics.GetSnapshot().SessionsOpened);
}
/// <summary>Verifies that a session opened by an authenticated caller records that caller's API key id in OwnerKeyId.</summary>
[Fact]
public async Task OpenSessionAsync_WithOwnerKeyId_RecordsOwnerKeyIdOnSession()
{
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(new FakeWorkerClient()));
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
clientIdentity: "MyKey Display",
ownerKeyId: "key-abc123",
CancellationToken.None);
Assert.Equal("key-abc123", session.OwnerKeyId);
}
/// <summary>Verifies that a session opened without an owner key id records null in OwnerKeyId.</summary>
[Fact]
public async Task OpenSessionAsync_WithNullOwnerKeyId_RecordsNullOwnerKeyIdOnSession()
{
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(new FakeWorkerClient()));
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
clientIdentity: null,
ownerKeyId: null,
CancellationToken.None);
Assert.Null(session.OwnerKeyId);
}
/// <summary>Verifies that opening a session sets the initial lease expiry from the configured default lease.</summary>
[Fact]
public async Task OpenSessionAsync_SetsInitialDefaultLease()
@@ -45,7 +75,7 @@ public sealed class SessionManagerTests
options: options,
timeProvider: clock);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
Assert.Equal(clock.GetUtcNow() + TimeSpan.FromMinutes(30), session.LeaseExpiresAt);
}
@@ -61,7 +91,7 @@ public sealed class SessionManagerTests
};
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(new FakeWorkerClient()));
GatewaySession session = await manager.OpenSessionAsync(request, "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(request, "client-1", ownerKeyId: null, CancellationToken.None);
Assert.Equal($"rust-load-client-{session.SessionId}", session.ClientCorrelationId);
}
@@ -76,7 +106,7 @@ public sealed class SessionManagerTests
};
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(new FakeWorkerClient()));
GatewaySession session = await manager.OpenSessionAsync(request, "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(request, "client-1", ownerKeyId: null, CancellationToken.None);
Assert.Equal($"client-{session.SessionId}", session.ClientCorrelationId);
}
@@ -87,7 +117,7 @@ public sealed class SessionManagerTests
{
FakeWorkerClient workerClient = new();
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
WorkerCommandReply reply = await manager.InvokeAsync(
session.SessionId,
@@ -108,6 +138,7 @@ public sealed class SessionManagerTests
"mxaccess-gateway-1-session-lease-refresh",
"nonce",
"client-1",
null,
"test-session",
"client-correlation-1",
TimeSpan.FromSeconds(30),
@@ -156,7 +187,7 @@ public sealed class SessionManagerTests
},
};
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
IReadOnlyList<SubscribeResult> results = await session.SubscribeBulkAsync(
12,
@@ -207,7 +238,7 @@ public sealed class SessionManagerTests
},
};
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
IReadOnlyList<BulkWriteResult> results = await session.WriteBulkAsync(
12,
@@ -268,7 +299,7 @@ public sealed class SessionManagerTests
},
};
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
IReadOnlyList<BulkReadResult> results = await session.ReadBulkAsync(
12,
@@ -291,7 +322,7 @@ public sealed class SessionManagerTests
{
FakeWorkerClient workerClient = new();
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
session.MarkFaulted("test fault");
SessionManagerException exception = await Assert.ThrowsAsync<SessionManagerException>(
@@ -316,7 +347,7 @@ public sealed class SessionManagerTests
{
FakeWorkerClient workerClient = new();
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
// Force a state mismatch: session stays Ready, worker transitions out.
workerClient.State = WorkerClientState.Handshaking;
@@ -341,7 +372,7 @@ public sealed class SessionManagerTests
FakeWorkerClient workerClient = new();
using GatewayMetrics metrics = new();
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient), metrics: metrics);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
SessionCloseResult firstClose = await manager.CloseSessionAsync(session.SessionId, CancellationToken.None);
SessionManagerException secondClose = await Assert.ThrowsAsync<SessionManagerException>(
@@ -366,7 +397,7 @@ public sealed class SessionManagerTests
"Worker shutdown timed out."),
};
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
SessionManagerException exception = await Assert.ThrowsAsync<SessionManagerException>(
async () => await manager.CloseSessionAsync(session.SessionId, CancellationToken.None));
@@ -397,6 +428,7 @@ public sealed class SessionManagerTests
GatewaySession firstSession = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-1",
ownerKeyId: null,
CancellationToken.None);
metrics.EventReceived(firstSession.SessionId, MxEventFamily.OnDataChange.ToString());
@@ -405,6 +437,7 @@ public sealed class SessionManagerTests
GatewaySession secondSession = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-2",
ownerKeyId: null,
CancellationToken.None);
Assert.Equal(SessionManagerErrorCode.CloseFailed, exception.ErrorCode);
@@ -440,6 +473,7 @@ public sealed class SessionManagerTests
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-1",
ownerKeyId: null,
CancellationToken.None);
Task<SessionCloseResult> firstClose = manager.CloseSessionAsync(session.SessionId, CancellationToken.None);
@@ -482,7 +516,7 @@ public sealed class SessionManagerTests
FakeWorkerClient workerClient = new();
using GatewayMetrics metrics = new();
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient), metrics: metrics);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
SessionCloseResult result = await manager.KillWorkerAsync(session.SessionId, "test-kill", CancellationToken.None);
@@ -510,7 +544,7 @@ public sealed class SessionManagerTests
{
FakeWorkerClient workerClient = new();
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
await Assert.ThrowsAsync<ArgumentException>(
async () => await manager.KillWorkerAsync(session.SessionId, blankReason, CancellationToken.None));
@@ -529,7 +563,7 @@ public sealed class SessionManagerTests
{
FakeWorkerClient workerClient = new();
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
await Assert.ThrowsAsync<ArgumentNullException>(
async () => await manager.KillWorkerAsync(session.SessionId, null!, CancellationToken.None));
@@ -569,6 +603,7 @@ public sealed class SessionManagerTests
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-1",
ownerKeyId: null,
CancellationToken.None);
Assert.Equal(1, metrics.GetSnapshot().OpenSessions);
@@ -598,6 +633,7 @@ public sealed class SessionManagerTests
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-1",
ownerKeyId: null,
CancellationToken.None);
Task<SessionCloseResult> first = manager.KillWorkerAsync(session.SessionId, "kill-a", CancellationToken.None);
@@ -641,6 +677,7 @@ public sealed class SessionManagerTests
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-1",
ownerKeyId: null,
CancellationToken.None);
Assert.Equal(1, metrics.GetSnapshot().OpenSessions);
@@ -666,7 +703,7 @@ public sealed class SessionManagerTests
metrics);
SessionManagerException exception = await Assert.ThrowsAsync<SessionManagerException>(
async () => await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None));
async () => await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None));
Assert.Equal(SessionManagerErrorCode.OpenFailed, exception.ErrorCode);
Assert.Equal(0, registry.Count);
@@ -682,8 +719,8 @@ public sealed class SessionManagerTests
FakeWorkerClient activeClient = new();
QueueingSessionWorkerClientFactory factory = new(expiredClient, activeClient);
SessionManager manager = CreateManager(factory);
GatewaySession expiredSession = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession activeSession = await manager.OpenSessionAsync(CreateOpenRequest(), "client-2", CancellationToken.None);
GatewaySession expiredSession = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
GatewaySession activeSession = await manager.OpenSessionAsync(CreateOpenRequest(), "client-2", ownerKeyId: null, CancellationToken.None);
DateTimeOffset now = DateTimeOffset.UtcNow;
expiredSession.ExtendLease(now.AddSeconds(-1));
activeSession.ExtendLease(now.AddMinutes(5));
@@ -703,7 +740,7 @@ public sealed class SessionManagerTests
{
FakeWorkerClient workerClient = new();
SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient));
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
DateTimeOffset now = DateTimeOffset.UtcNow;
session.ExtendLease(now.AddSeconds(-1));
using IDisposable eventSubscriber = session.AttachEventSubscriber(allowMultipleSubscribers: false);
@@ -724,8 +761,8 @@ public sealed class SessionManagerTests
QueueingSessionWorkerClientFactory factory = new(firstClient, secondClient);
using GatewayMetrics metrics = new();
SessionManager manager = CreateManager(factory, metrics: metrics);
GatewaySession firstSession = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None);
GatewaySession secondSession = await manager.OpenSessionAsync(CreateOpenRequest(), "client-2", CancellationToken.None);
GatewaySession firstSession = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", ownerKeyId: null, CancellationToken.None);
GatewaySession secondSession = await manager.OpenSessionAsync(CreateOpenRequest(), "client-2", ownerKeyId: null, CancellationToken.None);
await manager.ShutdownAsync(CancellationToken.None);
@@ -192,6 +192,140 @@ public sealed class FakeWorkerHarnessTests
Assert.Equal(WorkerClientState.Closed, client.State);
}
/// <summary>
/// Verifies that RespondToControlCommandAsync echoes the Ping message back
/// in the DiagnosticMessage field, matching the real worker's ping reply shape.
/// </summary>
[Fact]
public async Task RespondToControlCommandAsync_Ping_EchoesMessageInDiagnostic()
{
await using FakeWorkerHarness fakeWorker = await FakeWorkerHarness.CreateConnectedPairAsync();
await using WorkerClient client = fakeWorker.CreateClient();
await StartClientAsync(fakeWorker, client);
Task<WorkerCommandReply> invokeTask = client.InvokeAsync(
CreateCommand(MxCommandKind.Ping, cmd => cmd.Ping = new PingCommand { Message = "hello-ping" }),
TestTimeout,
CancellationToken.None);
await fakeWorker.RespondToControlCommandAsync().WaitAsync(TestTimeout);
WorkerCommandReply reply = await invokeTask.WaitAsync(TestTimeout);
Assert.Equal(MxCommandKind.Ping, reply.Reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.Reply.ProtocolStatus.Code);
Assert.Equal("hello-ping", reply.Reply.DiagnosticMessage);
}
/// <summary>
/// Verifies that RespondToControlCommandAsync returns a SessionStateReply
/// with state Ready for a GetSessionState command.
/// </summary>
[Fact]
public async Task RespondToControlCommandAsync_GetSessionState_ReturnsReadyState()
{
await using FakeWorkerHarness fakeWorker = await FakeWorkerHarness.CreateConnectedPairAsync();
await using WorkerClient client = fakeWorker.CreateClient();
await StartClientAsync(fakeWorker, client);
Task<WorkerCommandReply> invokeTask = client.InvokeAsync(
CreateCommand(MxCommandKind.GetSessionState),
TestTimeout,
CancellationToken.None);
await fakeWorker.RespondToControlCommandAsync().WaitAsync(TestTimeout);
WorkerCommandReply reply = await invokeTask.WaitAsync(TestTimeout);
Assert.Equal(MxCommandKind.GetSessionState, reply.Reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.Reply.ProtocolStatus.Code);
Assert.NotNull(reply.Reply.SessionState);
Assert.Equal(SessionState.Ready, reply.Reply.SessionState.State);
}
/// <summary>
/// Verifies that RespondToControlCommandAsync returns a WorkerInfoReply
/// with the fake worker's process ID, version, and MXAccess identifiers.
/// </summary>
[Fact]
public async Task RespondToControlCommandAsync_GetWorkerInfo_ReturnsFakeWorkerInfo()
{
await using FakeWorkerHarness fakeWorker = await FakeWorkerHarness.CreateConnectedPairAsync();
await using WorkerClient client = fakeWorker.CreateClient();
await StartClientAsync(fakeWorker, client);
Task<WorkerCommandReply> invokeTask = client.InvokeAsync(
CreateCommand(MxCommandKind.GetWorkerInfo),
TestTimeout,
CancellationToken.None);
await fakeWorker.RespondToControlCommandAsync().WaitAsync(TestTimeout);
WorkerCommandReply reply = await invokeTask.WaitAsync(TestTimeout);
Assert.Equal(MxCommandKind.GetWorkerInfo, reply.Reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.Reply.ProtocolStatus.Code);
Assert.NotNull(reply.Reply.WorkerInfo);
Assert.Equal(FakeWorkerHarness.DefaultWorkerProcessId, reply.Reply.WorkerInfo.WorkerProcessId);
Assert.Equal("LMXProxy.LMXProxyServer.1", reply.Reply.WorkerInfo.MxaccessProgid);
Assert.Equal("{C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC}", reply.Reply.WorkerInfo.MxaccessClsid);
Assert.Equal("fake-worker", reply.Reply.WorkerInfo.WorkerVersion);
}
/// <summary>
/// Verifies that RespondToControlCommandAsync returns an empty DrainEventsReply
/// for a DrainEvents command (the fake harness has no queued events).
/// </summary>
[Fact]
public async Task RespondToControlCommandAsync_DrainEvents_ReturnsEmptyReply()
{
await using FakeWorkerHarness fakeWorker = await FakeWorkerHarness.CreateConnectedPairAsync();
await using WorkerClient client = fakeWorker.CreateClient();
await StartClientAsync(fakeWorker, client);
Task<WorkerCommandReply> invokeTask = client.InvokeAsync(
CreateCommand(MxCommandKind.DrainEvents, cmd => cmd.DrainEvents = new DrainEventsCommand { MaxEvents = 32 }),
TestTimeout,
CancellationToken.None);
await fakeWorker.RespondToControlCommandAsync().WaitAsync(TestTimeout);
WorkerCommandReply reply = await invokeTask.WaitAsync(TestTimeout);
Assert.Equal(MxCommandKind.DrainEvents, reply.Reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.Reply.ProtocolStatus.Code);
Assert.NotNull(reply.Reply.DrainEvents);
Assert.Empty(reply.Reply.DrainEvents.Events);
}
/// <summary>
/// Verifies that RespondToControlCommandAsync for ShutdownWorker sends an OK
/// reply followed by a WorkerShutdownAck, which closes the client.
/// </summary>
[Fact]
public async Task RespondToControlCommandAsync_ShutdownWorker_SendsReplyThenAck()
{
await using FakeWorkerHarness fakeWorker = await FakeWorkerHarness.CreateConnectedPairAsync();
await using WorkerClient client = fakeWorker.CreateClient();
await StartClientAsync(fakeWorker, client);
// ShutdownAsync triggers a WorkerShutdown envelope (not WorkerCommand),
// so we directly invoke ShutdownWorker as a control command via InvokeAsync.
Task<WorkerCommandReply> invokeTask = client.InvokeAsync(
CreateCommand(MxCommandKind.ShutdownWorker, cmd => cmd.ShutdownWorker = new ShutdownWorkerCommand()),
TestTimeout,
CancellationToken.None);
// The harness reads the ShutdownWorker WorkerCommand and replies with
// OK + ShutdownAck — the WorkerClient's read loop processes the ack and
// transitions to Closed.
await fakeWorker.RespondToControlCommandAsync().WaitAsync(TestTimeout);
WorkerCommandReply reply = await invokeTask.WaitAsync(TestTimeout);
Assert.Equal(MxCommandKind.ShutdownWorker, reply.Reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.Reply.ProtocolStatus.Code);
await WaitUntilAsync(() => client.State == WorkerClientState.Closed, TestTimeout);
Assert.Equal(WorkerClientState.Closed, client.State);
}
private static async Task StartClientAsync(
FakeWorkerHarness fakeWorker,
WorkerClient client)
@@ -201,15 +335,13 @@ public sealed class FakeWorkerHarnessTests
await startTask.WaitAsync(TestTimeout).ConfigureAwait(false);
}
private static WorkerCommand CreateCommand(MxCommandKind kind)
private static WorkerCommand CreateCommand(
MxCommandKind kind,
Action<MxCommand>? configure = null)
{
return new WorkerCommand
{
Command = new MxCommand
{
Kind = kind,
},
};
MxCommand command = new() { Kind = kind };
configure?.Invoke(command);
return new WorkerCommand { Command = command };
}
private static async Task WaitUntilAsync(
@@ -391,6 +391,118 @@ public sealed class FakeWorkerHarness : IAsyncDisposable
cancellationToken).ConfigureAwait(false);
}
/// <summary>
/// Reads one incoming command envelope and, if it is one of the five
/// control command kinds (Ping, GetSessionState, GetWorkerInfo, DrainEvents,
/// ShutdownWorker), writes a canned reply that mirrors the real worker's
/// reply shape. For ShutdownWorker the method additionally sends a
/// <see cref="WorkerShutdownAck"/> after the OK reply, matching the real
/// worker's shutdown flow.
/// </summary>
/// <param name="cancellationToken">Token to cancel the asynchronous operation.</param>
/// <returns>The command envelope that was handled.</returns>
/// <exception cref="InvalidOperationException">
/// Thrown when the next envelope is not a <c>WorkerCommand</c> or contains a
/// non-control command kind.
/// </exception>
public async Task<WorkerEnvelope> RespondToControlCommandAsync(
CancellationToken cancellationToken = default)
{
WorkerEnvelope commandEnvelope = await ReadCommandAsync(cancellationToken).ConfigureAwait(false);
return await RespondToControlCommandAsync(commandEnvelope, cancellationToken).ConfigureAwait(false);
}
/// <summary>
/// Accepts an already-read command envelope and, if it is one of the five control
/// command kinds (Ping, GetSessionState, GetWorkerInfo, DrainEvents, ShutdownWorker),
/// writes a canned reply that mirrors the real worker's reply shape. For ShutdownWorker
/// the method additionally sends a <see cref="WorkerShutdownAck"/> after the OK reply.
/// Use this overload when the caller has already consumed the envelope from the pipe
/// (e.g., to inspect the kind before routing) to avoid re-reading.
/// </summary>
/// <param name="commandEnvelope">The already-read command envelope to respond to.</param>
/// <param name="cancellationToken">Token to cancel the asynchronous operation.</param>
/// <returns>The command envelope that was handled.</returns>
/// <exception cref="ArgumentException">
/// Thrown when <paramref name="commandEnvelope"/> does not contain a <c>WorkerCommand</c>.
/// </exception>
/// <exception cref="InvalidOperationException">
/// Thrown when the command kind is not one of the five control command kinds.
/// </exception>
public async Task<WorkerEnvelope> RespondToControlCommandAsync(
WorkerEnvelope commandEnvelope,
CancellationToken cancellationToken = default)
{
if (commandEnvelope.BodyCase != WorkerEnvelope.BodyOneofCase.WorkerCommand)
{
throw new ArgumentException(
$"Expected WorkerCommand envelope but received {commandEnvelope.BodyCase}.",
nameof(commandEnvelope));
}
MxCommand command = commandEnvelope.WorkerCommand.Command;
switch (command.Kind)
{
case MxCommandKind.Ping:
await ReplyToCommandAsync(
commandEnvelope,
configureReply: reply =>
{
string? message = command.Ping?.Message;
if (!string.IsNullOrEmpty(message))
{
reply.DiagnosticMessage = message;
}
},
cancellationToken: cancellationToken).ConfigureAwait(false);
break;
case MxCommandKind.GetSessionState:
await ReplyToCommandAsync(
commandEnvelope,
configureReply: reply => reply.SessionState = new SessionStateReply
{
State = SessionState.Ready,
},
cancellationToken: cancellationToken).ConfigureAwait(false);
break;
case MxCommandKind.GetWorkerInfo:
await ReplyToCommandAsync(
commandEnvelope,
configureReply: reply => reply.WorkerInfo = new WorkerInfoReply
{
WorkerProcessId = DefaultWorkerProcessId,
WorkerVersion = "fake-worker",
MxaccessProgid = "LMXProxy.LMXProxyServer.1",
MxaccessClsid = "{C30B52F5-2CB5-4760-AF0A-3A344A7EB5DC}",
},
cancellationToken: cancellationToken).ConfigureAwait(false);
break;
case MxCommandKind.DrainEvents:
await ReplyToCommandAsync(
commandEnvelope,
configureReply: reply => reply.DrainEvents = new DrainEventsReply(),
cancellationToken: cancellationToken).ConfigureAwait(false);
break;
case MxCommandKind.ShutdownWorker:
await ReplyToCommandAsync(
commandEnvelope,
cancellationToken: cancellationToken).ConfigureAwait(false);
await SendShutdownAckAsync(cancellationToken: cancellationToken).ConfigureAwait(false);
break;
default:
throw new InvalidOperationException(
$"RespondToControlCommandAsync only handles control command kinds; received {command.Kind}.");
}
return commandEnvelope;
}
/// <summary>Writes a malformed payload directly to the worker stream.</summary>
/// <param name="payload">Malformed payload bytes to write.</param>
/// <param name="cancellationToken">Token to cancel the asynchronous operation.</param>
@@ -1,3 +1,4 @@
using System.Text.Json;
using ZB.MOM.WW.Audit;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
@@ -69,7 +70,7 @@ public sealed class ConstraintEnforcerTests
CancellationToken.None);
Assert.NotNull(failure);
await enforcer.RecordDenialAsync(identity, "Write", "42", failure, CancellationToken.None);
await enforcer.RecordDenialAsync(identity, "Write", "42", failure, correlationId: null, CancellationToken.None);
AuditEvent auditEvent = Assert.Single(auditWriter.Events);
Assert.Equal("operator01", auditEvent.Actor);
@@ -83,6 +84,52 @@ public sealed class ConstraintEnforcerTests
Assert.Null(auditEvent.CorrelationId);
}
/// <summary>A denial carrying a parseable correlation id stores it on the audit record.</summary>
[Fact]
public async Task RecordDenialAsync_WithGuidCorrelationId_StoresCorrelationId()
{
ConstraintEnforcer enforcer = CreateEnforcer(out FakeAuditWriter auditWriter);
Guid correlationId = Guid.NewGuid();
await enforcer.RecordDenialAsync(
identity: null,
"Read",
"Secret.Tag",
new ConstraintFailure("read_scope", "Tag is outside the API key read scope."),
correlationId.ToString(),
CancellationToken.None);
AuditEvent auditEvent = Assert.Single(auditWriter.Events);
Assert.Equal(correlationId, auditEvent.CorrelationId);
}
/// <summary>
/// A denial with a non-GUID correlation id leaves the typed audit correlation id null but
/// still preserves the raw client correlation id in DetailsJson so it is not lost.
/// </summary>
[Fact]
public async Task RecordDenialAsync_WithNonGuidCorrelationId_LeavesCorrelationIdNullButPreservesRawInDetails()
{
ConstraintEnforcer enforcer = CreateEnforcer(out FakeAuditWriter auditWriter);
await enforcer.RecordDenialAsync(
identity: null,
"Read",
"Secret.Tag",
new ConstraintFailure("read_scope", "Tag is outside the API key read scope."),
"rust-client-Write-7",
CancellationToken.None);
AuditEvent auditEvent = Assert.Single(auditWriter.Events);
Assert.Null(auditEvent.CorrelationId);
Assert.NotNull(auditEvent.DetailsJson);
Dictionary<string, string>? details =
JsonSerializer.Deserialize<Dictionary<string, string>>(auditEvent.DetailsJson);
Assert.NotNull(details);
Assert.Equal("rust-client-Write-7", details["clientCorrelationId"]);
}
/// <summary>A denial with no identity records the canonical "anonymous" actor.</summary>
[Fact]
public async Task RecordDenialAsync_WithoutIdentity_UsesAnonymousActor()
@@ -94,6 +141,7 @@ public sealed class ConstraintEnforcerTests
"Read",
"Secret.Tag",
new ConstraintFailure("read_scope", "Tag is outside the API key read scope."),
correlationId: null,
CancellationToken.None);
AuditEvent auditEvent = Assert.Single(auditWriter.Events);
@@ -416,6 +416,7 @@ public sealed class GatewayGrpcAuthorizationInterceptorTests
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
string? ownerKeyId,
CancellationToken cancellationToken)
{
OpenSessionCount++;
@@ -38,5 +38,6 @@ public sealed class AllowAllConstraintEnforcer : IConstraintEnforcer
string commandKind,
string target,
ConstraintFailure failure,
string? correlationId,
CancellationToken cancellationToken) => Task.CompletedTask;
}
@@ -23,8 +23,8 @@ public sealed class PredicateConstraintEnforcer : IConstraintEnforcer
/// <summary>Deny predicate keyed on (serverHandle, itemHandle) (returns true to deny).</summary>
public Func<int, int, bool> DenyWriteHandle { get; init; } = (_, _) => false;
/// <summary>Recorded denial messages — (commandKind, target) tuples.</summary>
public List<(string CommandKind, string Target)> RecordedDenials { get; } = [];
/// <summary>Recorded denial messages — (commandKind, target, correlationId) tuples.</summary>
public List<(string CommandKind, string Target, string? CorrelationId)> RecordedDenials { get; } = [];
/// <inheritdoc />
public Task<ConstraintFailure?> CheckReadTagAsync(
@@ -81,9 +81,10 @@ public sealed class PredicateConstraintEnforcer : IConstraintEnforcer
string commandKind,
string target,
ConstraintFailure failure,
string? correlationId,
CancellationToken cancellationToken)
{
RecordedDenials.Add((commandKind, target));
RecordedDenials.Add((commandKind, target, correlationId));
return Task.CompletedTask;
}
}
@@ -288,6 +288,206 @@ public sealed class WorkerPipeSessionTests
await SendShutdownAndWaitAsync(pipePair, runTask, cancellation.Token);
}
/// <summary>
/// Verifies that a Ping control command is answered on the worker side
/// (not dispatched to the STA) with an OK reply that echoes the ping
/// message into the reply's diagnostic field.
/// </summary>
[Fact]
public async Task RunAsync_PingControlCommand_RepliesOkAndEchoesMessage()
{
using CancellationTokenSource cancellation = new(TimeSpan.FromSeconds(5));
using PipePair pipePair = await PipePair.CreateAsync(cancellation.Token);
FakeRuntimeSession runtime = new();
WorkerPipeSession session = CreatePipeSession(pipePair.WorkerStream, runtime);
Task runTask = session.RunAsync(cancellation.Token);
await CompleteGatewayHandshakeAsync(pipePair, cancellation.Token);
await pipePair.GatewayWriter
.WriteAsync(CreatePingCommandEnvelope("ping-1", "hello-worker"), cancellation.Token);
WorkerEnvelope replyEnvelope = await ReadUntilAsync(
pipePair.GatewayReader,
WorkerEnvelope.BodyOneofCase.WorkerCommandReply,
cancellation.Token);
MxCommandReply reply = replyEnvelope.WorkerCommandReply.Reply;
Assert.Equal("ping-1", reply.CorrelationId);
Assert.Equal(MxCommandKind.Ping, reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.Equal("hello-worker", reply.DiagnosticMessage);
await SendShutdownAndWaitAsync(pipePair, runTask, cancellation.Token);
}
/// <summary>
/// Verifies that GetSessionState reports the worker's lifecycle as the
/// proto SessionState — READY while the message loop is serving.
/// </summary>
[Fact]
public async Task RunAsync_GetSessionStateControlCommand_RepliesReady()
{
using CancellationTokenSource cancellation = new(TimeSpan.FromSeconds(5));
using PipePair pipePair = await PipePair.CreateAsync(cancellation.Token);
FakeRuntimeSession runtime = new();
WorkerPipeSession session = CreatePipeSession(pipePair.WorkerStream, runtime);
Task runTask = session.RunAsync(cancellation.Token);
await CompleteGatewayHandshakeAsync(pipePair, cancellation.Token);
await pipePair.GatewayWriter
.WriteAsync(
CreateControlCommandEnvelope(
"state-1",
MxCommandKind.GetSessionState,
command => command.GetSessionState = new GetSessionStateCommand()),
cancellation.Token);
WorkerEnvelope replyEnvelope = await ReadUntilAsync(
pipePair.GatewayReader,
WorkerEnvelope.BodyOneofCase.WorkerCommandReply,
cancellation.Token);
MxCommandReply reply = replyEnvelope.WorkerCommandReply.Reply;
Assert.Equal(MxCommandKind.GetSessionState, reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.Equal(SessionState.Ready, reply.SessionState.State);
await SendShutdownAndWaitAsync(pipePair, runTask, cancellation.Token);
}
/// <summary>
/// Verifies that GetWorkerInfo populates the worker process id, version,
/// and MXAccess ProgID/CLSID from the worker's own metadata.
/// </summary>
[Fact]
public async Task RunAsync_GetWorkerInfoControlCommand_PopulatesWorkerInfoFields()
{
using CancellationTokenSource cancellation = new(TimeSpan.FromSeconds(5));
using PipePair pipePair = await PipePair.CreateAsync(cancellation.Token);
FakeRuntimeSession runtime = new();
WorkerPipeSession session = CreatePipeSession(pipePair.WorkerStream, runtime);
Task runTask = session.RunAsync(cancellation.Token);
await CompleteGatewayHandshakeAsync(pipePair, cancellation.Token);
await pipePair.GatewayWriter
.WriteAsync(
CreateControlCommandEnvelope(
"info-1",
MxCommandKind.GetWorkerInfo,
command => command.GetWorkerInfo = new GetWorkerInfoCommand()),
cancellation.Token);
WorkerEnvelope replyEnvelope = await ReadUntilAsync(
pipePair.GatewayReader,
WorkerEnvelope.BodyOneofCase.WorkerCommandReply,
cancellation.Token);
MxCommandReply reply = replyEnvelope.WorkerCommandReply.Reply;
Assert.Equal(MxCommandKind.GetWorkerInfo, reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
WorkerInfoReply info = reply.WorkerInfo;
Assert.Equal(1234, info.WorkerProcessId);
Assert.False(string.IsNullOrEmpty(info.WorkerVersion));
Assert.Equal(MxAccessInteropInfo.ProgId, info.MxaccessProgid);
Assert.Equal(MxAccessInteropInfo.Clsid, info.MxaccessClsid);
await SendShutdownAndWaitAsync(pipePair, runTask, cancellation.Token);
}
/// <summary>
/// Verifies that DrainEvents drains the runtime session's queued events
/// into the reply rather than streaming them as WorkerEvent envelopes.
/// </summary>
[Fact]
public async Task RunAsync_DrainEventsControlCommand_ReturnsQueuedEvents()
{
using CancellationTokenSource cancellation = new(TimeSpan.FromSeconds(5));
using PipePair pipePair = await PipePair.CreateAsync(cancellation.Token);
// Suppress the background drain loop's fixed-batch drains so the
// queued events survive for the explicit DrainEvents command (which
// drains all via max_events == 0). 128 mirrors
// WorkerPipeSession.EventDrainBatchSize.
FakeRuntimeSession runtime = new() { SuppressDrainForBatchSize = 128 };
WorkerPipeSession session = CreatePipeSession(pipePair.WorkerStream, runtime);
runtime.EnqueueEvent(CreateWorkerEvent(sequence: 11));
runtime.EnqueueEvent(CreateWorkerEvent(sequence: 12));
Task runTask = session.RunAsync(cancellation.Token);
await CompleteGatewayHandshakeAsync(pipePair, cancellation.Token);
await pipePair.GatewayWriter
.WriteAsync(
CreateControlCommandEnvelope(
"drain-1",
MxCommandKind.DrainEvents,
command => command.DrainEvents = new DrainEventsCommand { MaxEvents = 0 }),
cancellation.Token);
WorkerEnvelope replyEnvelope = await ReadUntilAsync(
pipePair.GatewayReader,
WorkerEnvelope.BodyOneofCase.WorkerCommandReply,
cancellation.Token);
MxCommandReply reply = replyEnvelope.WorkerCommandReply.Reply;
Assert.Equal(MxCommandKind.DrainEvents, reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.Equal(2, reply.DrainEvents.Events.Count);
Assert.Contains(reply.DrainEvents.Events, e => e.WorkerSequence == 11UL);
Assert.Contains(reply.DrainEvents.Events, e => e.WorkerSequence == 12UL);
await SendShutdownAndWaitAsync(pipePair, runTask, cancellation.Token);
}
/// <summary>
/// Verifies that ShutdownWorker returns its OK reply BEFORE the graceful
/// shutdown runs and disposes the runtime session, and that the message
/// loop then stops.
/// </summary>
[Fact]
public async Task RunAsync_ShutdownWorkerControlCommand_RepliesOkThenShutsDown()
{
using CancellationTokenSource cancellation = new(TimeSpan.FromSeconds(5));
using PipePair pipePair = await PipePair.CreateAsync(cancellation.Token);
FakeRuntimeSession runtime = new();
WorkerPipeSession session = CreatePipeSession(pipePair.WorkerStream, runtime);
Task runTask = session.RunAsync(cancellation.Token);
await CompleteGatewayHandshakeAsync(pipePair, cancellation.Token);
await pipePair.GatewayWriter
.WriteAsync(
CreateControlCommandEnvelope(
"shutdown-1",
MxCommandKind.ShutdownWorker,
command => command.ShutdownWorker = new ShutdownWorkerCommand
{
GracePeriod = Duration.FromTimeSpan(TimeSpan.FromSeconds(1)),
}),
cancellation.Token);
WorkerEnvelope replyEnvelope = await ReadUntilAsync(
pipePair.GatewayReader,
WorkerEnvelope.BodyOneofCase.WorkerCommandReply,
cancellation.Token);
MxCommandReply reply = replyEnvelope.WorkerCommandReply.Reply;
Assert.Equal("shutdown-1", reply.CorrelationId);
Assert.Equal(MxCommandKind.ShutdownWorker, reply.Kind);
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
// The OK reply is followed by a shutdown ack, then the loop stops and
// the runtime session is disposed.
WorkerEnvelope ack = await ReadUntilAsync(
pipePair.GatewayReader,
WorkerEnvelope.BodyOneofCase.WorkerShutdownAck,
cancellation.Token);
Assert.Equal(ProtocolStatusCode.Ok, ack.WorkerShutdownAck.Status.Code);
Task completedTask = await Task
.WhenAny(runTask, Task.Delay(TimeSpan.FromSeconds(5), cancellation.Token));
Assert.Same(runTask, completedTask);
await runTask;
Assert.True(runtime.Disposed, "ShutdownWorker must dispose the runtime session.");
}
/// <summary>
/// Verifies that stale STA activity with no command in flight triggers
@@ -867,6 +1067,20 @@ public sealed class WorkerPipeSessionTests
() => 1234);
}
private static WorkerPipeSession CreatePipeSession(
Stream stream,
FakeRuntimeSession runtime)
{
return CreatePipeSession(
stream,
runtime,
new WorkerPipeSessionOptions
{
HeartbeatInterval = TimeSpan.FromMilliseconds(100),
HeartbeatGrace = TimeSpan.FromSeconds(5),
});
}
private static WorkerPipeSession CreatePipeSession(
Stream stream,
FakeRuntimeSession runtime,
@@ -916,6 +1130,11 @@ public sealed class WorkerPipeSessionTests
};
}
// A generic STA-dispatched command used by the dispatch/heartbeat/
// shutdown-race tests. Register is a real MXAccess command kind (not a
// worker control command), so it flows through IWorkerRuntimeSession
// .DispatchAsync — unlike Ping/GetSessionState/etc., which are answered on
// the message-loop thread without touching the STA.
private static WorkerEnvelope CreateCommandEnvelope(string correlationId, ulong sequence = 2)
{
return new WorkerEnvelope
@@ -928,10 +1147,10 @@ public sealed class WorkerPipeSessionTests
{
Command = new MxCommand
{
Kind = MxCommandKind.Ping,
Ping = new PingCommand
Kind = MxCommandKind.Register,
Register = new RegisterCommand
{
Message = "ping",
ClientName = "test-client",
},
},
EnqueueTimestamp = Timestamp.FromDateTimeOffset(DateTimeOffset.UtcNow),
@@ -939,6 +1158,40 @@ public sealed class WorkerPipeSessionTests
};
}
private static WorkerEnvelope CreatePingCommandEnvelope(
string correlationId,
string message,
ulong sequence = 2)
{
return CreateControlCommandEnvelope(
correlationId,
MxCommandKind.Ping,
command => command.Ping = new PingCommand { Message = message },
sequence);
}
private static WorkerEnvelope CreateControlCommandEnvelope(
string correlationId,
MxCommandKind kind,
Action<MxCommand> configurePayload,
ulong sequence = 2)
{
MxCommand command = new() { Kind = kind };
configurePayload(command);
return new WorkerEnvelope
{
ProtocolVersion = GatewayContractInfo.WorkerProtocolVersion,
SessionId = SessionId,
Sequence = sequence,
CorrelationId = correlationId,
WorkerCommand = new WorkerCommand
{
Command = command,
EnqueueTimestamp = Timestamp.FromDateTimeOffset(DateTimeOffset.UtcNow),
},
};
}
private static WorkerEnvelope CreateCancelEnvelope(string correlationId, ulong sequence = 2)
{
return new WorkerEnvelope
@@ -259,6 +259,21 @@ public sealed class LmxSubtagAlarmSourceTests
{
}
public object Suspend(int serverHandle, int itemHandle) => new object();
public object Activate(int serverHandle, int itemHandle) => new object();
public int AuthenticateUser(int serverHandle, string verifyUser, string verifyUserPassword) => 0;
public int ArchestrAUserToId(int serverHandle, string userIdGuid) => 0;
public int AddBufferedItem(int serverHandle, string itemDefinition, string itemContext)
=> AddItem(serverHandle, itemDefinition);
public void SetBufferedUpdateInterval(int serverHandle, int updateIntervalMilliseconds)
{
}
internal sealed class WriteRecord
{
public WriteRecord(int serverHandle, int itemHandle, object? value, int userId)
@@ -33,6 +33,39 @@ public sealed class MxAccessComServerTests
Assert.Equal(new[] { "Register:client-a", "Advise:77:9", "Unregister:77" }, typed.Calls);
}
/// <summary>
/// The MXAccess command methods added in the worker COM commands bundle
/// (Suspend/Activate/AuthenticateUser/ArchestrAUserToId/AddBufferedItem/
/// SetBufferedUpdateInterval) route through the typed interface with their
/// arguments preserved, and the credential is never echoed back.
/// </summary>
[Fact]
public void CommandMethods_WithTypedServer_RouteThroughTypedInterface()
{
RecordingMxAccessServer typed = new(registerHandle: 5);
MxAccessComServer adapter = new(typed);
adapter.Suspend(serverHandle: 5, itemHandle: 11);
adapter.Activate(serverHandle: 5, itemHandle: 12);
adapter.AuthenticateUser(serverHandle: 5, verifyUser: "Administrator", verifyUserPassword: "s3cret");
adapter.ArchestrAUserToId(serverHandle: 5, userIdGuid: "guid-1");
adapter.AddBufferedItem(serverHandle: 5, itemDefinition: "TestInt", itemContext: "TestChildObject");
adapter.SetBufferedUpdateInterval(serverHandle: 5, updateIntervalMilliseconds: 250);
Assert.Equal(
new[]
{
"Suspend:5:11",
"Activate:5:12",
"AuthenticateUser:5:Administrator",
"ArchestrAUserToId:5:guid-1",
"AddBufferedItem:5:TestInt:TestChildObject",
"SetBufferedUpdateInterval:5:250",
},
typed.Calls);
Assert.DoesNotContain(typed.Calls, call => call.Contains("s3cret", StringComparison.Ordinal));
}
/// <summary>
/// A COM object that implements neither the typed COM interface family
/// nor <see cref="IMxAccessServer"/> fails fast with a clear
@@ -207,5 +240,60 @@ public sealed class MxAccessComServerTests
{
calls.Add($"WriteSecured2:{serverHandle}:{itemHandle}:{currentUserId}:{verifierUserId}:{value}:{timestamp}");
}
/// <summary>Records a Suspend call and returns a canned status.</summary>
/// <param name="serverHandle">The MXAccess server handle.</param>
/// <param name="itemHandle">The MXAccess item handle.</param>
public object Suspend(int serverHandle, int itemHandle)
{
calls.Add($"Suspend:{serverHandle}:{itemHandle}");
return new object();
}
/// <summary>Records an Activate call and returns a canned status.</summary>
/// <param name="serverHandle">The MXAccess server handle.</param>
/// <param name="itemHandle">The MXAccess item handle.</param>
public object Activate(int serverHandle, int itemHandle)
{
calls.Add($"Activate:{serverHandle}:{itemHandle}");
return new object();
}
/// <summary>Records an AuthenticateUser call and returns zero.</summary>
/// <param name="serverHandle">The MXAccess server handle.</param>
/// <param name="verifyUser">The user name to authenticate.</param>
/// <param name="verifyUserPassword">The credential; recorded only as a fixed marker, never echoed.</param>
public int AuthenticateUser(int serverHandle, string verifyUser, string verifyUserPassword)
{
calls.Add($"AuthenticateUser:{serverHandle}:{verifyUser}");
return 0;
}
/// <summary>Records an ArchestrAUserToId call and returns zero.</summary>
/// <param name="serverHandle">The MXAccess server handle.</param>
/// <param name="userIdGuid">The ArchestrA user GUID to resolve.</param>
public int ArchestrAUserToId(int serverHandle, string userIdGuid)
{
calls.Add($"ArchestrAUserToId:{serverHandle}:{userIdGuid}");
return 0;
}
/// <summary>Records an AddBufferedItem call and returns zero.</summary>
/// <param name="serverHandle">The MXAccess server handle.</param>
/// <param name="itemDefinition">The item definition string to record.</param>
/// <param name="itemContext">The item context string to record.</param>
public int AddBufferedItem(int serverHandle, string itemDefinition, string itemContext)
{
calls.Add($"AddBufferedItem:{serverHandle}:{itemDefinition}:{itemContext}");
return 0;
}
/// <summary>Records a SetBufferedUpdateInterval call.</summary>
/// <param name="serverHandle">The MXAccess server handle.</param>
/// <param name="updateIntervalMilliseconds">The buffered update interval in milliseconds.</param>
public void SetBufferedUpdateInterval(int serverHandle, int updateIntervalMilliseconds)
{
calls.Add($"SetBufferedUpdateInterval:{serverHandle}:{updateIntervalMilliseconds}");
}
}
}
@@ -952,6 +952,295 @@ public sealed class MxAccessCommandExecutorTests
Assert.Null(fakeComObject.WriteServerHandle);
}
/// <summary>Verifies Suspend calls MXAccess on the STA and maps the native status to MxStatusProxy.</summary>
[Fact]
public async Task DispatchAsync_Suspend_CallsMxAccessOnStaAndMapsStatus()
{
FakeMxAccessComObject fakeComObject = new(registerHandle: 200);
FakeMxAccessComObjectFactory factory = new(fakeComObject);
using StaRuntime runtime = CreateRuntime();
using MxAccessStaSession session = new(runtime, factory, new NoopEventSink());
await session.StartAsync(workerProcessId: 1234);
await session.DispatchAsync(CreateRegisterCommand("register-before-suspend", "client-a"));
MxCommandReply reply = await session.DispatchAsync(CreateSuspendCommand("suspend-1", 200, 21));
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.Equal(0, reply.Hresult);
Assert.NotNull(reply.Suspend);
Assert.NotNull(reply.Suspend.Status);
Assert.Equal(1, reply.Suspend.Status.Success);
Assert.Equal(MxStatusCategory.Ok, reply.Suspend.Status.Category);
Assert.Equal(200, fakeComObject.SuspendServerHandle);
Assert.Equal(21, fakeComObject.SuspendItemHandle);
Assert.Equal(runtime.StaThreadId, fakeComObject.SuspendThreadId);
}
/// <summary>Verifies Activate calls MXAccess on the STA and maps the native status to MxStatusProxy.</summary>
[Fact]
public async Task DispatchAsync_Activate_CallsMxAccessOnStaAndMapsStatus()
{
FakeMxAccessComObject fakeComObject = new(registerHandle: 201);
FakeMxAccessComObjectFactory factory = new(fakeComObject);
using StaRuntime runtime = CreateRuntime();
using MxAccessStaSession session = new(runtime, factory, new NoopEventSink());
await session.StartAsync(workerProcessId: 1234);
await session.DispatchAsync(CreateRegisterCommand("register-before-activate", "client-a"));
MxCommandReply reply = await session.DispatchAsync(CreateActivateCommand("activate-1", 201, 22));
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.NotNull(reply.Activate);
Assert.NotNull(reply.Activate.Status);
Assert.Equal(1, reply.Activate.Status.Success);
Assert.Equal(MxStatusCategory.Ok, reply.Activate.Status.Category);
Assert.Equal(201, fakeComObject.ActivateServerHandle);
Assert.Equal(22, fakeComObject.ActivateItemHandle);
Assert.Equal(runtime.StaThreadId, fakeComObject.ActivateThreadId);
}
/// <summary>Verifies AuthenticateUser passes credentials to MXAccess on the STA and returns the user id.</summary>
[Fact]
public async Task DispatchAsync_AuthenticateUser_CallsMxAccessOnStaAndReturnsUserId()
{
FakeMxAccessComObject fakeComObject = new(registerHandle: 202);
FakeMxAccessComObjectFactory factory = new(fakeComObject);
using StaRuntime runtime = CreateRuntime();
using MxAccessStaSession session = new(runtime, factory, new NoopEventSink());
await session.StartAsync(workerProcessId: 1234);
await session.DispatchAsync(CreateRegisterCommand("register-before-auth", "client-a"));
MxCommandReply reply = await session.DispatchAsync(
CreateAuthenticateUserCommand("auth-1", 202, "Administrator", string.Empty));
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.NotNull(reply.AuthenticateUser);
Assert.Equal(1, reply.AuthenticateUser.UserId);
Assert.Equal(202, fakeComObject.AuthenticateServerHandle);
Assert.Equal("Administrator", fakeComObject.AuthenticateUserName);
Assert.Equal(runtime.StaThreadId, fakeComObject.AuthenticateThreadId);
}
/// <summary>
/// Verifies the AuthenticateUser path never surfaces the credential into the
/// command reply or any recorded diagnostic — the password is only ever
/// handed straight to the MXAccess wrapper.
/// </summary>
[Fact]
public async Task DispatchAsync_AuthenticateUser_DoesNotLeakPassword()
{
const string secret = "sup3r-secret-pw";
FakeMxAccessComObject fakeComObject = new(registerHandle: 203);
FakeMxAccessComObjectFactory factory = new(fakeComObject);
using StaRuntime runtime = CreateRuntime();
using MxAccessStaSession session = new(runtime, factory, new NoopEventSink());
await session.StartAsync(workerProcessId: 1234);
await session.DispatchAsync(CreateRegisterCommand("register-before-auth-leak", "client-a"));
MxCommandReply reply = await session.DispatchAsync(
CreateAuthenticateUserCommand("auth-leak", 203, "Administrator", secret));
// The wrapper still receives the credential verbatim...
Assert.Equal(secret, fakeComObject.AuthenticatePassword);
// ...but the reply (diagnostics, status text) and the fake's operation
// log must never contain it.
Assert.DoesNotContain(secret, reply.DiagnosticMessage ?? string.Empty, StringComparison.Ordinal);
Assert.DoesNotContain(secret, reply.ProtocolStatus.Message ?? string.Empty, StringComparison.Ordinal);
Assert.DoesNotContain(fakeComObject.OperationNames, name => name.Contains(secret, StringComparison.Ordinal));
}
/// <summary>Verifies ArchestrAUserToId calls MXAccess on the STA and returns the resolved user id.</summary>
[Fact]
public async Task DispatchAsync_ArchestrAUserToId_CallsMxAccessOnStaAndReturnsUserId()
{
FakeMxAccessComObject fakeComObject = new(registerHandle: 204);
FakeMxAccessComObjectFactory factory = new(fakeComObject);
using StaRuntime runtime = CreateRuntime();
using MxAccessStaSession session = new(runtime, factory, new NoopEventSink());
await session.StartAsync(workerProcessId: 1234);
await session.DispatchAsync(CreateRegisterCommand("register-before-user-to-id", "client-a"));
MxCommandReply reply = await session.DispatchAsync(
CreateArchestrAUserToIdCommand("user-to-id-1", 204, "11112222-3333-4444-5555-666677778888"));
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.NotNull(reply.ArchestraUserToId);
Assert.Equal(7, reply.ArchestraUserToId.UserId);
Assert.Equal(204, fakeComObject.ArchestrAUserToIdServerHandle);
Assert.Equal("11112222-3333-4444-5555-666677778888", fakeComObject.ArchestrAUserToIdGuid);
}
/// <summary>Verifies AddBufferedItem calls MXAccess on the STA and tracks the buffered item handle.</summary>
[Fact]
public async Task DispatchAsync_AddBufferedItem_CallsMxAccessOnStaAndTracksItemHandle()
{
FakeMxAccessComObject fakeComObject = new(registerHandle: 205);
FakeMxAccessComObjectFactory factory = new(fakeComObject);
using StaRuntime runtime = CreateRuntime();
using MxAccessStaSession session = new(runtime, factory, new NoopEventSink());
await session.StartAsync(workerProcessId: 1234);
await session.DispatchAsync(CreateRegisterCommand("register-before-buffered", "client-a"));
MxCommandReply reply = await session.DispatchAsync(
CreateAddBufferedItemCommand("buffered-1", 205, "TestInt", "TestChildObject"));
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.NotNull(reply.AddBufferedItem);
Assert.Equal(1, reply.AddBufferedItem.ItemHandle);
Assert.Equal(MxDataType.Integer, reply.ReturnValue.DataType);
Assert.Equal(1, reply.ReturnValue.Int32Value);
Assert.Equal(205, fakeComObject.AddBufferedItemServerHandle);
Assert.Equal("TestInt", fakeComObject.AddBufferedItemDefinition);
Assert.Equal("TestChildObject", fakeComObject.AddBufferedItemContext);
RegisteredItemHandle registeredItemHandle = Assert.Single(
await session.GetRegisteredItemHandlesAsync());
Assert.Equal(205, registeredItemHandle.ServerHandle);
Assert.Equal(1, registeredItemHandle.ItemHandle);
Assert.Equal("TestInt", registeredItemHandle.ItemDefinition);
Assert.Equal("TestChildObject", registeredItemHandle.ItemContext);
Assert.True(registeredItemHandle.HasItemContext);
}
/// <summary>Verifies SetBufferedUpdateInterval calls MXAccess on the STA and returns a base OK reply.</summary>
[Fact]
public async Task DispatchAsync_SetBufferedUpdateInterval_CallsMxAccessOnStaAndReturnsOk()
{
FakeMxAccessComObject fakeComObject = new(registerHandle: 206);
FakeMxAccessComObjectFactory factory = new(fakeComObject);
using StaRuntime runtime = CreateRuntime();
using MxAccessStaSession session = new(runtime, factory, new NoopEventSink());
await session.StartAsync(workerProcessId: 1234);
await session.DispatchAsync(CreateRegisterCommand("register-before-interval", "client-a"));
MxCommandReply reply = await session.DispatchAsync(
CreateSetBufferedUpdateIntervalCommand("interval-1", 206, 500));
Assert.Equal(ProtocolStatusCode.Ok, reply.ProtocolStatus.Code);
Assert.Equal(0, reply.Hresult);
Assert.Equal(206, fakeComObject.SetBufferedUpdateIntervalServerHandle);
Assert.Equal(500, fakeComObject.SetBufferedUpdateIntervalValue);
}
private static StaCommand CreateSuspendCommand(
string correlationId,
int serverHandle,
int itemHandle)
{
return new StaCommand(
"session-1",
correlationId,
new MxCommand
{
Kind = MxCommandKind.Suspend,
Suspend = new SuspendCommand
{
ServerHandle = serverHandle,
ItemHandle = itemHandle,
},
});
}
private static StaCommand CreateActivateCommand(
string correlationId,
int serverHandle,
int itemHandle)
{
return new StaCommand(
"session-1",
correlationId,
new MxCommand
{
Kind = MxCommandKind.Activate,
Activate = new ActivateCommand
{
ServerHandle = serverHandle,
ItemHandle = itemHandle,
},
});
}
private static StaCommand CreateAuthenticateUserCommand(
string correlationId,
int serverHandle,
string verifyUser,
string verifyUserPassword)
{
return new StaCommand(
"session-1",
correlationId,
new MxCommand
{
Kind = MxCommandKind.AuthenticateUser,
AuthenticateUser = new AuthenticateUserCommand
{
ServerHandle = serverHandle,
VerifyUser = verifyUser,
VerifyUserPassword = verifyUserPassword,
},
});
}
private static StaCommand CreateArchestrAUserToIdCommand(
string correlationId,
int serverHandle,
string userIdGuid)
{
return new StaCommand(
"session-1",
correlationId,
new MxCommand
{
Kind = MxCommandKind.ArchestraUserToId,
ArchestraUserToId = new ArchestrAUserToIdCommand
{
ServerHandle = serverHandle,
UserIdGuid = userIdGuid,
},
});
}
private static StaCommand CreateAddBufferedItemCommand(
string correlationId,
int serverHandle,
string itemDefinition,
string itemContext)
{
return new StaCommand(
"session-1",
correlationId,
new MxCommand
{
Kind = MxCommandKind.AddBufferedItem,
AddBufferedItem = new AddBufferedItemCommand
{
ServerHandle = serverHandle,
ItemDefinition = itemDefinition,
ItemContext = itemContext,
},
});
}
private static StaCommand CreateSetBufferedUpdateIntervalCommand(
string correlationId,
int serverHandle,
int updateIntervalMilliseconds)
{
return new StaCommand(
"session-1",
correlationId,
new MxCommand
{
Kind = MxCommandKind.SetBufferedUpdateInterval,
SetBufferedUpdateInterval = new SetBufferedUpdateIntervalCommand
{
ServerHandle = serverHandle,
UpdateIntervalMilliseconds = updateIntervalMilliseconds,
},
});
}
private static StaCommand CreateRegisterCommand(
string correlationId,
string clientName)
@@ -1810,6 +2099,151 @@ public sealed class MxAccessCommandExecutorTests
throw exception;
}
}
/// <summary>Gets the server handle passed to Suspend, if called.</summary>
public int? SuspendServerHandle { get; private set; }
/// <summary>Gets the item handle passed to Suspend, if called.</summary>
public int? SuspendItemHandle { get; private set; }
/// <summary>Gets the thread ID on which Suspend was called.</summary>
public int? SuspendThreadId { get; private set; }
/// <summary>Gets the server handle passed to Activate, if called.</summary>
public int? ActivateServerHandle { get; private set; }
/// <summary>Gets the item handle passed to Activate, if called.</summary>
public int? ActivateItemHandle { get; private set; }
/// <summary>Gets the thread ID on which Activate was called.</summary>
public int? ActivateThreadId { get; private set; }
/// <summary>Gets the server handle passed to AuthenticateUser, if called.</summary>
public int? AuthenticateServerHandle { get; private set; }
/// <summary>Gets the user name passed to AuthenticateUser, if called.</summary>
public string? AuthenticateUserName { get; private set; }
/// <summary>Gets the credential passed to AuthenticateUser, if called. Used only to prove non-logging.</summary>
public string? AuthenticatePassword { get; private set; }
/// <summary>Gets the thread ID on which AuthenticateUser was called.</summary>
public int? AuthenticateThreadId { get; private set; }
/// <summary>Gets the server handle passed to ArchestrAUserToId, if called.</summary>
public int? ArchestrAUserToIdServerHandle { get; private set; }
/// <summary>Gets the GUID passed to ArchestrAUserToId, if called.</summary>
public string? ArchestrAUserToIdGuid { get; private set; }
/// <summary>Gets the server handle passed to AddBufferedItem, if called.</summary>
public int? AddBufferedItemServerHandle { get; private set; }
/// <summary>Gets the item definition passed to AddBufferedItem, if called.</summary>
public string? AddBufferedItemDefinition { get; private set; }
/// <summary>Gets the item context passed to AddBufferedItem, if called.</summary>
public string? AddBufferedItemContext { get; private set; }
/// <summary>Gets the server handle passed to SetBufferedUpdateInterval, if called.</summary>
public int? SetBufferedUpdateIntervalServerHandle { get; private set; }
/// <summary>Gets the interval passed to SetBufferedUpdateInterval, if called.</summary>
public int? SetBufferedUpdateIntervalValue { get; private set; }
/// <summary>Suspends an item and returns a canned status whose fields drive MxStatusProxy conversion.</summary>
/// <param name="serverHandle">Server handle for the suspend.</param>
/// <param name="itemHandle">Item handle to suspend.</param>
/// <returns>A status stand-in with all-OK fields.</returns>
public object Suspend(int serverHandle, int itemHandle)
{
operationNames.Add($"Suspend:{serverHandle}:{itemHandle}");
SuspendServerHandle = serverHandle;
SuspendItemHandle = itemHandle;
SuspendThreadId = Environment.CurrentManagedThreadId;
return new FakeMxStatus { success = 1, category = 0, detectedBy = 0, detail = 0 };
}
/// <summary>Activates an item and returns a canned status whose fields drive MxStatusProxy conversion.</summary>
/// <param name="serverHandle">Server handle for the activate.</param>
/// <param name="itemHandle">Item handle to activate.</param>
/// <returns>A status stand-in with all-OK fields.</returns>
public object Activate(int serverHandle, int itemHandle)
{
operationNames.Add($"Activate:{serverHandle}:{itemHandle}");
ActivateServerHandle = serverHandle;
ActivateItemHandle = itemHandle;
ActivateThreadId = Environment.CurrentManagedThreadId;
return new FakeMxStatus { success = 1, category = 0, detectedBy = 0, detail = 0 };
}
/// <summary>Authenticates a user and returns a canned user id.</summary>
/// <param name="serverHandle">Server handle for the authentication.</param>
/// <param name="verifyUser">User name to authenticate.</param>
/// <param name="verifyUserPassword">Credential; recorded only to assert it is never logged.</param>
/// <returns>The canned MXAccess user id (1).</returns>
public int AuthenticateUser(int serverHandle, string verifyUser, string verifyUserPassword)
{
// Deliberately does NOT include the password in the operation log.
operationNames.Add($"AuthenticateUser:{serverHandle}:{verifyUser}");
AuthenticateServerHandle = serverHandle;
AuthenticateUserName = verifyUser;
AuthenticatePassword = verifyUserPassword;
AuthenticateThreadId = Environment.CurrentManagedThreadId;
return 1;
}
/// <summary>Resolves an ArchestrA user GUID and returns a canned user id.</summary>
/// <param name="serverHandle">Server handle for the resolution.</param>
/// <param name="userIdGuid">ArchestrA user GUID to resolve.</param>
/// <returns>The canned MXAccess user id (7).</returns>
public int ArchestrAUserToId(int serverHandle, string userIdGuid)
{
operationNames.Add($"ArchestrAUserToId:{serverHandle}:{userIdGuid}");
ArchestrAUserToIdServerHandle = serverHandle;
ArchestrAUserToIdGuid = userIdGuid;
return 7;
}
/// <summary>Adds a buffered item and returns a canned item handle.</summary>
/// <param name="serverHandle">Server handle to add the item to.</param>
/// <param name="itemDefinition">Item definition string.</param>
/// <param name="itemContext">Item context string.</param>
/// <returns>The canned buffered item handle (1).</returns>
public int AddBufferedItem(int serverHandle, string itemDefinition, string itemContext)
{
operationNames.Add($"AddBufferedItem:{serverHandle}:{itemDefinition}:{itemContext}");
AddBufferedItemServerHandle = serverHandle;
AddBufferedItemDefinition = itemDefinition;
AddBufferedItemContext = itemContext;
return 1;
}
/// <summary>Sets the buffered update interval and tracks the operation.</summary>
/// <param name="serverHandle">Server handle for the interval change.</param>
/// <param name="updateIntervalMilliseconds">Buffered update interval in milliseconds.</param>
public void SetBufferedUpdateInterval(int serverHandle, int updateIntervalMilliseconds)
{
operationNames.Add($"SetBufferedUpdateInterval:{serverHandle}:{updateIntervalMilliseconds}");
SetBufferedUpdateIntervalServerHandle = serverHandle;
SetBufferedUpdateIntervalValue = updateIntervalMilliseconds;
}
/// <summary>Status stand-in reflected over by the worker's MxStatusProxy converter.</summary>
internal sealed class FakeMxStatus
{
/// <summary>Success indicator read by the status converter.</summary>
public int success;
/// <summary>Status category read by the status converter.</summary>
public int category;
/// <summary>Status detected-by read by the status converter.</summary>
public int detectedBy;
/// <summary>Status detail read by the status converter.</summary>
public int detail;
}
}
/// <summary>Factory for creating fake MXAccess COM objects in tests.</summary>
@@ -122,11 +122,26 @@ internal sealed class FakeRuntimeSession : IWorkerRuntimeSession
}
}
/// <summary>
/// When set, <see cref="DrainEvents"/> returns no events for the
/// WorkerPipeSession background drain loop's fixed batch size, so an
/// explicit DrainEvents control command (which drains all via
/// <c>maxEvents == 0</c>) can claim the queued events deterministically
/// without racing the 25 ms background loop. Mirrors
/// <c>WorkerPipeSession.EventDrainBatchSize</c>.
/// </summary>
public uint? SuppressDrainForBatchSize { get; set; }
/// <summary>Drains queued events up to the specified limit.</summary>
/// <param name="maxEvents">Maximum events to drain; 0 drains all.</param>
/// <returns>The drained events.</returns>
public IReadOnlyList<WorkerEvent> DrainEvents(uint maxEvents)
{
if (SuppressDrainForBatchSize is uint suppressed && maxEvents == suppressed)
{
return Array.Empty<WorkerEvent>();
}
lock (gate)
{
int drainCount = maxEvents == 0
@@ -1,3 +1,4 @@
using ZB.MOM.WW.MxGateway.Worker.Conversion;
using ZB.MOM.WW.MxGateway.Worker.MxAccess;
namespace ZB.MOM.WW.MxGateway.Worker.Tests.TestSupport;
@@ -55,14 +56,10 @@ internal sealed class NoopMxAccessServer : IMxAccessServer
}
/// <inheritdoc />
public void Suspend(int serverHandle, int itemHandle)
{
}
public object Suspend(int serverHandle, int itemHandle) => new FakeMxStatus();
/// <inheritdoc />
public void Activate(int serverHandle, int itemHandle)
{
}
public object Activate(int serverHandle, int itemHandle) => new FakeMxStatus();
/// <inheritdoc />
public void Write(int serverHandle, int itemHandle, object? value, int userId)
@@ -85,8 +82,34 @@ internal sealed class NoopMxAccessServer : IMxAccessServer
}
/// <inheritdoc />
public int AuthenticateUser(string userName, string password) => 0;
public int AuthenticateUser(int serverHandle, string verifyUser, string verifyUserPassword) => 0;
/// <inheritdoc />
public int ArchestrAUserToId(string userName) => 0;
public int ArchestrAUserToId(int serverHandle, string userIdGuid) => 0;
}
/// <summary>
/// Minimal stand-in for the native <c>ArchestrA.MxAccess.MxStatus</c> struct.
/// <see cref="MxStatusProxyConverter"/> reflects over the public
/// <c>success</c>, <c>category</c>, <c>detectedBy</c>, and <c>detail</c>
/// fields, so this fake exposes the same field shape with all-OK values.
/// </summary>
internal sealed class FakeMxStatus
{
// These public fields exist solely so MxStatusProxyConverter can reflect
// over them by name; they are read through reflection, not directly, so the
// compiler's "never assigned" (CS0649) diagnostic does not apply.
#pragma warning disable CS0649
/// <summary>Success indicator field read by the status converter.</summary>
public int success;
/// <summary>Status category field read by the status converter.</summary>
public int category;
/// <summary>Status detected-by field read by the status converter.</summary>
public int detectedBy;
/// <summary>Status detail field read by the status converter.</summary>
public int detail;
#pragma warning restore CS0649
}
@@ -378,6 +378,22 @@ public sealed class WorkerPipeSession
switch (envelope.BodyCase)
{
case WorkerEnvelope.BodyOneofCase.WorkerCommand:
// Worker control/lifecycle commands (Ping, GetSessionState,
// GetWorkerInfo, DrainEvents, ShutdownWorker) are answered here
// on the message-loop thread instead of being dispatched onto
// the STA. Their replies are built from process-level state
// (worker process id, assembly version, _state, the runtime
// session's event queue) that the STA-bound
// MxAccessCommandExecutor cannot see, and ShutdownWorker must
// return its OK reply BEFORE the graceful shutdown joins the
// STA thread — running it on the STA would deadlock. Returning
// false from the ShutdownWorker arm stops the read loop exactly
// as a WorkerShutdown envelope would.
if (IsControlCommand(envelope.WorkerCommand?.Command?.Kind ?? MxCommandKind.Unspecified))
{
return await HandleControlCommandAsync(envelope, cancellationToken).ConfigureAwait(false);
}
TryStartCommandTask(envelope, cancellationToken);
return true;
case WorkerEnvelope.BodyOneofCase.WorkerShutdown:
@@ -393,6 +409,175 @@ public sealed class WorkerPipeSession
}
}
private static bool IsControlCommand(MxCommandKind kind)
{
return kind switch
{
MxCommandKind.Ping => true,
MxCommandKind.GetSessionState => true,
MxCommandKind.GetWorkerInfo => true,
MxCommandKind.DrainEvents => true,
MxCommandKind.ShutdownWorker => true,
_ => false,
};
}
/// <summary>
/// Answers a worker control/lifecycle command on the message-loop
/// thread (never on the STA). Returns <c>false</c> only for
/// <see cref="MxCommandKind.ShutdownWorker"/> — after writing its OK
/// reply this drives the same graceful-shutdown path a
/// <c>WorkerShutdown</c> envelope would, then signals the read loop to
/// stop. All other control commands return <c>true</c> to keep reading.
/// </summary>
private async Task<bool> HandleControlCommandAsync(
WorkerEnvelope envelope,
CancellationToken cancellationToken)
{
WorkerCommand workerCommand = envelope.WorkerCommand;
MxCommand command = workerCommand.Command;
string correlationId = envelope.CorrelationId;
if (command.Kind == MxCommandKind.ShutdownWorker)
{
// Build and emit the OK reply BEFORE triggering shutdown so the
// gateway's correlation-id wait is satisfied even though the
// graceful shutdown below tears the session (and pipe) down.
MxCommandReply shutdownReply = CreateControlOkReply(correlationId, command.Kind);
await WriteControlReplyAsync(shutdownReply, cancellationToken).ConfigureAwait(false);
WorkerShutdown shutdown = new();
if (command.ShutdownWorker?.GracePeriod is not null)
{
shutdown.GracePeriod = command.ShutdownWorker.GracePeriod;
}
shutdown.Reason = "ShutdownWorker command";
await ShutdownAsync(shutdown, cancellationToken).ConfigureAwait(false);
return false;
}
MxCommandReply reply = command.Kind switch
{
MxCommandKind.Ping => CreatePingReply(correlationId, command),
MxCommandKind.GetSessionState => CreateSessionStateReply(correlationId, command.Kind),
MxCommandKind.GetWorkerInfo => CreateWorkerInfoReply(correlationId, command.Kind),
MxCommandKind.DrainEvents => CreateDrainEventsReply(correlationId, command),
_ => CreateControlOkReply(correlationId, command.Kind),
};
await WriteControlReplyAsync(reply, cancellationToken).ConfigureAwait(false);
return true;
}
private Task WriteControlReplyAsync(
MxCommandReply reply,
CancellationToken cancellationToken)
{
return _writer.WriteAsync(
CreateEnvelope(new WorkerCommandReply
{
Reply = reply,
CompletedTimestamp = Timestamp.FromDateTime(DateTime.UtcNow),
}),
cancellationToken);
}
private MxCommandReply CreatePingReply(string correlationId, MxCommand command)
{
MxCommandReply reply = CreateControlOkReply(correlationId, command.Kind);
// Echo the ping message back through the base reply's diagnostic
// message field (there is no dedicated PingReply payload). An empty
// message leaves the diagnostic field at its proto3 default.
string? message = command.Ping?.Message;
if (!string.IsNullOrEmpty(message))
{
reply.DiagnosticMessage = message;
}
return reply;
}
private MxCommandReply CreateSessionStateReply(string correlationId, MxCommandKind kind)
{
MxCommandReply reply = CreateControlOkReply(correlationId, kind);
reply.SessionState = new SessionStateReply
{
State = MapWorkerStateToSessionState(_state),
};
return reply;
}
private MxCommandReply CreateWorkerInfoReply(string correlationId, MxCommandKind kind)
{
MxCommandReply reply = CreateControlOkReply(correlationId, kind);
reply.WorkerInfo = new WorkerInfoReply
{
WorkerProcessId = _processIdProvider(),
WorkerVersion = typeof(WorkerPipeSession).Assembly.GetName().Version?.ToString() ?? string.Empty,
MxaccessProgid = MxAccessInteropInfo.ProgId,
MxaccessClsid = MxAccessInteropInfo.Clsid,
};
return reply;
}
private MxCommandReply CreateDrainEventsReply(string correlationId, MxCommand command)
{
MxCommandReply reply = CreateControlOkReply(correlationId, command.Kind);
DrainEventsReply drainReply = new();
IWorkerRuntimeSession? runtimeSession = _runtimeSession;
if (runtimeSession is not null)
{
uint maxEvents = command.DrainEvents?.MaxEvents ?? 0;
foreach (WorkerEvent workerEvent in runtimeSession.DrainEvents(maxEvents))
{
if (workerEvent.Event is not null)
{
drainReply.Events.Add(workerEvent.Event);
}
}
}
reply.DrainEvents = drainReply;
return reply;
}
private MxCommandReply CreateControlOkReply(string correlationId, MxCommandKind kind)
{
return new MxCommandReply
{
SessionId = _options.SessionId,
CorrelationId = correlationId,
Kind = kind,
Hresult = 0,
ProtocolStatus = new ProtocolStatus
{
Code = ProtocolStatusCode.Ok,
Message = "OK",
},
};
}
private static SessionState MapWorkerStateToSessionState(WorkerState state)
{
return state switch
{
WorkerState.Starting => SessionState.StartingWorker,
WorkerState.Handshaking => SessionState.Handshaking,
WorkerState.InitializingSta => SessionState.InitializingWorker,
WorkerState.Ready => SessionState.Ready,
// A control command is being served, so the STA is alive and
// ready — the busy state is incidental, not a distinct lifecycle.
WorkerState.ExecutingCommand => SessionState.Ready,
WorkerState.ShuttingDown => SessionState.Closing,
WorkerState.Stopped => SessionState.Closed,
WorkerState.Faulted => SessionState.Faulted,
_ => SessionState.Unspecified,
};
}
private async Task ProcessCommandAsync(
WorkerEnvelope envelope,
CancellationToken cancellationToken)
@@ -57,6 +57,68 @@ public interface IMxAccessServer
int serverHandle,
int itemHandle);
/// <summary>Suspends data acquisition for an advised item (ILMXProxyServer4).</summary>
/// <param name="serverHandle">Server handle identifying the registration.</param>
/// <param name="itemHandle">Item handle to suspend.</param>
/// <returns>
/// The native MXAccess <c>MxStatus</c> value (boxed) produced by the call.
/// Callers convert it to a protobuf <c>MxStatusProxy</c> via the worker's
/// status converter; the underlying type is reflected over, not cast.
/// </returns>
object Suspend(
int serverHandle,
int itemHandle);
/// <summary>Reactivates data acquisition for a suspended item (ILMXProxyServer4).</summary>
/// <param name="serverHandle">Server handle identifying the registration.</param>
/// <param name="itemHandle">Item handle to activate.</param>
/// <returns>
/// The native MXAccess <c>MxStatus</c> value (boxed) produced by the call.
/// Callers convert it to a protobuf <c>MxStatusProxy</c> via the worker's
/// status converter; the underlying type is reflected over, not cast.
/// </returns>
object Activate(
int serverHandle,
int itemHandle);
/// <summary>Authenticates an MXAccess user and returns its user id (base ILMXProxyServer).</summary>
/// <param name="serverHandle">Server handle identifying the registration.</param>
/// <param name="verifyUser">MXAccess user name to authenticate.</param>
/// <param name="verifyUserPassword">
/// Raw MXAccess credential. Implementations must keep this value out of
/// logs, metrics, command lines, and diagnostics.
/// </param>
/// <returns>The MXAccess user id for the authenticated user.</returns>
int AuthenticateUser(
int serverHandle,
string verifyUser,
string verifyUserPassword);
/// <summary>Resolves an ArchestrA user GUID to an MXAccess user id (ILMXProxyServer2).</summary>
/// <param name="serverHandle">Server handle identifying the registration.</param>
/// <param name="userIdGuid">ArchestrA user GUID to resolve.</param>
/// <returns>The MXAccess user id for the resolved user.</returns>
int ArchestrAUserToId(
int serverHandle,
string userIdGuid);
/// <summary>Adds a buffered item to a server and returns an item handle (ILMXProxyServer5).</summary>
/// <param name="serverHandle">Server handle identifying the registration.</param>
/// <param name="itemDefinition">Item definition string.</param>
/// <param name="itemContext">Item context string.</param>
/// <returns>Item handle for the added buffered item.</returns>
int AddBufferedItem(
int serverHandle,
string itemDefinition,
string itemContext);
/// <summary>Sets the buffered-update interval for a server (ILMXProxyServer5).</summary>
/// <param name="serverHandle">Server handle identifying the registration.</param>
/// <param name="updateIntervalMilliseconds">Buffered update interval in milliseconds.</param>
void SetBufferedUpdateInterval(
int serverHandle,
int updateIntervalMilliseconds);
/// <summary>Writes a value to an item.</summary>
/// <param name="serverHandle">Server handle identifying the registration.</param>
/// <param name="itemHandle">Item handle to write to.</param>
@@ -140,6 +140,89 @@ public sealed class MxAccessComServer : IMxAccessServer
AsProxyServer4().AdviseSupervisory(serverHandle, itemHandle);
}
/// <inheritdoc />
public object Suspend(
int serverHandle,
int itemHandle)
{
if (mxAccessComObject is IMxAccessServer typedFake)
{
return typedFake.Suspend(serverHandle, itemHandle);
}
AsProxyServer4().Suspend(serverHandle, itemHandle, out MxStatus status);
return status;
}
/// <inheritdoc />
public object Activate(
int serverHandle,
int itemHandle)
{
if (mxAccessComObject is IMxAccessServer typedFake)
{
return typedFake.Activate(serverHandle, itemHandle);
}
AsProxyServer4().Activate(serverHandle, itemHandle, out MxStatus status);
return status;
}
/// <inheritdoc />
public int AuthenticateUser(
int serverHandle,
string verifyUser,
string verifyUserPassword)
{
if (mxAccessComObject is IMxAccessServer typedFake)
{
return typedFake.AuthenticateUser(serverHandle, verifyUser, verifyUserPassword);
}
return AsProxyServer().AuthenticateUser(serverHandle, verifyUser, verifyUserPassword);
}
/// <inheritdoc />
public int ArchestrAUserToId(
int serverHandle,
string userIdGuid)
{
if (mxAccessComObject is IMxAccessServer typedFake)
{
return typedFake.ArchestrAUserToId(serverHandle, userIdGuid);
}
return AsProxyServer2().ArchestrAUserToId(serverHandle, userIdGuid);
}
/// <inheritdoc />
public int AddBufferedItem(
int serverHandle,
string itemDefinition,
string itemContext)
{
if (mxAccessComObject is IMxAccessServer typedFake)
{
return typedFake.AddBufferedItem(serverHandle, itemDefinition, itemContext);
}
return AsProxyServer5().AddBufferedItem(serverHandle, itemDefinition, itemContext);
}
/// <inheritdoc />
public void SetBufferedUpdateInterval(
int serverHandle,
int updateIntervalMilliseconds)
{
if (mxAccessComObject is IMxAccessServer typedFake)
{
typedFake.SetBufferedUpdateInterval(serverHandle, updateIntervalMilliseconds);
return;
}
AsProxyServer5().SetBufferedUpdateInterval(serverHandle, updateIntervalMilliseconds);
}
/// <inheritdoc />
public void Write(
int serverHandle,
@@ -216,6 +299,14 @@ public sealed class MxAccessComServer : IMxAccessServer
+ $"{nameof(ILMXProxyServer)} or {nameof(IMxAccessServer)}.");
}
private ILMXProxyServer2 AsProxyServer2()
{
return mxAccessComObject as ILMXProxyServer2
?? throw new InvalidOperationException(
$"MXAccess COM object of type '{mxAccessComObject.GetType().FullName}' does not implement "
+ $"{nameof(ILMXProxyServer2)} or {nameof(IMxAccessServer)}.");
}
private ILMXProxyServer3 AsProxyServer3()
{
return mxAccessComObject as ILMXProxyServer3
@@ -231,4 +322,12 @@ public sealed class MxAccessComServer : IMxAccessServer
$"MXAccess COM object of type '{mxAccessComObject.GetType().FullName}' does not implement "
+ $"{nameof(ILMXProxyServer4)} or {nameof(IMxAccessServer)}.");
}
private ILMXProxyServer5 AsProxyServer5()
{
return mxAccessComObject as ILMXProxyServer5
?? throw new InvalidOperationException(
$"MXAccess COM object of type '{mxAccessComObject.GetType().FullName}' does not implement "
+ $"{nameof(ILMXProxyServer5)} or {nameof(IMxAccessServer)}.");
}
}
@@ -16,6 +16,7 @@ public sealed class MxAccessCommandExecutor : IStaCommandExecutor
private readonly MxAccessSession session;
private readonly VariantConverter variantConverter;
private readonly MxStatusProxyConverter statusProxyConverter;
private readonly IAlarmCommandHandler? alarmCommandHandler;
private readonly Action pumpStep;
@@ -78,6 +79,7 @@ public sealed class MxAccessCommandExecutor : IStaCommandExecutor
{
this.session = session ?? throw new ArgumentNullException(nameof(session));
this.variantConverter = variantConverter ?? throw new ArgumentNullException(nameof(variantConverter));
this.statusProxyConverter = new MxStatusProxyConverter();
this.alarmCommandHandler = alarmCommandHandler;
this.pumpStep = pumpStep ?? (static () => { });
}
@@ -104,6 +106,12 @@ public sealed class MxAccessCommandExecutor : IStaCommandExecutor
MxCommandKind.Advise => ExecuteAdvise(command),
MxCommandKind.UnAdvise => ExecuteUnAdvise(command),
MxCommandKind.AdviseSupervisory => ExecuteAdviseSupervisory(command),
MxCommandKind.Suspend => ExecuteSuspend(command),
MxCommandKind.Activate => ExecuteActivate(command),
MxCommandKind.AuthenticateUser => ExecuteAuthenticateUser(command),
MxCommandKind.ArchestraUserToId => ExecuteArchestrAUserToId(command),
MxCommandKind.AddBufferedItem => ExecuteAddBufferedItem(command),
MxCommandKind.SetBufferedUpdateInterval => ExecuteSetBufferedUpdateInterval(command),
MxCommandKind.Write => ExecuteWrite(command),
MxCommandKind.Write2 => ExecuteWrite2(command),
MxCommandKind.WriteSecured => ExecuteWriteSecured(command),
@@ -262,6 +270,134 @@ public sealed class MxAccessCommandExecutor : IStaCommandExecutor
return CreateOkReply(command);
}
private MxCommandReply ExecuteSuspend(StaCommand command)
{
if (command.Command.PayloadCase != MxCommand.PayloadOneofCase.Suspend)
{
return CreateInvalidRequestReply(command, "Suspend command payload is required.");
}
SuspendCommand suspendCommand = command.Command.Suspend;
object nativeStatus = session.Suspend(
suspendCommand.ServerHandle,
suspendCommand.ItemHandle);
MxCommandReply reply = CreateOkReply(command);
reply.Suspend = new SuspendReply
{
Status = statusProxyConverter.Convert(nativeStatus),
};
return reply;
}
private MxCommandReply ExecuteActivate(StaCommand command)
{
if (command.Command.PayloadCase != MxCommand.PayloadOneofCase.Activate)
{
return CreateInvalidRequestReply(command, "Activate command payload is required.");
}
ActivateCommand activateCommand = command.Command.Activate;
object nativeStatus = session.Activate(
activateCommand.ServerHandle,
activateCommand.ItemHandle);
MxCommandReply reply = CreateOkReply(command);
reply.Activate = new ActivateReply
{
Status = statusProxyConverter.Convert(nativeStatus),
};
return reply;
}
private MxCommandReply ExecuteAuthenticateUser(StaCommand command)
{
if (command.Command.PayloadCase != MxCommand.PayloadOneofCase.AuthenticateUser)
{
return CreateInvalidRequestReply(command, "AuthenticateUser command payload is required.");
}
AuthenticateUserCommand authenticateUserCommand = command.Command.AuthenticateUser;
// The credential (verify_user_password) is passed straight to MXAccess
// and is never written to logs, diagnostics, or the reply. MXAccess is
// allowed to fail authentication; the native HResult is surfaced by the
// dispatcher's exception path.
int userId = session.AuthenticateUser(
authenticateUserCommand.ServerHandle,
authenticateUserCommand.VerifyUser,
authenticateUserCommand.VerifyUserPassword);
MxCommandReply reply = CreateOkReply(command);
reply.AuthenticateUser = new AuthenticateUserReply
{
UserId = userId,
};
return reply;
}
private MxCommandReply ExecuteArchestrAUserToId(StaCommand command)
{
if (command.Command.PayloadCase != MxCommand.PayloadOneofCase.ArchestraUserToId)
{
return CreateInvalidRequestReply(command, "ArchestrAUserToId command payload is required.");
}
ArchestrAUserToIdCommand archestrAUserToIdCommand = command.Command.ArchestraUserToId;
int userId = session.ArchestrAUserToId(
archestrAUserToIdCommand.ServerHandle,
archestrAUserToIdCommand.UserIdGuid);
MxCommandReply reply = CreateOkReply(command);
reply.ArchestraUserToId = new ArchestrAUserToIdReply
{
UserId = userId,
};
return reply;
}
private MxCommandReply ExecuteAddBufferedItem(StaCommand command)
{
if (command.Command.PayloadCase != MxCommand.PayloadOneofCase.AddBufferedItem)
{
return CreateInvalidRequestReply(command, "AddBufferedItem command payload is required.");
}
AddBufferedItemCommand addBufferedItemCommand = command.Command.AddBufferedItem;
int itemHandle = session.AddBufferedItem(
addBufferedItemCommand.ServerHandle,
addBufferedItemCommand.ItemDefinition,
addBufferedItemCommand.ItemContext);
MxCommandReply reply = CreateOkReply(command);
reply.ReturnValue = variantConverter.Convert(itemHandle);
reply.AddBufferedItem = new AddBufferedItemReply
{
ItemHandle = itemHandle,
};
return reply;
}
private MxCommandReply ExecuteSetBufferedUpdateInterval(StaCommand command)
{
if (command.Command.PayloadCase != MxCommand.PayloadOneofCase.SetBufferedUpdateInterval)
{
return CreateInvalidRequestReply(command, "SetBufferedUpdateInterval command payload is required.");
}
SetBufferedUpdateIntervalCommand setBufferedUpdateIntervalCommand = command.Command.SetBufferedUpdateInterval;
session.SetBufferedUpdateInterval(
setBufferedUpdateIntervalCommand.ServerHandle,
setBufferedUpdateIntervalCommand.UpdateIntervalMilliseconds);
return CreateOkReply(command);
}
private MxCommandReply ExecuteWrite(StaCommand command)
{
if (command.Command.PayloadCase != MxCommand.PayloadOneofCase.Write)
@@ -300,6 +300,94 @@ public sealed class MxAccessSession : IDisposable
MxAccessAdviceKind.Supervisory);
}
/// <summary>Suspends data acquisition for an advised item and returns the native MXAccess status.</summary>
/// <param name="serverHandle">Handle returned by the worker.</param>
/// <param name="itemHandle">Handle returned by the worker.</param>
/// <returns>The boxed native MXAccess status produced by the call.</returns>
public object Suspend(
int serverHandle,
int itemHandle)
{
ThrowIfDisposed();
return mxAccessServer.Suspend(serverHandle, itemHandle);
}
/// <summary>Reactivates data acquisition for a suspended item and returns the native MXAccess status.</summary>
/// <param name="serverHandle">Handle returned by the worker.</param>
/// <param name="itemHandle">Handle returned by the worker.</param>
/// <returns>The boxed native MXAccess status produced by the call.</returns>
public object Activate(
int serverHandle,
int itemHandle)
{
ThrowIfDisposed();
return mxAccessServer.Activate(serverHandle, itemHandle);
}
/// <summary>Authenticates an MXAccess user and returns its user id.</summary>
/// <param name="serverHandle">Handle returned by the worker.</param>
/// <param name="verifyUser">MXAccess user name to authenticate.</param>
/// <param name="verifyUserPassword">Raw MXAccess credential; never logged.</param>
/// <returns>The MXAccess user id for the authenticated user.</returns>
public int AuthenticateUser(
int serverHandle,
string verifyUser,
string verifyUserPassword)
{
ThrowIfDisposed();
return mxAccessServer.AuthenticateUser(serverHandle, verifyUser, verifyUserPassword);
}
/// <summary>Resolves an ArchestrA user GUID to an MXAccess user id.</summary>
/// <param name="serverHandle">Handle returned by the worker.</param>
/// <param name="userIdGuid">ArchestrA user GUID to resolve.</param>
/// <returns>The MXAccess user id for the resolved user.</returns>
public int ArchestrAUserToId(
int serverHandle,
string userIdGuid)
{
ThrowIfDisposed();
return mxAccessServer.ArchestrAUserToId(serverHandle, userIdGuid);
}
/// <summary>Adds a buffered item to an MXAccess server and returns the item handle.</summary>
/// <param name="serverHandle">Handle returned by the worker.</param>
/// <param name="itemDefinition">Definition or address of the item to add.</param>
/// <param name="itemContext">Context string for the item.</param>
public int AddBufferedItem(
int serverHandle,
string itemDefinition,
string itemContext)
{
ThrowIfDisposed();
int itemHandle = mxAccessServer.AddBufferedItem(serverHandle, itemDefinition, itemContext);
handleRegistry.RegisterItemHandle(
serverHandle,
itemHandle,
itemDefinition,
itemContext,
hasItemContext: true);
return itemHandle;
}
/// <summary>Sets the buffered-update interval for an MXAccess server.</summary>
/// <param name="serverHandle">Handle returned by the worker.</param>
/// <param name="updateIntervalMilliseconds">Buffered update interval in milliseconds.</param>
public void SetBufferedUpdateInterval(
int serverHandle,
int updateIntervalMilliseconds)
{
ThrowIfDisposed();
mxAccessServer.SetBufferedUpdateInterval(serverHandle, updateIntervalMilliseconds);
}
/// <summary>Writes a value to an item.</summary>
/// <param name="serverHandle">Handle returned by the worker.</param>
/// <param name="itemHandle">Handle returned by the worker.</param>
+162
View File
@@ -0,0 +1,162 @@
# Still Pending — Deferred / Partial / Unfinished / Missing Functionality
**Generated:** 2026-06-15 · **Commit:** `c7f754c` (main) · **Method:** six parallel read-only audits (Server, Worker, Contracts/proto, all five clients, docs/design/plans, tests + review backlog). Every item cites a verified `file:line`.
> **Resolution update (2026-06-15, branch `feat/stillpending-completion`):** The actionable items were implemented and verified per `docs/plans/2026-06-15-stillpending-completion.md`. **§1.1** (all 11 worker command kinds), **§1.2** (audit CorrelationId), and the **§4** client CLI/helper parity gaps are **Resolved** — see per-item annotations below. Worker COM commands are live-verified on the dev rig (`efd9971`, `f7ada90`). Remaining open items are the documented residuals (**§1.3**, **§1.4**, the **§3** vendor/capture-gated questions incl. the new **§3.2** multi-sample buffered residual) and the deliberate v1 scope of **§2**. Zero `.proto` changes were needed (all reply messages already existed).
## How to read this
Items are graded by what they actually are, because most "pending" surface in this codebase is **deliberate v1 scope**, not accidental:
- 🔴 **Genuine gap** — real unfinished/missing functionality with user-visible impact; a candidate to actually build.
- 🟠 **Parity hole** — declared in-scope (proto/design) but not wired through; breaks "MXAccess parity" or cross-client parity.
- 🟡 **Open question / vendor-gated** — intentionally incomplete, awaiting a live MXAccess capture or an AVEVA fix; raw data preserved meanwhile.
- 🔵 **Intentional v1 scope** — deliberately deferred and documented; listed so it's catalogued, not because it's broken.
- ⚪ **Verification gap** — code exists but is unverified by default (opt-in/live tests).
- 📄 **Stale doc / dead code** — prose or code that lags reality.
---
## 1. Genuine gaps (real unfinished functionality)
### ✅ 1.1 — Worker does not implement 11 declared command kinds *(biggest real gap)* — **RESOLVED**
- **Resolved:** all 11 now implemented. The **5 control commands** (`Ping`, `GetSessionState`, `GetWorkerInfo`, `DrainEvents`, `ShutdownWorker`) are handled in `WorkerPipeSession` (off-STA — `ShutdownWorker` on the STA would deadlock, and these read pipe-session state) — `bf72cd8`. The **6 COM commands** (`Suspend`, `Activate`, `AuthenticateUser`, `ArchestrAUserToId`, `AddBufferedItem`, `SetBufferedUpdateInterval`) are implemented in `MxAccessCommandExecutor` (STA-dispatched) via new `IMxAccessServer`/`MxAccessComServer` wrappers selecting `ILMXProxyServer2/4/5``2939932`. Live-verified on the dev rig (`efd9971`, `f7ada90`): `ArchestrAUserToId`→Ok(user_id=1), `AddBufferedItem`/`SetBufferedUpdateInterval`→Ok; `AuthenticateUser`/`Suspend`/`Activate`→real `MxaccessFailure`/HResult (parity, not `INVALID_REQUEST`). `FakeWorkerHarness` now answers the control commands so the default gateway suite covers them (`bb5139f`). Note: `DrainEvents` is a *diagnostic snapshot* — it competes with the 25 ms background stream-drain loop, so with an active event stream it returns only events not yet pushed (no loss/double-drain; the queue drain is lock-protected and destructive).
- **Original finding below (for history):**
- **Location:** `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessCommandExecutor.cs:97-128` (the `Execute` switch; everything else falls to `_ => CreateInvalidRequestReply(...)`).
- **What's missing:** the proto `MxCommandKind` enum defines these, the gateway *validates* (`Grpc/MxAccessGrpcRequestValidator.cs:86-95`), *scopes* (`Security/Authorization/GatewayGrpcScopeResolver.cs:45-47`), and *routes* them, and reply messages exist — but the worker answers `INVALID_REQUEST`:
- **MXAccess COM commands (parity-critical):** `AddBufferedItem`, `SetBufferedUpdateInterval`, `Suspend`, `Activate`, `AuthenticateUser`, `ArchestrAUserToId` (`mxaccess_gateway.proto:107-116,151-160`).
- **Worker control/lifecycle:** `Ping`, `GetSessionState`, `GetWorkerInfo`, `DrainEvents`, `ShutdownWorker` (`mxaccess_gateway.proto:133-137,177-181`).
- **Why it matters:** `gateway.md:890-899`, `docs/MxAccessWorkerInstanceDesign.md:424-433`, and `docs/DesignDecisions.md:169-173` list all six COM commands under **"Phase 4: Full Command Surface"** with the exit criterion *"gRPC command surface covers the installed MXAccess public method set."* So this is a declared-in-scope phase that isn't finished, not a v1 cut.
- **Masked by tests:** the control kinds (`Ping`/`GetWorkerInfo`/`DrainEvents`/`ShutdownWorker`) are exercised only through `FakeWorkerHarness`, so the unit suite is green while the *real* worker can't answer them. The live integration test `WorkerLiveMxAccessSmokeTests.cs:931` even sends `AuthenticateUser`, which would get an invalid-request reply today.
- **Note:** `OnBufferedDataChange` *event* mapping IS fully wired (`Conversion`/`MxAccessEventMapper.cs:231-254`) — but with `AddBufferedItem`/`SetBufferedUpdateInterval` unimplemented there is no way to set buffered eventing up.
### ✅ 1.2 — Constraint-denial audit events drop CorrelationId — **RESOLVED**
- **Resolved (`8415f35`, `55526d5`):** `request.ClientCorrelationId` is threaded from `Invoke``ApplyConstraintsAsync` → the six filter helpers → `IConstraintEnforcer.RecordDenialAsync` (new `string? correlationId` param); the `TODO(Task 2.3)` is gone. The audit `CorrelationId` column is `Guid?`, so a GUID-parseable id is stored typed; **and** the raw string is always preserved in the audit record's `DetailsJson["clientCorrelationId"]` — this matters because the Rust client sends non-GUID ids (`rust-client-<op>-<n>`) on all traffic and Python/Java default to empty, which would otherwise have left the typed column null. An end-to-end test asserts the value propagates through `Invoke`.
- **Original finding:** denied-operation audit records always wrote `CorrelationId = null` (`ConstraintEnforcer.cs:134-136,147`); threading needed an `IConstraintEnforcer` signature change.
### 🔴 1.3 — `provider_switches{from,to,reason}` counter never exercised live
- **Location:** metric emitted in `Alarms/GatewayAlarmMonitor.cs` (failover/failback path); residual recorded in `docs/plans/2026-06-14-deferred-followups.md:124-125`.
- **Evidence:** *"that counter's live exercise remains the one gap; record it explicitly rather than claiming coverage."* Unit-tested (Tests-032) but the dev rig can't drive a real alarmmgr→subtag failover (`project_rig_alarms_object_driven`), so the counter's `reason` tagging is unproven in production.
### 🟠 1.4 — Worker 8-arg alarm ack silently discards operator domain / full name
- **Location:** `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs:261-278` (`_ = ackOperatorDomain; _ = ackOperatorFullName;`).
- **Evidence:** *"the IwwAlarmConsumer2 8-arg AlarmAckByName returns -55 on this AVEVA build (looks like a stub) … fields are accepted by the proto for forward-compat but are not propagated to AVEVA today."*
- **Impact:** two contract fields are accepted on the wire and silently not delivered. Root cause is the vendor stub (see 3.5), but the drop is currently invisible to callers.
---
## 2. Intentional v1 scope decisions (deliberately deferred — catalog)
These are documented, deliberate, and mostly enforced. Listed so the deferred surface is in one place — **none are bugs.** Canonical register: `docs/DesignDecisions.md:466-474` ("Later Revisit Items") + `gateway.md` "Post-v1 revisit items".
- 🔵 **Reconnectable sessions** — not in v1. `docs/DesignDecisions.md:63-73`, `gateway.md:1087,1101`.
- 🔵 **Multi-event-subscriber fan-out***plumbed but blocked.* The option flows all the way to `Sessions/GatewaySession.cs:387-408 AttachEventSubscriber(allowMultipleSubscribers)`, but `Configuration/GatewayOptionsValidator.cs:181-185` hard-rejects the only enabling value: *"AllowMultipleEventSubscribers is not supported until event fan-out is implemented."* So the fan-out code path never runs. `docs/DesignDecisions.md:75-80`.
- 🔵 **Gateway restart does not reattach orphan workers** — terminates them on startup. `docs/DesignDecisions.md:65-69`, `CLAUDE.md`.
- 🔵 **Workers run as the gateway service identity** — restricted service account is a reserved extension point. `docs/DesignDecisions.md:179-184`.
- 🔵 **Fail-fast event backpressure, no coalescing** — opt-in coalescing is post-v1. `docs/DesignDecisions.md:187-203`.
- 🔵 **No public command batching**`docs/DesignDecisions.md:206-212`.
- 🔵 **API-key admin is a local CLI only** — no public admin RPC. `docs/DesignDecisions.md:308-323`.
- 🔵 **No Blazor UI component libraries** — hard constraint. `docs/DesignDecisions.md:342-358`.
- 🔵 **Lazy browse is wire-only** — no lazy SQL / cache loading. `docs/DesignDecisions.md:365-376`, `docs/plans/2026-05-28-lazy-browse-design.md:30`.
- 🔵 **No server-side / streaming browse search**`docs/plans/2026-05-28-lazy-browse-design.md:208`.
- 🔵 **Alarm command surface is ack + query only** — no Clear/Disable/Enable/Silence/Shelve/Inhibit; matches the MXAccess alarm-client set. `Worker/MxAccess/AlarmCommandHandler.cs`, shelve/suppress out of scope per `docs/AlarmClientDiscovery.md:60-66`.
- 🔵 **Dashboard EventsHub has no per-session ACL** — any authenticated dashboard user may subscribe to any session group. `Dashboard/Hubs/EventsHub.cs:36-50` (`TODO(per-session-acl)`); only relevant once a per-session role model exists.
---
## 3. MXAccess parity — open questions & vendor-gated items
Intentionally incomplete, awaiting a live capture or an AVEVA fix; raw payload/metadata is preserved in the meantime (no synthesis).
- 🟡 **3.1 `OperationComplete` native trigger condition unknown** — modeled and emitted only from the real event (no synthesis), but the runtime condition that fires it isn't captured. `docs/DesignDecisions.md:280-289`, `gateway.md:1094`, `docs/MxAccessWorkerInstanceDesign.md:341,366`.
- 🟡 **3.2 `OnBufferedDataChange` multi-sample conversion unvalidated — STILL OPEN (residual after B8)**`AddBufferedItem`/`SetBufferedUpdateInterval` are now implemented and live-confirmed (§1.1), and the live B8 test (`f7ada90`) confirms the worker receives and cleanly converts the empty `NoData` bootstrap `OnBufferedDataChange` (no crash, no dropped payload). But the rig's object logic does not drive a multi-sample buffered batch on demand (same limitation as the alarm rig), so a real parallel quality/timestamp sample array (length > 1) has never been observed live — it is exercised only by the B-bundle unit tests against a fake `IMxAccessServer`. Re-run `GatewaySession_WithLiveWorker_BufferedItem_*` against a fast-changing simulated tag to close this. `docs/DesignDecisions.md:291-297`.
- 🟡 **3.3 Completion-only status → `MXSTATUS_PROXY[]` mapping unproven** — completion-only operation-status bytes are kept as raw diagnostic metadata until the analysis proves an exact mapping. `docs/DesignDecisions.md:299-306`.
- 🟡 **3.4 `AlarmAckByGUID` is `E_NOTIMPL` on this AVEVA build** — throws `NotImplementedException`; all acks route through `AlarmAckByName`. Proto/worker keep the path for forward-compat but it is dead today. `docs/AlarmClientDiscovery.md:750-763`.
- 🟡 **3.5 8-arg `AlarmAckByName` v2 is a vendor stub (returns -55)** — worker uses the 6-arg method; the 8-arg `domain`/`full_name` fields are carried for forward-compat only (see 1.4). `docs/AlarmClientDiscovery.md:743-748`.
- 🟡 **3.6 Subtag degraded-mode fidelity limits**`category`, `description`, `alarm_type_name`, operator fields, and `retrigger` are not populated/synthesized in subtag fallback (no subtag exists for them). Documented, by design. `docs/AlarmClientDiscovery.md:913-931`, `docs/plans/2026-06-13-alarm-subtag-fallback-design.md:292-298`.
- 🟡 **3.7 Subtag `Clear` transition unvalidated live** — Raise/Ack/AckMsg are live-confirmed; Clear is externally undrivable on the rig (object logic owns alarm state). Environmental, not code. (`project_alarm_subtag_fallback`, `project_rig_alarms_object_driven`.)
---
## 4. Clients — gaps & cross-client parity
Library RPC surface is at **full parity**: all gateway + GalaxyRepository RPCs and the `LazyBrowseNode` helper exist in all five clients, with **no** TODO/stub/not-implemented markers in production code. The CLI/helper gaps below are **RESOLVED**.
| Capability | Dotnet | Go | Python | Rust | Java |
|---|---|---|---|---|---|
| `Write2` single session helper | ✅ | ✅ `849f1d2` | ✅ | ✅ | ✅ |
| `ping` CLI subcommand | ✅ | ✅ `90529dc` | ✅ | ✅ | ✅ `0d5b488` |
| `version` CLI subcommand | ✅ *(already worked)* | ✅ | ✅ | ✅ | ✅ |
| `galaxy-*` CLI commands (4) | ✅ | ✅ | ✅ `a211fae` | ✅ | ✅ |
| `galaxy-browse` / BrowseChildren CLI | ✅ | ✅ | ✅ | ✅ | ✅ **(5/5)** |
- ✅ **4.1 Go single `Write2` helper — RESOLVED** (`849f1d2`): added `Write2`/`Write2Raw` to `clients/go/mxgateway/session.go`, matching the other four clients' signature.
- ✅ **4.2 Python `galaxy-*` CLI commands — RESOLVED** (`a211fae`, `a59fc99`): added `galaxy-test-connection`/`galaxy-last-deploy`/`galaxy-discover`/`galaxy-watch` Click commands wrapping `galaxy.py`; README corrected. (Fixed a UTC-offset bug in last-deploy output during review.)
- ✅ **4.3 `ping` CLI added to Go + Java — RESOLVED** (Go `90529dc`/`742ced7`, Java `0d5b488`).
- ✅ **4.4 `version` CLI in Dotnet — NOT MISSING (audit correction)**: the dotnet `version` subcommand already worked (`MxGatewayClientCli.cs:85``WriteVersion`, prints gateway/worker protocol versions). The original audit was wrong. Minor: unlike Go, dotnet's `version` omits a client-*package*-version line (`MxGatewayClientContractInfo` exposes only the two protocol versions) — cosmetic, not tracked.
- ✅ **4.5 Galaxy CLI command-name divergence — RESOLVED** (Java `0d5b488`): `galaxy-test-connection`/`galaxy-last-deploy` are now the canonical Java names, with `galaxy-test`/`galaxy-deploy-time` kept as **deprecated picocli aliases** so existing scripts don't break. (Rust keeps its `galaxy <subcommand>` group style — a clap structural choice, not a name divergence.)
- ✅ **4.6 `browse`/`BrowseChildren` CLI — RESOLVED, 0/5 → 5/5** (Rust `639e36b`, Go `8cb416b`, Python `39ec2a3`, dotnet `d7e2a8b`, Java `0d5b488`). All five emit the per-node JSON key `hasChildrenHint` (unified during review). Minor residual divergence: dotnet *nests* the Galaxy object fields under an `object` key while Go/Rust/Python/Java *flatten* them — both carry `hasChildrenHint` + a nested `children` array; harmonizing the object nesting is a cosmetic follow-up, not tracked.
- ⚪ **4.7 No typed wrappers for the rarer commands**`AuthenticateUser`, `ArchestrAUserToId`, `AddBufferedItem`, `Suspend`/`Activate`, `GetSessionState`/`GetWorkerInfo`/`DrainEvents`/`ShutdownWorker` remain reachable via the generic `Invoke`/`invoke_raw` escape hatch in all five clients (consistent and deliberate; the worker-side commands are now implemented per §1.1, but no client adds dedicated typed wrappers — out of scope, the CLIs that needed them got `ping`/`browse` subcommands).
---
## 5. Verification gaps (code exists, unverified by default)
All live/integration paths are opt-in; the default unit suites do not exercise them.
- ⚪ **Live MXAccess COM + STA + message pump**`Worker.Tests/MxAccess/MxAccessLiveComCreationTests.cs` (5 `[LiveMxAccessFact]`), gated `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`.
- ⚪ **Live gateway↔worker↔MXAccess round-trip**`IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` (6 `[LiveMxAccessFact]`).
- ⚪ **Live Galaxy Repository SQL**`IntegrationTests/Galaxy/GalaxyRepositoryLiveTests.cs` (4 `[LiveGalaxyRepositoryFact]`), gated `MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1`.
- ⚪ **Live LDAP dashboard auth**`IntegrationTests/DashboardLdapLiveTests.cs` (5 `[LiveLdapFact]`), gated `MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`.
- ⚪ **Alarm runtime/discovery probes (dev-rig)**`Worker.Tests/Probes/{WnWrapConsumerProbeTests,AlarmClientWmProbeTests}.cs`, `AlarmClientDiscoveryTests.cs` — hard `[Fact(Skip=...)]`.
- ⚪ **Live alarm + subtag-fallback smoke**`Worker.Tests/Probes/{AlarmSubtagLiveSmokeTests,AlarmsLiveSmokeTests}.cs``Skip` + one `[LiveMxAccessFact]`; Clear path remains undrivable even when enabled.
- ⚪ **Python loopback TLS**`clients/python/tests/test_tls.py:111-112` — gated `MXGATEWAY_RUN_TLS_TESTS=1` + openssl; only cert-config parsing runs by default.
- ⚪ **.NET client live browse smoke** — `clients/dotnet/.../BrowseChildrenSmokeTests.cs:17-18` — hard `[Fact(Skip=...)]`.
- ⚪ **Cross-language client↔gateway wire behavior** — no per-client integration unit tests; only `scripts/run-client-e2e-tests.ps1` against a live gateway (`MXGATEWAY_INTEGRATION=1`). All client wire behavior is unverified in default unit runs.
No placeholder/empty/`Assert.True(true)` tests were found anywhere.
---
## 6. Config-gated functional gaps (work only after configuration)
- 🟠 **6.1 Alarm ack in subtag mode requires `AckComment` subtag configured** — empty by default; ack fails in subtag mode until set. Names must be validated against live MXAccess, not guessed. `docs/DesignDecisions.md:454-458`. (`AckCommentSubtag` is write-only; `Worker/MxAccess/SubtagAlarmStateMachine.cs:21`.)
- 🔵 **6.2 Multi-subscriber** — see 2 (option exists, validator-blocked).
---
## 7. Stale docs, dead code, accepted gaps
- 📄 **7.1 D1 plan header stale**`docs/plans/2026-06-14-deferred-followups.md:4` still says *"Plan only — NOT yet executed,"* but D1 is **done** (`Dashboard/DashboardSnapshotService.cs:198`, commit `4af24b9`). Update the plan status.
- 📄 **7.2 `AlarmClientDiscovery.md` STA "production fix needed" prose is stale**`docs/AlarmClientDiscovery.md:765-774` reads as a pending follow-up, but alarms now run through the worker STA / `GatewayAlarmMonitor` (merged). Re-check against current code.
- 📄 **7.3 EventsHub "publisher side is a follow-up" comment is stale**`Dashboard/Hubs/EventsHub.cs:9-17`; the `DashboardEventBroadcaster` exists, is DI-registered (`Dashboard/DashboardServiceCollectionExtensions.cs:47`), runs in the live loop (`Grpc/EventStreamService.cs:133`), and `SessionDetailsPage.razor` renders the feed.
- 📄 **7.4 CLAUDE.md project-name drift** — CLAUDE.md uses `src/MxGateway.Server`/`MxGateway.Tests`; the actual tree is `src/ZB.MOM.WW.MxGateway.*`. Misleads path-based work.
- ⚪ **7.5 Dead `MapSqlException` helper**`Grpc/GalaxyRepositoryGrpcService.cs:350-360`, IDE0051-suppressed, kept for a hypothetical direct-SQL path that doesn't exist.
- **7.6 Accepted code-review gaps (`Won't Fix`, by design):**
- `Client.Python-012``Session.invoke_raw` deliberately skips `ensure_mxaccess_success`, so an embedded MXAccess HRESULT failure surfaces silently (raw-parity inspection). `code-reviews/Client.Python/findings.md:290`.
- `Contracts-003` — closed as not-a-defect. `code-reviews/Contracts/findings.md`.
- *(All 351 review findings are otherwise Resolved; none Open or Deferred.)*
---
## 8. Deferred test-coverage follow-ups (noted in resolutions, never filed as findings)
- **Java CLI bulk-subcommand coverage** — 6 of 13 non-trivial subcommands untested: `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk` (plus `stream-events`, the four `galaxy-*`, `close-session`). `code-reviews/Client.Java/findings.md:495` (Client.Java-026).
- **Per-session-ACL TODO** at `Server/Dashboard/Hubs/EventsHub.cs` (`code-reviews/Server/findings.md:765`).
- **Worker-Ready retry race** noted at `code-reviews/Server/findings.md:611`.
- **Duplicated `FakeWorkerProcess` harness** flagged as a latent regression vector — `code-reviews/Tests/findings.md:463`.
---
## Bottom line
**Status after `feat/stillpending-completion` (2026-06-15):** the net-new functionality is **done****§1.1** (all 11 worker command kinds, COM half live-verified on the rig), **§1.2** (audit CorrelationId, with a raw-string fallback for non-GUID clients), and the entire **§4** client CLI/helper parity surface (`Write2`, `ping`, `galaxy-*`, `galaxy-browse` 5/5, name aliases). Doc hygiene **§7** is done (`0032d2d`, `bd46ba1`). Zero `.proto` changes were required.
**Still open (all deliberate or environment/vendor-gated):**
- **§1.3** — `provider_switches` counter still only unit-tested; the dev rig can't drive a real alarmmgr→subtag failover, so live `reason`-tag coverage remains a recorded residual.
- **§1.4 / §3.4 / §3.5** — the AVEVA 8-arg `AlarmAckByName` is a vendor stub (55) and `AlarmAckByGUID` is `E_NOTIMPL`; the `domain`/`full_name` fields stay forward-compat-only until AVEVA implements them.
- **§3.2** — buffered commands work and the empty bootstrap converts cleanly live, but a multi-sample buffered batch is undrivable on the rig (unit-tested only).
- **§3.1 / §3.3 / §3.6 / §3.7** — await live MXAccess captures.
- **§2** — deliberate v1 scope. **§5** — opt-in verification gates. **§7.6** — accepted `Won't Fix` review findings.
MXAccess **event/data/value/write** mapping, the **Galaxy** RPC surface, and now the **full command surface** are complete; no `NotImplementedException`s, stubbed RPC bodies, or empty tests remain in the production paths.