fix: resolve code-review findings (locally verified)

Server-054/055/056, Contracts-020/021/022, Tests-036/038/039,
IntegrationTests-030/031/032 (+033 deferred to live rig),
Client.Dotnet-026/028/029 (+027 won't-fix), Client.Go-030..034,
Client.Python-032..036, Client.Rust-033..038.

Key fix: SessionEventDistributor orphaned a subscriber that registered after
the pump completed but before disposal (Server-056) -> register paths now
complete late registrants under _lifecycleLock; regression test added. The
racy dashboard-mirror gRPC test made deterministic (Tests-039).

Verified green locally: gateway Tests targeted classes (GatewaySession,
SessionEventDistributor, GatewayOptionsValidator, ProtobufContractRoundTrip,
GatewaySessionDashboardMirror) + dotnet/go/python/rust client suites.
This commit is contained in:
Joseph Doherty
2026-06-17 05:23:14 -04:00
parent 25d04ec37e
commit 6b5fe6aa82
37 changed files with 1049 additions and 211 deletions
+60 -21
View File
@@ -62,37 +62,67 @@ Implementation guidance:
## Session Reconnect
Decision: no reconnectable sessions for v1.
Reconnectable sessions with event replay are shipped and config-gated. The
original "no reconnectable sessions" constraint is superseded.
One `OpenSession` creates one gateway session and one worker process. The
session ends on `CloseSession`, client disconnect policy, lease expiry, worker
fault, or gateway shutdown.
fault, gateway shutdown, or — when `DetachGraceSeconds > 0` — detach-grace
expiry after the last external event subscriber drops.
Rationale: reconnectable sessions require event replay, orphan ownership,
security checks, and more complicated worker lifetime rules. They are not needed
for the first parity slice.
`MxGateway:Sessions:DetachGraceSeconds` (default `30`) controls the retention
window. When positive, a session whose last external gRPC event-stream
subscriber drops stays `Ready` for that many seconds so a client can reconnect
to the same session instead of triggering a new `OpenSession` → worker spawn.
Setting it to `0` reverts to closing only on normal lease expiry.
A reconnecting client issues `StreamEvents` with `after_worker_sequence` set to
the last sequence it observed; the gateway replays retained events newer than
that watermark (capped by `MxGateway:Events:ReplayBufferCapacity` and
`MxGateway:Events:ReplayRetentionSeconds`) then transitions seamlessly to live
delivery. If the requested position precedes the oldest retained event, a
`ReplayGap` sentinel signals the client to re-snapshot. The replay→live handoff
is atomic (no gap, no duplicate). See [Sessions](./Sessions.md) for the full
reconnect and replay protocol.
## Event Subscribers
Decision: one active `StreamEvents` subscriber per session for v1.
Multi-subscriber fan-out for data-side `StreamEvents` is shipped and
config-gated. The original "one active subscriber per session" constraint is
superseded for deployments that opt in.
A second subscriber should be rejected with a clear session error. Multi-client
fan-out may be added later with explicit backpressure semantics.
`MxGateway:Sessions:AllowMultipleEventSubscribers` (default `false`) controls
the mode. When `false` the session still rejects a second `StreamEvents`
subscriber with `EventSubscriberAlreadyActive`, preserving the original
single-subscriber behavior. When `true`, up to
`MxGateway:Sessions:MaxEventSubscribersPerSession` (default `8`) concurrent
external subscribers may attach; a new attach that would exceed the cap is
rejected with `EventSubscriberLimitReached`. The count-check-and-increment is
atomic under the session lock.
Rationale: one subscriber preserves simple event ordering and failure behavior
while parity is being proven.
Failure semantics differ by mode: in single-subscriber mode a slow consumer's
channel overflow faults the whole session (`FailFast` backpressure); in
multi-subscriber mode the same condition disconnects only that subscriber so one
slow consumer never faults a session shared by others. The mode is fixed at
session construction and is not changed by a live subscriber-count snapshot.
### Alarms — superseded for the alarm subsystem
The gateway-owned internal dashboard mirror subscribes directly on the
distributor with `isInternal: true` and is not counted toward the cap or the
detach-grace subscriber-count in either mode.
The single-subscriber rule above no longer applies to alarms. The gateway runs
an always-on central alarm monitor (`GatewayAlarmMonitor`) that owns one
See [Sessions](./Sessions.md) for the full event-distributor and backpressure
design.
### Alarms — separate fan-out architecture
The single-subscriber rule never applied to alarms. The gateway runs an
always-on central alarm monitor (`GatewayAlarmMonitor`) that owns one
gateway-managed worker session, caches the active-alarm set, and fans it out to
any number of clients through the session-less `StreamAlarms` RPC. Per-session
alarm auto-subscribe is removed; `AcknowledgeAlarm` is session-less and routes
through the monitor. Data-side `StreamEvents` remains one subscriber per
session. Rationale: alarm state is gateway-wide, not session-scoped — every
client wants the same current set plus updates, and forcing each to own a
worker would multiply AVEVA polling load for no benefit.
any number of clients through the session-less `StreamAlarms` RPC.
`AcknowledgeAlarm` is session-less and routes through the monitor. Rationale:
alarm state is gateway-wide, not session-scoped — every client wants the same
current set plus updates, and forcing each to own a worker would multiply AVEVA
polling load for no benefit.
## Authentication
@@ -467,12 +497,21 @@ against the live MXAccess attribute set.
These are explicit post-v1 revisit items, not open blockers:
- reconnectable sessions,
- multiple event subscribers per session,
- restricted worker service account,
- production coalescing by item handle,
- command batching for high-volume tag setup.
The following items were previously listed here and have since shipped:
- **Reconnectable sessions with replay** — shipped, config-gated via
`MxGateway:Sessions:DetachGraceSeconds` and
`MxGateway:Events:ReplayBufferCapacity` / `ReplayRetentionSeconds`.
See [Session Reconnect](#session-reconnect) above and [Sessions](./Sessions.md).
- **Multiple event subscribers per session** — shipped, config-gated via
`MxGateway:Sessions:AllowMultipleEventSubscribers` and
`MxGateway:Sessions:MaxEventSubscribersPerSession`.
See [Event Subscribers](#event-subscribers) above and [Sessions](./Sessions.md).
## Related Documentation
- [Gateway Process Detailed Design](./GatewayProcessDesign.md)
+12 -4
View File
@@ -51,7 +51,7 @@ shutdown request even when a command or event assertion fails. Cleanup failures
in that `finally` block are logged rather than thrown, so a real assertion
failure is never masked by a shutdown timeout.
`WorkerLiveMxAccessSmokeTests` additionally covers five MXAccess parity paths the
`WorkerLiveMxAccessSmokeTests` additionally covers seven MXAccess parity paths the
fake-worker tests cannot validate:
- a `Write` round-trip against an advised item, asserting both that the reply is
@@ -67,13 +67,21 @@ fake-worker tests cannot validate:
- a `WriteSecured` round-trip after `AuthenticateUser`, asserting the reply
carries `MxCommandKind.WriteSecured` and the credential password never
appears in the diagnostic message (parity for both the secured-write
ordering rule and the "do not log secrets" contract), and
ordering rule and the "do not log secrets" contract),
- an abnormal worker exit (the worker process is killed mid-session) where the
gateway must transition the session to `SessionState.Faulted` with a
non-empty fault description carrying a known worker-client classification
(pipe disconnected / worker faulted / end-of-stream / heartbeat expired).
(pipe disconnected / worker faulted / end-of-stream / heartbeat expired),
- the B8 new COM commands — `AuthenticateUser`, `ArchestrAUserToId`, `Suspend`,
and `Activate` — each asserting a real MXAccess reply (not `InvalidRequest`)
is returned against an added-but-not-advised item, and
- the buffered-data path — `AddBufferedItem` and `SetBufferedUpdateInterval`
asserting the commands round-trip and that the worker delivers at least one
`OnBufferedDataChange` event (the empty NoData bootstrap) without crashing
or dropping frames; live §3.2 multi-sample conversion is noted as a residual
when the rig does not drive sample-bearing buffered batches on demand.
All six tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
All eight tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
opt-in variable.
Build the worker before running the smoke: