fix: resolve code-review findings (locally verified)

Server-054/055/056, Contracts-020/021/022, Tests-036/038/039,
IntegrationTests-030/031/032 (+033 deferred to live rig),
Client.Dotnet-026/028/029 (+027 won't-fix), Client.Go-030..034,
Client.Python-032..036, Client.Rust-033..038.

Key fix: SessionEventDistributor orphaned a subscriber that registered after
the pump completed but before disposal (Server-056) -> register paths now
complete late registrants under _lifecycleLock; regression test added. The
racy dashboard-mirror gRPC test made deterministic (Tests-039).

Verified green locally: gateway Tests targeted classes (GatewaySession,
SessionEventDistributor, GatewayOptionsValidator, ProtobufContractRoundTrip,
GatewaySessionDashboardMirror) + dotnet/go/python/rust client suites.
This commit is contained in:
Joseph Doherty
2026-06-17 05:23:14 -04:00
parent 25d04ec37e
commit 6b5fe6aa82
37 changed files with 1049 additions and 211 deletions
+60 -21
View File
@@ -62,37 +62,67 @@ Implementation guidance:
## Session Reconnect
Decision: no reconnectable sessions for v1.
Reconnectable sessions with event replay are shipped and config-gated. The
original "no reconnectable sessions" constraint is superseded.
One `OpenSession` creates one gateway session and one worker process. The
session ends on `CloseSession`, client disconnect policy, lease expiry, worker
fault, or gateway shutdown.
fault, gateway shutdown, or — when `DetachGraceSeconds > 0` — detach-grace
expiry after the last external event subscriber drops.
Rationale: reconnectable sessions require event replay, orphan ownership,
security checks, and more complicated worker lifetime rules. They are not needed
for the first parity slice.
`MxGateway:Sessions:DetachGraceSeconds` (default `30`) controls the retention
window. When positive, a session whose last external gRPC event-stream
subscriber drops stays `Ready` for that many seconds so a client can reconnect
to the same session instead of triggering a new `OpenSession` → worker spawn.
Setting it to `0` reverts to closing only on normal lease expiry.
A reconnecting client issues `StreamEvents` with `after_worker_sequence` set to
the last sequence it observed; the gateway replays retained events newer than
that watermark (capped by `MxGateway:Events:ReplayBufferCapacity` and
`MxGateway:Events:ReplayRetentionSeconds`) then transitions seamlessly to live
delivery. If the requested position precedes the oldest retained event, a
`ReplayGap` sentinel signals the client to re-snapshot. The replay→live handoff
is atomic (no gap, no duplicate). See [Sessions](./Sessions.md) for the full
reconnect and replay protocol.
## Event Subscribers
Decision: one active `StreamEvents` subscriber per session for v1.
Multi-subscriber fan-out for data-side `StreamEvents` is shipped and
config-gated. The original "one active subscriber per session" constraint is
superseded for deployments that opt in.
A second subscriber should be rejected with a clear session error. Multi-client
fan-out may be added later with explicit backpressure semantics.
`MxGateway:Sessions:AllowMultipleEventSubscribers` (default `false`) controls
the mode. When `false` the session still rejects a second `StreamEvents`
subscriber with `EventSubscriberAlreadyActive`, preserving the original
single-subscriber behavior. When `true`, up to
`MxGateway:Sessions:MaxEventSubscribersPerSession` (default `8`) concurrent
external subscribers may attach; a new attach that would exceed the cap is
rejected with `EventSubscriberLimitReached`. The count-check-and-increment is
atomic under the session lock.
Rationale: one subscriber preserves simple event ordering and failure behavior
while parity is being proven.
Failure semantics differ by mode: in single-subscriber mode a slow consumer's
channel overflow faults the whole session (`FailFast` backpressure); in
multi-subscriber mode the same condition disconnects only that subscriber so one
slow consumer never faults a session shared by others. The mode is fixed at
session construction and is not changed by a live subscriber-count snapshot.
### Alarms — superseded for the alarm subsystem
The gateway-owned internal dashboard mirror subscribes directly on the
distributor with `isInternal: true` and is not counted toward the cap or the
detach-grace subscriber-count in either mode.
The single-subscriber rule above no longer applies to alarms. The gateway runs
an always-on central alarm monitor (`GatewayAlarmMonitor`) that owns one
See [Sessions](./Sessions.md) for the full event-distributor and backpressure
design.
### Alarms — separate fan-out architecture
The single-subscriber rule never applied to alarms. The gateway runs an
always-on central alarm monitor (`GatewayAlarmMonitor`) that owns one
gateway-managed worker session, caches the active-alarm set, and fans it out to
any number of clients through the session-less `StreamAlarms` RPC. Per-session
alarm auto-subscribe is removed; `AcknowledgeAlarm` is session-less and routes
through the monitor. Data-side `StreamEvents` remains one subscriber per
session. Rationale: alarm state is gateway-wide, not session-scoped — every
client wants the same current set plus updates, and forcing each to own a
worker would multiply AVEVA polling load for no benefit.
any number of clients through the session-less `StreamAlarms` RPC.
`AcknowledgeAlarm` is session-less and routes through the monitor. Rationale:
alarm state is gateway-wide, not session-scoped — every client wants the same
current set plus updates, and forcing each to own a worker would multiply AVEVA
polling load for no benefit.
## Authentication
@@ -467,12 +497,21 @@ against the live MXAccess attribute set.
These are explicit post-v1 revisit items, not open blockers:
- reconnectable sessions,
- multiple event subscribers per session,
- restricted worker service account,
- production coalescing by item handle,
- command batching for high-volume tag setup.
The following items were previously listed here and have since shipped:
- **Reconnectable sessions with replay** — shipped, config-gated via
`MxGateway:Sessions:DetachGraceSeconds` and
`MxGateway:Events:ReplayBufferCapacity` / `ReplayRetentionSeconds`.
See [Session Reconnect](#session-reconnect) above and [Sessions](./Sessions.md).
- **Multiple event subscribers per session** — shipped, config-gated via
`MxGateway:Sessions:AllowMultipleEventSubscribers` and
`MxGateway:Sessions:MaxEventSubscribersPerSession`.
See [Event Subscribers](#event-subscribers) above and [Sessions](./Sessions.md).
## Related Documentation
- [Gateway Process Detailed Design](./GatewayProcessDesign.md)