fix: resolve code-review findings (locally verified)
Server-054/055/056, Contracts-020/021/022, Tests-036/038/039, IntegrationTests-030/031/032 (+033 deferred to live rig), Client.Dotnet-026/028/029 (+027 won't-fix), Client.Go-030..034, Client.Python-032..036, Client.Rust-033..038. Key fix: SessionEventDistributor orphaned a subscriber that registered after the pump completed but before disposal (Server-056) -> register paths now complete late registrants under _lifecycleLock; regression test added. The racy dashboard-mirror gRPC test made deterministic (Tests-039). Verified green locally: gateway Tests targeted classes (GatewaySession, SessionEventDistributor, GatewayOptionsValidator, ProtobufContractRoundTrip, GatewaySessionDashboardMirror) + dotnet/go/python/rust client suites.
This commit is contained in:
+60
-21
@@ -62,37 +62,67 @@ Implementation guidance:
|
||||
|
||||
## Session Reconnect
|
||||
|
||||
Decision: no reconnectable sessions for v1.
|
||||
Reconnectable sessions with event replay are shipped and config-gated. The
|
||||
original "no reconnectable sessions" constraint is superseded.
|
||||
|
||||
One `OpenSession` creates one gateway session and one worker process. The
|
||||
session ends on `CloseSession`, client disconnect policy, lease expiry, worker
|
||||
fault, or gateway shutdown.
|
||||
fault, gateway shutdown, or — when `DetachGraceSeconds > 0` — detach-grace
|
||||
expiry after the last external event subscriber drops.
|
||||
|
||||
Rationale: reconnectable sessions require event replay, orphan ownership,
|
||||
security checks, and more complicated worker lifetime rules. They are not needed
|
||||
for the first parity slice.
|
||||
`MxGateway:Sessions:DetachGraceSeconds` (default `30`) controls the retention
|
||||
window. When positive, a session whose last external gRPC event-stream
|
||||
subscriber drops stays `Ready` for that many seconds so a client can reconnect
|
||||
to the same session instead of triggering a new `OpenSession` → worker spawn.
|
||||
Setting it to `0` reverts to closing only on normal lease expiry.
|
||||
|
||||
A reconnecting client issues `StreamEvents` with `after_worker_sequence` set to
|
||||
the last sequence it observed; the gateway replays retained events newer than
|
||||
that watermark (capped by `MxGateway:Events:ReplayBufferCapacity` and
|
||||
`MxGateway:Events:ReplayRetentionSeconds`) then transitions seamlessly to live
|
||||
delivery. If the requested position precedes the oldest retained event, a
|
||||
`ReplayGap` sentinel signals the client to re-snapshot. The replay→live handoff
|
||||
is atomic (no gap, no duplicate). See [Sessions](./Sessions.md) for the full
|
||||
reconnect and replay protocol.
|
||||
|
||||
## Event Subscribers
|
||||
|
||||
Decision: one active `StreamEvents` subscriber per session for v1.
|
||||
Multi-subscriber fan-out for data-side `StreamEvents` is shipped and
|
||||
config-gated. The original "one active subscriber per session" constraint is
|
||||
superseded for deployments that opt in.
|
||||
|
||||
A second subscriber should be rejected with a clear session error. Multi-client
|
||||
fan-out may be added later with explicit backpressure semantics.
|
||||
`MxGateway:Sessions:AllowMultipleEventSubscribers` (default `false`) controls
|
||||
the mode. When `false` the session still rejects a second `StreamEvents`
|
||||
subscriber with `EventSubscriberAlreadyActive`, preserving the original
|
||||
single-subscriber behavior. When `true`, up to
|
||||
`MxGateway:Sessions:MaxEventSubscribersPerSession` (default `8`) concurrent
|
||||
external subscribers may attach; a new attach that would exceed the cap is
|
||||
rejected with `EventSubscriberLimitReached`. The count-check-and-increment is
|
||||
atomic under the session lock.
|
||||
|
||||
Rationale: one subscriber preserves simple event ordering and failure behavior
|
||||
while parity is being proven.
|
||||
Failure semantics differ by mode: in single-subscriber mode a slow consumer's
|
||||
channel overflow faults the whole session (`FailFast` backpressure); in
|
||||
multi-subscriber mode the same condition disconnects only that subscriber so one
|
||||
slow consumer never faults a session shared by others. The mode is fixed at
|
||||
session construction and is not changed by a live subscriber-count snapshot.
|
||||
|
||||
### Alarms — superseded for the alarm subsystem
|
||||
The gateway-owned internal dashboard mirror subscribes directly on the
|
||||
distributor with `isInternal: true` and is not counted toward the cap or the
|
||||
detach-grace subscriber-count in either mode.
|
||||
|
||||
The single-subscriber rule above no longer applies to alarms. The gateway runs
|
||||
an always-on central alarm monitor (`GatewayAlarmMonitor`) that owns one
|
||||
See [Sessions](./Sessions.md) for the full event-distributor and backpressure
|
||||
design.
|
||||
|
||||
### Alarms — separate fan-out architecture
|
||||
|
||||
The single-subscriber rule never applied to alarms. The gateway runs an
|
||||
always-on central alarm monitor (`GatewayAlarmMonitor`) that owns one
|
||||
gateway-managed worker session, caches the active-alarm set, and fans it out to
|
||||
any number of clients through the session-less `StreamAlarms` RPC. Per-session
|
||||
alarm auto-subscribe is removed; `AcknowledgeAlarm` is session-less and routes
|
||||
through the monitor. Data-side `StreamEvents` remains one subscriber per
|
||||
session. Rationale: alarm state is gateway-wide, not session-scoped — every
|
||||
client wants the same current set plus updates, and forcing each to own a
|
||||
worker would multiply AVEVA polling load for no benefit.
|
||||
any number of clients through the session-less `StreamAlarms` RPC.
|
||||
`AcknowledgeAlarm` is session-less and routes through the monitor. Rationale:
|
||||
alarm state is gateway-wide, not session-scoped — every client wants the same
|
||||
current set plus updates, and forcing each to own a worker would multiply AVEVA
|
||||
polling load for no benefit.
|
||||
|
||||
## Authentication
|
||||
|
||||
@@ -467,12 +497,21 @@ against the live MXAccess attribute set.
|
||||
|
||||
These are explicit post-v1 revisit items, not open blockers:
|
||||
|
||||
- reconnectable sessions,
|
||||
- multiple event subscribers per session,
|
||||
- restricted worker service account,
|
||||
- production coalescing by item handle,
|
||||
- command batching for high-volume tag setup.
|
||||
|
||||
The following items were previously listed here and have since shipped:
|
||||
|
||||
- **Reconnectable sessions with replay** — shipped, config-gated via
|
||||
`MxGateway:Sessions:DetachGraceSeconds` and
|
||||
`MxGateway:Events:ReplayBufferCapacity` / `ReplayRetentionSeconds`.
|
||||
See [Session Reconnect](#session-reconnect) above and [Sessions](./Sessions.md).
|
||||
- **Multiple event subscribers per session** — shipped, config-gated via
|
||||
`MxGateway:Sessions:AllowMultipleEventSubscribers` and
|
||||
`MxGateway:Sessions:MaxEventSubscribersPerSession`.
|
||||
See [Event Subscribers](#event-subscribers) above and [Sessions](./Sessions.md).
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Gateway Process Detailed Design](./GatewayProcessDesign.md)
|
||||
|
||||
+12
-4
@@ -51,7 +51,7 @@ shutdown request even when a command or event assertion fails. Cleanup failures
|
||||
in that `finally` block are logged rather than thrown, so a real assertion
|
||||
failure is never masked by a shutdown timeout.
|
||||
|
||||
`WorkerLiveMxAccessSmokeTests` additionally covers five MXAccess parity paths the
|
||||
`WorkerLiveMxAccessSmokeTests` additionally covers seven MXAccess parity paths the
|
||||
fake-worker tests cannot validate:
|
||||
|
||||
- a `Write` round-trip against an advised item, asserting both that the reply is
|
||||
@@ -67,13 +67,21 @@ fake-worker tests cannot validate:
|
||||
- a `WriteSecured` round-trip after `AuthenticateUser`, asserting the reply
|
||||
carries `MxCommandKind.WriteSecured` and the credential password never
|
||||
appears in the diagnostic message (parity for both the secured-write
|
||||
ordering rule and the "do not log secrets" contract), and
|
||||
ordering rule and the "do not log secrets" contract),
|
||||
- an abnormal worker exit (the worker process is killed mid-session) where the
|
||||
gateway must transition the session to `SessionState.Faulted` with a
|
||||
non-empty fault description carrying a known worker-client classification
|
||||
(pipe disconnected / worker faulted / end-of-stream / heartbeat expired).
|
||||
(pipe disconnected / worker faulted / end-of-stream / heartbeat expired),
|
||||
- the B8 new COM commands — `AuthenticateUser`, `ArchestrAUserToId`, `Suspend`,
|
||||
and `Activate` — each asserting a real MXAccess reply (not `InvalidRequest`)
|
||||
is returned against an added-but-not-advised item, and
|
||||
- the buffered-data path — `AddBufferedItem` and `SetBufferedUpdateInterval` —
|
||||
asserting the commands round-trip and that the worker delivers at least one
|
||||
`OnBufferedDataChange` event (the empty NoData bootstrap) without crashing
|
||||
or dropping frames; live §3.2 multi-sample conversion is noted as a residual
|
||||
when the rig does not drive sample-bearing buffered batches on demand.
|
||||
|
||||
All six tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
|
||||
All eight tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
|
||||
opt-in variable.
|
||||
|
||||
Build the worker before running the smoke:
|
||||
|
||||
Reference in New Issue
Block a user