docs(audit): apply per-cluster judgment fixes across living docs

Resolve audit findings: correct WorkerEnvelope proto/route/metric/session
facts; rewrite auth (ZB.MOM.WW.Auth migration), dashboard (ZB.MOM.WW.Theme),
and StyleGuide (foreign-project copy-paste); document alarm subsystem, Ldap
options, and gateway alarm broker; fix client CLI flags and package paths.
This commit is contained in:
Joseph Doherty
2026-06-03 16:01:28 -04:00
parent f84e0c3474
commit e541339c07
29 changed files with 1102 additions and 432 deletions
+45 -11
View File
@@ -4,9 +4,9 @@ The sessions subsystem owns the in-memory representation of an active gateway-to
## Overview
A session is the gateway-side handle that callers use to invoke worker commands, stream worker events, and tear the worker down. The subsystem is split between the per-session state machine (`GatewaySession`), an in-memory directory (`SessionRegistry`), the orchestrator that opens and closes sessions (`SessionManager`), the worker construction step (`SessionWorkerClientFactory`), and a hosted service that drains sessions during host shutdown (`SessionShutdownHostedService`).
A session is the gateway-side handle that callers use to invoke worker commands, stream worker events, and tear the worker down. The subsystem is split between the per-session state machine (`GatewaySession`), an in-memory directory (`SessionRegistry`), the orchestrator that opens and closes sessions (`SessionManager`), the worker construction step (`SessionWorkerClientFactory`), a hosted service that sweeps expired leases (`SessionLeaseMonitorHostedService`), and a hosted service that drains sessions during host shutdown (`SessionShutdownHostedService`).
All four interfaces (`ISessionManager`, `ISessionRegistry`, `ISessionWorkerClientFactory`) plus `SessionShutdownHostedService` are wired as singletons by `SessionServiceCollectionExtensions.AddGatewaySessions`.
The three interfaces (`ISessionManager`, `ISessionRegistry`, `ISessionWorkerClientFactory`) are wired as singletons, and both hosted services (`SessionLeaseMonitorHostedService`, `SessionShutdownHostedService`) are registered, by `SessionServiceCollectionExtensions.AddGatewaySessions`. The startup orphan-worker cleanup that runs before any session opens lives in the worker subsystem (`OrphanWorkerCleanupHostedService`); see [Gateway Restart and Orphan Cleanup](#gateway-restart-and-orphan-cleanup).
## Key Types
@@ -18,6 +18,8 @@ The session id is an opaque string in the form `session-{guid:N}` and the per-se
`SessionState` itself is the protobuf-generated enum from `ZB.MOM.WW.MxGateway.Contracts.Proto`, so it is shared between the gateway and clients on the wire.
`GatewaySession` also keeps an `_items` dictionary keyed by `(ServerHandle, ItemHandle)` mapping each subscribed item to its `SessionItemRegistration` (server handle, item handle, tag address). It is the gateway-side shadow of the items the worker has added, populated as `AddItem`-style commands succeed and pruned on `RemoveItem`. The shadow exists so the gateway can answer item lookups and clean up subscriptions without round-tripping the worker; the worker remains authoritative for the handles themselves (see [gateway.md](../gateway.md)).
```csharp
public void TransitionTo(SessionState nextState)
{
@@ -54,7 +56,7 @@ public void TransitionTo(SessionState nextState)
`CloseSessionAsync` and `KillWorkerAsync` are both end-of-life paths but differ in what they offer the worker:
- `CloseSessionAsync` is the graceful path: it calls `GatewaySession.CloseAsync`, which asks the worker to shut down via `IWorkerClient.ShutdownAsync` and only kills the process as a fallback if shutdown fails.
- `KillWorkerAsync` is the forceful path used by the dashboard's admin Kill button: it calls `GatewaySession.KillWorker` directly, which kills the worker process immediately with no graceful-shutdown attempt and transitions the session to `Closed`.
- `KillWorkerAsync` is the forceful path used by the dashboard's admin Kill button: it calls `GatewaySession.KillWorkerWithCloseGateAsync`, which kills the worker process immediately with no graceful-shutdown attempt and transitions the session to `Closed`. Routing through `KillWorkerWithCloseGateAsync` (rather than the bare `GatewaySession.KillWorker`) acquires the per-session `_closeLock` so a kill and an in-flight graceful close serialize on the same "was the session already closed" observation that drives metric accounting; the method returns that observation so `KillWorkerAsync` increments `mxgateway.sessions.closed` at most once across concurrent callers.
Both paths converge on the same registry/metrics cleanup, so the open-session slot is released and `mxgateway.sessions.closed` is incremented either way.
@@ -99,6 +101,8 @@ if (exception is OperationCanceledException
The named pipe is created with `maxNumberOfServerInstances: 1` so a second worker cannot connect to the same pipe name even if the first launch is still pending. Combined with the per-session nonce passed to the worker, this is the gateway's defense against a foreign process answering a pipe.
The factory also seeds the worker client's `MaxPendingCommands` from `MxGateway:Sessions:MaxPendingCommandsPerSession` (default 128, validated `> 0` at startup). This caps how many commands can be in flight to a single worker at once; the `WorkerClient` rejects an enqueue past the cap and records `mxgateway.queues.overflows` tagged `worker-pending-commands`. The bound exists because the worker executes commands serially on one STA — an unbounded backlog would only grow memory and latency, not throughput.
### SessionShutdownHostedService
`SessionShutdownHostedService` is an `IHostedService` whose only job is to call `ISessionManager.ShutdownAsync` from `StopAsync`. It catches `OperationCanceledException` triggered by the host shutdown timeout and logs a warning so that an over-running shutdown does not surface as an unhandled exception.
@@ -172,6 +176,14 @@ catch (Exception exception)
await session.DisposeAsync().ConfigureAwait(false);
}
// If SessionOpened() already incremented the open-session gauge,
// a failure after that point (e.g. auto-subscribe rejection) must
// decrement it again so mxgateway.sessions.open does not leak.
if (sessionOpenedRecorded)
{
_metrics.SessionRemoved();
}
ReleaseSessionSlot();
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
_logger.LogWarning(
@@ -186,7 +198,7 @@ catch (Exception exception)
}
```
The order — fault, deregister, dispose, release slot, record metric, log, rethrow — matters because releasing the semaphore before disposal would let the next open race the worker process tear-down on the same machine.
The order — fault, deregister, dispose, conditionally decrement the open-session gauge, release slot, record fault metric, log, rethrow — matters because releasing the semaphore before disposal would let the next open race the worker process tear-down on the same machine. The `SessionRemoved()` call is conditional on `sessionOpenedRecorded` (Server-006): a failure *after* `SessionOpened()` already incremented `mxgateway.sessions.open` (for example, an auto-subscribe rejection) must decrement the gauge so it does not leak, but a failure before that point must not.
### Run
@@ -194,6 +206,8 @@ While `Ready`, callers reach the worker through `SessionManager.InvokeAsync` or
Event streaming uses `AttachEventSubscriber` which returns a disposable lease. When `allowMultipleSubscribers` is false the second attach throws `EventSubscriberAlreadyActive`; this prevents two gRPC streams from racing on the same worker event channel. Active event subscribers keep the session lease from expiring until the stream is disposed.
The single-subscriber rule is enforced at startup, not just at runtime: setting `MxGateway:Sessions:AllowMultipleEventSubscribers` to `true` is refused by `GatewayOptionsValidator` with "AllowMultipleEventSubscribers is not supported until event fan-out is implemented," so the gateway fails fast rather than booting in a configuration the event path cannot honor. Multi-subscriber fan-out is explicitly out of scope for v1 (see [Design Decisions](./DesignDecisions.md)).
Sessions open with `MxGateway:Sessions:DefaultLeaseSeconds` (default 1800) added to the open timestamp. Unary client activity refreshes the lease by the same duration. `ExtendLease` and `IsLeaseExpired` cooperate with `SessionManager.CloseExpiredLeasesAsync`, which iterates a registry snapshot and closes any session whose lease has expired with `LeaseExpiredReason`. `SessionLeaseMonitorHostedService` runs that sweep every `MxGateway:Sessions:LeaseSweepIntervalSeconds` seconds (default 30).
### Close
@@ -227,11 +241,11 @@ if (_workerClient is not null)
If both graceful shutdown and the kill fall-back fail, the original and kill exceptions are bundled into an `AggregateException` and surfaced as `SessionCloseStartedException`. `SessionManager.CloseSessionCoreAsync` then translates that into a `SessionManagerException` with `CloseFailed` and removes the session.
`GatewaySession.KillWorker` is the unconditional forced-close path used by shutdown when graceful close itself throws, and also by `SessionManager.KillWorkerAsync` — the explicit kill path that the dashboard's admin Kill button invokes. `KillWorkerAsync` skips `WorkerClient.ShutdownAsync` entirely, so `KillCount` increments while `ShutdownCount` does not; the session is then removed from the registry and the open-session slot is released, identical to the cleanup that follows a successful `CloseSessionAsync`.
`GatewaySession.KillWorker` is the unconditional forced-close path. `SessionManager.KillWorkerAsync` — the explicit kill path that the dashboard's admin Kill button invokes — no longer calls it directly; it routes through `GatewaySession.KillWorkerWithCloseGateAsync` so the kill takes the per-session `_closeLock`. That method skips `WorkerClient.ShutdownAsync` entirely and forces the worker process down via `IWorkerClient.Kill`, which records the `mxgateway.workers.killed` counter through `GatewayMetrics.WorkerKilled(reason)`. The session is then removed from the registry and the open-session slot is released, identical to the cleanup that follows a successful `CloseSessionAsync` (which increments `mxgateway.sessions.closed`). There is no separate `KillCount` / `ShutdownCount`: worker terminations are counted by `mxgateway.workers.killed` (tagged with the kill reason), and session closes by `mxgateway.sessions.closed`.
## Shutdown Coordination
`SessionShutdownHostedService.StopAsync` calls `SessionManager.ShutdownAsync`, which closes every registered session with `GatewayShutdownReason`. The shutdown loop catches per-session exceptions, calls `KillWorker`, and removes the session so that one stuck worker cannot block the rest of the host:
`SessionShutdownHostedService.StopAsync` calls `SessionManager.ShutdownAsync`, which closes every registered session with `GatewayShutdownReason`. The shutdown loop catches per-session exceptions and falls back to a forced kill so that one stuck worker cannot block the rest of the host. The fallback routes through `KillWorkerAsync` (not a bare `session.KillWorker`) so the kill takes the same close-gate and metric bookkeeping as the dashboard kill path (Server-046):
```csharp
public async Task ShutdownAsync(CancellationToken cancellationToken)
@@ -248,21 +262,40 @@ public async Task ShutdownAsync(CancellationToken cancellationToken)
exception,
"Graceful shutdown failed for session {SessionId}; killing worker.",
session.SessionId);
// CloseSessionCoreAsync's inner SessionCloseStartedException catch normally
// removes and accounts the session; this fallback only fires for sessions
// still in the registry, and reuses KillWorkerAsync for identical bookkeeping.
if (_registry.TryGet(session.SessionId, out _))
{
session.KillWorker(GatewayShutdownReason);
await RemoveSessionAsync(session).ConfigureAwait(false);
try
{
await KillWorkerAsync(session.SessionId, GatewayShutdownReason, cancellationToken).ConfigureAwait(false);
}
catch (SessionManagerException killException)
{
_logger.LogWarning(
killException,
"Worker kill fallback failed for session {SessionId}.",
session.SessionId);
}
}
}
}
}
```
Iterating over `Snapshot` rather than the live dictionary lets `RemoveSessionAsync` mutate the registry inside the loop without throwing.
Iterating over `Snapshot` rather than the live dictionary lets the registry mutate inside the loop without throwing.
## Gateway Restart and Orphan Cleanup
A graceful shutdown drains sessions through `ShutdownAsync`, but a gateway crash or `Kill` leaves no chance to tear workers down. Those orphaned worker processes outlive the gateway that launched them, still holding their MXAccess COM instance and their named pipe. Because the pipe name encodes the *old* gateway PID, a fresh gateway will never reconnect to them — v1 deliberately does not reattach orphan workers (see [Design Decisions](./DesignDecisions.md)).
Instead, `OrphanWorkerCleanupHostedService` runs once on startup, before any session opens, and calls `OrphanWorkerTerminator.TerminateOrphans`. The terminator enumerates running processes matching the configured worker executable name, skips the current process, and kills any that it identifies as a leftover worker (matched against the configured executable path). Each kill records `mxgateway.workers.killed` tagged `OrphanStartupCleanup` and logs a warning. The sweep is best-effort: a failure to kill any one orphan (it may have already exited, or be inaccessible) is logged and swallowed so it cannot block gateway startup. This service lives in the worker subsystem, not the session subsystem, because it operates on OS processes rather than `GatewaySession` state.
## Dependency Injection
`SessionServiceCollectionExtensions.AddGatewaySessions` registers the four singletons and the hosted service:
`SessionServiceCollectionExtensions.AddGatewaySessions` registers the three singletons and the two hosted services:
```csharp
public static IServiceCollection AddGatewaySessions(this IServiceCollection services)
@@ -270,13 +303,14 @@ public static IServiceCollection AddGatewaySessions(this IServiceCollection serv
services.AddSingleton<ISessionRegistry, SessionRegistry>();
services.AddSingleton<ISessionWorkerClientFactory, SessionWorkerClientFactory>();
services.AddSingleton<ISessionManager, SessionManager>();
services.AddHostedService<SessionLeaseMonitorHostedService>();
services.AddHostedService<SessionShutdownHostedService>();
return services;
}
```
The registry must be a singleton because its `ConcurrentDictionary` is the source of truth for session state across the gRPC service, the lease sweeper, the dashboard, and the shutdown hosted service. Registering `SessionShutdownHostedService` last ensures it is constructed after `ISessionManager` and therefore drains sessions during host stop.
The registry must be a singleton because its `ConcurrentDictionary` is the source of truth for session state across the gRPC service, the lease sweeper, the dashboard, and the shutdown hosted service. `SessionLeaseMonitorHostedService` runs the periodic expired-lease sweep; `SessionShutdownHostedService` drains sessions during host stop. Both are registered after `ISessionManager` so they resolve the same singleton manager when the host starts; `SessionShutdownHostedService` is registered last so it is the latter of the two to be constructed and is available to drain sessions on stop.
## Related Documentation