Files
mxaccessgw/docs/audit/fragments/03-sessions.md
T

22 KiB

Cluster 03 — Sessions/Runtime

Auditor: automated (claude-sonnet-4-6) Date: 2026-06-03 Source doc: docs/Sessions.md Verified against: src/ZB.MOM.WW.MxGateway.Server/Sessions/, src/ZB.MOM.WW.MxGateway.Server/Workers/


DOC / LINES / 9 CLAIM: "All four interfaces (ISessionManager, ISessionRegistry, ISessionWorkerClientFactory) plus SessionShutdownHostedService are wired as singletons by SessionServiceCollectionExtensions.AddGatewaySessions." CLAIM_TYPE: term VERDICT: wrong EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionServiceCollectionExtensions.cs:9-18 — only three interfaces exist (confirmed by ls I*.cs in Sessions/). The doc claims "four interfaces" but names only three. Additionally the DI registration also registers SessionLeaseMonitorHostedService as a hosted service, which is omitted from this sentence. CODE_AREA: session.di SEVERITY: medium PROPOSED_FIX: Change "All four interfaces" to "All three interfaces". Separately note that two hosted services are registered: SessionLeaseMonitorHostedService and SessionShutdownHostedService.


DOC / LINES / 265-276 CLAIM: Code snippet for AddGatewaySessions shows only SessionShutdownHostedService registered; SessionLeaseMonitorHostedService is absent from the snippet. CLAIM_TYPE: behavior-rule VERDICT: stale EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionServiceCollectionExtensions.cs:14-15 — actual code registers both AddHostedService<SessionLeaseMonitorHostedService>() and AddHostedService<SessionShutdownHostedService>(). The snippet in the doc is missing the lease-monitor line. CODE_AREA: session.di SEVERITY: medium PROPOSED_FIX: Add services.AddHostedService<SessionLeaseMonitorHostedService>(); to the code snippet (between the ISessionManager singleton line and the shutdown service line).


DOC / LINES / 232-259 CLAIM: The ShutdownAsync code snippet shown calls session.KillWorker(GatewayShutdownReason) and await RemoveSessionAsync(session) directly in the catch block. CLAIM_TYPE: behavior-rule VERDICT: stale EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:296-331 — the actual ShutdownAsync fallback calls await KillWorkerAsync(session.SessionId, GatewayShutdownReason, cancellationToken) (which routes through KillWorkerWithCloseGateAsync and then RemoveSessionAsync), not a direct session.KillWorker + RemoveSessionAsync. The old snippet predates the Server-045/Server-046 refactor that unified the kill path through KillWorkerAsync. CODE_AREA: session.shutdown SEVERITY: medium PROPOSED_FIX: Replace the ShutdownAsync snippet with the current implementation, which checks _registry.TryGet then calls KillWorkerAsync (wrapped in its own try/catch) instead of directly calling session.KillWorker and RemoveSessionAsync.


DOC / LINES / 55-59 CLAIM: "KillWorkerAsync is the forceful path used by the dashboard's admin Kill button: it calls GatewaySession.KillWorker directly, which kills the worker process immediately with no graceful-shutdown attempt and transitions the session to Closed." CLAIM_TYPE: behavior-rule VERDICT: stale EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-264 — KillWorkerAsync now calls session.KillWorkerWithCloseGateAsync (not GatewaySession.KillWorker directly). The KillWorkerWithCloseGateAsync method acquires _closeLock before killing, serializing concurrent close/kill attempts (Server-045 fix). The old description of a direct KillWorker call is stale. CODE_AREA: session.lifecycle SEVERITY: medium PROPOSED_FIX: Update description to state that KillWorkerAsync calls session.KillWorkerWithCloseGateAsync, which acquires the per-session close lock before killing the worker, so concurrent close and kill callers serialize.


DOC / LINES / 59 CLAIM: "Both paths converge on the same registry/metrics cleanup, so the open-session slot is released and mxgateway.sessions.closed is incremented either way." CLAIM_TYPE: config-key VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:59 — counter name mxgateway.sessions.closed confirmed. Both CloseSessionCoreAsync and KillWorkerAsync call _metrics.SessionClosed() and RemoveSessionAsync (which calls ReleaseSessionSlot). CODE_AREA: session.metrics SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 60-72 CLAIM: Code snippet for EnsureSessionCapacity throws SessionManagerException with SessionLimitExceeded; open requests that exceed the bound "throw ... rather than queuing". CLAIM_TYPE: behavior-rule VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:388-396 — _sessionSlots.Wait(0) (zero timeout = non-blocking) confirms the no-queue, immediate-throw behavior. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 61 CLAIM: "Concurrency is bounded by a SemaphoreSlim initialized to GatewayOptions.Sessions.MaxSessions." CLAIM_TYPE: config-key VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:53 — new SemaphoreSlim(_options.Sessions.MaxSessions, _options.Sessions.MaxSessions). CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 75 CLAIM: "three close-reason constants — DefaultCloseReason (\"client-close\"), GatewayShutdownReason (\"gateway-shutdown\"), and LeaseExpiredReason (\"lease-expired\")" CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:17-19 — all three constants confirmed with exact string values. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 79-81 CLAIM: "SessionRegistry is a thin wrapper over a ConcurrentDictionary<string, GatewaySession> keyed by session id with StringComparer.Ordinal." CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionRegistry.cs:12 — new ConcurrentDictionary<string, GatewaySession>(StringComparer.Ordinal) confirmed. CODE_AREA: session.registry SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 81 CLAIM: "ActiveCount filters out sessions whose state is Closed" CLAIM_TYPE: behavior-rule VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionRegistry.cs:22 — _sessions.Values.Count(session => session.State is not SessionState.Closed) confirmed. CODE_AREA: session.registry SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 15-19 CLAIM: "The session id is an opaque string in the form session-{guid:N} and the per-session pipe name is mxaccess-gateway-{ProcessId}-{SessionId}." CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:433 (pipeName = $"mxaccess-gateway-{Environment.ProcessId}-{sessionId}") and :479 ($"session-{Guid.NewGuid():N}"). CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 19 CLAIM: "SessionState itself is the protobuf-generated enum from ZB.MOM.WW.MxGateway.Contracts.Proto" CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:1 — using ZB.MOM.WW.MxGateway.Contracts.Proto; and the state field is typed SessionState. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 85-87 CLAIM: "SessionWorkerClientFactory.CreateAsync … drives the session through the protobuf SessionState substates in order: StartingWorker, WaitingForPipe, Handshaking, InitializingWorker." CLAIM_TYPE: behavior-rule VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionWorkerClientFactory.cs:60-105 — TransitionTo(SessionState.StartingWorker)TransitionTo(SessionState.WaitingForPipe)TransitionTo(SessionState.Handshaking)TransitionTo(SessionState.InitializingWorker) in sequence. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 87-98 CLAIM: Startup timeout wrapped as TimeoutException with the exact catch pattern shown — OperationCanceledException where startupCancellation.IsCancellationRequested and !cancellationToken.IsCancellationRequested. CLAIM_TYPE: behavior-rule VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionWorkerClientFactory.cs:145-153 — identical predicate confirmed. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 100 CLAIM: "The named pipe is created with maxNumberOfServerInstances: 1" CLAIM_TYPE: behavior-rule VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionWorkerClientFactory.cs:166 — maxNumberOfServerInstances: 1 confirmed. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 104 CLAIM: "SessionShutdownHostedService … catches OperationCanceledException triggered by the host shutdown timeout and logs a warning so that an over-running shutdown does not surface as an unhandled exception." CLAIM_TYPE: behavior-rule VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionShutdownHostedService.cs:18-28 — exact catch confirmed. CODE_AREA: session.shutdown SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 109-127 CLAIM: SessionOpenRequest is a sealed record with fields RequestedBackend, ClientSessionName, ClientCorrelationId, CommandTimeout, and a FromContract factory. CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionOpenRequest.cs:6-24 — confirmed. Note: the doc snippet includes a ClientCorrelationId field in the record definition, but the actual SessionManager.CreateSession derives clientCorrelationId internally rather than forwarding the field from the request. This is a minor mismatch between what the record holds vs. how it is used, but does not constitute an error in the doc's description of the record type itself. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 134-139 CLAIM: SessionCloseResult is a sealed record with SessionId, FinalState, AlreadyClosed. CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionCloseResult.cs:5-8 — confirmed. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 143 CLAIM: "SessionCloseStartedException is internal" CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionCloseStartedException.cs:3 — internal sealed class SessionCloseStartedException confirmed. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 148-157 CLAIM: Error code table for SessionManagerException — seven codes listed: SessionNotFound, SessionNotReady, EventSubscriberAlreadyActive, EventQueueOverflow, SessionLimitExceeded, OpenFailed, CloseFailed. CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManagerErrorCode.cs:1-12 — all seven members confirmed in order. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 163-188 CLAIM: Open failure rollback order: "fault, deregister, dispose, release slot, record metric, log, rethrow". CLAIM_TYPE: behavior-rule VERDICT: stale EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:97-123 — actual order is: MarkFaulted → TryRemove (deregister) → DisposeAsync → (conditionally) SessionRemoved metric if sessionOpenedRecorded → ReleaseSessionSlot → Fault metric → LogWarning → rethrow. The doc omits the sessionOpenedRecorded conditional SessionRemoved() call that was added in the Server-006 fix, making the described order incomplete. The doc text says "release slot, record metric" but the actual code calls SessionRemoved before ReleaseSessionSlot when sessionOpenedRecorded is true. CODE_AREA: session.lifecycle SEVERITY: medium PROPOSED_FIX: Update the rollback description to note the conditional SessionRemoved() metric call that precedes ReleaseSessionSlot when SessionOpened() was already recorded (guards against mxgateway.sessions.open gauge leak on late failures such as auto-subscribe rejection).


DOC / LINES / 193-195 CLAIM: "GatewaySession also exposes typed bulk helpers (AddItemBulkAsync, SubscribeBulkAsync, etc.) that wrap WorkerCommand round-trips and translate non-Ok ProtocolStatus replies into SessionManagerException with SessionNotReady." CLAIM_TYPE: term VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:490, 590 (AddItemBulkAsync, SubscribeBulkAsync) and :1017-1023 (ProtocolStatusCode.Ok guard throwing SessionManagerException(SessionNotReady)). CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 195-197 CLAIM: "Event streaming uses AttachEventSubscriber which returns a disposable lease. When allowMultipleSubscribers is false the second attach throws EventSubscriberAlreadyActive; this prevents two gRPC streams from racing on the same worker event channel. Active event subscribers keep the session lease from expiring until the stream is disposed." CLAIM_TYPE: behavior-rule VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:387-407 (AttachEventSubscriber guard and lease) and :373-380 (IsLeaseExpired checks _activeEventSubscriberCount == 0). CODE_AREA: session.subscriber SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 197 CLAIM: "Sessions open with MxGateway:Sessions:DefaultLeaseSeconds (default 1800)" CLAIM_TYPE: config-key VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs:21 — public int DefaultLeaseSeconds { get; init; } = 1800. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 197 CLAIM: "SessionLeaseMonitorHostedService runs that sweep every MxGateway:Sessions:LeaseSweepIntervalSeconds seconds (default 30)." CLAIM_TYPE: config-key VERDICT: accurate EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs:24 — public int LeaseSweepIntervalSeconds { get; init; } = 30; src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionLeaseMonitorHostedService.cs:19 — TimeSpan.FromSeconds(Math.Max(1, options.Value.Sessions.LeaseSweepIntervalSeconds)). CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: flag only


DOC / LINES / 230 CLAIM: "GatewaySession.KillWorker is the unconditional forced-close path used by shutdown when graceful close itself throws, and also by SessionManager.KillWorkerAsync — the explicit kill path that the dashboard's admin Kill button invokes." CLAIM_TYPE: behavior-rule VERDICT: stale EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:233 — KillWorkerAsync now calls session.KillWorkerWithCloseGateAsync (not session.KillWorker). The shutdown fallback (line 319) also routes through KillWorkerAsync rather than calling session.KillWorker + RemoveSessionAsync directly. GatewaySession.KillWorker is still present (line 874) but is no longer the entry point from SessionManager.KillWorkerAsync. CODE_AREA: session.lifecycle SEVERITY: medium PROPOSED_FIX: Update to reflect that SessionManager.KillWorkerAsync delegates to session.KillWorkerWithCloseGateAsync (which serializes concurrent kill/close via _closeLock — Server-045 fix) and that GatewaySession.KillWorker is now only the internal terminal action inside KillWorkerWithCloseGateAsync.


DOC / LINES / 230 CLAIM: "KillCount increments while ShutdownCount does not" CLAIM_TYPE: term VERDICT: wrong EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:56-79 — no metrics named KillCount or ShutdownCount exist. The actual worker-kill metric is mxgateway.workers.killed (counter). The doc invents non-existent metric names. CODE_AREA: session.metrics SEVERITY: high PROPOSED_FIX: Replace "KillCount increments while ShutdownCount does not" with "the mxgateway.workers.killed counter is incremented (via GatewayMetrics.WorkerKilled) while the graceful-shutdown path does not increment it".


DOC / LINES / 265 CLAIM: "registers the four singletons and the hosted service" (singular "the hosted service") CLAIM_TYPE: behavior-rule VERDICT: wrong EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionServiceCollectionExtensions.cs:14-15 — two hosted services are registered: SessionLeaseMonitorHostedService and SessionShutdownHostedService. CODE_AREA: session.di SEVERITY: medium PROPOSED_FIX: Change "registers the four singletons and the hosted service" to "registers the three singletons and two hosted services (SessionLeaseMonitorHostedService, SessionShutdownHostedService)".


DOC / LINES / 279 CLAIM: "Registering SessionShutdownHostedService last ensures it is constructed after ISessionManager and therefore drains sessions during host stop." CLAIM_TYPE: behavior-rule VERDICT: stale EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionServiceCollectionExtensions.cs:14-15 — SessionLeaseMonitorHostedService is now registered before SessionShutdownHostedService. The shutdown service is still last of the two hosted services, but the reasoning in the doc no longer fully applies because construction order of hosted services relative to singletons is governed by ASP.NET Core's DI container, not purely registration order. CODE_AREA: session.di SEVERITY: low PROPOSED_FIX: Update to note that two hosted services are registered in order (lease monitor first, shutdown second) and that both depend on ISessionManager which is registered as a singleton.


DOC / LINES / (none — gap) CLAIM: (gap) GatewaySession holds an item registration dictionary (_items, keyed by (ServerHandle, ItemHandle)) tracking all successfully added/subscribed items. The session tracks and prunes these registrations via TrackCommandReply, TryGetItemRegistration, and the per-command TrackItem/RemoveItems helpers. This bookkeeping is undocumented. CLAIM_TYPE: behavior-rule VERDICT: gap EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:17 (_items field), :425-481 (TrackCommandReply), :1059-1090 (TrackItem, TrackBulkItems, RemoveItems). src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionItemRegistration.cs:3 (SessionItemRegistration record). CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: Add a subsection or paragraph noting that GatewaySession maintains an in-session item registry keyed by (ServerHandle, ItemHandle), updated after successful AddItem, AddItem2, AddBufferedItem, AddItemBulk, SubscribeBulk, RemoveItem, RemoveItemBulk, and UnsubscribeBulk replies.


DOC / LINES / (none — gap) CLAIM: (gap) SessionOptions exposes AllowMultipleEventSubscribers (default false). Setting it true is rejected at startup by GatewayOptionsValidator with the message "AllowMultipleEventSubscribers is not supported until event fan-out is implemented." This validator-level enforcement of the v1 constraint is undocumented. CLAIM_TYPE: config-key VERDICT: gap EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs:29 and src/ZB.MOM.WW.MxGateway.Server/Configuration/GatewayOptionsValidator.cs:181-184. CODE_AREA: session.subscriber SEVERITY: medium PROPOSED_FIX: Add a note to the "Run" section explaining that MxGateway:Sessions:AllowMultipleEventSubscribers exists but is actively refused by the validator in v1; operators who set it to true will see a startup validation failure, not a runtime error.


DOC / LINES / (none — gap) CLAIM: (gap) Gateway-restart orphan cleanup is performed by OrphanWorkerCleanupHostedService (wrapping OrphanWorkerTerminator.TerminateOrphans) on StartAsync, before the gateway accepts sessions. Cleanup is best-effort (a failure logs a warning but does not block startup). The Sessions.md doc does not mention this, yet it directly affects the "gateway restart does not reattach orphan workers" contract. CLAIM_TYPE: behavior-rule VERDICT: gap EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerCleanupHostedService.cs:7-30; src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerTerminator.cs:49-95; src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerServiceCollectionExtensions.cs:19. CODE_AREA: session.orphan SEVERITY: high PROPOSED_FIX: Add a "Gateway Restart / Orphan Cleanup" section to Sessions.md (or cross-reference from Shutdown Coordination) noting that OrphanWorkerCleanupHostedService runs OrphanWorkerTerminator.TerminateOrphans on startup, kills any running worker executables matching the configured MxGateway:Worker:ExecutablePath, and that failures are non-fatal to startup.


DOC / LINES / (none — gap) CLAIM: (gap) SessionOptions.MaxPendingCommandsPerSession (default 128) is passed to WorkerClientOptions.MaxPendingCommands during session construction. This per-session command concurrency cap is not documented in Sessions.md. CLAIM_TYPE: config-key VERDICT: gap EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs:18; src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionWorkerClientFactory.cs:92. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: Add a note in the "Key Types — SessionManager" or "Run" section that each session is bounded to MxGateway:Sessions:MaxPendingCommandsPerSession (default 128) concurrent in-flight worker commands.


DOC / LINES / (none — gap) CLAIM: (gap) GatewaySession exposes a KillWorkerWithCloseGateAsync method that acquires _closeLock before killing, introduced to serialize concurrent close/kill callers (Server-045). This method is not mentioned; the doc describes only KillWorker as the unconditional kill path from SessionManager. CLAIM_TYPE: term VERDICT: gap EVIDENCE: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:896-917; src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:233. CODE_AREA: session.lifecycle SEVERITY: low PROPOSED_FIX: Mention KillWorkerWithCloseGateAsync in the "Close" section as the locked kill path now used by SessionManager.KillWorkerAsync, distinguishing it from the bare KillWorker still used as the internal terminal action.