Append 7 new findings (Server-044..050) covering the destructive-action wave: KillWorkerAsync metric/state leaks, ShutdownAsync kill-fallback gauge leak, inconsistent ConfirmDialog cleanup across pages, missing XML docs on the new DashboardSessionAdmin surface, and unhandled RemoveSessionAsync exception paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
129 KiB
Code Review — Server
| Field | Value |
|---|---|
| Module | src/ZB.MOM.WW.MxGateway.Server |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | 42b0037 |
| Status | Re-reviewed |
| Open findings | 7 |
Checklist coverage
2026-05-20 review (commit 1cd51bb)
This row summarizes the 2026-05-20 review pass at commit 1cd51bb. Findings from
prior passes (Server-001 through Server-014) are all closed and remain below as
audit history.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Server-019 (WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync yields silently when session is missing). |
| 2 | mxaccessgw conventions | No issues found — convention drift previously called out is resolved; no new gaps observed. |
| 3 | Concurrency & thread safety | Issues found: Server-015 (GatewaySession._state is written under _closeLock but read/written elsewhere under _syncRoot). |
| 4 | Error handling & resilience | Issues found: Server-016 (GatewaySession.DisposeAsync disposes the close-lock semaphore while it may be held). |
| 5 | Security | Issues found: Server-017 (AcknowledgeAlarm / QueryActiveAlarms fall through to admin-only scope because the resolver was not updated for the new alarm RPCs). |
| 6 | Performance & resource management | Issues found: Server-018 (GalaxyGlobMatcher regex cache is unbounded — currently low-risk but uncapped). |
| 7 | Design-document adherence | No issues found at this pass. |
| 8 | Code organization & conventions | Issues found: Server-020 (dashboard pages each declare two @page directives — @page "/X" AND @page "/dashboard/X" — producing duplicate routes under the /dashboard group prefix). |
| 9 | Testing coverage | Issues found: Server-021 (MxAccessGatewayService.ApplyConstraintsAsync and the new BulkConstraintPlan / ReadBulkConstraintPlan / WriteBulkConstraintPlan / SubscribeBulkConstraintPlan merge logic is entirely untested). |
| 10 | Documentation & comments | Issues found: Server-022 (IAlarmRpcDispatcher XML doc still describes the dispatcher as "ships a not-yet-wired default"; stale after Server-014). |
2026-05-20 review (commit a020350)
Re-review pass at a020350 — the cross-module sweep that resolved Server-015 through Server-022. Verified each fix is sound (lock discipline now uniform on _syncRoot; DisposeAsync gates on _closeLock; alarm RPCs map to InvokeWrite/EventsRead; glob cache is bounded; alarm dispatcher SessionNotFound flows through MxAccessGatewayService.MapException → gRPC NotFound; dashboard pages emit a single @page; 11 new MxAccessGatewayServiceConstraintTests cover the bulk-constraint plans). New findings filed against this pass.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Server-024 (GalaxyGlobMatcher.GetOrCreateRegex indexer access after TryAdd fails can throw KeyNotFoundException under contention near the cap). |
| 2 | mxaccessgw conventions | No issues found. |
| 3 | Concurrency & thread safety | No new issues found — Server-015/016 fixes verified sound. |
| 4 | Error handling & resilience | Issues found: Server-026 (AlarmsOptions is bound but not validated by GatewayOptionsValidator). |
| 5 | Security | No issues found — Server-017 mapping (InvokeWrite / EventsRead) is defensible and exercised by both resolver and interceptor tests. |
| 6 | Performance & resource management | No issues found — Server-018 cap is in place and tested. |
| 7 | Design-document adherence | Issues found: Server-027 (docs/Authorization.md ResolveCommandScope code snippet and Constraint Enforcement section omit the bulk read/write command families). |
| 8 | Code organization & conventions | Issues found: Server-025 (GalaxyRepositoryGrpcService still consumes the concrete GalaxyRepository after IGalaxyRepository was introduced for testability — inconsistent with GalaxyHierarchyCache). |
| 9 | Testing coverage | Issues found: Server-028 (GatewayGrpcScopeResolverTests does not exercise WatchDeployEventsRequest or MxCommandKind.ReadBulk; no GatewaySessionTests case asserts a MarkFaulted during in-flight Close). |
| 10 | Documentation & comments | Issues found: Server-023 (NotWiredAlarmRpcDispatcher class XML doc still says "PR A.6/A.7 — default … shipped while the worker-side AlarmClient event subscription is gated on dev-rig validation"; contradicts the cleanup that Server-014/Server-022 applied to the interface, gateway service, and WorkerAlarmRpcDispatcher). Issues found: Server-029 (OpenSession capability list advertises bulk-subscribe-commands but not the now-shipping bulk-read or bulk-write families — clients that gate on capability strings have no signal that those families exist). |
2026-05-22 review (commit fa491c7)
Re-review pass at fa491c7, scoped to the Galaxy hierarchy snapshot-persistence
change: the new GalaxyHierarchySnapshot, IGalaxyHierarchySnapshotStore /
GalaxyHierarchySnapshotStore, the restore / persist paths added to
GalaxyHierarchyCache, the two new GalaxyRepositoryOptions, and the
docs/GalaxyRepository.md / docs/GatewayConfiguration.md updates. Prior
findings (Server-001 through Server-032) are unchanged by this pass.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found — restore/save sequencing and the shared BuildEntry materialization are sound. |
| 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, sealed, Async suffixes, Options pattern, and XML docs all conform; the snapshot persists Galaxy metadata (names/types), not tag values or secrets. |
| 3 | Concurrency & thread safety | No issues found — _restoreAttempted and _current are touched only under _refreshGate; _current is published via Volatile.Write; the store serializes its file I/O on a private SemaphoreSlim. |
| 4 | Error handling & resilience | Issues found: Server-033 (restore never completes _firstLoad, so a cold-start browse waits the full 5s bootstrap budget), Server-034 (TryLoadAsync throws on a corrupt file despite the Try prefix), Server-036 (a save cancelled at shutdown logs a misleading warning). |
| 5 | Security | No issues found — the snapshot holds non-secret Galaxy metadata, is written under C:\ProgramData\MxGateway alongside the auth DB, and restored rows flow the same materialization path as live SQL with no injection surface. |
| 6 | Performance & resource management | Issues found: Server-035 (the snapshot write is awaited on the refresh critical path under _refreshGate with no timeout). |
| 7 | Design-document adherence | No issues found — docs/GalaxyRepository.md and docs/GatewayConfiguration.md were updated in the same commit; docs/DesignDecisions.md already defers to GalaxyRepository.md as the Galaxy authority. |
| 8 | Code organization & conventions | No issues found — the new options live on GalaxyRepositoryOptions, the store is a registered singleton, and the on-disk envelope (PersistedFile) is a private nested record. |
| 9 | Testing coverage | Issues found: Server-037 (no test for the corrupt-snapshot restore path or for PersistSnapshot = false at the cache level). |
| 10 | Documentation & comments | No issues found — XML docs match behavior; the GalaxyRepository.md "On-disk snapshot" section documents the Stale-on-restore lifecycle. |
2026-05-24 re-review (commit 42b0037)
Re-review pass at 42b0037 scoped to the dashboard destructive-action wave on
top of d692232. Seven commits in range: c5e7479/0e56b5b add the admin-only
Close/Kill flow (new ISessionManager.KillWorkerAsync, new
IDashboardSessionAdminService, the shared ConfirmDialog.razor); 24cc5fd
adds IApiKeyAdminStore.DeleteAsync + the dashboard Delete action on revoked
keys; c5153d6 chains base.OnInitializedAsync() on ApiKeysPage; de7639a
removes the legacy MapGet("/", ...) redirect that was colliding with the
Blazor @page "/" (a real 500); 42b0037 is the cosmetic @using switch on
GalaxyPage/SessionDetailsPage. Tests added: DashboardSessionAdminServiceTests,
DashboardSnapshotPublisherTests, HubTokenServiceTests, two new
SessionManagerTests.KillWorkerAsync_* cases, plus the API-key-management
delete-path tests.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Server-044 (SessionManager.KillWorkerAsync catch-path leaks the mxgateway.sessions.open gauge — mirror of Server-006 but for the kill path). |
| 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, sealed by default, Async suffix, primary constructors used by new types (DashboardSessionAdminService); no Blazor UI component libraries pulled in (ConfirmDialog.razor uses local Bootstrap classes only); no secrets logged. |
| 3 | Concurrency & thread safety | Issues found: Server-045 (KillWorkerAsync reads session.State without synchronization; two concurrent kills can both observe wasClosed=false and double-increment the sessions.closed counter). |
| 4 | Error handling & resilience | Issues found: Server-046 (SessionManager.ShutdownAsync's KillWorker fallback doesn't call _metrics.SessionClosed() — gauge leaks for every session whose graceful close throws), Server-050 (DashboardSessionAdminService only catches SessionManagerException; any other exception from RemoveSessionAsync/session.DisposeAsync propagates raw into Blazor). |
| 5 | Security | No issues found — DashboardSessionAdminService.CanManage requires DashboardRoles.Admin and the Razor pages gate the Close/Kill buttons on it; audit-log events dashboard-close-session / dashboard-kill-worker / dashboard-delete-key write through the existing IApiKeyAuditStore pipeline; IApiKeyAdminStore.DeleteAsync SQL guards on revoked_utc IS NOT NULL. |
| 6 | Performance & resource management | No issues found in the changed code. |
| 7 | Design-document adherence | No issues found in the changed code. |
| 8 | Code organization & conventions | Issues found: Server-047 (ApiKeysPage.razor ConfirmPendingAsync clears PendingAction before awaiting the action while SessionsPage/SessionDetailsPage clear it in the finally; the three consumers of the new ConfirmDialog differ on a small but visible UX point). |
| 9 | Testing coverage | Issues found: Server-048 (no test for KillWorkerAsync catch-path metric leak, wasClosed=true short-circuit, or concurrent-kill double-count). |
| 10 | Documentation & comments | Issues found: Server-049 (IDashboardSessionAdminService interface has no XML docs on any of its three members; DashboardSessionAdminService has none either — the IDashboardApiKeyManagementService peer carries the same gap but the convention is to document new public surfaces per CLAUDE.md "update docs in the same change as the source"). |
2026-05-24 review (commit d692232)
Re-review pass at d692232 scoped to the dashboard refactor wave: the
ZB.MOM.WW project rename (dc9c0c9), the QueryActiveAlarms public RPC
implementation (397d3c5), the LDAP role-mapping + HubToken bearer auth
(27ed651), the sidebar layout + three SignalR push hubs (6594359), and the
EventsHub broadcaster + doc refresh (d692232). Server-031 and Server-032
remain open and untouched — neither the gateway-side _writeLock heartbeat
contention nor the bounded _events channel saw any changes in this wave.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found in the fa491c7..d692232 diff. |
| 2 | mxaccessgw conventions | No issues found — rename hygiene clean, external runtime identifiers (MeterName, MxGateway.Dashboard scheme, MxGateway.Request logger, MxGateway.Worker.STA thread name) intentionally unprefixed per commit message. |
| 3 | Concurrency & thread safety | No issues found — Server-031 (_writeLock heartbeat watchdog contention) remains open and unchanged. |
| 4 | Error handling & resilience | Issues found: Server-039 (HubTokenService.Validate accepts a payload with null Name/NameIdentifier), Server-041 (EventStreamService calls the broadcaster without a try/catch — fragile seam), Server-042 (DashboardSnapshotPublisher tight retry loop with no backoff vs AlarmsHubPublisher 5-second delay). |
| 5 | Security | Issues found: Server-038 (EventsHub.SubscribeSession accepts any session id from any Viewer; no per-session ACL). |
| 6 | Performance & resource management | Issues found: Server-042 (DashboardSnapshotPublisher lacks reconnect backoff). |
| 7 | Design-document adherence | Issues found: Server-041 (broadcaster's never-throw contract documented in the interface but not enforced by the caller). |
| 8 | Code organization & conventions | Issues found: Server-040 (undocumented lookup-order precedence in MapGroupsToRoles), Server-043 (singleton sharing of HubTokenService undocumented). |
| 9 | Testing coverage | No issues found in this module — see Tests-026 in the Tests module for the missing EventsHub broadcast coverage. |
| 10 | Documentation & comments | Issues found: Server-040, Server-043 (both documentation gaps). |
Findings
Server-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Security |
| Location | src/MxGateway.Server/GatewayApplication.cs:147-149, src/MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs:55-58, src/MxGateway.Server/Dashboard/Components/Routes.razor:1-15 |
| Status | Resolved |
Description: The dashboard authorization policy (DashboardAuthenticationDefaults.AuthorizationPolicy), DashboardAuthorizationRequirement, and DashboardAuthorizationHandler are registered in DI but never applied to any endpoint. MapRazorComponents<App>() has no .RequireAuthorization(...), the <Router> in Routes.razor uses plain RouteView (not AuthorizeRouteView), and no dashboard page carries [Authorize] — a module-wide grep finds zero RequireAuthorization/[Authorize]/AuthorizeRouteView usages. Every dashboard page (Sessions, Workers, Events, Galaxy, Settings, and the API Keys list exposing key IDs, scopes, and constraints) is reachable by any unauthenticated remote client regardless of Dashboard:AllowAnonymousLocalhost or Dashboard:RequireAdminScope. Only the API-key mutation operations remain protected, via the separate DashboardApiKeyManagementService.CanManage check.
Recommendation: Apply the policy at the route level — endpoints.MapRazorComponents<App>().AddInteractiveServerRenderMode().RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy) — and/or switch Routes.razor to AuthorizeRouteView with a [Authorize] fallback policy plus a NotAuthorized redirect to the login page. Add an integration test that GETs a dashboard page anonymously and asserts 302-to-login / 401.
Resolution: Resolved in a8aafdf (2026-05-18): MapRazorComponents<App>() now calls .RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy), so an unauthenticated request to any dashboard component route is challenged by the cookie scheme and redirected to the login page. GatewayApplicationTests gained ComponentRoutesRequireAuthorization (component routes carry the policy) and AuthEndpointsAllowAnonymousAccess, replacing the prior test that asserted the insecure behavior.
Server-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | src/MxGateway.Server/Program.cs:24, src/MxGateway.Server/GatewayApplication.cs |
| Status | Resolved |
Description: gateway.md:583 and CLAUDE.md state the first version "terminates orphaned workers on startup." No code in MxGateway.Server enumerates or kills leftover MxGateway.Worker.exe processes at startup — a grep for orphan/reattach/terminate finds nothing. After an unclean gateway crash, x86 worker processes (each holding an MXAccess COM instance) leak and survive indefinitely, and a restarted gateway does not reclaim or kill them.
Recommendation: Add a startup hosted service that finds and kills stale worker processes (by executable path / a well-known argument or environment marker) before the server accepts sessions, or update the design docs if reattachment/cleanup is deliberately deferred.
Resolution: Resolved 2026-05-18. Confirmed against source: no code path enumerated or killed leftover workers. Added IRunningProcessInspector / SystemRunningProcessInspector (a testable seam over Process.GetProcessesByName/Kill), OrphanWorkerTerminator (kills processes matched by the configured worker executable path, or by image name when the x64 gateway cannot introspect the x86 worker's MainModule, skipping the current process and tolerating per-process kill failures), and OrphanWorkerCleanupHostedService (best-effort IHostedService). The hosted service is registered in AddWorkerProcessLauncher ahead of AddGatewaySessions so cleanup runs before the server accepts sessions. gateway.md updated to describe the implemented behavior. Regression tests: OrphanWorkerTerminatorTests (KillsWorkerProcessesMatchingConfiguredExecutablePath, KillsImageNameMatchWhenExecutablePathUnreadable, DoesNotKillUnrelatedProcessSharingImageName, DoesNotKillCurrentProcess, ContinuesWhenOneKillThrows).
Server-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | src/MxGateway.Server/Dashboard/DashboardAuthorizationHandler.cs:39,54-59, src/MxGateway.Server/Dashboard/DashboardAuthenticator.cs:236-258 |
| Status | Resolved |
Description: When Dashboard:RequireAdminScope is true (the default) and the request is not loopback, DashboardAuthorizationHandler succeeds only if HasAdminScope finds a claim of type "scope" with value "admin". But DashboardAuthenticator.CreatePrincipal issues only NameIdentifier, Name, and LdapGroupClaimType claims — never a scope/admin claim. So a correctly LDAP-authenticated user who passed the required-group check is still denied dashboard access on any non-loopback connection. The bug is currently masked by the missing route-level enforcement (Server-001) and by AllowAnonymousLocalhost; fixing Server-001 would make the dashboard unusable for all real LDAP logins.
Recommendation: Either have DashboardAuthenticator.CreatePrincipal add a scope=admin claim when the user is in the required group, or change DashboardAuthorizationHandler.HasAdminScope to evaluate LDAP group membership (reuse IsMemberOfRequiredGroup against the LdapGroupClaimType claims, as DashboardApiKeyAuthorization.CanManage already does).
Resolution: Resolved in a8aafdf (2026-05-18): DashboardAuthenticator.CreatePrincipal — reached only after the required-group check passes — now emits the scope=admin claim that DashboardAuthorizationHandler checks, so group-validated LDAP users pass RequireAdminScope once route-level authorization (Server-001) is enforced.
Server-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Code organization & conventions |
| Location | src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233, src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77, src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67 |
| Status | Resolved |
Description: ParseScopes accepts any comma-separated strings and CreateKeyAsync persists them verbatim; neither the CLI nor the dashboard create path validates scopes against GatewayScopes. A typo or non-canonical name (e.g. CLAUDE.md's example --scopes session,invoke,event,metadata,admin, which does not match the resolver's session:open/invoke:read/etc.) silently creates a key whose scope strings the authorization resolver never checks for — the key is unusable for those RPCs with no error at creation time.
Recommendation: Validate every requested scope against the GatewayScopes catalog at create time in both the CLI parser/runner and DashboardApiKeyManagementService.ValidateCreateRequest, rejecting unknown scope strings.
Resolution: Resolved 2026-05-18. Confirmed against source: ParseScopes split unvalidated strings into the create command and ValidateCreateRequest checked only key id and display name. Added GatewayScopes.All (the canonical scope catalog) and GatewayScopes.IsKnown(string). ApiKeyAdminCommandLineParser.Parse now runs ValidateScopes for create-key commands and fails the parse listing the unknown scope(s) and valid set; DashboardApiKeyManagementService.ValidateCreateRequest rejects requests carrying any non-canonical scope. Revoke/rotate paths are unaffected (no scope input). Regression tests: ApiKeyAdminCommandLineParserTests.Parse_CreateKeyCommand_RejectsUnknownScope, Parse_CreateKeyCommand_AcceptsAllCanonicalScopes, and DashboardApiKeyManagementServiceTests.CreateAsync_UnknownScope_DoesNotCallStore.
Server-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28, src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184 |
| Status | Resolved |
Description: GalaxyHierarchyCache.RefreshCoreAsync only catches SqlException and InvalidOperationException. The initial cache.RefreshAsync call in GalaxyHierarchyRefreshService.ExecuteAsync is wrapped only for OperationCanceledException. A transient non-SqlException failure on the first refresh (e.g. a Win32Exception/TimeoutException from connection establishment, or another DbException subtype) escapes both layers, faults the BackgroundService, and — with default host behavior — stops the whole gateway. The periodic-tick loop does catch general exceptions, so only the first load is exposed.
Recommendation: Broaden the catch in RefreshCoreAsync to all non-cancellation exceptions (record Unavailable/Stale and still complete _firstLoad), or wrap the initial RefreshAsync in GalaxyHierarchyRefreshService with the same general catch the tick loop uses.
Resolution: Resolved 2026-05-18. Confirmed against source: the initial RefreshAsync in ExecuteAsync was guarded only for OperationCanceledException, and RefreshCoreAsync filtered its catch to SqlException or InvalidOperationException. Both recommended layers applied: GalaxyHierarchyRefreshService.ExecuteAsync now catches every non-cancellation exception on the initial load (logs a warning; the periodic tick retries), and GalaxyHierarchyCache.RefreshCoreAsync broadens its catch to all non-cancellation exceptions so the cache still records Stale/Unavailable and completes _firstLoad. The now-unused Microsoft.Data.SqlClient using was removed. Regression test: GalaxyHierarchyRefreshServiceTests.ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService.
Server-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Server/Sessions/SessionManager.cs:84-114 |
| Status | Resolved |
Description: In OpenSessionAsync, _metrics.SessionOpened() (line 89) increments the _openSessions gauge before TryAutoSubscribeAlarmsAsync runs. If auto-subscribe throws (which it does when Alarms.RequireSubscribeOnOpen is true and the worker rejects the subscription), the catch block disposes and removes the session and records _metrics.Fault(...) but never calls SessionClosed/SessionRemoved. The mxgateway.sessions.open gauge permanently over-counts by one for every such failed open.
Recommendation: In the catch block, when the session had reached the point where SessionOpened() was recorded, also call _metrics.SessionRemoved() — or move the SessionOpened() call to after auto-subscribe succeeds.
Resolution: Resolved 2026-05-18. Confirmed against source: the catch block in OpenSessionAsync recorded Fault(...) and removed the session but never decremented the open-session gauge after SessionOpened() had run. Added a sessionOpenedRecorded flag set immediately after _metrics.SessionOpened(); the catch block now calls _metrics.SessionRemoved() when that flag is set, restoring the gauge for a post-SessionOpened() failure (e.g. an auto-subscribe rejection with RequireSubscribeOnOpen=true). Regression test: SessionManagerAlarmAutoSubscribeTests.OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn.
Server-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | src/MxGateway.Server/Galaxy/GalaxyHierarchyProjector.cs:55-70 |
| Status | Resolved |
Description: Project always iterates the full entry.Index.ObjectViews collection and re-applies all filters to skip offset matched items before collecting a page. Paging through a large Galaxy hierarchy is therefore O(total) per page and O(total²/pageSize) end-to-end. The cache is in-memory so impact is bounded, but for large galaxies repeated DiscoverHierarchy pagination wastes CPU.
Recommendation: Precompute and cache the filtered, ordered view list per (filterSignature, sequence) so subsequent pages are an O(pageSize) slice; the existing filter signature already keys page tokens.
Resolution: Resolved 2026-05-18. Confirmed against source: Project re-scanned and re-filtered the whole ObjectViews list on every page. Added a ConditionalWeakTable<GalaxyHierarchyCacheEntry, ConcurrentDictionary<string, IReadOnlyList<GalaxyObjectView>>> memo in GalaxyHierarchyProjector: the first projection of a given filter signature builds the filtered, ordered view list; subsequent pages take an O(pageSize) slice via index arithmetic. The memo is keyed on the immutable cache-entry instance, so when the cache publishes a new entry the stale memo becomes unreachable and is reclaimed with it — no explicit invalidation. ResolveRoot still runs before the memo lookup so a missing root surfaces NotFound consistently. Regression tests: GalaxyHierarchyProjectorTests (Project_PagedAcrossEntireHierarchy_ReturnsEveryObjectExactlyOnce, Project_DistinctFiltersOnSameEntry_DoNotShareMemoizedViewList, Project_SameFilterRepeated_ReturnsIdenticalTotals, Project_DistinctCacheEntries_ProjectAgainstTheirOwnData); existing GalaxyRepositoryGrpcServiceTests paging tests continue to pass unchanged.
Server-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:111-134,160-189 |
| Status | Resolved |
Description: WatchDeployEvents calls ResolveBrowseSubtrees() on every streamed event, and MapDeployEvent re-runs GalaxyHierarchyProjector.Project over the entire cached hierarchy (and Sums attribute counts) for every event of every constrained subscriber. GalaxyGlobMatcher.IsMatch also rebuilds the glob regex on each call. With many constrained subscribers and frequent deploys this is avoidable work.
Recommendation: Hoist ResolveBrowseSubtrees() out of the loop; compute scoped object/attribute counts once per deploy sequence and cache by (sequence, browseSubtrees); cache compiled glob Regex instances in GalaxyGlobMatcher.
Resolution: Resolved 2026-05-18. Confirmed against source. Three changes: (1) WatchDeployEvents now resolves ResolveBrowseSubtrees() once before the streaming loop — the caller's identity and constraints are fixed for the stream lifetime, so per-event resolution was pure waste. (2) GalaxyGlobMatcher now caches compiled Regex instances in a ConcurrentDictionary keyed by glob pattern (with RegexOptions.Compiled), so the same handful of globs are translated once instead of on every IsMatch call. (3) The per-event MapDeployEvent re-projection is no longer a separate hot path: with finding Server-007 resolved, GalaxyHierarchyProjector.Project memoizes the filtered view list per (cache entry, filter signature), so the scoped-count projection in MapDeployEvent for a constrained subscriber is O(matched-slice) after the first event of a given deploy sequence rather than a full re-scan — this subsumes the recommendation's (sequence, browseSubtrees) cache (the memo is keyed on the per-sequence cache-entry instance and the browse-subtree-bearing filter signature). Regression tests: GalaxyFilterInputSafetyTests.GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect (glob cache correctness); existing WatchDeployEvents and GalaxyFilterInputSafetyTests coverage continues to pass.
Server-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/MxGateway.Server/Security/Authentication/AuthSqliteConnectionFactory.cs:15-32 |
| Status | Resolved |
Description: Each auth-store operation opens a fresh SqliteConnection with no busy timeout, no WAL journal mode, and default journaling. MarkKeyUsedAsync runs on every authenticated request and SqliteApiKeyAuditStore appends on every denial; under concurrent load these writers can collide and surface SQLITE_BUSY as a hard failure on the request path.
Recommendation: Set Pooling, a non-zero DefaultTimeout/busy_timeout, and enable WAL (PRAGMA journal_mode=WAL) once at startup so concurrent readers/writers degrade gracefully.
Resolution: Resolved 2026-05-18. Confirmed against source: the connection string set only DataSource and Mode. AuthSqliteConnectionFactory.CreateConnection now also sets Pooling = true and a non-zero DefaultTimeout. A new OpenConnectionAsync(CancellationToken) opens the connection and applies PRAGMA journal_mode=WAL and PRAGMA busy_timeout (5 s); WAL is a persistent database-level setting so re-applying it per connection is a cheap no-op, while busy_timeout is per-connection state. All nine auth-store call sites (SqliteApiKeyAdminStore, SqliteApiKeyAuditStore, SqliteApiKeyStore, SqliteAuthStoreMigrator) were switched from CreateConnection() + OpenAsync() to OpenConnectionAsync(). docs/Authentication.md updated to describe the WAL/busy-timeout behavior. Regression test: SqliteAuthStoreTests.OpenConnectionAsync_EnablesWalJournalModeAndBusyTimeout.
Server-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | src/MxGateway.Server/Security/Authentication/SqliteApiKeyAdminStore.cs:91-114, src/MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:168-172 |
| Status | Resolved |
Description: RotateAsync sets revoked_utc = NULL, so rotating a previously revoked key silently reactivates it. This is documented intentional behavior in docs/Authentication.md:167, but the dashboard renders the "Rotate" button unconditionally — including for keys whose status badge says "Revoked" — so an operator can un-revoke a deliberately disabled key without an explicit warning.
Recommendation: Either hide/disable the Rotate action for revoked keys in ApiKeysPage.razor, require an explicit confirmation, or have RotateAsync preserve revoked_utc and add a separate explicit "reactivate" operation.
Resolution: Resolved 2026-05-18. Confirmed against source: ApiKeysPage.razor rendered the Rotate button unconditionally while Revoke was already gated on key.RevokedUtc is null. Took the lowest-risk recommended option — the dashboard now renders the Rotate (and Revoke) actions only for keys whose status is Active; a revoked key shows a "No actions" placeholder, so an operator cannot un-revoke a deliberately disabled key as a side effect of a rotation. RotateAsync's store-level behavior is unchanged (rotation by key_id still clears revoked_utc, which the CLI relies on); docs/Authentication.md updated to document both the store behavior and the dashboard restriction. No automated test added: the change is pure conditional Razor rendering and the test project has no bUnit component-rendering harness; the underlying DashboardApiKeyManagementService is already unit-tested.
Server-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:1-46 |
| Status | Resolved |
Description: WorkerAlarmRpcDispatcher deviates from the module's conventions: it fully-qualifies System.Guid, System.ArgumentNullException, and System.Threading types inline instead of relying on using directives, and uses an explicit constructor with this.-qualified field assignment while the rest of the module (e.g. ConstraintEnforcer, MxAccessGatewayService, GalaxyRepositoryGrpcService) uses primary constructors. docs/style-guides/CSharpStyleGuide.md is authoritative for gateway code.
Recommendation: Add the needed using directives, drop the inline fully-qualified names, and convert to a primary constructor for consistency.
Resolution: Resolved 2026-05-18. Confirmed against source. Converted WorkerAlarmRpcDispatcher to a primary constructor with the standard ?? throw new ArgumentNullException(...) field-initializer guard; dropped the inline System.Guid / System.ArgumentNullException qualifications (using implicit using System;); removed redundant using System.Collections.Generic; / System.Threading / System.Threading.Tasks; directives (covered by ImplicitUsings); replaced the two if (... is null) throw new System.ArgumentNullException(...) checks with ArgumentNullException.ThrowIfNull. The stale class-level <summary>/<remarks> ("Replaces NotWiredAlarmRpcDispatcher once ... wired in", "partially wired", "returns an Unimplemented diagnostic") were corrected to describe the actual GUID-vs-Provider!Group.Tag handling — overlapping with Server-014. No behavior change, so no new test; existing WorkerAlarmRpcDispatcherTests continue to pass and the project builds warning-free under TreatWarningsAsErrors.
Server-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | CLAUDE.md (Authentication section and apikey create example) |
| Status | Resolved |
Description: CLAUDE.md describes scopes as session, invoke, event, metadata, admin and shows apikey create --scopes session,invoke,event,metadata,admin. The actual canonical scope strings (used by GatewayScopes, GatewayGrpcScopeResolver, and docs/Authorization.md) are session:open, session:close, invoke:read, invoke:write, invoke:secure, events:read, metadata:read, admin. A key created per the CLAUDE.md example carries scopes the resolver never matches.
Recommendation: Update CLAUDE.md's scope list and the apikey example to the canonical *:* scope strings, per CLAUDE.md's own rule that docs change with the code.
Resolution: Resolved 2026-05-18. Confirmed against GatewayScopes (session:open, session:close, invoke:read, invoke:write, invoke:secure, events:read, metadata:read, admin). CLAUDE.md's Build/Test/Run apikey create example and the Authentication-section scope list were both updated to the canonical *:* strings. (Note: since finding Server-004 was resolved, the old example would now be actively rejected at create time rather than silently creating an unusable key, making the doc correction load-bearing.) Pure documentation change; no test.
Server-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs, src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs |
| Status | Resolved |
Description: DashboardAuthorizationHandler is unit-tested in isolation, but no test exercises the dashboard routes end-to-end to confirm the policy is actually enforced — which is why Server-001 (policy registered but never wired) went uncaught. There are also no tests for WorkerExecutableValidator (PE-header architecture parsing), GalaxyGlobMatcher (anchoring/escaping/empty-glob fail-open), or GalaxyHierarchyProjector pagination/page-token behavior.
Recommendation: Add a WebApplicationFactory integration test that requests a dashboard page unauthenticated and asserts the redirect/401, plus unit tests for WorkerExecutableValidator, GalaxyGlobMatcher, and projector paging.
Resolution: Resolved 2026-05-18. Re-triaged against the current test suite: three of the four named gaps were already closed. (1) The dashboard route-level enforcement test exists — GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization (and ..._AuthEndpointsAllowAnonymousAccess), added when Server-001 was fixed. (2) GalaxyGlobMatcher anchoring/escaping/empty-glob behavior is covered by GalaxyFilterInputSafetyTests (GlobMatcher_TreatsSqlMetacharactersAsLiterals, GlobMatcher_DoesNotTreatLikeWildcardsAsWildcards, GlobMatcher_WithPathologicalInput_DoesNotHang), now extended with GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect. (3) Projector pagination/page-token behavior is covered end-to-end by GalaxyRepositoryGrpcServiceTests and now directly by the new GalaxyHierarchyProjectorTests. The one genuine remaining gap — WorkerExecutableValidator PE-header parsing — was closed with the new WorkerExecutableValidatorTests (7 cases: matching/mismatched x86 and x64, missing MZ header, file too small, missing PE signature), exercising the validator against synthesized minimal PE fixtures.
Server-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:162-171,191-198,206-214,229-237 |
| Status | Resolved |
Description: The XML <remarks> and inline comments on AcknowledgeAlarm and QueryActiveAlarms describe the alarm path as not yet wired and say NotWiredAlarmRpcDispatcher is the default ("Clients calling this method today receive an OK reply with a 'worker alarm path not yet wired' diagnostic", "an empty stream until PR A.2"). In fact SessionServiceCollectionExtensions.AddGatewaySessions registers WorkerAlarmRpcDispatcher as IAlarmRpcDispatcher, so DI always injects the production dispatcher; NotWiredAlarmRpcDispatcher is only the null fallback. The comments are stale and misleading.
Recommendation: Update the AcknowledgeAlarm/QueryActiveAlarms remarks to reflect that WorkerAlarmRpcDispatcher is the wired default, and describe its actual GUID-vs-Provider!Group.Tag handling.
Resolution: Resolved 2026-05-18. Confirmed against source: SessionServiceCollectionExtensions registers WorkerAlarmRpcDispatcher as IAlarmRpcDispatcher, so the "not yet wired" / "empty stream until PR A.2" / "PR A.6/A.7 follow-up" prose in the AcknowledgeAlarm and QueryActiveAlarms <remarks> and inline comments was stale. Rewrote both <remarks> blocks and both inline comments to state that DI binds the production WorkerAlarmRpcDispatcher, that it routes over the worker pipe IPC, and that AcknowledgeAlarm handles a canonical-GUID reference (→ AcknowledgeAlarmCommand) and a Provider!Group.Tag reference (→ AcknowledgeAlarmByNameCommand), with NotWiredAlarmRpcDispatcher being only the null fallback. The matching stale WorkerAlarmRpcDispatcher class-level XML doc was corrected as part of Server-011. Pure documentation/comment change; no test.
Server-015
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | src/MxGateway.Server/Sessions/GatewaySession.cs:8-15,266-308,720-775 |
| Status | Resolved |
Description: GatewaySession guards its mutable state with two different sync primitives. TransitionTo, MarkFaulted, TouchClientActivity, the State/LastClientActivityAt/LeaseExpiresAt/FinalFault/ActiveEventSubscriberCount getters, AttachWorkerClient, and IsLeaseExpired all read/write _state, _finalFault, _lastClientActivityAt, _leaseExpiresAt, _workerClient, and _activeEventSubscriberCount under _syncRoot. CloseAsync (lines 720-775), however, reads _state at line 729 and writes _state at lines 736 (SessionState.Closing) and 761 (SessionState.Closed) while only holding the _closeLock SemaphoreSlim — _syncRoot is never acquired. A concurrent TransitionTo or MarkFaulted from another thread sees _state outside the lock that protects it, and the State getter is not guaranteed to observe the Closing/Closed writes promptly. SemaphoreSlim.WaitAsync/Release do happen to provide memory barriers in practice, but the locking discipline is split across two primitives, which is fragile and defeats the audit value of "all _state access is guarded by _syncRoot". Concretely, the race between CloseAsync setting _state = Closing and a concurrent TransitionTo(Ready) is unordered — and TransitionTo will happily overwrite Closing back to Ready because its only guard is "do not overwrite Closed/Faulted".
Recommendation: Make CloseAsync mutate _state through the existing TransitionTo(...) helper (or acquire _syncRoot around the reads/writes) so all _state access uses the same lock. Either extend TransitionTo to accept the Closing and Closed transitions (it already handles Faulted/Closed precedence) or refactor CloseAsync to call a private TrySetClosing() / MarkClosed() that locks _syncRoot. Add a regression test that forces a TransitionTo(Ready) after CloseAsync has set Closing and asserts the session does not flip back to Ready.
Resolution: 2026-05-20 — Unified the close path on _syncRoot. GatewaySession.CloseAsync (src/MxGateway.Server/Sessions/GatewaySession.cs) now mutates _state only through two private _syncRoot-locked helpers — TryBeginClose (writes Closing, returns the prior _closeStarted) and MarkClosed (writes Closed) — so every _state read/write in the session uses the same lock; _closeLock keeps its role of serializing concurrent close attempts. TransitionTo was tightened to refuse a transition out of Closing to anything other than Closed/Faulted so a late lifecycle callback cannot walk a closing session back to Ready. docs/Sessions.md updated to describe the unified lock discipline and the extended terminal precedence. Regression tests in src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs: TransitionTo_AfterCloseStarted_DoesNotOverwriteClosing (the named scenario — BlockingShutdownWorkerClient parks the close inside worker.ShutdownAsync so the test can call TransitionTo(Ready) between the Closing and Closed writes and assert the state stays Closing) and MarkFaulted_AfterCloseCompletes_DoesNotResurrectSession.
Server-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/MxGateway.Server/Sessions/GatewaySession.cs:790-797, src/MxGateway.Server/Sessions/SessionManager.cs:237-258 |
| Status | Resolved |
Description: GatewaySession.DisposeAsync synchronously calls _closeLock.Dispose() (line 792) without first acquiring the lock and without checking whether a CloseAsync is still in flight. The normal call path is SessionManager.CloseSessionCoreAsync → session.CloseAsync(...) → RemoveSessionAsync → DisposeAsync, where DisposeAsync runs strictly after CloseAsync completes. But the ShutdownAsync path (SessionManager.cs:237-258) and any future caller that disposes a session while another thread is still inside CloseAsync will trip ObjectDisposedException when the in-flight CloseAsync releases the semaphore. The race is narrow today because all Close/Dispose choreography goes through SessionManager, but the class-level contract is broken: nothing on GatewaySession documents or enforces "DisposeAsync must not be called concurrently with CloseAsync".
Recommendation: In DisposeAsync, either (a) take and release _closeLock once before disposing it, so the dispose is sequenced after any in-flight close, or (b) replace _closeLock disposal with a guard flag and let the semaphore be reclaimed by the finalizer. Document the invariant on the public method. Add a regression test that disposes a session whose CloseAsync has not yet completed and asserts no ObjectDisposedException.
Resolution: 2026-05-20 — Took recommendation (a): GatewaySession.DisposeAsync (src/MxGateway.Server/Sessions/GatewaySession.cs) now acquires _closeLock once before disposing the semaphore so an in-flight CloseAsync finishes (its _closeLock.Release()) before the dispose tears the semaphore down. The wait is non-cancellable (CancellationToken.None) and ObjectDisposedException is swallowed at both the wait and the dispose site so double-dispose still completes cleanly. The method's XML doc was extended with a <remarks> block stating the invariant. Regression tests in src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs: DisposeAsync_WhileCloseInFlight_WaitsForCloseAndDoesNotThrow (parks CloseAsync inside the worker shutdown, calls DisposeAsync concurrently, releases shutdown, asserts both complete without ObjectDisposedException and the worker is disposed exactly once) and DisposeAsync_CalledTwice_DoesNotThrow.
Server-017
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:13-27, src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:173-247, docs/Authorization.md:108-110 |
| Status | Resolved |
Description: The two new top-level RPCs added to MxAccessGateway — AcknowledgeAlarm(AcknowledgeAlarmRequest) and QueryActiveAlarms(QueryActiveAlarmsRequest) (proto lines 23-24) — are not enumerated by GatewayGrpcScopeResolver.ResolveRequiredScope. The resolver's request switch covers OpenSessionRequest, CloseSessionRequest, StreamEventsRequest, MxCommandRequest, and the four Galaxy-repository requests; everything else falls through to _ => GatewayScopes.Admin. The interceptor (GatewayGrpcAuthorizationInterceptor.AuthenticateAndAuthorizeAsync) then rejects any non-admin caller with PermissionDenied. This is technically fail-closed (and docs/Authorization.md:108-110 documents the "unrecognized → admin" intent), but in practice it means: (1) only API keys with the admin scope can acknowledge alarms or query active alarms, even though acknowledging is naturally an invoke:write-shaped operation and querying is naturally an invoke:read- or metadata:read-shaped operation; (2) the alarm RPCs ship in a state where any client that successfully opened a session and subscribed to alarm events still cannot perform the operational acks the contract advertises; (3) the test matrix GatewayGrpcScopeResolverTests does not even cover these two request types, so the gap was not caught at unit-test time.
Recommendation: Add explicit arms to ResolveRequiredScope: map AcknowledgeAlarmRequest to GatewayScopes.InvokeWrite (parity with other write actions; ack changes alarm state) and QueryActiveAlarmsRequest to GatewayScopes.MetadataRead or GatewayScopes.InvokeRead. Update docs/Authorization.md to list both. Extend GatewayGrpcScopeResolverTests with the new mappings and an assertion that every request type defined by mxaccess_gateway.proto is named in the resolver (the test can enumerate the assembly's request types so a future RPC cannot quietly add itself only via the admin fallback).
Resolution: 2026-05-20 — Added explicit AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite and QueryActiveAlarmsRequest => GatewayScopes.EventsRead arms to GatewayGrpcScopeResolver.ResolveRequiredScope (src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:21-22). InvokeWrite matches the existing MxCommandKind.Write* mapping because ack mutates alarm state; EventsRead matches StreamEventsRequest and MxCommandKind.DrainEvents because querying active alarms reads the same alarm/event surface. Extended GatewayGrpcScopeResolverTests with two new InlineData rows covering both request types (src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:16-17) and added four interceptor-level cases in GatewayGrpcAuthorizationInterceptorTests (UnaryServerHandler_AcknowledgeAlarmMissingScope_ReturnsPermissionDenied, UnaryServerHandler_AcknowledgeAlarmWithScope_RunsHandler, ServerStreamingServerHandler_QueryActiveAlarmsMissingScope_ReturnsPermissionDenied, ServerStreamingServerHandler_QueryActiveAlarmsWithScope_RunsHandler) proving each new RPC denies callers lacking the chosen scope and runs the handler when the scope is held. Updated docs/Authorization.md (resolver snippet and Scope Catalog table) to list both RPCs against their scopes. dotnet test ... --filter FullyQualifiedName~GatewayGrpcAuthorizationInterceptorTests → 14 passed, 0 failed; resolver tests 28 passed, 0 failed.
Server-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:15 |
| Status | Resolved |
Description: GalaxyGlobMatcher.RegexCache is a ConcurrentDictionary<string, Regex> keyed by glob pattern, with no eviction. The fix for Server-008 added this cache deliberately to avoid recompiling the same handful of patterns, but the cache key is the raw glob string. The patterns currently come from two sources — DiscoverHierarchyRequest.TagNameGlob (client-supplied) and ApiKeyConstraints.BrowseSubtrees / ReadSubtrees / WriteSubtrees / ReadTagGlobs / WriteTagGlobs (admin-configured) — and BuildRegex also runs each glob through Regex.Escape so an attacker cannot craft a denial-of-service ReDoS payload. The leak is therefore bounded only by "how many distinct globs a client can submit over the process lifetime", which is in the millions for TagNameGlob if a client iterates through generated names. Each compiled Regex also holds a JIT'd assembly that is non-trivial to reclaim.
Recommendation: Cap the cache at a small bound (e.g. 256 patterns) using a simple LRU or a MemoryCache with sliding expiration, or restrict the cache to globs that originate from API-key constraints (admin-controlled, naturally bounded) and pay the compile cost for client-supplied globs. Add a test that fills the cache with thousands of distinct globs and asserts the cache size stays bounded.
Resolution: 2026-05-20 — Capped GalaxyGlobMatcher's compiled-regex cache at RegexCacheCapacity = 256 entries with FIFO-by-insertion eviction (src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs). A ConcurrentQueue<string> tracks insertion order; when the cache grows past the cap, EvictIfOverCapacity takes a small lock and dequeues + removes the oldest entries until the count is back within bound. Reads stay lock-free (the lock guards only the eviction path). Internal CurrentCacheSize / RegexCacheCapacity accessors are surfaced through the existing InternalsVisibleTo("MxGateway.Tests") so tests can assert the bound. Regression test: GalaxyFilterInputSafetyTests.GlobMatcher_WithManyDistinctPatterns_CacheStaysBounded submits RegexCacheCapacity * 4 distinct globs and asserts CurrentCacheSize stays in [0, RegexCacheCapacity]. Existing glob correctness tests (GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect, the adversarial-input theories) continue to pass, confirming eviction does not corrupt lookups.
Server-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:183-221 |
| Status | Resolved |
Description: WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync returns yield break (line 191) when sessionRegistry.TryGet(request.SessionId, ...) fails — it silently produces an empty stream with no diagnostic. The peer AcknowledgeAsync instead returns an AcknowledgeAlarmReply with ProtocolStatus.Code = SessionNotFound (lines 81-89), so the two methods have inconsistent missing-session handling. In production this branch is unreachable because MxAccessGatewayService.QueryActiveAlarms calls ResolveSession(...) first and throws NotFound from the gRPC layer (MxAccessGatewayService.cs:228), but: (a) the dispatcher is the seam other code paths might reach in the future, and (b) any unit test that instantiates the dispatcher directly with a missing session id sees an empty stream rather than a clear error, which is a footgun.
Recommendation: Either throw a SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...) (matching the gRPC service's own resolver) or yield a single ActiveAlarmSnapshot with a diagnostic field set, and add a WorkerAlarmRpcDispatcherTests case that asserts whichever shape is chosen. Aligning with AcknowledgeAsync's SessionNotFound protocol-status pattern is preferred, but QueryActiveAlarms is a server-streaming RPC so a thrown SessionManagerException propagated by the gateway is the cleaner fit.
Resolution: 2026-05-20 — Took the preferred option: WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync (src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs) now throws SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...) instead of yield break-ing when the session is missing. MxAccessGatewayService.MapException already maps that error code to gRPC NotFound, so production callers see a consistent missing-session response and a direct unit-test caller now gets a clear error instead of an empty success. The unary peer AcknowledgeAsync continues to surface the same condition as an in-band ProtocolStatus.Code = SessionNotFound, which is correct for a unary RPC. Regression test: WorkerAlarmRpcDispatcherTests.QueryActiveAlarmsAsync_WhenSessionMissing_ThrowsSessionNotFound replaces the prior _YieldsEmpty assertion — it asserts the new exception shape and also exercises AcknowledgeAsync with the same missing session id to pin the peer-method parity.
Server-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor:1-2, …/GalaxyPage.razor:1-2, …/ApiKeysPage.razor:1-2, …/EventsPage.razor:1-2, …/SessionsPage.razor:1-2, …/WorkersPage.razor:1-2, …/SettingsPage.razor:1-2, …/SessionDetailsPage.razor:1-2 |
| Status | Resolved |
Description: Every dashboard page declares two @page directives — @page "/X" AND @page "/dashboard/X" — even though DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard mounts the Razor components under a RouteGroupBuilder with pathBase = "/dashboard". The group prefix is prepended to each @page route, so the actual endpoints become /dashboard/X (from @page "/X") and /dashboard/dashboard/X (from @page "/dashboard/X"). The pages are reachable at two URLs each, and the deeper one (/dashboard/dashboard/sessions etc.) is almost certainly accidental — it leaks the path-base name into the URL and creates duplicate authorize/render work per route. GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization only checks the /dashboard/X shape, so the duplicate route slipped through without an assertion.
Recommendation: Drop the @page "/dashboard/X" directive from each page; rely on the MapGroup("/dashboard") to provide the prefix. Or, if the team genuinely wants both URL shapes, document the choice in the file header and extend the route-enumeration test to assert that both are present (and both carry the authorization policy). Either way, the current setup is non-obvious.
Resolution: 2026-05-20 — Took the recommended drop: removed the redundant @page "/dashboard/X" directive from every dashboard Razor page (DashboardHome.razor, SessionsPage.razor, WorkersPage.razor, EventsPage.razor, GalaxyPage.razor, SettingsPage.razor, ApiKeysPage.razor, SessionDetailsPage.razor). Each page now declares only its bare route (e.g. @page "/sessions"); DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard continues to prepend /dashboard via MapGroup, so each page is reachable at exactly one URL (/dashboard/X). Regression test: GatewayApplicationTests.Build_WhenDashboardEnabled_DoesNotRegisterDoubledDashboardPrefixRoutes enumerates the eight previously-doubled routes (/dashboard/dashboard/, /dashboard/dashboard/sessions, ... /dashboard/dashboard/sessions/{SessionId}) and asserts none of them are mapped. The existing ..._MapsBlazorDashboardAndAuthEndpoints / ..._ComponentRoutesRequireAuthorization tests continue to verify the desired /dashboard/X shapes are still present and policy-gated. No public URL contract changed (the doubled shape was accidental); no doc update needed — gateway.md and docs/GatewayDashboardDesign.md never referenced the doubled routes.
Server-021
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:266-664, src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs |
| Status | Resolved |
Description: The 1cd51bb commit history (the bulk read/write series, f220908/5e375f6/758aca2) added 473 lines of constraint-filtering and reply-merging logic to MxAccessGatewayService: ApplyConstraintsAsync (line 266), EnforceReadTagAsync / EnforceWriteHandleAsync, FilterTagBulkAsync / FilterReadBulkAsync / FilterWriteBulkAsync / FilterHandleBulkAsync, the ReplaceWriteBulkEntries switch, and three concrete BulkConstraintPlan records (SubscribeBulkConstraintPlan, WriteBulkConstraintPlan, ReadBulkConstraintPlan) that splice denied entries back into the worker's allowed-only reply in original-index order. None of this is covered by MxAccessGatewayServiceTests — its FakeSessionManager is wired with an AllowAllConstraintEnforcer (line 430) that never denies anything, so every constraint-related code path is dead at test time. A subtle off-by-one in BuildMerged, a wrong PayloadOneofCase in GetPayload / SetPayload, or a missing case in ReplaceWriteBulkEntries would all ship without a test failure.
Recommendation: Add MxAccessGatewayServiceTests cases that inject a deny-on-glob IConstraintEnforcer and exercise: (1) AddItemBulk / SubscribeBulk / AdviseItemBulk with a mix of allowed and denied tags, asserting BulkSubscribeReply.Results interleaves denied and worker-allowed entries in original-index order; (2) the same for ReadBulk and each of the four bulk-write commands; (3) HasAllowedItems == false so CreateDeniedReply is exercised (no worker call); (4) the unary Write/Write2/WriteSecured/WriteSecured2 paths through EnforceWriteHandleAsync. The fixtures can reuse the existing FakeSessionManager by replacing the constraint enforcer; no live worker is needed.
Resolution: 2026-05-20 — Added a configurable PredicateConstraintEnforcer test double (src/MxGateway.Tests/TestSupport/PredicateConstraintEnforcer.cs) that denies on per-tag and per-handle predicates and records denials. Added 11 new tests in src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs covering: (1) AddItemBulk with mixed denials — asserts the worker is called once with only the allowed subset and the merged reply interleaves denied and worker-allowed SubscribeResults at their original indices; (2) SubscribeBulk with every tag denied — asserts HasAllowedItems short-circuits CreateDeniedReply and the session manager is never invoked; (3) AdviseItemBulk (handle-keyed denial via CheckReadHandleAsync); (4) SubscribeBulk with the allow-all enforcer — pass-through regression guard; (5) ReadBulk partial denial — asserts the BulkReadConstraintPlan produces a BulkReadReply (not a BulkSubscribeReply) with denied entries spliced in at their original indices; (6) ReadBulk all-denied short-circuit; (7) WriteBulk partial denial — asserts denied entries are dropped from the forwarded Entries and the merged reply preserves original-index order; (8) WriteSecuredBulk all-denied — proves the second ReplaceWriteBulkEntries switch arm is reachable; (9) unary Write with denied handle → PermissionDenied, no worker call, denial recorded; (10) unary WriteSecured with denied handle → PermissionDenied; (11) unary AddItem with denied tag → PermissionDenied (EnforceReadTagAsync). MxAccessGatewayServiceTests.CreateService updated to accept an IConstraintEnforcer so future tests can opt into the deny enforcer without duplicating the wiring. All 11 new tests pass; full suite (dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj) is green at 458 passing.
Server-022
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs:8-29 |
| Status | Resolved |
Description: Server-014's resolution noted that the stale "PR A.6 / A.7" / "not yet wired" language was rewritten on MxAccessGatewayService.AcknowledgeAlarm / QueryActiveAlarms and on the WorkerAlarmRpcDispatcher class doc. The corresponding XML doc on the interface IAlarmRpcDispatcher (lines 8-29) still says it is "PR A.6 / A.7 — gateway-side dispatcher" and that "Production implementations live in WorkerAlarmRpcDispatcher (this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic)". That second clause directly contradicts the now-correct comments on the concrete implementations and on the gRPC service: WorkerAlarmRpcDispatcher is the wired default, not a not-yet-wired one. A reader who finds the interface first will believe the dispatcher is non-functional.
Recommendation: Rewrite the IAlarmRpcDispatcher <remarks> block to match the language now used on WorkerAlarmRpcDispatcher and on the gRPC service: DI binds WorkerAlarmRpcDispatcher by default; NotWiredAlarmRpcDispatcher is only the null fallback for tests/DI omission. Drop the "PR A.6 / A.7" prefix from the <summary> — the interface is now the public alarm-RPC seam.
Resolution: 2026-05-20 — Rewrote IAlarmRpcDispatcher's <summary> and <remarks> (src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs) to match the language now used on WorkerAlarmRpcDispatcher and on MxAccessGatewayService.AcknowledgeAlarm / QueryActiveAlarms: dropped the stale "PR A.6 / A.7" prefix from the summary, and replaced the "this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic" clause with the correct statement that DI binds the production WorkerAlarmRpcDispatcher by default and NotWiredAlarmRpcDispatcher is only the null fallback for DI omission / standalone tests. Pure documentation change; no test.
Server-023
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/MxGateway.Server/Sessions/NotWiredAlarmRpcDispatcher.cs:10-26 |
| Status | Resolved |
Description: Server-014 and Server-022 swept the stale "PR A.6 / A.7" / "not-yet-wired" / "worker-pending" language off MxAccessGatewayService.AcknowledgeAlarm / QueryActiveAlarms, WorkerAlarmRpcDispatcher, and IAlarmRpcDispatcher. The concrete NotWiredAlarmRpcDispatcher class XML doc was not updated as part of either fix and still reads: "PR A.6 / A.7 — default IAlarmRpcDispatcher shipped while the worker-side AlarmClient event subscription is gated on dev-rig validation" and "When the worker dispatcher (PR A.6/A.7 dev-rig follow-up) lands, WorkerAlarmRpcDispatcher replaces this implementation in the DI container". That is the exact prose the other sweeps removed, and it directly contradicts the now-current narrative everywhere else: SessionServiceCollectionExtensions.AddGatewaySessions registers WorkerAlarmRpcDispatcher as the default IAlarmRpcDispatcher; NotWiredAlarmRpcDispatcher is only the null fallback used when no dispatcher is registered (DI omission / standalone tests). The diagnostic string returned by AcknowledgeAsync (line 39) — "the worker-side AlarmClient consumer (PR A.5) is in place but the dispatcher hookup is gated on validating the AVEVA alarm-provider event subscription on the dev rig" — is also stale; the dispatcher hookup landed and any client that actually sees that diagnostic today is hitting the null-fallback path, not the dev-rig gate it describes.
Recommendation: Replace the <summary> and <remarks> on NotWiredAlarmRpcDispatcher with text that matches the language now used on the interface and WorkerAlarmRpcDispatcher — "null fallback IAlarmRpcDispatcher used when no dispatcher is registered (DI omission / standalone tests); production wires WorkerAlarmRpcDispatcher." Either drop the AcknowledgeAsync diagnostic string's dev-rig framing entirely or shorten it to "alarm dispatcher is not registered." #pragma warning disable CS1998 on QueryActiveAlarmsAsync is correct here (empty stream is intentional for the null fallback) and should stay.
Resolution: 2026-05-20 — Rewrote NotWiredAlarmRpcDispatcher summary/remarks as the null-fallback dispatcher and shortened the AcknowledgeAsync diagnostic to "Alarm dispatcher is not registered."; updated the two tests that asserted the old "worker"-prefixed diagnostic.
Server-024
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:56-77 |
| Status | Resolved |
Description: GetOrCreateRegex's race-loser branch reads RegexCache[glob] with an indexer (line 76) after TryAdd returned false. The indexer throws KeyNotFoundException if the key is missing. Under the new bounded cache (Server-018), there is a real — if narrow — race where the key vanishes between the failing TryAdd and the indexer read: thread A and thread B both compile a Regex for glob; A's TryAdd succeeds, A enqueues + enters EvictIfOverCapacity, the eviction loop dequeues glob (because some other thread had already enqueued + evicted enough that glob is now the oldest entry) and removes it; thread B's TryAdd then returns false, B reads RegexCache[glob], and the indexer throws. The window is tiny but nonzero — eviction is approximate FIFO, and a hot pattern that is repeatedly re-added near the cap is the natural trigger. The same pre-Server-018 code used GetOrAdd, which had no such race because the dictionary handled the rebuild atomically.
Recommendation: Replace the TryAdd + indexer pair with RegexCache.GetOrAdd(glob, _ => compiled) so the dictionary atomically returns whichever instance won. Track the new insertion only when GetOrAdd returns the locally-compiled instance (ReferenceEquals(result, compiled)), then enqueue + evict. Alternatively, swap the trailing indexer read for TryGetValue + recursive recompile on miss. Add a stress test that mixes repeated reads of a single hot pattern with a flood of unique patterns near the cap and asserts no exception escapes IsMatch.
Resolution: 2026-05-20 — Replaced the TryAdd + indexer pair with RegexCache.GetOrAdd(glob, compiled); FIFO enqueue + eviction now run only when ReferenceEquals(result, compiled) (i.e. our caller was the inserter), eliminating the post-eviction KeyNotFoundException window.
Server-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:19-25, src/MxGateway.Server/Galaxy/IGalaxyRepository.cs |
| Status | Resolved |
Description: The Tests-016 fix introduced IGalaxyRepository so GalaxyHierarchyCache could be unit-tested against an in-memory fake, and GalaxyHierarchyCache was updated to depend on the interface. GalaxyRepositoryGrpcService was not updated and still receives the concrete GalaxyDb.GalaxyRepository via its primary constructor. Functionally this is fine — DI registers the concrete singleton and a thin sp.GetRequiredService<GalaxyRepository>() forwarder for the interface — but the seam is now half-applied: a future caller that wants to test or stub the gRPC service's TestConnection path has to construct a real GalaxyRepository against a SQL connection string, defeating the abstraction IGalaxyRepository was introduced for. The pattern also creates an inconsistency for new readers — two consumers in the same namespace, one on the interface and one on the concrete.
Recommendation: Change GalaxyRepositoryGrpcService's repository parameter to IGalaxyRepository. No DI change is needed (both forwarders already resolve to the same singleton). Optionally drop the concrete singleton registration and register the interface directly.
Resolution: 2026-05-20 — Changed GalaxyRepositoryGrpcService's repository primary-constructor parameter from the concrete GalaxyRepository to IGalaxyRepository; existing DI registration in GalaxyRepositoryServiceCollectionExtensions already resolves both the concrete and interface to the same singleton.
Server-026
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/MxGateway.Server/Configuration/GatewayOptionsValidator.cs:17-32, src/MxGateway.Server/Configuration/AlarmsOptions.cs |
| Status | Resolved |
Description: GatewayOptions.Alarms is bound from MxGateway:Alarms and consumed by SessionManager.TryAutoSubscribeAlarmsAsync (per-session SubscribeAlarms on Ready). GatewayOptionsValidator.Validate validates every other section (Authentication, Ldap, Worker, Sessions, Events, Dashboard, Protocol) but has no ValidateAlarms arm — AlarmsOptions is silently accepted regardless of contents. The runtime mitigates this by logging a warning when Enabled = true but neither SubscriptionExpression nor DefaultArea is set, then either faulting open-session (RequireSubscribeOnOpen = true) or skipping auto-subscribe — a configuration error therefore surfaces per-session at runtime instead of at startup. Other sections fail-fast at ValidateOnStart(), so the inconsistency makes alarm misconfiguration discoverable only after a client hits the gateway. A misformatted SubscriptionExpression (no \\<host>\Galaxy!<area> shape) likewise passes validation; the worker rejects it later.
Recommendation: Add a ValidateAlarms(options.Alarms, failures) arm in GatewayOptionsValidator. When Enabled = true, require either a non-blank SubscriptionExpression or a non-blank DefaultArea; when SubscriptionExpression is provided, sanity-check that it starts with \\ (the AVEVA UNC subscription shape) — or document that the shape is left to the worker to validate. Either way, treat the configuration as part of the validated surface.
Resolution: 2026-05-20 — Added ValidateAlarms to GatewayOptionsValidator: when Enabled = true, requires a non-blank SubscriptionExpression or DefaultArea, and when SubscriptionExpression is provided, requires it to start with \\ (canonical UNC subscription shape). Alarm misconfiguration now fails fast at startup instead of per-session.
Server-027
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | docs/Authorization.md:120-141,176-181 |
| Status | Resolved |
Description: Two parts of docs/Authorization.md drifted from GatewayGrpcScopeResolver.ResolveCommandScope and from MxAccessGatewayService.ApplyConstraintsAsync over the bulk-read/bulk-write series (f220908/5e375f6/758aca2) and were not updated by the Server-017 / Server-021 fixes:
- The
ResolveCommandScopecode snippet at lines 120-141 still shows onlyWrite/Write2againstInvokeWriteandWriteSecured/WriteSecured2/AuthenticateUseragainstInvokeSecure. The actual resolver also mapsMxCommandKind.WriteBulk,MxCommandKind.Write2Bulk,MxCommandKind.WriteSecuredBulk, andMxCommandKind.WriteSecured2Bulk. A reader believing the snippet would conclude the bulk-write families inherit the fail-closed admin scope, when in fact they correctly map toInvokeWrite/InvokeSecure(the Scope Catalog table at lines 199-200 lists them). - The Constraint Enforcement section (lines 176-181) says: "The service checks read constraints for
AddItem,AddItem2,AddItemBulk,SubscribeBulk, andAdviseItemBulk. It checks write constraints forWrite,Write2,WriteSecured, andWriteSecured2." The actualApplyConstraintsAsyncswitch also enforces constraints forReadBulk(read scope),WriteBulk/Write2Bulk/WriteSecuredBulk/WriteSecured2Bulk(write scope, per-entry filtering with index-order merge). Server-021 added test coverage for all of these without touching the doc.
Recommendation: Update the ResolveCommandScope snippet to include the four bulk-write arms. Update the Constraint Enforcement prose to enumerate the bulk read/write commands that are actually filtered, and reference the per-entry index-ordered merge that BulkConstraintPlan.MergeDeniedInto performs. Adding ReadBulk to the InvokeRead row of the Scope Catalog would also be useful — the table currently lists Register/AddItem/Advise against InvokeRead but not ReadBulk.
Resolution: 2026-05-20 — Updated the ResolveCommandScope snippet in docs/Authorization.md to enumerate the four bulk-write arms (WriteBulk/Write2Bulk against InvokeWrite, WriteSecuredBulk/WriteSecured2Bulk against InvokeSecure); expanded the Constraint Enforcement prose to list ReadBulk and all four bulk-write commands and to call out BulkConstraintPlan.MergeDeniedInto's index-ordered merge; added ReadBulk to the InvokeRead row of the Scope Catalog.
Server-028
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:13-20, src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs |
| Status | Resolved |
Description: Two narrow test gaps were not closed by Server-017 / Server-015:
GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScopeenumeratesOpenSessionRequest,CloseSessionRequest,StreamEventsRequest,AcknowledgeAlarmRequest,QueryActiveAlarmsRequest,TestConnectionRequest,GetLastDeployTimeRequest, andDiscoverHierarchyRequest.WatchDeployEventsRequestis missing even though it is named in the resolver's metadata-read arm and listed in the Scope Catalog. Similarly, theResolveRequiredScope_InvokeCommand_ReturnsExpectedScopematrix covers every other write/secure/bulk command but omitsMxCommandKind.ReadBulk, which is the only bulk family that falls into the_ => GatewayScopes.InvokeReaddefault arm. A regression that dropsWatchDeployEventsfrom the request switch or that adds a new mapping forReadBulkwould not be caught.GatewaySessionTests(added under Server-015 / Server-016) covers theTransitionTo(Ready)andMarkFaulted(post-Close)cases but does not cover the third edge that Server-015's tightened state machine permits:MarkFaultedissued whileCloseAsyncis parked betweenTryBeginClose(Closing) andMarkClosed(Closed). The currentMarkFaulted(GatewaySession.cs:314-326) checks only forClosed, so it overwritesClosing→Faulted; the subsequentMarkClosedthen overwritesFaulted→Closedwhile_finalFaultis preserved. The behaviour is consistent with the docs ("Closing only allows a transition to Closed or Faulted") but the test bundle does not pin it, and a future tightening ofMarkFaultedcould silently regress.
Recommendation: Extend GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope with [InlineData(typeof(WatchDeployEventsRequest), GatewayScopes.MetadataRead)] and extend the command theory with [InlineData(MxCommandKind.ReadBulk, GatewayScopes.InvokeRead)]. Add a GatewaySessionTests.MarkFaulted_DuringInFlightClose_PreservesFaultButYieldsToClose case using BlockingShutdownWorkerClient to park CloseAsync, call MarkFaulted while parked, release the worker, and assert State == Closed && FinalFault == "<the fault reason>".
Resolution: 2026-05-20 — Added [InlineData(typeof(WatchDeployEventsRequest), GatewayScopes.MetadataRead)] to GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope (the ReadBulk arm was already present); added GatewaySessionTests.MarkFaulted_DuringInFlightClose_PreservesFaultButYieldsToClose covering the parked-close + MarkFaulted interleave and asserting the post-release state is Closed with FinalFault = "concurrent-fault".
Server-029
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:52-58 |
| Status | Resolved |
Description: OpenSession advertises capabilities the gateway supports so clients can branch on them. The current list is unary-open-session, unary-close-session, unary-invoke, server-stream-events, bulk-subscribe-commands, unary-acknowledge-alarm, server-stream-active-alarms. The bulk-subscribe-commands token was added for the AddItemBulk / AdviseItemBulk / RemoveItemBulk / UnAdviseItemBulk / SubscribeBulk / UnsubscribeBulk family. The subsequent ReadBulk and WriteBulk / Write2Bulk / WriteSecuredBulk / WriteSecured2Bulk families landed without a corresponding capability token — the contract advertises bulk-subscribe support but is silent on bulk-read and bulk-write. A defensive client that gates on bulk-write-commands before issuing a WriteBulk has no signal that the family is supported; current clients sidestep this by ignoring the list entirely, but that just shifts the failure mode (an old client against a new server, or vice versa, will see Unimplemented instead of a structured Capabilities mismatch).
Recommendation: Either (a) extend the advertised list with bulk-read-command and bulk-write-commands (WriteBulk / Write2Bulk / WriteSecuredBulk / WriteSecured2Bulk collectively), or (b) document in gateway.md and docs/Contracts.md that Capabilities is informational only and not the contract version. Option (a) is the simplest forward-compatible fix and keeps the capability token shape clients are already familiar with.
Resolution: 2026-05-20 — Extended the OpenSession capabilities list with bulk-read-commands and bulk-write-commands alongside the existing bulk-subscribe-commands token, so clients that gate on capability strings have an explicit signal for the bulk-read and bulk-write families.
Server-030
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/MxGateway.Server/Sessions/GatewaySession.cs:952-980 |
| Status | Resolved |
Description: Surfaced during the 2026-05-20 cross-language e2e run against a redeployed gateway (a020350). The Java client got 55 of 120 AddItem calls in, then Advise returned Session session-de7728a290bd41028ad6fec81e233144 is not ready. Current state is Ready. — a self-contradictory diagnostic. The check in GetReadyWorkerClient (GatewaySession.cs:956) is _state != SessionState.Ready || _workerClient?.State != WorkerClientState.Ready, but the formatted message only includes _state. When the gateway-side session state is Ready but the worker client's own WorkerClientState has transitioned (heartbeat watchdog firing, pipe disconnect detected by the read loop, etc.) before the session-level reaction observes it, the in-flight RPC fails fast here — and the operator sees a message that doesn't tell them which side of the gate the failure is on. The two-state gap itself is a real race (the worker-side state can shift independently of the gateway-driven session state) but a clear diagnostic is the prerequisite for diagnosing it; without it, a future investigation will start from "it says Ready but it's not Ready" instead of "the worker is Handshaking / Closing / Faulted while the session is still Ready".
Recommendation: Format both states into the exception message — Session {SessionId} is not ready. Session state is {_state}; worker state is {workerClientState}. (or "<no worker>" when _workerClient is null). Document on the method that the two states can diverge under load and that this branch is the fail-fast for that case. Add a regression test that flips FakeWorkerClient.State to a non-Ready value (e.g. Handshaking) while the session is Ready and asserts both pieces of state appear in the thrown SessionManagerException.Message. The deeper race investigation (should the gateway briefly wait for worker-Ready before failing? when does WorkerClient.State legitimately shift while the session is still Ready?) is out of scope for this finding but is worth a follow-up.
Resolution: 2026-05-20 — Rewrote GetReadyWorkerClient so the SessionManagerException message includes both _state and _workerClient.State (or "<no worker>" for the null case): "Session {SessionId} is not ready. Session state is {_state}; worker state is {workerState}.". Added XML doc on the method explaining the two-state contract and that this branch is the fail-fast for a state-divergence race. Added regression test SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates that sets FakeWorkerClient.State = WorkerClientState.Handshaking while the session is Ready and asserts both "Session state is Ready" and "worker state is Handshaking" appear in the message; the test also pins InvokeCount == 0 so the worker isn't called. The deeper race (should GetReadyWorkerClient retry briefly when state has just diverged?) remains open for follow-up.
Server-031
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:392-443 (gateway-side heartbeat watchdog); src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClientOptions.cs:14-67 (new HeartbeatStuckCeiling option) |
| Status | Resolved |
Description: Surfaced during the 2026-05-20 cross-language e2e re-run against gateway b794c46. The .NET phase succeeded through open-session/register/bulk-subscribe/bulk-read/bulk-unsubscribe/stream-events/write but then failed on its third advise call with the Server-030 diagnostic Session ... is not ready. Session state is Ready; worker state is Faulted. The gateway stdout log records the underlying cause: Worker client faulted for session session-01a1a07fa59c489983a719821fa46e72: Worker heartbeat expired. Last heartbeat was at 2026-05-20T17:20:39.+00:00. — a real 15s+ gap with no WorkerHeartbeat envelope arriving from the worker.
Investigation paths:
-
Shared
_writeLockon the worker side.WorkerFrameWriterserializes every pipe write (heartbeats, command replies, events, faults) through a singleSemaphoreSlim _writeLock(WorkerFrameWriter.cs:14,:67-76).RunEventDrainLoopAsync(WorkerPipeSession.cs:336-372) writes events one at a time inside aforeach, each call to_writer.WriteAsyncre-acquiring_writeLock. If the gateway-side read drains slowly and the OS-level named-pipe buffer fills,_stream.WriteAsync(WorkerFrameWriter.cs:70) blocks. The event-drain loop blocks holding the lock.RunHeartbeatLoopAsync(WorkerPipeSession.cs:611-613) then can't acquire_writeLockto send its 5s heartbeat. Heartbeats stall past the gateway'sHeartbeatGrace(15s default) andWorkerClient.HeartbeatLoopAsyncfaults the session. -
No prioritization between heartbeats and events. Even without OS-level back-pressure, a backlog of events in the worker's
MxAccessEventQueue(drained in batches ofEventDrainBatchSize) can keep the writer lock held for many milliseconds at a time. Heartbeats can be delayed (though normally not pastHeartbeatGraceunless something else is wrong). -
Gateway-side heartbeat watchdog ignores in-flight commands.
WorkerClient.HeartbeatLoopAsync(WorkerClient.cs:392-422) checks only_state == Readyandnow - lastHeartbeatAt > HeartbeatGrace. It does not check whether a command is in flight on the gateway↔worker pipe. The mirror of Worker-017's fix (worker-side watchdog skipsStaHungwhile a command is in flight) does not exist on the gateway side.
The .NET test pattern stresses the issue uniquely because each dotnet run --project rebuild between subcommands introduces multi-second client-side gaps; the worker's heartbeat path should still be alive (heartbeats are emitted by RunHeartbeatLoopAsync independently of gateway activity), but if the gateway is also blocked draining events from the channel into a non-existent StreamEvents consumer, the back-pressure-into-heartbeat chain bites first.
Recommendation: Two changes worth landing together:
-
Decouple heartbeat writes from the event/reply lock. Either (a) give heartbeats their own pipe
Stream(likely impractical — one pipe per session), (b) introduce a priority queue in front ofWorkerFrameWriterso heartbeats hop the line, or (c) interleave heartbeat checks insideRunEventDrainLoopAsync(e.g., after each event-batch write, post a heartbeat envelope if one is due). Option (c) is the smallest change. -
Mirror Worker-017's "skip-while-command-in-flight" guard on the gateway side. In
WorkerClient.HeartbeatLoopAsync, when_pendingCommands.Count > 0and the oldest pending command is younger than some ceiling (e.g., 5×HeartbeatGrace), skip the fault. The worker may be busy executing a slow STA command and the heartbeat write may be queued behind a long event burst — neither indicates the worker is actually hung.
Add a regression test that floods the worker's outbound event channel (e.g., via a high-rate STA fixture or a mock event source emitting at > 1000 events/s for several seconds) and asserts the worker is not faulted while the gateway has no StreamEvents consumer attached.
Resolution: 2026-05-24 — Re-triaged at HEAD d2d2e5f: the gateway-side "skip-while-command-in-flight" guard (recommendation #2) is already implemented and verified against source. WorkerClient.HeartbeatLoopAsync (src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:403-443) now skips the HeartbeatExpired fault while TryGetOldestPendingCommandAge reports an in-flight command younger than WorkerClientOptions.HeartbeatStuckCeiling (default 75s = 5× HeartbeatGrace). Once the oldest pending command exceeds the ceiling the watchdog fires anyway, so a truly stuck COM call doesn't hide the worker forever. The new HeartbeatStuckCeiling option is documented inline with a back-reference to Worker-023, the worker-side mirror. Regression tests in src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs: HeartbeatMonitor_WhenCommandInFlightWithinCeiling_DoesNotFaultOnExpiredHeartbeat (the named scenario — parks an unanswered InvokeAsync past HeartbeatGrace but well within HeartbeatStuckCeiling and asserts the client stays Ready) and HeartbeatMonitor_WhenPendingCommandExceedsStuckCeiling_FaultsClient (advances past the ceiling and asserts the watchdog still fires). Recommendation #1 (decoupling the worker-side _writeLock) is the worker module's concern and is tracked by Worker-017 / Worker-023 — out of scope for the Server module here.
Server-032
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:510-569 (gateway-side _events channel); src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClientOptions.cs:45-53 (EventChannelFullModeTimeout) |
| Status | Resolved |
Description: Surfaced during the 2026-05-20 cross-language e2e re-run against gateway b794c46. The Java phase advised ~55 items (item-handle 63) before failing on the next advise call with the Server-030 diagnostic Session ... is not ready. Session state is Ready; worker state is Faulted.. The gateway stdout log records: Worker client faulted for session session-adfcc808da974808947e87db060c2b03: Worker event channel rejected an event. — the gateway-side per-session bounded event channel filled up and Channel.Writer.TryWrite returned false, triggering the fail-fast path in EnqueueWorkerEventAsync (WorkerClient.cs:467-484).
The channel is configured as Channel.CreateBounded<WorkerEvent>(new BoundedChannelOptions(EventChannelCapacity) { ... FullMode = BoundedChannelFullMode.Wait ... }) (capacity defaults to EventOptions.QueueCapacity = 10_000). But EnqueueWorkerEventAsync uses TryWrite (non-blocking), so the configured Wait mode is moot — the writer always fails fast when full. This is consistent with docs/DesignDecisions.md's "fail-fast event backpressure" policy (one subscriber per session, no producer-side queuing beyond the channel), but two facts make it sharp in practice:
-
The e2e flow (and any realistic client)
advises many items BEFORE opening a long-runningStreamEventsconsumer. With no consumer, events accumulate at the in-rate (driven by the SCADA tags' change frequency). ForTestMachine_*.TestChangingInt× ~55 advised items, the rig can fill 10,000 in well under a minute. -
The fail-fast threshold is "exactly at capacity." There is no overflow grace window. A momentary lull on the consumer side that lasts long enough for one extra event to arrive after the channel is full results in worker fault and session teardown.
This is design-as-intended in the v1 sense, but it surfaces a behavioral contract that is not currently documented: clients must open StreamEvents BEFORE issuing advise against high-rate tags, or pace their advise calls below the (non-published) accumulation budget. None of the current docs (gateway.md, docs/DesignDecisions.md, the client READMEs) enforce or surface this requirement, and four of the five client CLIs (go, python, rust, java) hit it gracelessly in scripts/run-client-e2e-tests.ps1.
The diagnostic "Worker event channel rejected an event." also does not name the actual channel (it says "Worker event channel" but the channel is gateway-owned), the current depth, or the capacity — only that it overflowed. Operators can't tell whether the threshold needs lifting or whether the consumer is genuinely missing.
Recommendation: Three escalating options, pick at least the first and consider one of the others:
-
Document the contract. In
gateway.mdanddocs/DesignDecisions.md, state explicitly thatadviseproduces events into the gateway-side per-session channel and that aStreamEventsconsumer must be attached to drain it. Add the bound (MxGateway:Events:QueueCapacity, default 10,000) and the fault behavior (the worker is faulted; the session ends). Updateclients/*/README.mdto call out the requirement in the "advise" / "subscribe" sections. -
Improve the diagnostic. Format the channel depth and capacity into the fault message:
"Worker event channel rejected an event after {capacity} unconsumed events accumulated. Attach a StreamEvents consumer or increase MxGateway:Events:QueueCapacity." -
Add an overflow grace window. Instead of fail-fast on the first
TryWrite == false, count overflow events and only fault if N consecutive overflows happen within T ms (or, equivalently, switch toWriteAsyncwith a short timeout). This trades a tiny memory bump for resilience to consumer hiccups. Out of scope if v1 explicitly chose fail-fast for parity reasons — but worth raising for v2.
Add a regression test that advises N items without an active StreamEvents consumer, lets the channel fill, and asserts the produced fault message contains the channel-depth diagnostic (#2) — gated so that #3 is not required.
Resolution: 2026-05-24 — Re-triaged at HEAD d2d2e5f: recommendation #2 (improved diagnostic) is already implemented and verified against source. WorkerClient.EnqueueWorkerEventAsync (src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:525-569) now (a) attempts TryWrite first for the fast path, (b) on full-channel falls through to WriteAsync with a linked CancellationTokenSource cancelled after WorkerClientOptions.EventChannelFullModeTimeout (default 5s), so a transient consumer hiccup is absorbed instead of fail-fast on the first overflow event, and (c) on real overflow records QueueOverflow("worker-events") and faults with the rich diagnostic message naming the wait timeout, the current channel depth, the channel capacity, and the actionable remediation ("Attach a StreamEvents consumer or raise MxGateway:Events:QueueCapacity."). Regression test: WorkerClientTests.EnqueueWorkerEvent_WhenChannelFullPastTimeout_FaultsWithRichDiagnostic (src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:473-521) fills a 4-slot channel + one overflow, asserts the worker is faulted, then drains the propagated WorkerClientException and pins the diagnostic string contains "Worker event channel rejected", "of 4 capacity", "StreamEvents", and "MxGateway:Events:QueueCapacity". Recommendation #1 (the prose contract in gateway.md / docs/DesignDecisions.md / client READMEs) is out of scope for this pass — the prompt restricts edits to src/ZB.MOM.WW.MxGateway.Server/**, src/ZB.MOM.WW.MxGateway.Tests/**, and this findings file; the documentation update needs to land in a follow-up that has docs access. Recommendation #3 (overflow grace window) was already implemented in spirit by the WriteAsync + timeout switch — the channel now absorbs a transient burst up to the configured wait timeout, satisfying #3's "consumer hiccup resilience" goal without requiring a separate counter.
Server-033
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:265-323 (TryRestoreFromDiskAsync), :84-99 (_firstLoad / WaitForFirstLoadAsync); src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:141-163 (WaitForCacheBootstrap) |
| Status | Resolved |
Description: TryRestoreFromDiskAsync populates _current with the on-disk snapshot (status Stale, HasData == true) but never completes the _firstLoad TaskCompletionSource — only the live-query paths (cheap / heavy / catch) in RefreshCoreAsync do. A DiscoverHierarchy or GetLastDeployTime call that arrives after gateway start but before the first refresh tick finishes sees cache.Current as Empty (status Unknown) when WaitForCacheBootstrap runs its initial check, so it falls through to await WaitForFirstLoadAsync with a 5-second budget. Restore then completes within milliseconds and makes the data available, but _firstLoad stays pending until the live query returns or fails. When the Galaxy database is unreachable — the exact scenario the snapshot feature exists for — the SQL connect attempt outlasts the 5s budget, so the caller waits the full 5 seconds before the budget elapses and the handler falls through to read the (already-restored) data. The result is correct, but the first browse calls after a cold offline start incur a needless ~5s latency, undercutting the feature's purpose.
Recommendation: Call _firstLoad.TrySetResult() at the end of TryRestoreFromDiskAsync once the restored entry is published — restored data is a valid completed first load. Add a regression test: a cache with a throwing repository plus a populated snapshot store should have WaitForFirstLoadAsync complete promptly after RefreshAsync, not block on the live query.
Resolution: Resolved in bdccdbf (2026-05-22): TryRestoreFromDiskAsync calls _firstLoad.TrySetResult() immediately after publishing the restored entry, so a restored snapshot satisfies the bootstrap gate without waiting on the live query. New test GalaxyHierarchyCacheTests.RefreshAsync_RestoredSnapshotCompletesFirstLoadBeforeLiveQueryReturns blocks the repository's deploy-time query and asserts WaitForFirstLoadAsync still completes from the snapshot.
Server-034
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/MxGateway.Server/Galaxy/GalaxyHierarchySnapshotStore.cs:87-115 (TryLoadAsync) |
| Status | Resolved |
Description: TryLoadAsync carries the Try prefix and its XML doc says it returns null "when none exists, persistence is disabled, or the on-disk file uses an unrecognized schema version." But a corrupt or partially written JSON file makes JsonSerializer.DeserializeAsync throw JsonException, and an unreadable file (locked, denied ACL) throws IOException / UnauthorizedAccessException — none of which the method catches. End-to-end behavior is still safe because the sole caller, GalaxyHierarchyCache.TryRestoreFromDiskAsync, wraps the call in a catch (Exception); but the store's own Try-prefixed contract is violated, and any future caller would be surprised by the throw.
Recommendation: Catch JsonException and IOException (the latter covers the UnauthorizedAccessException family) inside TryLoadAsync, log a warning, and return null — consistent with the unrecognized-schema-version branch already present and with the Try naming. A corrupt cache file is an expected failure mode for a disk cache.
Resolution: Resolved in bdccdbf (2026-05-22): TryLoadAsync now has a catch (Exception) when (exception is JsonException or IOException or UnauthorizedAccessException) that logs a warning and returns null. New test GalaxyHierarchySnapshotStoreTests.TryLoadAsync_WhenFileIsCorruptJson_ReturnsNull.
Server-035
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:176 (call site), :327-352 (PersistSnapshotAsync) |
| Status | Resolved |
Description: After a heavy refresh, RefreshCoreAsync awaits PersistSnapshotAsync while still holding _refreshGate, and the SaveAsync write has no timeout. The only caller of RefreshAsync is the sequential GalaxyHierarchyRefreshService loop, so a write that hangs — e.g. a SnapshotCachePath pointed at an unresponsive network share — blocks the gate and stalls all subsequent cache refreshes until gateway shutdown. Impact is bounded: clients keep being served the last entry (which flips to Stale after the 5-minute threshold), so this is a degradation rather than an outage, and the default C:\ProgramData path is local disk where a hang is unlikely.
Recommendation: Bound the snapshot write with a timeout — a linked CancellationTokenSource cancelling after, say, the SQL CommandTimeoutSeconds budget — so a stuck write fails fast and logs rather than pinning the refresh loop. Moving the write off the gate is an alternative but would need its own write-serialization.
Resolution: Resolved in bdccdbf (2026-05-22): SaveAsync wraps the write in a CancellationTokenSource.CreateLinkedTokenSource(cancellationToken) cancelled after Math.Max(1, CommandTimeoutSeconds) seconds, so a stuck write fails fast instead of pinning the refresh loop. The timeout-expiry path itself is not unit-tested — exercising it would require a genuinely hanging filesystem.
Server-036
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:345-348 (PersistSnapshotAsync catch) |
| Status | Resolved |
Description: PersistSnapshotAsync passes the refresh CancellationToken to SaveAsync and catches every exception — including the OperationCanceledException thrown when that token is cancelled at gateway shutdown — in its general catch (Exception), logging it as Warning: "Failed to persist the Galaxy hierarchy snapshot to disk.". A snapshot write interrupted by a normal shutdown is not a failure, but it surfaces as a misleading warning every time the gateway stops mid-write.
Recommendation: Let a cancellation-driven OperationCanceledException pass without the warning — e.g. add catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { } before the general catch — matching the cancellation handling already used in RefreshCoreAsync and TryRestoreFromDiskAsync.
Resolution: Resolved in bdccdbf (2026-05-22): PersistSnapshotAsync has a catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) ahead of the general catch, so a save aborted by gateway shutdown is silent while a genuine failure (including a write timeout) still logs. New test GalaxyHierarchyCacheTests.RefreshAsync_WhenSnapshotSaveCancelledAtShutdown_DoesNotLogPersistFailure.
Server-037
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | src/MxGateway.Tests/Galaxy/GalaxyHierarchySnapshotStoreTests.cs, src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs |
| Status | Resolved |
Description: The new snapshot tests cover the round-trip, missing-file, persistence-disabled, unrecognized-schema, and overwrite cases for the store, and the persist / restore-when-unreachable / promote-on-matching-deploy cases for the cache. Two resilience paths are untested: (1) GalaxyHierarchyCache.TryRestoreFromDiskAsync's catch path when the snapshot file is corrupt — the cache must come up Unavailable rather than throwing; (2) the cache restore path when PersistSnapshot = false (the store yields null and the cache stays Unavailable). Both are the failure modes most likely to matter operationally.
Recommendation: Add a cache test that writes a corrupt snapshot file and asserts RefreshAsync with an unreachable repository leaves the cache Unavailable without throwing, and a test that confirms a PersistSnapshot = false store neither restores nor persists. If Server-034 is fixed, the corrupt-file test also pins the store's null-return.
Resolution: Resolved in bdccdbf (2026-05-22): added GalaxyHierarchyCacheTests.RefreshAsync_WhenSnapshotFileCorrupt_ComesUpUnavailableWithoutThrowing and RefreshAsync_WhenPersistDisabled_DoesNotRestoreFromDisk, plus the TryLoadAsync_WhenFileIsCorruptJson_ReturnsNull store test added for Server-034.
Server-038
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:23-44 |
| Status | Resolved |
Description: EventsHub is gated by [Authorize(Policy = DashboardAuthenticationDefaults.HubClientsPolicy)], which checks only that the caller carries a dashboard role (Admin or Viewer). SubscribeSession(sessionId) accepts any non-empty session id and joins the caller to session:{id}. A Viewer who knows or guesses a session id can therefore subscribe to any session's MxEvent stream once DashboardEventBroadcaster is broadcasting (which it now is, per d692232). The per-session ACL that gates the gRPC StreamEvents RPC is not replicated.
Recommendation: Before the EventsHub is exercised by Admin-only sessions or session-scoped Viewer roles, gate SubscribeSession on a session-access check — either via a per-session role check in the hub method itself, or by storing a per-user allowed-session-id set in the connection's Context.Items at connect time and rejecting subscribes outside that set. The current dashboard surfaces only a per-page Session Details view that the page can prove it's authorized for, but as soon as a Viewer role exists the gap matters.
Resolution: 2026-05-24 — Documented the v1 acceptance per the prompt's "practical fix for v1" direction. Added a detailed <remarks> block to EventsHub.SubscribeSession (src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs) stating that (a) in v1 the hub-level HubClientsPolicy only requires one of the dashboard roles (Admin or Viewer) and both may subscribe to any session id, (b) this is acceptable today because the dashboard's per-session views show non-secret session metadata any authenticated user can already see and value logging is gated by the same redaction policy, and (c) the per-session ACL that gates the gRPC StreamEvents RPC is intentionally not yet mirrored here. Added an explicit TODO(per-session-acl) describing the future enforcement seam — once a role/scope is introduced that scopes a Viewer to a specific session or tenant, add a session-access check at this method (inline on Context.User claims/Context.Items, or via a dedicated authorization policy applied to the hub method). No code-behavior change in this pass; the per-session ACL data model design is out of scope for the resolution window. No new regression test (the change is documentation-only).
Server-039
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:37-58 |
| Status | Resolved |
Description: HubTokenService.Validate deserializes the protected JSON payload and trusts payload.Roles even when payload.Name and payload.NameIdentifier are both null. The resulting ClaimsPrincipal has the MxGateway.Dashboard.HubToken scheme as its AuthenticationType and the role claims, but no identity claims. Identity?.IsAuthenticated returns true because the auth type is non-empty, so the principal satisfies IsAuthenticated checks and IsInRole checks even though it has no caller identity. A token forged from a corrupted data-protection store could pass authorization without an associated user.
Recommendation: Mark HubTokenPayload.Name and HubTokenPayload.NameIdentifier as required (e.g. with [JsonRequired] once the project standardizes the JSON binder, or by validating non-null explicitly after deserialization) and reject the token if either is missing. Alternatively, document on IDashboardAuthorizationHandler consumers that they must check Identity?.Name is non-null before honoring role claims from this scheme.
Resolution: 2026-05-24 — HubTokenService.Validate (src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs) now rejects a deserialized payload where both Name and NameIdentifier are null/empty — returning null rather than emitting a principal with role claims but no caller identity. The check sits immediately after the protector unprotect and the null-payload guard, with a comment back-referencing Server-039. Either field is sufficient (a token minted with only a NameIdentifier still validates), matching the existing Issue path where the cookie principal may carry just one of them. New test file src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/HubTokenServiceTests.cs with six tests: Validate_TokenWithNullNameAndNullNameIdentifier_ReturnsNull (the named regression — confirmed to fail before the fix and pass after), Validate_TokenWithName_ReturnsAuthenticatedPrincipal, Validate_TokenWithOnlyNameIdentifier_ReturnsPrincipal, plus null/empty/garbage-token sanity checks. Verified by passing tests.
Server-040
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAuthenticator.cs:140-160 (MapGroupsToRoles) |
| Status | Resolved |
Description: MapGroupsToRoles checks each LDAP group against the role map twice — first by the full group string, then by ExtractFirstRdnValue(group) — and TryGetValue short-circuits on the first hit. The precedence ("full match wins over RDN match") is correct because the map's key set is operator-controlled and matches should resolve deterministically, but the lookup ordering is not documented. A future maintainer reading the code can't tell whether "fall through to RDN" is intentional or a leftover from refactoring IsMemberOfRequiredGroup.
Recommendation: Add a one-line comment above the loop explaining the precedence: full DN/CN literal first, leading-RDN fallback second. Mention the case-insensitive map comparer (OrdinalIgnoreCase) so the next reader doesn't ask why "GwAdmin" matches "gwadmin".
Resolution: 2026-05-24 — Added a precedence comment block above the lookup in MapGroupsToRoles (src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAuthenticator.cs:156-163) explaining that the full literal group string is tried first and the leading-RDN value (e.g. GwAdmin extracted from ou=GwAdmin,ou=groups,...) is the fallback, and back-referencing DashboardOptions.GroupToRole as the source of the OrdinalIgnoreCase comparer so a maintainer sees why "GwAdmin" matches "gwadmin". No code change — existing DashboardAuthenticatorTests.MapGroupsToRoles_ResolvesByShortNameAndDistinguishedName already pins both the full-match and RDN-fallback paths and the case-insensitive lookup; pure documentation-only resolution, no new test.
Server-041
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-126, src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/IDashboardEventBroadcaster.cs:6-10 |
| Status | Resolved |
Description: IDashboardEventBroadcaster.Publish is documented as "Implementations must never throw — broadcast failures are best-effort and must not disrupt the source gRPC stream." EventStreamService honors that contract by passing the call through without a try/catch. The current DashboardEventBroadcaster implementation observes the SendAsync task's continuation but does not raise synchronously, so the seam is safe today. A future implementation that adds synchronous validation or a serializer hop could throw, faulting the producer loop and ending the gRPC stream.
Recommendation: Either wrap the Publish call in a try/catch (Exception ex) that logs at debug and continues (matching the DashboardSnapshotPublisher pattern), or add a code-review checklist note enforcing the never-throw contract on implementations. The wrap is safer because it doesn't depend on convention.
Resolution: 2026-05-24 — Took the safer wrap. EventStreamService.ProduceEventsAsync (src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-137) now wraps the dashboardEventBroadcaster.Publish(...) call in a try / catch (Exception ex) that logs at debug and continues. The producer loop and the gRPC stream are no longer at the mercy of the broadcaster's never-throw discipline — a future implementation that adds synchronous validation or a serializer hop cannot fault the stream. Regression test in src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs: StreamEventsAsync_WhenDashboardBroadcasterThrows_StillYieldsEventsAndDoesNotFaultSession (new) injects a ThrowingDashboardEventBroadcaster test double that throws InvalidOperationException on every Publish, then asserts (a) the gRPC stream still yields both events in order, (b) the broadcaster's Publish is attempted for every event (so the catch is exercised per-event rather than aborting the loop), and (c) the session does not transition to Faulted. Confirmed to fail before the fix (the producer loop surfaced the simulated InvalidOperationException) and pass after. Verified by passing tests.
Server-042
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardSnapshotPublisher.cs:18-41 |
| Status | Resolved |
Description: DashboardSnapshotPublisher.ExecuteAsync reads from IDashboardSnapshotService.WatchSnapshotsAsync inside an outer try that catches OperationCanceledException only. A failure inside WatchSnapshotsAsync (e.g. the snapshot service throws after a transient SQL failure for the Galaxy summary projection) escapes the outer try and ends the BackgroundService — no automatic reconnect. The sibling AlarmsHubPublisher (lines 55-61) wraps its StreamAsync consumer in a 5-second reconnect loop with catch (Exception ex) and continues. The snapshot publisher should follow the same shape.
Recommendation: Wrap the await foreach in a while (!stoppingToken.IsCancellationRequested) loop with a catch (Exception ex) plus a 5-second Task.Delay, mirroring AlarmsHubPublisher. Today's snapshot service rarely throws on the watch path, but a one-time logger-init failure or transient IGatewayConfigurationProvider exception would silently take the dashboard offline.
Resolution: 2026-05-24 — Mirrored AlarmsHubPublisher's reconnect loop. DashboardSnapshotPublisher.ExecuteAsync (src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardSnapshotPublisher.cs) now wraps the await foreach in a while (!stoppingToken.IsCancellationRequested) loop and catches general exceptions with a logged warning + Task.Delay(reconnectDelay, stoppingToken). The 5-second DefaultReconnectDelay is preserved for production via the public constructor; an internal overload injects a shorter delay for the regression test (with [InternalsVisibleTo("ZB.MOM.WW.MxGateway.Tests")] already in place). Also tightened cancellation handling: the inner OperationCanceledException returns instead of merely catching, so a normal shutdown exits cleanly rather than re-looping on the cancelled token. New test file src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs with two cases: ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay (the named regression — ThrowOnceThenYieldSnapshotService throws on the first WatchSnapshotsAsync call and yields a snapshot on the second; the publisher must reconnect, broadcast the snapshot, and the gap between throw and reconnect must respect the configured delay) and ExecuteAsync_WhenSnapshotServiceCompletes_ReconnectsAfterDelay (sanity case: a normal yield break also triggers the reconnect loop). Confirmed both tests fail against the original single-try implementation (the BackgroundService exits and SubscribeCount stays at 1) and pass after the fix. Verified by passing tests.
Server-043
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:1, src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardServiceCollectionExtensions.cs:24 |
| Status | Resolved |
Description: HubTokenService is registered as a singleton (good — data protection providers are thread-safe and a single protector instance is correct) and shared by both DashboardHubConnectionFactory (per-circuit scoped, mints fresh tokens from the cookie principal) and HubTokenAuthenticationHandler (per-request transient, validates inbound tokens). The class-level docs describe what the service does but not that it is intentionally a singleton with two consumer scopes, so a future maintainer rewriting the DI registration may pick the wrong lifetime.
Recommendation: Add a <remarks> block to HubTokenService noting "Registered as a singleton in AddGatewayDashboard; the underlying ITimeLimitedDataProtector is thread-safe and shared across hub-token issuance and validation." Optionally add a comment near the DI registration explaining the lifetime contract.
Resolution: 2026-05-24 — Added a <remarks> block to HubTokenService (src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs) documenting that the service is registered as a singleton in DashboardServiceCollectionExtensions.AddGatewayDashboard and is shared by two consumer scopes — DashboardHubConnectionFactory (scoped, per-circuit; calls Issue from the cookie-authenticated dashboard) and HubTokenAuthenticationHandler (transient, per-request; calls Validate from the SignalR negotiate / connection path). Notes that the underlying ITimeLimitedDataProtector is thread-safe so concurrent mint/validate from any number of callers is safe, and explicitly asks future maintainers to preserve the singleton lifetime to keep the protector instance stable. Pure documentation change; no test.
Re-review 2026-05-24 (commit 42b0037)
Server-044
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-254 |
| Status | Open |
Description: KillWorkerAsync is the mirror of CloseSessionCoreAsync for the new admin-only Kill flow, but its catch path leaks the mxgateway.sessions.open gauge — the exact bug that Server-006 closed for OpenSessionAsync. The happy path increments _metrics.SessionClosed() once after session.KillWorker(reason) returns (line 244), which decrements _openSessions. The catch path, however, records _metrics.Fault(...), calls session.MarkFaulted(...), and then awaits RemoveSessionAsync(session) — but never calls _metrics.SessionClosed() (nor SessionRemoved()), so a kill that throws from session.KillWorker leaves the open-session gauge permanently incremented. RemoveSessionAsync only calls _metrics.RemoveSessionEvents(...) and ReleaseSessionSlot(); neither touches _openSessions. Server-006's fix pattern (track whether the open-counter was recorded, and decrement on the failing path) was applied to OpenSessionAsync but not propagated to this new write path.
In practice the trigger is narrow — GatewaySession.KillWorker calls _workerClient?.Kill(reason) and TransitionTo(SessionState.Closed); the worker client's Kill method on a faulted client or a worker process that has already exited could throw, and the catch path then leaks the gauge. Sustained operator use of the dashboard Kill action on misbehaving workers would gradually inflate mxgateway.sessions.open and corrupt the metric exposed by /health/metrics and any Grafana panel keying off it.
Recommendation: Mirror Server-006's fix: track whether the session was counted as opened (it always is in KillWorkerAsync — GetRequiredSession only succeeds for sessions in the registry, all of which had SessionOpened() called), and decrement on the failing path. Concretely, add _metrics.SessionClosed() (or _metrics.SessionRemoved() if the kill is being treated as an unclean removal) inside the catch block before RemoveSessionAsync(session). The cleanest form is to record SessionClosed() once at the top of the method (under a flag), then only re-record if the happy path actually transitions; or to add _metrics.SessionClosed() in the catch right after MarkFaulted. Add a SessionManagerTests.KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge regression test that uses a FakeWorkerClient.KillThrows = true to exercise the catch.
Server-045
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:225,242-245, src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:837-841 |
| Status | Open |
Description: KillWorkerAsync reads session.State once into a local bool wasClosed (line 225) before calling session.KillWorker(reason). The read is unsynchronized — State is a getter that takes _syncRoot internally so the read itself is safe, but there is no lock spanning "read state, call KillWorker, conditionally record metric." Two concurrent KillWorkerAsync calls on the same session (e.g. one operator clicking Kill on the Sessions page and another clicking Kill on the Session Details page within the same render tick) can both observe wasClosed = false, then both call session.KillWorker(...) (the second is effectively a no-op because TransitionTo refuses to overwrite Closed), and both call _metrics.SessionClosed() at line 244. The _openSessions gauge is bounded at 0 by GatewayMetrics.SessionClosed's if (_openSessions > 0) guard, but the _sessionsClosed counter (and the mxgateway.sessions.closed counter exported by the meter) is double-incremented; _metrics.Fault is not used here, so the only mitigation is the SessionsRegistry race — the second call's GetRequiredSession could miss if the first already removed the session via RemoveSessionAsync, but only if the second arrives after the first's removal completes. The window is small but exists, and the same race exists for "Kill from one tab while the lease-expired sweep is closing the session." CloseSessionCoreAsync has the same shape, so this isn't a regression specifically from the kill change — but the new path widens the surface where the issue can fire.
Recommendation: Either (a) gate KillWorkerAsync on a per-session lock — extending the _closeLock pattern that GatewaySession.CloseAsync already uses, or introducing a new _killLock and accepting that close + kill don't serialize against each other — or (b) accept the metric double-count as harmless and document it on KillWorkerAsync's XML doc. Option (a) is the more defensible long-term fix; option (b) is acceptable for v1 if the metric is purely informational. Adding a test that issues concurrent kills against the same session id and asserts _sessionsClosed == 1 would pin the chosen behavior either way.
Server-046
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:286-307 |
| Status | Open |
Description: ShutdownAsync was updated to fall back to KillWorker when CloseSessionCoreAsync throws (lines 294-305) — a useful resilience improvement on its own. But the fallback's bookkeeping is wrong: session.KillWorker(GatewayShutdownReason) is called and RemoveSessionAsync(session) is awaited, but _metrics.SessionClosed() is never invoked, so for every session whose graceful close throws, the mxgateway.sessions.open gauge stays incremented after shutdown completes. Worse, CloseSessionCoreAsync's SessionCloseStartedException catch (line 330) already records _metrics.SessionRemoved() (line 334-336) before re-throwing — so for that specific exception type, the gauge is decremented inside the inner catch, then the outer fallback runs and does not double-decrement (good), but _metrics.SessionClosed() is never called, so the _sessionsClosed counter under-counts by one. For any other exception (the more common case), neither inner catch records anything, so both _sessionsClosed and _openSessions end up wrong: gauge is left high, counter is left low.
Recommendation: Inside the ShutdownAsync fallback (after the KillWorker call but before/inside the RemoveSessionAsync), call _metrics.SessionClosed() unless the inner catch already recorded the close. The simplest shape is to propagate a wasClosed flag out of CloseSessionCoreAsync (or replace the fallback's manual choreography with a single call into KillWorkerAsync(...), which has the right metric path once Server-044 is fixed). The latter is the cleanest — ShutdownAsync becomes "try graceful, fall back to KillWorkerAsync," and there's exactly one accounting path for each session. Add a SessionManagerTests.ShutdownAsync_WhenCloseThrows_StillDecrementsOpenSessionGauge test using a session whose CloseAsync throws (e.g. a BlockingShutdownWorkerClient configured to throw on ShutdownAsync).
Server-047
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:324-334, src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionsPage.razor:171-195, src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionDetailsPage.razor:231-255 |
| Status | Open |
Description: The shared ConfirmDialog.razor (added in 0e56b5b / 24cc5fd) is wired by three pages, but the pages handle PendingAction cleanup inconsistently:
ApiKeysPage.ConfirmPendingAsynccaptures the action, setsPendingAction = nullsynchronously, then awaits the action viaRunManagementActionAsync. The dialog disappears the moment Confirm is clicked, and the user sees no busy indication on the dialog itself (the busy state lives onRunManagementActionAsync'sIsBusy = true).SessionsPage.ConfirmPendingAsyncandSessionDetailsPage.ConfirmPendingAsynckeepPendingActionset, setIsBusy = true, await the action, then clearPendingAction = nullin thefinally. The dialog stays open during the call and visibly disables its buttons viaIsBusy.
The user-visible difference: rotating/revoking/deleting a key vs closing/killing a session uses two different dialog-lifecycle patterns. Neither is broken, but ConfirmDialog is the shared component and its IsBusy parameter exists precisely to render the in-flight state — ApiKeysPage discards that signal by closing the dialog before the action runs.
Recommendation: Align ApiKeysPage.ConfirmPendingAsync with the sessions pages: hold PendingAction, set IsBusy = true, run the action, then clear PendingAction in the finally. The current ApiKeysPage shape was inherited from before the dialog existed (when the confirmation was a confirm() JS call); the dialog component change can flatten the difference now. As a smaller alternative, document the divergence on the component's XML doc — but the shared component should ideally be used consistently.
Server-048
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:463-498 |
| Status | Open |
Description: The two new KillWorkerAsync_* tests cover the happy path (KillWorkerAsync_KillsWorkerAndRemovesSession) and the missing-session error (KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound). Three behaviorally distinct cases are missing:
- Catch path —
session.KillWorkerthrows. Today no test exercises the failure branch, so the Server-044 gauge leak ships without coverage. AFakeWorkerClient.ThrowOnKill = true(or equivalent) would letWorkerClient.Killthrow; the test would assert the session is removed,_metrics.Faultis recorded, and the open-session gauge is decremented. wasClosed = truepath — kill on an already-Closedsession must not re-incrementmxgateway.sessions.closed. No assertion pins this.- Concurrent kill — two
KillWorkerAsynccalls for the same session id; Server-045's double-increment lives or dies on whether this is tested.
Recommendation: Add the three tests above. The fakes in MxGateway.Tests/TestSupport/ already cover most of the moving parts; FakeWorkerClient needs a single ThrowOnKill flag (or the existing KillThrowing if any).
Server-049
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs:5-18, src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:8-25 |
| Status | Open |
Description: IDashboardSessionAdminService declares three members — CanManage, CloseSessionAsync, KillWorkerAsync — none of which carry XML documentation. DashboardSessionAdminService.CanManage and the two operation methods are also undocumented (only the constructor parameters are named). The C# style guide requires public-surface XML docs and CLAUDE.md mandates that "docs change with the code." The peer IDashboardApiKeyManagementService is also undocumented, so this isn't unique — but the new interface is a fresh public surface being landed in c5e7479, and the contract subtleties (CanManage returns false for non-Admin; missing-session paths surface as Succeeded = false not as a thrown exception; KillReason is fixed at "dashboard-admin-kill" and that value reaches the audit log) are exactly what XML docs are for.
Recommendation: Add <summary> blocks to IDashboardSessionAdminService.CanManage (states the Admin-role gate), CloseSessionAsync and KillWorkerAsync (state that missing sessions return DashboardSessionAdminResult.Fail(...) rather than throwing, and that the audit log captures actor + remote IP). Add <param> and <returns> for the request/response shape. The same sweep can pick up the longstanding gap on IDashboardApiKeyManagementService if the team wants — but the new file is the load-bearing one.
Server-050
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:42-75,92-125 |
| Status | Open |
Description: CloseSessionAsync and KillWorkerAsync catch only SessionManagerException (the SessionNotFound filter, then a general SessionManagerException catch). Anything else propagates raw to Blazor's error boundary. The propagation paths exist:
SessionManager.CloseSessionAsync→CloseSessionCoreAsynccatchesOperationCanceledExceptionandSessionCloseStartedException; any other exception (e.g. anIOExceptionfrom worker pipe teardown surfacing throughsession.CloseAsync→_workerClient.ShutdownAsync) propagates raw.SessionManager.KillWorkerAsyncwraps only thesession.KillWorker(reason)call in a try/catch. Exceptions fromRemoveSessionAsync→session.DisposeAsync(the new_closeLock.WaitAsync/ dispose choreography from Server-016) — particularly aTaskCanceledExceptionif the caller's CancellationToken fires mid-dispose, or an aggregate exception from concurrent disposal — also propagate raw.
Today neither call site has a Blazor error boundary, so an unhandled exception lands as a generic Blazor circuit error page. The friendlier-error contract that Server-044's commit message advertises ("audit-logs, friendly errors") is incomplete: only SessionManagerException gets a friendly error.
Recommendation: Add a general catch (Exception exception) after the SessionManagerException catch in both CloseSessionAsync and KillWorkerAsync, log a warning (matching the SessionManagerException pattern), and return DashboardSessionAdminResult.Fail($"{operation} failed unexpectedly. See the gateway log for details."). This makes the result type truly the only output the page sees. Add a regression test using a ThrowingSessionManager that throws e.g. InvalidOperationException from KillWorkerAsync and asserts the service returns a failing result rather than propagating.