# Code Review — Server | Field | Value | |---|---| | Module | `src/ZB.MOM.WW.MxGateway.Server` | | Reviewer | Claude Code | | Review date | 2026-05-24 | | Commit reviewed | `d692232` | | Status | Reviewed | | Open findings | 0 | ## Checklist coverage ### 2026-05-20 review (commit 1cd51bb) This row summarizes the 2026-05-20 review pass at commit `1cd51bb`. Findings from prior passes (Server-001 through Server-014) are all closed and remain below as audit history. | # | Category | Result | |---|---|---| | 1 | Correctness & logic bugs | Issues found: Server-019 (`WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` yields silently when session is missing). | | 2 | mxaccessgw conventions | No issues found — convention drift previously called out is resolved; no new gaps observed. | | 3 | Concurrency & thread safety | Issues found: Server-015 (`GatewaySession._state` is written under `_closeLock` but read/written elsewhere under `_syncRoot`). | | 4 | Error handling & resilience | Issues found: Server-016 (`GatewaySession.DisposeAsync` disposes the close-lock semaphore while it may be held). | | 5 | Security | Issues found: Server-017 (`AcknowledgeAlarm` / `QueryActiveAlarms` fall through to admin-only scope because the resolver was not updated for the new alarm RPCs). | | 6 | Performance & resource management | Issues found: Server-018 (`GalaxyGlobMatcher` regex cache is unbounded — currently low-risk but uncapped). | | 7 | Design-document adherence | No issues found at this pass. | | 8 | Code organization & conventions | Issues found: Server-020 (dashboard pages each declare two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — producing duplicate routes under the `/dashboard` group prefix). | | 9 | Testing coverage | Issues found: Server-021 (`MxAccessGatewayService.ApplyConstraintsAsync` and the new `BulkConstraintPlan` / `ReadBulkConstraintPlan` / `WriteBulkConstraintPlan` / `SubscribeBulkConstraintPlan` merge logic is entirely untested). | | 10 | Documentation & comments | Issues found: Server-022 (`IAlarmRpcDispatcher` XML doc still describes the dispatcher as "ships a not-yet-wired default"; stale after Server-014). | ### 2026-05-20 review (commit a020350) Re-review pass at `a020350` — the cross-module sweep that resolved Server-015 through Server-022. Verified each fix is sound (lock discipline now uniform on `_syncRoot`; `DisposeAsync` gates on `_closeLock`; alarm RPCs map to `InvokeWrite`/`EventsRead`; glob cache is bounded; alarm dispatcher SessionNotFound flows through `MxAccessGatewayService.MapException` → gRPC `NotFound`; dashboard pages emit a single `@page`; 11 new `MxAccessGatewayServiceConstraintTests` cover the bulk-constraint plans). New findings filed against this pass. | # | Category | Result | |---|---|---| | 1 | Correctness & logic bugs | Issues found: Server-024 (`GalaxyGlobMatcher.GetOrCreateRegex` indexer access after `TryAdd` fails can throw `KeyNotFoundException` under contention near the cap). | | 2 | mxaccessgw conventions | No issues found. | | 3 | Concurrency & thread safety | No new issues found — Server-015/016 fixes verified sound. | | 4 | Error handling & resilience | Issues found: Server-026 (`AlarmsOptions` is bound but not validated by `GatewayOptionsValidator`). | | 5 | Security | No issues found — Server-017 mapping (`InvokeWrite` / `EventsRead`) is defensible and exercised by both resolver and interceptor tests. | | 6 | Performance & resource management | No issues found — Server-018 cap is in place and tested. | | 7 | Design-document adherence | Issues found: Server-027 (`docs/Authorization.md` `ResolveCommandScope` code snippet and Constraint Enforcement section omit the bulk read/write command families). | | 8 | Code organization & conventions | Issues found: Server-025 (`GalaxyRepositoryGrpcService` still consumes the concrete `GalaxyRepository` after `IGalaxyRepository` was introduced for testability — inconsistent with `GalaxyHierarchyCache`). | | 9 | Testing coverage | Issues found: Server-028 (`GatewayGrpcScopeResolverTests` does not exercise `WatchDeployEventsRequest` or `MxCommandKind.ReadBulk`; no `GatewaySessionTests` case asserts a `MarkFaulted` during in-flight Close). | | 10 | Documentation & comments | Issues found: Server-023 (`NotWiredAlarmRpcDispatcher` class XML doc still says "PR A.6/A.7 — default … shipped while the worker-side AlarmClient event subscription is gated on dev-rig validation"; contradicts the cleanup that Server-014/Server-022 applied to the interface, gateway service, and `WorkerAlarmRpcDispatcher`). Issues found: Server-029 (`OpenSession` capability list advertises `bulk-subscribe-commands` but not the now-shipping bulk-read or bulk-write families — clients that gate on capability strings have no signal that those families exist). | ### 2026-05-22 review (commit fa491c7) Re-review pass at `fa491c7`, scoped to the Galaxy hierarchy snapshot-persistence change: the new `GalaxyHierarchySnapshot`, `IGalaxyHierarchySnapshotStore` / `GalaxyHierarchySnapshotStore`, the restore / persist paths added to `GalaxyHierarchyCache`, the two new `GalaxyRepositoryOptions`, and the `docs/GalaxyRepository.md` / `docs/GatewayConfiguration.md` updates. Prior findings (Server-001 through Server-032) are unchanged by this pass. | # | Category | Result | |---|---|---| | 1 | Correctness & logic bugs | No issues found — restore/save sequencing and the shared `BuildEntry` materialization are sound. | | 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, `sealed`, `Async` suffixes, Options pattern, and XML docs all conform; the snapshot persists Galaxy metadata (names/types), not tag values or secrets. | | 3 | Concurrency & thread safety | No issues found — `_restoreAttempted` and `_current` are touched only under `_refreshGate`; `_current` is published via `Volatile.Write`; the store serializes its file I/O on a private `SemaphoreSlim`. | | 4 | Error handling & resilience | Issues found: Server-033 (restore never completes `_firstLoad`, so a cold-start browse waits the full 5s bootstrap budget), Server-034 (`TryLoadAsync` throws on a corrupt file despite the `Try` prefix), Server-036 (a save cancelled at shutdown logs a misleading warning). | | 5 | Security | No issues found — the snapshot holds non-secret Galaxy metadata, is written under `C:\ProgramData\MxGateway` alongside the auth DB, and restored rows flow the same materialization path as live SQL with no injection surface. | | 6 | Performance & resource management | Issues found: Server-035 (the snapshot write is awaited on the refresh critical path under `_refreshGate` with no timeout). | | 7 | Design-document adherence | No issues found — `docs/GalaxyRepository.md` and `docs/GatewayConfiguration.md` were updated in the same commit; `docs/DesignDecisions.md` already defers to `GalaxyRepository.md` as the Galaxy authority. | | 8 | Code organization & conventions | No issues found — the new options live on `GalaxyRepositoryOptions`, the store is a registered singleton, and the on-disk envelope (`PersistedFile`) is a private nested record. | | 9 | Testing coverage | Issues found: Server-037 (no test for the corrupt-snapshot restore path or for `PersistSnapshot = false` at the cache level). | | 10 | Documentation & comments | No issues found — XML docs match behavior; the `GalaxyRepository.md` "On-disk snapshot" section documents the Stale-on-restore lifecycle. | ### 2026-05-24 review (commit d692232) Re-review pass at `d692232` scoped to the dashboard refactor wave: the `ZB.MOM.WW` project rename (`dc9c0c9`), the `QueryActiveAlarms` public RPC implementation (`397d3c5`), the LDAP role-mapping + HubToken bearer auth (`27ed651`), the sidebar layout + three SignalR push hubs (`6594359`), and the EventsHub broadcaster + doc refresh (`d692232`). Server-031 and Server-032 remain open and untouched — neither the gateway-side `_writeLock` heartbeat contention nor the bounded `_events` channel saw any changes in this wave. | # | Category | Result | |---|---|---| | 1 | Correctness & logic bugs | No issues found in the fa491c7..d692232 diff. | | 2 | mxaccessgw conventions | No issues found — rename hygiene clean, external runtime identifiers (`MeterName`, `MxGateway.Dashboard` scheme, `MxGateway.Request` logger, `MxGateway.Worker.STA` thread name) intentionally unprefixed per commit message. | | 3 | Concurrency & thread safety | No issues found — Server-031 (`_writeLock` heartbeat watchdog contention) remains open and unchanged. | | 4 | Error handling & resilience | Issues found: Server-039 (`HubTokenService.Validate` accepts a payload with null Name/NameIdentifier), Server-041 (`EventStreamService` calls the broadcaster without a try/catch — fragile seam), Server-042 (`DashboardSnapshotPublisher` tight retry loop with no backoff vs `AlarmsHubPublisher` 5-second delay). | | 5 | Security | Issues found: Server-038 (`EventsHub.SubscribeSession` accepts any session id from any Viewer; no per-session ACL). | | 6 | Performance & resource management | Issues found: Server-042 (`DashboardSnapshotPublisher` lacks reconnect backoff). | | 7 | Design-document adherence | Issues found: Server-041 (broadcaster's never-throw contract documented in the interface but not enforced by the caller). | | 8 | Code organization & conventions | Issues found: Server-040 (undocumented lookup-order precedence in `MapGroupsToRoles`), Server-043 (singleton sharing of `HubTokenService` undocumented). | | 9 | Testing coverage | No issues found in this module — see Tests-026 in the Tests module for the missing EventsHub broadcast coverage. | | 10 | Documentation & comments | Issues found: Server-040, Server-043 (both documentation gaps). | ## Findings ### Server-001 | Field | Value | |---|---| | Severity | Critical | | Category | Security | | Location | `src/MxGateway.Server/GatewayApplication.cs:147-149`, `src/MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs:55-58`, `src/MxGateway.Server/Dashboard/Components/Routes.razor:1-15` | | Status | Resolved | **Description:** The dashboard authorization policy (`DashboardAuthenticationDefaults.AuthorizationPolicy`), `DashboardAuthorizationRequirement`, and `DashboardAuthorizationHandler` are registered in DI but never applied to any endpoint. `MapRazorComponents()` has no `.RequireAuthorization(...)`, the `` in `Routes.razor` uses plain `RouteView` (not `AuthorizeRouteView`), and no dashboard page carries `[Authorize]` — a module-wide grep finds zero `RequireAuthorization`/`[Authorize]`/`AuthorizeRouteView` usages. Every dashboard page (Sessions, Workers, Events, Galaxy, Settings, and the API Keys list exposing key IDs, scopes, and constraints) is reachable by any unauthenticated remote client regardless of `Dashboard:AllowAnonymousLocalhost` or `Dashboard:RequireAdminScope`. Only the API-key mutation operations remain protected, via the separate `DashboardApiKeyManagementService.CanManage` check. **Recommendation:** Apply the policy at the route level — `endpoints.MapRazorComponents().AddInteractiveServerRenderMode().RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy)` — and/or switch `Routes.razor` to `AuthorizeRouteView` with a `[Authorize]` fallback policy plus a `NotAuthorized` redirect to the login page. Add an integration test that GETs a dashboard page anonymously and asserts 302-to-login / 401. **Resolution:** Resolved in `a8aafdf` (2026-05-18): `MapRazorComponents()` now calls `.RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy)`, so an unauthenticated request to any dashboard component route is challenged by the cookie scheme and redirected to the login page. `GatewayApplicationTests` gained `ComponentRoutesRequireAuthorization` (component routes carry the policy) and `AuthEndpointsAllowAnonymousAccess`, replacing the prior test that asserted the insecure behavior. ### Server-002 | Field | Value | |---|---| | Severity | Medium | | Category | Design-document adherence | | Location | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` | | Status | Resolved | **Description:** `gateway.md:583` and CLAUDE.md state the first version "terminates orphaned workers on startup." No code in MxGateway.Server enumerates or kills leftover `MxGateway.Worker.exe` processes at startup — a grep for `orphan`/`reattach`/`terminate` finds nothing. After an unclean gateway crash, x86 worker processes (each holding an MXAccess COM instance) leak and survive indefinitely, and a restarted gateway does not reclaim or kill them. **Recommendation:** Add a startup hosted service that finds and kills stale worker processes (by executable path / a well-known argument or environment marker) before the server accepts sessions, or update the design docs if reattachment/cleanup is deliberately deferred. **Resolution:** Resolved 2026-05-18. Confirmed against source: no code path enumerated or killed leftover workers. Added `IRunningProcessInspector` / `SystemRunningProcessInspector` (a testable seam over `Process.GetProcessesByName`/`Kill`), `OrphanWorkerTerminator` (kills processes matched by the configured worker executable path, or by image name when the x64 gateway cannot introspect the x86 worker's `MainModule`, skipping the current process and tolerating per-process kill failures), and `OrphanWorkerCleanupHostedService` (best-effort `IHostedService`). The hosted service is registered in `AddWorkerProcessLauncher` ahead of `AddGatewaySessions` so cleanup runs before the server accepts sessions. `gateway.md` updated to describe the implemented behavior. Regression tests: `OrphanWorkerTerminatorTests` (`KillsWorkerProcessesMatchingConfiguredExecutablePath`, `KillsImageNameMatchWhenExecutablePathUnreadable`, `DoesNotKillUnrelatedProcessSharingImageName`, `DoesNotKillCurrentProcess`, `ContinuesWhenOneKillThrows`). ### Server-003 | Field | Value | |---|---| | Severity | High | | Category | Security | | Location | `src/MxGateway.Server/Dashboard/DashboardAuthorizationHandler.cs:39,54-59`, `src/MxGateway.Server/Dashboard/DashboardAuthenticator.cs:236-258` | | Status | Resolved | **Description:** When `Dashboard:RequireAdminScope` is true (the default) and the request is not loopback, `DashboardAuthorizationHandler` succeeds only if `HasAdminScope` finds a claim of type `"scope"` with value `"admin"`. But `DashboardAuthenticator.CreatePrincipal` issues only `NameIdentifier`, `Name`, and `LdapGroupClaimType` claims — never a `scope`/`admin` claim. So a correctly LDAP-authenticated user who passed the required-group check is still denied dashboard access on any non-loopback connection. The bug is currently masked by the missing route-level enforcement (Server-001) and by `AllowAnonymousLocalhost`; fixing Server-001 would make the dashboard unusable for all real LDAP logins. **Recommendation:** Either have `DashboardAuthenticator.CreatePrincipal` add a `scope=admin` claim when the user is in the required group, or change `DashboardAuthorizationHandler.HasAdminScope` to evaluate LDAP group membership (reuse `IsMemberOfRequiredGroup` against the `LdapGroupClaimType` claims, as `DashboardApiKeyAuthorization.CanManage` already does). **Resolution:** Resolved in `a8aafdf` (2026-05-18): `DashboardAuthenticator.CreatePrincipal` — reached only after the required-group check passes — now emits the `scope=admin` claim that `DashboardAuthorizationHandler` checks, so group-validated LDAP users pass `RequireAdminScope` once route-level authorization (Server-001) is enforced. ### Server-004 | Field | Value | |---|---| | Severity | Medium | | Category | Code organization & conventions | | Location | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` | | Status | Resolved | **Description:** `ParseScopes` accepts any comma-separated strings and `CreateKeyAsync` persists them verbatim; neither the CLI nor the dashboard create path validates scopes against `GatewayScopes`. A typo or non-canonical name (e.g. CLAUDE.md's example `--scopes session,invoke,event,metadata,admin`, which does not match the resolver's `session:open`/`invoke:read`/etc.) silently creates a key whose scope strings the authorization resolver never checks for — the key is unusable for those RPCs with no error at creation time. **Recommendation:** Validate every requested scope against the `GatewayScopes` catalog at create time in both the CLI parser/runner and `DashboardApiKeyManagementService.ValidateCreateRequest`, rejecting unknown scope strings. **Resolution:** Resolved 2026-05-18. Confirmed against source: `ParseScopes` split unvalidated strings into the create command and `ValidateCreateRequest` checked only key id and display name. Added `GatewayScopes.All` (the canonical scope catalog) and `GatewayScopes.IsKnown(string)`. `ApiKeyAdminCommandLineParser.Parse` now runs `ValidateScopes` for create-key commands and fails the parse listing the unknown scope(s) and valid set; `DashboardApiKeyManagementService.ValidateCreateRequest` rejects requests carrying any non-canonical scope. Revoke/rotate paths are unaffected (no scope input). Regression tests: `ApiKeyAdminCommandLineParserTests.Parse_CreateKeyCommand_RejectsUnknownScope`, `Parse_CreateKeyCommand_AcceptsAllCanonicalScopes`, and `DashboardApiKeyManagementServiceTests.CreateAsync_UnknownScope_DoesNotCallStore`. ### Server-005 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` | | Status | Resolved | **Description:** `GalaxyHierarchyCache.RefreshCoreAsync` only catches `SqlException` and `InvalidOperationException`. The initial `cache.RefreshAsync` call in `GalaxyHierarchyRefreshService.ExecuteAsync` is wrapped only for `OperationCanceledException`. A transient non-`SqlException` failure on the first refresh (e.g. a `Win32Exception`/`TimeoutException` from connection establishment, or another `DbException` subtype) escapes both layers, faults the `BackgroundService`, and — with default host behavior — stops the whole gateway. The periodic-tick loop does catch general exceptions, so only the first load is exposed. **Recommendation:** Broaden the `catch` in `RefreshCoreAsync` to all non-cancellation exceptions (record `Unavailable`/`Stale` and still complete `_firstLoad`), or wrap the initial `RefreshAsync` in `GalaxyHierarchyRefreshService` with the same general `catch` the tick loop uses. **Resolution:** Resolved 2026-05-18. Confirmed against source: the initial `RefreshAsync` in `ExecuteAsync` was guarded only for `OperationCanceledException`, and `RefreshCoreAsync` filtered its catch to `SqlException or InvalidOperationException`. Both recommended layers applied: `GalaxyHierarchyRefreshService.ExecuteAsync` now catches every non-cancellation exception on the initial load (logs a warning; the periodic tick retries), and `GalaxyHierarchyCache.RefreshCoreAsync` broadens its catch to all non-cancellation exceptions so the cache still records `Stale`/`Unavailable` and completes `_firstLoad`. The now-unused `Microsoft.Data.SqlClient` using was removed. Regression test: `GalaxyHierarchyRefreshServiceTests.ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService`. ### Server-006 | Field | Value | |---|---| | Severity | Medium | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` | | Status | Resolved | **Description:** In `OpenSessionAsync`, `_metrics.SessionOpened()` (line 89) increments the `_openSessions` gauge before `TryAutoSubscribeAlarmsAsync` runs. If auto-subscribe throws (which it does when `Alarms.RequireSubscribeOnOpen` is true and the worker rejects the subscription), the `catch` block disposes and removes the session and records `_metrics.Fault(...)` but never calls `SessionClosed`/`SessionRemoved`. The `mxgateway.sessions.open` gauge permanently over-counts by one for every such failed open. **Recommendation:** In the `catch` block, when the session had reached the point where `SessionOpened()` was recorded, also call `_metrics.SessionRemoved()` — or move the `SessionOpened()` call to after auto-subscribe succeeds. **Resolution:** Resolved 2026-05-18. Confirmed against source: the `catch` block in `OpenSessionAsync` recorded `Fault(...)` and removed the session but never decremented the open-session gauge after `SessionOpened()` had run. Added a `sessionOpenedRecorded` flag set immediately after `_metrics.SessionOpened()`; the `catch` block now calls `_metrics.SessionRemoved()` when that flag is set, restoring the gauge for a post-`SessionOpened()` failure (e.g. an auto-subscribe rejection with `RequireSubscribeOnOpen=true`). Regression test: `SessionManagerAlarmAutoSubscribeTests.OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn`. ### Server-007 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyProjector.cs:55-70` | | Status | Resolved | **Description:** `Project` always iterates the full `entry.Index.ObjectViews` collection and re-applies all filters to skip `offset` matched items before collecting a page. Paging through a large Galaxy hierarchy is therefore O(total) per page and O(total²/pageSize) end-to-end. The cache is in-memory so impact is bounded, but for large galaxies repeated `DiscoverHierarchy` pagination wastes CPU. **Recommendation:** Precompute and cache the filtered, ordered view list per `(filterSignature, sequence)` so subsequent pages are an O(pageSize) slice; the existing filter signature already keys page tokens. **Resolution:** Resolved 2026-05-18. Confirmed against source: `Project` re-scanned and re-filtered the whole `ObjectViews` list on every page. Added a `ConditionalWeakTable>>` memo in `GalaxyHierarchyProjector`: the first projection of a given filter signature builds the filtered, ordered view list; subsequent pages take an O(pageSize) slice via index arithmetic. The memo is keyed on the immutable cache-entry instance, so when the cache publishes a new entry the stale memo becomes unreachable and is reclaimed with it — no explicit invalidation. `ResolveRoot` still runs before the memo lookup so a missing root surfaces `NotFound` consistently. Regression tests: `GalaxyHierarchyProjectorTests` (`Project_PagedAcrossEntireHierarchy_ReturnsEveryObjectExactlyOnce`, `Project_DistinctFiltersOnSameEntry_DoNotShareMemoizedViewList`, `Project_SameFilterRepeated_ReturnsIdenticalTotals`, `Project_DistinctCacheEntries_ProjectAgainstTheirOwnData`); existing `GalaxyRepositoryGrpcServiceTests` paging tests continue to pass unchanged. ### Server-008 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:111-134,160-189` | | Status | Resolved | **Description:** `WatchDeployEvents` calls `ResolveBrowseSubtrees()` on every streamed event, and `MapDeployEvent` re-runs `GalaxyHierarchyProjector.Project` over the entire cached hierarchy (and `Sum`s attribute counts) for every event of every constrained subscriber. `GalaxyGlobMatcher.IsMatch` also rebuilds the glob regex on each call. With many constrained subscribers and frequent deploys this is avoidable work. **Recommendation:** Hoist `ResolveBrowseSubtrees()` out of the loop; compute scoped object/attribute counts once per deploy sequence and cache by `(sequence, browseSubtrees)`; cache compiled glob `Regex` instances in `GalaxyGlobMatcher`. **Resolution:** Resolved 2026-05-18. Confirmed against source. Three changes: (1) `WatchDeployEvents` now resolves `ResolveBrowseSubtrees()` once before the streaming loop — the caller's identity and constraints are fixed for the stream lifetime, so per-event resolution was pure waste. (2) `GalaxyGlobMatcher` now caches compiled `Regex` instances in a `ConcurrentDictionary` keyed by glob pattern (with `RegexOptions.Compiled`), so the same handful of globs are translated once instead of on every `IsMatch` call. (3) The per-event `MapDeployEvent` re-projection is no longer a separate hot path: with finding Server-007 resolved, `GalaxyHierarchyProjector.Project` memoizes the filtered view list per `(cache entry, filter signature)`, so the scoped-count projection in `MapDeployEvent` for a constrained subscriber is O(matched-slice) after the first event of a given deploy sequence rather than a full re-scan — this subsumes the recommendation's `(sequence, browseSubtrees)` cache (the memo is keyed on the per-sequence cache-entry instance and the browse-subtree-bearing filter signature). Regression tests: `GalaxyFilterInputSafetyTests.GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect` (glob cache correctness); existing `WatchDeployEvents` and `GalaxyFilterInputSafetyTests` coverage continues to pass. ### Server-009 | Field | Value | |---|---| | Severity | Low | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Security/Authentication/AuthSqliteConnectionFactory.cs:15-32` | | Status | Resolved | **Description:** Each auth-store operation opens a fresh `SqliteConnection` with no busy timeout, no WAL journal mode, and default journaling. `MarkKeyUsedAsync` runs on every authenticated request and `SqliteApiKeyAuditStore` appends on every denial; under concurrent load these writers can collide and surface `SQLITE_BUSY` as a hard failure on the request path. **Recommendation:** Set `Pooling`, a non-zero `DefaultTimeout`/`busy_timeout`, and enable WAL (`PRAGMA journal_mode=WAL`) once at startup so concurrent readers/writers degrade gracefully. **Resolution:** Resolved 2026-05-18. Confirmed against source: the connection string set only `DataSource` and `Mode`. `AuthSqliteConnectionFactory.CreateConnection` now also sets `Pooling = true` and a non-zero `DefaultTimeout`. A new `OpenConnectionAsync(CancellationToken)` opens the connection and applies `PRAGMA journal_mode=WAL` and `PRAGMA busy_timeout` (5 s); WAL is a persistent database-level setting so re-applying it per connection is a cheap no-op, while `busy_timeout` is per-connection state. All nine auth-store call sites (`SqliteApiKeyAdminStore`, `SqliteApiKeyAuditStore`, `SqliteApiKeyStore`, `SqliteAuthStoreMigrator`) were switched from `CreateConnection()` + `OpenAsync()` to `OpenConnectionAsync()`. `docs/Authentication.md` updated to describe the WAL/busy-timeout behavior. Regression test: `SqliteAuthStoreTests.OpenConnectionAsync_EnablesWalJournalModeAndBusyTimeout`. ### Server-010 | Field | Value | |---|---| | Severity | Low | | Category | Security | | Location | `src/MxGateway.Server/Security/Authentication/SqliteApiKeyAdminStore.cs:91-114`, `src/MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:168-172` | | Status | Resolved | **Description:** `RotateAsync` sets `revoked_utc = NULL`, so rotating a previously revoked key silently reactivates it. This is documented intentional behavior in `docs/Authentication.md:167`, but the dashboard renders the "Rotate" button unconditionally — including for keys whose status badge says "Revoked" — so an operator can un-revoke a deliberately disabled key without an explicit warning. **Recommendation:** Either hide/disable the Rotate action for revoked keys in `ApiKeysPage.razor`, require an explicit confirmation, or have `RotateAsync` preserve `revoked_utc` and add a separate explicit "reactivate" operation. **Resolution:** Resolved 2026-05-18. Confirmed against source: `ApiKeysPage.razor` rendered the Rotate button unconditionally while Revoke was already gated on `key.RevokedUtc is null`. Took the lowest-risk recommended option — the dashboard now renders the Rotate (and Revoke) actions only for keys whose status is `Active`; a revoked key shows a "No actions" placeholder, so an operator cannot un-revoke a deliberately disabled key as a side effect of a rotation. `RotateAsync`'s store-level behavior is unchanged (rotation by `key_id` still clears `revoked_utc`, which the CLI relies on); `docs/Authentication.md` updated to document both the store behavior and the dashboard restriction. No automated test added: the change is pure conditional Razor rendering and the test project has no bUnit component-rendering harness; the underlying `DashboardApiKeyManagementService` is already unit-tested. ### Server-011 | Field | Value | |---|---| | Severity | Low | | Category | Code organization & conventions | | Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:1-46` | | Status | Resolved | **Description:** `WorkerAlarmRpcDispatcher` deviates from the module's conventions: it fully-qualifies `System.Guid`, `System.ArgumentNullException`, and `System.Threading` types inline instead of relying on `using` directives, and uses an explicit constructor with `this.`-qualified field assignment while the rest of the module (e.g. `ConstraintEnforcer`, `MxAccessGatewayService`, `GalaxyRepositoryGrpcService`) uses primary constructors. `docs/style-guides/CSharpStyleGuide.md` is authoritative for gateway code. **Recommendation:** Add the needed `using` directives, drop the inline fully-qualified names, and convert to a primary constructor for consistency. **Resolution:** Resolved 2026-05-18. Confirmed against source. Converted `WorkerAlarmRpcDispatcher` to a primary constructor with the standard `?? throw new ArgumentNullException(...)` field-initializer guard; dropped the inline `System.Guid` / `System.ArgumentNullException` qualifications (using implicit `using System;`); removed redundant `using System.Collections.Generic;` / `System.Threading` / `System.Threading.Tasks;` directives (covered by `ImplicitUsings`); replaced the two `if (... is null) throw new System.ArgumentNullException(...)` checks with `ArgumentNullException.ThrowIfNull`. The stale class-level ``/`` ("Replaces NotWiredAlarmRpcDispatcher once ... wired in", "partially wired", "returns an Unimplemented diagnostic") were corrected to describe the actual GUID-vs-`Provider!Group.Tag` handling — overlapping with Server-014. No behavior change, so no new test; existing `WorkerAlarmRpcDispatcherTests` continue to pass and the project builds warning-free under `TreatWarningsAsErrors`. ### Server-012 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `CLAUDE.md` (Authentication section and `apikey create` example) | | Status | Resolved | **Description:** CLAUDE.md describes scopes as `session`, `invoke`, `event`, `metadata`, `admin` and shows `apikey create --scopes session,invoke,event,metadata,admin`. The actual canonical scope strings (used by `GatewayScopes`, `GatewayGrpcScopeResolver`, and `docs/Authorization.md`) are `session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`. A key created per the CLAUDE.md example carries scopes the resolver never matches. **Recommendation:** Update CLAUDE.md's scope list and the `apikey` example to the canonical `*:*` scope strings, per CLAUDE.md's own rule that docs change with the code. **Resolution:** Resolved 2026-05-18. Confirmed against `GatewayScopes` (`session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`). CLAUDE.md's Build/Test/Run `apikey create` example and the Authentication-section scope list were both updated to the canonical `*:*` strings. (Note: since finding Server-004 was resolved, the old example would now be actively rejected at create time rather than silently creating an unusable key, making the doc correction load-bearing.) Pure documentation change; no test. ### Server-013 | Field | Value | |---|---| | Severity | Low | | Category | Testing coverage | | Location | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs`, `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs` | | Status | Resolved | **Description:** `DashboardAuthorizationHandler` is unit-tested in isolation, but no test exercises the dashboard routes end-to-end to confirm the policy is actually enforced — which is why Server-001 (policy registered but never wired) went uncaught. There are also no tests for `WorkerExecutableValidator` (PE-header architecture parsing), `GalaxyGlobMatcher` (anchoring/escaping/empty-glob fail-open), or `GalaxyHierarchyProjector` pagination/page-token behavior. **Recommendation:** Add a `WebApplicationFactory` integration test that requests a dashboard page unauthenticated and asserts the redirect/401, plus unit tests for `WorkerExecutableValidator`, `GalaxyGlobMatcher`, and projector paging. **Resolution:** Resolved 2026-05-18. Re-triaged against the current test suite: three of the four named gaps were already closed. (1) The dashboard route-level enforcement test exists — `GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization` (and `..._AuthEndpointsAllowAnonymousAccess`), added when Server-001 was fixed. (2) `GalaxyGlobMatcher` anchoring/escaping/empty-glob behavior is covered by `GalaxyFilterInputSafetyTests` (`GlobMatcher_TreatsSqlMetacharactersAsLiterals`, `GlobMatcher_DoesNotTreatLikeWildcardsAsWildcards`, `GlobMatcher_WithPathologicalInput_DoesNotHang`), now extended with `GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect`. (3) Projector pagination/page-token behavior is covered end-to-end by `GalaxyRepositoryGrpcServiceTests` and now directly by the new `GalaxyHierarchyProjectorTests`. The one genuine remaining gap — `WorkerExecutableValidator` PE-header parsing — was closed with the new `WorkerExecutableValidatorTests` (7 cases: matching/mismatched x86 and x64, missing `MZ` header, file too small, missing `PE` signature), exercising the validator against synthesized minimal PE fixtures. ### Server-014 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:162-171,191-198,206-214,229-237` | | Status | Resolved | **Description:** The XML `` and inline comments on `AcknowledgeAlarm` and `QueryActiveAlarms` describe the alarm path as not yet wired and say `NotWiredAlarmRpcDispatcher` is the default ("Clients calling this method today receive an OK reply with a 'worker alarm path not yet wired' diagnostic", "an empty stream until PR A.2"). In fact `SessionServiceCollectionExtensions.AddGatewaySessions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so DI always injects the production dispatcher; `NotWiredAlarmRpcDispatcher` is only the null fallback. The comments are stale and misleading. **Recommendation:** Update the `AcknowledgeAlarm`/`QueryActiveAlarms` remarks to reflect that `WorkerAlarmRpcDispatcher` is the wired default, and describe its actual GUID-vs-`Provider!Group.Tag` handling. **Resolution:** Resolved 2026-05-18. Confirmed against source: `SessionServiceCollectionExtensions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so the "not yet wired" / "empty stream until PR A.2" / "PR A.6/A.7 follow-up" prose in the `AcknowledgeAlarm` and `QueryActiveAlarms` `` and inline comments was stale. Rewrote both `` blocks and both inline comments to state that DI binds the production `WorkerAlarmRpcDispatcher`, that it routes over the worker pipe IPC, and that `AcknowledgeAlarm` handles a canonical-GUID reference (→ `AcknowledgeAlarmCommand`) and a `Provider!Group.Tag` reference (→ `AcknowledgeAlarmByNameCommand`), with `NotWiredAlarmRpcDispatcher` being only the null fallback. The matching stale `WorkerAlarmRpcDispatcher` class-level XML doc was corrected as part of Server-011. Pure documentation/comment change; no test. ### Server-015 | Field | Value | |---|---| | Severity | Medium | | Category | Concurrency & thread safety | | Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:8-15,266-308,720-775` | | Status | Resolved | **Description:** `GatewaySession` guards its mutable state with two different sync primitives. `TransitionTo`, `MarkFaulted`, `TouchClientActivity`, the `State`/`LastClientActivityAt`/`LeaseExpiresAt`/`FinalFault`/`ActiveEventSubscriberCount` getters, `AttachWorkerClient`, and `IsLeaseExpired` all read/write `_state`, `_finalFault`, `_lastClientActivityAt`, `_leaseExpiresAt`, `_workerClient`, and `_activeEventSubscriberCount` under `_syncRoot`. `CloseAsync` (lines 720-775), however, reads `_state` at line 729 and writes `_state` at lines 736 (`SessionState.Closing`) and 761 (`SessionState.Closed`) while only holding the `_closeLock` `SemaphoreSlim` — `_syncRoot` is never acquired. A concurrent `TransitionTo` or `MarkFaulted` from another thread sees `_state` outside the lock that protects it, and the `State` getter is not guaranteed to observe the `Closing`/`Closed` writes promptly. `SemaphoreSlim.WaitAsync`/`Release` do happen to provide memory barriers in practice, but the locking discipline is split across two primitives, which is fragile and defeats the audit value of "all `_state` access is guarded by `_syncRoot`". Concretely, the race between `CloseAsync` setting `_state = Closing` and a concurrent `TransitionTo(Ready)` is unordered — and `TransitionTo` will happily overwrite `Closing` back to `Ready` because its only guard is "do not overwrite `Closed`/`Faulted`". **Recommendation:** Make `CloseAsync` mutate `_state` through the existing `TransitionTo(...)` helper (or acquire `_syncRoot` around the reads/writes) so all `_state` access uses the same lock. Either extend `TransitionTo` to accept the `Closing` and `Closed` transitions (it already handles `Faulted`/`Closed` precedence) or refactor `CloseAsync` to call a private `TrySetClosing()` / `MarkClosed()` that locks `_syncRoot`. Add a regression test that forces a `TransitionTo(Ready)` after `CloseAsync` has set `Closing` and asserts the session does not flip back to `Ready`. **Resolution:** 2026-05-20 — Unified the close path on `_syncRoot`. `GatewaySession.CloseAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now mutates `_state` only through two private `_syncRoot`-locked helpers — `TryBeginClose` (writes `Closing`, returns the prior `_closeStarted`) and `MarkClosed` (writes `Closed`) — so every `_state` read/write in the session uses the same lock; `_closeLock` keeps its role of serializing concurrent close attempts. `TransitionTo` was tightened to refuse a transition out of `Closing` to anything other than `Closed`/`Faulted` so a late lifecycle callback cannot walk a closing session back to `Ready`. `docs/Sessions.md` updated to describe the unified lock discipline and the extended terminal precedence. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `TransitionTo_AfterCloseStarted_DoesNotOverwriteClosing` (the named scenario — `BlockingShutdownWorkerClient` parks the close inside `worker.ShutdownAsync` so the test can call `TransitionTo(Ready)` between the `Closing` and `Closed` writes and assert the state stays `Closing`) and `MarkFaulted_AfterCloseCompletes_DoesNotResurrectSession`. ### Server-016 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:790-797`, `src/MxGateway.Server/Sessions/SessionManager.cs:237-258` | | Status | Resolved | **Description:** `GatewaySession.DisposeAsync` synchronously calls `_closeLock.Dispose()` (line 792) without first acquiring the lock and without checking whether a `CloseAsync` is still in flight. The normal call path is `SessionManager.CloseSessionCoreAsync` → `session.CloseAsync(...)` → `RemoveSessionAsync` → `DisposeAsync`, where `DisposeAsync` runs strictly after `CloseAsync` completes. But the `ShutdownAsync` path (`SessionManager.cs:237-258`) and any future caller that disposes a session while another thread is still inside `CloseAsync` will trip `ObjectDisposedException` when the in-flight `CloseAsync` releases the semaphore. The race is narrow today because all `Close`/`Dispose` choreography goes through `SessionManager`, but the class-level contract is broken: nothing on `GatewaySession` documents or enforces "DisposeAsync must not be called concurrently with CloseAsync". **Recommendation:** In `DisposeAsync`, either (a) take and release `_closeLock` once before disposing it, so the dispose is sequenced after any in-flight close, or (b) replace `_closeLock` disposal with a guard flag and let the semaphore be reclaimed by the finalizer. Document the invariant on the public method. Add a regression test that disposes a session whose `CloseAsync` has not yet completed and asserts no `ObjectDisposedException`. **Resolution:** 2026-05-20 — Took recommendation (a): `GatewaySession.DisposeAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now acquires `_closeLock` once before disposing the semaphore so an in-flight `CloseAsync` finishes (its `_closeLock.Release()`) before the dispose tears the semaphore down. The wait is non-cancellable (`CancellationToken.None`) and `ObjectDisposedException` is swallowed at both the wait and the dispose site so double-dispose still completes cleanly. The method's XML doc was extended with a `` block stating the invariant. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `DisposeAsync_WhileCloseInFlight_WaitsForCloseAndDoesNotThrow` (parks `CloseAsync` inside the worker shutdown, calls `DisposeAsync` concurrently, releases shutdown, asserts both complete without `ObjectDisposedException` and the worker is disposed exactly once) and `DisposeAsync_CalledTwice_DoesNotThrow`. ### Server-017 | Field | Value | |---|---| | Severity | High | | Category | Security | | Location | `src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:13-27`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:173-247`, `docs/Authorization.md:108-110` | | Status | Resolved | **Description:** The two new top-level RPCs added to `MxAccessGateway` — `AcknowledgeAlarm(AcknowledgeAlarmRequest)` and `QueryActiveAlarms(QueryActiveAlarmsRequest)` (proto lines 23-24) — are not enumerated by `GatewayGrpcScopeResolver.ResolveRequiredScope`. The resolver's `request switch` covers `OpenSessionRequest`, `CloseSessionRequest`, `StreamEventsRequest`, `MxCommandRequest`, and the four Galaxy-repository requests; everything else falls through to `_ => GatewayScopes.Admin`. The interceptor (`GatewayGrpcAuthorizationInterceptor.AuthenticateAndAuthorizeAsync`) then rejects any non-admin caller with `PermissionDenied`. This is technically fail-closed (and `docs/Authorization.md:108-110` documents the "unrecognized → admin" intent), but in practice it means: (1) only API keys with the `admin` scope can acknowledge alarms or query active alarms, even though acknowledging is naturally an `invoke:write`-shaped operation and querying is naturally an `invoke:read`- or `metadata:read`-shaped operation; (2) the alarm RPCs ship in a state where any client that successfully opened a session and subscribed to alarm events still cannot perform the operational acks the contract advertises; (3) the test matrix `GatewayGrpcScopeResolverTests` does not even cover these two request types, so the gap was not caught at unit-test time. **Recommendation:** Add explicit arms to `ResolveRequiredScope`: map `AcknowledgeAlarmRequest` to `GatewayScopes.InvokeWrite` (parity with other write actions; ack changes alarm state) and `QueryActiveAlarmsRequest` to `GatewayScopes.MetadataRead` or `GatewayScopes.InvokeRead`. Update `docs/Authorization.md` to list both. Extend `GatewayGrpcScopeResolverTests` with the new mappings and an assertion that every request type defined by `mxaccess_gateway.proto` is named in the resolver (the test can enumerate the assembly's request types so a future RPC cannot quietly add itself only via the admin fallback). **Resolution:** 2026-05-20 — Added explicit `AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite` and `QueryActiveAlarmsRequest => GatewayScopes.EventsRead` arms to `GatewayGrpcScopeResolver.ResolveRequiredScope` (`src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:21-22`). `InvokeWrite` matches the existing `MxCommandKind.Write*` mapping because ack mutates alarm state; `EventsRead` matches `StreamEventsRequest` and `MxCommandKind.DrainEvents` because querying active alarms reads the same alarm/event surface. Extended `GatewayGrpcScopeResolverTests` with two new `InlineData` rows covering both request types (`src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:16-17`) and added four interceptor-level cases in `GatewayGrpcAuthorizationInterceptorTests` (`UnaryServerHandler_AcknowledgeAlarmMissingScope_ReturnsPermissionDenied`, `UnaryServerHandler_AcknowledgeAlarmWithScope_RunsHandler`, `ServerStreamingServerHandler_QueryActiveAlarmsMissingScope_ReturnsPermissionDenied`, `ServerStreamingServerHandler_QueryActiveAlarmsWithScope_RunsHandler`) proving each new RPC denies callers lacking the chosen scope and runs the handler when the scope is held. Updated `docs/Authorization.md` (resolver snippet and Scope Catalog table) to list both RPCs against their scopes. `dotnet test ... --filter FullyQualifiedName~GatewayGrpcAuthorizationInterceptorTests` → 14 passed, 0 failed; resolver tests 28 passed, 0 failed. ### Server-018 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:15` | | Status | Resolved | **Description:** `GalaxyGlobMatcher.RegexCache` is a `ConcurrentDictionary` keyed by glob pattern, with no eviction. The fix for Server-008 added this cache deliberately to avoid recompiling the same handful of patterns, but the cache key is the raw glob string. The patterns currently come from two sources — `DiscoverHierarchyRequest.TagNameGlob` (client-supplied) and `ApiKeyConstraints.BrowseSubtrees` / `ReadSubtrees` / `WriteSubtrees` / `ReadTagGlobs` / `WriteTagGlobs` (admin-configured) — and `BuildRegex` also runs each glob through `Regex.Escape` so an attacker cannot craft a denial-of-service ReDoS payload. The leak is therefore bounded only by "how many distinct globs a client can submit over the process lifetime", which is in the millions for `TagNameGlob` if a client iterates through generated names. Each compiled `Regex` also holds a JIT'd assembly that is non-trivial to reclaim. **Recommendation:** Cap the cache at a small bound (e.g. 256 patterns) using a simple LRU or a `MemoryCache` with sliding expiration, or restrict the cache to globs that originate from API-key constraints (admin-controlled, naturally bounded) and pay the compile cost for client-supplied globs. Add a test that fills the cache with thousands of distinct globs and asserts the cache size stays bounded. **Resolution:** 2026-05-20 — Capped `GalaxyGlobMatcher`'s compiled-regex cache at `RegexCacheCapacity = 256` entries with FIFO-by-insertion eviction (`src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs`). A `ConcurrentQueue` tracks insertion order; when the cache grows past the cap, `EvictIfOverCapacity` takes a small lock and dequeues + removes the oldest entries until the count is back within bound. Reads stay lock-free (the lock guards only the eviction path). Internal `CurrentCacheSize` / `RegexCacheCapacity` accessors are surfaced through the existing `InternalsVisibleTo("MxGateway.Tests")` so tests can assert the bound. Regression test: `GalaxyFilterInputSafetyTests.GlobMatcher_WithManyDistinctPatterns_CacheStaysBounded` submits `RegexCacheCapacity * 4` distinct globs and asserts `CurrentCacheSize` stays in `[0, RegexCacheCapacity]`. Existing glob correctness tests (`GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect`, the adversarial-input theories) continue to pass, confirming eviction does not corrupt lookups. ### Server-019 | Field | Value | |---|---| | Severity | Low | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:183-221` | | Status | Resolved | **Description:** `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` returns `yield break` (line 191) when `sessionRegistry.TryGet(request.SessionId, ...)` fails — it silently produces an empty stream with no diagnostic. The peer `AcknowledgeAsync` instead returns an `AcknowledgeAlarmReply` with `ProtocolStatus.Code = SessionNotFound` (lines 81-89), so the two methods have inconsistent missing-session handling. In production this branch is unreachable because `MxAccessGatewayService.QueryActiveAlarms` calls `ResolveSession(...)` first and throws `NotFound` from the gRPC layer (`MxAccessGatewayService.cs:228`), but: (a) the dispatcher is the seam other code paths might reach in the future, and (b) any unit test that instantiates the dispatcher directly with a missing session id sees an empty stream rather than a clear error, which is a footgun. **Recommendation:** Either throw a `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` (matching the gRPC service's own resolver) or yield a single `ActiveAlarmSnapshot` with a diagnostic field set, and add a `WorkerAlarmRpcDispatcherTests` case that asserts whichever shape is chosen. Aligning with `AcknowledgeAsync`'s `SessionNotFound` protocol-status pattern is preferred, but `QueryActiveAlarms` is a server-streaming RPC so a thrown `SessionManagerException` propagated by the gateway is the cleaner fit. **Resolution:** 2026-05-20 — Took the preferred option: `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` (`src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs`) now throws `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` instead of `yield break`-ing when the session is missing. `MxAccessGatewayService.MapException` already maps that error code to gRPC `NotFound`, so production callers see a consistent missing-session response and a direct unit-test caller now gets a clear error instead of an empty success. The unary peer `AcknowledgeAsync` continues to surface the same condition as an in-band `ProtocolStatus.Code = SessionNotFound`, which is correct for a unary RPC. Regression test: `WorkerAlarmRpcDispatcherTests.QueryActiveAlarmsAsync_WhenSessionMissing_ThrowsSessionNotFound` replaces the prior `_YieldsEmpty` assertion — it asserts the new exception shape and also exercises `AcknowledgeAsync` with the same missing session id to pin the peer-method parity. ### Server-020 | Field | Value | |---|---| | Severity | Low | | Category | Code organization & conventions | | Location | `src/MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor:1-2`, `…/GalaxyPage.razor:1-2`, `…/ApiKeysPage.razor:1-2`, `…/EventsPage.razor:1-2`, `…/SessionsPage.razor:1-2`, `…/WorkersPage.razor:1-2`, `…/SettingsPage.razor:1-2`, `…/SessionDetailsPage.razor:1-2` | | Status | Resolved | **Description:** Every dashboard page declares two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — even though `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` mounts the Razor components under a `RouteGroupBuilder` with `pathBase = "/dashboard"`. The group prefix is prepended to each `@page` route, so the actual endpoints become `/dashboard/X` (from `@page "/X"`) **and** `/dashboard/dashboard/X` (from `@page "/dashboard/X"`). The pages are reachable at two URLs each, and the deeper one (`/dashboard/dashboard/sessions` etc.) is almost certainly accidental — it leaks the path-base name into the URL and creates duplicate authorize/render work per route. `GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization` only checks the `/dashboard/X` shape, so the duplicate route slipped through without an assertion. **Recommendation:** Drop the `@page "/dashboard/X"` directive from each page; rely on the `MapGroup("/dashboard")` to provide the prefix. Or, if the team genuinely wants both URL shapes, document the choice in the file header and extend the route-enumeration test to assert that **both** are present (and both carry the authorization policy). Either way, the current setup is non-obvious. **Resolution:** 2026-05-20 — Took the recommended drop: removed the redundant `@page "/dashboard/X"` directive from every dashboard Razor page (`DashboardHome.razor`, `SessionsPage.razor`, `WorkersPage.razor`, `EventsPage.razor`, `GalaxyPage.razor`, `SettingsPage.razor`, `ApiKeysPage.razor`, `SessionDetailsPage.razor`). Each page now declares only its bare route (e.g. `@page "/sessions"`); `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` continues to prepend `/dashboard` via `MapGroup`, so each page is reachable at exactly one URL (`/dashboard/X`). Regression test: `GatewayApplicationTests.Build_WhenDashboardEnabled_DoesNotRegisterDoubledDashboardPrefixRoutes` enumerates the eight previously-doubled routes (`/dashboard/dashboard/`, `/dashboard/dashboard/sessions`, ... `/dashboard/dashboard/sessions/{SessionId}`) and asserts none of them are mapped. The existing `..._MapsBlazorDashboardAndAuthEndpoints` / `..._ComponentRoutesRequireAuthorization` tests continue to verify the desired `/dashboard/X` shapes are still present and policy-gated. No public URL contract changed (the doubled shape was accidental); no doc update needed — `gateway.md` and `docs/GatewayDashboardDesign.md` never referenced the doubled routes. ### Server-021 | Field | Value | |---|---| | Severity | Medium | | Category | Testing coverage | | Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:266-664`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs` | | Status | Resolved | **Description:** The 1cd51bb commit history (the bulk read/write series, `f220908`/`5e375f6`/`758aca2`) added 473 lines of constraint-filtering and reply-merging logic to `MxAccessGatewayService`: `ApplyConstraintsAsync` (line 266), `EnforceReadTagAsync` / `EnforceWriteHandleAsync`, `FilterTagBulkAsync` / `FilterReadBulkAsync` / `FilterWriteBulkAsync` / `FilterHandleBulkAsync`, the `ReplaceWriteBulkEntries` switch, and three concrete `BulkConstraintPlan` records (`SubscribeBulkConstraintPlan`, `WriteBulkConstraintPlan`, `ReadBulkConstraintPlan`) that splice denied entries back into the worker's allowed-only reply in original-index order. None of this is covered by `MxAccessGatewayServiceTests` — its `FakeSessionManager` is wired with an `AllowAllConstraintEnforcer` (line 430) that never denies anything, so every constraint-related code path is dead at test time. A subtle off-by-one in `BuildMerged`, a wrong `PayloadOneofCase` in `GetPayload` / `SetPayload`, or a missing case in `ReplaceWriteBulkEntries` would all ship without a test failure. **Recommendation:** Add `MxAccessGatewayServiceTests` cases that inject a deny-on-glob `IConstraintEnforcer` and exercise: (1) `AddItemBulk` / `SubscribeBulk` / `AdviseItemBulk` with a mix of allowed and denied tags, asserting `BulkSubscribeReply.Results` interleaves denied and worker-allowed entries in original-index order; (2) the same for `ReadBulk` and each of the four bulk-write commands; (3) `HasAllowedItems == false` so `CreateDeniedReply` is exercised (no worker call); (4) the unary `Write`/`Write2`/`WriteSecured`/`WriteSecured2` paths through `EnforceWriteHandleAsync`. The fixtures can reuse the existing `FakeSessionManager` by replacing the constraint enforcer; no live worker is needed. **Resolution:** 2026-05-20 — Added a configurable `PredicateConstraintEnforcer` test double (`src/MxGateway.Tests/TestSupport/PredicateConstraintEnforcer.cs`) that denies on per-tag and per-handle predicates and records denials. Added 11 new tests in `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs` covering: (1) `AddItemBulk` with mixed denials — asserts the worker is called once with only the allowed subset and the merged reply interleaves denied and worker-allowed `SubscribeResult`s at their original indices; (2) `SubscribeBulk` with every tag denied — asserts `HasAllowedItems` short-circuits `CreateDeniedReply` and the session manager is never invoked; (3) `AdviseItemBulk` (handle-keyed denial via `CheckReadHandleAsync`); (4) `SubscribeBulk` with the allow-all enforcer — pass-through regression guard; (5) `ReadBulk` partial denial — asserts the `BulkReadConstraintPlan` produces a `BulkReadReply` (not a `BulkSubscribeReply`) with denied entries spliced in at their original indices; (6) `ReadBulk` all-denied short-circuit; (7) `WriteBulk` partial denial — asserts denied entries are dropped from the forwarded `Entries` and the merged reply preserves original-index order; (8) `WriteSecuredBulk` all-denied — proves the second `ReplaceWriteBulkEntries` switch arm is reachable; (9) unary `Write` with denied handle → `PermissionDenied`, no worker call, denial recorded; (10) unary `WriteSecured` with denied handle → `PermissionDenied`; (11) unary `AddItem` with denied tag → `PermissionDenied` (`EnforceReadTagAsync`). `MxAccessGatewayServiceTests.CreateService` updated to accept an `IConstraintEnforcer` so future tests can opt into the deny enforcer without duplicating the wiring. All 11 new tests pass; full suite (`dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj`) is green at 458 passing. ### Server-022 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs:8-29` | | Status | Resolved | **Description:** Server-014's resolution noted that the stale "PR A.6 / A.7" / "not yet wired" language was rewritten on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms` and on the `WorkerAlarmRpcDispatcher` class doc. The corresponding XML doc on the **interface** `IAlarmRpcDispatcher` (lines 8-29) still says it is "PR A.6 / A.7 — gateway-side dispatcher" and that "Production implementations live in `WorkerAlarmRpcDispatcher` (this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic)". That second clause directly contradicts the now-correct comments on the concrete implementations and on the gRPC service: `WorkerAlarmRpcDispatcher` is the wired default, not a not-yet-wired one. A reader who finds the interface first will believe the dispatcher is non-functional. **Recommendation:** Rewrite the `IAlarmRpcDispatcher` `` block to match the language now used on `WorkerAlarmRpcDispatcher` and on the gRPC service: DI binds `WorkerAlarmRpcDispatcher` by default; `NotWiredAlarmRpcDispatcher` is only the null fallback for tests/DI omission. Drop the "PR A.6 / A.7" prefix from the `` — the interface is now the public alarm-RPC seam. **Resolution:** 2026-05-20 — Rewrote `IAlarmRpcDispatcher`'s `` and `` (`src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs`) to match the language now used on `WorkerAlarmRpcDispatcher` and on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms`: dropped the stale "PR A.6 / A.7" prefix from the summary, and replaced the "this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic" clause with the correct statement that DI binds the production `WorkerAlarmRpcDispatcher` by default and `NotWiredAlarmRpcDispatcher` is only the null fallback for DI omission / standalone tests. Pure documentation change; no test. ### Server-023 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `src/MxGateway.Server/Sessions/NotWiredAlarmRpcDispatcher.cs:10-26` | | Status | Resolved | **Description:** Server-014 and Server-022 swept the stale "PR A.6 / A.7" / "not-yet-wired" / "worker-pending" language off `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms`, `WorkerAlarmRpcDispatcher`, and `IAlarmRpcDispatcher`. The concrete `NotWiredAlarmRpcDispatcher` class XML doc was not updated as part of either fix and still reads: *"PR A.6 / A.7 — default `IAlarmRpcDispatcher` shipped while the worker-side AlarmClient event subscription is gated on dev-rig validation"* and *"When the worker dispatcher (PR A.6/A.7 dev-rig follow-up) lands, `WorkerAlarmRpcDispatcher` replaces this implementation in the DI container"*. That is the exact prose the other sweeps removed, and it directly contradicts the now-current narrative everywhere else: `SessionServiceCollectionExtensions.AddGatewaySessions` registers `WorkerAlarmRpcDispatcher` as the default `IAlarmRpcDispatcher`; `NotWiredAlarmRpcDispatcher` is only the null fallback used when no dispatcher is registered (DI omission / standalone tests). The diagnostic string returned by `AcknowledgeAsync` (line 39) — `"the worker-side AlarmClient consumer (PR A.5) is in place but the dispatcher hookup is gated on validating the AVEVA alarm-provider event subscription on the dev rig"` — is also stale; the dispatcher hookup landed and any client that actually sees that diagnostic today is hitting the null-fallback path, not the dev-rig gate it describes. **Recommendation:** Replace the `` and `` on `NotWiredAlarmRpcDispatcher` with text that matches the language now used on the interface and `WorkerAlarmRpcDispatcher` — "null fallback `IAlarmRpcDispatcher` used when no dispatcher is registered (DI omission / standalone tests); production wires `WorkerAlarmRpcDispatcher`." Either drop the `AcknowledgeAsync` diagnostic string's dev-rig framing entirely or shorten it to "alarm dispatcher is not registered." `#pragma warning disable CS1998` on `QueryActiveAlarmsAsync` is correct here (empty stream is intentional for the null fallback) and should stay. **Resolution:** 2026-05-20 — Rewrote `NotWiredAlarmRpcDispatcher` summary/remarks as the null-fallback dispatcher and shortened the `AcknowledgeAsync` diagnostic to "Alarm dispatcher is not registered."; updated the two tests that asserted the old "worker"-prefixed diagnostic. ### Server-024 | Field | Value | |---|---| | Severity | Low | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:56-77` | | Status | Resolved | **Description:** `GetOrCreateRegex`'s race-loser branch reads `RegexCache[glob]` with an indexer (line 76) after `TryAdd` returned `false`. The indexer throws `KeyNotFoundException` if the key is missing. Under the new bounded cache (Server-018), there is a real — if narrow — race where the key vanishes between the failing `TryAdd` and the indexer read: thread A and thread B both compile a `Regex` for `glob`; A's `TryAdd` succeeds, A enqueues + enters `EvictIfOverCapacity`, the eviction loop dequeues `glob` (because some other thread had already enqueued + evicted enough that `glob` is now the oldest entry) and removes it; thread B's `TryAdd` then returns false, B reads `RegexCache[glob]`, and the indexer throws. The window is tiny but nonzero — eviction is approximate FIFO, and a hot pattern that is repeatedly re-added near the cap is the natural trigger. The same pre-Server-018 code used `GetOrAdd`, which had no such race because the dictionary handled the rebuild atomically. **Recommendation:** Replace the `TryAdd` + indexer pair with `RegexCache.GetOrAdd(glob, _ => compiled)` so the dictionary atomically returns whichever instance won. Track the new insertion only when `GetOrAdd` returns the locally-compiled instance (`ReferenceEquals(result, compiled)`), then enqueue + evict. Alternatively, swap the trailing indexer read for `TryGetValue` + recursive recompile on miss. Add a stress test that mixes repeated reads of a single hot pattern with a flood of unique patterns near the cap and asserts no exception escapes `IsMatch`. **Resolution:** 2026-05-20 — Replaced the `TryAdd` + indexer pair with `RegexCache.GetOrAdd(glob, compiled)`; FIFO enqueue + eviction now run only when `ReferenceEquals(result, compiled)` (i.e. our caller was the inserter), eliminating the post-eviction `KeyNotFoundException` window. ### Server-025 | Field | Value | |---|---| | Severity | Low | | Category | Code organization & conventions | | Location | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:19-25`, `src/MxGateway.Server/Galaxy/IGalaxyRepository.cs` | | Status | Resolved | **Description:** The Tests-016 fix introduced `IGalaxyRepository` so `GalaxyHierarchyCache` could be unit-tested against an in-memory fake, and `GalaxyHierarchyCache` was updated to depend on the interface. `GalaxyRepositoryGrpcService` was not updated and still receives the concrete `GalaxyDb.GalaxyRepository` via its primary constructor. Functionally this is fine — DI registers the concrete singleton and a thin `sp.GetRequiredService()` forwarder for the interface — but the seam is now half-applied: a future caller that wants to test or stub the gRPC service's `TestConnection` path has to construct a real `GalaxyRepository` against a SQL connection string, defeating the abstraction `IGalaxyRepository` was introduced for. The pattern also creates an inconsistency for new readers — two consumers in the same namespace, one on the interface and one on the concrete. **Recommendation:** Change `GalaxyRepositoryGrpcService`'s `repository` parameter to `IGalaxyRepository`. No DI change is needed (both forwarders already resolve to the same singleton). Optionally drop the concrete singleton registration and register the interface directly. **Resolution:** 2026-05-20 — Changed `GalaxyRepositoryGrpcService`'s `repository` primary-constructor parameter from the concrete `GalaxyRepository` to `IGalaxyRepository`; existing DI registration in `GalaxyRepositoryServiceCollectionExtensions` already resolves both the concrete and interface to the same singleton. ### Server-026 | Field | Value | |---|---| | Severity | Low | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Configuration/GatewayOptionsValidator.cs:17-32`, `src/MxGateway.Server/Configuration/AlarmsOptions.cs` | | Status | Resolved | **Description:** `GatewayOptions.Alarms` is bound from `MxGateway:Alarms` and consumed by `SessionManager.TryAutoSubscribeAlarmsAsync` (per-session SubscribeAlarms on Ready). `GatewayOptionsValidator.Validate` validates every other section (`Authentication`, `Ldap`, `Worker`, `Sessions`, `Events`, `Dashboard`, `Protocol`) but has no `ValidateAlarms` arm — `AlarmsOptions` is silently accepted regardless of contents. The runtime mitigates this by logging a warning when `Enabled = true` but neither `SubscriptionExpression` nor `DefaultArea` is set, then either faulting open-session (`RequireSubscribeOnOpen = true`) or skipping auto-subscribe — a configuration error therefore surfaces per-session at runtime instead of at startup. Other sections fail-fast at `ValidateOnStart()`, so the inconsistency makes alarm misconfiguration discoverable only after a client hits the gateway. A misformatted `SubscriptionExpression` (no `\\\Galaxy!` shape) likewise passes validation; the worker rejects it later. **Recommendation:** Add a `ValidateAlarms(options.Alarms, failures)` arm in `GatewayOptionsValidator`. When `Enabled = true`, require either a non-blank `SubscriptionExpression` or a non-blank `DefaultArea`; when `SubscriptionExpression` is provided, sanity-check that it starts with `\\` (the AVEVA UNC subscription shape) — or document that the shape is left to the worker to validate. Either way, treat the configuration as part of the validated surface. **Resolution:** 2026-05-20 — Added `ValidateAlarms` to `GatewayOptionsValidator`: when `Enabled = true`, requires a non-blank `SubscriptionExpression` or `DefaultArea`, and when `SubscriptionExpression` is provided, requires it to start with `\\` (canonical UNC subscription shape). Alarm misconfiguration now fails fast at startup instead of per-session. ### Server-027 | Field | Value | |---|---| | Severity | Low | | Category | Design-document adherence | | Location | `docs/Authorization.md:120-141,176-181` | | Status | Resolved | **Description:** Two parts of `docs/Authorization.md` drifted from `GatewayGrpcScopeResolver.ResolveCommandScope` and from `MxAccessGatewayService.ApplyConstraintsAsync` over the bulk-read/bulk-write series (`f220908`/`5e375f6`/`758aca2`) and were not updated by the Server-017 / Server-021 fixes: 1. The `ResolveCommandScope` code snippet at lines 120-141 still shows only `Write`/`Write2` against `InvokeWrite` and `WriteSecured`/`WriteSecured2`/`AuthenticateUser` against `InvokeSecure`. The actual resolver also maps `MxCommandKind.WriteBulk`, `MxCommandKind.Write2Bulk`, `MxCommandKind.WriteSecuredBulk`, and `MxCommandKind.WriteSecured2Bulk`. A reader believing the snippet would conclude the bulk-write families inherit the fail-closed admin scope, when in fact they correctly map to `InvokeWrite` / `InvokeSecure` (the Scope Catalog table at lines 199-200 lists them). 2. The Constraint Enforcement section (lines 176-181) says: *"The service checks read constraints for `AddItem`, `AddItem2`, `AddItemBulk`, `SubscribeBulk`, and `AdviseItemBulk`. It checks write constraints for `Write`, `Write2`, `WriteSecured`, and `WriteSecured2`."* The actual `ApplyConstraintsAsync` switch also enforces constraints for `ReadBulk` (read scope), `WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` (write scope, per-entry filtering with index-order merge). Server-021 added test coverage for all of these without touching the doc. **Recommendation:** Update the `ResolveCommandScope` snippet to include the four bulk-write arms. Update the Constraint Enforcement prose to enumerate the bulk read/write commands that are actually filtered, and reference the per-entry index-ordered merge that `BulkConstraintPlan.MergeDeniedInto` performs. Adding `ReadBulk` to the `InvokeRead` row of the Scope Catalog would also be useful — the table currently lists `Register`/`AddItem`/`Advise` against `InvokeRead` but not `ReadBulk`. **Resolution:** 2026-05-20 — Updated the `ResolveCommandScope` snippet in `docs/Authorization.md` to enumerate the four bulk-write arms (`WriteBulk`/`Write2Bulk` against `InvokeWrite`, `WriteSecuredBulk`/`WriteSecured2Bulk` against `InvokeSecure`); expanded the Constraint Enforcement prose to list `ReadBulk` and all four bulk-write commands and to call out `BulkConstraintPlan.MergeDeniedInto`'s index-ordered merge; added `ReadBulk` to the `InvokeRead` row of the Scope Catalog. ### Server-028 | Field | Value | |---|---| | Severity | Low | | Category | Testing coverage | | Location | `src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:13-20`, `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs` | | Status | Resolved | **Description:** Two narrow test gaps were not closed by Server-017 / Server-015: 1. `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` enumerates `OpenSessionRequest`, `CloseSessionRequest`, `StreamEventsRequest`, `AcknowledgeAlarmRequest`, `QueryActiveAlarmsRequest`, `TestConnectionRequest`, `GetLastDeployTimeRequest`, and `DiscoverHierarchyRequest`. `WatchDeployEventsRequest` is missing even though it is named in the resolver's metadata-read arm and listed in the Scope Catalog. Similarly, the `ResolveRequiredScope_InvokeCommand_ReturnsExpectedScope` matrix covers every other write/secure/bulk command but omits `MxCommandKind.ReadBulk`, which is the only bulk family that falls into the `_ => GatewayScopes.InvokeRead` default arm. A regression that drops `WatchDeployEvents` from the request switch or that adds a new mapping for `ReadBulk` would not be caught. 2. `GatewaySessionTests` (added under Server-015 / Server-016) covers the `TransitionTo(Ready)` and `MarkFaulted(post-Close)` cases but does not cover the third edge that Server-015's tightened state machine permits: `MarkFaulted` issued while `CloseAsync` is parked between `TryBeginClose` (Closing) and `MarkClosed` (Closed). The current `MarkFaulted` (`GatewaySession.cs:314-326`) checks only for `Closed`, so it overwrites `Closing` → `Faulted`; the subsequent `MarkClosed` then overwrites `Faulted` → `Closed` while `_finalFault` is preserved. The behaviour is consistent with the docs ("Closing only allows a transition to Closed or Faulted") but the test bundle does not pin it, and a future tightening of `MarkFaulted` could silently regress. **Recommendation:** Extend `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` with `[InlineData(typeof(WatchDeployEventsRequest), GatewayScopes.MetadataRead)]` and extend the command theory with `[InlineData(MxCommandKind.ReadBulk, GatewayScopes.InvokeRead)]`. Add a `GatewaySessionTests.MarkFaulted_DuringInFlightClose_PreservesFaultButYieldsToClose` case using `BlockingShutdownWorkerClient` to park `CloseAsync`, call `MarkFaulted` while parked, release the worker, and assert `State == Closed && FinalFault == ""`. **Resolution:** 2026-05-20 — Added `[InlineData(typeof(WatchDeployEventsRequest), GatewayScopes.MetadataRead)]` to `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` (the `ReadBulk` arm was already present); added `GatewaySessionTests.MarkFaulted_DuringInFlightClose_PreservesFaultButYieldsToClose` covering the parked-close + `MarkFaulted` interleave and asserting the post-release state is `Closed` with `FinalFault = "concurrent-fault"`. ### Server-029 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:52-58` | | Status | Resolved | **Description:** `OpenSession` advertises capabilities the gateway supports so clients can branch on them. The current list is `unary-open-session`, `unary-close-session`, `unary-invoke`, `server-stream-events`, `bulk-subscribe-commands`, `unary-acknowledge-alarm`, `server-stream-active-alarms`. The `bulk-subscribe-commands` token was added for the `AddItemBulk` / `AdviseItemBulk` / `RemoveItemBulk` / `UnAdviseItemBulk` / `SubscribeBulk` / `UnsubscribeBulk` family. The subsequent `ReadBulk` and `WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` families landed without a corresponding capability token — the contract advertises bulk-subscribe support but is silent on bulk-read and bulk-write. A defensive client that gates on `bulk-write-commands` before issuing a `WriteBulk` has no signal that the family is supported; current clients sidestep this by ignoring the list entirely, but that just shifts the failure mode (an old client against a new server, or vice versa, will see `Unimplemented` instead of a structured `Capabilities` mismatch). **Recommendation:** Either (a) extend the advertised list with `bulk-read-command` and `bulk-write-commands` (`WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` collectively), or (b) document in `gateway.md` and `docs/Contracts.md` that `Capabilities` is informational only and not the contract version. Option (a) is the simplest forward-compatible fix and keeps the capability token shape clients are already familiar with. **Resolution:** 2026-05-20 — Extended the `OpenSession` capabilities list with `bulk-read-commands` and `bulk-write-commands` alongside the existing `bulk-subscribe-commands` token, so clients that gate on capability strings have an explicit signal for the bulk-read and bulk-write families. ### Server-030 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:952-980` | | Status | Resolved | **Description:** Surfaced during the 2026-05-20 cross-language e2e run against a redeployed gateway (`a020350`). The Java client got 55 of 120 `AddItem` calls in, then `Advise` returned `Session session-de7728a290bd41028ad6fec81e233144 is not ready. Current state is Ready.` — a self-contradictory diagnostic. The check in `GetReadyWorkerClient` (`GatewaySession.cs:956`) is `_state != SessionState.Ready || _workerClient?.State != WorkerClientState.Ready`, but the formatted message only includes `_state`. When the gateway-side session state is `Ready` but the worker client's own `WorkerClientState` has transitioned (heartbeat watchdog firing, pipe disconnect detected by the read loop, etc.) before the session-level reaction observes it, the in-flight RPC fails fast here — and the operator sees a message that doesn't tell them which side of the gate the failure is on. The two-state gap itself is a real race (the worker-side state can shift independently of the gateway-driven session state) but a clear diagnostic is the prerequisite for diagnosing it; without it, a future investigation will start from "it says Ready but it's not Ready" instead of "the worker is Handshaking / Closing / Faulted while the session is still Ready". **Recommendation:** Format both states into the exception message — `Session {SessionId} is not ready. Session state is {_state}; worker state is {workerClientState}.` (or `""` when `_workerClient` is null). Document on the method that the two states can diverge under load and that this branch is the fail-fast for that case. Add a regression test that flips `FakeWorkerClient.State` to a non-Ready value (e.g. `Handshaking`) while the session is `Ready` and asserts both pieces of state appear in the thrown `SessionManagerException.Message`. The deeper race investigation (should the gateway briefly wait for worker-Ready before failing? when does `WorkerClient.State` legitimately shift while the session is still `Ready`?) is out of scope for this finding but is worth a follow-up. **Resolution:** 2026-05-20 — Rewrote `GetReadyWorkerClient` so the `SessionManagerException` message includes both `_state` and `_workerClient.State` (or `""` for the null case): `"Session {SessionId} is not ready. Session state is {_state}; worker state is {workerState}."`. Added XML doc on the method explaining the two-state contract and that this branch is the fail-fast for a state-divergence race. Added regression test `SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates` that sets `FakeWorkerClient.State = WorkerClientState.Handshaking` while the session is `Ready` and asserts both `"Session state is Ready"` and `"worker state is Handshaking"` appear in the message; the test also pins `InvokeCount == 0` so the worker isn't called. The deeper race (should `GetReadyWorkerClient` retry briefly when state has just diverged?) remains open for follow-up. ### Server-031 | Field | Value | |---|---| | Severity | Medium | | Category | Concurrency & thread safety | | Location | `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:392-443` (gateway-side heartbeat watchdog); `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClientOptions.cs:14-67` (new `HeartbeatStuckCeiling` option) | | Status | Resolved | **Description:** Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The .NET phase succeeded through `open-session`/`register`/`bulk-subscribe`/`bulk-read`/`bulk-unsubscribe`/`stream-events`/`write` but then failed on its third `advise` call with the Server-030 diagnostic `Session ... is not ready. Session state is Ready; worker state is Faulted.` The gateway stdout log records the underlying cause: **`Worker client faulted for session session-01a1a07fa59c489983a719821fa46e72: Worker heartbeat expired. Last heartbeat was at 2026-05-20T17:20:39.+00:00.`** — a real 15s+ gap with no `WorkerHeartbeat` envelope arriving from the worker. Investigation paths: 1. **Shared `_writeLock` on the worker side.** `WorkerFrameWriter` serializes every pipe write (heartbeats, command replies, events, faults) through a single `SemaphoreSlim _writeLock` (`WorkerFrameWriter.cs:14`, `:67-76`). `RunEventDrainLoopAsync` (`WorkerPipeSession.cs:336-372`) writes events one at a time inside a `foreach`, each call to `_writer.WriteAsync` re-acquiring `_writeLock`. If the gateway-side read drains slowly and the OS-level named-pipe buffer fills, `_stream.WriteAsync` (`WorkerFrameWriter.cs:70`) blocks. The event-drain loop blocks holding the lock. `RunHeartbeatLoopAsync` (`WorkerPipeSession.cs:611-613`) then can't acquire `_writeLock` to send its 5s heartbeat. Heartbeats stall past the gateway's `HeartbeatGrace` (15s default) and `WorkerClient.HeartbeatLoopAsync` faults the session. 2. **No prioritization between heartbeats and events.** Even without OS-level back-pressure, a backlog of events in the worker's `MxAccessEventQueue` (drained in batches of `EventDrainBatchSize`) can keep the writer lock held for many milliseconds at a time. Heartbeats can be delayed (though normally not past `HeartbeatGrace` unless something else is wrong). 3. **Gateway-side heartbeat watchdog ignores in-flight commands.** `WorkerClient.HeartbeatLoopAsync` (`WorkerClient.cs:392-422`) checks only `_state == Ready` and `now - lastHeartbeatAt > HeartbeatGrace`. It does not check whether a command is in flight on the gateway↔worker pipe. The mirror of Worker-017's fix (worker-side watchdog skips `StaHung` while a command is in flight) does not exist on the gateway side. The .NET test pattern stresses the issue uniquely because each `dotnet run --project` rebuild between subcommands introduces multi-second client-side gaps; the worker's heartbeat path should still be alive (heartbeats are emitted by `RunHeartbeatLoopAsync` independently of gateway activity), but if the gateway is also blocked draining events from the channel into a non-existent `StreamEvents` consumer, the back-pressure-into-heartbeat chain bites first. **Recommendation:** Two changes worth landing together: 1. **Decouple heartbeat writes from the event/reply lock.** Either (a) give heartbeats their own pipe `Stream` (likely impractical — one pipe per session), (b) introduce a priority queue in front of `WorkerFrameWriter` so heartbeats hop the line, or (c) interleave heartbeat checks inside `RunEventDrainLoopAsync` (e.g., after each event-batch write, post a heartbeat envelope if one is due). Option (c) is the smallest change. 2. **Mirror Worker-017's "skip-while-command-in-flight" guard on the gateway side.** In `WorkerClient.HeartbeatLoopAsync`, when `_pendingCommands.Count > 0` and the oldest pending command is younger than some ceiling (e.g., 5× `HeartbeatGrace`), skip the fault. The worker may be busy executing a slow STA command and the heartbeat write may be queued behind a long event burst — neither indicates the worker is actually hung. Add a regression test that floods the worker's outbound event channel (e.g., via a high-rate STA fixture or a mock event source emitting at > 1000 events/s for several seconds) and asserts the worker is not faulted while the gateway has no `StreamEvents` consumer attached. **Resolution:** 2026-05-24 — Re-triaged at HEAD `d2d2e5f`: the gateway-side "skip-while-command-in-flight" guard (recommendation #2) is already implemented and verified against source. `WorkerClient.HeartbeatLoopAsync` (`src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:403-443`) now skips the `HeartbeatExpired` fault while `TryGetOldestPendingCommandAge` reports an in-flight command younger than `WorkerClientOptions.HeartbeatStuckCeiling` (default 75s = 5× `HeartbeatGrace`). Once the oldest pending command exceeds the ceiling the watchdog fires anyway, so a truly stuck COM call doesn't hide the worker forever. The new `HeartbeatStuckCeiling` option is documented inline with a back-reference to Worker-023, the worker-side mirror. Regression tests in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs`: `HeartbeatMonitor_WhenCommandInFlightWithinCeiling_DoesNotFaultOnExpiredHeartbeat` (the named scenario — parks an unanswered `InvokeAsync` past `HeartbeatGrace` but well within `HeartbeatStuckCeiling` and asserts the client stays `Ready`) and `HeartbeatMonitor_WhenPendingCommandExceedsStuckCeiling_FaultsClient` (advances past the ceiling and asserts the watchdog still fires). Recommendation #1 (decoupling the worker-side `_writeLock`) is the worker module's concern and is tracked by Worker-017 / Worker-023 — out of scope for the Server module here. ### Server-032 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:510-569` (gateway-side `_events` channel); `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClientOptions.cs:45-53` (`EventChannelFullModeTimeout`) | | Status | Resolved | **Description:** Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The Java phase advised ~55 items (`item-handle 63`) before failing on the next `advise` call with the Server-030 diagnostic `Session ... is not ready. Session state is Ready; worker state is Faulted.`. The gateway stdout log records: **`Worker client faulted for session session-adfcc808da974808947e87db060c2b03: Worker event channel rejected an event.`** — the gateway-side per-session bounded event channel filled up and `Channel.Writer.TryWrite` returned `false`, triggering the fail-fast path in `EnqueueWorkerEventAsync` (`WorkerClient.cs:467-484`). The channel is configured as `Channel.CreateBounded(new BoundedChannelOptions(EventChannelCapacity) { ... FullMode = BoundedChannelFullMode.Wait ... })` (capacity defaults to `EventOptions.QueueCapacity = 10_000`). But `EnqueueWorkerEventAsync` uses **`TryWrite`** (non-blocking), so the configured `Wait` mode is moot — the writer always fails fast when full. This is consistent with `docs/DesignDecisions.md`'s "fail-fast event backpressure" policy (one subscriber per session, no producer-side queuing beyond the channel), but two facts make it sharp in practice: 1. The e2e flow (and any realistic client) `advise`s many items BEFORE opening a long-running `StreamEvents` consumer. With no consumer, events accumulate at the in-rate (driven by the SCADA tags' change frequency). For `TestMachine_*.TestChangingInt` × ~55 advised items, the rig can fill 10,000 in well under a minute. 2. The fail-fast threshold is "exactly at capacity." There is no overflow grace window. A momentary lull on the consumer side that lasts long enough for one extra event to arrive after the channel is full results in worker fault and session teardown. This is design-as-intended in the v1 sense, but it surfaces a behavioral contract that is **not currently documented**: clients must open `StreamEvents` BEFORE issuing `advise` against high-rate tags, or pace their `advise` calls below the (non-published) accumulation budget. None of the current docs (`gateway.md`, `docs/DesignDecisions.md`, the client READMEs) enforce or surface this requirement, and four of the five client CLIs (`go`, `python`, `rust`, `java`) hit it gracelessly in `scripts/run-client-e2e-tests.ps1`. The diagnostic `"Worker event channel rejected an event."` also does not name the actual channel (it says "Worker event channel" but the channel is gateway-owned), the current depth, or the capacity — only that it overflowed. Operators can't tell whether the threshold needs lifting or whether the consumer is genuinely missing. **Recommendation:** Three escalating options, pick at least the first and consider one of the others: 1. **Document the contract.** In `gateway.md` and `docs/DesignDecisions.md`, state explicitly that `advise` produces events into the gateway-side per-session channel and that a `StreamEvents` consumer must be attached to drain it. Add the bound (`MxGateway:Events:QueueCapacity`, default 10,000) and the fault behavior (the worker is faulted; the session ends). Update `clients/*/README.md` to call out the requirement in the "advise" / "subscribe" sections. 2. **Improve the diagnostic.** Format the channel depth and capacity into the fault message: `"Worker event channel rejected an event after {capacity} unconsumed events accumulated. Attach a StreamEvents consumer or increase MxGateway:Events:QueueCapacity."` 3. **Add an overflow grace window.** Instead of fail-fast on the first `TryWrite == false`, count overflow events and only fault if N consecutive overflows happen within T ms (or, equivalently, switch to `WriteAsync` with a short timeout). This trades a tiny memory bump for resilience to consumer hiccups. Out of scope if v1 explicitly chose fail-fast for parity reasons — but worth raising for v2. Add a regression test that advises N items without an active `StreamEvents` consumer, lets the channel fill, and asserts the produced fault message contains the channel-depth diagnostic (#2) — gated so that #3 is not required. **Resolution:** 2026-05-24 — Re-triaged at HEAD `d2d2e5f`: recommendation #2 (improved diagnostic) is already implemented and verified against source. `WorkerClient.EnqueueWorkerEventAsync` (`src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:525-569`) now (a) attempts `TryWrite` first for the fast path, (b) on full-channel falls through to `WriteAsync` with a linked `CancellationTokenSource` cancelled after `WorkerClientOptions.EventChannelFullModeTimeout` (default 5s), so a transient consumer hiccup is absorbed instead of fail-fast on the first overflow event, and (c) on real overflow records `QueueOverflow("worker-events")` and faults with the rich diagnostic message naming the wait timeout, the current channel depth, the channel capacity, and the actionable remediation ("Attach a StreamEvents consumer or raise MxGateway:Events:QueueCapacity."). Regression test: `WorkerClientTests.EnqueueWorkerEvent_WhenChannelFullPastTimeout_FaultsWithRichDiagnostic` (`src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:473-521`) fills a 4-slot channel + one overflow, asserts the worker is faulted, then drains the propagated `WorkerClientException` and pins the diagnostic string contains "Worker event channel rejected", "of 4 capacity", "StreamEvents", and "MxGateway:Events:QueueCapacity". Recommendation #1 (the prose contract in `gateway.md` / `docs/DesignDecisions.md` / client READMEs) is out of scope for this pass — the prompt restricts edits to `src/ZB.MOM.WW.MxGateway.Server/**`, `src/ZB.MOM.WW.MxGateway.Tests/**`, and this findings file; the documentation update needs to land in a follow-up that has docs access. Recommendation #3 (overflow grace window) was already implemented in spirit by the `WriteAsync` + timeout switch — the channel now absorbs a transient burst up to the configured wait timeout, satisfying #3's "consumer hiccup resilience" goal without requiring a separate counter. ### Server-033 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:265-323` (`TryRestoreFromDiskAsync`), `:84-99` (`_firstLoad` / `WaitForFirstLoadAsync`); `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:141-163` (`WaitForCacheBootstrap`) | | Status | Resolved | **Description:** `TryRestoreFromDiskAsync` populates `_current` with the on-disk snapshot (status `Stale`, `HasData == true`) but never completes the `_firstLoad` `TaskCompletionSource` — only the live-query paths (cheap / heavy / catch) in `RefreshCoreAsync` do. A `DiscoverHierarchy` or `GetLastDeployTime` call that arrives after gateway start but before the first refresh tick finishes sees `cache.Current` as `Empty` (status `Unknown`) when `WaitForCacheBootstrap` runs its initial check, so it falls through to `await WaitForFirstLoadAsync` with a 5-second budget. Restore then completes within milliseconds and makes the data available, but `_firstLoad` stays pending until the live query returns or fails. When the Galaxy database is unreachable — the exact scenario the snapshot feature exists for — the SQL connect attempt outlasts the 5s budget, so the caller waits the full 5 seconds before the budget elapses and the handler falls through to read the (already-restored) data. The result is correct, but the first browse calls after a cold offline start incur a needless ~5s latency, undercutting the feature's purpose. **Recommendation:** Call `_firstLoad.TrySetResult()` at the end of `TryRestoreFromDiskAsync` once the restored entry is published — restored data is a valid completed first load. Add a regression test: a cache with a throwing repository plus a populated snapshot store should have `WaitForFirstLoadAsync` complete promptly after `RefreshAsync`, not block on the live query. **Resolution:** Resolved in `bdccdbf` (2026-05-22): `TryRestoreFromDiskAsync` calls `_firstLoad.TrySetResult()` immediately after publishing the restored entry, so a restored snapshot satisfies the bootstrap gate without waiting on the live query. New test `GalaxyHierarchyCacheTests.RefreshAsync_RestoredSnapshotCompletesFirstLoadBeforeLiveQueryReturns` blocks the repository's deploy-time query and asserts `WaitForFirstLoadAsync` still completes from the snapshot. ### Server-034 | Field | Value | |---|---| | Severity | Low | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchySnapshotStore.cs:87-115` (`TryLoadAsync`) | | Status | Resolved | **Description:** `TryLoadAsync` carries the `Try` prefix and its XML doc says it returns `null` "when none exists, persistence is disabled, or the on-disk file uses an unrecognized schema version." But a corrupt or partially written JSON file makes `JsonSerializer.DeserializeAsync` throw `JsonException`, and an unreadable file (locked, denied ACL) throws `IOException` / `UnauthorizedAccessException` — none of which the method catches. End-to-end behavior is still safe because the sole caller, `GalaxyHierarchyCache.TryRestoreFromDiskAsync`, wraps the call in a `catch (Exception)`; but the store's own `Try`-prefixed contract is violated, and any future caller would be surprised by the throw. **Recommendation:** Catch `JsonException` and `IOException` (the latter covers the `UnauthorizedAccessException` family) inside `TryLoadAsync`, log a warning, and return `null` — consistent with the unrecognized-schema-version branch already present and with the `Try` naming. A corrupt cache file is an expected failure mode for a disk cache. **Resolution:** Resolved in `bdccdbf` (2026-05-22): `TryLoadAsync` now has a `catch (Exception) when (exception is JsonException or IOException or UnauthorizedAccessException)` that logs a warning and returns `null`. New test `GalaxyHierarchySnapshotStoreTests.TryLoadAsync_WhenFileIsCorruptJson_ReturnsNull`. ### Server-035 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:176` (call site), `:327-352` (`PersistSnapshotAsync`) | | Status | Resolved | **Description:** After a heavy refresh, `RefreshCoreAsync` `await`s `PersistSnapshotAsync` while still holding `_refreshGate`, and the `SaveAsync` write has no timeout. The only caller of `RefreshAsync` is the sequential `GalaxyHierarchyRefreshService` loop, so a write that hangs — e.g. a `SnapshotCachePath` pointed at an unresponsive network share — blocks the gate and stalls all subsequent cache refreshes until gateway shutdown. Impact is bounded: clients keep being served the last entry (which flips to `Stale` after the 5-minute threshold), so this is a degradation rather than an outage, and the default `C:\ProgramData` path is local disk where a hang is unlikely. **Recommendation:** Bound the snapshot write with a timeout — a linked `CancellationTokenSource` cancelling after, say, the SQL `CommandTimeoutSeconds` budget — so a stuck write fails fast and logs rather than pinning the refresh loop. Moving the write off the gate is an alternative but would need its own write-serialization. **Resolution:** Resolved in `bdccdbf` (2026-05-22): `SaveAsync` wraps the write in a `CancellationTokenSource.CreateLinkedTokenSource(cancellationToken)` cancelled after `Math.Max(1, CommandTimeoutSeconds)` seconds, so a stuck write fails fast instead of pinning the refresh loop. The timeout-expiry path itself is not unit-tested — exercising it would require a genuinely hanging filesystem. ### Server-036 | Field | Value | |---|---| | Severity | Low | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:345-348` (`PersistSnapshotAsync` catch) | | Status | Resolved | **Description:** `PersistSnapshotAsync` passes the refresh `CancellationToken` to `SaveAsync` and catches every exception — including the `OperationCanceledException` thrown when that token is cancelled at gateway shutdown — in its general `catch (Exception)`, logging it as `Warning: "Failed to persist the Galaxy hierarchy snapshot to disk."`. A snapshot write interrupted by a normal shutdown is not a failure, but it surfaces as a misleading warning every time the gateway stops mid-write. **Recommendation:** Let a cancellation-driven `OperationCanceledException` pass without the warning — e.g. add `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` before the general catch — matching the cancellation handling already used in `RefreshCoreAsync` and `TryRestoreFromDiskAsync`. **Resolution:** Resolved in `bdccdbf` (2026-05-22): `PersistSnapshotAsync` has a `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)` ahead of the general catch, so a save aborted by gateway shutdown is silent while a genuine failure (including a write timeout) still logs. New test `GalaxyHierarchyCacheTests.RefreshAsync_WhenSnapshotSaveCancelledAtShutdown_DoesNotLogPersistFailure`. ### Server-037 | Field | Value | |---|---| | Severity | Low | | Category | Testing coverage | | Location | `src/MxGateway.Tests/Galaxy/GalaxyHierarchySnapshotStoreTests.cs`, `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs` | | Status | Resolved | **Description:** The new snapshot tests cover the round-trip, missing-file, persistence-disabled, unrecognized-schema, and overwrite cases for the store, and the persist / restore-when-unreachable / promote-on-matching-deploy cases for the cache. Two resilience paths are untested: (1) `GalaxyHierarchyCache.TryRestoreFromDiskAsync`'s `catch` path when the snapshot file is corrupt — the cache must come up `Unavailable` rather than throwing; (2) the cache restore path when `PersistSnapshot = false` (the store yields `null` and the cache stays `Unavailable`). Both are the failure modes most likely to matter operationally. **Recommendation:** Add a cache test that writes a corrupt snapshot file and asserts `RefreshAsync` with an unreachable repository leaves the cache `Unavailable` without throwing, and a test that confirms a `PersistSnapshot = false` store neither restores nor persists. If Server-034 is fixed, the corrupt-file test also pins the store's null-return. **Resolution:** Resolved in `bdccdbf` (2026-05-22): added `GalaxyHierarchyCacheTests.RefreshAsync_WhenSnapshotFileCorrupt_ComesUpUnavailableWithoutThrowing` and `RefreshAsync_WhenPersistDisabled_DoesNotRestoreFromDisk`, plus the `TryLoadAsync_WhenFileIsCorruptJson_ReturnsNull` store test added for Server-034. ### Server-038 | Field | Value | |---|---| | Severity | Medium | | Category | Security | | Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:23-44` | | Status | Resolved | **Description:** `EventsHub` is gated by `[Authorize(Policy = DashboardAuthenticationDefaults.HubClientsPolicy)]`, which checks only that the caller carries a dashboard role (Admin or Viewer). `SubscribeSession(sessionId)` accepts any non-empty session id and joins the caller to `session:{id}`. A Viewer who knows or guesses a session id can therefore subscribe to any session's MxEvent stream once `DashboardEventBroadcaster` is broadcasting (which it now is, per `d692232`). The per-session ACL that gates the gRPC `StreamEvents` RPC is not replicated. **Recommendation:** Before the EventsHub is exercised by Admin-only sessions or session-scoped Viewer roles, gate `SubscribeSession` on a session-access check — either via a per-session role check in the hub method itself, or by storing a per-user allowed-session-id set in the connection's `Context.Items` at connect time and rejecting subscribes outside that set. The current dashboard surfaces only a per-page Session Details view that the page can prove it's authorized for, but as soon as a Viewer role exists the gap matters. **Resolution:** 2026-05-24 — Documented the v1 acceptance per the prompt's "practical fix for v1" direction. Added a detailed `` block to `EventsHub.SubscribeSession` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs`) stating that (a) in v1 the hub-level `HubClientsPolicy` only requires one of the dashboard roles (Admin or Viewer) and both may subscribe to any session id, (b) this is acceptable today because the dashboard's per-session views show non-secret session metadata any authenticated user can already see and value logging is gated by the same redaction policy, and (c) the per-session ACL that gates the gRPC `StreamEvents` RPC is intentionally not yet mirrored here. Added an explicit `TODO(per-session-acl)` describing the future enforcement seam — once a role/scope is introduced that scopes a Viewer to a specific session or tenant, add a session-access check at this method (inline on `Context.User` claims/`Context.Items`, or via a dedicated authorization policy applied to the hub method). No code-behavior change in this pass; the per-session ACL data model design is out of scope for the resolution window. No new regression test (the change is documentation-only). ### Server-039 | Field | Value | |---|---| | Severity | Low | | Category | Error handling & resilience | | Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:37-58` | | Status | Resolved | **Description:** `HubTokenService.Validate` deserializes the protected JSON payload and trusts `payload.Roles` even when `payload.Name` and `payload.NameIdentifier` are both `null`. The resulting `ClaimsPrincipal` has the `MxGateway.Dashboard.HubToken` scheme as its `AuthenticationType` and the role claims, but no identity claims. `Identity?.IsAuthenticated` returns `true` because the auth type is non-empty, so the principal satisfies `IsAuthenticated` checks and `IsInRole` checks even though it has no caller identity. A token forged from a corrupted data-protection store could pass authorization without an associated user. **Recommendation:** Mark `HubTokenPayload.Name` and `HubTokenPayload.NameIdentifier` as required (e.g. with `[JsonRequired]` once the project standardizes the JSON binder, or by validating non-null explicitly after deserialization) and reject the token if either is missing. Alternatively, document on `IDashboardAuthorizationHandler` consumers that they must check `Identity?.Name` is non-null before honoring role claims from this scheme. **Resolution:** 2026-05-24 — `HubTokenService.Validate` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs`) now rejects a deserialized payload where both `Name` and `NameIdentifier` are null/empty — returning `null` rather than emitting a principal with role claims but no caller identity. The check sits immediately after the protector unprotect and the null-payload guard, with a comment back-referencing Server-039. Either field is sufficient (a token minted with only a `NameIdentifier` still validates), matching the existing `Issue` path where the cookie principal may carry just one of them. New test file `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/HubTokenServiceTests.cs` with six tests: `Validate_TokenWithNullNameAndNullNameIdentifier_ReturnsNull` (the named regression — confirmed to fail before the fix and pass after), `Validate_TokenWithName_ReturnsAuthenticatedPrincipal`, `Validate_TokenWithOnlyNameIdentifier_ReturnsPrincipal`, plus null/empty/garbage-token sanity checks. Verified by passing tests. ### Server-040 | Field | Value | |---|---| | Severity | Low | | Category | Code organization & conventions | | Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAuthenticator.cs:140-160` (`MapGroupsToRoles`) | | Status | Resolved | **Description:** `MapGroupsToRoles` checks each LDAP group against the role map twice — first by the full group string, then by `ExtractFirstRdnValue(group)` — and `TryGetValue` short-circuits on the first hit. The precedence ("full match wins over RDN match") is correct because the map's key set is operator-controlled and matches should resolve deterministically, but the lookup ordering is not documented. A future maintainer reading the code can't tell whether "fall through to RDN" is intentional or a leftover from refactoring `IsMemberOfRequiredGroup`. **Recommendation:** Add a one-line comment above the loop explaining the precedence: full DN/CN literal first, leading-RDN fallback second. Mention the case-insensitive map comparer (`OrdinalIgnoreCase`) so the next reader doesn't ask why `"GwAdmin"` matches `"gwadmin"`. **Resolution:** 2026-05-24 — Added a precedence comment block above the lookup in `MapGroupsToRoles` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAuthenticator.cs:156-163`) explaining that the full literal group string is tried first and the leading-RDN value (e.g. `GwAdmin` extracted from `ou=GwAdmin,ou=groups,...`) is the fallback, and back-referencing `DashboardOptions.GroupToRole` as the source of the `OrdinalIgnoreCase` comparer so a maintainer sees why `"GwAdmin"` matches `"gwadmin"`. No code change — existing `DashboardAuthenticatorTests.MapGroupsToRoles_ResolvesByShortNameAndDistinguishedName` already pins both the full-match and RDN-fallback paths and the case-insensitive lookup; pure documentation-only resolution, no new test. ### Server-041 | Field | Value | |---|---| | Severity | Low | | Category | Design-document adherence | | Location | `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-126`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/IDashboardEventBroadcaster.cs:6-10` | | Status | Resolved | **Description:** `IDashboardEventBroadcaster.Publish` is documented as "Implementations must never throw — broadcast failures are best-effort and must not disrupt the source gRPC stream." `EventStreamService` honors that contract by passing the call through without a try/catch. The current `DashboardEventBroadcaster` implementation observes the `SendAsync` task's continuation but does not raise synchronously, so the seam is safe today. A future implementation that adds synchronous validation or a serializer hop could throw, faulting the producer loop and ending the gRPC stream. **Recommendation:** Either wrap the `Publish` call in a `try/catch (Exception ex)` that logs at debug and continues (matching the `DashboardSnapshotPublisher` pattern), or add a code-review checklist note enforcing the never-throw contract on implementations. The wrap is safer because it doesn't depend on convention. **Resolution:** 2026-05-24 — Took the safer wrap. `EventStreamService.ProduceEventsAsync` (`src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-137`) now wraps the `dashboardEventBroadcaster.Publish(...)` call in a `try / catch (Exception ex)` that logs at debug and continues. The producer loop and the gRPC stream are no longer at the mercy of the broadcaster's never-throw discipline — a future implementation that adds synchronous validation or a serializer hop cannot fault the stream. Regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs`: `StreamEventsAsync_WhenDashboardBroadcasterThrows_StillYieldsEventsAndDoesNotFaultSession` (new) injects a `ThrowingDashboardEventBroadcaster` test double that throws `InvalidOperationException` on every `Publish`, then asserts (a) the gRPC stream still yields both events in order, (b) the broadcaster's `Publish` is attempted for every event (so the catch is exercised per-event rather than aborting the loop), and (c) the session does not transition to `Faulted`. Confirmed to fail before the fix (the producer loop surfaced the simulated `InvalidOperationException`) and pass after. Verified by passing tests. ### Server-042 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardSnapshotPublisher.cs:18-41` | | Status | Resolved | **Description:** `DashboardSnapshotPublisher.ExecuteAsync` reads from `IDashboardSnapshotService.WatchSnapshotsAsync` inside an outer `try` that catches `OperationCanceledException` only. A failure inside `WatchSnapshotsAsync` (e.g. the snapshot service throws after a transient SQL failure for the Galaxy summary projection) escapes the outer try and ends the BackgroundService — no automatic reconnect. The sibling `AlarmsHubPublisher` (lines 55-61) wraps its `StreamAsync` consumer in a 5-second reconnect loop with `catch (Exception ex)` and continues. The snapshot publisher should follow the same shape. **Recommendation:** Wrap the `await foreach` in a `while (!stoppingToken.IsCancellationRequested)` loop with a `catch (Exception ex)` plus a 5-second `Task.Delay`, mirroring `AlarmsHubPublisher`. Today's snapshot service rarely throws on the watch path, but a one-time logger-init failure or transient `IGatewayConfigurationProvider` exception would silently take the dashboard offline. **Resolution:** 2026-05-24 — Mirrored `AlarmsHubPublisher`'s reconnect loop. `DashboardSnapshotPublisher.ExecuteAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardSnapshotPublisher.cs`) now wraps the `await foreach` in a `while (!stoppingToken.IsCancellationRequested)` loop and catches general exceptions with a logged warning + `Task.Delay(reconnectDelay, stoppingToken)`. The 5-second `DefaultReconnectDelay` is preserved for production via the public constructor; an `internal` overload injects a shorter delay for the regression test (with `[InternalsVisibleTo("ZB.MOM.WW.MxGateway.Tests")]` already in place). Also tightened cancellation handling: the inner `OperationCanceledException` `return`s instead of merely catching, so a normal shutdown exits cleanly rather than re-looping on the cancelled token. New test file `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs` with two cases: `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` (the named regression — `ThrowOnceThenYieldSnapshotService` throws on the first `WatchSnapshotsAsync` call and yields a snapshot on the second; the publisher must reconnect, broadcast the snapshot, and the gap between throw and reconnect must respect the configured delay) and `ExecuteAsync_WhenSnapshotServiceCompletes_ReconnectsAfterDelay` (sanity case: a normal `yield break` also triggers the reconnect loop). Confirmed both tests fail against the original single-try implementation (the BackgroundService exits and `SubscribeCount` stays at 1) and pass after the fix. Verified by passing tests. ### Server-043 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:1`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardServiceCollectionExtensions.cs:24` | | Status | Resolved | **Description:** `HubTokenService` is registered as a singleton (good — data protection providers are thread-safe and a single protector instance is correct) and shared by both `DashboardHubConnectionFactory` (per-circuit scoped, mints fresh tokens from the cookie principal) and `HubTokenAuthenticationHandler` (per-request transient, validates inbound tokens). The class-level docs describe what the service does but not that it is intentionally a singleton with two consumer scopes, so a future maintainer rewriting the DI registration may pick the wrong lifetime. **Recommendation:** Add a `` block to `HubTokenService` noting "Registered as a singleton in `AddGatewayDashboard`; the underlying `ITimeLimitedDataProtector` is thread-safe and shared across hub-token issuance and validation." Optionally add a comment near the DI registration explaining the lifetime contract. **Resolution:** 2026-05-24 — Added a `` block to `HubTokenService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs`) documenting that the service is registered as a singleton in `DashboardServiceCollectionExtensions.AddGatewayDashboard` and is shared by two consumer scopes — `DashboardHubConnectionFactory` (scoped, per-circuit; calls `Issue` from the cookie-authenticated dashboard) and `HubTokenAuthenticationHandler` (transient, per-request; calls `Validate` from the SignalR negotiate / connection path). Notes that the underlying `ITimeLimitedDataProtector` is thread-safe so concurrent mint/validate from any number of callers is safe, and explicitly asks future maintainers to preserve the singleton lifetime to keep the protector instance stable. Pure documentation change; no test.