327e9c5f94
Server-031: re-triaged. The recommended gateway-side "skip-while-command-in-flight" guard is already in place at WorkerClient.HeartbeatLoopAsync via WorkerClientOptions.HeartbeatStuckCeiling (default 75s = 5× HeartbeatGrace). Two regression tests pin the behaviour. Recommendation #1 (decouple worker-side _writeLock) is a Worker-module concern (Worker-017 / Worker-023) and out of scope here. Server-032: re-triaged. Recommendation #2 (rich diagnostic) is already in EnqueueWorkerEventAsync, with #3 (overflow grace) absorbed by the TryWrite → WriteAsync-with-timeout fall-through. Test EnqueueWorkerEvent_WhenChannelFullPastTimeout_FaultsWithRichDiagnostic pins the diagnostic string. Recommendation #1 (prose contract in gateway.md / docs) is deferred — outside this pass's edit scope. Server-038 (Security): EventsHub.SubscribeSession's missing per-session ACL is documented with a TODO(per-session-acl) and a <remarks> block explaining the v1 acceptance (any dashboard role can subscribe to any session — non-secret metadata, redacted value logging). The per-session ACL design lands in a follow-up once a session-scoped role exists. Server-039 (Error handling): HubTokenService.Validate now rejects a deserialized payload where both Name and NameIdentifier are null/empty. New test file HubTokenServiceTests.cs covers the regression and five sanity cases. TDD confirmed. Server-040 (Conventions): MapGroupsToRoles gains a precedence comment explaining "full literal match first, leading-RDN fallback; OrdinalIgnoreCase via DashboardOptions.GroupToRole". Documentation-only. Server-041 (Design adherence): EventStreamService.ProduceEventsAsync wraps the broadcaster.Publish call in try/catch (Exception). The producer loop and gRPC stream are no longer at the mercy of the broadcaster's never-throw discipline. New regression test StreamEventsAsync_WhenDashboardBroadcasterThrows_StillYieldsEventsAndDoesNotFaultSession. Server-042 (Performance): DashboardSnapshotPublisher.ExecuteAsync now mirrors AlarmsHubPublisher's reconnect loop — wraps the await foreach in a while-not-cancelled, catches general exceptions, and Task.Delays 5s before retrying. An internal ctor accepts a shorter delay for the test. New test file DashboardSnapshotPublisherTests.cs covers the throw-then-yield reconnect path and the normal-completion case. Server-043 (Documentation): HubTokenService class XML doc gains a <remarks> describing the singleton lifetime, the two consumer scopes (DashboardHubConnectionFactory scoped, HubTokenAuthenticationHandler transient), and the thread-safety contract. Verification: dotnet build src/ZB.MOM.WW.MxGateway.slnx clean (0 warnings / 0 errors); src/ZB.MOM.WW.MxGateway.Tests 486/486 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
781 lines
112 KiB
Markdown
781 lines
112 KiB
Markdown
# Code Review — Server
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Module | `src/ZB.MOM.WW.MxGateway.Server` |
|
||
| Reviewer | Claude Code |
|
||
| Review date | 2026-05-24 |
|
||
| Commit reviewed | `d692232` |
|
||
| Status | Reviewed |
|
||
| Open findings | 0 |
|
||
|
||
## Checklist coverage
|
||
|
||
### 2026-05-20 review (commit 1cd51bb)
|
||
|
||
This row summarizes the 2026-05-20 review pass at commit `1cd51bb`. Findings from
|
||
prior passes (Server-001 through Server-014) are all closed and remain below as
|
||
audit history.
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | Issues found: Server-019 (`WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` yields silently when session is missing). |
|
||
| 2 | mxaccessgw conventions | No issues found — convention drift previously called out is resolved; no new gaps observed. |
|
||
| 3 | Concurrency & thread safety | Issues found: Server-015 (`GatewaySession._state` is written under `_closeLock` but read/written elsewhere under `_syncRoot`). |
|
||
| 4 | Error handling & resilience | Issues found: Server-016 (`GatewaySession.DisposeAsync` disposes the close-lock semaphore while it may be held). |
|
||
| 5 | Security | Issues found: Server-017 (`AcknowledgeAlarm` / `QueryActiveAlarms` fall through to admin-only scope because the resolver was not updated for the new alarm RPCs). |
|
||
| 6 | Performance & resource management | Issues found: Server-018 (`GalaxyGlobMatcher` regex cache is unbounded — currently low-risk but uncapped). |
|
||
| 7 | Design-document adherence | No issues found at this pass. |
|
||
| 8 | Code organization & conventions | Issues found: Server-020 (dashboard pages each declare two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — producing duplicate routes under the `/dashboard` group prefix). |
|
||
| 9 | Testing coverage | Issues found: Server-021 (`MxAccessGatewayService.ApplyConstraintsAsync` and the new `BulkConstraintPlan` / `ReadBulkConstraintPlan` / `WriteBulkConstraintPlan` / `SubscribeBulkConstraintPlan` merge logic is entirely untested). |
|
||
| 10 | Documentation & comments | Issues found: Server-022 (`IAlarmRpcDispatcher` XML doc still describes the dispatcher as "ships a not-yet-wired default"; stale after Server-014). |
|
||
|
||
### 2026-05-20 review (commit a020350)
|
||
|
||
Re-review pass at `a020350` — the cross-module sweep that resolved Server-015 through Server-022. Verified each fix is sound (lock discipline now uniform on `_syncRoot`; `DisposeAsync` gates on `_closeLock`; alarm RPCs map to `InvokeWrite`/`EventsRead`; glob cache is bounded; alarm dispatcher SessionNotFound flows through `MxAccessGatewayService.MapException` → gRPC `NotFound`; dashboard pages emit a single `@page`; 11 new `MxAccessGatewayServiceConstraintTests` cover the bulk-constraint plans). New findings filed against this pass.
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | Issues found: Server-024 (`GalaxyGlobMatcher.GetOrCreateRegex` indexer access after `TryAdd` fails can throw `KeyNotFoundException` under contention near the cap). |
|
||
| 2 | mxaccessgw conventions | No issues found. |
|
||
| 3 | Concurrency & thread safety | No new issues found — Server-015/016 fixes verified sound. |
|
||
| 4 | Error handling & resilience | Issues found: Server-026 (`AlarmsOptions` is bound but not validated by `GatewayOptionsValidator`). |
|
||
| 5 | Security | No issues found — Server-017 mapping (`InvokeWrite` / `EventsRead`) is defensible and exercised by both resolver and interceptor tests. |
|
||
| 6 | Performance & resource management | No issues found — Server-018 cap is in place and tested. |
|
||
| 7 | Design-document adherence | Issues found: Server-027 (`docs/Authorization.md` `ResolveCommandScope` code snippet and Constraint Enforcement section omit the bulk read/write command families). |
|
||
| 8 | Code organization & conventions | Issues found: Server-025 (`GalaxyRepositoryGrpcService` still consumes the concrete `GalaxyRepository` after `IGalaxyRepository` was introduced for testability — inconsistent with `GalaxyHierarchyCache`). |
|
||
| 9 | Testing coverage | Issues found: Server-028 (`GatewayGrpcScopeResolverTests` does not exercise `WatchDeployEventsRequest` or `MxCommandKind.ReadBulk`; no `GatewaySessionTests` case asserts a `MarkFaulted` during in-flight Close). |
|
||
| 10 | Documentation & comments | Issues found: Server-023 (`NotWiredAlarmRpcDispatcher` class XML doc still says "PR A.6/A.7 — default … shipped while the worker-side AlarmClient event subscription is gated on dev-rig validation"; contradicts the cleanup that Server-014/Server-022 applied to the interface, gateway service, and `WorkerAlarmRpcDispatcher`). Issues found: Server-029 (`OpenSession` capability list advertises `bulk-subscribe-commands` but not the now-shipping bulk-read or bulk-write families — clients that gate on capability strings have no signal that those families exist). |
|
||
|
||
### 2026-05-22 review (commit fa491c7)
|
||
|
||
Re-review pass at `fa491c7`, scoped to the Galaxy hierarchy snapshot-persistence
|
||
change: the new `GalaxyHierarchySnapshot`, `IGalaxyHierarchySnapshotStore` /
|
||
`GalaxyHierarchySnapshotStore`, the restore / persist paths added to
|
||
`GalaxyHierarchyCache`, the two new `GalaxyRepositoryOptions`, and the
|
||
`docs/GalaxyRepository.md` / `docs/GatewayConfiguration.md` updates. Prior
|
||
findings (Server-001 through Server-032) are unchanged by this pass.
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | No issues found — restore/save sequencing and the shared `BuildEntry` materialization are sound. |
|
||
| 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, `sealed`, `Async` suffixes, Options pattern, and XML docs all conform; the snapshot persists Galaxy metadata (names/types), not tag values or secrets. |
|
||
| 3 | Concurrency & thread safety | No issues found — `_restoreAttempted` and `_current` are touched only under `_refreshGate`; `_current` is published via `Volatile.Write`; the store serializes its file I/O on a private `SemaphoreSlim`. |
|
||
| 4 | Error handling & resilience | Issues found: Server-033 (restore never completes `_firstLoad`, so a cold-start browse waits the full 5s bootstrap budget), Server-034 (`TryLoadAsync` throws on a corrupt file despite the `Try` prefix), Server-036 (a save cancelled at shutdown logs a misleading warning). |
|
||
| 5 | Security | No issues found — the snapshot holds non-secret Galaxy metadata, is written under `C:\ProgramData\MxGateway` alongside the auth DB, and restored rows flow the same materialization path as live SQL with no injection surface. |
|
||
| 6 | Performance & resource management | Issues found: Server-035 (the snapshot write is awaited on the refresh critical path under `_refreshGate` with no timeout). |
|
||
| 7 | Design-document adherence | No issues found — `docs/GalaxyRepository.md` and `docs/GatewayConfiguration.md` were updated in the same commit; `docs/DesignDecisions.md` already defers to `GalaxyRepository.md` as the Galaxy authority. |
|
||
| 8 | Code organization & conventions | No issues found — the new options live on `GalaxyRepositoryOptions`, the store is a registered singleton, and the on-disk envelope (`PersistedFile`) is a private nested record. |
|
||
| 9 | Testing coverage | Issues found: Server-037 (no test for the corrupt-snapshot restore path or for `PersistSnapshot = false` at the cache level). |
|
||
| 10 | Documentation & comments | No issues found — XML docs match behavior; the `GalaxyRepository.md` "On-disk snapshot" section documents the Stale-on-restore lifecycle. |
|
||
|
||
### 2026-05-24 review (commit d692232)
|
||
|
||
Re-review pass at `d692232` scoped to the dashboard refactor wave: the
|
||
`ZB.MOM.WW` project rename (`dc9c0c9`), the `QueryActiveAlarms` public RPC
|
||
implementation (`397d3c5`), the LDAP role-mapping + HubToken bearer auth
|
||
(`27ed651`), the sidebar layout + three SignalR push hubs (`6594359`), and the
|
||
EventsHub broadcaster + doc refresh (`d692232`). Server-031 and Server-032
|
||
remain open and untouched — neither the gateway-side `_writeLock` heartbeat
|
||
contention nor the bounded `_events` channel saw any changes in this wave.
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | No issues found in the fa491c7..d692232 diff. |
|
||
| 2 | mxaccessgw conventions | No issues found — rename hygiene clean, external runtime identifiers (`MeterName`, `MxGateway.Dashboard` scheme, `MxGateway.Request` logger, `MxGateway.Worker.STA` thread name) intentionally unprefixed per commit message. |
|
||
| 3 | Concurrency & thread safety | No issues found — Server-031 (`_writeLock` heartbeat watchdog contention) remains open and unchanged. |
|
||
| 4 | Error handling & resilience | Issues found: Server-039 (`HubTokenService.Validate` accepts a payload with null Name/NameIdentifier), Server-041 (`EventStreamService` calls the broadcaster without a try/catch — fragile seam), Server-042 (`DashboardSnapshotPublisher` tight retry loop with no backoff vs `AlarmsHubPublisher` 5-second delay). |
|
||
| 5 | Security | Issues found: Server-038 (`EventsHub.SubscribeSession` accepts any session id from any Viewer; no per-session ACL). |
|
||
| 6 | Performance & resource management | Issues found: Server-042 (`DashboardSnapshotPublisher` lacks reconnect backoff). |
|
||
| 7 | Design-document adherence | Issues found: Server-041 (broadcaster's never-throw contract documented in the interface but not enforced by the caller). |
|
||
| 8 | Code organization & conventions | Issues found: Server-040 (undocumented lookup-order precedence in `MapGroupsToRoles`), Server-043 (singleton sharing of `HubTokenService` undocumented). |
|
||
| 9 | Testing coverage | No issues found in this module — see Tests-026 in the Tests module for the missing EventsHub broadcast coverage. |
|
||
| 10 | Documentation & comments | Issues found: Server-040, Server-043 (both documentation gaps). |
|
||
|
||
## Findings
|
||
|
||
### Server-001
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Critical |
|
||
| Category | Security |
|
||
| Location | `src/MxGateway.Server/GatewayApplication.cs:147-149`, `src/MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs:55-58`, `src/MxGateway.Server/Dashboard/Components/Routes.razor:1-15` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The dashboard authorization policy (`DashboardAuthenticationDefaults.AuthorizationPolicy`), `DashboardAuthorizationRequirement`, and `DashboardAuthorizationHandler` are registered in DI but never applied to any endpoint. `MapRazorComponents<App>()` has no `.RequireAuthorization(...)`, the `<Router>` in `Routes.razor` uses plain `RouteView` (not `AuthorizeRouteView`), and no dashboard page carries `[Authorize]` — a module-wide grep finds zero `RequireAuthorization`/`[Authorize]`/`AuthorizeRouteView` usages. Every dashboard page (Sessions, Workers, Events, Galaxy, Settings, and the API Keys list exposing key IDs, scopes, and constraints) is reachable by any unauthenticated remote client regardless of `Dashboard:AllowAnonymousLocalhost` or `Dashboard:RequireAdminScope`. Only the API-key mutation operations remain protected, via the separate `DashboardApiKeyManagementService.CanManage` check.
|
||
|
||
**Recommendation:** Apply the policy at the route level — `endpoints.MapRazorComponents<App>().AddInteractiveServerRenderMode().RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy)` — and/or switch `Routes.razor` to `AuthorizeRouteView` with a `[Authorize]` fallback policy plus a `NotAuthorized` redirect to the login page. Add an integration test that GETs a dashboard page anonymously and asserts 302-to-login / 401.
|
||
|
||
**Resolution:** Resolved in `a8aafdf` (2026-05-18): `MapRazorComponents<App>()` now calls `.RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy)`, so an unauthenticated request to any dashboard component route is challenged by the cookie scheme and redirected to the login page. `GatewayApplicationTests` gained `ComponentRoutesRequireAuthorization` (component routes carry the policy) and `AuthEndpointsAllowAnonymousAccess`, replacing the prior test that asserted the insecure behavior.
|
||
|
||
### Server-002
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Location | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `gateway.md:583` and CLAUDE.md state the first version "terminates orphaned workers on startup." No code in MxGateway.Server enumerates or kills leftover `MxGateway.Worker.exe` processes at startup — a grep for `orphan`/`reattach`/`terminate` finds nothing. After an unclean gateway crash, x86 worker processes (each holding an MXAccess COM instance) leak and survive indefinitely, and a restarted gateway does not reclaim or kill them.
|
||
|
||
**Recommendation:** Add a startup hosted service that finds and kills stale worker processes (by executable path / a well-known argument or environment marker) before the server accepts sessions, or update the design docs if reattachment/cleanup is deliberately deferred.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source: no code path enumerated or killed leftover workers. Added `IRunningProcessInspector` / `SystemRunningProcessInspector` (a testable seam over `Process.GetProcessesByName`/`Kill`), `OrphanWorkerTerminator` (kills processes matched by the configured worker executable path, or by image name when the x64 gateway cannot introspect the x86 worker's `MainModule`, skipping the current process and tolerating per-process kill failures), and `OrphanWorkerCleanupHostedService` (best-effort `IHostedService`). The hosted service is registered in `AddWorkerProcessLauncher` ahead of `AddGatewaySessions` so cleanup runs before the server accepts sessions. `gateway.md` updated to describe the implemented behavior. Regression tests: `OrphanWorkerTerminatorTests` (`KillsWorkerProcessesMatchingConfiguredExecutablePath`, `KillsImageNameMatchWhenExecutablePathUnreadable`, `DoesNotKillUnrelatedProcessSharingImageName`, `DoesNotKillCurrentProcess`, `ContinuesWhenOneKillThrows`).
|
||
|
||
### Server-003
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | High |
|
||
| Category | Security |
|
||
| Location | `src/MxGateway.Server/Dashboard/DashboardAuthorizationHandler.cs:39,54-59`, `src/MxGateway.Server/Dashboard/DashboardAuthenticator.cs:236-258` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** When `Dashboard:RequireAdminScope` is true (the default) and the request is not loopback, `DashboardAuthorizationHandler` succeeds only if `HasAdminScope` finds a claim of type `"scope"` with value `"admin"`. But `DashboardAuthenticator.CreatePrincipal` issues only `NameIdentifier`, `Name`, and `LdapGroupClaimType` claims — never a `scope`/`admin` claim. So a correctly LDAP-authenticated user who passed the required-group check is still denied dashboard access on any non-loopback connection. The bug is currently masked by the missing route-level enforcement (Server-001) and by `AllowAnonymousLocalhost`; fixing Server-001 would make the dashboard unusable for all real LDAP logins.
|
||
|
||
**Recommendation:** Either have `DashboardAuthenticator.CreatePrincipal` add a `scope=admin` claim when the user is in the required group, or change `DashboardAuthorizationHandler.HasAdminScope` to evaluate LDAP group membership (reuse `IsMemberOfRequiredGroup` against the `LdapGroupClaimType` claims, as `DashboardApiKeyAuthorization.CanManage` already does).
|
||
|
||
**Resolution:** Resolved in `a8aafdf` (2026-05-18): `DashboardAuthenticator.CreatePrincipal` — reached only after the required-group check passes — now emits the `scope=admin` claim that `DashboardAuthorizationHandler` checks, so group-validated LDAP users pass `RequireAdminScope` once route-level authorization (Server-001) is enforced.
|
||
|
||
### Server-004
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Code organization & conventions |
|
||
| Location | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `ParseScopes` accepts any comma-separated strings and `CreateKeyAsync` persists them verbatim; neither the CLI nor the dashboard create path validates scopes against `GatewayScopes`. A typo or non-canonical name (e.g. CLAUDE.md's example `--scopes session,invoke,event,metadata,admin`, which does not match the resolver's `session:open`/`invoke:read`/etc.) silently creates a key whose scope strings the authorization resolver never checks for — the key is unusable for those RPCs with no error at creation time.
|
||
|
||
**Recommendation:** Validate every requested scope against the `GatewayScopes` catalog at create time in both the CLI parser/runner and `DashboardApiKeyManagementService.ValidateCreateRequest`, rejecting unknown scope strings.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source: `ParseScopes` split unvalidated strings into the create command and `ValidateCreateRequest` checked only key id and display name. Added `GatewayScopes.All` (the canonical scope catalog) and `GatewayScopes.IsKnown(string)`. `ApiKeyAdminCommandLineParser.Parse` now runs `ValidateScopes` for create-key commands and fails the parse listing the unknown scope(s) and valid set; `DashboardApiKeyManagementService.ValidateCreateRequest` rejects requests carrying any non-canonical scope. Revoke/rotate paths are unaffected (no scope input). Regression tests: `ApiKeyAdminCommandLineParserTests.Parse_CreateKeyCommand_RejectsUnknownScope`, `Parse_CreateKeyCommand_AcceptsAllCanonicalScopes`, and `DashboardApiKeyManagementServiceTests.CreateAsync_UnknownScope_DoesNotCallStore`.
|
||
|
||
### Server-005
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `GalaxyHierarchyCache.RefreshCoreAsync` only catches `SqlException` and `InvalidOperationException`. The initial `cache.RefreshAsync` call in `GalaxyHierarchyRefreshService.ExecuteAsync` is wrapped only for `OperationCanceledException`. A transient non-`SqlException` failure on the first refresh (e.g. a `Win32Exception`/`TimeoutException` from connection establishment, or another `DbException` subtype) escapes both layers, faults the `BackgroundService`, and — with default host behavior — stops the whole gateway. The periodic-tick loop does catch general exceptions, so only the first load is exposed.
|
||
|
||
**Recommendation:** Broaden the `catch` in `RefreshCoreAsync` to all non-cancellation exceptions (record `Unavailable`/`Stale` and still complete `_firstLoad`), or wrap the initial `RefreshAsync` in `GalaxyHierarchyRefreshService` with the same general `catch` the tick loop uses.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source: the initial `RefreshAsync` in `ExecuteAsync` was guarded only for `OperationCanceledException`, and `RefreshCoreAsync` filtered its catch to `SqlException or InvalidOperationException`. Both recommended layers applied: `GalaxyHierarchyRefreshService.ExecuteAsync` now catches every non-cancellation exception on the initial load (logs a warning; the periodic tick retries), and `GalaxyHierarchyCache.RefreshCoreAsync` broadens its catch to all non-cancellation exceptions so the cache still records `Stale`/`Unavailable` and completes `_firstLoad`. The now-unused `Microsoft.Data.SqlClient` using was removed. Regression test: `GalaxyHierarchyRefreshServiceTests.ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService`.
|
||
|
||
### Server-006
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** In `OpenSessionAsync`, `_metrics.SessionOpened()` (line 89) increments the `_openSessions` gauge before `TryAutoSubscribeAlarmsAsync` runs. If auto-subscribe throws (which it does when `Alarms.RequireSubscribeOnOpen` is true and the worker rejects the subscription), the `catch` block disposes and removes the session and records `_metrics.Fault(...)` but never calls `SessionClosed`/`SessionRemoved`. The `mxgateway.sessions.open` gauge permanently over-counts by one for every such failed open.
|
||
|
||
**Recommendation:** In the `catch` block, when the session had reached the point where `SessionOpened()` was recorded, also call `_metrics.SessionRemoved()` — or move the `SessionOpened()` call to after auto-subscribe succeeds.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source: the `catch` block in `OpenSessionAsync` recorded `Fault(...)` and removed the session but never decremented the open-session gauge after `SessionOpened()` had run. Added a `sessionOpenedRecorded` flag set immediately after `_metrics.SessionOpened()`; the `catch` block now calls `_metrics.SessionRemoved()` when that flag is set, restoring the gauge for a post-`SessionOpened()` failure (e.g. an auto-subscribe rejection with `RequireSubscribeOnOpen=true`). Regression test: `SessionManagerAlarmAutoSubscribeTests.OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn`.
|
||
|
||
### Server-007
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyProjector.cs:55-70` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `Project` always iterates the full `entry.Index.ObjectViews` collection and re-applies all filters to skip `offset` matched items before collecting a page. Paging through a large Galaxy hierarchy is therefore O(total) per page and O(total²/pageSize) end-to-end. The cache is in-memory so impact is bounded, but for large galaxies repeated `DiscoverHierarchy` pagination wastes CPU.
|
||
|
||
**Recommendation:** Precompute and cache the filtered, ordered view list per `(filterSignature, sequence)` so subsequent pages are an O(pageSize) slice; the existing filter signature already keys page tokens.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source: `Project` re-scanned and re-filtered the whole `ObjectViews` list on every page. Added a `ConditionalWeakTable<GalaxyHierarchyCacheEntry, ConcurrentDictionary<string, IReadOnlyList<GalaxyObjectView>>>` memo in `GalaxyHierarchyProjector`: the first projection of a given filter signature builds the filtered, ordered view list; subsequent pages take an O(pageSize) slice via index arithmetic. The memo is keyed on the immutable cache-entry instance, so when the cache publishes a new entry the stale memo becomes unreachable and is reclaimed with it — no explicit invalidation. `ResolveRoot` still runs before the memo lookup so a missing root surfaces `NotFound` consistently. Regression tests: `GalaxyHierarchyProjectorTests` (`Project_PagedAcrossEntireHierarchy_ReturnsEveryObjectExactlyOnce`, `Project_DistinctFiltersOnSameEntry_DoNotShareMemoizedViewList`, `Project_SameFilterRepeated_ReturnsIdenticalTotals`, `Project_DistinctCacheEntries_ProjectAgainstTheirOwnData`); existing `GalaxyRepositoryGrpcServiceTests` paging tests continue to pass unchanged.
|
||
|
||
### Server-008
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Location | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:111-134,160-189` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `WatchDeployEvents` calls `ResolveBrowseSubtrees()` on every streamed event, and `MapDeployEvent` re-runs `GalaxyHierarchyProjector.Project` over the entire cached hierarchy (and `Sum`s attribute counts) for every event of every constrained subscriber. `GalaxyGlobMatcher.IsMatch` also rebuilds the glob regex on each call. With many constrained subscribers and frequent deploys this is avoidable work.
|
||
|
||
**Recommendation:** Hoist `ResolveBrowseSubtrees()` out of the loop; compute scoped object/attribute counts once per deploy sequence and cache by `(sequence, browseSubtrees)`; cache compiled glob `Regex` instances in `GalaxyGlobMatcher`.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source. Three changes: (1) `WatchDeployEvents` now resolves `ResolveBrowseSubtrees()` once before the streaming loop — the caller's identity and constraints are fixed for the stream lifetime, so per-event resolution was pure waste. (2) `GalaxyGlobMatcher` now caches compiled `Regex` instances in a `ConcurrentDictionary` keyed by glob pattern (with `RegexOptions.Compiled`), so the same handful of globs are translated once instead of on every `IsMatch` call. (3) The per-event `MapDeployEvent` re-projection is no longer a separate hot path: with finding Server-007 resolved, `GalaxyHierarchyProjector.Project` memoizes the filtered view list per `(cache entry, filter signature)`, so the scoped-count projection in `MapDeployEvent` for a constrained subscriber is O(matched-slice) after the first event of a given deploy sequence rather than a full re-scan — this subsumes the recommendation's `(sequence, browseSubtrees)` cache (the memo is keyed on the per-sequence cache-entry instance and the browse-subtree-bearing filter signature). Regression tests: `GalaxyFilterInputSafetyTests.GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect` (glob cache correctness); existing `WatchDeployEvents` and `GalaxyFilterInputSafetyTests` coverage continues to pass.
|
||
|
||
### Server-009
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/MxGateway.Server/Security/Authentication/AuthSqliteConnectionFactory.cs:15-32` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Each auth-store operation opens a fresh `SqliteConnection` with no busy timeout, no WAL journal mode, and default journaling. `MarkKeyUsedAsync` runs on every authenticated request and `SqliteApiKeyAuditStore` appends on every denial; under concurrent load these writers can collide and surface `SQLITE_BUSY` as a hard failure on the request path.
|
||
|
||
**Recommendation:** Set `Pooling`, a non-zero `DefaultTimeout`/`busy_timeout`, and enable WAL (`PRAGMA journal_mode=WAL`) once at startup so concurrent readers/writers degrade gracefully.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source: the connection string set only `DataSource` and `Mode`. `AuthSqliteConnectionFactory.CreateConnection` now also sets `Pooling = true` and a non-zero `DefaultTimeout`. A new `OpenConnectionAsync(CancellationToken)` opens the connection and applies `PRAGMA journal_mode=WAL` and `PRAGMA busy_timeout` (5 s); WAL is a persistent database-level setting so re-applying it per connection is a cheap no-op, while `busy_timeout` is per-connection state. All nine auth-store call sites (`SqliteApiKeyAdminStore`, `SqliteApiKeyAuditStore`, `SqliteApiKeyStore`, `SqliteAuthStoreMigrator`) were switched from `CreateConnection()` + `OpenAsync()` to `OpenConnectionAsync()`. `docs/Authentication.md` updated to describe the WAL/busy-timeout behavior. Regression test: `SqliteAuthStoreTests.OpenConnectionAsync_EnablesWalJournalModeAndBusyTimeout`.
|
||
|
||
### Server-010
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Security |
|
||
| Location | `src/MxGateway.Server/Security/Authentication/SqliteApiKeyAdminStore.cs:91-114`, `src/MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:168-172` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `RotateAsync` sets `revoked_utc = NULL`, so rotating a previously revoked key silently reactivates it. This is documented intentional behavior in `docs/Authentication.md:167`, but the dashboard renders the "Rotate" button unconditionally — including for keys whose status badge says "Revoked" — so an operator can un-revoke a deliberately disabled key without an explicit warning.
|
||
|
||
**Recommendation:** Either hide/disable the Rotate action for revoked keys in `ApiKeysPage.razor`, require an explicit confirmation, or have `RotateAsync` preserve `revoked_utc` and add a separate explicit "reactivate" operation.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source: `ApiKeysPage.razor` rendered the Rotate button unconditionally while Revoke was already gated on `key.RevokedUtc is null`. Took the lowest-risk recommended option — the dashboard now renders the Rotate (and Revoke) actions only for keys whose status is `Active`; a revoked key shows a "No actions" placeholder, so an operator cannot un-revoke a deliberately disabled key as a side effect of a rotation. `RotateAsync`'s store-level behavior is unchanged (rotation by `key_id` still clears `revoked_utc`, which the CLI relies on); `docs/Authentication.md` updated to document both the store behavior and the dashboard restriction. No automated test added: the change is pure conditional Razor rendering and the test project has no bUnit component-rendering harness; the underlying `DashboardApiKeyManagementService` is already unit-tested.
|
||
|
||
### Server-011
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:1-46` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `WorkerAlarmRpcDispatcher` deviates from the module's conventions: it fully-qualifies `System.Guid`, `System.ArgumentNullException`, and `System.Threading` types inline instead of relying on `using` directives, and uses an explicit constructor with `this.`-qualified field assignment while the rest of the module (e.g. `ConstraintEnforcer`, `MxAccessGatewayService`, `GalaxyRepositoryGrpcService`) uses primary constructors. `docs/style-guides/CSharpStyleGuide.md` is authoritative for gateway code.
|
||
|
||
**Recommendation:** Add the needed `using` directives, drop the inline fully-qualified names, and convert to a primary constructor for consistency.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source. Converted `WorkerAlarmRpcDispatcher` to a primary constructor with the standard `?? throw new ArgumentNullException(...)` field-initializer guard; dropped the inline `System.Guid` / `System.ArgumentNullException` qualifications (using implicit `using System;`); removed redundant `using System.Collections.Generic;` / `System.Threading` / `System.Threading.Tasks;` directives (covered by `ImplicitUsings`); replaced the two `if (... is null) throw new System.ArgumentNullException(...)` checks with `ArgumentNullException.ThrowIfNull`. The stale class-level `<summary>`/`<remarks>` ("Replaces NotWiredAlarmRpcDispatcher once ... wired in", "partially wired", "returns an Unimplemented diagnostic") were corrected to describe the actual GUID-vs-`Provider!Group.Tag` handling — overlapping with Server-014. No behavior change, so no new test; existing `WorkerAlarmRpcDispatcherTests` continue to pass and the project builds warning-free under `TreatWarningsAsErrors`.
|
||
|
||
### Server-012
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Location | `CLAUDE.md` (Authentication section and `apikey create` example) |
|
||
| Status | Resolved |
|
||
|
||
**Description:** CLAUDE.md describes scopes as `session`, `invoke`, `event`, `metadata`, `admin` and shows `apikey create --scopes session,invoke,event,metadata,admin`. The actual canonical scope strings (used by `GatewayScopes`, `GatewayGrpcScopeResolver`, and `docs/Authorization.md`) are `session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`. A key created per the CLAUDE.md example carries scopes the resolver never matches.
|
||
|
||
**Recommendation:** Update CLAUDE.md's scope list and the `apikey` example to the canonical `*:*` scope strings, per CLAUDE.md's own rule that docs change with the code.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against `GatewayScopes` (`session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`). CLAUDE.md's Build/Test/Run `apikey create` example and the Authentication-section scope list were both updated to the canonical `*:*` strings. (Note: since finding Server-004 was resolved, the old example would now be actively rejected at create time rather than silently creating an unusable key, making the doc correction load-bearing.) Pure documentation change; no test.
|
||
|
||
### Server-013
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Location | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs`, `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `DashboardAuthorizationHandler` is unit-tested in isolation, but no test exercises the dashboard routes end-to-end to confirm the policy is actually enforced — which is why Server-001 (policy registered but never wired) went uncaught. There are also no tests for `WorkerExecutableValidator` (PE-header architecture parsing), `GalaxyGlobMatcher` (anchoring/escaping/empty-glob fail-open), or `GalaxyHierarchyProjector` pagination/page-token behavior.
|
||
|
||
**Recommendation:** Add a `WebApplicationFactory` integration test that requests a dashboard page unauthenticated and asserts the redirect/401, plus unit tests for `WorkerExecutableValidator`, `GalaxyGlobMatcher`, and projector paging.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Re-triaged against the current test suite: three of the four named gaps were already closed. (1) The dashboard route-level enforcement test exists — `GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization` (and `..._AuthEndpointsAllowAnonymousAccess`), added when Server-001 was fixed. (2) `GalaxyGlobMatcher` anchoring/escaping/empty-glob behavior is covered by `GalaxyFilterInputSafetyTests` (`GlobMatcher_TreatsSqlMetacharactersAsLiterals`, `GlobMatcher_DoesNotTreatLikeWildcardsAsWildcards`, `GlobMatcher_WithPathologicalInput_DoesNotHang`), now extended with `GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect`. (3) Projector pagination/page-token behavior is covered end-to-end by `GalaxyRepositoryGrpcServiceTests` and now directly by the new `GalaxyHierarchyProjectorTests`. The one genuine remaining gap — `WorkerExecutableValidator` PE-header parsing — was closed with the new `WorkerExecutableValidatorTests` (7 cases: matching/mismatched x86 and x64, missing `MZ` header, file too small, missing `PE` signature), exercising the validator against synthesized minimal PE fixtures.
|
||
|
||
### Server-014
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:162-171,191-198,206-214,229-237` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The XML `<remarks>` and inline comments on `AcknowledgeAlarm` and `QueryActiveAlarms` describe the alarm path as not yet wired and say `NotWiredAlarmRpcDispatcher` is the default ("Clients calling this method today receive an OK reply with a 'worker alarm path not yet wired' diagnostic", "an empty stream until PR A.2"). In fact `SessionServiceCollectionExtensions.AddGatewaySessions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so DI always injects the production dispatcher; `NotWiredAlarmRpcDispatcher` is only the null fallback. The comments are stale and misleading.
|
||
|
||
**Recommendation:** Update the `AcknowledgeAlarm`/`QueryActiveAlarms` remarks to reflect that `WorkerAlarmRpcDispatcher` is the wired default, and describe its actual GUID-vs-`Provider!Group.Tag` handling.
|
||
|
||
**Resolution:** Resolved 2026-05-18. Confirmed against source: `SessionServiceCollectionExtensions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so the "not yet wired" / "empty stream until PR A.2" / "PR A.6/A.7 follow-up" prose in the `AcknowledgeAlarm` and `QueryActiveAlarms` `<remarks>` and inline comments was stale. Rewrote both `<remarks>` blocks and both inline comments to state that DI binds the production `WorkerAlarmRpcDispatcher`, that it routes over the worker pipe IPC, and that `AcknowledgeAlarm` handles a canonical-GUID reference (→ `AcknowledgeAlarmCommand`) and a `Provider!Group.Tag` reference (→ `AcknowledgeAlarmByNameCommand`), with `NotWiredAlarmRpcDispatcher` being only the null fallback. The matching stale `WorkerAlarmRpcDispatcher` class-level XML doc was corrected as part of Server-011. Pure documentation/comment change; no test.
|
||
|
||
### Server-015
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:8-15,266-308,720-775` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `GatewaySession` guards its mutable state with two different sync primitives. `TransitionTo`, `MarkFaulted`, `TouchClientActivity`, the `State`/`LastClientActivityAt`/`LeaseExpiresAt`/`FinalFault`/`ActiveEventSubscriberCount` getters, `AttachWorkerClient`, and `IsLeaseExpired` all read/write `_state`, `_finalFault`, `_lastClientActivityAt`, `_leaseExpiresAt`, `_workerClient`, and `_activeEventSubscriberCount` under `_syncRoot`. `CloseAsync` (lines 720-775), however, reads `_state` at line 729 and writes `_state` at lines 736 (`SessionState.Closing`) and 761 (`SessionState.Closed`) while only holding the `_closeLock` `SemaphoreSlim` — `_syncRoot` is never acquired. A concurrent `TransitionTo` or `MarkFaulted` from another thread sees `_state` outside the lock that protects it, and the `State` getter is not guaranteed to observe the `Closing`/`Closed` writes promptly. `SemaphoreSlim.WaitAsync`/`Release` do happen to provide memory barriers in practice, but the locking discipline is split across two primitives, which is fragile and defeats the audit value of "all `_state` access is guarded by `_syncRoot`". Concretely, the race between `CloseAsync` setting `_state = Closing` and a concurrent `TransitionTo(Ready)` is unordered — and `TransitionTo` will happily overwrite `Closing` back to `Ready` because its only guard is "do not overwrite `Closed`/`Faulted`".
|
||
|
||
**Recommendation:** Make `CloseAsync` mutate `_state` through the existing `TransitionTo(...)` helper (or acquire `_syncRoot` around the reads/writes) so all `_state` access uses the same lock. Either extend `TransitionTo` to accept the `Closing` and `Closed` transitions (it already handles `Faulted`/`Closed` precedence) or refactor `CloseAsync` to call a private `TrySetClosing()` / `MarkClosed()` that locks `_syncRoot`. Add a regression test that forces a `TransitionTo(Ready)` after `CloseAsync` has set `Closing` and asserts the session does not flip back to `Ready`.
|
||
|
||
**Resolution:** 2026-05-20 — Unified the close path on `_syncRoot`. `GatewaySession.CloseAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now mutates `_state` only through two private `_syncRoot`-locked helpers — `TryBeginClose` (writes `Closing`, returns the prior `_closeStarted`) and `MarkClosed` (writes `Closed`) — so every `_state` read/write in the session uses the same lock; `_closeLock` keeps its role of serializing concurrent close attempts. `TransitionTo` was tightened to refuse a transition out of `Closing` to anything other than `Closed`/`Faulted` so a late lifecycle callback cannot walk a closing session back to `Ready`. `docs/Sessions.md` updated to describe the unified lock discipline and the extended terminal precedence. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `TransitionTo_AfterCloseStarted_DoesNotOverwriteClosing` (the named scenario — `BlockingShutdownWorkerClient` parks the close inside `worker.ShutdownAsync` so the test can call `TransitionTo(Ready)` between the `Closing` and `Closed` writes and assert the state stays `Closing`) and `MarkFaulted_AfterCloseCompletes_DoesNotResurrectSession`.
|
||
|
||
### Server-016
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:790-797`, `src/MxGateway.Server/Sessions/SessionManager.cs:237-258` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `GatewaySession.DisposeAsync` synchronously calls `_closeLock.Dispose()` (line 792) without first acquiring the lock and without checking whether a `CloseAsync` is still in flight. The normal call path is `SessionManager.CloseSessionCoreAsync` → `session.CloseAsync(...)` → `RemoveSessionAsync` → `DisposeAsync`, where `DisposeAsync` runs strictly after `CloseAsync` completes. But the `ShutdownAsync` path (`SessionManager.cs:237-258`) and any future caller that disposes a session while another thread is still inside `CloseAsync` will trip `ObjectDisposedException` when the in-flight `CloseAsync` releases the semaphore. The race is narrow today because all `Close`/`Dispose` choreography goes through `SessionManager`, but the class-level contract is broken: nothing on `GatewaySession` documents or enforces "DisposeAsync must not be called concurrently with CloseAsync".
|
||
|
||
**Recommendation:** In `DisposeAsync`, either (a) take and release `_closeLock` once before disposing it, so the dispose is sequenced after any in-flight close, or (b) replace `_closeLock` disposal with a guard flag and let the semaphore be reclaimed by the finalizer. Document the invariant on the public method. Add a regression test that disposes a session whose `CloseAsync` has not yet completed and asserts no `ObjectDisposedException`.
|
||
|
||
**Resolution:** 2026-05-20 — Took recommendation (a): `GatewaySession.DisposeAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now acquires `_closeLock` once before disposing the semaphore so an in-flight `CloseAsync` finishes (its `_closeLock.Release()`) before the dispose tears the semaphore down. The wait is non-cancellable (`CancellationToken.None`) and `ObjectDisposedException` is swallowed at both the wait and the dispose site so double-dispose still completes cleanly. The method's XML doc was extended with a `<remarks>` block stating the invariant. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `DisposeAsync_WhileCloseInFlight_WaitsForCloseAndDoesNotThrow` (parks `CloseAsync` inside the worker shutdown, calls `DisposeAsync` concurrently, releases shutdown, asserts both complete without `ObjectDisposedException` and the worker is disposed exactly once) and `DisposeAsync_CalledTwice_DoesNotThrow`.
|
||
|
||
### Server-017
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | High |
|
||
| Category | Security |
|
||
| Location | `src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:13-27`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:173-247`, `docs/Authorization.md:108-110` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The two new top-level RPCs added to `MxAccessGateway` — `AcknowledgeAlarm(AcknowledgeAlarmRequest)` and `QueryActiveAlarms(QueryActiveAlarmsRequest)` (proto lines 23-24) — are not enumerated by `GatewayGrpcScopeResolver.ResolveRequiredScope`. The resolver's `request switch` covers `OpenSessionRequest`, `CloseSessionRequest`, `StreamEventsRequest`, `MxCommandRequest`, and the four Galaxy-repository requests; everything else falls through to `_ => GatewayScopes.Admin`. The interceptor (`GatewayGrpcAuthorizationInterceptor.AuthenticateAndAuthorizeAsync`) then rejects any non-admin caller with `PermissionDenied`. This is technically fail-closed (and `docs/Authorization.md:108-110` documents the "unrecognized → admin" intent), but in practice it means: (1) only API keys with the `admin` scope can acknowledge alarms or query active alarms, even though acknowledging is naturally an `invoke:write`-shaped operation and querying is naturally an `invoke:read`- or `metadata:read`-shaped operation; (2) the alarm RPCs ship in a state where any client that successfully opened a session and subscribed to alarm events still cannot perform the operational acks the contract advertises; (3) the test matrix `GatewayGrpcScopeResolverTests` does not even cover these two request types, so the gap was not caught at unit-test time.
|
||
|
||
**Recommendation:** Add explicit arms to `ResolveRequiredScope`: map `AcknowledgeAlarmRequest` to `GatewayScopes.InvokeWrite` (parity with other write actions; ack changes alarm state) and `QueryActiveAlarmsRequest` to `GatewayScopes.MetadataRead` or `GatewayScopes.InvokeRead`. Update `docs/Authorization.md` to list both. Extend `GatewayGrpcScopeResolverTests` with the new mappings and an assertion that every request type defined by `mxaccess_gateway.proto` is named in the resolver (the test can enumerate the assembly's request types so a future RPC cannot quietly add itself only via the admin fallback).
|
||
|
||
**Resolution:** 2026-05-20 — Added explicit `AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite` and `QueryActiveAlarmsRequest => GatewayScopes.EventsRead` arms to `GatewayGrpcScopeResolver.ResolveRequiredScope` (`src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:21-22`). `InvokeWrite` matches the existing `MxCommandKind.Write*` mapping because ack mutates alarm state; `EventsRead` matches `StreamEventsRequest` and `MxCommandKind.DrainEvents` because querying active alarms reads the same alarm/event surface. Extended `GatewayGrpcScopeResolverTests` with two new `InlineData` rows covering both request types (`src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:16-17`) and added four interceptor-level cases in `GatewayGrpcAuthorizationInterceptorTests` (`UnaryServerHandler_AcknowledgeAlarmMissingScope_ReturnsPermissionDenied`, `UnaryServerHandler_AcknowledgeAlarmWithScope_RunsHandler`, `ServerStreamingServerHandler_QueryActiveAlarmsMissingScope_ReturnsPermissionDenied`, `ServerStreamingServerHandler_QueryActiveAlarmsWithScope_RunsHandler`) proving each new RPC denies callers lacking the chosen scope and runs the handler when the scope is held. Updated `docs/Authorization.md` (resolver snippet and Scope Catalog table) to list both RPCs against their scopes. `dotnet test ... --filter FullyQualifiedName~GatewayGrpcAuthorizationInterceptorTests` → 14 passed, 0 failed; resolver tests 28 passed, 0 failed.
|
||
|
||
### Server-018
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Location | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:15` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `GalaxyGlobMatcher.RegexCache` is a `ConcurrentDictionary<string, Regex>` keyed by glob pattern, with no eviction. The fix for Server-008 added this cache deliberately to avoid recompiling the same handful of patterns, but the cache key is the raw glob string. The patterns currently come from two sources — `DiscoverHierarchyRequest.TagNameGlob` (client-supplied) and `ApiKeyConstraints.BrowseSubtrees` / `ReadSubtrees` / `WriteSubtrees` / `ReadTagGlobs` / `WriteTagGlobs` (admin-configured) — and `BuildRegex` also runs each glob through `Regex.Escape` so an attacker cannot craft a denial-of-service ReDoS payload. The leak is therefore bounded only by "how many distinct globs a client can submit over the process lifetime", which is in the millions for `TagNameGlob` if a client iterates through generated names. Each compiled `Regex` also holds a JIT'd assembly that is non-trivial to reclaim.
|
||
|
||
**Recommendation:** Cap the cache at a small bound (e.g. 256 patterns) using a simple LRU or a `MemoryCache` with sliding expiration, or restrict the cache to globs that originate from API-key constraints (admin-controlled, naturally bounded) and pay the compile cost for client-supplied globs. Add a test that fills the cache with thousands of distinct globs and asserts the cache size stays bounded.
|
||
|
||
**Resolution:** 2026-05-20 — Capped `GalaxyGlobMatcher`'s compiled-regex cache at `RegexCacheCapacity = 256` entries with FIFO-by-insertion eviction (`src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs`). A `ConcurrentQueue<string>` tracks insertion order; when the cache grows past the cap, `EvictIfOverCapacity` takes a small lock and dequeues + removes the oldest entries until the count is back within bound. Reads stay lock-free (the lock guards only the eviction path). Internal `CurrentCacheSize` / `RegexCacheCapacity` accessors are surfaced through the existing `InternalsVisibleTo("MxGateway.Tests")` so tests can assert the bound. Regression test: `GalaxyFilterInputSafetyTests.GlobMatcher_WithManyDistinctPatterns_CacheStaysBounded` submits `RegexCacheCapacity * 4` distinct globs and asserts `CurrentCacheSize` stays in `[0, RegexCacheCapacity]`. Existing glob correctness tests (`GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect`, the adversarial-input theories) continue to pass, confirming eviction does not corrupt lookups.
|
||
|
||
### Server-019
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:183-221` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` returns `yield break` (line 191) when `sessionRegistry.TryGet(request.SessionId, ...)` fails — it silently produces an empty stream with no diagnostic. The peer `AcknowledgeAsync` instead returns an `AcknowledgeAlarmReply` with `ProtocolStatus.Code = SessionNotFound` (lines 81-89), so the two methods have inconsistent missing-session handling. In production this branch is unreachable because `MxAccessGatewayService.QueryActiveAlarms` calls `ResolveSession(...)` first and throws `NotFound` from the gRPC layer (`MxAccessGatewayService.cs:228`), but: (a) the dispatcher is the seam other code paths might reach in the future, and (b) any unit test that instantiates the dispatcher directly with a missing session id sees an empty stream rather than a clear error, which is a footgun.
|
||
|
||
**Recommendation:** Either throw a `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` (matching the gRPC service's own resolver) or yield a single `ActiveAlarmSnapshot` with a diagnostic field set, and add a `WorkerAlarmRpcDispatcherTests` case that asserts whichever shape is chosen. Aligning with `AcknowledgeAsync`'s `SessionNotFound` protocol-status pattern is preferred, but `QueryActiveAlarms` is a server-streaming RPC so a thrown `SessionManagerException` propagated by the gateway is the cleaner fit.
|
||
|
||
**Resolution:** 2026-05-20 — Took the preferred option: `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` (`src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs`) now throws `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` instead of `yield break`-ing when the session is missing. `MxAccessGatewayService.MapException` already maps that error code to gRPC `NotFound`, so production callers see a consistent missing-session response and a direct unit-test caller now gets a clear error instead of an empty success. The unary peer `AcknowledgeAsync` continues to surface the same condition as an in-band `ProtocolStatus.Code = SessionNotFound`, which is correct for a unary RPC. Regression test: `WorkerAlarmRpcDispatcherTests.QueryActiveAlarmsAsync_WhenSessionMissing_ThrowsSessionNotFound` replaces the prior `_YieldsEmpty` assertion — it asserts the new exception shape and also exercises `AcknowledgeAsync` with the same missing session id to pin the peer-method parity.
|
||
|
||
### Server-020
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `src/MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor:1-2`, `…/GalaxyPage.razor:1-2`, `…/ApiKeysPage.razor:1-2`, `…/EventsPage.razor:1-2`, `…/SessionsPage.razor:1-2`, `…/WorkersPage.razor:1-2`, `…/SettingsPage.razor:1-2`, `…/SessionDetailsPage.razor:1-2` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Every dashboard page declares two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — even though `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` mounts the Razor components under a `RouteGroupBuilder` with `pathBase = "/dashboard"`. The group prefix is prepended to each `@page` route, so the actual endpoints become `/dashboard/X` (from `@page "/X"`) **and** `/dashboard/dashboard/X` (from `@page "/dashboard/X"`). The pages are reachable at two URLs each, and the deeper one (`/dashboard/dashboard/sessions` etc.) is almost certainly accidental — it leaks the path-base name into the URL and creates duplicate authorize/render work per route. `GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization` only checks the `/dashboard/X` shape, so the duplicate route slipped through without an assertion.
|
||
|
||
**Recommendation:** Drop the `@page "/dashboard/X"` directive from each page; rely on the `MapGroup("/dashboard")` to provide the prefix. Or, if the team genuinely wants both URL shapes, document the choice in the file header and extend the route-enumeration test to assert that **both** are present (and both carry the authorization policy). Either way, the current setup is non-obvious.
|
||
|
||
**Resolution:** 2026-05-20 — Took the recommended drop: removed the redundant `@page "/dashboard/X"` directive from every dashboard Razor page (`DashboardHome.razor`, `SessionsPage.razor`, `WorkersPage.razor`, `EventsPage.razor`, `GalaxyPage.razor`, `SettingsPage.razor`, `ApiKeysPage.razor`, `SessionDetailsPage.razor`). Each page now declares only its bare route (e.g. `@page "/sessions"`); `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` continues to prepend `/dashboard` via `MapGroup`, so each page is reachable at exactly one URL (`/dashboard/X`). Regression test: `GatewayApplicationTests.Build_WhenDashboardEnabled_DoesNotRegisterDoubledDashboardPrefixRoutes` enumerates the eight previously-doubled routes (`/dashboard/dashboard/`, `/dashboard/dashboard/sessions`, ... `/dashboard/dashboard/sessions/{SessionId}`) and asserts none of them are mapped. The existing `..._MapsBlazorDashboardAndAuthEndpoints` / `..._ComponentRoutesRequireAuthorization` tests continue to verify the desired `/dashboard/X` shapes are still present and policy-gated. No public URL contract changed (the doubled shape was accidental); no doc update needed — `gateway.md` and `docs/GatewayDashboardDesign.md` never referenced the doubled routes.
|
||
|
||
### Server-021
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Testing coverage |
|
||
| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:266-664`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The 1cd51bb commit history (the bulk read/write series, `f220908`/`5e375f6`/`758aca2`) added 473 lines of constraint-filtering and reply-merging logic to `MxAccessGatewayService`: `ApplyConstraintsAsync` (line 266), `EnforceReadTagAsync` / `EnforceWriteHandleAsync`, `FilterTagBulkAsync` / `FilterReadBulkAsync` / `FilterWriteBulkAsync` / `FilterHandleBulkAsync`, the `ReplaceWriteBulkEntries` switch, and three concrete `BulkConstraintPlan` records (`SubscribeBulkConstraintPlan`, `WriteBulkConstraintPlan`, `ReadBulkConstraintPlan`) that splice denied entries back into the worker's allowed-only reply in original-index order. None of this is covered by `MxAccessGatewayServiceTests` — its `FakeSessionManager` is wired with an `AllowAllConstraintEnforcer` (line 430) that never denies anything, so every constraint-related code path is dead at test time. A subtle off-by-one in `BuildMerged`, a wrong `PayloadOneofCase` in `GetPayload` / `SetPayload`, or a missing case in `ReplaceWriteBulkEntries` would all ship without a test failure.
|
||
|
||
**Recommendation:** Add `MxAccessGatewayServiceTests` cases that inject a deny-on-glob `IConstraintEnforcer` and exercise: (1) `AddItemBulk` / `SubscribeBulk` / `AdviseItemBulk` with a mix of allowed and denied tags, asserting `BulkSubscribeReply.Results` interleaves denied and worker-allowed entries in original-index order; (2) the same for `ReadBulk` and each of the four bulk-write commands; (3) `HasAllowedItems == false` so `CreateDeniedReply` is exercised (no worker call); (4) the unary `Write`/`Write2`/`WriteSecured`/`WriteSecured2` paths through `EnforceWriteHandleAsync`. The fixtures can reuse the existing `FakeSessionManager` by replacing the constraint enforcer; no live worker is needed.
|
||
|
||
**Resolution:** 2026-05-20 — Added a configurable `PredicateConstraintEnforcer` test double (`src/MxGateway.Tests/TestSupport/PredicateConstraintEnforcer.cs`) that denies on per-tag and per-handle predicates and records denials. Added 11 new tests in `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs` covering: (1) `AddItemBulk` with mixed denials — asserts the worker is called once with only the allowed subset and the merged reply interleaves denied and worker-allowed `SubscribeResult`s at their original indices; (2) `SubscribeBulk` with every tag denied — asserts `HasAllowedItems` short-circuits `CreateDeniedReply` and the session manager is never invoked; (3) `AdviseItemBulk` (handle-keyed denial via `CheckReadHandleAsync`); (4) `SubscribeBulk` with the allow-all enforcer — pass-through regression guard; (5) `ReadBulk` partial denial — asserts the `BulkReadConstraintPlan` produces a `BulkReadReply` (not a `BulkSubscribeReply`) with denied entries spliced in at their original indices; (6) `ReadBulk` all-denied short-circuit; (7) `WriteBulk` partial denial — asserts denied entries are dropped from the forwarded `Entries` and the merged reply preserves original-index order; (8) `WriteSecuredBulk` all-denied — proves the second `ReplaceWriteBulkEntries` switch arm is reachable; (9) unary `Write` with denied handle → `PermissionDenied`, no worker call, denial recorded; (10) unary `WriteSecured` with denied handle → `PermissionDenied`; (11) unary `AddItem` with denied tag → `PermissionDenied` (`EnforceReadTagAsync`). `MxAccessGatewayServiceTests.CreateService` updated to accept an `IConstraintEnforcer` so future tests can opt into the deny enforcer without duplicating the wiring. All 11 new tests pass; full suite (`dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj`) is green at 458 passing.
|
||
|
||
### Server-022
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Location | `src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs:8-29` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Server-014's resolution noted that the stale "PR A.6 / A.7" / "not yet wired" language was rewritten on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms` and on the `WorkerAlarmRpcDispatcher` class doc. The corresponding XML doc on the **interface** `IAlarmRpcDispatcher` (lines 8-29) still says it is "PR A.6 / A.7 — gateway-side dispatcher" and that "Production implementations live in `WorkerAlarmRpcDispatcher` (this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic)". That second clause directly contradicts the now-correct comments on the concrete implementations and on the gRPC service: `WorkerAlarmRpcDispatcher` is the wired default, not a not-yet-wired one. A reader who finds the interface first will believe the dispatcher is non-functional.
|
||
|
||
**Recommendation:** Rewrite the `IAlarmRpcDispatcher` `<remarks>` block to match the language now used on `WorkerAlarmRpcDispatcher` and on the gRPC service: DI binds `WorkerAlarmRpcDispatcher` by default; `NotWiredAlarmRpcDispatcher` is only the null fallback for tests/DI omission. Drop the "PR A.6 / A.7" prefix from the `<summary>` — the interface is now the public alarm-RPC seam.
|
||
|
||
**Resolution:** 2026-05-20 — Rewrote `IAlarmRpcDispatcher`'s `<summary>` and `<remarks>` (`src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs`) to match the language now used on `WorkerAlarmRpcDispatcher` and on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms`: dropped the stale "PR A.6 / A.7" prefix from the summary, and replaced the "this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic" clause with the correct statement that DI binds the production `WorkerAlarmRpcDispatcher` by default and `NotWiredAlarmRpcDispatcher` is only the null fallback for DI omission / standalone tests. Pure documentation change; no test.
|
||
|
||
### Server-023
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Location | `src/MxGateway.Server/Sessions/NotWiredAlarmRpcDispatcher.cs:10-26` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Server-014 and Server-022 swept the stale "PR A.6 / A.7" / "not-yet-wired" / "worker-pending" language off `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms`, `WorkerAlarmRpcDispatcher`, and `IAlarmRpcDispatcher`. The concrete `NotWiredAlarmRpcDispatcher` class XML doc was not updated as part of either fix and still reads: *"PR A.6 / A.7 — default `IAlarmRpcDispatcher` shipped while the worker-side AlarmClient event subscription is gated on dev-rig validation"* and *"When the worker dispatcher (PR A.6/A.7 dev-rig follow-up) lands, `WorkerAlarmRpcDispatcher` replaces this implementation in the DI container"*. That is the exact prose the other sweeps removed, and it directly contradicts the now-current narrative everywhere else: `SessionServiceCollectionExtensions.AddGatewaySessions` registers `WorkerAlarmRpcDispatcher` as the default `IAlarmRpcDispatcher`; `NotWiredAlarmRpcDispatcher` is only the null fallback used when no dispatcher is registered (DI omission / standalone tests). The diagnostic string returned by `AcknowledgeAsync` (line 39) — `"the worker-side AlarmClient consumer (PR A.5) is in place but the dispatcher hookup is gated on validating the AVEVA alarm-provider event subscription on the dev rig"` — is also stale; the dispatcher hookup landed and any client that actually sees that diagnostic today is hitting the null-fallback path, not the dev-rig gate it describes.
|
||
|
||
**Recommendation:** Replace the `<summary>` and `<remarks>` on `NotWiredAlarmRpcDispatcher` with text that matches the language now used on the interface and `WorkerAlarmRpcDispatcher` — "null fallback `IAlarmRpcDispatcher` used when no dispatcher is registered (DI omission / standalone tests); production wires `WorkerAlarmRpcDispatcher`." Either drop the `AcknowledgeAsync` diagnostic string's dev-rig framing entirely or shorten it to "alarm dispatcher is not registered." `#pragma warning disable CS1998` on `QueryActiveAlarmsAsync` is correct here (empty stream is intentional for the null fallback) and should stay.
|
||
|
||
**Resolution:** 2026-05-20 — Rewrote `NotWiredAlarmRpcDispatcher` summary/remarks as the null-fallback dispatcher and shortened the `AcknowledgeAsync` diagnostic to "Alarm dispatcher is not registered."; updated the two tests that asserted the old "worker"-prefixed diagnostic.
|
||
|
||
### Server-024
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:56-77` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `GetOrCreateRegex`'s race-loser branch reads `RegexCache[glob]` with an indexer (line 76) after `TryAdd` returned `false`. The indexer throws `KeyNotFoundException` if the key is missing. Under the new bounded cache (Server-018), there is a real — if narrow — race where the key vanishes between the failing `TryAdd` and the indexer read: thread A and thread B both compile a `Regex` for `glob`; A's `TryAdd` succeeds, A enqueues + enters `EvictIfOverCapacity`, the eviction loop dequeues `glob` (because some other thread had already enqueued + evicted enough that `glob` is now the oldest entry) and removes it; thread B's `TryAdd` then returns false, B reads `RegexCache[glob]`, and the indexer throws. The window is tiny but nonzero — eviction is approximate FIFO, and a hot pattern that is repeatedly re-added near the cap is the natural trigger. The same pre-Server-018 code used `GetOrAdd`, which had no such race because the dictionary handled the rebuild atomically.
|
||
|
||
**Recommendation:** Replace the `TryAdd` + indexer pair with `RegexCache.GetOrAdd(glob, _ => compiled)` so the dictionary atomically returns whichever instance won. Track the new insertion only when `GetOrAdd` returns the locally-compiled instance (`ReferenceEquals(result, compiled)`), then enqueue + evict. Alternatively, swap the trailing indexer read for `TryGetValue` + recursive recompile on miss. Add a stress test that mixes repeated reads of a single hot pattern with a flood of unique patterns near the cap and asserts no exception escapes `IsMatch`.
|
||
|
||
**Resolution:** 2026-05-20 — Replaced the `TryAdd` + indexer pair with `RegexCache.GetOrAdd(glob, compiled)`; FIFO enqueue + eviction now run only when `ReferenceEquals(result, compiled)` (i.e. our caller was the inserter), eliminating the post-eviction `KeyNotFoundException` window.
|
||
|
||
### Server-025
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:19-25`, `src/MxGateway.Server/Galaxy/IGalaxyRepository.cs` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The Tests-016 fix introduced `IGalaxyRepository` so `GalaxyHierarchyCache` could be unit-tested against an in-memory fake, and `GalaxyHierarchyCache` was updated to depend on the interface. `GalaxyRepositoryGrpcService` was not updated and still receives the concrete `GalaxyDb.GalaxyRepository` via its primary constructor. Functionally this is fine — DI registers the concrete singleton and a thin `sp.GetRequiredService<GalaxyRepository>()` forwarder for the interface — but the seam is now half-applied: a future caller that wants to test or stub the gRPC service's `TestConnection` path has to construct a real `GalaxyRepository` against a SQL connection string, defeating the abstraction `IGalaxyRepository` was introduced for. The pattern also creates an inconsistency for new readers — two consumers in the same namespace, one on the interface and one on the concrete.
|
||
|
||
**Recommendation:** Change `GalaxyRepositoryGrpcService`'s `repository` parameter to `IGalaxyRepository`. No DI change is needed (both forwarders already resolve to the same singleton). Optionally drop the concrete singleton registration and register the interface directly.
|
||
|
||
**Resolution:** 2026-05-20 — Changed `GalaxyRepositoryGrpcService`'s `repository` primary-constructor parameter from the concrete `GalaxyRepository` to `IGalaxyRepository`; existing DI registration in `GalaxyRepositoryServiceCollectionExtensions` already resolves both the concrete and interface to the same singleton.
|
||
|
||
### Server-026
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/MxGateway.Server/Configuration/GatewayOptionsValidator.cs:17-32`, `src/MxGateway.Server/Configuration/AlarmsOptions.cs` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `GatewayOptions.Alarms` is bound from `MxGateway:Alarms` and consumed by `SessionManager.TryAutoSubscribeAlarmsAsync` (per-session SubscribeAlarms on Ready). `GatewayOptionsValidator.Validate` validates every other section (`Authentication`, `Ldap`, `Worker`, `Sessions`, `Events`, `Dashboard`, `Protocol`) but has no `ValidateAlarms` arm — `AlarmsOptions` is silently accepted regardless of contents. The runtime mitigates this by logging a warning when `Enabled = true` but neither `SubscriptionExpression` nor `DefaultArea` is set, then either faulting open-session (`RequireSubscribeOnOpen = true`) or skipping auto-subscribe — a configuration error therefore surfaces per-session at runtime instead of at startup. Other sections fail-fast at `ValidateOnStart()`, so the inconsistency makes alarm misconfiguration discoverable only after a client hits the gateway. A misformatted `SubscriptionExpression` (no `\\<host>\Galaxy!<area>` shape) likewise passes validation; the worker rejects it later.
|
||
|
||
**Recommendation:** Add a `ValidateAlarms(options.Alarms, failures)` arm in `GatewayOptionsValidator`. When `Enabled = true`, require either a non-blank `SubscriptionExpression` or a non-blank `DefaultArea`; when `SubscriptionExpression` is provided, sanity-check that it starts with `\\` (the AVEVA UNC subscription shape) — or document that the shape is left to the worker to validate. Either way, treat the configuration as part of the validated surface.
|
||
|
||
**Resolution:** 2026-05-20 — Added `ValidateAlarms` to `GatewayOptionsValidator`: when `Enabled = true`, requires a non-blank `SubscriptionExpression` or `DefaultArea`, and when `SubscriptionExpression` is provided, requires it to start with `\\` (canonical UNC subscription shape). Alarm misconfiguration now fails fast at startup instead of per-session.
|
||
|
||
### Server-027
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Location | `docs/Authorization.md:120-141,176-181` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Two parts of `docs/Authorization.md` drifted from `GatewayGrpcScopeResolver.ResolveCommandScope` and from `MxAccessGatewayService.ApplyConstraintsAsync` over the bulk-read/bulk-write series (`f220908`/`5e375f6`/`758aca2`) and were not updated by the Server-017 / Server-021 fixes:
|
||
|
||
1. The `ResolveCommandScope` code snippet at lines 120-141 still shows only `Write`/`Write2` against `InvokeWrite` and `WriteSecured`/`WriteSecured2`/`AuthenticateUser` against `InvokeSecure`. The actual resolver also maps `MxCommandKind.WriteBulk`, `MxCommandKind.Write2Bulk`, `MxCommandKind.WriteSecuredBulk`, and `MxCommandKind.WriteSecured2Bulk`. A reader believing the snippet would conclude the bulk-write families inherit the fail-closed admin scope, when in fact they correctly map to `InvokeWrite` / `InvokeSecure` (the Scope Catalog table at lines 199-200 lists them).
|
||
2. The Constraint Enforcement section (lines 176-181) says: *"The service checks read constraints for `AddItem`, `AddItem2`, `AddItemBulk`, `SubscribeBulk`, and `AdviseItemBulk`. It checks write constraints for `Write`, `Write2`, `WriteSecured`, and `WriteSecured2`."* The actual `ApplyConstraintsAsync` switch also enforces constraints for `ReadBulk` (read scope), `WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` (write scope, per-entry filtering with index-order merge). Server-021 added test coverage for all of these without touching the doc.
|
||
|
||
**Recommendation:** Update the `ResolveCommandScope` snippet to include the four bulk-write arms. Update the Constraint Enforcement prose to enumerate the bulk read/write commands that are actually filtered, and reference the per-entry index-ordered merge that `BulkConstraintPlan.MergeDeniedInto` performs. Adding `ReadBulk` to the `InvokeRead` row of the Scope Catalog would also be useful — the table currently lists `Register`/`AddItem`/`Advise` against `InvokeRead` but not `ReadBulk`.
|
||
|
||
**Resolution:** 2026-05-20 — Updated the `ResolveCommandScope` snippet in `docs/Authorization.md` to enumerate the four bulk-write arms (`WriteBulk`/`Write2Bulk` against `InvokeWrite`, `WriteSecuredBulk`/`WriteSecured2Bulk` against `InvokeSecure`); expanded the Constraint Enforcement prose to list `ReadBulk` and all four bulk-write commands and to call out `BulkConstraintPlan.MergeDeniedInto`'s index-ordered merge; added `ReadBulk` to the `InvokeRead` row of the Scope Catalog.
|
||
|
||
### Server-028
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Location | `src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:13-20`, `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Two narrow test gaps were not closed by Server-017 / Server-015:
|
||
|
||
1. `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` enumerates `OpenSessionRequest`, `CloseSessionRequest`, `StreamEventsRequest`, `AcknowledgeAlarmRequest`, `QueryActiveAlarmsRequest`, `TestConnectionRequest`, `GetLastDeployTimeRequest`, and `DiscoverHierarchyRequest`. `WatchDeployEventsRequest` is missing even though it is named in the resolver's metadata-read arm and listed in the Scope Catalog. Similarly, the `ResolveRequiredScope_InvokeCommand_ReturnsExpectedScope` matrix covers every other write/secure/bulk command but omits `MxCommandKind.ReadBulk`, which is the only bulk family that falls into the `_ => GatewayScopes.InvokeRead` default arm. A regression that drops `WatchDeployEvents` from the request switch or that adds a new mapping for `ReadBulk` would not be caught.
|
||
2. `GatewaySessionTests` (added under Server-015 / Server-016) covers the `TransitionTo(Ready)` and `MarkFaulted(post-Close)` cases but does not cover the third edge that Server-015's tightened state machine permits: `MarkFaulted` issued while `CloseAsync` is parked between `TryBeginClose` (Closing) and `MarkClosed` (Closed). The current `MarkFaulted` (`GatewaySession.cs:314-326`) checks only for `Closed`, so it overwrites `Closing` → `Faulted`; the subsequent `MarkClosed` then overwrites `Faulted` → `Closed` while `_finalFault` is preserved. The behaviour is consistent with the docs ("Closing only allows a transition to Closed or Faulted") but the test bundle does not pin it, and a future tightening of `MarkFaulted` could silently regress.
|
||
|
||
**Recommendation:** Extend `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` with `[InlineData(typeof(WatchDeployEventsRequest), GatewayScopes.MetadataRead)]` and extend the command theory with `[InlineData(MxCommandKind.ReadBulk, GatewayScopes.InvokeRead)]`. Add a `GatewaySessionTests.MarkFaulted_DuringInFlightClose_PreservesFaultButYieldsToClose` case using `BlockingShutdownWorkerClient` to park `CloseAsync`, call `MarkFaulted` while parked, release the worker, and assert `State == Closed && FinalFault == "<the fault reason>"`.
|
||
|
||
**Resolution:** 2026-05-20 — Added `[InlineData(typeof(WatchDeployEventsRequest), GatewayScopes.MetadataRead)]` to `GatewayGrpcScopeResolverTests.ResolveRequiredScope_KnownRpcRequest_ReturnsExpectedScope` (the `ReadBulk` arm was already present); added `GatewaySessionTests.MarkFaulted_DuringInFlightClose_PreservesFaultButYieldsToClose` covering the parked-close + `MarkFaulted` interleave and asserting the post-release state is `Closed` with `FinalFault = "concurrent-fault"`.
|
||
|
||
### Server-029
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:52-58` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `OpenSession` advertises capabilities the gateway supports so clients can branch on them. The current list is `unary-open-session`, `unary-close-session`, `unary-invoke`, `server-stream-events`, `bulk-subscribe-commands`, `unary-acknowledge-alarm`, `server-stream-active-alarms`. The `bulk-subscribe-commands` token was added for the `AddItemBulk` / `AdviseItemBulk` / `RemoveItemBulk` / `UnAdviseItemBulk` / `SubscribeBulk` / `UnsubscribeBulk` family. The subsequent `ReadBulk` and `WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` families landed without a corresponding capability token — the contract advertises bulk-subscribe support but is silent on bulk-read and bulk-write. A defensive client that gates on `bulk-write-commands` before issuing a `WriteBulk` has no signal that the family is supported; current clients sidestep this by ignoring the list entirely, but that just shifts the failure mode (an old client against a new server, or vice versa, will see `Unimplemented` instead of a structured `Capabilities` mismatch).
|
||
|
||
**Recommendation:** Either (a) extend the advertised list with `bulk-read-command` and `bulk-write-commands` (`WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` collectively), or (b) document in `gateway.md` and `docs/Contracts.md` that `Capabilities` is informational only and not the contract version. Option (a) is the simplest forward-compatible fix and keeps the capability token shape clients are already familiar with.
|
||
|
||
**Resolution:** 2026-05-20 — Extended the `OpenSession` capabilities list with `bulk-read-commands` and `bulk-write-commands` alongside the existing `bulk-subscribe-commands` token, so clients that gate on capability strings have an explicit signal for the bulk-read and bulk-write families.
|
||
|
||
### Server-030
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:952-980` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Surfaced during the 2026-05-20 cross-language e2e run against a redeployed gateway (`a020350`). The Java client got 55 of 120 `AddItem` calls in, then `Advise` returned `Session session-de7728a290bd41028ad6fec81e233144 is not ready. Current state is Ready.` — a self-contradictory diagnostic. The check in `GetReadyWorkerClient` (`GatewaySession.cs:956`) is `_state != SessionState.Ready || _workerClient?.State != WorkerClientState.Ready`, but the formatted message only includes `_state`. When the gateway-side session state is `Ready` but the worker client's own `WorkerClientState` has transitioned (heartbeat watchdog firing, pipe disconnect detected by the read loop, etc.) before the session-level reaction observes it, the in-flight RPC fails fast here — and the operator sees a message that doesn't tell them which side of the gate the failure is on. The two-state gap itself is a real race (the worker-side state can shift independently of the gateway-driven session state) but a clear diagnostic is the prerequisite for diagnosing it; without it, a future investigation will start from "it says Ready but it's not Ready" instead of "the worker is Handshaking / Closing / Faulted while the session is still Ready".
|
||
|
||
**Recommendation:** Format both states into the exception message — `Session {SessionId} is not ready. Session state is {_state}; worker state is {workerClientState}.` (or `"<no worker>"` when `_workerClient` is null). Document on the method that the two states can diverge under load and that this branch is the fail-fast for that case. Add a regression test that flips `FakeWorkerClient.State` to a non-Ready value (e.g. `Handshaking`) while the session is `Ready` and asserts both pieces of state appear in the thrown `SessionManagerException.Message`. The deeper race investigation (should the gateway briefly wait for worker-Ready before failing? when does `WorkerClient.State` legitimately shift while the session is still `Ready`?) is out of scope for this finding but is worth a follow-up.
|
||
|
||
**Resolution:** 2026-05-20 — Rewrote `GetReadyWorkerClient` so the `SessionManagerException` message includes both `_state` and `_workerClient.State` (or `"<no worker>"` for the null case): `"Session {SessionId} is not ready. Session state is {_state}; worker state is {workerState}."`. Added XML doc on the method explaining the two-state contract and that this branch is the fail-fast for a state-divergence race. Added regression test `SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates` that sets `FakeWorkerClient.State = WorkerClientState.Handshaking` while the session is `Ready` and asserts both `"Session state is Ready"` and `"worker state is Handshaking"` appear in the message; the test also pins `InvokeCount == 0` so the worker isn't called. The deeper race (should `GetReadyWorkerClient` retry briefly when state has just diverged?) remains open for follow-up.
|
||
|
||
### Server-031
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:392-443` (gateway-side heartbeat watchdog); `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClientOptions.cs:14-67` (new `HeartbeatStuckCeiling` option) |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The .NET phase succeeded through `open-session`/`register`/`bulk-subscribe`/`bulk-read`/`bulk-unsubscribe`/`stream-events`/`write` but then failed on its third `advise` call with the Server-030 diagnostic `Session ... is not ready. Session state is Ready; worker state is Faulted.` The gateway stdout log records the underlying cause: **`Worker client faulted for session session-01a1a07fa59c489983a719821fa46e72: Worker heartbeat expired. Last heartbeat was at 2026-05-20T17:20:39.+00:00.`** — a real 15s+ gap with no `WorkerHeartbeat` envelope arriving from the worker.
|
||
|
||
Investigation paths:
|
||
|
||
1. **Shared `_writeLock` on the worker side.** `WorkerFrameWriter` serializes every pipe write (heartbeats, command replies, events, faults) through a single `SemaphoreSlim _writeLock` (`WorkerFrameWriter.cs:14`, `:67-76`). `RunEventDrainLoopAsync` (`WorkerPipeSession.cs:336-372`) writes events one at a time inside a `foreach`, each call to `_writer.WriteAsync` re-acquiring `_writeLock`. If the gateway-side read drains slowly and the OS-level named-pipe buffer fills, `_stream.WriteAsync` (`WorkerFrameWriter.cs:70`) blocks. The event-drain loop blocks holding the lock. `RunHeartbeatLoopAsync` (`WorkerPipeSession.cs:611-613`) then can't acquire `_writeLock` to send its 5s heartbeat. Heartbeats stall past the gateway's `HeartbeatGrace` (15s default) and `WorkerClient.HeartbeatLoopAsync` faults the session.
|
||
|
||
2. **No prioritization between heartbeats and events.** Even without OS-level back-pressure, a backlog of events in the worker's `MxAccessEventQueue` (drained in batches of `EventDrainBatchSize`) can keep the writer lock held for many milliseconds at a time. Heartbeats can be delayed (though normally not past `HeartbeatGrace` unless something else is wrong).
|
||
|
||
3. **Gateway-side heartbeat watchdog ignores in-flight commands.** `WorkerClient.HeartbeatLoopAsync` (`WorkerClient.cs:392-422`) checks only `_state == Ready` and `now - lastHeartbeatAt > HeartbeatGrace`. It does not check whether a command is in flight on the gateway↔worker pipe. The mirror of Worker-017's fix (worker-side watchdog skips `StaHung` while a command is in flight) does not exist on the gateway side.
|
||
|
||
The .NET test pattern stresses the issue uniquely because each `dotnet run --project` rebuild between subcommands introduces multi-second client-side gaps; the worker's heartbeat path should still be alive (heartbeats are emitted by `RunHeartbeatLoopAsync` independently of gateway activity), but if the gateway is also blocked draining events from the channel into a non-existent `StreamEvents` consumer, the back-pressure-into-heartbeat chain bites first.
|
||
|
||
**Recommendation:** Two changes worth landing together:
|
||
|
||
1. **Decouple heartbeat writes from the event/reply lock.** Either (a) give heartbeats their own pipe `Stream` (likely impractical — one pipe per session), (b) introduce a priority queue in front of `WorkerFrameWriter` so heartbeats hop the line, or (c) interleave heartbeat checks inside `RunEventDrainLoopAsync` (e.g., after each event-batch write, post a heartbeat envelope if one is due). Option (c) is the smallest change.
|
||
|
||
2. **Mirror Worker-017's "skip-while-command-in-flight" guard on the gateway side.** In `WorkerClient.HeartbeatLoopAsync`, when `_pendingCommands.Count > 0` and the oldest pending command is younger than some ceiling (e.g., 5× `HeartbeatGrace`), skip the fault. The worker may be busy executing a slow STA command and the heartbeat write may be queued behind a long event burst — neither indicates the worker is actually hung.
|
||
|
||
Add a regression test that floods the worker's outbound event channel (e.g., via a high-rate STA fixture or a mock event source emitting at > 1000 events/s for several seconds) and asserts the worker is not faulted while the gateway has no `StreamEvents` consumer attached.
|
||
|
||
**Resolution:** 2026-05-24 — Re-triaged at HEAD `d2d2e5f`: the gateway-side "skip-while-command-in-flight" guard (recommendation #2) is already implemented and verified against source. `WorkerClient.HeartbeatLoopAsync` (`src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:403-443`) now skips the `HeartbeatExpired` fault while `TryGetOldestPendingCommandAge` reports an in-flight command younger than `WorkerClientOptions.HeartbeatStuckCeiling` (default 75s = 5× `HeartbeatGrace`). Once the oldest pending command exceeds the ceiling the watchdog fires anyway, so a truly stuck COM call doesn't hide the worker forever. The new `HeartbeatStuckCeiling` option is documented inline with a back-reference to Worker-023, the worker-side mirror. Regression tests in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs`: `HeartbeatMonitor_WhenCommandInFlightWithinCeiling_DoesNotFaultOnExpiredHeartbeat` (the named scenario — parks an unanswered `InvokeAsync` past `HeartbeatGrace` but well within `HeartbeatStuckCeiling` and asserts the client stays `Ready`) and `HeartbeatMonitor_WhenPendingCommandExceedsStuckCeiling_FaultsClient` (advances past the ceiling and asserts the watchdog still fires). Recommendation #1 (decoupling the worker-side `_writeLock`) is the worker module's concern and is tracked by Worker-017 / Worker-023 — out of scope for the Server module here.
|
||
|
||
### Server-032
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:510-569` (gateway-side `_events` channel); `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClientOptions.cs:45-53` (`EventChannelFullModeTimeout`) |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The Java phase advised ~55 items (`item-handle 63`) before failing on the next `advise` call with the Server-030 diagnostic `Session ... is not ready. Session state is Ready; worker state is Faulted.`. The gateway stdout log records: **`Worker client faulted for session session-adfcc808da974808947e87db060c2b03: Worker event channel rejected an event.`** — the gateway-side per-session bounded event channel filled up and `Channel.Writer.TryWrite` returned `false`, triggering the fail-fast path in `EnqueueWorkerEventAsync` (`WorkerClient.cs:467-484`).
|
||
|
||
The channel is configured as `Channel.CreateBounded<WorkerEvent>(new BoundedChannelOptions(EventChannelCapacity) { ... FullMode = BoundedChannelFullMode.Wait ... })` (capacity defaults to `EventOptions.QueueCapacity = 10_000`). But `EnqueueWorkerEventAsync` uses **`TryWrite`** (non-blocking), so the configured `Wait` mode is moot — the writer always fails fast when full. This is consistent with `docs/DesignDecisions.md`'s "fail-fast event backpressure" policy (one subscriber per session, no producer-side queuing beyond the channel), but two facts make it sharp in practice:
|
||
|
||
1. The e2e flow (and any realistic client) `advise`s many items BEFORE opening a long-running `StreamEvents` consumer. With no consumer, events accumulate at the in-rate (driven by the SCADA tags' change frequency). For `TestMachine_*.TestChangingInt` × ~55 advised items, the rig can fill 10,000 in well under a minute.
|
||
|
||
2. The fail-fast threshold is "exactly at capacity." There is no overflow grace window. A momentary lull on the consumer side that lasts long enough for one extra event to arrive after the channel is full results in worker fault and session teardown.
|
||
|
||
This is design-as-intended in the v1 sense, but it surfaces a behavioral contract that is **not currently documented**: clients must open `StreamEvents` BEFORE issuing `advise` against high-rate tags, or pace their `advise` calls below the (non-published) accumulation budget. None of the current docs (`gateway.md`, `docs/DesignDecisions.md`, the client READMEs) enforce or surface this requirement, and four of the five client CLIs (`go`, `python`, `rust`, `java`) hit it gracelessly in `scripts/run-client-e2e-tests.ps1`.
|
||
|
||
The diagnostic `"Worker event channel rejected an event."` also does not name the actual channel (it says "Worker event channel" but the channel is gateway-owned), the current depth, or the capacity — only that it overflowed. Operators can't tell whether the threshold needs lifting or whether the consumer is genuinely missing.
|
||
|
||
**Recommendation:** Three escalating options, pick at least the first and consider one of the others:
|
||
|
||
1. **Document the contract.** In `gateway.md` and `docs/DesignDecisions.md`, state explicitly that `advise` produces events into the gateway-side per-session channel and that a `StreamEvents` consumer must be attached to drain it. Add the bound (`MxGateway:Events:QueueCapacity`, default 10,000) and the fault behavior (the worker is faulted; the session ends). Update `clients/*/README.md` to call out the requirement in the "advise" / "subscribe" sections.
|
||
|
||
2. **Improve the diagnostic.** Format the channel depth and capacity into the fault message: `"Worker event channel rejected an event after {capacity} unconsumed events accumulated. Attach a StreamEvents consumer or increase MxGateway:Events:QueueCapacity."`
|
||
|
||
3. **Add an overflow grace window.** Instead of fail-fast on the first `TryWrite == false`, count overflow events and only fault if N consecutive overflows happen within T ms (or, equivalently, switch to `WriteAsync` with a short timeout). This trades a tiny memory bump for resilience to consumer hiccups. Out of scope if v1 explicitly chose fail-fast for parity reasons — but worth raising for v2.
|
||
|
||
Add a regression test that advises N items without an active `StreamEvents` consumer, lets the channel fill, and asserts the produced fault message contains the channel-depth diagnostic (#2) — gated so that #3 is not required.
|
||
|
||
**Resolution:** 2026-05-24 — Re-triaged at HEAD `d2d2e5f`: recommendation #2 (improved diagnostic) is already implemented and verified against source. `WorkerClient.EnqueueWorkerEventAsync` (`src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerClient.cs:525-569`) now (a) attempts `TryWrite` first for the fast path, (b) on full-channel falls through to `WriteAsync` with a linked `CancellationTokenSource` cancelled after `WorkerClientOptions.EventChannelFullModeTimeout` (default 5s), so a transient consumer hiccup is absorbed instead of fail-fast on the first overflow event, and (c) on real overflow records `QueueOverflow("worker-events")` and faults with the rich diagnostic message naming the wait timeout, the current channel depth, the channel capacity, and the actionable remediation ("Attach a StreamEvents consumer or raise MxGateway:Events:QueueCapacity."). Regression test: `WorkerClientTests.EnqueueWorkerEvent_WhenChannelFullPastTimeout_FaultsWithRichDiagnostic` (`src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:473-521`) fills a 4-slot channel + one overflow, asserts the worker is faulted, then drains the propagated `WorkerClientException` and pins the diagnostic string contains "Worker event channel rejected", "of 4 capacity", "StreamEvents", and "MxGateway:Events:QueueCapacity". Recommendation #1 (the prose contract in `gateway.md` / `docs/DesignDecisions.md` / client READMEs) is out of scope for this pass — the prompt restricts edits to `src/ZB.MOM.WW.MxGateway.Server/**`, `src/ZB.MOM.WW.MxGateway.Tests/**`, and this findings file; the documentation update needs to land in a follow-up that has docs access. Recommendation #3 (overflow grace window) was already implemented in spirit by the `WriteAsync` + timeout switch — the channel now absorbs a transient burst up to the configured wait timeout, satisfying #3's "consumer hiccup resilience" goal without requiring a separate counter.
|
||
|
||
### Server-033
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:265-323` (`TryRestoreFromDiskAsync`), `:84-99` (`_firstLoad` / `WaitForFirstLoadAsync`); `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:141-163` (`WaitForCacheBootstrap`) |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `TryRestoreFromDiskAsync` populates `_current` with the on-disk snapshot (status `Stale`, `HasData == true`) but never completes the `_firstLoad` `TaskCompletionSource` — only the live-query paths (cheap / heavy / catch) in `RefreshCoreAsync` do. A `DiscoverHierarchy` or `GetLastDeployTime` call that arrives after gateway start but before the first refresh tick finishes sees `cache.Current` as `Empty` (status `Unknown`) when `WaitForCacheBootstrap` runs its initial check, so it falls through to `await WaitForFirstLoadAsync` with a 5-second budget. Restore then completes within milliseconds and makes the data available, but `_firstLoad` stays pending until the live query returns or fails. When the Galaxy database is unreachable — the exact scenario the snapshot feature exists for — the SQL connect attempt outlasts the 5s budget, so the caller waits the full 5 seconds before the budget elapses and the handler falls through to read the (already-restored) data. The result is correct, but the first browse calls after a cold offline start incur a needless ~5s latency, undercutting the feature's purpose.
|
||
|
||
**Recommendation:** Call `_firstLoad.TrySetResult()` at the end of `TryRestoreFromDiskAsync` once the restored entry is published — restored data is a valid completed first load. Add a regression test: a cache with a throwing repository plus a populated snapshot store should have `WaitForFirstLoadAsync` complete promptly after `RefreshAsync`, not block on the live query.
|
||
|
||
**Resolution:** Resolved in `bdccdbf` (2026-05-22): `TryRestoreFromDiskAsync` calls `_firstLoad.TrySetResult()` immediately after publishing the restored entry, so a restored snapshot satisfies the bootstrap gate without waiting on the live query. New test `GalaxyHierarchyCacheTests.RefreshAsync_RestoredSnapshotCompletesFirstLoadBeforeLiveQueryReturns` blocks the repository's deploy-time query and asserts `WaitForFirstLoadAsync` still completes from the snapshot.
|
||
|
||
### Server-034
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchySnapshotStore.cs:87-115` (`TryLoadAsync`) |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `TryLoadAsync` carries the `Try` prefix and its XML doc says it returns `null` "when none exists, persistence is disabled, or the on-disk file uses an unrecognized schema version." But a corrupt or partially written JSON file makes `JsonSerializer.DeserializeAsync` throw `JsonException`, and an unreadable file (locked, denied ACL) throws `IOException` / `UnauthorizedAccessException` — none of which the method catches. End-to-end behavior is still safe because the sole caller, `GalaxyHierarchyCache.TryRestoreFromDiskAsync`, wraps the call in a `catch (Exception)`; but the store's own `Try`-prefixed contract is violated, and any future caller would be surprised by the throw.
|
||
|
||
**Recommendation:** Catch `JsonException` and `IOException` (the latter covers the `UnauthorizedAccessException` family) inside `TryLoadAsync`, log a warning, and return `null` — consistent with the unrecognized-schema-version branch already present and with the `Try` naming. A corrupt cache file is an expected failure mode for a disk cache.
|
||
|
||
**Resolution:** Resolved in `bdccdbf` (2026-05-22): `TryLoadAsync` now has a `catch (Exception) when (exception is JsonException or IOException or UnauthorizedAccessException)` that logs a warning and returns `null`. New test `GalaxyHierarchySnapshotStoreTests.TryLoadAsync_WhenFileIsCorruptJson_ReturnsNull`.
|
||
|
||
### Server-035
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:176` (call site), `:327-352` (`PersistSnapshotAsync`) |
|
||
| Status | Resolved |
|
||
|
||
**Description:** After a heavy refresh, `RefreshCoreAsync` `await`s `PersistSnapshotAsync` while still holding `_refreshGate`, and the `SaveAsync` write has no timeout. The only caller of `RefreshAsync` is the sequential `GalaxyHierarchyRefreshService` loop, so a write that hangs — e.g. a `SnapshotCachePath` pointed at an unresponsive network share — blocks the gate and stalls all subsequent cache refreshes until gateway shutdown. Impact is bounded: clients keep being served the last entry (which flips to `Stale` after the 5-minute threshold), so this is a degradation rather than an outage, and the default `C:\ProgramData` path is local disk where a hang is unlikely.
|
||
|
||
**Recommendation:** Bound the snapshot write with a timeout — a linked `CancellationTokenSource` cancelling after, say, the SQL `CommandTimeoutSeconds` budget — so a stuck write fails fast and logs rather than pinning the refresh loop. Moving the write off the gate is an alternative but would need its own write-serialization.
|
||
|
||
**Resolution:** Resolved in `bdccdbf` (2026-05-22): `SaveAsync` wraps the write in a `CancellationTokenSource.CreateLinkedTokenSource(cancellationToken)` cancelled after `Math.Max(1, CommandTimeoutSeconds)` seconds, so a stuck write fails fast instead of pinning the refresh loop. The timeout-expiry path itself is not unit-tested — exercising it would require a genuinely hanging filesystem.
|
||
|
||
### Server-036
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:345-348` (`PersistSnapshotAsync` catch) |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `PersistSnapshotAsync` passes the refresh `CancellationToken` to `SaveAsync` and catches every exception — including the `OperationCanceledException` thrown when that token is cancelled at gateway shutdown — in its general `catch (Exception)`, logging it as `Warning: "Failed to persist the Galaxy hierarchy snapshot to disk."`. A snapshot write interrupted by a normal shutdown is not a failure, but it surfaces as a misleading warning every time the gateway stops mid-write.
|
||
|
||
**Recommendation:** Let a cancellation-driven `OperationCanceledException` pass without the warning — e.g. add `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` before the general catch — matching the cancellation handling already used in `RefreshCoreAsync` and `TryRestoreFromDiskAsync`.
|
||
|
||
**Resolution:** Resolved in `bdccdbf` (2026-05-22): `PersistSnapshotAsync` has a `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)` ahead of the general catch, so a save aborted by gateway shutdown is silent while a genuine failure (including a write timeout) still logs. New test `GalaxyHierarchyCacheTests.RefreshAsync_WhenSnapshotSaveCancelledAtShutdown_DoesNotLogPersistFailure`.
|
||
|
||
### Server-037
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Location | `src/MxGateway.Tests/Galaxy/GalaxyHierarchySnapshotStoreTests.cs`, `src/MxGateway.Tests/Galaxy/GalaxyHierarchyCacheTests.cs` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The new snapshot tests cover the round-trip, missing-file, persistence-disabled, unrecognized-schema, and overwrite cases for the store, and the persist / restore-when-unreachable / promote-on-matching-deploy cases for the cache. Two resilience paths are untested: (1) `GalaxyHierarchyCache.TryRestoreFromDiskAsync`'s `catch` path when the snapshot file is corrupt — the cache must come up `Unavailable` rather than throwing; (2) the cache restore path when `PersistSnapshot = false` (the store yields `null` and the cache stays `Unavailable`). Both are the failure modes most likely to matter operationally.
|
||
|
||
**Recommendation:** Add a cache test that writes a corrupt snapshot file and asserts `RefreshAsync` with an unreachable repository leaves the cache `Unavailable` without throwing, and a test that confirms a `PersistSnapshot = false` store neither restores nor persists. If Server-034 is fixed, the corrupt-file test also pins the store's null-return.
|
||
|
||
**Resolution:** Resolved in `bdccdbf` (2026-05-22): added `GalaxyHierarchyCacheTests.RefreshAsync_WhenSnapshotFileCorrupt_ComesUpUnavailableWithoutThrowing` and `RefreshAsync_WhenPersistDisabled_DoesNotRestoreFromDisk`, plus the `TryLoadAsync_WhenFileIsCorruptJson_ReturnsNull` store test added for Server-034.
|
||
|
||
### Server-038
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Security |
|
||
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:23-44` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `EventsHub` is gated by `[Authorize(Policy = DashboardAuthenticationDefaults.HubClientsPolicy)]`, which checks only that the caller carries a dashboard role (Admin or Viewer). `SubscribeSession(sessionId)` accepts any non-empty session id and joins the caller to `session:{id}`. A Viewer who knows or guesses a session id can therefore subscribe to any session's MxEvent stream once `DashboardEventBroadcaster` is broadcasting (which it now is, per `d692232`). The per-session ACL that gates the gRPC `StreamEvents` RPC is not replicated.
|
||
|
||
**Recommendation:** Before the EventsHub is exercised by Admin-only sessions or session-scoped Viewer roles, gate `SubscribeSession` on a session-access check — either via a per-session role check in the hub method itself, or by storing a per-user allowed-session-id set in the connection's `Context.Items` at connect time and rejecting subscribes outside that set. The current dashboard surfaces only a per-page Session Details view that the page can prove it's authorized for, but as soon as a Viewer role exists the gap matters.
|
||
|
||
**Resolution:** 2026-05-24 — Documented the v1 acceptance per the prompt's "practical fix for v1" direction. Added a detailed `<remarks>` block to `EventsHub.SubscribeSession` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs`) stating that (a) in v1 the hub-level `HubClientsPolicy` only requires one of the dashboard roles (Admin or Viewer) and both may subscribe to any session id, (b) this is acceptable today because the dashboard's per-session views show non-secret session metadata any authenticated user can already see and value logging is gated by the same redaction policy, and (c) the per-session ACL that gates the gRPC `StreamEvents` RPC is intentionally not yet mirrored here. Added an explicit `TODO(per-session-acl)` describing the future enforcement seam — once a role/scope is introduced that scopes a Viewer to a specific session or tenant, add a session-access check at this method (inline on `Context.User` claims/`Context.Items`, or via a dedicated authorization policy applied to the hub method). No code-behavior change in this pass; the per-session ACL data model design is out of scope for the resolution window. No new regression test (the change is documentation-only).
|
||
|
||
### Server-039
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:37-58` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `HubTokenService.Validate` deserializes the protected JSON payload and trusts `payload.Roles` even when `payload.Name` and `payload.NameIdentifier` are both `null`. The resulting `ClaimsPrincipal` has the `MxGateway.Dashboard.HubToken` scheme as its `AuthenticationType` and the role claims, but no identity claims. `Identity?.IsAuthenticated` returns `true` because the auth type is non-empty, so the principal satisfies `IsAuthenticated` checks and `IsInRole` checks even though it has no caller identity. A token forged from a corrupted data-protection store could pass authorization without an associated user.
|
||
|
||
**Recommendation:** Mark `HubTokenPayload.Name` and `HubTokenPayload.NameIdentifier` as required (e.g. with `[JsonRequired]` once the project standardizes the JSON binder, or by validating non-null explicitly after deserialization) and reject the token if either is missing. Alternatively, document on `IDashboardAuthorizationHandler` consumers that they must check `Identity?.Name` is non-null before honoring role claims from this scheme.
|
||
|
||
**Resolution:** 2026-05-24 — `HubTokenService.Validate` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs`) now rejects a deserialized payload where both `Name` and `NameIdentifier` are null/empty — returning `null` rather than emitting a principal with role claims but no caller identity. The check sits immediately after the protector unprotect and the null-payload guard, with a comment back-referencing Server-039. Either field is sufficient (a token minted with only a `NameIdentifier` still validates), matching the existing `Issue` path where the cookie principal may carry just one of them. New test file `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/HubTokenServiceTests.cs` with six tests: `Validate_TokenWithNullNameAndNullNameIdentifier_ReturnsNull` (the named regression — confirmed to fail before the fix and pass after), `Validate_TokenWithName_ReturnsAuthenticatedPrincipal`, `Validate_TokenWithOnlyNameIdentifier_ReturnsPrincipal`, plus null/empty/garbage-token sanity checks. Verified by passing tests.
|
||
|
||
### Server-040
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAuthenticator.cs:140-160` (`MapGroupsToRoles`) |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `MapGroupsToRoles` checks each LDAP group against the role map twice — first by the full group string, then by `ExtractFirstRdnValue(group)` — and `TryGetValue` short-circuits on the first hit. The precedence ("full match wins over RDN match") is correct because the map's key set is operator-controlled and matches should resolve deterministically, but the lookup ordering is not documented. A future maintainer reading the code can't tell whether "fall through to RDN" is intentional or a leftover from refactoring `IsMemberOfRequiredGroup`.
|
||
|
||
**Recommendation:** Add a one-line comment above the loop explaining the precedence: full DN/CN literal first, leading-RDN fallback second. Mention the case-insensitive map comparer (`OrdinalIgnoreCase`) so the next reader doesn't ask why `"GwAdmin"` matches `"gwadmin"`.
|
||
|
||
**Resolution:** 2026-05-24 — Added a precedence comment block above the lookup in `MapGroupsToRoles` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAuthenticator.cs:156-163`) explaining that the full literal group string is tried first and the leading-RDN value (e.g. `GwAdmin` extracted from `ou=GwAdmin,ou=groups,...`) is the fallback, and back-referencing `DashboardOptions.GroupToRole` as the source of the `OrdinalIgnoreCase` comparer so a maintainer sees why `"GwAdmin"` matches `"gwadmin"`. No code change — existing `DashboardAuthenticatorTests.MapGroupsToRoles_ResolvesByShortNameAndDistinguishedName` already pins both the full-match and RDN-fallback paths and the case-insensitive lookup; pure documentation-only resolution, no new test.
|
||
|
||
### Server-041
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Location | `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-126`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/IDashboardEventBroadcaster.cs:6-10` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `IDashboardEventBroadcaster.Publish` is documented as "Implementations must never throw — broadcast failures are best-effort and must not disrupt the source gRPC stream." `EventStreamService` honors that contract by passing the call through without a try/catch. The current `DashboardEventBroadcaster` implementation observes the `SendAsync` task's continuation but does not raise synchronously, so the seam is safe today. A future implementation that adds synchronous validation or a serializer hop could throw, faulting the producer loop and ending the gRPC stream.
|
||
|
||
**Recommendation:** Either wrap the `Publish` call in a `try/catch (Exception ex)` that logs at debug and continues (matching the `DashboardSnapshotPublisher` pattern), or add a code-review checklist note enforcing the never-throw contract on implementations. The wrap is safer because it doesn't depend on convention.
|
||
|
||
**Resolution:** 2026-05-24 — Took the safer wrap. `EventStreamService.ProduceEventsAsync` (`src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:123-137`) now wraps the `dashboardEventBroadcaster.Publish(...)` call in a `try / catch (Exception ex)` that logs at debug and continues. The producer loop and the gRPC stream are no longer at the mercy of the broadcaster's never-throw discipline — a future implementation that adds synchronous validation or a serializer hop cannot fault the stream. Regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs`: `StreamEventsAsync_WhenDashboardBroadcasterThrows_StillYieldsEventsAndDoesNotFaultSession` (new) injects a `ThrowingDashboardEventBroadcaster` test double that throws `InvalidOperationException` on every `Publish`, then asserts (a) the gRPC stream still yields both events in order, (b) the broadcaster's `Publish` is attempted for every event (so the catch is exercised per-event rather than aborting the loop), and (c) the session does not transition to `Faulted`. Confirmed to fail before the fix (the producer loop surfaced the simulated `InvalidOperationException`) and pass after. Verified by passing tests.
|
||
|
||
### Server-042
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardSnapshotPublisher.cs:18-41` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `DashboardSnapshotPublisher.ExecuteAsync` reads from `IDashboardSnapshotService.WatchSnapshotsAsync` inside an outer `try` that catches `OperationCanceledException` only. A failure inside `WatchSnapshotsAsync` (e.g. the snapshot service throws after a transient SQL failure for the Galaxy summary projection) escapes the outer try and ends the BackgroundService — no automatic reconnect. The sibling `AlarmsHubPublisher` (lines 55-61) wraps its `StreamAsync` consumer in a 5-second reconnect loop with `catch (Exception ex)` and continues. The snapshot publisher should follow the same shape.
|
||
|
||
**Recommendation:** Wrap the `await foreach` in a `while (!stoppingToken.IsCancellationRequested)` loop with a `catch (Exception ex)` plus a 5-second `Task.Delay`, mirroring `AlarmsHubPublisher`. Today's snapshot service rarely throws on the watch path, but a one-time logger-init failure or transient `IGatewayConfigurationProvider` exception would silently take the dashboard offline.
|
||
|
||
**Resolution:** 2026-05-24 — Mirrored `AlarmsHubPublisher`'s reconnect loop. `DashboardSnapshotPublisher.ExecuteAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardSnapshotPublisher.cs`) now wraps the `await foreach` in a `while (!stoppingToken.IsCancellationRequested)` loop and catches general exceptions with a logged warning + `Task.Delay(reconnectDelay, stoppingToken)`. The 5-second `DefaultReconnectDelay` is preserved for production via the public constructor; an `internal` overload injects a shorter delay for the regression test (with `[InternalsVisibleTo("ZB.MOM.WW.MxGateway.Tests")]` already in place). Also tightened cancellation handling: the inner `OperationCanceledException` `return`s instead of merely catching, so a normal shutdown exits cleanly rather than re-looping on the cancelled token. New test file `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs` with two cases: `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` (the named regression — `ThrowOnceThenYieldSnapshotService` throws on the first `WatchSnapshotsAsync` call and yields a snapshot on the second; the publisher must reconnect, broadcast the snapshot, and the gap between throw and reconnect must respect the configured delay) and `ExecuteAsync_WhenSnapshotServiceCompletes_ReconnectsAfterDelay` (sanity case: a normal `yield break` also triggers the reconnect loop). Confirmed both tests fail against the original single-try implementation (the BackgroundService exits and `SubscribeCount` stays at 1) and pass after the fix. Verified by passing tests.
|
||
|
||
### Server-043
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs:1`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardServiceCollectionExtensions.cs:24` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `HubTokenService` is registered as a singleton (good — data protection providers are thread-safe and a single protector instance is correct) and shared by both `DashboardHubConnectionFactory` (per-circuit scoped, mints fresh tokens from the cookie principal) and `HubTokenAuthenticationHandler` (per-request transient, validates inbound tokens). The class-level docs describe what the service does but not that it is intentionally a singleton with two consumer scopes, so a future maintainer rewriting the DI registration may pick the wrong lifetime.
|
||
|
||
**Recommendation:** Add a `<remarks>` block to `HubTokenService` noting "Registered as a singleton in `AddGatewayDashboard`; the underlying `ITimeLimitedDataProtector` is thread-safe and shared across hub-token issuance and validation." Optionally add a comment near the DI registration explaining the lifetime contract.
|
||
|
||
**Resolution:** 2026-05-24 — Added a `<remarks>` block to `HubTokenService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs`) documenting that the service is registered as a singleton in `DashboardServiceCollectionExtensions.AddGatewayDashboard` and is shared by two consumer scopes — `DashboardHubConnectionFactory` (scoped, per-circuit; calls `Issue` from the cookie-authenticated dashboard) and `HubTokenAuthenticationHandler` (transient, per-request; calls `Validate` from the SignalR negotiate / connection path). Notes that the underlying `ITimeLimitedDataProtector` is thread-safe so concurrent mint/validate from any number of callers is safe, and explicitly asks future maintainers to preserve the singleton lifetime to keep the protector instance stable. Pure documentation change; no test.
|