Files
mxaccessgw/code-reviews/Server/findings.md
T
Joseph Doherty 1d9e3afadd Resolve Server-002, -004, -005, -006 code-review findings
Server-002: the gateway never terminated leftover MxGateway.Worker.exe
processes at startup, contradicting gateway.md and CLAUDE.md. Added
IRunningProcessInspector/SystemRunningProcessInspector, OrphanWorkerTerminator,
and OrphanWorkerCleanupHostedService (best-effort, runs before sessions are
accepted); updated gateway.md to describe the implemented behavior.

Server-004: API-key scopes were persisted verbatim with no validation. Added
GatewayScopes.All/IsKnown; the CLI parser and dashboard create path now
reject unknown scope strings.

Server-005: a non-SqlException/InvalidOperationException fault on the initial
Galaxy hierarchy load faulted the BackgroundService. ExecuteAsync now catches
all non-cancellation exceptions on first load and RefreshCoreAsync broadens
its catch so the cache records Stale/Unavailable instead.

Server-006: OpenSessionAsync incremented the open-sessions gauge before
alarm auto-subscribe; an auto-subscribe failure leaked the gauge. The catch
path now calls SessionRemoved() when the gauge was incremented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:10 -04:00

238 lines
20 KiB
Markdown

# Code Review — Server
| Field | Value |
|---|---|
| Module | `src/MxGateway.Server` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Status | Reviewed |
| Open findings | 8 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Server-006 (metrics open-session leak on alarm auto-subscribe failure), Server-010 (rotate reactivates revoked keys). |
| 2 | mxaccessgw conventions | Issues found: Server-002 (orphan-worker termination on startup not implemented), Server-011 (style deviation in `WorkerAlarmRpcDispatcher`). |
| 3 | Concurrency & thread safety | No issues found — locking is correct; inconsistent-but-safe discipline in `GatewayMetrics` noted only. |
| 4 | Error handling & resilience | Issues found: Server-005 (Galaxy first-load can fault the host BackgroundService), Server-009 (SQLite has no busy-timeout/WAL under concurrent writes). |
| 5 | Security | Issues found: Server-001 (Critical: dashboard authorization never enforced on any route), Server-003 (LDAP dashboard users denied for lack of a scope claim), Server-010. |
| 6 | Performance & resource management | Issues found: Server-007 (DiscoverHierarchy paging is O(total) per page), Server-008 (WatchDeployEvents re-projects whole hierarchy per event). |
| 7 | Design-document adherence | Issues found: Server-002 (orphan workers), Server-012 (CLAUDE.md scope names stale vs code/docs). |
| 8 | Code organization & conventions | Issues found: Server-011 (style), Server-004 (CLI accepts unvalidated scope strings). |
| 9 | Testing coverage | Issues found: Server-013 (no dashboard route-level authorization test; `WorkerExecutableValidator`, `GalaxyGlobMatcher`, projector paging untested). |
| 10 | Documentation & comments | Issues found: Server-014 (stale "not yet wired" alarm comments), Server-012. |
## Findings
### Server-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Security |
| Location | `src/MxGateway.Server/GatewayApplication.cs:147-149`, `src/MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs:55-58`, `src/MxGateway.Server/Dashboard/Components/Routes.razor:1-15` |
| Status | Resolved |
**Description:** The dashboard authorization policy (`DashboardAuthenticationDefaults.AuthorizationPolicy`), `DashboardAuthorizationRequirement`, and `DashboardAuthorizationHandler` are registered in DI but never applied to any endpoint. `MapRazorComponents<App>()` has no `.RequireAuthorization(...)`, the `<Router>` in `Routes.razor` uses plain `RouteView` (not `AuthorizeRouteView`), and no dashboard page carries `[Authorize]` — a module-wide grep finds zero `RequireAuthorization`/`[Authorize]`/`AuthorizeRouteView` usages. Every dashboard page (Sessions, Workers, Events, Galaxy, Settings, and the API Keys list exposing key IDs, scopes, and constraints) is reachable by any unauthenticated remote client regardless of `Dashboard:AllowAnonymousLocalhost` or `Dashboard:RequireAdminScope`. Only the API-key mutation operations remain protected, via the separate `DashboardApiKeyManagementService.CanManage` check.
**Recommendation:** Apply the policy at the route level — `endpoints.MapRazorComponents<App>().AddInteractiveServerRenderMode().RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy)` — and/or switch `Routes.razor` to `AuthorizeRouteView` with a `[Authorize]` fallback policy plus a `NotAuthorized` redirect to the login page. Add an integration test that GETs a dashboard page anonymously and asserts 302-to-login / 401.
**Resolution:** Resolved in `a8aafdf` (2026-05-18): `MapRazorComponents<App>()` now calls `.RequireAuthorization(DashboardAuthenticationDefaults.AuthorizationPolicy)`, so an unauthenticated request to any dashboard component route is challenged by the cookie scheme and redirected to the login page. `GatewayApplicationTests` gained `ComponentRoutesRequireAuthorization` (component routes carry the policy) and `AuthEndpointsAllowAnonymousAccess`, replacing the prior test that asserted the insecure behavior.
### Server-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` |
| Status | Resolved |
**Description:** `gateway.md:583` and CLAUDE.md state the first version "terminates orphaned workers on startup." No code in MxGateway.Server enumerates or kills leftover `MxGateway.Worker.exe` processes at startup — a grep for `orphan`/`reattach`/`terminate` finds nothing. After an unclean gateway crash, x86 worker processes (each holding an MXAccess COM instance) leak and survive indefinitely, and a restarted gateway does not reclaim or kill them.
**Recommendation:** Add a startup hosted service that finds and kills stale worker processes (by executable path / a well-known argument or environment marker) before the server accepts sessions, or update the design docs if reattachment/cleanup is deliberately deferred.
**Resolution:** Resolved 2026-05-18. Confirmed against source: no code path enumerated or killed leftover workers. Added `IRunningProcessInspector` / `SystemRunningProcessInspector` (a testable seam over `Process.GetProcessesByName`/`Kill`), `OrphanWorkerTerminator` (kills processes matched by the configured worker executable path, or by image name when the x64 gateway cannot introspect the x86 worker's `MainModule`, skipping the current process and tolerating per-process kill failures), and `OrphanWorkerCleanupHostedService` (best-effort `IHostedService`). The hosted service is registered in `AddWorkerProcessLauncher` ahead of `AddGatewaySessions` so cleanup runs before the server accepts sessions. `gateway.md` updated to describe the implemented behavior. Regression tests: `OrphanWorkerTerminatorTests` (`KillsWorkerProcessesMatchingConfiguredExecutablePath`, `KillsImageNameMatchWhenExecutablePathUnreadable`, `DoesNotKillUnrelatedProcessSharingImageName`, `DoesNotKillCurrentProcess`, `ContinuesWhenOneKillThrows`).
### Server-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/MxGateway.Server/Dashboard/DashboardAuthorizationHandler.cs:39,54-59`, `src/MxGateway.Server/Dashboard/DashboardAuthenticator.cs:236-258` |
| Status | Resolved |
**Description:** When `Dashboard:RequireAdminScope` is true (the default) and the request is not loopback, `DashboardAuthorizationHandler` succeeds only if `HasAdminScope` finds a claim of type `"scope"` with value `"admin"`. But `DashboardAuthenticator.CreatePrincipal` issues only `NameIdentifier`, `Name`, and `LdapGroupClaimType` claims — never a `scope`/`admin` claim. So a correctly LDAP-authenticated user who passed the required-group check is still denied dashboard access on any non-loopback connection. The bug is currently masked by the missing route-level enforcement (Server-001) and by `AllowAnonymousLocalhost`; fixing Server-001 would make the dashboard unusable for all real LDAP logins.
**Recommendation:** Either have `DashboardAuthenticator.CreatePrincipal` add a `scope=admin` claim when the user is in the required group, or change `DashboardAuthorizationHandler.HasAdminScope` to evaluate LDAP group membership (reuse `IsMemberOfRequiredGroup` against the `LdapGroupClaimType` claims, as `DashboardApiKeyAuthorization.CanManage` already does).
**Resolution:** Resolved in `a8aafdf` (2026-05-18): `DashboardAuthenticator.CreatePrincipal` — reached only after the required-group check passes — now emits the `scope=admin` claim that `DashboardAuthorizationHandler` checks, so group-validated LDAP users pass `RequireAdminScope` once route-level authorization (Server-001) is enforced.
### Server-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` |
| Status | Resolved |
**Description:** `ParseScopes` accepts any comma-separated strings and `CreateKeyAsync` persists them verbatim; neither the CLI nor the dashboard create path validates scopes against `GatewayScopes`. A typo or non-canonical name (e.g. CLAUDE.md's example `--scopes session,invoke,event,metadata,admin`, which does not match the resolver's `session:open`/`invoke:read`/etc.) silently creates a key whose scope strings the authorization resolver never checks for — the key is unusable for those RPCs with no error at creation time.
**Recommendation:** Validate every requested scope against the `GatewayScopes` catalog at create time in both the CLI parser/runner and `DashboardApiKeyManagementService.ValidateCreateRequest`, rejecting unknown scope strings.
**Resolution:** Resolved 2026-05-18. Confirmed against source: `ParseScopes` split unvalidated strings into the create command and `ValidateCreateRequest` checked only key id and display name. Added `GatewayScopes.All` (the canonical scope catalog) and `GatewayScopes.IsKnown(string)`. `ApiKeyAdminCommandLineParser.Parse` now runs `ValidateScopes` for create-key commands and fails the parse listing the unknown scope(s) and valid set; `DashboardApiKeyManagementService.ValidateCreateRequest` rejects requests carrying any non-canonical scope. Revoke/rotate paths are unaffected (no scope input). Regression tests: `ApiKeyAdminCommandLineParserTests.Parse_CreateKeyCommand_RejectsUnknownScope`, `Parse_CreateKeyCommand_AcceptsAllCanonicalScopes`, and `DashboardApiKeyManagementServiceTests.CreateAsync_UnknownScope_DoesNotCallStore`.
### Server-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` |
| Status | Resolved |
**Description:** `GalaxyHierarchyCache.RefreshCoreAsync` only catches `SqlException` and `InvalidOperationException`. The initial `cache.RefreshAsync` call in `GalaxyHierarchyRefreshService.ExecuteAsync` is wrapped only for `OperationCanceledException`. A transient non-`SqlException` failure on the first refresh (e.g. a `Win32Exception`/`TimeoutException` from connection establishment, or another `DbException` subtype) escapes both layers, faults the `BackgroundService`, and — with default host behavior — stops the whole gateway. The periodic-tick loop does catch general exceptions, so only the first load is exposed.
**Recommendation:** Broaden the `catch` in `RefreshCoreAsync` to all non-cancellation exceptions (record `Unavailable`/`Stale` and still complete `_firstLoad`), or wrap the initial `RefreshAsync` in `GalaxyHierarchyRefreshService` with the same general `catch` the tick loop uses.
**Resolution:** Resolved 2026-05-18. Confirmed against source: the initial `RefreshAsync` in `ExecuteAsync` was guarded only for `OperationCanceledException`, and `RefreshCoreAsync` filtered its catch to `SqlException or InvalidOperationException`. Both recommended layers applied: `GalaxyHierarchyRefreshService.ExecuteAsync` now catches every non-cancellation exception on the initial load (logs a warning; the periodic tick retries), and `GalaxyHierarchyCache.RefreshCoreAsync` broadens its catch to all non-cancellation exceptions so the cache still records `Stale`/`Unavailable` and completes `_firstLoad`. The now-unused `Microsoft.Data.SqlClient` using was removed. Regression test: `GalaxyHierarchyRefreshServiceTests.ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService`.
### Server-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` |
| Status | Resolved |
**Description:** In `OpenSessionAsync`, `_metrics.SessionOpened()` (line 89) increments the `_openSessions` gauge before `TryAutoSubscribeAlarmsAsync` runs. If auto-subscribe throws (which it does when `Alarms.RequireSubscribeOnOpen` is true and the worker rejects the subscription), the `catch` block disposes and removes the session and records `_metrics.Fault(...)` but never calls `SessionClosed`/`SessionRemoved`. The `mxgateway.sessions.open` gauge permanently over-counts by one for every such failed open.
**Recommendation:** In the `catch` block, when the session had reached the point where `SessionOpened()` was recorded, also call `_metrics.SessionRemoved()` — or move the `SessionOpened()` call to after auto-subscribe succeeds.
**Resolution:** Resolved 2026-05-18. Confirmed against source: the `catch` block in `OpenSessionAsync` recorded `Fault(...)` and removed the session but never decremented the open-session gauge after `SessionOpened()` had run. Added a `sessionOpenedRecorded` flag set immediately after `_metrics.SessionOpened()`; the `catch` block now calls `_metrics.SessionRemoved()` when that flag is set, restoring the gauge for a post-`SessionOpened()` failure (e.g. an auto-subscribe rejection with `RequireSubscribeOnOpen=true`). Regression test: `SessionManagerAlarmAutoSubscribeTests.OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn`.
### Server-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyProjector.cs:55-70` |
| Status | Open |
**Description:** `Project` always iterates the full `entry.Index.ObjectViews` collection and re-applies all filters to skip `offset` matched items before collecting a page. Paging through a large Galaxy hierarchy is therefore O(total) per page and O(total²/pageSize) end-to-end. The cache is in-memory so impact is bounded, but for large galaxies repeated `DiscoverHierarchy` pagination wastes CPU.
**Recommendation:** Precompute and cache the filtered, ordered view list per `(filterSignature, sequence)` so subsequent pages are an O(pageSize) slice; the existing filter signature already keys page tokens.
**Resolution:** _(open)_
### Server-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/MxGateway.Server/Grpc/GalaxyRepositoryGrpcService.cs:111-134,160-189` |
| Status | Open |
**Description:** `WatchDeployEvents` calls `ResolveBrowseSubtrees()` on every streamed event, and `MapDeployEvent` re-runs `GalaxyHierarchyProjector.Project` over the entire cached hierarchy (and `Sum`s attribute counts) for every event of every constrained subscriber. `GalaxyGlobMatcher.IsMatch` also rebuilds the glob regex on each call. With many constrained subscribers and frequent deploys this is avoidable work.
**Recommendation:** Hoist `ResolveBrowseSubtrees()` out of the loop; compute scoped object/attribute counts once per deploy sequence and cache by `(sequence, browseSubtrees)`; cache compiled glob `Regex` instances in `GalaxyGlobMatcher`.
**Resolution:** _(open)_
### Server-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Server/Security/Authentication/AuthSqliteConnectionFactory.cs:15-32` |
| Status | Open |
**Description:** Each auth-store operation opens a fresh `SqliteConnection` with no busy timeout, no WAL journal mode, and default journaling. `MarkKeyUsedAsync` runs on every authenticated request and `SqliteApiKeyAuditStore` appends on every denial; under concurrent load these writers can collide and surface `SQLITE_BUSY` as a hard failure on the request path.
**Recommendation:** Set `Pooling`, a non-zero `DefaultTimeout`/`busy_timeout`, and enable WAL (`PRAGMA journal_mode=WAL`) once at startup so concurrent readers/writers degrade gracefully.
**Resolution:** _(open)_
### Server-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `src/MxGateway.Server/Security/Authentication/SqliteApiKeyAdminStore.cs:91-114`, `src/MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:168-172` |
| Status | Open |
**Description:** `RotateAsync` sets `revoked_utc = NULL`, so rotating a previously revoked key silently reactivates it. This is documented intentional behavior in `docs/Authentication.md:167`, but the dashboard renders the "Rotate" button unconditionally — including for keys whose status badge says "Revoked" — so an operator can un-revoke a deliberately disabled key without an explicit warning.
**Recommendation:** Either hide/disable the Rotate action for revoked keys in `ApiKeysPage.razor`, require an explicit confirmation, or have `RotateAsync` preserve `revoked_utc` and add a separate explicit "reactivate" operation.
**Resolution:** _(open)_
### Server-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:1-46` |
| Status | Open |
**Description:** `WorkerAlarmRpcDispatcher` deviates from the module's conventions: it fully-qualifies `System.Guid`, `System.ArgumentNullException`, and `System.Threading` types inline instead of relying on `using` directives, and uses an explicit constructor with `this.`-qualified field assignment while the rest of the module (e.g. `ConstraintEnforcer`, `MxAccessGatewayService`, `GalaxyRepositoryGrpcService`) uses primary constructors. `docs/style-guides/CSharpStyleGuide.md` is authoritative for gateway code.
**Recommendation:** Add the needed `using` directives, drop the inline fully-qualified names, and convert to a primary constructor for consistency.
**Resolution:** _(open)_
### Server-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `CLAUDE.md` (Authentication section and `apikey create` example) |
| Status | Open |
**Description:** CLAUDE.md describes scopes as `session`, `invoke`, `event`, `metadata`, `admin` and shows `apikey create --scopes session,invoke,event,metadata,admin`. The actual canonical scope strings (used by `GatewayScopes`, `GatewayGrpcScopeResolver`, and `docs/Authorization.md`) are `session:open`, `session:close`, `invoke:read`, `invoke:write`, `invoke:secure`, `events:read`, `metadata:read`, `admin`. A key created per the CLAUDE.md example carries scopes the resolver never matches.
**Recommendation:** Update CLAUDE.md's scope list and the `apikey` example to the canonical `*:*` scope strings, per CLAUDE.md's own rule that docs change with the code.
**Resolution:** _(open)_
### Server-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs`, `src/MxGateway.Tests/Gateway/GatewayApplicationTests.cs` |
| Status | Open |
**Description:** `DashboardAuthorizationHandler` is unit-tested in isolation, but no test exercises the dashboard routes end-to-end to confirm the policy is actually enforced — which is why Server-001 (policy registered but never wired) went uncaught. There are also no tests for `WorkerExecutableValidator` (PE-header architecture parsing), `GalaxyGlobMatcher` (anchoring/escaping/empty-glob fail-open), or `GalaxyHierarchyProjector` pagination/page-token behavior.
**Recommendation:** Add a `WebApplicationFactory` integration test that requests a dashboard page unauthenticated and asserts the redirect/401, plus unit tests for `WorkerExecutableValidator`, `GalaxyGlobMatcher`, and projector paging.
**Resolution:** _(open)_
### Server-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:162-171,191-198,206-214,229-237` |
| Status | Open |
**Description:** The XML `<remarks>` and inline comments on `AcknowledgeAlarm` and `QueryActiveAlarms` describe the alarm path as not yet wired and say `NotWiredAlarmRpcDispatcher` is the default ("Clients calling this method today receive an OK reply with a 'worker alarm path not yet wired' diagnostic", "an empty stream until PR A.2"). In fact `SessionServiceCollectionExtensions.AddGatewaySessions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so DI always injects the production dispatcher; `NotWiredAlarmRpcDispatcher` is only the null fallback. The comments are stale and misleading.
**Recommendation:** Update the `AcknowledgeAlarm`/`QueryActiveAlarms` remarks to reflect that `WorkerAlarmRpcDispatcher` is the wired default, and describe its actual GUID-vs-`Provider!Group.Tag` handling.
**Resolution:** _(open)_