Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules
Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).
Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
GatewayGrpcScopeResolver so non-admin keys can use them; document
the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
in generated tonic code by reformatting the ReadBulkCommand proto
comment and scoping a #![allow(...)] to the generated submodules.
Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
make DisposeAsync race-safe against in-flight CloseAsync (-016);
add constraint-enforcement test coverage for the bulk-plan path
(-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
can distinguish graceful shutdown from a real STA-affinity
violation (-016); have the watchdog skip StaHung while
CurrentCommandCorrelationId is non-empty so a legitimate slow
ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
11 GatewaySession bulk methods (-013); replace the real TCP probe
in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
(-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
test and assert OnWriteComplete (-012); add live tests for
Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
CreateForTesting factory (-016); cover WorkerCancel and
unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
beforeStart() (-014); return a CancellingCompletableFuture that
actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
histograms with failed-call durations (-015); add coverage for
the five MalformedReply paths, the bulk-write helpers, the
Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
command family (-009).
Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
WorkerAlarmRpcDispatcher missing-session handling; drop the
duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
subscriptionExpression / ExecutingCommand arms; preserve
factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
source; switch the heartbeat-expires test to ManualTimeProvider;
add InvariantCulture to the remaining DateTimeOffset.Parse sites;
document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
IDisposable, class-level [Trait], single-source ZB default
connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
so absent env vars SKIP not pass; PascalCase rename of probe
[Fact]s; deterministic deadline test; new frame-protocol error
tests; ComputeTransitions diff-coverage; relocate dev-rig probes
to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
TreatWarningsAsErrors / analysers apply; document
DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
bulk-read handles in CLI; surface AcknowledgeAlarm transport
faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
runWriteBulkVariant; document the six new subcommands in
writeUsage; drain galaxy-watch events on limit; switch io.EOF
comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
option; regex-based credential redaction; Long.toUnsignedString
for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
_percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
_api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
stop hard-coding correlation IDs; resync RustClientDesign.md
with the current Session / Error surface and CLI subcommand set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+136
-12
@@ -4,25 +4,29 @@
|
||||
|---|---|
|
||||
| Module | `src/MxGateway.Server` |
|
||||
| Reviewer | Claude Code |
|
||||
| Review date | 2026-05-18 |
|
||||
| Commit reviewed | `6c64030` |
|
||||
| Review date | 2026-05-20 |
|
||||
| Commit reviewed | `1cd51bb` |
|
||||
| Status | Reviewed |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
This row summarizes the 2026-05-20 review pass at commit `1cd51bb`. Findings from
|
||||
prior passes (Server-001 through Server-014) are all closed and remain below as
|
||||
audit history.
|
||||
|
||||
| # | Category | Result |
|
||||
|---|---|---|
|
||||
| 1 | Correctness & logic bugs | Issues found: Server-006 (metrics open-session leak on alarm auto-subscribe failure), Server-010 (rotate reactivates revoked keys). |
|
||||
| 2 | mxaccessgw conventions | Issues found: Server-002 (orphan-worker termination on startup not implemented), Server-011 (style deviation in `WorkerAlarmRpcDispatcher`). |
|
||||
| 3 | Concurrency & thread safety | No issues found — locking is correct; inconsistent-but-safe discipline in `GatewayMetrics` noted only. |
|
||||
| 4 | Error handling & resilience | Issues found: Server-005 (Galaxy first-load can fault the host BackgroundService), Server-009 (SQLite has no busy-timeout/WAL under concurrent writes). |
|
||||
| 5 | Security | Issues found: Server-001 (Critical: dashboard authorization never enforced on any route), Server-003 (LDAP dashboard users denied for lack of a scope claim), Server-010. |
|
||||
| 6 | Performance & resource management | Issues found: Server-007 (DiscoverHierarchy paging is O(total) per page), Server-008 (WatchDeployEvents re-projects whole hierarchy per event). |
|
||||
| 7 | Design-document adherence | Issues found: Server-002 (orphan workers), Server-012 (CLAUDE.md scope names stale vs code/docs). |
|
||||
| 8 | Code organization & conventions | Issues found: Server-011 (style), Server-004 (CLI accepts unvalidated scope strings). |
|
||||
| 9 | Testing coverage | Issues found: Server-013 (no dashboard route-level authorization test; `WorkerExecutableValidator`, `GalaxyGlobMatcher`, projector paging untested). |
|
||||
| 10 | Documentation & comments | Issues found: Server-014 (stale "not yet wired" alarm comments), Server-012. |
|
||||
| 1 | Correctness & logic bugs | Issues found: Server-019 (`WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` yields silently when session is missing). |
|
||||
| 2 | mxaccessgw conventions | No issues found — convention drift previously called out is resolved; no new gaps observed. |
|
||||
| 3 | Concurrency & thread safety | Issues found: Server-015 (`GatewaySession._state` is written under `_closeLock` but read/written elsewhere under `_syncRoot`). |
|
||||
| 4 | Error handling & resilience | Issues found: Server-016 (`GatewaySession.DisposeAsync` disposes the close-lock semaphore while it may be held). |
|
||||
| 5 | Security | Issues found: Server-017 (`AcknowledgeAlarm` / `QueryActiveAlarms` fall through to admin-only scope because the resolver was not updated for the new alarm RPCs). |
|
||||
| 6 | Performance & resource management | Issues found: Server-018 (`GalaxyGlobMatcher` regex cache is unbounded — currently low-risk but uncapped). |
|
||||
| 7 | Design-document adherence | No issues found at this pass. |
|
||||
| 8 | Code organization & conventions | Issues found: Server-020 (dashboard pages each declare two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — producing duplicate routes under the `/dashboard` group prefix). |
|
||||
| 9 | Testing coverage | Issues found: Server-021 (`MxAccessGatewayService.ApplyConstraintsAsync` and the new `BulkConstraintPlan` / `ReadBulkConstraintPlan` / `WriteBulkConstraintPlan` / `SubscribeBulkConstraintPlan` merge logic is entirely untested). |
|
||||
| 10 | Documentation & comments | Issues found: Server-022 (`IAlarmRpcDispatcher` XML doc still describes the dispatcher as "ships a not-yet-wired default"; stale after Server-014). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -235,3 +239,123 @@
|
||||
**Recommendation:** Update the `AcknowledgeAlarm`/`QueryActiveAlarms` remarks to reflect that `WorkerAlarmRpcDispatcher` is the wired default, and describe its actual GUID-vs-`Provider!Group.Tag` handling.
|
||||
|
||||
**Resolution:** Resolved 2026-05-18. Confirmed against source: `SessionServiceCollectionExtensions` registers `WorkerAlarmRpcDispatcher` as `IAlarmRpcDispatcher`, so the "not yet wired" / "empty stream until PR A.2" / "PR A.6/A.7 follow-up" prose in the `AcknowledgeAlarm` and `QueryActiveAlarms` `<remarks>` and inline comments was stale. Rewrote both `<remarks>` blocks and both inline comments to state that DI binds the production `WorkerAlarmRpcDispatcher`, that it routes over the worker pipe IPC, and that `AcknowledgeAlarm` handles a canonical-GUID reference (→ `AcknowledgeAlarmCommand`) and a `Provider!Group.Tag` reference (→ `AcknowledgeAlarmByNameCommand`), with `NotWiredAlarmRpcDispatcher` being only the null fallback. The matching stale `WorkerAlarmRpcDispatcher` class-level XML doc was corrected as part of Server-011. Pure documentation/comment change; no test.
|
||||
|
||||
### Server-015
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:8-15,266-308,720-775` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `GatewaySession` guards its mutable state with two different sync primitives. `TransitionTo`, `MarkFaulted`, `TouchClientActivity`, the `State`/`LastClientActivityAt`/`LeaseExpiresAt`/`FinalFault`/`ActiveEventSubscriberCount` getters, `AttachWorkerClient`, and `IsLeaseExpired` all read/write `_state`, `_finalFault`, `_lastClientActivityAt`, `_leaseExpiresAt`, `_workerClient`, and `_activeEventSubscriberCount` under `_syncRoot`. `CloseAsync` (lines 720-775), however, reads `_state` at line 729 and writes `_state` at lines 736 (`SessionState.Closing`) and 761 (`SessionState.Closed`) while only holding the `_closeLock` `SemaphoreSlim` — `_syncRoot` is never acquired. A concurrent `TransitionTo` or `MarkFaulted` from another thread sees `_state` outside the lock that protects it, and the `State` getter is not guaranteed to observe the `Closing`/`Closed` writes promptly. `SemaphoreSlim.WaitAsync`/`Release` do happen to provide memory barriers in practice, but the locking discipline is split across two primitives, which is fragile and defeats the audit value of "all `_state` access is guarded by `_syncRoot`". Concretely, the race between `CloseAsync` setting `_state = Closing` and a concurrent `TransitionTo(Ready)` is unordered — and `TransitionTo` will happily overwrite `Closing` back to `Ready` because its only guard is "do not overwrite `Closed`/`Faulted`".
|
||||
|
||||
**Recommendation:** Make `CloseAsync` mutate `_state` through the existing `TransitionTo(...)` helper (or acquire `_syncRoot` around the reads/writes) so all `_state` access uses the same lock. Either extend `TransitionTo` to accept the `Closing` and `Closed` transitions (it already handles `Faulted`/`Closed` precedence) or refactor `CloseAsync` to call a private `TrySetClosing()` / `MarkClosed()` that locks `_syncRoot`. Add a regression test that forces a `TransitionTo(Ready)` after `CloseAsync` has set `Closing` and asserts the session does not flip back to `Ready`.
|
||||
|
||||
**Resolution:** 2026-05-20 — Unified the close path on `_syncRoot`. `GatewaySession.CloseAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now mutates `_state` only through two private `_syncRoot`-locked helpers — `TryBeginClose` (writes `Closing`, returns the prior `_closeStarted`) and `MarkClosed` (writes `Closed`) — so every `_state` read/write in the session uses the same lock; `_closeLock` keeps its role of serializing concurrent close attempts. `TransitionTo` was tightened to refuse a transition out of `Closing` to anything other than `Closed`/`Faulted` so a late lifecycle callback cannot walk a closing session back to `Ready`. `docs/Sessions.md` updated to describe the unified lock discipline and the extended terminal precedence. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `TransitionTo_AfterCloseStarted_DoesNotOverwriteClosing` (the named scenario — `BlockingShutdownWorkerClient` parks the close inside `worker.ShutdownAsync` so the test can call `TransitionTo(Ready)` between the `Closing` and `Closed` writes and assert the state stays `Closing`) and `MarkFaulted_AfterCloseCompletes_DoesNotResurrectSession`.
|
||||
|
||||
### Server-016
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:790-797`, `src/MxGateway.Server/Sessions/SessionManager.cs:237-258` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `GatewaySession.DisposeAsync` synchronously calls `_closeLock.Dispose()` (line 792) without first acquiring the lock and without checking whether a `CloseAsync` is still in flight. The normal call path is `SessionManager.CloseSessionCoreAsync` → `session.CloseAsync(...)` → `RemoveSessionAsync` → `DisposeAsync`, where `DisposeAsync` runs strictly after `CloseAsync` completes. But the `ShutdownAsync` path (`SessionManager.cs:237-258`) and any future caller that disposes a session while another thread is still inside `CloseAsync` will trip `ObjectDisposedException` when the in-flight `CloseAsync` releases the semaphore. The race is narrow today because all `Close`/`Dispose` choreography goes through `SessionManager`, but the class-level contract is broken: nothing on `GatewaySession` documents or enforces "DisposeAsync must not be called concurrently with CloseAsync".
|
||||
|
||||
**Recommendation:** In `DisposeAsync`, either (a) take and release `_closeLock` once before disposing it, so the dispose is sequenced after any in-flight close, or (b) replace `_closeLock` disposal with a guard flag and let the semaphore be reclaimed by the finalizer. Document the invariant on the public method. Add a regression test that disposes a session whose `CloseAsync` has not yet completed and asserts no `ObjectDisposedException`.
|
||||
|
||||
**Resolution:** 2026-05-20 — Took recommendation (a): `GatewaySession.DisposeAsync` (`src/MxGateway.Server/Sessions/GatewaySession.cs`) now acquires `_closeLock` once before disposing the semaphore so an in-flight `CloseAsync` finishes (its `_closeLock.Release()`) before the dispose tears the semaphore down. The wait is non-cancellable (`CancellationToken.None`) and `ObjectDisposedException` is swallowed at both the wait and the dispose site so double-dispose still completes cleanly. The method's XML doc was extended with a `<remarks>` block stating the invariant. Regression tests in `src/MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`: `DisposeAsync_WhileCloseInFlight_WaitsForCloseAndDoesNotThrow` (parks `CloseAsync` inside the worker shutdown, calls `DisposeAsync` concurrently, releases shutdown, asserts both complete without `ObjectDisposedException` and the worker is disposed exactly once) and `DisposeAsync_CalledTwice_DoesNotThrow`.
|
||||
|
||||
### Server-017
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | High |
|
||||
| Category | Security |
|
||||
| Location | `src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:13-27`, `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:173-247`, `docs/Authorization.md:108-110` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** The two new top-level RPCs added to `MxAccessGateway` — `AcknowledgeAlarm(AcknowledgeAlarmRequest)` and `QueryActiveAlarms(QueryActiveAlarmsRequest)` (proto lines 23-24) — are not enumerated by `GatewayGrpcScopeResolver.ResolveRequiredScope`. The resolver's `request switch` covers `OpenSessionRequest`, `CloseSessionRequest`, `StreamEventsRequest`, `MxCommandRequest`, and the four Galaxy-repository requests; everything else falls through to `_ => GatewayScopes.Admin`. The interceptor (`GatewayGrpcAuthorizationInterceptor.AuthenticateAndAuthorizeAsync`) then rejects any non-admin caller with `PermissionDenied`. This is technically fail-closed (and `docs/Authorization.md:108-110` documents the "unrecognized → admin" intent), but in practice it means: (1) only API keys with the `admin` scope can acknowledge alarms or query active alarms, even though acknowledging is naturally an `invoke:write`-shaped operation and querying is naturally an `invoke:read`- or `metadata:read`-shaped operation; (2) the alarm RPCs ship in a state where any client that successfully opened a session and subscribed to alarm events still cannot perform the operational acks the contract advertises; (3) the test matrix `GatewayGrpcScopeResolverTests` does not even cover these two request types, so the gap was not caught at unit-test time.
|
||||
|
||||
**Recommendation:** Add explicit arms to `ResolveRequiredScope`: map `AcknowledgeAlarmRequest` to `GatewayScopes.InvokeWrite` (parity with other write actions; ack changes alarm state) and `QueryActiveAlarmsRequest` to `GatewayScopes.MetadataRead` or `GatewayScopes.InvokeRead`. Update `docs/Authorization.md` to list both. Extend `GatewayGrpcScopeResolverTests` with the new mappings and an assertion that every request type defined by `mxaccess_gateway.proto` is named in the resolver (the test can enumerate the assembly's request types so a future RPC cannot quietly add itself only via the admin fallback).
|
||||
|
||||
**Resolution:** 2026-05-20 — Added explicit `AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite` and `QueryActiveAlarmsRequest => GatewayScopes.EventsRead` arms to `GatewayGrpcScopeResolver.ResolveRequiredScope` (`src/MxGateway.Server/Security/Authorization/GatewayGrpcScopeResolver.cs:21-22`). `InvokeWrite` matches the existing `MxCommandKind.Write*` mapping because ack mutates alarm state; `EventsRead` matches `StreamEventsRequest` and `MxCommandKind.DrainEvents` because querying active alarms reads the same alarm/event surface. Extended `GatewayGrpcScopeResolverTests` with two new `InlineData` rows covering both request types (`src/MxGateway.Tests/Security/Authorization/GatewayGrpcScopeResolverTests.cs:16-17`) and added four interceptor-level cases in `GatewayGrpcAuthorizationInterceptorTests` (`UnaryServerHandler_AcknowledgeAlarmMissingScope_ReturnsPermissionDenied`, `UnaryServerHandler_AcknowledgeAlarmWithScope_RunsHandler`, `ServerStreamingServerHandler_QueryActiveAlarmsMissingScope_ReturnsPermissionDenied`, `ServerStreamingServerHandler_QueryActiveAlarmsWithScope_RunsHandler`) proving each new RPC denies callers lacking the chosen scope and runs the handler when the scope is held. Updated `docs/Authorization.md` (resolver snippet and Scope Catalog table) to list both RPCs against their scopes. `dotnet test ... --filter FullyQualifiedName~GatewayGrpcAuthorizationInterceptorTests` → 14 passed, 0 failed; resolver tests 28 passed, 0 failed.
|
||||
|
||||
### Server-018
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Location | `src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs:15` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `GalaxyGlobMatcher.RegexCache` is a `ConcurrentDictionary<string, Regex>` keyed by glob pattern, with no eviction. The fix for Server-008 added this cache deliberately to avoid recompiling the same handful of patterns, but the cache key is the raw glob string. The patterns currently come from two sources — `DiscoverHierarchyRequest.TagNameGlob` (client-supplied) and `ApiKeyConstraints.BrowseSubtrees` / `ReadSubtrees` / `WriteSubtrees` / `ReadTagGlobs` / `WriteTagGlobs` (admin-configured) — and `BuildRegex` also runs each glob through `Regex.Escape` so an attacker cannot craft a denial-of-service ReDoS payload. The leak is therefore bounded only by "how many distinct globs a client can submit over the process lifetime", which is in the millions for `TagNameGlob` if a client iterates through generated names. Each compiled `Regex` also holds a JIT'd assembly that is non-trivial to reclaim.
|
||||
|
||||
**Recommendation:** Cap the cache at a small bound (e.g. 256 patterns) using a simple LRU or a `MemoryCache` with sliding expiration, or restrict the cache to globs that originate from API-key constraints (admin-controlled, naturally bounded) and pay the compile cost for client-supplied globs. Add a test that fills the cache with thousands of distinct globs and asserts the cache size stays bounded.
|
||||
|
||||
**Resolution:** 2026-05-20 — Capped `GalaxyGlobMatcher`'s compiled-regex cache at `RegexCacheCapacity = 256` entries with FIFO-by-insertion eviction (`src/MxGateway.Server/Galaxy/GalaxyGlobMatcher.cs`). A `ConcurrentQueue<string>` tracks insertion order; when the cache grows past the cap, `EvictIfOverCapacity` takes a small lock and dequeues + removes the oldest entries until the count is back within bound. Reads stay lock-free (the lock guards only the eviction path). Internal `CurrentCacheSize` / `RegexCacheCapacity` accessors are surfaced through the existing `InternalsVisibleTo("MxGateway.Tests")` so tests can assert the bound. Regression test: `GalaxyFilterInputSafetyTests.GlobMatcher_WithManyDistinctPatterns_CacheStaysBounded` submits `RegexCacheCapacity * 4` distinct globs and asserts `CurrentCacheSize` stays in `[0, RegexCacheCapacity]`. Existing glob correctness tests (`GlobMatcher_RepeatedAndInterleavedPatterns_StayCorrect`, the adversarial-input theories) continue to pass, confirming eviction does not corrupt lookups.
|
||||
|
||||
### Server-019
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Location | `src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs:183-221` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` returns `yield break` (line 191) when `sessionRegistry.TryGet(request.SessionId, ...)` fails — it silently produces an empty stream with no diagnostic. The peer `AcknowledgeAsync` instead returns an `AcknowledgeAlarmReply` with `ProtocolStatus.Code = SessionNotFound` (lines 81-89), so the two methods have inconsistent missing-session handling. In production this branch is unreachable because `MxAccessGatewayService.QueryActiveAlarms` calls `ResolveSession(...)` first and throws `NotFound` from the gRPC layer (`MxAccessGatewayService.cs:228`), but: (a) the dispatcher is the seam other code paths might reach in the future, and (b) any unit test that instantiates the dispatcher directly with a missing session id sees an empty stream rather than a clear error, which is a footgun.
|
||||
|
||||
**Recommendation:** Either throw a `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` (matching the gRPC service's own resolver) or yield a single `ActiveAlarmSnapshot` with a diagnostic field set, and add a `WorkerAlarmRpcDispatcherTests` case that asserts whichever shape is chosen. Aligning with `AcknowledgeAsync`'s `SessionNotFound` protocol-status pattern is preferred, but `QueryActiveAlarms` is a server-streaming RPC so a thrown `SessionManagerException` propagated by the gateway is the cleaner fit.
|
||||
|
||||
**Resolution:** 2026-05-20 — Took the preferred option: `WorkerAlarmRpcDispatcher.QueryActiveAlarmsAsync` (`src/MxGateway.Server/Sessions/WorkerAlarmRpcDispatcher.cs`) now throws `SessionManagerException(SessionManagerErrorCode.SessionNotFound, ...)` instead of `yield break`-ing when the session is missing. `MxAccessGatewayService.MapException` already maps that error code to gRPC `NotFound`, so production callers see a consistent missing-session response and a direct unit-test caller now gets a clear error instead of an empty success. The unary peer `AcknowledgeAsync` continues to surface the same condition as an in-band `ProtocolStatus.Code = SessionNotFound`, which is correct for a unary RPC. Regression test: `WorkerAlarmRpcDispatcherTests.QueryActiveAlarmsAsync_WhenSessionMissing_ThrowsSessionNotFound` replaces the prior `_YieldsEmpty` assertion — it asserts the new exception shape and also exercises `AcknowledgeAsync` with the same missing session id to pin the peer-method parity.
|
||||
|
||||
### Server-020
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Location | `src/MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor:1-2`, `…/GalaxyPage.razor:1-2`, `…/ApiKeysPage.razor:1-2`, `…/EventsPage.razor:1-2`, `…/SessionsPage.razor:1-2`, `…/WorkersPage.razor:1-2`, `…/SettingsPage.razor:1-2`, `…/SessionDetailsPage.razor:1-2` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** Every dashboard page declares two `@page` directives — `@page "/X"` AND `@page "/dashboard/X"` — even though `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` mounts the Razor components under a `RouteGroupBuilder` with `pathBase = "/dashboard"`. The group prefix is prepended to each `@page` route, so the actual endpoints become `/dashboard/X` (from `@page "/X"`) **and** `/dashboard/dashboard/X` (from `@page "/dashboard/X"`). The pages are reachable at two URLs each, and the deeper one (`/dashboard/dashboard/sessions` etc.) is almost certainly accidental — it leaks the path-base name into the URL and creates duplicate authorize/render work per route. `GatewayApplicationTests.Build_WhenDashboardEnabled_ComponentRoutesRequireAuthorization` only checks the `/dashboard/X` shape, so the duplicate route slipped through without an assertion.
|
||||
|
||||
**Recommendation:** Drop the `@page "/dashboard/X"` directive from each page; rely on the `MapGroup("/dashboard")` to provide the prefix. Or, if the team genuinely wants both URL shapes, document the choice in the file header and extend the route-enumeration test to assert that **both** are present (and both carry the authorization policy). Either way, the current setup is non-obvious.
|
||||
|
||||
**Resolution:** 2026-05-20 — Took the recommended drop: removed the redundant `@page "/dashboard/X"` directive from every dashboard Razor page (`DashboardHome.razor`, `SessionsPage.razor`, `WorkersPage.razor`, `EventsPage.razor`, `GalaxyPage.razor`, `SettingsPage.razor`, `ApiKeysPage.razor`, `SessionDetailsPage.razor`). Each page now declares only its bare route (e.g. `@page "/sessions"`); `DashboardEndpointRouteBuilderExtensions.MapGatewayDashboard` continues to prepend `/dashboard` via `MapGroup`, so each page is reachable at exactly one URL (`/dashboard/X`). Regression test: `GatewayApplicationTests.Build_WhenDashboardEnabled_DoesNotRegisterDoubledDashboardPrefixRoutes` enumerates the eight previously-doubled routes (`/dashboard/dashboard/`, `/dashboard/dashboard/sessions`, ... `/dashboard/dashboard/sessions/{SessionId}`) and asserts none of them are mapped. The existing `..._MapsBlazorDashboardAndAuthEndpoints` / `..._ComponentRoutesRequireAuthorization` tests continue to verify the desired `/dashboard/X` shapes are still present and policy-gated. No public URL contract changed (the doubled shape was accidental); no doc update needed — `gateway.md` and `docs/GatewayDashboardDesign.md` never referenced the doubled routes.
|
||||
|
||||
### Server-021
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Testing coverage |
|
||||
| Location | `src/MxGateway.Server/Grpc/MxAccessGatewayService.cs:266-664`, `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** The 1cd51bb commit history (the bulk read/write series, `f220908`/`5e375f6`/`758aca2`) added 473 lines of constraint-filtering and reply-merging logic to `MxAccessGatewayService`: `ApplyConstraintsAsync` (line 266), `EnforceReadTagAsync` / `EnforceWriteHandleAsync`, `FilterTagBulkAsync` / `FilterReadBulkAsync` / `FilterWriteBulkAsync` / `FilterHandleBulkAsync`, the `ReplaceWriteBulkEntries` switch, and three concrete `BulkConstraintPlan` records (`SubscribeBulkConstraintPlan`, `WriteBulkConstraintPlan`, `ReadBulkConstraintPlan`) that splice denied entries back into the worker's allowed-only reply in original-index order. None of this is covered by `MxAccessGatewayServiceTests` — its `FakeSessionManager` is wired with an `AllowAllConstraintEnforcer` (line 430) that never denies anything, so every constraint-related code path is dead at test time. A subtle off-by-one in `BuildMerged`, a wrong `PayloadOneofCase` in `GetPayload` / `SetPayload`, or a missing case in `ReplaceWriteBulkEntries` would all ship without a test failure.
|
||||
|
||||
**Recommendation:** Add `MxAccessGatewayServiceTests` cases that inject a deny-on-glob `IConstraintEnforcer` and exercise: (1) `AddItemBulk` / `SubscribeBulk` / `AdviseItemBulk` with a mix of allowed and denied tags, asserting `BulkSubscribeReply.Results` interleaves denied and worker-allowed entries in original-index order; (2) the same for `ReadBulk` and each of the four bulk-write commands; (3) `HasAllowedItems == false` so `CreateDeniedReply` is exercised (no worker call); (4) the unary `Write`/`Write2`/`WriteSecured`/`WriteSecured2` paths through `EnforceWriteHandleAsync`. The fixtures can reuse the existing `FakeSessionManager` by replacing the constraint enforcer; no live worker is needed.
|
||||
|
||||
**Resolution:** 2026-05-20 — Added a configurable `PredicateConstraintEnforcer` test double (`src/MxGateway.Tests/TestSupport/PredicateConstraintEnforcer.cs`) that denies on per-tag and per-handle predicates and records denials. Added 11 new tests in `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceConstraintTests.cs` covering: (1) `AddItemBulk` with mixed denials — asserts the worker is called once with only the allowed subset and the merged reply interleaves denied and worker-allowed `SubscribeResult`s at their original indices; (2) `SubscribeBulk` with every tag denied — asserts `HasAllowedItems` short-circuits `CreateDeniedReply` and the session manager is never invoked; (3) `AdviseItemBulk` (handle-keyed denial via `CheckReadHandleAsync`); (4) `SubscribeBulk` with the allow-all enforcer — pass-through regression guard; (5) `ReadBulk` partial denial — asserts the `BulkReadConstraintPlan` produces a `BulkReadReply` (not a `BulkSubscribeReply`) with denied entries spliced in at their original indices; (6) `ReadBulk` all-denied short-circuit; (7) `WriteBulk` partial denial — asserts denied entries are dropped from the forwarded `Entries` and the merged reply preserves original-index order; (8) `WriteSecuredBulk` all-denied — proves the second `ReplaceWriteBulkEntries` switch arm is reachable; (9) unary `Write` with denied handle → `PermissionDenied`, no worker call, denial recorded; (10) unary `WriteSecured` with denied handle → `PermissionDenied`; (11) unary `AddItem` with denied tag → `PermissionDenied` (`EnforceReadTagAsync`). `MxAccessGatewayServiceTests.CreateService` updated to accept an `IConstraintEnforcer` so future tests can opt into the deny enforcer without duplicating the wiring. All 11 new tests pass; full suite (`dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj`) is green at 458 passing.
|
||||
|
||||
### Server-022
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Location | `src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs:8-29` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** Server-014's resolution noted that the stale "PR A.6 / A.7" / "not yet wired" language was rewritten on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms` and on the `WorkerAlarmRpcDispatcher` class doc. The corresponding XML doc on the **interface** `IAlarmRpcDispatcher` (lines 8-29) still says it is "PR A.6 / A.7 — gateway-side dispatcher" and that "Production implementations live in `WorkerAlarmRpcDispatcher` (this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic)". That second clause directly contradicts the now-correct comments on the concrete implementations and on the gRPC service: `WorkerAlarmRpcDispatcher` is the wired default, not a not-yet-wired one. A reader who finds the interface first will believe the dispatcher is non-functional.
|
||||
|
||||
**Recommendation:** Rewrite the `IAlarmRpcDispatcher` `<remarks>` block to match the language now used on `WorkerAlarmRpcDispatcher` and on the gRPC service: DI binds `WorkerAlarmRpcDispatcher` by default; `NotWiredAlarmRpcDispatcher` is only the null fallback for tests/DI omission. Drop the "PR A.6 / A.7" prefix from the `<summary>` — the interface is now the public alarm-RPC seam.
|
||||
|
||||
**Resolution:** 2026-05-20 — Rewrote `IAlarmRpcDispatcher`'s `<summary>` and `<remarks>` (`src/MxGateway.Server/Sessions/IAlarmRpcDispatcher.cs`) to match the language now used on `WorkerAlarmRpcDispatcher` and on `MxAccessGatewayService.AcknowledgeAlarm` / `QueryActiveAlarms`: dropped the stale "PR A.6 / A.7" prefix from the summary, and replaced the "this PR ships a not-yet-wired default that returns a clear worker-pending diagnostic" clause with the correct statement that DI binds the production `WorkerAlarmRpcDispatcher` by default and `NotWiredAlarmRpcDispatcher` is only the null fallback for DI omission / standalone tests. Pure documentation change; no test.
|
||||
|
||||
Reference in New Issue
Block a user