diff --git a/code-reviews/Server/findings.md b/code-reviews/Server/findings.md index 5087c06..4083d7b 100644 --- a/code-reviews/Server/findings.md +++ b/code-reviews/Server/findings.md @@ -7,7 +7,7 @@ | Review date | 2026-05-18 | | Commit reviewed | `6c64030` | | Status | Reviewed | -| Open findings | 12 | +| Open findings | 8 | ## Checklist coverage @@ -48,13 +48,13 @@ | Severity | Medium | | Category | Design-document adherence | | Location | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` | -| Status | Open | +| Status | Resolved | **Description:** `gateway.md:583` and CLAUDE.md state the first version "terminates orphaned workers on startup." No code in MxGateway.Server enumerates or kills leftover `MxGateway.Worker.exe` processes at startup — a grep for `orphan`/`reattach`/`terminate` finds nothing. After an unclean gateway crash, x86 worker processes (each holding an MXAccess COM instance) leak and survive indefinitely, and a restarted gateway does not reclaim or kill them. **Recommendation:** Add a startup hosted service that finds and kills stale worker processes (by executable path / a well-known argument or environment marker) before the server accepts sessions, or update the design docs if reattachment/cleanup is deliberately deferred. -**Resolution:** _(open)_ +**Resolution:** Resolved 2026-05-18. Confirmed against source: no code path enumerated or killed leftover workers. Added `IRunningProcessInspector` / `SystemRunningProcessInspector` (a testable seam over `Process.GetProcessesByName`/`Kill`), `OrphanWorkerTerminator` (kills processes matched by the configured worker executable path, or by image name when the x64 gateway cannot introspect the x86 worker's `MainModule`, skipping the current process and tolerating per-process kill failures), and `OrphanWorkerCleanupHostedService` (best-effort `IHostedService`). The hosted service is registered in `AddWorkerProcessLauncher` ahead of `AddGatewaySessions` so cleanup runs before the server accepts sessions. `gateway.md` updated to describe the implemented behavior. Regression tests: `OrphanWorkerTerminatorTests` (`KillsWorkerProcessesMatchingConfiguredExecutablePath`, `KillsImageNameMatchWhenExecutablePathUnreadable`, `DoesNotKillUnrelatedProcessSharingImageName`, `DoesNotKillCurrentProcess`, `ContinuesWhenOneKillThrows`). ### Server-003 @@ -78,13 +78,13 @@ | Severity | Medium | | Category | Code organization & conventions | | Location | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` | -| Status | Open | +| Status | Resolved | **Description:** `ParseScopes` accepts any comma-separated strings and `CreateKeyAsync` persists them verbatim; neither the CLI nor the dashboard create path validates scopes against `GatewayScopes`. A typo or non-canonical name (e.g. CLAUDE.md's example `--scopes session,invoke,event,metadata,admin`, which does not match the resolver's `session:open`/`invoke:read`/etc.) silently creates a key whose scope strings the authorization resolver never checks for — the key is unusable for those RPCs with no error at creation time. **Recommendation:** Validate every requested scope against the `GatewayScopes` catalog at create time in both the CLI parser/runner and `DashboardApiKeyManagementService.ValidateCreateRequest`, rejecting unknown scope strings. -**Resolution:** _(open)_ +**Resolution:** Resolved 2026-05-18. Confirmed against source: `ParseScopes` split unvalidated strings into the create command and `ValidateCreateRequest` checked only key id and display name. Added `GatewayScopes.All` (the canonical scope catalog) and `GatewayScopes.IsKnown(string)`. `ApiKeyAdminCommandLineParser.Parse` now runs `ValidateScopes` for create-key commands and fails the parse listing the unknown scope(s) and valid set; `DashboardApiKeyManagementService.ValidateCreateRequest` rejects requests carrying any non-canonical scope. Revoke/rotate paths are unaffected (no scope input). Regression tests: `ApiKeyAdminCommandLineParserTests.Parse_CreateKeyCommand_RejectsUnknownScope`, `Parse_CreateKeyCommand_AcceptsAllCanonicalScopes`, and `DashboardApiKeyManagementServiceTests.CreateAsync_UnknownScope_DoesNotCallStore`. ### Server-005 @@ -93,13 +93,13 @@ | Severity | Medium | | Category | Error handling & resilience | | Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` | -| Status | Open | +| Status | Resolved | **Description:** `GalaxyHierarchyCache.RefreshCoreAsync` only catches `SqlException` and `InvalidOperationException`. The initial `cache.RefreshAsync` call in `GalaxyHierarchyRefreshService.ExecuteAsync` is wrapped only for `OperationCanceledException`. A transient non-`SqlException` failure on the first refresh (e.g. a `Win32Exception`/`TimeoutException` from connection establishment, or another `DbException` subtype) escapes both layers, faults the `BackgroundService`, and — with default host behavior — stops the whole gateway. The periodic-tick loop does catch general exceptions, so only the first load is exposed. **Recommendation:** Broaden the `catch` in `RefreshCoreAsync` to all non-cancellation exceptions (record `Unavailable`/`Stale` and still complete `_firstLoad`), or wrap the initial `RefreshAsync` in `GalaxyHierarchyRefreshService` with the same general `catch` the tick loop uses. -**Resolution:** _(open)_ +**Resolution:** Resolved 2026-05-18. Confirmed against source: the initial `RefreshAsync` in `ExecuteAsync` was guarded only for `OperationCanceledException`, and `RefreshCoreAsync` filtered its catch to `SqlException or InvalidOperationException`. Both recommended layers applied: `GalaxyHierarchyRefreshService.ExecuteAsync` now catches every non-cancellation exception on the initial load (logs a warning; the periodic tick retries), and `GalaxyHierarchyCache.RefreshCoreAsync` broadens its catch to all non-cancellation exceptions so the cache still records `Stale`/`Unavailable` and completes `_firstLoad`. The now-unused `Microsoft.Data.SqlClient` using was removed. Regression test: `GalaxyHierarchyRefreshServiceTests.ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService`. ### Server-006 @@ -108,13 +108,13 @@ | Severity | Medium | | Category | Correctness & logic bugs | | Location | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` | -| Status | Open | +| Status | Resolved | **Description:** In `OpenSessionAsync`, `_metrics.SessionOpened()` (line 89) increments the `_openSessions` gauge before `TryAutoSubscribeAlarmsAsync` runs. If auto-subscribe throws (which it does when `Alarms.RequireSubscribeOnOpen` is true and the worker rejects the subscription), the `catch` block disposes and removes the session and records `_metrics.Fault(...)` but never calls `SessionClosed`/`SessionRemoved`. The `mxgateway.sessions.open` gauge permanently over-counts by one for every such failed open. **Recommendation:** In the `catch` block, when the session had reached the point where `SessionOpened()` was recorded, also call `_metrics.SessionRemoved()` — or move the `SessionOpened()` call to after auto-subscribe succeeds. -**Resolution:** _(open)_ +**Resolution:** Resolved 2026-05-18. Confirmed against source: the `catch` block in `OpenSessionAsync` recorded `Fault(...)` and removed the session but never decremented the open-session gauge after `SessionOpened()` had run. Added a `sessionOpenedRecorded` flag set immediately after `_metrics.SessionOpened()`; the `catch` block now calls `_metrics.SessionRemoved()` when that flag is set, restoring the gauge for a post-`SessionOpened()` failure (e.g. an auto-subscribe rejection with `RequireSubscribeOnOpen=true`). Regression test: `SessionManagerAlarmAutoSubscribeTests.OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn`. ### Server-007 diff --git a/gateway.md b/gateway.md index 931e0ee..a22300c 100644 --- a/gateway.md +++ b/gateway.md @@ -579,8 +579,11 @@ Policy: - command exceptions return structured command fault with HRESULT if known, - stale sessions are closed by lease timeout, - stuck workers are killed by process id, -- gateway restart should not attempt to reattach old workers unless explicitly - designed; first version should terminate orphaned workers on startup. +- gateway restart does not reattach old workers; `OrphanWorkerCleanupHostedService` + runs `OrphanWorkerTerminator` once on startup — before the server accepts + sessions — to kill leftover `MxGateway.Worker.exe` processes (matched by the + configured worker executable path, or by image name when the x64 gateway cannot + introspect the x86 worker's module) left behind by a previous unclean run. Because each client owns one worker, a crash or leak affects only that session. diff --git a/src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs b/src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs index 50758cf..485952a 100644 --- a/src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs +++ b/src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs @@ -1,6 +1,7 @@ using System.Security.Claims; using Microsoft.Data.Sqlite; using MxGateway.Server.Security.Authentication; +using MxGateway.Server.Security.Authorization; namespace MxGateway.Server.Dashboard; @@ -171,6 +172,15 @@ public sealed class DashboardApiKeyManagementService( return "Display name is required."; } + string[] unknownScopes = request.Scopes + .Where(scope => !GatewayScopes.IsKnown(scope)) + .ToArray(); + if (unknownScopes.Length > 0) + { + return $"Unknown scope(s): {string.Join(", ", unknownScopes)}. " + + $"Valid scopes are: {string.Join(", ", GatewayScopes.All)}."; + } + return null; } diff --git a/src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs b/src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs index afc1a06..fe2db2b 100644 --- a/src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs +++ b/src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs @@ -1,5 +1,4 @@ using Google.Protobuf.WellKnownTypes; -using Microsoft.Data.SqlClient; using Microsoft.Extensions.Logging; using MxGateway.Contracts.Proto.Galaxy; using MxGateway.Server.Dashboard; @@ -181,8 +180,13 @@ public sealed class GalaxyHierarchyCache : IGalaxyHierarchyCache { throw; } - catch (Exception exception) when (exception is SqlException or InvalidOperationException) + catch (Exception exception) { + // Catch every non-cancellation failure — not just SqlException / + // InvalidOperationException. A TimeoutException or Win32Exception + // from connection establishment, or another DbException subtype, + // must still degrade gracefully to Stale/Unavailable and complete + // _firstLoad rather than escape and fault the refresh BackgroundService. _logger?.LogWarning(exception, "Galaxy hierarchy cache refresh failed."); GalaxyHierarchyCacheEntry failed = previous with { diff --git a/src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs b/src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs index 629a601..b629eb3 100644 --- a/src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs +++ b/src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs @@ -26,6 +26,15 @@ public sealed class GalaxyHierarchyRefreshService( { return; } + catch (Exception exception) + { + // A transient first-load failure (e.g. a TimeoutException or + // Win32Exception from connection establishment, or a DbException + // subtype the cache does not catch) must not fault this + // BackgroundService and stop the whole gateway. The cache records + // its own Unavailable/Stale status; the periodic tick below retries. + logger.LogWarning(exception, "Initial Galaxy hierarchy cache load failed; will retry on the refresh interval."); + } using PeriodicTimer timer = new(interval, _timeProvider); try diff --git a/src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs b/src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs index 235edbb..3eb3073 100644 --- a/src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs +++ b/src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs @@ -1,3 +1,5 @@ +using MxGateway.Server.Security.Authorization; + namespace MxGateway.Server.Security.Authentication; public static class ApiKeyAdminCommandLineParser @@ -95,6 +97,12 @@ public static class ApiKeyAdminCommandLineParser return ApiKeyAdminParseResult.Fail(validationError); } + string? scopeError = ValidateScopes(kind, scopes); + if (scopeError is not null) + { + return ApiKeyAdminParseResult.Fail(scopeError); + } + return ApiKeyAdminParseResult.Success(new ApiKeyAdminCommand( Kind: kind, Json: json, @@ -152,6 +160,23 @@ public static class ApiKeyAdminCommandLineParser return null; } + private static string? ValidateScopes(ApiKeyAdminCommandKind kind, IReadOnlySet scopes) + { + if (kind != ApiKeyAdminCommandKind.CreateKey) + { + return null; + } + + string[] unknown = scopes.Where(scope => !GatewayScopes.IsKnown(scope)).ToArray(); + if (unknown.Length == 0) + { + return null; + } + + return $"Unknown scope(s): {string.Join(", ", unknown)}. " + + $"Valid scopes are: {string.Join(", ", GatewayScopes.All)}."; + } + private static string KindName(ApiKeyAdminCommandKind kind) { return kind switch diff --git a/src/MxGateway.Server/Security/Authorization/GatewayScopes.cs b/src/MxGateway.Server/Security/Authorization/GatewayScopes.cs index f2205da..3c7be03 100644 --- a/src/MxGateway.Server/Security/Authorization/GatewayScopes.cs +++ b/src/MxGateway.Server/Security/Authorization/GatewayScopes.cs @@ -10,4 +10,28 @@ public static class GatewayScopes public const string EventsRead = "events:read"; public const string MetadataRead = "metadata:read"; public const string Admin = "admin"; + + /// + /// The complete catalog of canonical scope strings the gateway authorization + /// resolver recognizes. Key-creation paths (CLI and dashboard) validate requested + /// scopes against this set so a typo or non-canonical name cannot persist a key + /// whose scope strings the resolver never matches. + /// + public static readonly IReadOnlySet All = new HashSet( + [ + SessionOpen, + SessionClose, + InvokeRead, + InvokeWrite, + InvokeSecure, + EventsRead, + MetadataRead, + Admin, + ], + System.StringComparer.Ordinal); + + /// Determines whether the supplied scope string is a recognized canonical scope. + /// Scope string to check. + /// when the scope is canonical; otherwise . + public static bool IsKnown(string scope) => All.Contains(scope); } diff --git a/src/MxGateway.Server/Sessions/SessionManager.cs b/src/MxGateway.Server/Sessions/SessionManager.cs index 3ea5a3f..7fbd22d 100644 --- a/src/MxGateway.Server/Sessions/SessionManager.cs +++ b/src/MxGateway.Server/Sessions/SessionManager.cs @@ -68,6 +68,7 @@ public sealed class SessionManager : ISessionManager EnsureSessionCapacity(); GatewaySession? session = null; + bool sessionOpenedRecorded = false; try { session = CreateSession(request, clientIdentity); @@ -86,6 +87,7 @@ public sealed class SessionManager : ISessionManager session.AttachWorkerClient(workerClient); session.MarkReady(); _metrics.SessionOpened(); + sessionOpenedRecorded = true; await TryAutoSubscribeAlarmsAsync(session, cancellationToken).ConfigureAwait(false); @@ -100,6 +102,14 @@ public sealed class SessionManager : ISessionManager await session.DisposeAsync().ConfigureAwait(false); } + // If SessionOpened() already incremented the open-session gauge, + // a failure after that point (e.g. auto-subscribe rejection) must + // decrement it again so mxgateway.sessions.open does not leak. + if (sessionOpenedRecorded) + { + _metrics.SessionRemoved(); + } + ReleaseSessionSlot(); _metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString()); _logger.LogWarning( diff --git a/src/MxGateway.Server/Workers/IRunningProcessInspector.cs b/src/MxGateway.Server/Workers/IRunningProcessInspector.cs new file mode 100644 index 0000000..f12bf76 --- /dev/null +++ b/src/MxGateway.Server/Workers/IRunningProcessInspector.cs @@ -0,0 +1,29 @@ +namespace MxGateway.Server.Workers; + +/// +/// Abstraction over OS process enumeration and termination. Exists so the +/// orphan-worker cleanup logic can be unit-tested without spawning real +/// processes. +/// +public interface IRunningProcessInspector +{ + /// + /// Enumerates currently running processes whose image name (without the + /// .exe extension) matches . + /// + /// Process image name to match, without extension. + /// The matching running processes. + IReadOnlyList GetProcessesByName(string processName); + + /// Forcibly terminates the process with the given identifier. + /// Identifier of the process to terminate. + void Kill(int processId); +} + +/// Identifying information for a running process candidate. +/// Operating-system process identifier. +/// +/// Fully-qualified path to the process main module, or +/// when it could not be read (e.g. access denied). +/// +public sealed record RunningProcessInfo(int ProcessId, string? ExecutablePath); diff --git a/src/MxGateway.Server/Workers/OrphanWorkerCleanupHostedService.cs b/src/MxGateway.Server/Workers/OrphanWorkerCleanupHostedService.cs new file mode 100644 index 0000000..8b71d11 --- /dev/null +++ b/src/MxGateway.Server/Workers/OrphanWorkerCleanupHostedService.cs @@ -0,0 +1,30 @@ +namespace MxGateway.Server.Workers; + +/// +/// Hosted service that terminates leftover MXAccess worker processes once on +/// gateway startup, before the server begins accepting sessions. +/// +public sealed class OrphanWorkerCleanupHostedService( + OrphanWorkerTerminator terminator, + ILogger logger) : IHostedService +{ + /// + public Task StartAsync(CancellationToken cancellationToken) + { + try + { + terminator.TerminateOrphans(); + } + catch (Exception exception) + { + // Orphan cleanup is best-effort; a failure here must not prevent + // the gateway from starting. + logger.LogWarning(exception, "Orphan worker cleanup failed on startup."); + } + + return Task.CompletedTask; + } + + /// + public Task StopAsync(CancellationToken cancellationToken) => Task.CompletedTask; +} diff --git a/src/MxGateway.Server/Workers/OrphanWorkerTerminator.cs b/src/MxGateway.Server/Workers/OrphanWorkerTerminator.cs new file mode 100644 index 0000000..a344e8b --- /dev/null +++ b/src/MxGateway.Server/Workers/OrphanWorkerTerminator.cs @@ -0,0 +1,138 @@ +using Microsoft.Extensions.Logging.Abstractions; +using Microsoft.Extensions.Options; +using MxGateway.Server.Configuration; +using MxGateway.Server.Metrics; + +namespace MxGateway.Server.Workers; + +/// +/// Terminates leftover MXAccess worker processes on gateway startup. +/// +/// Per gateway.md ("first version should terminate orphaned workers +/// on startup") and CLAUDE.md, a gateway restart does not reattach old +/// workers. After an unclean gateway crash, x86 worker processes — each +/// holding an MXAccess COM instance on an STA — survive indefinitely. This +/// terminator finds those processes by executable name/path and kills them +/// before the restarted gateway accepts sessions. +/// +/// +public sealed class OrphanWorkerTerminator +{ + private readonly IRunningProcessInspector _inspector; + private readonly GatewayMetrics _metrics; + private readonly WorkerOptions _workerOptions; + private readonly ILogger _logger; + + /// Initializes a new instance of the class. + /// Gateway configuration options. + /// Running-process inspector. + /// Gateway metrics collector. + /// Optional logger for diagnostic output. + public OrphanWorkerTerminator( + IOptions gatewayOptions, + IRunningProcessInspector inspector, + GatewayMetrics metrics, + ILogger? logger = null) + { + ArgumentNullException.ThrowIfNull(gatewayOptions); + _inspector = inspector ?? throw new ArgumentNullException(nameof(inspector)); + _metrics = metrics ?? throw new ArgumentNullException(nameof(metrics)); + _workerOptions = gatewayOptions.Value.Worker; + _logger = logger ?? NullLogger.Instance; + } + + /// + /// Finds and kills every leftover worker process. Safe to call once at + /// startup before any session-owned worker is launched. + /// + /// The number of orphan worker processes that were terminated. + public int TerminateOrphans() + { + string? configuredPath = ResolveConfiguredExecutablePath(); + string processName = ResolveProcessName(configuredPath); + int currentProcessId = Environment.ProcessId; + + int terminated = 0; + foreach (RunningProcessInfo candidate in _inspector.GetProcessesByName(processName)) + { + if (candidate.ProcessId == currentProcessId) + { + continue; + } + + if (!IsOrphanWorker(candidate, configuredPath)) + { + continue; + } + + try + { + _inspector.Kill(candidate.ProcessId); + _metrics.WorkerKilled("OrphanStartupCleanup"); + terminated++; + _logger.LogWarning( + "Terminated orphan worker process {ProcessId} ({ExecutablePath}) left over from a previous gateway run.", + candidate.ProcessId, + candidate.ExecutablePath ?? processName); + } + catch (Exception exception) + { + // The process may have already exited, or be inaccessible. + // A failure to kill one orphan must not block gateway startup. + _logger.LogWarning( + exception, + "Failed to terminate orphan worker process {ProcessId}.", + candidate.ProcessId); + } + } + + if (terminated > 0) + { + _logger.LogInformation("Terminated {Count} orphan worker process(es) on startup.", terminated); + } + + return terminated; + } + + private static bool IsOrphanWorker(RunningProcessInfo candidate, string? configuredPath) + { + // When the executable path is readable, require an exact match against + // the configured worker path so unrelated processes that merely share + // the image name are never killed. + if (candidate.ExecutablePath is { } path) + { + return configuredPath is not null + && string.Equals(path, configuredPath, StringComparison.OrdinalIgnoreCase); + } + + // A null path means the x64 gateway could not introspect the module — + // the expected case for the x86 worker. Image-name match is the only + // signal available; treat it as an orphan. + return true; + } + + private string? ResolveConfiguredExecutablePath() + { + try + { + return Path.GetFullPath(_workerOptions.ExecutablePath); + } + catch (Exception exception) when (exception is ArgumentException + or NotSupportedException + or PathTooLongException) + { + _logger.LogWarning( + exception, + "Configured worker executable path '{ExecutablePath}' is not a valid filesystem path; " + + "orphan cleanup will match by image name only.", + _workerOptions.ExecutablePath); + return null; + } + } + + private static string ResolveProcessName(string? configuredPath) + { + string source = configuredPath ?? "MxGateway.Worker.exe"; + return Path.GetFileNameWithoutExtension(source); + } +} diff --git a/src/MxGateway.Server/Workers/SystemRunningProcessInspector.cs b/src/MxGateway.Server/Workers/SystemRunningProcessInspector.cs new file mode 100644 index 0000000..de5ed61 --- /dev/null +++ b/src/MxGateway.Server/Workers/SystemRunningProcessInspector.cs @@ -0,0 +1,55 @@ +using System.Diagnostics; + +namespace MxGateway.Server.Workers; + +/// +/// backed by . +/// +public sealed class SystemRunningProcessInspector : IRunningProcessInspector +{ + /// + public IReadOnlyList GetProcessesByName(string processName) + { + List results = []; + Process[] processes = Process.GetProcessesByName(processName); + try + { + foreach (Process process in processes) + { + results.Add(new RunningProcessInfo(process.Id, TryGetExecutablePath(process))); + } + } + finally + { + foreach (Process process in processes) + { + process.Dispose(); + } + } + + return results; + } + + /// + public void Kill(int processId) + { + using Process process = Process.GetProcessById(processId); + process.Kill(entireProcessTree: true); + } + + private static string? TryGetExecutablePath(Process process) + { + try + { + return process.MainModule?.FileName; + } + catch (Exception exception) when (exception is InvalidOperationException + or System.ComponentModel.Win32Exception + or NotSupportedException) + { + // Access to the main module can be denied (e.g. a 64-bit gateway + // querying a 32-bit worker, or a process owned by another user). + return null; + } + } +} diff --git a/src/MxGateway.Server/Workers/WorkerServiceCollectionExtensions.cs b/src/MxGateway.Server/Workers/WorkerServiceCollectionExtensions.cs index d387573..062cb40 100644 --- a/src/MxGateway.Server/Workers/WorkerServiceCollectionExtensions.cs +++ b/src/MxGateway.Server/Workers/WorkerServiceCollectionExtensions.cs @@ -11,6 +11,13 @@ public static class WorkerServiceCollectionExtensions services.AddSingleton(); services.AddSingleton(); + // Terminate workers leaked by a previous unclean gateway run before the + // server accepts sessions. Registered ahead of AddGatewaySessions so the + // cleanup hosted service starts before the session subsystem. + services.AddSingleton(); + services.AddSingleton(); + services.AddHostedService(); + return services; } } diff --git a/src/MxGateway.Tests/Galaxy/GalaxyHierarchyRefreshServiceTests.cs b/src/MxGateway.Tests/Galaxy/GalaxyHierarchyRefreshServiceTests.cs new file mode 100644 index 0000000..f0759f2 --- /dev/null +++ b/src/MxGateway.Tests/Galaxy/GalaxyHierarchyRefreshServiceTests.cs @@ -0,0 +1,64 @@ +using Microsoft.Extensions.Logging.Abstractions; +using Microsoft.Extensions.Options; +using MxGateway.Server.Galaxy; + +namespace MxGateway.Tests.Galaxy; + +/// +/// Server-005 regression: the initial RefreshAsync call in +/// must not let a transient, +/// non-cancellation first-load failure (e.g. a +/// or from connection +/// establishment) escape and fault the host BackgroundService. +/// +public sealed class GalaxyHierarchyRefreshServiceTests +{ + [Fact] + public async Task ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService() + { + ThrowingCache cache = new(new TimeoutException("connection establishment timed out")); + GalaxyHierarchyRefreshService service = CreateService(cache); + + using CancellationTokenSource cts = new(); + + await service.StartAsync(cts.Token); + await cts.CancelAsync(); + + // The background loop must have stopped cleanly: ExecuteTask completes + // (RanToCompletion or Canceled) rather than faulting on the first refresh. + Task? executeTask = service.ExecuteTask; + Assert.NotNull(executeTask); + await executeTask; + Assert.False(executeTask.IsFaulted); + Assert.Equal(1, cache.RefreshCallCount); + + await service.StopAsync(CancellationToken.None); + } + + private static GalaxyHierarchyRefreshService CreateService(IGalaxyHierarchyCache cache) + { + GalaxyRepositoryOptions options = new() + { + DashboardRefreshIntervalSeconds = 3600, + }; + return new GalaxyHierarchyRefreshService( + cache, + Options.Create(options), + NullLogger.Instance); + } + + private sealed class ThrowingCache(Exception toThrow) : IGalaxyHierarchyCache + { + public int RefreshCallCount { get; private set; } + + public GalaxyHierarchyCacheEntry Current => GalaxyHierarchyCacheEntry.Empty; + + public Task RefreshAsync(CancellationToken cancellationToken) + { + RefreshCallCount++; + throw toThrow; + } + + public Task WaitForFirstLoadAsync(CancellationToken cancellationToken) => Task.CompletedTask; + } +} diff --git a/src/MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs b/src/MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs index 818b481..f69eeeb 100644 --- a/src/MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs +++ b/src/MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs @@ -112,6 +112,33 @@ public sealed class DashboardApiKeyManagementServiceTests && entry.Details == "rotated"); } + /// + /// Server-004 regression: the dashboard create path must reject a request + /// carrying a non-canonical scope string rather than persisting a key whose + /// scope the authorization resolver never matches. + /// + [Fact] + public async Task CreateAsync_UnknownScope_DoesNotCallStore() + { + FakeApiKeyAdminStore adminStore = new(); + DashboardApiKeyManagementService service = CreateService(adminStore); + + DashboardApiKeyManagementRequest request = CreateRequest() with + { + Scopes = new HashSet( + [GatewayScopes.SessionOpen, "invoke", "metadata"], + StringComparer.Ordinal), + }; + + DashboardApiKeyManagementResult result = await service.CreateAsync( + CreateAuthorizedUser(), + request, + CancellationToken.None); + + Assert.False(result.Succeeded); + Assert.Equal(0, adminStore.CreateCount); + } + private static DashboardApiKeyManagementService CreateService( FakeApiKeyAdminStore? adminStore = null, FakeApiKeyAuditStore? auditStore = null, diff --git a/src/MxGateway.Tests/Gateway/Sessions/SessionManagerAlarmAutoSubscribeTests.cs b/src/MxGateway.Tests/Gateway/Sessions/SessionManagerAlarmAutoSubscribeTests.cs index 89dfc84..e4c3175 100644 --- a/src/MxGateway.Tests/Gateway/Sessions/SessionManagerAlarmAutoSubscribeTests.cs +++ b/src/MxGateway.Tests/Gateway/Sessions/SessionManagerAlarmAutoSubscribeTests.cs @@ -125,6 +125,44 @@ public sealed class SessionManagerAlarmAutoSubscribeTests CreateOpenRequest(), "client-1", CancellationToken.None)); } + /// + /// Server-006 regression: when auto-subscribe throws after + /// SessionOpened() incremented the open-session gauge, the failed + /// open must not leave mxgateway.sessions.open over-counted. + /// + [Fact] + public async Task OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn() + { + AlarmAutoSubscribeWorkerClient worker = new() + { + SubscribeAlarmsReplyFactory = _ => new MxCommandReply + { + Kind = MxCommandKind.SubscribeAlarms, + ProtocolStatus = new ProtocolStatus + { + Code = ProtocolStatusCode.MxaccessFailure, + Message = "wnwrap subscribe failed", + }, + }, + }; + using GatewayMetrics metrics = new(); + SessionManager manager = NewManager( + worker, + alarms: new AlarmsOptions + { + Enabled = true, + SubscriptionExpression = @"\\HOST\Galaxy!Area1", + RequireSubscribeOnOpen = true, + }, + metrics: metrics); + + await Assert.ThrowsAsync( + async () => await manager.OpenSessionAsync( + CreateOpenRequest(), "client-1", CancellationToken.None)); + + Assert.Equal(0, metrics.GetSnapshot().OpenSessions); + } + [Fact] public async Task OpenSessionAsync_Throws_WhenEnabledButNoExpressionAndRequireOn() { @@ -161,7 +199,8 @@ public sealed class SessionManagerAlarmAutoSubscribeTests private static SessionManager NewManager( AlarmAutoSubscribeWorkerClient worker, - AlarmsOptions alarms) + AlarmsOptions alarms, + GatewayMetrics? metrics = null) { FakeSessionWorkerClientFactory factory = new(worker); GatewayOptions options = new GatewayOptions @@ -183,7 +222,7 @@ public sealed class SessionManagerAlarmAutoSubscribeTests new SessionRegistry(), factory, Options.Create(options), - new GatewayMetrics()); + metrics ?? new GatewayMetrics()); } private static SessionOpenRequest CreateOpenRequest() diff --git a/src/MxGateway.Tests/Gateway/Workers/OrphanWorkerTerminatorTests.cs b/src/MxGateway.Tests/Gateway/Workers/OrphanWorkerTerminatorTests.cs new file mode 100644 index 0000000..4f68dfd --- /dev/null +++ b/src/MxGateway.Tests/Gateway/Workers/OrphanWorkerTerminatorTests.cs @@ -0,0 +1,137 @@ +using Microsoft.Extensions.Options; +using MxGateway.Server.Configuration; +using MxGateway.Server.Metrics; +using MxGateway.Server.Workers; + +namespace MxGateway.Tests.Gateway.Workers; + +/// +/// Server-002 regression: per gateway.md the gateway must terminate +/// orphaned worker processes on startup. These tests pin that the terminator +/// kills leftover workers (matched by executable path, or by image name when +/// the path is unreadable) without touching unrelated processes or itself. +/// +public sealed class OrphanWorkerTerminatorTests +{ + private const string WorkerExecutablePath = @"C:\app\src\MxGateway.Worker\bin\x86\Release\MxGateway.Worker.exe"; + + [Fact] + public void TerminateOrphans_KillsWorkerProcessesMatchingConfiguredExecutablePath() + { + FakeProcessInspector inspector = new( + [ + new RunningProcessInfo(101, WorkerExecutablePath), + new RunningProcessInfo(102, WorkerExecutablePath), + ]); + OrphanWorkerTerminator terminator = CreateTerminator(inspector); + + int killed = terminator.TerminateOrphans(); + + Assert.Equal(2, killed); + Assert.Equal([101, 102], inspector.KilledProcessIds.Order()); + } + + [Fact] + public void TerminateOrphans_KillsImageNameMatchWhenExecutablePathUnreadable() + { + // The x64 gateway cannot introspect the x86 worker's main module, so the + // path comes back null. Image-name match is the only signal — and it is + // exactly the orphan worker case, so the process must still be killed. + FakeProcessInspector inspector = new( + [ + new RunningProcessInfo(201, ExecutablePath: null), + ]); + OrphanWorkerTerminator terminator = CreateTerminator(inspector); + + int killed = terminator.TerminateOrphans(); + + Assert.Equal(1, killed); + Assert.Equal([201], inspector.KilledProcessIds); + } + + [Fact] + public void TerminateOrphans_DoesNotKillUnrelatedProcessSharingImageName() + { + // A process with the same image name but a different executable path is + // not our worker and must be left alone. + FakeProcessInspector inspector = new( + [ + new RunningProcessInfo(301, @"C:\other\place\MxGateway.Worker.exe"), + ]); + OrphanWorkerTerminator terminator = CreateTerminator(inspector); + + int killed = terminator.TerminateOrphans(); + + Assert.Equal(0, killed); + Assert.Empty(inspector.KilledProcessIds); + } + + [Fact] + public void TerminateOrphans_DoesNotKillCurrentProcess() + { + FakeProcessInspector inspector = new( + [ + new RunningProcessInfo(Environment.ProcessId, WorkerExecutablePath), + ]); + OrphanWorkerTerminator terminator = CreateTerminator(inspector); + + int killed = terminator.TerminateOrphans(); + + Assert.Equal(0, killed); + Assert.Empty(inspector.KilledProcessIds); + } + + [Fact] + public void TerminateOrphans_ContinuesWhenOneKillThrows() + { + FakeProcessInspector inspector = new( + [ + new RunningProcessInfo(401, WorkerExecutablePath), + new RunningProcessInfo(402, WorkerExecutablePath), + ]) + { + ThrowOnKillProcessId = 401, + }; + OrphanWorkerTerminator terminator = CreateTerminator(inspector); + + int killed = terminator.TerminateOrphans(); + + Assert.Equal(1, killed); + Assert.Contains(402, inspector.KilledProcessIds); + } + + private static OrphanWorkerTerminator CreateTerminator(IRunningProcessInspector inspector) + { + GatewayOptions options = new() + { + Worker = new WorkerOptions + { + ExecutablePath = WorkerExecutablePath, + }, + }; + return new OrphanWorkerTerminator( + Options.Create(options), + inspector, + new GatewayMetrics()); + } + + private sealed class FakeProcessInspector(IReadOnlyList processes) + : IRunningProcessInspector + { + public List KilledProcessIds { get; } = []; + + public int? ThrowOnKillProcessId { get; init; } + + public IReadOnlyList GetProcessesByName(string processName) => processes; + + public void Kill(int processId) + { + if (ThrowOnKillProcessId == processId) + { + throw new InvalidOperationException("Process has already exited."); + } + + KilledProcessIds.Add(processId); + } + } +} diff --git a/src/MxGateway.Tests/Security/Authentication/ApiKeyAdminCommandLineParserTests.cs b/src/MxGateway.Tests/Security/Authentication/ApiKeyAdminCommandLineParserTests.cs index de43c31..99e3e47 100644 --- a/src/MxGateway.Tests/Security/Authentication/ApiKeyAdminCommandLineParserTests.cs +++ b/src/MxGateway.Tests/Security/Authentication/ApiKeyAdminCommandLineParserTests.cs @@ -52,6 +52,56 @@ public sealed class ApiKeyAdminCommandLineParserTests Assert.Contains("events:read", result.Command.Scopes); } + /// + /// Server-004 regression: a create-key command with a non-canonical scope + /// string (e.g. CLAUDE.md's stale invoke instead of invoke:read) + /// must be rejected at parse time rather than silently persisting an + /// unusable scope the authorization resolver never matches. + /// + [Fact] + public void Parse_CreateKeyCommand_RejectsUnknownScope() + { + ApiKeyAdminParseResult result = ApiKeyAdminCommandLineParser.Parse( + [ + "apikey", + "create-key", + "--key-id", + "operator01", + "--display-name", + "Operator", + "--scopes", + "session:open,invoke,metadata", + ]); + + Assert.True(result.IsApiKeyCommand); + Assert.Null(result.Command); + Assert.NotNull(result.Error); + Assert.Contains("invoke", result.Error, StringComparison.Ordinal); + Assert.Contains("metadata", result.Error, StringComparison.Ordinal); + } + + /// Verifies a create-key command with only canonical scopes parses successfully. + [Fact] + public void Parse_CreateKeyCommand_AcceptsAllCanonicalScopes() + { + ApiKeyAdminParseResult result = ApiKeyAdminCommandLineParser.Parse( + [ + "apikey", + "create-key", + "--key-id", + "operator01", + "--display-name", + "Operator", + "--scopes", + "session:open,session:close,invoke:read,invoke:write,invoke:secure,events:read,metadata:read,admin", + ]); + + Assert.True(result.IsApiKeyCommand); + Assert.Null(result.Error); + Assert.NotNull(result.Command); + Assert.Equal(8, result.Command.Scopes.Count); + } + /// /// Verifies create key without display name returns error. ///