Resolve Server-002, -004, -005, -006 code-review findings

Server-002: the gateway never terminated leftover MxGateway.Worker.exe
processes at startup, contradicting gateway.md and CLAUDE.md. Added
IRunningProcessInspector/SystemRunningProcessInspector, OrphanWorkerTerminator,
and OrphanWorkerCleanupHostedService (best-effort, runs before sessions are
accepted); updated gateway.md to describe the implemented behavior.

Server-004: API-key scopes were persisted verbatim with no validation. Added
GatewayScopes.All/IsKnown; the CLI parser and dashboard create path now
reject unknown scope strings.

Server-005: a non-SqlException/InvalidOperationException fault on the initial
Galaxy hierarchy load faulted the BackgroundService. ExecuteAsync now catches
all non-cancellation exceptions on first load and RefreshCoreAsync broadens
its catch so the cache records Stale/Unavailable instead.

Server-006: OpenSessionAsync incremented the open-sessions gauge before
alarm auto-subscribe; an auto-subscribe failure leaked the gauge. The catch
path now calls SessionRemoved() when the gauge was incremented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-18 21:31:10 -04:00
parent 5e795aeeb8
commit 1d9e3afadd
18 changed files with 676 additions and 15 deletions
+9 -9
View File
@@ -7,7 +7,7 @@
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Status | Reviewed |
| Open findings | 12 |
| Open findings | 8 |
## Checklist coverage
@@ -48,13 +48,13 @@
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` |
| Status | Open |
| Status | Resolved |
**Description:** `gateway.md:583` and CLAUDE.md state the first version "terminates orphaned workers on startup." No code in MxGateway.Server enumerates or kills leftover `MxGateway.Worker.exe` processes at startup — a grep for `orphan`/`reattach`/`terminate` finds nothing. After an unclean gateway crash, x86 worker processes (each holding an MXAccess COM instance) leak and survive indefinitely, and a restarted gateway does not reclaim or kill them.
**Recommendation:** Add a startup hosted service that finds and kills stale worker processes (by executable path / a well-known argument or environment marker) before the server accepts sessions, or update the design docs if reattachment/cleanup is deliberately deferred.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-18. Confirmed against source: no code path enumerated or killed leftover workers. Added `IRunningProcessInspector` / `SystemRunningProcessInspector` (a testable seam over `Process.GetProcessesByName`/`Kill`), `OrphanWorkerTerminator` (kills processes matched by the configured worker executable path, or by image name when the x64 gateway cannot introspect the x86 worker's `MainModule`, skipping the current process and tolerating per-process kill failures), and `OrphanWorkerCleanupHostedService` (best-effort `IHostedService`). The hosted service is registered in `AddWorkerProcessLauncher` ahead of `AddGatewaySessions` so cleanup runs before the server accepts sessions. `gateway.md` updated to describe the implemented behavior. Regression tests: `OrphanWorkerTerminatorTests` (`KillsWorkerProcessesMatchingConfiguredExecutablePath`, `KillsImageNameMatchWhenExecutablePathUnreadable`, `DoesNotKillUnrelatedProcessSharingImageName`, `DoesNotKillCurrentProcess`, `ContinuesWhenOneKillThrows`).
### Server-003
@@ -78,13 +78,13 @@
| Severity | Medium |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` |
| Status | Open |
| Status | Resolved |
**Description:** `ParseScopes` accepts any comma-separated strings and `CreateKeyAsync` persists them verbatim; neither the CLI nor the dashboard create path validates scopes against `GatewayScopes`. A typo or non-canonical name (e.g. CLAUDE.md's example `--scopes session,invoke,event,metadata,admin`, which does not match the resolver's `session:open`/`invoke:read`/etc.) silently creates a key whose scope strings the authorization resolver never checks for — the key is unusable for those RPCs with no error at creation time.
**Recommendation:** Validate every requested scope against the `GatewayScopes` catalog at create time in both the CLI parser/runner and `DashboardApiKeyManagementService.ValidateCreateRequest`, rejecting unknown scope strings.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-18. Confirmed against source: `ParseScopes` split unvalidated strings into the create command and `ValidateCreateRequest` checked only key id and display name. Added `GatewayScopes.All` (the canonical scope catalog) and `GatewayScopes.IsKnown(string)`. `ApiKeyAdminCommandLineParser.Parse` now runs `ValidateScopes` for create-key commands and fails the parse listing the unknown scope(s) and valid set; `DashboardApiKeyManagementService.ValidateCreateRequest` rejects requests carrying any non-canonical scope. Revoke/rotate paths are unaffected (no scope input). Regression tests: `ApiKeyAdminCommandLineParserTests.Parse_CreateKeyCommand_RejectsUnknownScope`, `Parse_CreateKeyCommand_AcceptsAllCanonicalScopes`, and `DashboardApiKeyManagementServiceTests.CreateAsync_UnknownScope_DoesNotCallStore`.
### Server-005
@@ -93,13 +93,13 @@
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` |
| Status | Open |
| Status | Resolved |
**Description:** `GalaxyHierarchyCache.RefreshCoreAsync` only catches `SqlException` and `InvalidOperationException`. The initial `cache.RefreshAsync` call in `GalaxyHierarchyRefreshService.ExecuteAsync` is wrapped only for `OperationCanceledException`. A transient non-`SqlException` failure on the first refresh (e.g. a `Win32Exception`/`TimeoutException` from connection establishment, or another `DbException` subtype) escapes both layers, faults the `BackgroundService`, and — with default host behavior — stops the whole gateway. The periodic-tick loop does catch general exceptions, so only the first load is exposed.
**Recommendation:** Broaden the `catch` in `RefreshCoreAsync` to all non-cancellation exceptions (record `Unavailable`/`Stale` and still complete `_firstLoad`), or wrap the initial `RefreshAsync` in `GalaxyHierarchyRefreshService` with the same general `catch` the tick loop uses.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-18. Confirmed against source: the initial `RefreshAsync` in `ExecuteAsync` was guarded only for `OperationCanceledException`, and `RefreshCoreAsync` filtered its catch to `SqlException or InvalidOperationException`. Both recommended layers applied: `GalaxyHierarchyRefreshService.ExecuteAsync` now catches every non-cancellation exception on the initial load (logs a warning; the periodic tick retries), and `GalaxyHierarchyCache.RefreshCoreAsync` broadens its catch to all non-cancellation exceptions so the cache still records `Stale`/`Unavailable` and completes `_firstLoad`. The now-unused `Microsoft.Data.SqlClient` using was removed. Regression test: `GalaxyHierarchyRefreshServiceTests.ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService`.
### Server-006
@@ -108,13 +108,13 @@
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` |
| Status | Open |
| Status | Resolved |
**Description:** In `OpenSessionAsync`, `_metrics.SessionOpened()` (line 89) increments the `_openSessions` gauge before `TryAutoSubscribeAlarmsAsync` runs. If auto-subscribe throws (which it does when `Alarms.RequireSubscribeOnOpen` is true and the worker rejects the subscription), the `catch` block disposes and removes the session and records `_metrics.Fault(...)` but never calls `SessionClosed`/`SessionRemoved`. The `mxgateway.sessions.open` gauge permanently over-counts by one for every such failed open.
**Recommendation:** In the `catch` block, when the session had reached the point where `SessionOpened()` was recorded, also call `_metrics.SessionRemoved()` — or move the `SessionOpened()` call to after auto-subscribe succeeds.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-18. Confirmed against source: the `catch` block in `OpenSessionAsync` recorded `Fault(...)` and removed the session but never decremented the open-session gauge after `SessionOpened()` had run. Added a `sessionOpenedRecorded` flag set immediately after `_metrics.SessionOpened()`; the `catch` block now calls `_metrics.SessionRemoved()` when that flag is set, restoring the gauge for a post-`SessionOpened()` failure (e.g. an auto-subscribe rejection with `RequireSubscribeOnOpen=true`). Regression test: `SessionManagerAlarmAutoSubscribeTests.OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn`.
### Server-007
+5 -2
View File
@@ -579,8 +579,11 @@ Policy:
- command exceptions return structured command fault with HRESULT if known,
- stale sessions are closed by lease timeout,
- stuck workers are killed by process id,
- gateway restart should not attempt to reattach old workers unless explicitly
designed; first version should terminate orphaned workers on startup.
- gateway restart does not reattach old workers; `OrphanWorkerCleanupHostedService`
runs `OrphanWorkerTerminator` once on startup — before the server accepts
sessions — to kill leftover `MxGateway.Worker.exe` processes (matched by the
configured worker executable path, or by image name when the x64 gateway cannot
introspect the x86 worker's module) left behind by a previous unclean run.
Because each client owns one worker, a crash or leak affects only that session.
@@ -1,6 +1,7 @@
using System.Security.Claims;
using Microsoft.Data.Sqlite;
using MxGateway.Server.Security.Authentication;
using MxGateway.Server.Security.Authorization;
namespace MxGateway.Server.Dashboard;
@@ -171,6 +172,15 @@ public sealed class DashboardApiKeyManagementService(
return "Display name is required.";
}
string[] unknownScopes = request.Scopes
.Where(scope => !GatewayScopes.IsKnown(scope))
.ToArray();
if (unknownScopes.Length > 0)
{
return $"Unknown scope(s): {string.Join(", ", unknownScopes)}. "
+ $"Valid scopes are: {string.Join(", ", GatewayScopes.All)}.";
}
return null;
}
@@ -1,5 +1,4 @@
using Google.Protobuf.WellKnownTypes;
using Microsoft.Data.SqlClient;
using Microsoft.Extensions.Logging;
using MxGateway.Contracts.Proto.Galaxy;
using MxGateway.Server.Dashboard;
@@ -181,8 +180,13 @@ public sealed class GalaxyHierarchyCache : IGalaxyHierarchyCache
{
throw;
}
catch (Exception exception) when (exception is SqlException or InvalidOperationException)
catch (Exception exception)
{
// Catch every non-cancellation failure — not just SqlException /
// InvalidOperationException. A TimeoutException or Win32Exception
// from connection establishment, or another DbException subtype,
// must still degrade gracefully to Stale/Unavailable and complete
// _firstLoad rather than escape and fault the refresh BackgroundService.
_logger?.LogWarning(exception, "Galaxy hierarchy cache refresh failed.");
GalaxyHierarchyCacheEntry failed = previous with
{
@@ -26,6 +26,15 @@ public sealed class GalaxyHierarchyRefreshService(
{
return;
}
catch (Exception exception)
{
// A transient first-load failure (e.g. a TimeoutException or
// Win32Exception from connection establishment, or a DbException
// subtype the cache does not catch) must not fault this
// BackgroundService and stop the whole gateway. The cache records
// its own Unavailable/Stale status; the periodic tick below retries.
logger.LogWarning(exception, "Initial Galaxy hierarchy cache load failed; will retry on the refresh interval.");
}
using PeriodicTimer timer = new(interval, _timeProvider);
try
@@ -1,3 +1,5 @@
using MxGateway.Server.Security.Authorization;
namespace MxGateway.Server.Security.Authentication;
public static class ApiKeyAdminCommandLineParser
@@ -95,6 +97,12 @@ public static class ApiKeyAdminCommandLineParser
return ApiKeyAdminParseResult.Fail(validationError);
}
string? scopeError = ValidateScopes(kind, scopes);
if (scopeError is not null)
{
return ApiKeyAdminParseResult.Fail(scopeError);
}
return ApiKeyAdminParseResult.Success(new ApiKeyAdminCommand(
Kind: kind,
Json: json,
@@ -152,6 +160,23 @@ public static class ApiKeyAdminCommandLineParser
return null;
}
private static string? ValidateScopes(ApiKeyAdminCommandKind kind, IReadOnlySet<string> scopes)
{
if (kind != ApiKeyAdminCommandKind.CreateKey)
{
return null;
}
string[] unknown = scopes.Where(scope => !GatewayScopes.IsKnown(scope)).ToArray();
if (unknown.Length == 0)
{
return null;
}
return $"Unknown scope(s): {string.Join(", ", unknown)}. "
+ $"Valid scopes are: {string.Join(", ", GatewayScopes.All)}.";
}
private static string KindName(ApiKeyAdminCommandKind kind)
{
return kind switch
@@ -10,4 +10,28 @@ public static class GatewayScopes
public const string EventsRead = "events:read";
public const string MetadataRead = "metadata:read";
public const string Admin = "admin";
/// <summary>
/// The complete catalog of canonical scope strings the gateway authorization
/// resolver recognizes. Key-creation paths (CLI and dashboard) validate requested
/// scopes against this set so a typo or non-canonical name cannot persist a key
/// whose scope strings the resolver never matches.
/// </summary>
public static readonly IReadOnlySet<string> All = new HashSet<string>(
[
SessionOpen,
SessionClose,
InvokeRead,
InvokeWrite,
InvokeSecure,
EventsRead,
MetadataRead,
Admin,
],
System.StringComparer.Ordinal);
/// <summary>Determines whether the supplied scope string is a recognized canonical scope.</summary>
/// <param name="scope">Scope string to check.</param>
/// <returns><see langword="true"/> when the scope is canonical; otherwise <see langword="false"/>.</returns>
public static bool IsKnown(string scope) => All.Contains(scope);
}
@@ -68,6 +68,7 @@ public sealed class SessionManager : ISessionManager
EnsureSessionCapacity();
GatewaySession? session = null;
bool sessionOpenedRecorded = false;
try
{
session = CreateSession(request, clientIdentity);
@@ -86,6 +87,7 @@ public sealed class SessionManager : ISessionManager
session.AttachWorkerClient(workerClient);
session.MarkReady();
_metrics.SessionOpened();
sessionOpenedRecorded = true;
await TryAutoSubscribeAlarmsAsync(session, cancellationToken).ConfigureAwait(false);
@@ -100,6 +102,14 @@ public sealed class SessionManager : ISessionManager
await session.DisposeAsync().ConfigureAwait(false);
}
// If SessionOpened() already incremented the open-session gauge,
// a failure after that point (e.g. auto-subscribe rejection) must
// decrement it again so mxgateway.sessions.open does not leak.
if (sessionOpenedRecorded)
{
_metrics.SessionRemoved();
}
ReleaseSessionSlot();
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
_logger.LogWarning(
@@ -0,0 +1,29 @@
namespace MxGateway.Server.Workers;
/// <summary>
/// Abstraction over OS process enumeration and termination. Exists so the
/// orphan-worker cleanup logic can be unit-tested without spawning real
/// processes.
/// </summary>
public interface IRunningProcessInspector
{
/// <summary>
/// Enumerates currently running processes whose image name (without the
/// <c>.exe</c> extension) matches <paramref name="processName"/>.
/// </summary>
/// <param name="processName">Process image name to match, without extension.</param>
/// <returns>The matching running processes.</returns>
IReadOnlyList<RunningProcessInfo> GetProcessesByName(string processName);
/// <summary>Forcibly terminates the process with the given identifier.</summary>
/// <param name="processId">Identifier of the process to terminate.</param>
void Kill(int processId);
}
/// <summary>Identifying information for a running process candidate.</summary>
/// <param name="ProcessId">Operating-system process identifier.</param>
/// <param name="ExecutablePath">
/// Fully-qualified path to the process main module, or <see langword="null"/>
/// when it could not be read (e.g. access denied).
/// </param>
public sealed record RunningProcessInfo(int ProcessId, string? ExecutablePath);
@@ -0,0 +1,30 @@
namespace MxGateway.Server.Workers;
/// <summary>
/// Hosted service that terminates leftover MXAccess worker processes once on
/// gateway startup, before the server begins accepting sessions.
/// </summary>
public sealed class OrphanWorkerCleanupHostedService(
OrphanWorkerTerminator terminator,
ILogger<OrphanWorkerCleanupHostedService> logger) : IHostedService
{
/// <inheritdoc />
public Task StartAsync(CancellationToken cancellationToken)
{
try
{
terminator.TerminateOrphans();
}
catch (Exception exception)
{
// Orphan cleanup is best-effort; a failure here must not prevent
// the gateway from starting.
logger.LogWarning(exception, "Orphan worker cleanup failed on startup.");
}
return Task.CompletedTask;
}
/// <inheritdoc />
public Task StopAsync(CancellationToken cancellationToken) => Task.CompletedTask;
}
@@ -0,0 +1,138 @@
using Microsoft.Extensions.Logging.Abstractions;
using Microsoft.Extensions.Options;
using MxGateway.Server.Configuration;
using MxGateway.Server.Metrics;
namespace MxGateway.Server.Workers;
/// <summary>
/// Terminates leftover MXAccess worker processes on gateway startup.
/// <para>
/// Per <c>gateway.md</c> ("first version should terminate orphaned workers
/// on startup") and CLAUDE.md, a gateway restart does not reattach old
/// workers. After an unclean gateway crash, x86 worker processes — each
/// holding an MXAccess COM instance on an STA — survive indefinitely. This
/// terminator finds those processes by executable name/path and kills them
/// before the restarted gateway accepts sessions.
/// </para>
/// </summary>
public sealed class OrphanWorkerTerminator
{
private readonly IRunningProcessInspector _inspector;
private readonly GatewayMetrics _metrics;
private readonly WorkerOptions _workerOptions;
private readonly ILogger<OrphanWorkerTerminator> _logger;
/// <summary>Initializes a new instance of the <see cref="OrphanWorkerTerminator"/> class.</summary>
/// <param name="gatewayOptions">Gateway configuration options.</param>
/// <param name="inspector">Running-process inspector.</param>
/// <param name="metrics">Gateway metrics collector.</param>
/// <param name="logger">Optional logger for diagnostic output.</param>
public OrphanWorkerTerminator(
IOptions<GatewayOptions> gatewayOptions,
IRunningProcessInspector inspector,
GatewayMetrics metrics,
ILogger<OrphanWorkerTerminator>? logger = null)
{
ArgumentNullException.ThrowIfNull(gatewayOptions);
_inspector = inspector ?? throw new ArgumentNullException(nameof(inspector));
_metrics = metrics ?? throw new ArgumentNullException(nameof(metrics));
_workerOptions = gatewayOptions.Value.Worker;
_logger = logger ?? NullLogger<OrphanWorkerTerminator>.Instance;
}
/// <summary>
/// Finds and kills every leftover worker process. Safe to call once at
/// startup before any session-owned worker is launched.
/// </summary>
/// <returns>The number of orphan worker processes that were terminated.</returns>
public int TerminateOrphans()
{
string? configuredPath = ResolveConfiguredExecutablePath();
string processName = ResolveProcessName(configuredPath);
int currentProcessId = Environment.ProcessId;
int terminated = 0;
foreach (RunningProcessInfo candidate in _inspector.GetProcessesByName(processName))
{
if (candidate.ProcessId == currentProcessId)
{
continue;
}
if (!IsOrphanWorker(candidate, configuredPath))
{
continue;
}
try
{
_inspector.Kill(candidate.ProcessId);
_metrics.WorkerKilled("OrphanStartupCleanup");
terminated++;
_logger.LogWarning(
"Terminated orphan worker process {ProcessId} ({ExecutablePath}) left over from a previous gateway run.",
candidate.ProcessId,
candidate.ExecutablePath ?? processName);
}
catch (Exception exception)
{
// The process may have already exited, or be inaccessible.
// A failure to kill one orphan must not block gateway startup.
_logger.LogWarning(
exception,
"Failed to terminate orphan worker process {ProcessId}.",
candidate.ProcessId);
}
}
if (terminated > 0)
{
_logger.LogInformation("Terminated {Count} orphan worker process(es) on startup.", terminated);
}
return terminated;
}
private static bool IsOrphanWorker(RunningProcessInfo candidate, string? configuredPath)
{
// When the executable path is readable, require an exact match against
// the configured worker path so unrelated processes that merely share
// the image name are never killed.
if (candidate.ExecutablePath is { } path)
{
return configuredPath is not null
&& string.Equals(path, configuredPath, StringComparison.OrdinalIgnoreCase);
}
// A null path means the x64 gateway could not introspect the module —
// the expected case for the x86 worker. Image-name match is the only
// signal available; treat it as an orphan.
return true;
}
private string? ResolveConfiguredExecutablePath()
{
try
{
return Path.GetFullPath(_workerOptions.ExecutablePath);
}
catch (Exception exception) when (exception is ArgumentException
or NotSupportedException
or PathTooLongException)
{
_logger.LogWarning(
exception,
"Configured worker executable path '{ExecutablePath}' is not a valid filesystem path; "
+ "orphan cleanup will match by image name only.",
_workerOptions.ExecutablePath);
return null;
}
}
private static string ResolveProcessName(string? configuredPath)
{
string source = configuredPath ?? "MxGateway.Worker.exe";
return Path.GetFileNameWithoutExtension(source);
}
}
@@ -0,0 +1,55 @@
using System.Diagnostics;
namespace MxGateway.Server.Workers;
/// <summary>
/// <see cref="IRunningProcessInspector"/> backed by <see cref="Process"/>.
/// </summary>
public sealed class SystemRunningProcessInspector : IRunningProcessInspector
{
/// <inheritdoc />
public IReadOnlyList<RunningProcessInfo> GetProcessesByName(string processName)
{
List<RunningProcessInfo> results = [];
Process[] processes = Process.GetProcessesByName(processName);
try
{
foreach (Process process in processes)
{
results.Add(new RunningProcessInfo(process.Id, TryGetExecutablePath(process)));
}
}
finally
{
foreach (Process process in processes)
{
process.Dispose();
}
}
return results;
}
/// <inheritdoc />
public void Kill(int processId)
{
using Process process = Process.GetProcessById(processId);
process.Kill(entireProcessTree: true);
}
private static string? TryGetExecutablePath(Process process)
{
try
{
return process.MainModule?.FileName;
}
catch (Exception exception) when (exception is InvalidOperationException
or System.ComponentModel.Win32Exception
or NotSupportedException)
{
// Access to the main module can be denied (e.g. a 64-bit gateway
// querying a 32-bit worker, or a process owned by another user).
return null;
}
}
}
@@ -11,6 +11,13 @@ public static class WorkerServiceCollectionExtensions
services.AddSingleton<IWorkerStartupProbe, WorkerProcessStartedProbe>();
services.AddSingleton<IWorkerProcessLauncher, WorkerProcessLauncher>();
// Terminate workers leaked by a previous unclean gateway run before the
// server accepts sessions. Registered ahead of AddGatewaySessions so the
// cleanup hosted service starts before the session subsystem.
services.AddSingleton<IRunningProcessInspector, SystemRunningProcessInspector>();
services.AddSingleton<OrphanWorkerTerminator>();
services.AddHostedService<OrphanWorkerCleanupHostedService>();
return services;
}
}
@@ -0,0 +1,64 @@
using Microsoft.Extensions.Logging.Abstractions;
using Microsoft.Extensions.Options;
using MxGateway.Server.Galaxy;
namespace MxGateway.Tests.Galaxy;
/// <summary>
/// Server-005 regression: the initial <c>RefreshAsync</c> call in
/// <see cref="GalaxyHierarchyRefreshService"/> must not let a transient,
/// non-cancellation first-load failure (e.g. a <see cref="TimeoutException"/>
/// or <see cref="System.ComponentModel.Win32Exception"/> from connection
/// establishment) escape and fault the host <c>BackgroundService</c>.
/// </summary>
public sealed class GalaxyHierarchyRefreshServiceTests
{
[Fact]
public async Task ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService()
{
ThrowingCache cache = new(new TimeoutException("connection establishment timed out"));
GalaxyHierarchyRefreshService service = CreateService(cache);
using CancellationTokenSource cts = new();
await service.StartAsync(cts.Token);
await cts.CancelAsync();
// The background loop must have stopped cleanly: ExecuteTask completes
// (RanToCompletion or Canceled) rather than faulting on the first refresh.
Task? executeTask = service.ExecuteTask;
Assert.NotNull(executeTask);
await executeTask;
Assert.False(executeTask.IsFaulted);
Assert.Equal(1, cache.RefreshCallCount);
await service.StopAsync(CancellationToken.None);
}
private static GalaxyHierarchyRefreshService CreateService(IGalaxyHierarchyCache cache)
{
GalaxyRepositoryOptions options = new()
{
DashboardRefreshIntervalSeconds = 3600,
};
return new GalaxyHierarchyRefreshService(
cache,
Options.Create(options),
NullLogger<GalaxyHierarchyRefreshService>.Instance);
}
private sealed class ThrowingCache(Exception toThrow) : IGalaxyHierarchyCache
{
public int RefreshCallCount { get; private set; }
public GalaxyHierarchyCacheEntry Current => GalaxyHierarchyCacheEntry.Empty;
public Task RefreshAsync(CancellationToken cancellationToken)
{
RefreshCallCount++;
throw toThrow;
}
public Task WaitForFirstLoadAsync(CancellationToken cancellationToken) => Task.CompletedTask;
}
}
@@ -112,6 +112,33 @@ public sealed class DashboardApiKeyManagementServiceTests
&& entry.Details == "rotated");
}
/// <summary>
/// Server-004 regression: the dashboard create path must reject a request
/// carrying a non-canonical scope string rather than persisting a key whose
/// scope the authorization resolver never matches.
/// </summary>
[Fact]
public async Task CreateAsync_UnknownScope_DoesNotCallStore()
{
FakeApiKeyAdminStore adminStore = new();
DashboardApiKeyManagementService service = CreateService(adminStore);
DashboardApiKeyManagementRequest request = CreateRequest() with
{
Scopes = new HashSet<string>(
[GatewayScopes.SessionOpen, "invoke", "metadata"],
StringComparer.Ordinal),
};
DashboardApiKeyManagementResult result = await service.CreateAsync(
CreateAuthorizedUser(),
request,
CancellationToken.None);
Assert.False(result.Succeeded);
Assert.Equal(0, adminStore.CreateCount);
}
private static DashboardApiKeyManagementService CreateService(
FakeApiKeyAdminStore? adminStore = null,
FakeApiKeyAuditStore? auditStore = null,
@@ -125,6 +125,44 @@ public sealed class SessionManagerAlarmAutoSubscribeTests
CreateOpenRequest(), "client-1", CancellationToken.None));
}
/// <summary>
/// Server-006 regression: when auto-subscribe throws after
/// <c>SessionOpened()</c> incremented the open-session gauge, the failed
/// open must not leave <c>mxgateway.sessions.open</c> over-counted.
/// </summary>
[Fact]
public async Task OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn()
{
AlarmAutoSubscribeWorkerClient worker = new()
{
SubscribeAlarmsReplyFactory = _ => new MxCommandReply
{
Kind = MxCommandKind.SubscribeAlarms,
ProtocolStatus = new ProtocolStatus
{
Code = ProtocolStatusCode.MxaccessFailure,
Message = "wnwrap subscribe failed",
},
},
};
using GatewayMetrics metrics = new();
SessionManager manager = NewManager(
worker,
alarms: new AlarmsOptions
{
Enabled = true,
SubscriptionExpression = @"\\HOST\Galaxy!Area1",
RequireSubscribeOnOpen = true,
},
metrics: metrics);
await Assert.ThrowsAsync<SessionManagerException>(
async () => await manager.OpenSessionAsync(
CreateOpenRequest(), "client-1", CancellationToken.None));
Assert.Equal(0, metrics.GetSnapshot().OpenSessions);
}
[Fact]
public async Task OpenSessionAsync_Throws_WhenEnabledButNoExpressionAndRequireOn()
{
@@ -161,7 +199,8 @@ public sealed class SessionManagerAlarmAutoSubscribeTests
private static SessionManager NewManager(
AlarmAutoSubscribeWorkerClient worker,
AlarmsOptions alarms)
AlarmsOptions alarms,
GatewayMetrics? metrics = null)
{
FakeSessionWorkerClientFactory factory = new(worker);
GatewayOptions options = new GatewayOptions
@@ -183,7 +222,7 @@ public sealed class SessionManagerAlarmAutoSubscribeTests
new SessionRegistry(),
factory,
Options.Create(options),
new GatewayMetrics());
metrics ?? new GatewayMetrics());
}
private static SessionOpenRequest CreateOpenRequest()
@@ -0,0 +1,137 @@
using Microsoft.Extensions.Options;
using MxGateway.Server.Configuration;
using MxGateway.Server.Metrics;
using MxGateway.Server.Workers;
namespace MxGateway.Tests.Gateway.Workers;
/// <summary>
/// Server-002 regression: per <c>gateway.md</c> the gateway must terminate
/// orphaned worker processes on startup. These tests pin that the terminator
/// kills leftover workers (matched by executable path, or by image name when
/// the path is unreadable) without touching unrelated processes or itself.
/// </summary>
public sealed class OrphanWorkerTerminatorTests
{
private const string WorkerExecutablePath = @"C:\app\src\MxGateway.Worker\bin\x86\Release\MxGateway.Worker.exe";
[Fact]
public void TerminateOrphans_KillsWorkerProcessesMatchingConfiguredExecutablePath()
{
FakeProcessInspector inspector = new(
[
new RunningProcessInfo(101, WorkerExecutablePath),
new RunningProcessInfo(102, WorkerExecutablePath),
]);
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
int killed = terminator.TerminateOrphans();
Assert.Equal(2, killed);
Assert.Equal([101, 102], inspector.KilledProcessIds.Order());
}
[Fact]
public void TerminateOrphans_KillsImageNameMatchWhenExecutablePathUnreadable()
{
// The x64 gateway cannot introspect the x86 worker's main module, so the
// path comes back null. Image-name match is the only signal — and it is
// exactly the orphan worker case, so the process must still be killed.
FakeProcessInspector inspector = new(
[
new RunningProcessInfo(201, ExecutablePath: null),
]);
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
int killed = terminator.TerminateOrphans();
Assert.Equal(1, killed);
Assert.Equal([201], inspector.KilledProcessIds);
}
[Fact]
public void TerminateOrphans_DoesNotKillUnrelatedProcessSharingImageName()
{
// A process with the same image name but a different executable path is
// not our worker and must be left alone.
FakeProcessInspector inspector = new(
[
new RunningProcessInfo(301, @"C:\other\place\MxGateway.Worker.exe"),
]);
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
int killed = terminator.TerminateOrphans();
Assert.Equal(0, killed);
Assert.Empty(inspector.KilledProcessIds);
}
[Fact]
public void TerminateOrphans_DoesNotKillCurrentProcess()
{
FakeProcessInspector inspector = new(
[
new RunningProcessInfo(Environment.ProcessId, WorkerExecutablePath),
]);
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
int killed = terminator.TerminateOrphans();
Assert.Equal(0, killed);
Assert.Empty(inspector.KilledProcessIds);
}
[Fact]
public void TerminateOrphans_ContinuesWhenOneKillThrows()
{
FakeProcessInspector inspector = new(
[
new RunningProcessInfo(401, WorkerExecutablePath),
new RunningProcessInfo(402, WorkerExecutablePath),
])
{
ThrowOnKillProcessId = 401,
};
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
int killed = terminator.TerminateOrphans();
Assert.Equal(1, killed);
Assert.Contains(402, inspector.KilledProcessIds);
}
private static OrphanWorkerTerminator CreateTerminator(IRunningProcessInspector inspector)
{
GatewayOptions options = new()
{
Worker = new WorkerOptions
{
ExecutablePath = WorkerExecutablePath,
},
};
return new OrphanWorkerTerminator(
Options.Create(options),
inspector,
new GatewayMetrics());
}
private sealed class FakeProcessInspector(IReadOnlyList<RunningProcessInfo> processes)
: IRunningProcessInspector
{
public List<int> KilledProcessIds { get; } = [];
public int? ThrowOnKillProcessId { get; init; }
public IReadOnlyList<RunningProcessInfo> GetProcessesByName(string processName) => processes;
public void Kill(int processId)
{
if (ThrowOnKillProcessId == processId)
{
throw new InvalidOperationException("Process has already exited.");
}
KilledProcessIds.Add(processId);
}
}
}
@@ -52,6 +52,56 @@ public sealed class ApiKeyAdminCommandLineParserTests
Assert.Contains("events:read", result.Command.Scopes);
}
/// <summary>
/// Server-004 regression: a create-key command with a non-canonical scope
/// string (e.g. CLAUDE.md's stale <c>invoke</c> instead of <c>invoke:read</c>)
/// must be rejected at parse time rather than silently persisting an
/// unusable scope the authorization resolver never matches.
/// </summary>
[Fact]
public void Parse_CreateKeyCommand_RejectsUnknownScope()
{
ApiKeyAdminParseResult result = ApiKeyAdminCommandLineParser.Parse(
[
"apikey",
"create-key",
"--key-id",
"operator01",
"--display-name",
"Operator",
"--scopes",
"session:open,invoke,metadata",
]);
Assert.True(result.IsApiKeyCommand);
Assert.Null(result.Command);
Assert.NotNull(result.Error);
Assert.Contains("invoke", result.Error, StringComparison.Ordinal);
Assert.Contains("metadata", result.Error, StringComparison.Ordinal);
}
/// <summary>Verifies a create-key command with only canonical scopes parses successfully.</summary>
[Fact]
public void Parse_CreateKeyCommand_AcceptsAllCanonicalScopes()
{
ApiKeyAdminParseResult result = ApiKeyAdminCommandLineParser.Parse(
[
"apikey",
"create-key",
"--key-id",
"operator01",
"--display-name",
"Operator",
"--scopes",
"session:open,session:close,invoke:read,invoke:write,invoke:secure,events:read,metadata:read,admin",
]);
Assert.True(result.IsApiKeyCommand);
Assert.Null(result.Error);
Assert.NotNull(result.Command);
Assert.Equal(8, result.Command.Scopes.Count);
}
/// <summary>
/// Verifies create key without display name returns error.
/// </summary>