Resolve Server-002, -004, -005, -006 code-review findings
Server-002: the gateway never terminated leftover MxGateway.Worker.exe processes at startup, contradicting gateway.md and CLAUDE.md. Added IRunningProcessInspector/SystemRunningProcessInspector, OrphanWorkerTerminator, and OrphanWorkerCleanupHostedService (best-effort, runs before sessions are accepted); updated gateway.md to describe the implemented behavior. Server-004: API-key scopes were persisted verbatim with no validation. Added GatewayScopes.All/IsKnown; the CLI parser and dashboard create path now reject unknown scope strings. Server-005: a non-SqlException/InvalidOperationException fault on the initial Galaxy hierarchy load faulted the BackgroundService. ExecuteAsync now catches all non-cancellation exceptions on first load and RefreshCoreAsync broadens its catch so the cache records Stale/Unavailable instead. Server-006: OpenSessionAsync incremented the open-sessions gauge before alarm auto-subscribe; an auto-subscribe failure leaked the gauge. The catch path now calls SessionRemoved() when the gauge was incremented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -7,7 +7,7 @@
|
||||
| Review date | 2026-05-18 |
|
||||
| Commit reviewed | `6c64030` |
|
||||
| Status | Reviewed |
|
||||
| Open findings | 12 |
|
||||
| Open findings | 8 |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
@@ -48,13 +48,13 @@
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Location | `src/MxGateway.Server/Program.cs:24`, `src/MxGateway.Server/GatewayApplication.cs` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `gateway.md:583` and CLAUDE.md state the first version "terminates orphaned workers on startup." No code in MxGateway.Server enumerates or kills leftover `MxGateway.Worker.exe` processes at startup — a grep for `orphan`/`reattach`/`terminate` finds nothing. After an unclean gateway crash, x86 worker processes (each holding an MXAccess COM instance) leak and survive indefinitely, and a restarted gateway does not reclaim or kill them.
|
||||
|
||||
**Recommendation:** Add a startup hosted service that finds and kills stale worker processes (by executable path / a well-known argument or environment marker) before the server accepts sessions, or update the design docs if reattachment/cleanup is deliberately deferred.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-18. Confirmed against source: no code path enumerated or killed leftover workers. Added `IRunningProcessInspector` / `SystemRunningProcessInspector` (a testable seam over `Process.GetProcessesByName`/`Kill`), `OrphanWorkerTerminator` (kills processes matched by the configured worker executable path, or by image name when the x64 gateway cannot introspect the x86 worker's `MainModule`, skipping the current process and tolerating per-process kill failures), and `OrphanWorkerCleanupHostedService` (best-effort `IHostedService`). The hosted service is registered in `AddWorkerProcessLauncher` ahead of `AddGatewaySessions` so cleanup runs before the server accepts sessions. `gateway.md` updated to describe the implemented behavior. Regression tests: `OrphanWorkerTerminatorTests` (`KillsWorkerProcessesMatchingConfiguredExecutablePath`, `KillsImageNameMatchWhenExecutablePathUnreadable`, `DoesNotKillUnrelatedProcessSharingImageName`, `DoesNotKillCurrentProcess`, `ContinuesWhenOneKillThrows`).
|
||||
|
||||
### Server-003
|
||||
|
||||
@@ -78,13 +78,13 @@
|
||||
| Severity | Medium |
|
||||
| Category | Code organization & conventions |
|
||||
| Location | `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCommandLineParser.cs:227-233`, `src/MxGateway.Server/Security/Authentication/ApiKeyAdminCliRunner.cs:53-77`, `src/MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:21-67` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `ParseScopes` accepts any comma-separated strings and `CreateKeyAsync` persists them verbatim; neither the CLI nor the dashboard create path validates scopes against `GatewayScopes`. A typo or non-canonical name (e.g. CLAUDE.md's example `--scopes session,invoke,event,metadata,admin`, which does not match the resolver's `session:open`/`invoke:read`/etc.) silently creates a key whose scope strings the authorization resolver never checks for — the key is unusable for those RPCs with no error at creation time.
|
||||
|
||||
**Recommendation:** Validate every requested scope against the `GatewayScopes` catalog at create time in both the CLI parser/runner and `DashboardApiKeyManagementService.ValidateCreateRequest`, rejecting unknown scope strings.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-18. Confirmed against source: `ParseScopes` split unvalidated strings into the create command and `ValidateCreateRequest` checked only key id and display name. Added `GatewayScopes.All` (the canonical scope catalog) and `GatewayScopes.IsKnown(string)`. `ApiKeyAdminCommandLineParser.Parse` now runs `ValidateScopes` for create-key commands and fails the parse listing the unknown scope(s) and valid set; `DashboardApiKeyManagementService.ValidateCreateRequest` rejects requests carrying any non-canonical scope. Revoke/rotate paths are unaffected (no scope input). Regression tests: `ApiKeyAdminCommandLineParserTests.Parse_CreateKeyCommand_RejectsUnknownScope`, `Parse_CreateKeyCommand_AcceptsAllCanonicalScopes`, and `DashboardApiKeyManagementServiceTests.CreateAsync_UnknownScope_DoesNotCallStore`.
|
||||
|
||||
### Server-005
|
||||
|
||||
@@ -93,13 +93,13 @@
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Location | `src/MxGateway.Server/Galaxy/GalaxyHierarchyRefreshService.cs:22-28`, `src/MxGateway.Server/Galaxy/GalaxyHierarchyCache.cs:184` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `GalaxyHierarchyCache.RefreshCoreAsync` only catches `SqlException` and `InvalidOperationException`. The initial `cache.RefreshAsync` call in `GalaxyHierarchyRefreshService.ExecuteAsync` is wrapped only for `OperationCanceledException`. A transient non-`SqlException` failure on the first refresh (e.g. a `Win32Exception`/`TimeoutException` from connection establishment, or another `DbException` subtype) escapes both layers, faults the `BackgroundService`, and — with default host behavior — stops the whole gateway. The periodic-tick loop does catch general exceptions, so only the first load is exposed.
|
||||
|
||||
**Recommendation:** Broaden the `catch` in `RefreshCoreAsync` to all non-cancellation exceptions (record `Unavailable`/`Stale` and still complete `_firstLoad`), or wrap the initial `RefreshAsync` in `GalaxyHierarchyRefreshService` with the same general `catch` the tick loop uses.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-18. Confirmed against source: the initial `RefreshAsync` in `ExecuteAsync` was guarded only for `OperationCanceledException`, and `RefreshCoreAsync` filtered its catch to `SqlException or InvalidOperationException`. Both recommended layers applied: `GalaxyHierarchyRefreshService.ExecuteAsync` now catches every non-cancellation exception on the initial load (logs a warning; the periodic tick retries), and `GalaxyHierarchyCache.RefreshCoreAsync` broadens its catch to all non-cancellation exceptions so the cache still records `Stale`/`Unavailable` and completes `_firstLoad`. The now-unused `Microsoft.Data.SqlClient` using was removed. Regression test: `GalaxyHierarchyRefreshServiceTests.ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService`.
|
||||
|
||||
### Server-006
|
||||
|
||||
@@ -108,13 +108,13 @@
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Location | `src/MxGateway.Server/Sessions/SessionManager.cs:84-114` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** In `OpenSessionAsync`, `_metrics.SessionOpened()` (line 89) increments the `_openSessions` gauge before `TryAutoSubscribeAlarmsAsync` runs. If auto-subscribe throws (which it does when `Alarms.RequireSubscribeOnOpen` is true and the worker rejects the subscription), the `catch` block disposes and removes the session and records `_metrics.Fault(...)` but never calls `SessionClosed`/`SessionRemoved`. The `mxgateway.sessions.open` gauge permanently over-counts by one for every such failed open.
|
||||
|
||||
**Recommendation:** In the `catch` block, when the session had reached the point where `SessionOpened()` was recorded, also call `_metrics.SessionRemoved()` — or move the `SessionOpened()` call to after auto-subscribe succeeds.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-18. Confirmed against source: the `catch` block in `OpenSessionAsync` recorded `Fault(...)` and removed the session but never decremented the open-session gauge after `SessionOpened()` had run. Added a `sessionOpenedRecorded` flag set immediately after `_metrics.SessionOpened()`; the `catch` block now calls `_metrics.SessionRemoved()` when that flag is set, restoring the gauge for a post-`SessionOpened()` failure (e.g. an auto-subscribe rejection with `RequireSubscribeOnOpen=true`). Regression test: `SessionManagerAlarmAutoSubscribeTests.OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn`.
|
||||
|
||||
### Server-007
|
||||
|
||||
|
||||
+5
-2
@@ -579,8 +579,11 @@ Policy:
|
||||
- command exceptions return structured command fault with HRESULT if known,
|
||||
- stale sessions are closed by lease timeout,
|
||||
- stuck workers are killed by process id,
|
||||
- gateway restart should not attempt to reattach old workers unless explicitly
|
||||
designed; first version should terminate orphaned workers on startup.
|
||||
- gateway restart does not reattach old workers; `OrphanWorkerCleanupHostedService`
|
||||
runs `OrphanWorkerTerminator` once on startup — before the server accepts
|
||||
sessions — to kill leftover `MxGateway.Worker.exe` processes (matched by the
|
||||
configured worker executable path, or by image name when the x64 gateway cannot
|
||||
introspect the x86 worker's module) left behind by a previous unclean run.
|
||||
|
||||
Because each client owns one worker, a crash or leak affects only that session.
|
||||
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
using System.Security.Claims;
|
||||
using Microsoft.Data.Sqlite;
|
||||
using MxGateway.Server.Security.Authentication;
|
||||
using MxGateway.Server.Security.Authorization;
|
||||
|
||||
namespace MxGateway.Server.Dashboard;
|
||||
|
||||
@@ -171,6 +172,15 @@ public sealed class DashboardApiKeyManagementService(
|
||||
return "Display name is required.";
|
||||
}
|
||||
|
||||
string[] unknownScopes = request.Scopes
|
||||
.Where(scope => !GatewayScopes.IsKnown(scope))
|
||||
.ToArray();
|
||||
if (unknownScopes.Length > 0)
|
||||
{
|
||||
return $"Unknown scope(s): {string.Join(", ", unknownScopes)}. "
|
||||
+ $"Valid scopes are: {string.Join(", ", GatewayScopes.All)}.";
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
using Google.Protobuf.WellKnownTypes;
|
||||
using Microsoft.Data.SqlClient;
|
||||
using Microsoft.Extensions.Logging;
|
||||
using MxGateway.Contracts.Proto.Galaxy;
|
||||
using MxGateway.Server.Dashboard;
|
||||
@@ -181,8 +180,13 @@ public sealed class GalaxyHierarchyCache : IGalaxyHierarchyCache
|
||||
{
|
||||
throw;
|
||||
}
|
||||
catch (Exception exception) when (exception is SqlException or InvalidOperationException)
|
||||
catch (Exception exception)
|
||||
{
|
||||
// Catch every non-cancellation failure — not just SqlException /
|
||||
// InvalidOperationException. A TimeoutException or Win32Exception
|
||||
// from connection establishment, or another DbException subtype,
|
||||
// must still degrade gracefully to Stale/Unavailable and complete
|
||||
// _firstLoad rather than escape and fault the refresh BackgroundService.
|
||||
_logger?.LogWarning(exception, "Galaxy hierarchy cache refresh failed.");
|
||||
GalaxyHierarchyCacheEntry failed = previous with
|
||||
{
|
||||
|
||||
@@ -26,6 +26,15 @@ public sealed class GalaxyHierarchyRefreshService(
|
||||
{
|
||||
return;
|
||||
}
|
||||
catch (Exception exception)
|
||||
{
|
||||
// A transient first-load failure (e.g. a TimeoutException or
|
||||
// Win32Exception from connection establishment, or a DbException
|
||||
// subtype the cache does not catch) must not fault this
|
||||
// BackgroundService and stop the whole gateway. The cache records
|
||||
// its own Unavailable/Stale status; the periodic tick below retries.
|
||||
logger.LogWarning(exception, "Initial Galaxy hierarchy cache load failed; will retry on the refresh interval.");
|
||||
}
|
||||
|
||||
using PeriodicTimer timer = new(interval, _timeProvider);
|
||||
try
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
using MxGateway.Server.Security.Authorization;
|
||||
|
||||
namespace MxGateway.Server.Security.Authentication;
|
||||
|
||||
public static class ApiKeyAdminCommandLineParser
|
||||
@@ -95,6 +97,12 @@ public static class ApiKeyAdminCommandLineParser
|
||||
return ApiKeyAdminParseResult.Fail(validationError);
|
||||
}
|
||||
|
||||
string? scopeError = ValidateScopes(kind, scopes);
|
||||
if (scopeError is not null)
|
||||
{
|
||||
return ApiKeyAdminParseResult.Fail(scopeError);
|
||||
}
|
||||
|
||||
return ApiKeyAdminParseResult.Success(new ApiKeyAdminCommand(
|
||||
Kind: kind,
|
||||
Json: json,
|
||||
@@ -152,6 +160,23 @@ public static class ApiKeyAdminCommandLineParser
|
||||
return null;
|
||||
}
|
||||
|
||||
private static string? ValidateScopes(ApiKeyAdminCommandKind kind, IReadOnlySet<string> scopes)
|
||||
{
|
||||
if (kind != ApiKeyAdminCommandKind.CreateKey)
|
||||
{
|
||||
return null;
|
||||
}
|
||||
|
||||
string[] unknown = scopes.Where(scope => !GatewayScopes.IsKnown(scope)).ToArray();
|
||||
if (unknown.Length == 0)
|
||||
{
|
||||
return null;
|
||||
}
|
||||
|
||||
return $"Unknown scope(s): {string.Join(", ", unknown)}. "
|
||||
+ $"Valid scopes are: {string.Join(", ", GatewayScopes.All)}.";
|
||||
}
|
||||
|
||||
private static string KindName(ApiKeyAdminCommandKind kind)
|
||||
{
|
||||
return kind switch
|
||||
|
||||
@@ -10,4 +10,28 @@ public static class GatewayScopes
|
||||
public const string EventsRead = "events:read";
|
||||
public const string MetadataRead = "metadata:read";
|
||||
public const string Admin = "admin";
|
||||
|
||||
/// <summary>
|
||||
/// The complete catalog of canonical scope strings the gateway authorization
|
||||
/// resolver recognizes. Key-creation paths (CLI and dashboard) validate requested
|
||||
/// scopes against this set so a typo or non-canonical name cannot persist a key
|
||||
/// whose scope strings the resolver never matches.
|
||||
/// </summary>
|
||||
public static readonly IReadOnlySet<string> All = new HashSet<string>(
|
||||
[
|
||||
SessionOpen,
|
||||
SessionClose,
|
||||
InvokeRead,
|
||||
InvokeWrite,
|
||||
InvokeSecure,
|
||||
EventsRead,
|
||||
MetadataRead,
|
||||
Admin,
|
||||
],
|
||||
System.StringComparer.Ordinal);
|
||||
|
||||
/// <summary>Determines whether the supplied scope string is a recognized canonical scope.</summary>
|
||||
/// <param name="scope">Scope string to check.</param>
|
||||
/// <returns><see langword="true"/> when the scope is canonical; otherwise <see langword="false"/>.</returns>
|
||||
public static bool IsKnown(string scope) => All.Contains(scope);
|
||||
}
|
||||
|
||||
@@ -68,6 +68,7 @@ public sealed class SessionManager : ISessionManager
|
||||
EnsureSessionCapacity();
|
||||
|
||||
GatewaySession? session = null;
|
||||
bool sessionOpenedRecorded = false;
|
||||
try
|
||||
{
|
||||
session = CreateSession(request, clientIdentity);
|
||||
@@ -86,6 +87,7 @@ public sealed class SessionManager : ISessionManager
|
||||
session.AttachWorkerClient(workerClient);
|
||||
session.MarkReady();
|
||||
_metrics.SessionOpened();
|
||||
sessionOpenedRecorded = true;
|
||||
|
||||
await TryAutoSubscribeAlarmsAsync(session, cancellationToken).ConfigureAwait(false);
|
||||
|
||||
@@ -100,6 +102,14 @@ public sealed class SessionManager : ISessionManager
|
||||
await session.DisposeAsync().ConfigureAwait(false);
|
||||
}
|
||||
|
||||
// If SessionOpened() already incremented the open-session gauge,
|
||||
// a failure after that point (e.g. auto-subscribe rejection) must
|
||||
// decrement it again so mxgateway.sessions.open does not leak.
|
||||
if (sessionOpenedRecorded)
|
||||
{
|
||||
_metrics.SessionRemoved();
|
||||
}
|
||||
|
||||
ReleaseSessionSlot();
|
||||
_metrics.Fault(SessionManagerErrorCode.OpenFailed.ToString());
|
||||
_logger.LogWarning(
|
||||
|
||||
@@ -0,0 +1,29 @@
|
||||
namespace MxGateway.Server.Workers;
|
||||
|
||||
/// <summary>
|
||||
/// Abstraction over OS process enumeration and termination. Exists so the
|
||||
/// orphan-worker cleanup logic can be unit-tested without spawning real
|
||||
/// processes.
|
||||
/// </summary>
|
||||
public interface IRunningProcessInspector
|
||||
{
|
||||
/// <summary>
|
||||
/// Enumerates currently running processes whose image name (without the
|
||||
/// <c>.exe</c> extension) matches <paramref name="processName"/>.
|
||||
/// </summary>
|
||||
/// <param name="processName">Process image name to match, without extension.</param>
|
||||
/// <returns>The matching running processes.</returns>
|
||||
IReadOnlyList<RunningProcessInfo> GetProcessesByName(string processName);
|
||||
|
||||
/// <summary>Forcibly terminates the process with the given identifier.</summary>
|
||||
/// <param name="processId">Identifier of the process to terminate.</param>
|
||||
void Kill(int processId);
|
||||
}
|
||||
|
||||
/// <summary>Identifying information for a running process candidate.</summary>
|
||||
/// <param name="ProcessId">Operating-system process identifier.</param>
|
||||
/// <param name="ExecutablePath">
|
||||
/// Fully-qualified path to the process main module, or <see langword="null"/>
|
||||
/// when it could not be read (e.g. access denied).
|
||||
/// </param>
|
||||
public sealed record RunningProcessInfo(int ProcessId, string? ExecutablePath);
|
||||
@@ -0,0 +1,30 @@
|
||||
namespace MxGateway.Server.Workers;
|
||||
|
||||
/// <summary>
|
||||
/// Hosted service that terminates leftover MXAccess worker processes once on
|
||||
/// gateway startup, before the server begins accepting sessions.
|
||||
/// </summary>
|
||||
public sealed class OrphanWorkerCleanupHostedService(
|
||||
OrphanWorkerTerminator terminator,
|
||||
ILogger<OrphanWorkerCleanupHostedService> logger) : IHostedService
|
||||
{
|
||||
/// <inheritdoc />
|
||||
public Task StartAsync(CancellationToken cancellationToken)
|
||||
{
|
||||
try
|
||||
{
|
||||
terminator.TerminateOrphans();
|
||||
}
|
||||
catch (Exception exception)
|
||||
{
|
||||
// Orphan cleanup is best-effort; a failure here must not prevent
|
||||
// the gateway from starting.
|
||||
logger.LogWarning(exception, "Orphan worker cleanup failed on startup.");
|
||||
}
|
||||
|
||||
return Task.CompletedTask;
|
||||
}
|
||||
|
||||
/// <inheritdoc />
|
||||
public Task StopAsync(CancellationToken cancellationToken) => Task.CompletedTask;
|
||||
}
|
||||
@@ -0,0 +1,138 @@
|
||||
using Microsoft.Extensions.Logging.Abstractions;
|
||||
using Microsoft.Extensions.Options;
|
||||
using MxGateway.Server.Configuration;
|
||||
using MxGateway.Server.Metrics;
|
||||
|
||||
namespace MxGateway.Server.Workers;
|
||||
|
||||
/// <summary>
|
||||
/// Terminates leftover MXAccess worker processes on gateway startup.
|
||||
/// <para>
|
||||
/// Per <c>gateway.md</c> ("first version should terminate orphaned workers
|
||||
/// on startup") and CLAUDE.md, a gateway restart does not reattach old
|
||||
/// workers. After an unclean gateway crash, x86 worker processes — each
|
||||
/// holding an MXAccess COM instance on an STA — survive indefinitely. This
|
||||
/// terminator finds those processes by executable name/path and kills them
|
||||
/// before the restarted gateway accepts sessions.
|
||||
/// </para>
|
||||
/// </summary>
|
||||
public sealed class OrphanWorkerTerminator
|
||||
{
|
||||
private readonly IRunningProcessInspector _inspector;
|
||||
private readonly GatewayMetrics _metrics;
|
||||
private readonly WorkerOptions _workerOptions;
|
||||
private readonly ILogger<OrphanWorkerTerminator> _logger;
|
||||
|
||||
/// <summary>Initializes a new instance of the <see cref="OrphanWorkerTerminator"/> class.</summary>
|
||||
/// <param name="gatewayOptions">Gateway configuration options.</param>
|
||||
/// <param name="inspector">Running-process inspector.</param>
|
||||
/// <param name="metrics">Gateway metrics collector.</param>
|
||||
/// <param name="logger">Optional logger for diagnostic output.</param>
|
||||
public OrphanWorkerTerminator(
|
||||
IOptions<GatewayOptions> gatewayOptions,
|
||||
IRunningProcessInspector inspector,
|
||||
GatewayMetrics metrics,
|
||||
ILogger<OrphanWorkerTerminator>? logger = null)
|
||||
{
|
||||
ArgumentNullException.ThrowIfNull(gatewayOptions);
|
||||
_inspector = inspector ?? throw new ArgumentNullException(nameof(inspector));
|
||||
_metrics = metrics ?? throw new ArgumentNullException(nameof(metrics));
|
||||
_workerOptions = gatewayOptions.Value.Worker;
|
||||
_logger = logger ?? NullLogger<OrphanWorkerTerminator>.Instance;
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Finds and kills every leftover worker process. Safe to call once at
|
||||
/// startup before any session-owned worker is launched.
|
||||
/// </summary>
|
||||
/// <returns>The number of orphan worker processes that were terminated.</returns>
|
||||
public int TerminateOrphans()
|
||||
{
|
||||
string? configuredPath = ResolveConfiguredExecutablePath();
|
||||
string processName = ResolveProcessName(configuredPath);
|
||||
int currentProcessId = Environment.ProcessId;
|
||||
|
||||
int terminated = 0;
|
||||
foreach (RunningProcessInfo candidate in _inspector.GetProcessesByName(processName))
|
||||
{
|
||||
if (candidate.ProcessId == currentProcessId)
|
||||
{
|
||||
continue;
|
||||
}
|
||||
|
||||
if (!IsOrphanWorker(candidate, configuredPath))
|
||||
{
|
||||
continue;
|
||||
}
|
||||
|
||||
try
|
||||
{
|
||||
_inspector.Kill(candidate.ProcessId);
|
||||
_metrics.WorkerKilled("OrphanStartupCleanup");
|
||||
terminated++;
|
||||
_logger.LogWarning(
|
||||
"Terminated orphan worker process {ProcessId} ({ExecutablePath}) left over from a previous gateway run.",
|
||||
candidate.ProcessId,
|
||||
candidate.ExecutablePath ?? processName);
|
||||
}
|
||||
catch (Exception exception)
|
||||
{
|
||||
// The process may have already exited, or be inaccessible.
|
||||
// A failure to kill one orphan must not block gateway startup.
|
||||
_logger.LogWarning(
|
||||
exception,
|
||||
"Failed to terminate orphan worker process {ProcessId}.",
|
||||
candidate.ProcessId);
|
||||
}
|
||||
}
|
||||
|
||||
if (terminated > 0)
|
||||
{
|
||||
_logger.LogInformation("Terminated {Count} orphan worker process(es) on startup.", terminated);
|
||||
}
|
||||
|
||||
return terminated;
|
||||
}
|
||||
|
||||
private static bool IsOrphanWorker(RunningProcessInfo candidate, string? configuredPath)
|
||||
{
|
||||
// When the executable path is readable, require an exact match against
|
||||
// the configured worker path so unrelated processes that merely share
|
||||
// the image name are never killed.
|
||||
if (candidate.ExecutablePath is { } path)
|
||||
{
|
||||
return configuredPath is not null
|
||||
&& string.Equals(path, configuredPath, StringComparison.OrdinalIgnoreCase);
|
||||
}
|
||||
|
||||
// A null path means the x64 gateway could not introspect the module —
|
||||
// the expected case for the x86 worker. Image-name match is the only
|
||||
// signal available; treat it as an orphan.
|
||||
return true;
|
||||
}
|
||||
|
||||
private string? ResolveConfiguredExecutablePath()
|
||||
{
|
||||
try
|
||||
{
|
||||
return Path.GetFullPath(_workerOptions.ExecutablePath);
|
||||
}
|
||||
catch (Exception exception) when (exception is ArgumentException
|
||||
or NotSupportedException
|
||||
or PathTooLongException)
|
||||
{
|
||||
_logger.LogWarning(
|
||||
exception,
|
||||
"Configured worker executable path '{ExecutablePath}' is not a valid filesystem path; "
|
||||
+ "orphan cleanup will match by image name only.",
|
||||
_workerOptions.ExecutablePath);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
private static string ResolveProcessName(string? configuredPath)
|
||||
{
|
||||
string source = configuredPath ?? "MxGateway.Worker.exe";
|
||||
return Path.GetFileNameWithoutExtension(source);
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,55 @@
|
||||
using System.Diagnostics;
|
||||
|
||||
namespace MxGateway.Server.Workers;
|
||||
|
||||
/// <summary>
|
||||
/// <see cref="IRunningProcessInspector"/> backed by <see cref="Process"/>.
|
||||
/// </summary>
|
||||
public sealed class SystemRunningProcessInspector : IRunningProcessInspector
|
||||
{
|
||||
/// <inheritdoc />
|
||||
public IReadOnlyList<RunningProcessInfo> GetProcessesByName(string processName)
|
||||
{
|
||||
List<RunningProcessInfo> results = [];
|
||||
Process[] processes = Process.GetProcessesByName(processName);
|
||||
try
|
||||
{
|
||||
foreach (Process process in processes)
|
||||
{
|
||||
results.Add(new RunningProcessInfo(process.Id, TryGetExecutablePath(process)));
|
||||
}
|
||||
}
|
||||
finally
|
||||
{
|
||||
foreach (Process process in processes)
|
||||
{
|
||||
process.Dispose();
|
||||
}
|
||||
}
|
||||
|
||||
return results;
|
||||
}
|
||||
|
||||
/// <inheritdoc />
|
||||
public void Kill(int processId)
|
||||
{
|
||||
using Process process = Process.GetProcessById(processId);
|
||||
process.Kill(entireProcessTree: true);
|
||||
}
|
||||
|
||||
private static string? TryGetExecutablePath(Process process)
|
||||
{
|
||||
try
|
||||
{
|
||||
return process.MainModule?.FileName;
|
||||
}
|
||||
catch (Exception exception) when (exception is InvalidOperationException
|
||||
or System.ComponentModel.Win32Exception
|
||||
or NotSupportedException)
|
||||
{
|
||||
// Access to the main module can be denied (e.g. a 64-bit gateway
|
||||
// querying a 32-bit worker, or a process owned by another user).
|
||||
return null;
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -11,6 +11,13 @@ public static class WorkerServiceCollectionExtensions
|
||||
services.AddSingleton<IWorkerStartupProbe, WorkerProcessStartedProbe>();
|
||||
services.AddSingleton<IWorkerProcessLauncher, WorkerProcessLauncher>();
|
||||
|
||||
// Terminate workers leaked by a previous unclean gateway run before the
|
||||
// server accepts sessions. Registered ahead of AddGatewaySessions so the
|
||||
// cleanup hosted service starts before the session subsystem.
|
||||
services.AddSingleton<IRunningProcessInspector, SystemRunningProcessInspector>();
|
||||
services.AddSingleton<OrphanWorkerTerminator>();
|
||||
services.AddHostedService<OrphanWorkerCleanupHostedService>();
|
||||
|
||||
return services;
|
||||
}
|
||||
}
|
||||
|
||||
@@ -0,0 +1,64 @@
|
||||
using Microsoft.Extensions.Logging.Abstractions;
|
||||
using Microsoft.Extensions.Options;
|
||||
using MxGateway.Server.Galaxy;
|
||||
|
||||
namespace MxGateway.Tests.Galaxy;
|
||||
|
||||
/// <summary>
|
||||
/// Server-005 regression: the initial <c>RefreshAsync</c> call in
|
||||
/// <see cref="GalaxyHierarchyRefreshService"/> must not let a transient,
|
||||
/// non-cancellation first-load failure (e.g. a <see cref="TimeoutException"/>
|
||||
/// or <see cref="System.ComponentModel.Win32Exception"/> from connection
|
||||
/// establishment) escape and fault the host <c>BackgroundService</c>.
|
||||
/// </summary>
|
||||
public sealed class GalaxyHierarchyRefreshServiceTests
|
||||
{
|
||||
[Fact]
|
||||
public async Task ExecuteAsync_WhenFirstRefreshThrowsNonCancellationException_DoesNotFaultBackgroundService()
|
||||
{
|
||||
ThrowingCache cache = new(new TimeoutException("connection establishment timed out"));
|
||||
GalaxyHierarchyRefreshService service = CreateService(cache);
|
||||
|
||||
using CancellationTokenSource cts = new();
|
||||
|
||||
await service.StartAsync(cts.Token);
|
||||
await cts.CancelAsync();
|
||||
|
||||
// The background loop must have stopped cleanly: ExecuteTask completes
|
||||
// (RanToCompletion or Canceled) rather than faulting on the first refresh.
|
||||
Task? executeTask = service.ExecuteTask;
|
||||
Assert.NotNull(executeTask);
|
||||
await executeTask;
|
||||
Assert.False(executeTask.IsFaulted);
|
||||
Assert.Equal(1, cache.RefreshCallCount);
|
||||
|
||||
await service.StopAsync(CancellationToken.None);
|
||||
}
|
||||
|
||||
private static GalaxyHierarchyRefreshService CreateService(IGalaxyHierarchyCache cache)
|
||||
{
|
||||
GalaxyRepositoryOptions options = new()
|
||||
{
|
||||
DashboardRefreshIntervalSeconds = 3600,
|
||||
};
|
||||
return new GalaxyHierarchyRefreshService(
|
||||
cache,
|
||||
Options.Create(options),
|
||||
NullLogger<GalaxyHierarchyRefreshService>.Instance);
|
||||
}
|
||||
|
||||
private sealed class ThrowingCache(Exception toThrow) : IGalaxyHierarchyCache
|
||||
{
|
||||
public int RefreshCallCount { get; private set; }
|
||||
|
||||
public GalaxyHierarchyCacheEntry Current => GalaxyHierarchyCacheEntry.Empty;
|
||||
|
||||
public Task RefreshAsync(CancellationToken cancellationToken)
|
||||
{
|
||||
RefreshCallCount++;
|
||||
throw toThrow;
|
||||
}
|
||||
|
||||
public Task WaitForFirstLoadAsync(CancellationToken cancellationToken) => Task.CompletedTask;
|
||||
}
|
||||
}
|
||||
@@ -112,6 +112,33 @@ public sealed class DashboardApiKeyManagementServiceTests
|
||||
&& entry.Details == "rotated");
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Server-004 regression: the dashboard create path must reject a request
|
||||
/// carrying a non-canonical scope string rather than persisting a key whose
|
||||
/// scope the authorization resolver never matches.
|
||||
/// </summary>
|
||||
[Fact]
|
||||
public async Task CreateAsync_UnknownScope_DoesNotCallStore()
|
||||
{
|
||||
FakeApiKeyAdminStore adminStore = new();
|
||||
DashboardApiKeyManagementService service = CreateService(adminStore);
|
||||
|
||||
DashboardApiKeyManagementRequest request = CreateRequest() with
|
||||
{
|
||||
Scopes = new HashSet<string>(
|
||||
[GatewayScopes.SessionOpen, "invoke", "metadata"],
|
||||
StringComparer.Ordinal),
|
||||
};
|
||||
|
||||
DashboardApiKeyManagementResult result = await service.CreateAsync(
|
||||
CreateAuthorizedUser(),
|
||||
request,
|
||||
CancellationToken.None);
|
||||
|
||||
Assert.False(result.Succeeded);
|
||||
Assert.Equal(0, adminStore.CreateCount);
|
||||
}
|
||||
|
||||
private static DashboardApiKeyManagementService CreateService(
|
||||
FakeApiKeyAdminStore? adminStore = null,
|
||||
FakeApiKeyAuditStore? auditStore = null,
|
||||
|
||||
@@ -125,6 +125,44 @@ public sealed class SessionManagerAlarmAutoSubscribeTests
|
||||
CreateOpenRequest(), "client-1", CancellationToken.None));
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Server-006 regression: when auto-subscribe throws after
|
||||
/// <c>SessionOpened()</c> incremented the open-session gauge, the failed
|
||||
/// open must not leave <c>mxgateway.sessions.open</c> over-counted.
|
||||
/// </summary>
|
||||
[Fact]
|
||||
public async Task OpenSessionAsync_DoesNotLeakOpenSessionGauge_WhenAutoSubscribeFailsWithRequireOn()
|
||||
{
|
||||
AlarmAutoSubscribeWorkerClient worker = new()
|
||||
{
|
||||
SubscribeAlarmsReplyFactory = _ => new MxCommandReply
|
||||
{
|
||||
Kind = MxCommandKind.SubscribeAlarms,
|
||||
ProtocolStatus = new ProtocolStatus
|
||||
{
|
||||
Code = ProtocolStatusCode.MxaccessFailure,
|
||||
Message = "wnwrap subscribe failed",
|
||||
},
|
||||
},
|
||||
};
|
||||
using GatewayMetrics metrics = new();
|
||||
SessionManager manager = NewManager(
|
||||
worker,
|
||||
alarms: new AlarmsOptions
|
||||
{
|
||||
Enabled = true,
|
||||
SubscriptionExpression = @"\\HOST\Galaxy!Area1",
|
||||
RequireSubscribeOnOpen = true,
|
||||
},
|
||||
metrics: metrics);
|
||||
|
||||
await Assert.ThrowsAsync<SessionManagerException>(
|
||||
async () => await manager.OpenSessionAsync(
|
||||
CreateOpenRequest(), "client-1", CancellationToken.None));
|
||||
|
||||
Assert.Equal(0, metrics.GetSnapshot().OpenSessions);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task OpenSessionAsync_Throws_WhenEnabledButNoExpressionAndRequireOn()
|
||||
{
|
||||
@@ -161,7 +199,8 @@ public sealed class SessionManagerAlarmAutoSubscribeTests
|
||||
|
||||
private static SessionManager NewManager(
|
||||
AlarmAutoSubscribeWorkerClient worker,
|
||||
AlarmsOptions alarms)
|
||||
AlarmsOptions alarms,
|
||||
GatewayMetrics? metrics = null)
|
||||
{
|
||||
FakeSessionWorkerClientFactory factory = new(worker);
|
||||
GatewayOptions options = new GatewayOptions
|
||||
@@ -183,7 +222,7 @@ public sealed class SessionManagerAlarmAutoSubscribeTests
|
||||
new SessionRegistry(),
|
||||
factory,
|
||||
Options.Create(options),
|
||||
new GatewayMetrics());
|
||||
metrics ?? new GatewayMetrics());
|
||||
}
|
||||
|
||||
private static SessionOpenRequest CreateOpenRequest()
|
||||
|
||||
@@ -0,0 +1,137 @@
|
||||
using Microsoft.Extensions.Options;
|
||||
using MxGateway.Server.Configuration;
|
||||
using MxGateway.Server.Metrics;
|
||||
using MxGateway.Server.Workers;
|
||||
|
||||
namespace MxGateway.Tests.Gateway.Workers;
|
||||
|
||||
/// <summary>
|
||||
/// Server-002 regression: per <c>gateway.md</c> the gateway must terminate
|
||||
/// orphaned worker processes on startup. These tests pin that the terminator
|
||||
/// kills leftover workers (matched by executable path, or by image name when
|
||||
/// the path is unreadable) without touching unrelated processes or itself.
|
||||
/// </summary>
|
||||
public sealed class OrphanWorkerTerminatorTests
|
||||
{
|
||||
private const string WorkerExecutablePath = @"C:\app\src\MxGateway.Worker\bin\x86\Release\MxGateway.Worker.exe";
|
||||
|
||||
[Fact]
|
||||
public void TerminateOrphans_KillsWorkerProcessesMatchingConfiguredExecutablePath()
|
||||
{
|
||||
FakeProcessInspector inspector = new(
|
||||
[
|
||||
new RunningProcessInfo(101, WorkerExecutablePath),
|
||||
new RunningProcessInfo(102, WorkerExecutablePath),
|
||||
]);
|
||||
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
|
||||
|
||||
int killed = terminator.TerminateOrphans();
|
||||
|
||||
Assert.Equal(2, killed);
|
||||
Assert.Equal([101, 102], inspector.KilledProcessIds.Order());
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void TerminateOrphans_KillsImageNameMatchWhenExecutablePathUnreadable()
|
||||
{
|
||||
// The x64 gateway cannot introspect the x86 worker's main module, so the
|
||||
// path comes back null. Image-name match is the only signal — and it is
|
||||
// exactly the orphan worker case, so the process must still be killed.
|
||||
FakeProcessInspector inspector = new(
|
||||
[
|
||||
new RunningProcessInfo(201, ExecutablePath: null),
|
||||
]);
|
||||
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
|
||||
|
||||
int killed = terminator.TerminateOrphans();
|
||||
|
||||
Assert.Equal(1, killed);
|
||||
Assert.Equal([201], inspector.KilledProcessIds);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void TerminateOrphans_DoesNotKillUnrelatedProcessSharingImageName()
|
||||
{
|
||||
// A process with the same image name but a different executable path is
|
||||
// not our worker and must be left alone.
|
||||
FakeProcessInspector inspector = new(
|
||||
[
|
||||
new RunningProcessInfo(301, @"C:\other\place\MxGateway.Worker.exe"),
|
||||
]);
|
||||
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
|
||||
|
||||
int killed = terminator.TerminateOrphans();
|
||||
|
||||
Assert.Equal(0, killed);
|
||||
Assert.Empty(inspector.KilledProcessIds);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void TerminateOrphans_DoesNotKillCurrentProcess()
|
||||
{
|
||||
FakeProcessInspector inspector = new(
|
||||
[
|
||||
new RunningProcessInfo(Environment.ProcessId, WorkerExecutablePath),
|
||||
]);
|
||||
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
|
||||
|
||||
int killed = terminator.TerminateOrphans();
|
||||
|
||||
Assert.Equal(0, killed);
|
||||
Assert.Empty(inspector.KilledProcessIds);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void TerminateOrphans_ContinuesWhenOneKillThrows()
|
||||
{
|
||||
FakeProcessInspector inspector = new(
|
||||
[
|
||||
new RunningProcessInfo(401, WorkerExecutablePath),
|
||||
new RunningProcessInfo(402, WorkerExecutablePath),
|
||||
])
|
||||
{
|
||||
ThrowOnKillProcessId = 401,
|
||||
};
|
||||
OrphanWorkerTerminator terminator = CreateTerminator(inspector);
|
||||
|
||||
int killed = terminator.TerminateOrphans();
|
||||
|
||||
Assert.Equal(1, killed);
|
||||
Assert.Contains(402, inspector.KilledProcessIds);
|
||||
}
|
||||
|
||||
private static OrphanWorkerTerminator CreateTerminator(IRunningProcessInspector inspector)
|
||||
{
|
||||
GatewayOptions options = new()
|
||||
{
|
||||
Worker = new WorkerOptions
|
||||
{
|
||||
ExecutablePath = WorkerExecutablePath,
|
||||
},
|
||||
};
|
||||
return new OrphanWorkerTerminator(
|
||||
Options.Create(options),
|
||||
inspector,
|
||||
new GatewayMetrics());
|
||||
}
|
||||
|
||||
private sealed class FakeProcessInspector(IReadOnlyList<RunningProcessInfo> processes)
|
||||
: IRunningProcessInspector
|
||||
{
|
||||
public List<int> KilledProcessIds { get; } = [];
|
||||
|
||||
public int? ThrowOnKillProcessId { get; init; }
|
||||
|
||||
public IReadOnlyList<RunningProcessInfo> GetProcessesByName(string processName) => processes;
|
||||
|
||||
public void Kill(int processId)
|
||||
{
|
||||
if (ThrowOnKillProcessId == processId)
|
||||
{
|
||||
throw new InvalidOperationException("Process has already exited.");
|
||||
}
|
||||
|
||||
KilledProcessIds.Add(processId);
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -52,6 +52,56 @@ public sealed class ApiKeyAdminCommandLineParserTests
|
||||
Assert.Contains("events:read", result.Command.Scopes);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Server-004 regression: a create-key command with a non-canonical scope
|
||||
/// string (e.g. CLAUDE.md's stale <c>invoke</c> instead of <c>invoke:read</c>)
|
||||
/// must be rejected at parse time rather than silently persisting an
|
||||
/// unusable scope the authorization resolver never matches.
|
||||
/// </summary>
|
||||
[Fact]
|
||||
public void Parse_CreateKeyCommand_RejectsUnknownScope()
|
||||
{
|
||||
ApiKeyAdminParseResult result = ApiKeyAdminCommandLineParser.Parse(
|
||||
[
|
||||
"apikey",
|
||||
"create-key",
|
||||
"--key-id",
|
||||
"operator01",
|
||||
"--display-name",
|
||||
"Operator",
|
||||
"--scopes",
|
||||
"session:open,invoke,metadata",
|
||||
]);
|
||||
|
||||
Assert.True(result.IsApiKeyCommand);
|
||||
Assert.Null(result.Command);
|
||||
Assert.NotNull(result.Error);
|
||||
Assert.Contains("invoke", result.Error, StringComparison.Ordinal);
|
||||
Assert.Contains("metadata", result.Error, StringComparison.Ordinal);
|
||||
}
|
||||
|
||||
/// <summary>Verifies a create-key command with only canonical scopes parses successfully.</summary>
|
||||
[Fact]
|
||||
public void Parse_CreateKeyCommand_AcceptsAllCanonicalScopes()
|
||||
{
|
||||
ApiKeyAdminParseResult result = ApiKeyAdminCommandLineParser.Parse(
|
||||
[
|
||||
"apikey",
|
||||
"create-key",
|
||||
"--key-id",
|
||||
"operator01",
|
||||
"--display-name",
|
||||
"Operator",
|
||||
"--scopes",
|
||||
"session:open,session:close,invoke:read,invoke:write,invoke:secure,events:read,metadata:read,admin",
|
||||
]);
|
||||
|
||||
Assert.True(result.IsApiKeyCommand);
|
||||
Assert.Null(result.Error);
|
||||
Assert.NotNull(result.Command);
|
||||
Assert.Equal(8, result.Command.Scopes.Count);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Verifies create key without display name returns error.
|
||||
/// </summary>
|
||||
|
||||
Reference in New Issue
Block a user