Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules

Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 09:46:47 -04:00
parent 1cd51bbda3
commit a0203503a7
122 changed files with 8723 additions and 757 deletions
@@ -1,5 +1,4 @@
@page "/apikeys"
@page "/dashboard/apikeys"
@inherits DashboardPageBase
@inject AuthenticationStateProvider AuthenticationStateProvider
@inject IDashboardApiKeyManagementService ApiKeyManagementService
@@ -1,5 +1,4 @@
@page "/"
@page "/dashboard/"
@inherits DashboardPageBase
<PageTitle>MXAccess Gateway Dashboard</PageTitle>
@@ -1,5 +1,4 @@
@page "/events"
@page "/dashboard/events"
@inherits DashboardPageBase
<PageTitle>Dashboard Events</PageTitle>
@@ -1,5 +1,4 @@
@page "/galaxy"
@page "/dashboard/galaxy"
@inherits DashboardPageBase
<PageTitle>Dashboard Galaxy</PageTitle>
@@ -1,5 +1,4 @@
@page "/sessions/{SessionId}"
@page "/dashboard/sessions/{SessionId}"
@inherits DashboardPageBase
<PageTitle>Dashboard Session</PageTitle>
@@ -1,5 +1,4 @@
@page "/sessions"
@page "/dashboard/sessions"
@inherits DashboardPageBase
<PageTitle>Dashboard Sessions</PageTitle>
@@ -1,5 +1,4 @@
@page "/settings"
@page "/dashboard/settings"
@inherits DashboardPageBase
<PageTitle>Dashboard Settings</PageTitle>
@@ -1,5 +1,4 @@
@page "/workers"
@page "/dashboard/workers"
@inherits DashboardPageBase
<PageTitle>Dashboard Workers</PageTitle>
@@ -7,13 +7,42 @@ namespace MxGateway.Server.Galaxy;
public static class GalaxyGlobMatcher
{
/// <summary>
/// Compiled-regex cache keyed by glob pattern. <c>IsMatch</c> is called once per
/// object per <c>DiscoverHierarchy</c>/<c>WatchDeployEvents</c> evaluation, so the
/// same handful of glob patterns are translated repeatedly; caching avoids
/// rebuilding and recompiling the regex on every call.
/// Maximum number of compiled-regex entries retained in <see cref="RegexCache"/>.
/// The cache is keyed by glob pattern and patterns flow in from two sources:
/// admin-controlled API-key constraints (naturally bounded) and the
/// client-supplied <c>DiscoverHierarchyRequest.TagNameGlob</c> (unbounded — a
/// client can iterate through generated names and create millions of distinct
/// globs over the process lifetime). Capping the cache bounds memory while
/// keeping the hot working set hit-cached.
/// </summary>
internal const int RegexCacheCapacity = 256;
/// <summary>
/// Bounded compiled-regex cache keyed by glob pattern. <c>IsMatch</c> is called
/// once per object per <c>DiscoverHierarchy</c>/<c>WatchDeployEvents</c>
/// evaluation, so the same handful of glob patterns are translated
/// repeatedly; caching avoids rebuilding and recompiling the regex on every
/// call. Beyond <see cref="RegexCacheCapacity"/> entries the oldest insertion
/// is evicted so a client cannot grow the cache without bound by submitting
/// unique patterns. Eviction is approximate (FIFO over insertion order, not
/// true LRU) because we only need the bound, not exact recency tracking.
/// </summary>
private static readonly ConcurrentDictionary<string, Regex> RegexCache = new(StringComparer.Ordinal);
/// <summary>
/// Insertion-order queue used to evict the oldest cache entry when the cache
/// exceeds <see cref="RegexCacheCapacity"/>. A separate queue keeps the
/// <see cref="RegexCache"/> reads lock-free; the lock below only guards the
/// eviction path.
/// </summary>
private static readonly ConcurrentQueue<string> InsertionOrder = new();
private static readonly object EvictionLock = new();
/// <summary>
/// Current cache size, exposed for tests asserting the cap is honoured.
/// </summary>
internal static int CurrentCacheSize => RegexCache.Count;
public static bool IsMatch(string value, string glob)
{
if (string.IsNullOrWhiteSpace(glob))
@@ -26,10 +55,42 @@ public static class GalaxyGlobMatcher
private static Regex GetOrCreateRegex(string glob)
{
return RegexCache.GetOrAdd(glob, static pattern => new Regex(
BuildRegex(pattern),
if (RegexCache.TryGetValue(glob, out Regex? existing))
{
return existing;
}
Regex compiled = new(
BuildRegex(glob),
RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled,
TimeSpan.FromMilliseconds(100)));
TimeSpan.FromMilliseconds(100));
if (RegexCache.TryAdd(glob, compiled))
{
InsertionOrder.Enqueue(glob);
EvictIfOverCapacity();
return compiled;
}
// Another thread won the race — use its compiled regex.
return RegexCache[glob];
}
private static void EvictIfOverCapacity()
{
if (RegexCache.Count <= RegexCacheCapacity)
{
return;
}
// Serialize eviction so two threads do not race past the cap together.
lock (EvictionLock)
{
while (RegexCache.Count > RegexCacheCapacity && InsertionOrder.TryDequeue(out string? oldest))
{
RegexCache.TryRemove(oldest, out _);
}
}
}
private static string BuildRegex(string glob)
@@ -17,7 +17,7 @@ public sealed class GalaxyHierarchyCache : IGalaxyHierarchyCache
{
private static readonly TimeSpan StaleThreshold = TimeSpan.FromMinutes(5);
private readonly GalaxyRepository _repository;
private readonly IGalaxyRepository _repository;
private readonly IGalaxyDeployNotifier _notifier;
private readonly TimeProvider _timeProvider;
private readonly ILogger<GalaxyHierarchyCache>? _logger;
@@ -31,7 +31,7 @@ public sealed class GalaxyHierarchyCache : IGalaxyHierarchyCache
/// <param name="timeProvider">Provider for current time; defaults to system time.</param>
/// <param name="logger">Optional logger for diagnostic output.</param>
public GalaxyHierarchyCache(
GalaxyRepository repository,
IGalaxyRepository repository,
IGalaxyDeployNotifier notifier,
TimeProvider? timeProvider = null,
ILogger<GalaxyHierarchyCache>? logger = null)
@@ -8,7 +8,7 @@ namespace MxGateway.Server.Galaxy;
/// consumers — the same SQL drives the OPC UA server's address space and this gateway's
/// gRPC browse surface.
/// </summary>
public sealed class GalaxyRepository(GalaxyRepositoryOptions options)
public sealed class GalaxyRepository(GalaxyRepositoryOptions options) : IGalaxyRepository
{
/// <summary>Tests the connection to the Galaxy Repository database.</summary>
/// <param name="ct">Token to cancel the asynchronous operation.</param>
@@ -8,10 +8,17 @@ public sealed class GalaxyRepositoryOptions
{
public const string SectionName = "MxGateway:Galaxy";
/// <summary>The SQL Server connection string for the Galaxy Repository database.</summary>
public string ConnectionString { get; init; } =
/// <summary>
/// Default SQL Server connection string for the Galaxy Repository database.
/// Single source of truth shared with the integration-test fallback so the
/// production default and the live-test default cannot drift.
/// </summary>
public const string DefaultConnectionString =
"Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;";
/// <summary>The SQL Server connection string for the Galaxy Repository database.</summary>
public string ConnectionString { get; init; } = DefaultConnectionString;
/// <summary>The timeout in seconds for SQL commands executed against the Galaxy Repository.</summary>
public int CommandTimeoutSeconds { get; init; } = 60;
@@ -16,6 +16,7 @@ public static class GalaxyRepositoryServiceCollectionExtensions
services.AddSingleton(sp =>
new GalaxyRepository(sp.GetRequiredService<IOptions<GalaxyRepositoryOptions>>().Value));
services.AddSingleton<IGalaxyRepository>(sp => sp.GetRequiredService<GalaxyRepository>());
services.AddSingleton<IGalaxyDeployNotifier, GalaxyDeployNotifier>();
services.AddSingleton<IGalaxyHierarchyCache, GalaxyHierarchyCache>();
@@ -0,0 +1,30 @@
namespace MxGateway.Server.Galaxy;
/// <summary>
/// Abstraction over <see cref="GalaxyRepository"/> consumed by
/// <see cref="GalaxyHierarchyCache"/>. Exists so the cache can be unit-tested
/// against an in-memory fake that throws a <see cref="System.Exception"/>
/// from <see cref="GetLastDeployTimeAsync"/> (the unavailable-backend code
/// path) without standing up a real <c>Microsoft.Data.SqlClient</c>
/// <c>SqlConnection</c> against a bogus host/port. The production gateway
/// wires the concrete <see cref="GalaxyRepository"/>; the SQL surface itself
/// stays covered by <c>MxGateway.IntegrationTests.Galaxy.GalaxyRepositoryLiveTests</c>.
/// </summary>
public interface IGalaxyRepository
{
/// <summary>Tests the connection to the Galaxy Repository database.</summary>
/// <param name="ct">Token to cancel the asynchronous operation.</param>
Task<bool> TestConnectionAsync(CancellationToken ct = default);
/// <summary>Retrieves the last deployment time from the Galaxy Repository.</summary>
/// <param name="ct">Token to cancel the asynchronous operation.</param>
Task<DateTime?> GetLastDeployTimeAsync(CancellationToken ct = default);
/// <summary>Retrieves the complete hierarchy of Galaxy objects from the repository.</summary>
/// <param name="ct">Token to cancel the asynchronous operation.</param>
Task<List<GalaxyHierarchyRow>> GetHierarchyAsync(CancellationToken ct = default);
/// <summary>Retrieves all attributes for Galaxy objects from the repository.</summary>
/// <param name="ct">Token to cancel the asynchronous operation.</param>
Task<List<GalaxyAttributeRow>> GetAttributesAsync(CancellationToken ct = default);
}
@@ -18,6 +18,8 @@ public sealed class GatewayGrpcScopeResolver
CloseSessionRequest => GatewayScopes.SessionClose,
StreamEventsRequest => GatewayScopes.EventsRead,
MxCommandRequest commandRequest => ResolveCommandScope(commandRequest.Command?.Kind ?? MxCommandKind.Unspecified),
AcknowledgeAlarmRequest => GatewayScopes.InvokeWrite,
QueryActiveAlarmsRequest => GatewayScopes.EventsRead,
TestConnectionRequest or
GetLastDeployTimeRequest or
DiscoverHierarchyRequest or
+101 -7
View File
@@ -263,6 +263,17 @@ public sealed class GatewaySession
/// Transitions the session to a new state with constraints for terminal states.
/// </summary>
/// <param name="nextState">Next session state to transition to.</param>
/// <remarks>
/// <see cref="SessionState.Closed"/> is terminal. <see cref="SessionState.Faulted"/>
/// only allows a transition to <see cref="SessionState.Closed"/>.
/// <see cref="SessionState.Closing"/> only allows a transition to
/// <see cref="SessionState.Closed"/> (or <see cref="SessionState.Faulted"/>) — once
/// <see cref="CloseAsync"/> has started, no late lifecycle callback can revive the
/// session by walking it back to <see cref="SessionState.Ready"/> or any earlier
/// state. Both close-related writes (<c>Closing</c> and <c>Closed</c>) go through
/// <c>_syncRoot</c> just like every other state read/write, closing the split-lock
/// race called out in Server-015.
/// </remarks>
public void TransitionTo(SessionState nextState)
{
lock (_syncRoot)
@@ -277,6 +288,13 @@ public sealed class GatewaySession
return;
}
if (_state is SessionState.Closing
&& nextState is not SessionState.Closed
&& nextState is not SessionState.Faulted)
{
return;
}
_state = nextState;
}
}
@@ -717,6 +735,14 @@ public sealed class GatewaySession
/// </summary>
/// <param name="reason">Reason for closing the session.</param>
/// <param name="cancellationToken">Token to cancel the asynchronous operation.</param>
/// <remarks>
/// Concurrent close attempts are serialized by <c>_closeLock</c> so only one close
/// runs at a time, but every read/write of <c>_state</c> still passes through
/// <c>_syncRoot</c> (via <see cref="TryBeginClose"/> and <see cref="MarkClosed"/>) —
/// the close path therefore obeys the same lock discipline as
/// <see cref="TransitionTo"/> / <see cref="MarkFaulted"/> and a concurrent
/// <c>TransitionTo(Ready)</c> cannot race past a <c>Closing</c> write.
/// </remarks>
public async Task<SessionCloseResult> CloseAsync(
string reason,
CancellationToken cancellationToken)
@@ -726,15 +752,11 @@ public sealed class GatewaySession
{
try
{
if (_state is SessionState.Closed)
if (!TryBeginClose(out bool alreadyClosing))
{
return new SessionCloseResult(SessionId, SessionState.Closed, AlreadyClosed: true);
}
bool alreadyClosing = _closeStarted;
_closeStarted = true;
_state = SessionState.Closing;
if (_workerClient is not null)
{
try
@@ -758,7 +780,7 @@ public sealed class GatewaySession
}
}
_state = SessionState.Closed;
MarkClosed();
return new SessionCloseResult(SessionId, SessionState.Closed, alreadyClosing);
}
catch (Exception exception) when (exception is not SessionCloseStartedException)
@@ -774,6 +796,40 @@ public sealed class GatewaySession
}
}
// Returns false when the session is already Closed (caller short-circuits with
// AlreadyClosed: true). Otherwise sets _state = Closing under _syncRoot so a
// concurrent TransitionTo(Ready) — which only refuses to overwrite Closed/Faulted
// — cannot flip the session back to Ready after close started. The `alreadyClosing`
// out parameter mirrors the previous `_closeStarted` check so the surface contract
// (a second concurrent close returns AlreadyClosed: alreadyClosing) is preserved.
private bool TryBeginClose(out bool alreadyClosing)
{
lock (_syncRoot)
{
if (_state is SessionState.Closed)
{
alreadyClosing = _closeStarted;
return false;
}
alreadyClosing = _closeStarted;
_closeStarted = true;
_state = SessionState.Closing;
return true;
}
}
// Final terminal transition; under _syncRoot to keep _state writes single-lock.
// Closed is unconditionally terminal — TransitionTo refuses to overwrite it —
// so we don't need to re-check the precondition here.
private void MarkClosed()
{
lock (_syncRoot)
{
_state = SessionState.Closed;
}
}
/// <summary>
/// Terminates the worker process immediately.
/// </summary>
@@ -787,9 +843,47 @@ public sealed class GatewaySession
/// <summary>
/// Disposes the session and frees associated resources.
/// </summary>
/// <remarks>
/// Acquires <c>_closeLock</c> once before disposing so an in-flight
/// <see cref="CloseAsync"/> finishes before the semaphore is released and
/// reclaimed. Without this gate, the in-flight close's <c>_closeLock.Release()</c>
/// would race the dispose and raise <see cref="ObjectDisposedException"/>.
/// The acquire is best-effort: a non-cancellable wait that swallows
/// <see cref="ObjectDisposedException"/> so double-dispose still completes.
/// </remarks>
public async ValueTask DisposeAsync()
{
_closeLock.Dispose();
try
{
// CancellationToken.None — disposal must not be cancelled, and a misbehaving
// close path that never releases would have to be torn down by the worker
// shutdown timeout long before we reach here.
await _closeLock.WaitAsync(CancellationToken.None).ConfigureAwait(false);
try
{
// Hand the slot back so the semaphore's internal counter is consistent
// for any contemporaneous waiter, then dispose. Once disposed, every
// subsequent WaitAsync / Release will throw — but DisposeAsync's contract
// is "no concurrent close after this point", which SessionManager honors.
_closeLock.Release();
}
catch (ObjectDisposedException)
{
}
}
catch (ObjectDisposedException)
{
// Already disposed (e.g. double-dispose); nothing to gate on.
}
try
{
_closeLock.Dispose();
}
catch (ObjectDisposedException)
{
}
if (_workerClient is not null)
{
await _workerClient.DisposeAsync().ConfigureAwait(false);
@@ -6,18 +6,18 @@ using MxGateway.Contracts.Proto;
namespace MxGateway.Server.Sessions;
/// <summary>
/// PR A.6 / A.7 — gateway-side dispatcher for the alarm-RPC surface.
/// Bridges the public <c>AcknowledgeAlarm</c> + <c>QueryActiveAlarms</c>
/// gRPC handlers to the worker process that hosts
/// <c>IMxAccessAlarmConsumer</c>.
/// Gateway-side dispatcher seam for the alarm-RPC surface. Bridges the
/// public <c>AcknowledgeAlarm</c> + <c>QueryActiveAlarms</c> gRPC handlers
/// to the worker process that hosts <c>IMxAccessAlarmConsumer</c>.
/// </summary>
/// <remarks>
/// <para>
/// Production implementations live in <c>WorkerAlarmRpcDispatcher</c>
/// (this PR ships a not-yet-wired default that returns a clear
/// worker-pending diagnostic) and route through the existing
/// worker-pipe IPC. Tests inject a fake to exercise the gateway
/// handler shape without spinning up a worker process.
/// DI binds the production <see cref="WorkerAlarmRpcDispatcher"/> by
/// default; it routes calls through the existing worker-pipe IPC.
/// <c>NotWiredAlarmRpcDispatcher</c> is only the null fallback used
/// when no dispatcher is registered (DI omission / standalone tests).
/// Other tests inject a fake to exercise the gateway handler shape
/// without spinning up a worker process.
/// </para>
/// <para>
/// The dispatcher is session-scoped: every call resolves the
@@ -188,7 +188,14 @@ public sealed class WorkerAlarmRpcDispatcher(
if (!sessionRegistry.TryGet(request.SessionId, out GatewaySession session))
{
yield break;
// Server-019: align with AcknowledgeAsync's missing-session handling and
// surface a SessionNotFound error rather than yielding an empty stream.
// QueryActiveAlarms is server-streaming, so a thrown exception is the
// cleaner fit than an in-band ProtocolStatus; MxAccessGatewayService maps
// SessionManagerException(SessionNotFound) to gRPC NotFound.
throw new SessionManagerException(
SessionManagerErrorCode.SessionNotFound,
$"Session '{request.SessionId}' not found.");
}
WorkerCommand workerCommand = new WorkerCommand