Resolve Server-031..032 (re-triaged) + Server-038..043

Server-031: re-triaged. The recommended gateway-side
"skip-while-command-in-flight" guard is already in place at
WorkerClient.HeartbeatLoopAsync via WorkerClientOptions.HeartbeatStuckCeiling
(default 75s = 5× HeartbeatGrace). Two regression tests pin the
behaviour. Recommendation #1 (decouple worker-side _writeLock) is a
Worker-module concern (Worker-017 / Worker-023) and out of scope here.

Server-032: re-triaged. Recommendation #2 (rich diagnostic) is already
in EnqueueWorkerEventAsync, with #3 (overflow grace) absorbed by the
TryWrite → WriteAsync-with-timeout fall-through. Test
EnqueueWorkerEvent_WhenChannelFullPastTimeout_FaultsWithRichDiagnostic
pins the diagnostic string. Recommendation #1 (prose contract in
gateway.md / docs) is deferred — outside this pass's edit scope.

Server-038 (Security): EventsHub.SubscribeSession's missing per-session
ACL is documented with a TODO(per-session-acl) and a <remarks> block
explaining the v1 acceptance (any dashboard role can subscribe to any
session — non-secret metadata, redacted value logging). The per-session
ACL design lands in a follow-up once a session-scoped role exists.

Server-039 (Error handling): HubTokenService.Validate now rejects a
deserialized payload where both Name and NameIdentifier are null/empty.
New test file HubTokenServiceTests.cs covers the regression and five
sanity cases. TDD confirmed.

Server-040 (Conventions): MapGroupsToRoles gains a precedence comment
explaining "full literal match first, leading-RDN fallback;
OrdinalIgnoreCase via DashboardOptions.GroupToRole". Documentation-only.

Server-041 (Design adherence): EventStreamService.ProduceEventsAsync
wraps the broadcaster.Publish call in try/catch (Exception). The
producer loop and gRPC stream are no longer at the mercy of the
broadcaster's never-throw discipline. New regression test
StreamEventsAsync_WhenDashboardBroadcasterThrows_StillYieldsEventsAndDoesNotFaultSession.

Server-042 (Performance): DashboardSnapshotPublisher.ExecuteAsync now
mirrors AlarmsHubPublisher's reconnect loop — wraps the await foreach
in a while-not-cancelled, catches general exceptions, and Task.Delays
5s before retrying. An internal ctor accepts a shorter delay for the
test. New test file DashboardSnapshotPublisherTests.cs covers the
throw-then-yield reconnect path and the normal-completion case.

Server-043 (Documentation): HubTokenService class XML doc gains a
<remarks> describing the singleton lifetime, the two consumer scopes
(DashboardHubConnectionFactory scoped, HubTokenAuthenticationHandler
transient), and the thread-safety contract.

Verification: dotnet build src/ZB.MOM.WW.MxGateway.slnx clean
(0 warnings / 0 errors); src/ZB.MOM.WW.MxGateway.Tests 486/486 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-24 03:18:52 -04:00
parent d2d2e5f68f
commit 327e9c5f94
9 changed files with 567 additions and 43 deletions
@@ -154,6 +154,13 @@ public sealed class DashboardAuthenticator(
foreach (string group in groups)
{
string normalizedGroup = group.Trim();
// Lookup precedence (Server-040): the full literal group string is
// tried first; only if that misses do we fall back to the leading
// RDN value (e.g. "GwAdmin" extracted from
// "ou=GwAdmin,ou=groups,..."). The map's comparer is
// OrdinalIgnoreCase (see DashboardOptions.GroupToRole), so e.g.
// "GwAdmin" and "gwadmin" both match.
if (groupToRole.TryGetValue(normalizedGroup, out string? mapped)
|| groupToRole.TryGetValue(ExtractFirstRdnValue(normalizedGroup), out mapped))
{
@@ -11,6 +11,18 @@ namespace ZB.MOM.WW.MxGateway.Server.Dashboard;
/// role claims. Validity is enforced by the data-protection time-limited
/// protector; no separate signing keys are configured.
/// </summary>
/// <remarks>
/// Server-043: this service is registered as a singleton in
/// <see cref="DashboardServiceCollectionExtensions.AddGatewayDashboard"/> and
/// is shared by two consumer scopes: <c>DashboardHubConnectionFactory</c>
/// (scoped, per-circuit; calls <see cref="Issue"/> from the cookie-authenticated
/// dashboard) and <c>HubTokenAuthenticationHandler</c> (transient, per-request;
/// calls <see cref="Validate"/> from the SignalR negotiate / connection path).
/// The underlying <see cref="ITimeLimitedDataProtector"/> is thread-safe, so
/// minting and validating concurrently from any number of callers is safe;
/// future maintainers should preserve the singleton lifetime to keep the
/// protector instance stable.
/// </remarks>
public sealed class HubTokenService
{
private const string ProtectorPurpose = "ZB.MOM.WW.MxGateway.Dashboard.HubToken.v1";
@@ -51,6 +63,16 @@ public sealed class HubTokenService
return null;
}
// Server-039: reject a token whose payload carries no caller
// identity. A null/empty Name AND NameIdentifier would otherwise
// produce a principal that satisfies IsAuthenticated and IsInRole
// checks without any associated user, because the AuthenticationType
// (the HubToken scheme) is non-empty.
if (string.IsNullOrEmpty(payload.Name) && string.IsNullOrEmpty(payload.NameIdentifier))
{
return null;
}
List<Claim> claims = [];
if (!string.IsNullOrEmpty(payload.Name))
{
@@ -8,34 +8,94 @@ namespace ZB.MOM.WW.MxGateway.Server.Dashboard.Hubs;
/// <see cref="DashboardSnapshotHub"/> client. There is one publisher per
/// gateway process; clients listen via the hub.
/// </summary>
public sealed class DashboardSnapshotPublisher(
IDashboardSnapshotService snapshotService,
IHubContext<DashboardSnapshotHub> hubContext,
ILogger<DashboardSnapshotPublisher> logger) : BackgroundService
/// <remarks>
/// Server-042: <see cref="ExecuteAsync"/> wraps the snapshot subscription in
/// a reconnect loop with a configurable retry delay (5s by default,
/// mirroring <see cref="AlarmsHubPublisher"/>). A transient failure inside
/// <see cref="IDashboardSnapshotService.WatchSnapshotsAsync"/> — e.g. a
/// one-time logger-init failure or a transient SQL error from the Galaxy
/// summary projection — would otherwise end the BackgroundService with no
/// reconnect, taking the dashboard offline until process restart.
/// </remarks>
public sealed class DashboardSnapshotPublisher : BackgroundService
{
private static readonly TimeSpan DefaultReconnectDelay = TimeSpan.FromSeconds(5);
private readonly IDashboardSnapshotService _snapshotService;
private readonly IHubContext<DashboardSnapshotHub> _hubContext;
private readonly ILogger<DashboardSnapshotPublisher> _logger;
private readonly TimeSpan _reconnectDelay;
public DashboardSnapshotPublisher(
IDashboardSnapshotService snapshotService,
IHubContext<DashboardSnapshotHub> hubContext,
ILogger<DashboardSnapshotPublisher> logger)
: this(snapshotService, hubContext, logger, DefaultReconnectDelay)
{
}
// Internal hook for the Server-042 regression test: tests inject a
// very short reconnect delay so the assertion doesn't wait the full
// 5s. Production wiring always uses the 5s default via the public ctor.
internal DashboardSnapshotPublisher(
IDashboardSnapshotService snapshotService,
IHubContext<DashboardSnapshotHub> hubContext,
ILogger<DashboardSnapshotPublisher> logger,
TimeSpan reconnectDelay)
{
_snapshotService = snapshotService;
_hubContext = hubContext;
_logger = logger;
_reconnectDelay = reconnectDelay;
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
try
// Loop forever — when WatchSnapshotsAsync completes or throws, reopen
// the subscription after a short delay. The hosted-service lifetime
// ends only when the host stops. Mirrors AlarmsHubPublisher.
while (!stoppingToken.IsCancellationRequested)
{
await foreach (DashboardSnapshot snapshot in snapshotService
.WatchSnapshotsAsync(stoppingToken)
.ConfigureAwait(false))
try
{
await foreach (DashboardSnapshot snapshot in _snapshotService
.WatchSnapshotsAsync(stoppingToken)
.ConfigureAwait(false))
{
if (stoppingToken.IsCancellationRequested)
{
break;
}
try
{
await _hubContext.Clients
.All
.SendAsync(DashboardSnapshotHub.SnapshotMessage, snapshot, stoppingToken)
.ConfigureAwait(false);
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
_logger.LogWarning(ex, "Snapshot broadcast failed; will retry on the next snapshot tick.");
}
}
}
catch (OperationCanceledException)
{
return;
}
catch (Exception ex)
{
_logger.LogWarning(ex, "Snapshot subscription faulted; reconnecting in {DelaySeconds:F1}s.", _reconnectDelay.TotalSeconds);
try
{
await hubContext.Clients
.All
.SendAsync(DashboardSnapshotHub.SnapshotMessage, snapshot, stoppingToken)
.ConfigureAwait(false);
await Task.Delay(_reconnectDelay, stoppingToken).ConfigureAwait(false);
}
catch (Exception ex) when (ex is not OperationCanceledException)
catch (OperationCanceledException)
{
logger.LogWarning(ex, "Snapshot broadcast failed; will retry on the next snapshot tick.");
return;
}
}
}
catch (OperationCanceledException)
{
}
}
}
@@ -23,6 +23,29 @@ public sealed class EventsHub : Hub
public static string GroupName(string sessionId) => $"session:{sessionId}";
/// <summary>
/// Subscribes the calling SignalR connection to the per-session events
/// group, so that events broadcast by
/// <see cref="DashboardEventBroadcaster"/> for that session reach this
/// client.
/// </summary>
/// <remarks>
/// Server-038: in v1 the hub-level <see cref="AuthorizeAttribute"/>
/// (<c>HubClientsPolicy</c>) only checks that the caller carries one of
/// the dashboard roles (Admin or Viewer); both roles may subscribe to
/// any session id they choose. This is acceptable today because (a) the
/// dashboard's per-session views show non-secret session metadata that
/// any authenticated dashboard user can already see, and (b) value
/// logging in the source gRPC stream is gated by the same redaction
/// policy that protects logs. The per-session ACL that gates the gRPC
/// <c>StreamEvents</c> RPC is intentionally not yet mirrored here.
/// TODO(per-session-acl): once a role/scope is introduced that scopes a
/// Viewer to a specific session or tenant, add a session-access check
/// at this seam — either inline (consult the per-user allowed-session
/// set on <c>Context.User</c> claims / <c>Context.Items</c>) or via a
/// dedicated authorization policy applied to the hub method itself.
/// </remarks>
/// <param name="sessionId">Session id to subscribe the caller to.</param>
public Task SubscribeSession(string sessionId)
{
if (string.IsNullOrWhiteSpace(sessionId))
@@ -122,8 +122,23 @@ public sealed class EventStreamService(
// Mirror the event to the dashboard EventsHub group for this
// session. Fire-and-forget — broadcast errors must not affect
// the source gRPC stream.
dashboardEventBroadcaster.Publish(session.SessionId, publicEvent);
// the source gRPC stream. Server-041: the
// IDashboardEventBroadcaster contract documents Publish as
// never-throw, but we enforce that at the seam too, so a
// future implementation that adds synchronous validation or
// a serializer hop cannot fault the producer loop and end
// this client's gRPC stream.
try
{
dashboardEventBroadcaster.Publish(session.SessionId, publicEvent);
}
catch (Exception ex)
{
logger.LogDebug(
ex,
"Dashboard event mirror threw for session {SessionId}; continuing.",
session.SessionId);
}
if (!writer.TryWrite(publicEvent))
{