fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown semantics, and audit-row coverage across 9 modules. Highlights: Cancellation & lifecycle: - AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the captured SyncContext that risked sync-over-async deadlock. - AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS, threaded through drain paths instead of CancellationToken.None. - Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls. - Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM during the bounded-retry window aborts cleanly. Cursor / retry / counter correctness: - AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since` when any row's idempotent insert is still being retried (per-EventId retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical abandon). No more silent abandonment of permanently-failing rows. - ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's SPLIT loop — by class-doc construction the catch could only mask real failures and let the next iteration create permanent partition holes. - HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot per-interval counters before sending, restore via new ISiteHealthCollector.AddIntervalCounters on transport failure so counts aren't silently lost. Fire-and-forget / shutdown waits: - InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks via OnlyOnFaulted continuation (Warning log; response unchanged). - SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep with a bounded SweepShutdownWaitTimeout (10s). Leak / refactor: - Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its own try/catch so a throw doesn't leak the relay actor or _activeStreams entry. - Comm-022: VERIFIED already-closed by Comm-016's dead-code purge. - CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync (auth-failure exit-code contract unified). Defensive / validation: - CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config prints a warning and returns defaults instead of crashing the CLI. - Host-022: ParseLevel emits stderr one-shot warning for unrecognised MinimumLevel instead of silently coercing to Information. - ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the per-call CTS is the sole timeout source (was clipped to 100s by .NET). - Security-020: New SecurityOptionsValidator (IValidateOptions) rejects empty LdapServer/LdapSearchBase with ValidateOnStart. - DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/ DeleteTimedOut audit entries (mirrors DeployFailed pattern). Plus reconciled stale per-module Open-findings counters that had drifted from prior sessions. 20+ new regression tests across 11 test projects; build clean; affected suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
@@ -96,6 +96,27 @@ public class SiteAuditReconciliationActor : ReceiveActor
|
||||
/// </summary>
|
||||
private readonly Dictionary<string, bool> _stalled = new();
|
||||
|
||||
/// <summary>
|
||||
/// AuditLog-004: per-EventId retry counter for rows whose central insert
|
||||
/// threw. While a row keeps failing AND is below
|
||||
/// <see cref="MaxPermanentInsertAttempts"/>, the cursor is held back so the
|
||||
/// next reconciliation tick re-pulls and retries the row. Crossing the
|
||||
/// threshold logs Critical and permanently abandons the row (cursor
|
||||
/// advances past it) so a truly broken row cannot block all subsequent
|
||||
/// progress for a site. The counter is in-memory only — singleton restart
|
||||
/// resets it, which is safe because the cursor also resets on restart and
|
||||
/// the next tick re-pulls everything.
|
||||
/// </summary>
|
||||
private readonly Dictionary<Guid, int> _failedInsertAttempts = new();
|
||||
|
||||
/// <summary>
|
||||
/// AuditLog-004: number of consecutive central-insert failures before a row
|
||||
/// is permanently abandoned with a Critical log entry and the cursor is
|
||||
/// allowed to advance past it. Five attempts at the 5-minute default tick
|
||||
/// is ~25 min of retry budget before a stuck row stops blocking progress.
|
||||
/// </summary>
|
||||
private const int MaxPermanentInsertAttempts = 5;
|
||||
|
||||
private ICancelable? _timer;
|
||||
|
||||
/// <summary>
|
||||
@@ -232,9 +253,11 @@ public class SiteAuditReconciliationActor : ReceiveActor
|
||||
.ConfigureAwait(false);
|
||||
|
||||
var maxOccurred = since;
|
||||
var hasUnresolvedFailure = false;
|
||||
var nowUtc = DateTime.UtcNow;
|
||||
foreach (var evt in response.Events)
|
||||
{
|
||||
var advanceForThisRow = false;
|
||||
try
|
||||
{
|
||||
// Idempotent repository write: duplicate EventIds (from a
|
||||
@@ -243,29 +266,58 @@ public class SiteAuditReconciliationActor : ReceiveActor
|
||||
// InsertIfNotExistsAsync.
|
||||
var ingested = evt with { IngestedAtUtc = nowUtc };
|
||||
await repository.InsertIfNotExistsAsync(ingested).ConfigureAwait(false);
|
||||
_failedInsertAttempts.Remove(evt.EventId);
|
||||
advanceForThisRow = true;
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
// Per-row catch so one bad event does not abandon the rest of
|
||||
// the batch. The cursor still advances based on OccurredAtUtc
|
||||
// — the row was returned by the site, so the next tick won't
|
||||
// re-fetch it; if it permanently fails to persist, that's an
|
||||
// operational concern surfaced by the log, not a hot-loop
|
||||
// trigger.
|
||||
_logger.LogError(
|
||||
ex,
|
||||
"Reconciliation ingest failed for AuditEvent {EventId} from site {SiteId}.",
|
||||
evt.EventId,
|
||||
site.SiteId);
|
||||
// AuditLog-004: per-row catch so one bad event does not abandon
|
||||
// the rest of the batch. Track the failure count per EventId —
|
||||
// below MaxPermanentInsertAttempts the cursor is HELD BACK so
|
||||
// the next tick re-pulls and retries; at the threshold the row
|
||||
// is permanently abandoned (LogCritical + cursor advances past)
|
||||
// to keep a truly broken row from blocking all subsequent
|
||||
// progress for the site.
|
||||
var attempts = _failedInsertAttempts.GetValueOrDefault(evt.EventId) + 1;
|
||||
_failedInsertAttempts[evt.EventId] = attempts;
|
||||
|
||||
if (attempts >= MaxPermanentInsertAttempts)
|
||||
{
|
||||
_logger.LogCritical(
|
||||
ex,
|
||||
"Permanently abandoning AuditEvent {EventId} from site {SiteId} after {Attempts} consecutive insert failures; cursor will advance past it.",
|
||||
evt.EventId,
|
||||
site.SiteId,
|
||||
attempts);
|
||||
_failedInsertAttempts.Remove(evt.EventId);
|
||||
advanceForThisRow = true;
|
||||
}
|
||||
else
|
||||
{
|
||||
_logger.LogError(
|
||||
ex,
|
||||
"Reconciliation ingest failed for AuditEvent {EventId} from site {SiteId} (attempt {Attempts}/{Max}); cursor held back for retry.",
|
||||
evt.EventId,
|
||||
site.SiteId,
|
||||
attempts,
|
||||
MaxPermanentInsertAttempts);
|
||||
hasUnresolvedFailure = true;
|
||||
}
|
||||
}
|
||||
|
||||
if (evt.OccurredAtUtc > maxOccurred)
|
||||
if (advanceForThisRow && evt.OccurredAtUtc > maxOccurred)
|
||||
{
|
||||
maxOccurred = evt.OccurredAtUtc;
|
||||
}
|
||||
}
|
||||
|
||||
_cursors[site.SiteId] = maxOccurred;
|
||||
// AuditLog-004: only advance the persisted cursor if no event in this
|
||||
// batch is still being retried. Leaving the cursor at `since` re-pulls
|
||||
// the whole batch next tick — successful rows are no-ops thanks to
|
||||
// InsertIfNotExistsAsync's idempotency, and the failing row gets
|
||||
// another attempt. Once it succeeds (or hits the permanent-abandon
|
||||
// threshold) the cursor unblocks naturally.
|
||||
_cursors[site.SiteId] = hasUnresolvedFailure ? since : maxOccurred;
|
||||
|
||||
var nonDraining = response.MoreAvailable && response.Events.Count > 0;
|
||||
UpdateStalledState(site.SiteId, draining: !nonDraining, eventStream);
|
||||
|
||||
@@ -693,10 +693,26 @@ public class SqliteAuditWriter : IAuditWriter, ISiteAuditQueue, IAsyncDisposable
|
||||
};
|
||||
}
|
||||
|
||||
/// <summary>Disposes the audit writer and releases resources.</summary>
|
||||
/// <summary>
|
||||
/// Disposes the audit writer and releases resources.
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// AuditLog-006: prefer <see cref="DisposeAsync"/> when possible (DI honours
|
||||
/// <see cref="IAsyncDisposable"/> on singletons). The sync path remains for
|
||||
/// callers that only know about <see cref="IDisposable"/> (e.g. legacy
|
||||
/// composition roots, <c>using</c> statements without <c>await</c>). To
|
||||
/// avoid the classic sync-over-async deadlock on a captured
|
||||
/// <see cref="SynchronizationContext"/> (ASP.NET request thread, Akka
|
||||
/// dispatcher under some configurations), we hop to the thread pool via
|
||||
/// <see cref="Task.Run(Func{Task})"/> before blocking on the result — the
|
||||
/// async continuation inside <see cref="DisposeAsync"/> then resumes on a
|
||||
/// pool thread with no captured context, so <c>GetResult()</c> never waits
|
||||
/// on the very thread the continuation needs.
|
||||
/// </remarks>
|
||||
public void Dispose()
|
||||
{
|
||||
DisposeAsync().AsTask().GetAwaiter().GetResult();
|
||||
Task.Run(async () => await DisposeAsync().ConfigureAwait(false))
|
||||
.GetAwaiter().GetResult();
|
||||
}
|
||||
|
||||
/// <summary>Asynchronously disposes the audit writer and releases resources.</summary>
|
||||
|
||||
@@ -42,6 +42,12 @@ public class SiteAuditTelemetryActor : ReceiveActor
|
||||
private readonly SiteAuditTelemetryOptions _options;
|
||||
private readonly ILogger<SiteAuditTelemetryActor> _logger;
|
||||
private ICancelable? _pendingTick;
|
||||
// AuditLog-010: per-actor lifecycle CTS so an in-flight drain (queue read,
|
||||
// gRPC push, mark-forwarded write) is actually cancelled when the actor is
|
||||
// stopped — without it, a stuck IngestAuditEventsAsync would hold the
|
||||
// continuation through CoordinatedShutdown's actor-system terminate window.
|
||||
// Cancelled in PostStop; never reset (the actor is single-lifetime).
|
||||
private readonly CancellationTokenSource _lifecycleCts = new();
|
||||
|
||||
/// <summary>Initializes the actor with its drain queue, gRPC client, options, and logger.</summary>
|
||||
/// <param name="queue">The site-local SQLite audit queue to drain.</param>
|
||||
@@ -81,15 +87,32 @@ public class SiteAuditTelemetryActor : ReceiveActor
|
||||
protected override void PostStop()
|
||||
{
|
||||
_pendingTick?.Cancel();
|
||||
// AuditLog-010: cancel any in-flight drain so a stuck queue read or
|
||||
// gRPC push does not hold the continuation past actor stop.
|
||||
try
|
||||
{
|
||||
_lifecycleCts.Cancel();
|
||||
}
|
||||
catch (ObjectDisposedException)
|
||||
{
|
||||
// PostStop may run after a prior Dispose path — benign.
|
||||
}
|
||||
_lifecycleCts.Dispose();
|
||||
base.PostStop();
|
||||
}
|
||||
|
||||
private async Task OnDrainAsync()
|
||||
{
|
||||
var nextDelay = TimeSpan.FromSeconds(_options.BusyIntervalSeconds);
|
||||
// AuditLog-010: route every async dependency call through the
|
||||
// per-actor lifecycle token so PostStop cancellation actually
|
||||
// propagates into the queue read, the gRPC push, and the
|
||||
// mark-forwarded write. OperationCanceledException is swallowed by
|
||||
// the catch-all below.
|
||||
var ct = _lifecycleCts.Token;
|
||||
try
|
||||
{
|
||||
var pending = await _queue.ReadPendingAsync(_options.BatchSize, CancellationToken.None)
|
||||
var pending = await _queue.ReadPendingAsync(_options.BatchSize, ct)
|
||||
.ConfigureAwait(false);
|
||||
if (pending.Count == 0)
|
||||
{
|
||||
@@ -104,7 +127,7 @@ public class SiteAuditTelemetryActor : ReceiveActor
|
||||
IngestAck ack;
|
||||
try
|
||||
{
|
||||
ack = await _client.IngestAuditEventsAsync(batch, CancellationToken.None)
|
||||
ack = await _client.IngestAuditEventsAsync(batch, ct)
|
||||
.ConfigureAwait(false);
|
||||
}
|
||||
catch (Exception ex)
|
||||
@@ -121,7 +144,7 @@ public class SiteAuditTelemetryActor : ReceiveActor
|
||||
var acceptedIds = ParseAcceptedIds(ack);
|
||||
if (acceptedIds.Count > 0)
|
||||
{
|
||||
await _queue.MarkForwardedAsync(acceptedIds, CancellationToken.None)
|
||||
await _queue.MarkForwardedAsync(acceptedIds, ct)
|
||||
.ConfigureAwait(false);
|
||||
}
|
||||
}
|
||||
@@ -133,7 +156,13 @@ public class SiteAuditTelemetryActor : ReceiveActor
|
||||
}
|
||||
finally
|
||||
{
|
||||
ScheduleNext(nextDelay);
|
||||
// AuditLog-010: if the actor is already shutting down, do not
|
||||
// arm another tick — the scheduler would fire after PostStop and
|
||||
// the message would land in dead letters.
|
||||
if (!_lifecycleCts.IsCancellationRequested)
|
||||
{
|
||||
ScheduleNext(nextDelay);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -40,15 +40,31 @@ public class CliConfig
|
||||
".scadalink", "config.json");
|
||||
if (File.Exists(configPath))
|
||||
{
|
||||
var json = File.ReadAllText(configPath);
|
||||
var fileConfig = JsonSerializer.Deserialize<CliConfigFile>(json,
|
||||
new JsonSerializerOptions { PropertyNameCaseInsensitive = true });
|
||||
if (fileConfig != null)
|
||||
// CLI-021: a malformed (`JsonException`), unreadable
|
||||
// (`UnauthorizedAccessException`), or otherwise faulted
|
||||
// (`IOException`) config file must not crash the CLI before any
|
||||
// command runs — even a command that supplies everything via
|
||||
// --url/--username/--password/--format on the command line still
|
||||
// calls Load() and would otherwise inherit the fault. Warn once on
|
||||
// stderr and fall through to the env-var + command-line precedence
|
||||
// chain with default settings.
|
||||
try
|
||||
{
|
||||
if (!string.IsNullOrEmpty(fileConfig.ManagementUrl))
|
||||
config.ManagementUrl = fileConfig.ManagementUrl;
|
||||
if (!string.IsNullOrEmpty(fileConfig.DefaultFormat))
|
||||
config.DefaultFormat = fileConfig.DefaultFormat;
|
||||
var json = File.ReadAllText(configPath);
|
||||
var fileConfig = JsonSerializer.Deserialize<CliConfigFile>(json,
|
||||
new JsonSerializerOptions { PropertyNameCaseInsensitive = true });
|
||||
if (fileConfig != null)
|
||||
{
|
||||
if (!string.IsNullOrEmpty(fileConfig.ManagementUrl))
|
||||
config.ManagementUrl = fileConfig.ManagementUrl;
|
||||
if (!string.IsNullOrEmpty(fileConfig.DefaultFormat))
|
||||
config.DefaultFormat = fileConfig.DefaultFormat;
|
||||
}
|
||||
}
|
||||
catch (Exception ex) when (ex is JsonException || ex is IOException || ex is UnauthorizedAccessException)
|
||||
{
|
||||
Console.Error.WriteLine(
|
||||
$"warning: ignoring malformed or unreadable {configPath}: {ex.Message}");
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -112,9 +112,11 @@ public static class BundleCommands
|
||||
Passphrase: passphrase,
|
||||
SourceEnvironment: sourceEnv);
|
||||
|
||||
return await RunBundleCommandAsync(
|
||||
result, urlOption, usernameOption, passwordOption,
|
||||
payload, jsonOk =>
|
||||
return await CommandHelpers.ExecuteCommandAsync(
|
||||
result, urlOption, formatOption, usernameOption, passwordOption,
|
||||
payload,
|
||||
timeout: BundleCommandTimeout,
|
||||
onSuccess: jsonOk =>
|
||||
{
|
||||
using var doc = JsonDocument.Parse(jsonOk);
|
||||
var base64 = doc.RootElement.GetProperty("base64Bundle").GetString()!;
|
||||
@@ -165,9 +167,11 @@ public static class BundleCommands
|
||||
Base64Bundle: Convert.ToBase64String(bytes),
|
||||
Passphrase: result.GetValue(passphraseOption));
|
||||
|
||||
return await RunBundleCommandAsync(
|
||||
result, urlOption, usernameOption, passwordOption,
|
||||
payload, jsonOk =>
|
||||
return await CommandHelpers.ExecuteCommandAsync(
|
||||
result, urlOption, formatOption, usernameOption, passwordOption,
|
||||
payload,
|
||||
timeout: BundleCommandTimeout,
|
||||
onSuccess: jsonOk =>
|
||||
{
|
||||
Console.WriteLine(jsonOk);
|
||||
return 0;
|
||||
@@ -220,9 +224,11 @@ public static class BundleCommands
|
||||
Passphrase: result.GetValue(passphraseOption),
|
||||
DefaultConflictPolicy: result.GetValue(onConflictOption)!);
|
||||
|
||||
return await RunBundleCommandAsync(
|
||||
result, urlOption, usernameOption, passwordOption,
|
||||
payload, jsonOk =>
|
||||
return await CommandHelpers.ExecuteCommandAsync(
|
||||
result, urlOption, formatOption, usernameOption, passwordOption,
|
||||
payload,
|
||||
timeout: BundleCommandTimeout,
|
||||
onSuccess: jsonOk =>
|
||||
{
|
||||
Console.WriteLine(jsonOk);
|
||||
return 0;
|
||||
@@ -234,59 +240,15 @@ public static class BundleCommands
|
||||
// ====================================================================
|
||||
// Shared HTTP plumbing
|
||||
// ====================================================================
|
||||
|
||||
/// <summary>
|
||||
/// Same shape as <see cref="CommandHelpers.ExecuteCommandAsync"/> but with
|
||||
/// a longer per-command timeout (bundles are big) and a caller-supplied
|
||||
/// success handler so export can capture the base64 payload into a file
|
||||
/// rather than print the whole envelope to stdout.
|
||||
/// </summary>
|
||||
private static async Task<int> RunBundleCommandAsync(
|
||||
ParseResult result,
|
||||
Option<string> urlOption,
|
||||
Option<string> usernameOption,
|
||||
Option<string> passwordOption,
|
||||
object payload,
|
||||
Func<string, int> onSuccess)
|
||||
{
|
||||
var config = CliConfig.Load();
|
||||
var url = result.GetValue(urlOption);
|
||||
if (string.IsNullOrWhiteSpace(url)) url = config.ManagementUrl;
|
||||
if (string.IsNullOrWhiteSpace(url))
|
||||
{
|
||||
OutputFormatter.WriteError(
|
||||
"No management URL specified. Use --url or set 'managementUrl' in ~/.scadalink/config.json.",
|
||||
"NO_URL");
|
||||
return 1;
|
||||
}
|
||||
if (!CommandHelpers.IsValidManagementUrl(url))
|
||||
{
|
||||
OutputFormatter.WriteError(
|
||||
$"Invalid management URL '{url}'.", "INVALID_URL");
|
||||
return 1;
|
||||
}
|
||||
var username = CommandHelpers.ResolveCredential(result.GetValue(usernameOption), config.Username);
|
||||
var password = CommandHelpers.ResolveCredential(result.GetValue(passwordOption), config.Password);
|
||||
if (string.IsNullOrWhiteSpace(username) || string.IsNullOrWhiteSpace(password))
|
||||
{
|
||||
OutputFormatter.WriteError(
|
||||
"Credentials required. Use --username/--password or SCADALINK_USERNAME/SCADALINK_PASSWORD.",
|
||||
"NO_CREDENTIALS");
|
||||
return 1;
|
||||
}
|
||||
|
||||
var commandName = ManagementCommandRegistry.GetCommandName(payload.GetType());
|
||||
using var client = new ManagementHttpClient(url, username, password);
|
||||
var response = await client.SendCommandAsync(commandName, payload, BundleCommandTimeout);
|
||||
|
||||
if (response.JsonData is not null)
|
||||
{
|
||||
return onSuccess(response.JsonData);
|
||||
}
|
||||
OutputFormatter.WriteError(response.Error ?? "Unknown error", response.ErrorCode ?? "ERROR");
|
||||
if (response.StatusCode == 403) return 2;
|
||||
return 1;
|
||||
}
|
||||
//
|
||||
// CLI-017: bundle commands previously routed through a private
|
||||
// RunBundleCommandAsync that re-implemented URL/credential resolution and
|
||||
// skipped the IsAuthorizationFailure(...) check that ExecuteCommandAsync
|
||||
// enforces — a server that signalled FORBIDDEN/UNAUTHORIZED via the error
|
||||
// code on a non-403 status would exit 1 instead of the documented exit 2.
|
||||
// The bundle path now delegates to CommandHelpers.ExecuteCommandAsync with
|
||||
// the longer BundleCommandTimeout and a per-command success handler, so the
|
||||
// exit-code contract is unified across every command group.
|
||||
|
||||
private static Option<IReadOnlyList<string>?> NameListOption(string name, string description)
|
||||
{
|
||||
|
||||
@@ -17,13 +17,28 @@ internal static class CommandHelpers
|
||||
/// <param name="usernameOption">Option that supplies the username override.</param>
|
||||
/// <param name="passwordOption">Option that supplies the password override.</param>
|
||||
/// <param name="command">The management command object to send.</param>
|
||||
/// <param name="timeout">
|
||||
/// Optional per-command HTTP timeout. Defaults to 30s, matching the management API's
|
||||
/// own request timeout. Larger payloads (e.g. Transport bundles) should supply a
|
||||
/// longer value.
|
||||
/// </param>
|
||||
/// <param name="onSuccess">
|
||||
/// Optional success handler. When supplied, the helper invokes it with the success
|
||||
/// body instead of running the default <see cref="HandleResponse"/> rendering path —
|
||||
/// useful when the caller needs to capture the response (e.g. write a file) rather
|
||||
/// than print it. The authorization-failure exit-code contract
|
||||
/// (<see cref="IsAuthorizationFailure"/>) is preserved on the error path either way,
|
||||
/// closing CLI-017's regression.
|
||||
/// </param>
|
||||
internal static async Task<int> ExecuteCommandAsync(
|
||||
ParseResult result,
|
||||
Option<string> urlOption,
|
||||
Option<string> formatOption,
|
||||
Option<string> usernameOption,
|
||||
Option<string> passwordOption,
|
||||
object command)
|
||||
object command,
|
||||
TimeSpan? timeout = null,
|
||||
Func<string, int>? onSuccess = null)
|
||||
{
|
||||
var config = CliConfig.Load();
|
||||
var format = ResolveFormat(result, formatOption, config);
|
||||
@@ -67,7 +82,20 @@ internal static class CommandHelpers
|
||||
|
||||
// Send via HTTP
|
||||
using var client = new ManagementHttpClient(url, username, password);
|
||||
var response = await client.SendCommandAsync(commandName, command, TimeSpan.FromSeconds(30));
|
||||
var response = await client.SendCommandAsync(commandName, command, timeout ?? TimeSpan.FromSeconds(30));
|
||||
|
||||
// Caller-supplied success handler short-circuits the default rendering — but
|
||||
// the error path still routes through IsAuthorizationFailure so the documented
|
||||
// exit-2 contract holds whether or not a custom handler is provided
|
||||
// (CLI-017 unification of the bundle path).
|
||||
if (onSuccess is not null)
|
||||
{
|
||||
if (response.JsonData is not null)
|
||||
return onSuccess(response.JsonData);
|
||||
|
||||
OutputFormatter.WriteError(response.Error ?? "Unknown error", response.ErrorCode ?? "ERROR");
|
||||
return IsAuthorizationFailure(response) ? 2 : 1;
|
||||
}
|
||||
|
||||
return HandleResponse(response, format);
|
||||
}
|
||||
|
||||
@@ -75,6 +75,14 @@ public class CentralCommunicationActor : ReceiveActor
|
||||
|
||||
private ICancelable? _refreshSchedule;
|
||||
|
||||
/// <summary>
|
||||
/// Communication-019: per-actor lifecycle CTS threaded into the periodic
|
||||
/// <see cref="LoadSiteAddressesFromDb"/> repository call so a hung MS SQL
|
||||
/// connection is bounded by actor shutdown rather than holding piped tasks
|
||||
/// open indefinitely. Cancelled in <see cref="PostStop"/>; never reset.
|
||||
/// </summary>
|
||||
private readonly CancellationTokenSource _lifecycleCts = new();
|
||||
|
||||
/// <summary>
|
||||
/// Proxy <see cref="IActorRef"/> for the central NotificationOutboxActor cluster singleton.
|
||||
/// Set via <see cref="RegisterNotificationOutbox"/> — the Host creates the singleton proxy
|
||||
@@ -358,11 +366,26 @@ public class CentralCommunicationActor : ReceiveActor
|
||||
private void LoadSiteAddressesFromDb()
|
||||
{
|
||||
var self = Self;
|
||||
// Communication-019: pass the actor's lifecycle CT into the repository
|
||||
// call so a hung database query is cancelled when the actor stops
|
||||
// rather than leaving the piped task to accumulate. Captured locally
|
||||
// because the lifecycle CTS may have been disposed by PostStop on a
|
||||
// racing late tick; treat that as "actor gone, give up".
|
||||
CancellationToken ct;
|
||||
try
|
||||
{
|
||||
ct = _lifecycleCts.Token;
|
||||
}
|
||||
catch (ObjectDisposedException)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
Task.Run(async () =>
|
||||
{
|
||||
using var scope = _serviceProvider.CreateScope();
|
||||
var repo = scope.ServiceProvider.GetRequiredService<ISiteRepository>();
|
||||
var sites = await repo.GetAllSitesAsync();
|
||||
var sites = await repo.GetAllSitesAsync(ct).ConfigureAwait(false);
|
||||
|
||||
var contacts = new Dictionary<string, List<string>>();
|
||||
foreach (var site in sites)
|
||||
@@ -495,6 +518,17 @@ public class CentralCommunicationActor : ReceiveActor
|
||||
{
|
||||
_log.Info("CentralCommunicationActor stopped");
|
||||
_refreshSchedule?.Cancel();
|
||||
// Communication-019: cancel any in-flight LoadSiteAddressesFromDb so a
|
||||
// hung MS SQL query does not outlive the actor.
|
||||
try
|
||||
{
|
||||
_lifecycleCts.Cancel();
|
||||
}
|
||||
catch (ObjectDisposedException)
|
||||
{
|
||||
// Double-stop is benign.
|
||||
}
|
||||
_lifecycleCts.Dispose();
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -235,7 +235,30 @@ public class SiteStreamGrpcServer : SiteStreamService.SiteStreamServiceBase
|
||||
Props.Create(typeof(Actors.StreamRelayActor), request.CorrelationId, channel.Writer),
|
||||
$"stream-relay-{request.CorrelationId}-{actorSeq}");
|
||||
|
||||
var subscriptionId = _streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor);
|
||||
// Communication-021: the previous code called _streamSubscriber.Subscribe
|
||||
// OUTSIDE the try block that owns relay-actor cleanup. If Subscribe threw
|
||||
// (stale instance name, index lookup fault, site runtime shutting down),
|
||||
// the freshly-created relay actor, the _activeStreams entry, the
|
||||
// StreamEntry.Cts, and the Channel<SiteStreamEvent> all leaked because the
|
||||
// finally never ran. Wrap Subscribe in its own try so any throw deterministically
|
||||
// stops the relay actor, removes the activeStreams entry, and completes the
|
||||
// channel before the RpcException escapes to the caller.
|
||||
string subscriptionId;
|
||||
try
|
||||
{
|
||||
subscriptionId = _streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
_logger.LogWarning(ex,
|
||||
"Subscribe failed for {Instance} (correlation {CorrelationId}); cleaning up relay actor.",
|
||||
request.InstanceUniqueName, request.CorrelationId);
|
||||
_actorSystem!.Stop(relayActor);
|
||||
channel.Writer.TryComplete();
|
||||
_activeStreams.TryRemove(
|
||||
new KeyValuePair<string, StreamEntry>(request.CorrelationId, entry));
|
||||
throw;
|
||||
}
|
||||
|
||||
_logger.LogInformation(
|
||||
"Stream {CorrelationId} started for {Instance} (subscription {SubscriptionId})",
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
using System.Globalization;
|
||||
using Microsoft.Data.SqlClient;
|
||||
using Microsoft.EntityFrameworkCore;
|
||||
using Microsoft.Extensions.Logging;
|
||||
using Microsoft.Extensions.Logging.Abstractions;
|
||||
@@ -178,22 +177,20 @@ WHERE pf.name = 'pf_AuditLog_Month';";
|
||||
ALTER PARTITION SCHEME {PartitionSchemeName} NEXT USED [{TargetFileGroup}];
|
||||
ALTER PARTITION FUNCTION {PartitionFunctionName}() SPLIT RANGE ('{literal}');";
|
||||
|
||||
try
|
||||
{
|
||||
await _context.Database.ExecuteSqlRawAsync(sql, ct).ConfigureAwait(false);
|
||||
added.Add(next);
|
||||
}
|
||||
catch (SqlException ex)
|
||||
{
|
||||
// Belt-and-braces: even though we read max-boundary first, an
|
||||
// ALTER from another process could have raced us. Logging at
|
||||
// Warning rather than Error because the desired end state
|
||||
// (boundary present) is satisfied by either path.
|
||||
_logger.LogWarning(
|
||||
ex,
|
||||
"EnsureLookaheadAsync: SPLIT RANGE for boundary {Boundary:o} failed; continuing.",
|
||||
next);
|
||||
}
|
||||
// ConfigDB-019: the loop pre-reads max-boundary and only issues
|
||||
// SPLITs for strictly-greater months, so msg 7708/7711 ("boundary
|
||||
// already exists") cannot happen by construction. Any OTHER
|
||||
// SqlException (permission revoked on the role, deadlock victim,
|
||||
// log full, filegroup full, transient connection drop) means the
|
||||
// boundary genuinely failed to create. The previous catch-and-
|
||||
// continue silently moved on to the next month, splitting month
|
||||
// N+1 successfully and leaving a permanent partition hole for
|
||||
// month N that blocks partition-switch purge until an operator
|
||||
// notices and rebuilds. Let SqlException propagate so the daily
|
||||
// hosted-service tick logs an Error and the next tick retries
|
||||
// from the same boundary (at-least-once, no holes).
|
||||
await _context.Database.ExecuteSqlRawAsync(sql, ct).ConfigureAwait(false);
|
||||
added.Add(next);
|
||||
|
||||
next = NextMonthBoundary(next);
|
||||
}
|
||||
|
||||
@@ -334,6 +334,17 @@ public class DeploymentService
|
||||
}
|
||||
catch (Exception ex) when (ex is TimeoutException or OperationCanceledException)
|
||||
{
|
||||
// DeploymentManager-019: a lifecycle command timeout produced no
|
||||
// audit row pre-fix — the operator saw a timeout in the UI but
|
||||
// the audit trail showed nothing happened, contrary to the
|
||||
// design's "audit logging for all instance lifecycle changes"
|
||||
// rule. Mirror the DeployFailed pattern: write a "<Action>TimedOut"
|
||||
// entry with CancellationToken.None so a cancelled outer token
|
||||
// (the typical reason this catch ran) cannot prevent the
|
||||
// durable audit write.
|
||||
await TryLogLifecycleTimeoutAsync(
|
||||
user, "DisableTimedOut", instanceId, instance.UniqueName, commandId, ex);
|
||||
|
||||
_logger.LogWarning(ex, "Disable of instance {Instance} timed out", instance.UniqueName);
|
||||
return Result<InstanceLifecycleResponse>.Failure(
|
||||
$"Disable failed: the site did not respond within {_options.LifecycleCommandTimeout}.");
|
||||
@@ -391,6 +402,12 @@ public class DeploymentService
|
||||
}
|
||||
catch (Exception ex) when (ex is TimeoutException or OperationCanceledException)
|
||||
{
|
||||
// DeploymentManager-019: emit an audit entry on lifecycle timeout
|
||||
// so the operator's attempted Enable is recorded; see the matching
|
||||
// comment in DisableInstanceAsync for the full rationale.
|
||||
await TryLogLifecycleTimeoutAsync(
|
||||
user, "EnableTimedOut", instanceId, instance.UniqueName, commandId, ex);
|
||||
|
||||
_logger.LogWarning(ex, "Enable of instance {Instance} timed out", instance.UniqueName);
|
||||
return Result<InstanceLifecycleResponse>.Failure(
|
||||
$"Enable failed: the site did not respond within {_options.LifecycleCommandTimeout}.");
|
||||
@@ -453,6 +470,12 @@ public class DeploymentService
|
||||
}
|
||||
catch (Exception ex) when (ex is TimeoutException or OperationCanceledException)
|
||||
{
|
||||
// DeploymentManager-019: emit an audit entry on lifecycle timeout
|
||||
// so the operator's attempted Delete is recorded; see the matching
|
||||
// comment in DisableInstanceAsync for the full rationale.
|
||||
await TryLogLifecycleTimeoutAsync(
|
||||
user, "DeleteTimedOut", instanceId, instance.UniqueName, commandId, ex);
|
||||
|
||||
_logger.LogWarning(ex, "Delete of instance {Instance} timed out", instance.UniqueName);
|
||||
return Result<InstanceLifecycleResponse>.Failure(
|
||||
$"Delete failed: the site did not respond within {_options.LifecycleCommandTimeout}.");
|
||||
@@ -794,6 +817,67 @@ public class DeploymentService
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// DeploymentManager-019: write a "<Action>TimedOut" audit entry on
|
||||
/// behalf of a lifecycle command (Disable / Enable / Delete) whose site
|
||||
/// round-trip exceeded <see cref="DeploymentManagerOptions.LifecycleCommandTimeout"/>.
|
||||
///
|
||||
/// <para>
|
||||
/// Mirrors the <c>DeployFailed</c> pattern in
|
||||
/// <see cref="DeployInstanceAsync"/>: the audit write uses
|
||||
/// <see cref="CancellationToken.None"/> so the operator's outer cancellation
|
||||
/// (the usual reason this path runs) cannot also prevent the audit row from
|
||||
/// being persisted. The detail object carries the lifecycle command id, the
|
||||
/// timeout that fired, and the original exception message so an operator can
|
||||
/// correlate the audit entry with the UI-surfaced timeout error.
|
||||
/// </para>
|
||||
///
|
||||
/// <para>
|
||||
/// Wrapped in try/catch — a failed audit write must NOT mask the underlying
|
||||
/// timeout from the caller; it is logged at Warning so the operator can
|
||||
/// reconcile but never thrown.
|
||||
/// </para>
|
||||
/// </summary>
|
||||
/// <param name="user">The username who initiated the lifecycle command.</param>
|
||||
/// <param name="action">The audit action name (<c>DisableTimedOut</c>, <c>EnableTimedOut</c>, or <c>DeleteTimedOut</c>).</param>
|
||||
/// <param name="instanceId">The numeric instance id, recorded on the audit row.</param>
|
||||
/// <param name="instanceUniqueName">The instance unique name used as the audit target name.</param>
|
||||
/// <param name="commandId">The lifecycle command's correlation id, so the audit entry can be matched to logs.</param>
|
||||
/// <param name="timeoutException">The captured <see cref="TimeoutException"/> or <see cref="OperationCanceledException"/>.</param>
|
||||
private async Task TryLogLifecycleTimeoutAsync(
|
||||
string user,
|
||||
string action,
|
||||
int instanceId,
|
||||
string instanceUniqueName,
|
||||
string commandId,
|
||||
Exception timeoutException)
|
||||
{
|
||||
try
|
||||
{
|
||||
await _auditService.LogAsync(
|
||||
user,
|
||||
action,
|
||||
"Instance",
|
||||
instanceId.ToString(),
|
||||
instanceUniqueName,
|
||||
new
|
||||
{
|
||||
CommandId = commandId,
|
||||
Deadline = _options.LifecycleCommandTimeout,
|
||||
Error = timeoutException.Message,
|
||||
},
|
||||
CancellationToken.None);
|
||||
}
|
||||
catch (Exception auditEx)
|
||||
{
|
||||
// A failed audit write must not bury the timeout for the caller —
|
||||
// just log so an operator can investigate the audit-pipeline issue.
|
||||
_logger.LogWarning(auditEx,
|
||||
"Failed to write {Action} audit entry for instance {Instance} (commandId={CommandId})",
|
||||
action, instanceUniqueName, commandId);
|
||||
}
|
||||
}
|
||||
|
||||
private async Task StoreDeployedSnapshotAsync(
|
||||
int instanceId,
|
||||
string deploymentId,
|
||||
|
||||
@@ -256,6 +256,24 @@ public class ExternalSystemClient : IExternalSystemClient
|
||||
|
||||
var client = _httpClientFactory.CreateClient($"ExternalSystem_{system.Name}");
|
||||
|
||||
// ExternalSystemGateway-019: HttpClient.Timeout defaults to 100 seconds
|
||||
// and is enforced internally by SendAsync via its own private CTS — a
|
||||
// TaskCanceledException raised by that internal CTS does not trip
|
||||
// either the caller's token or the gateway's timeout CTS, so it falls
|
||||
// through the ordered catch filters below into the generic "connection
|
||||
// error" branch and is misclassified. Any operator-configured
|
||||
// DefaultHttpTimeout greater than 100 s would therefore be silently
|
||||
// clipped to 100 s, breaking the design's "timeout applies to the HTTP
|
||||
// request round-trip" guarantee. Disable the framework default so the
|
||||
// linked CancellationTokenSource(DefaultHttpTimeout) below is the sole
|
||||
// timeout source — DefaultHttpTimeout is then honoured verbatim for
|
||||
// every value, including ones well above 100 s. Setting this on the
|
||||
// factory-supplied HttpClient before any request is the safe time:
|
||||
// IHttpClientFactory rents typed clients backed by pooled message
|
||||
// handlers, but the HttpClient instance itself is per-call and the
|
||||
// Timeout property is per-instance.
|
||||
client.Timeout = Timeout.InfiniteTimeSpan;
|
||||
|
||||
var url = BuildUrl(system.EndpointUrl, method.Path, parameters, method.HttpMethod);
|
||||
|
||||
// The request and response own IDisposable resources (StringContent, the
|
||||
|
||||
@@ -85,10 +85,39 @@ public class CentralHealthReportLoop : BackgroundService
|
||||
_collector.SetClusterNodes(_clusterNodeProvider.GetClusterNodes());
|
||||
|
||||
var seq = Interlocked.Increment(ref _sequenceNumber);
|
||||
|
||||
// HealthMonitoring-018: CollectReport atomically read-and-resets
|
||||
// the per-interval error counters via Interlocked.Exchange. If
|
||||
// ProcessReport throws (or any other failure occurs between the
|
||||
// collect and the publish), those counts would otherwise be
|
||||
// lost — neither in the un-published report nor in the
|
||||
// now-zeroed collector. Snapshot the freshly-collected report
|
||||
// so that on a publish failure we can atomically restore the
|
||||
// counts back into the shared SiteHealthCollector via
|
||||
// Interlocked.Add. Concurrent increments arriving during the
|
||||
// ProcessReport call are preserved on the counter; the restore
|
||||
// Add safely sums with any such concurrent increments. Same
|
||||
// shape as the HealthMonitoring-017 fix in HealthReportSender.
|
||||
var report = _collector.CollectReport(CentralSiteId);
|
||||
var reportWithSeq = report with { SequenceNumber = seq };
|
||||
|
||||
_aggregator.ProcessReport(reportWithSeq);
|
||||
try
|
||||
{
|
||||
_aggregator.ProcessReport(reportWithSeq);
|
||||
}
|
||||
catch
|
||||
{
|
||||
// Restore the captured per-interval counters atomically so
|
||||
// they roll forward into the next report — see
|
||||
// HealthMonitoring-018.
|
||||
_collector.AddIntervalCounters(
|
||||
scriptErrors: report.ScriptErrorCount,
|
||||
alarmErrors: report.AlarmEvaluationErrorCount,
|
||||
deadLetters: report.DeadLetterCount,
|
||||
siteAuditWriteFailures: report.SiteAuditWriteFailures,
|
||||
auditRedactionFailures: report.AuditRedactionFailure);
|
||||
throw;
|
||||
}
|
||||
|
||||
_logger.LogDebug("Generated central health report #{Seq}", seq);
|
||||
}
|
||||
|
||||
@@ -138,12 +138,42 @@ public class HealthReportSender : BackgroundService
|
||||
}
|
||||
|
||||
var seq = Interlocked.Increment(ref _sequenceNumber);
|
||||
|
||||
// HealthMonitoring-017: CollectReport atomically read-and-resets
|
||||
// the per-interval error counters via Interlocked.Exchange. If
|
||||
// the Send below throws, those counts are otherwise lost
|
||||
// forever — neither in the un-sent report nor in the now-zeroed
|
||||
// collector. Snapshot the freshly-collected report so that on a
|
||||
// transport failure we can atomically restore the counts back
|
||||
// into the collector via Interlocked.Add, so the next
|
||||
// successful report includes them. Concurrent increments
|
||||
// arriving during the Send are preserved on the counter (they
|
||||
// accumulate against zero); the restore Add safely sums with
|
||||
// any such concurrent increments.
|
||||
var report = _collector.CollectReport(_siteId);
|
||||
|
||||
// Replace the placeholder sequence number with our monotonic one
|
||||
var reportWithSeq = report with { SequenceNumber = seq };
|
||||
|
||||
_transport.Send(reportWithSeq);
|
||||
try
|
||||
{
|
||||
_transport.Send(reportWithSeq);
|
||||
}
|
||||
catch
|
||||
{
|
||||
// Restore the captured per-interval counters atomically so
|
||||
// they roll forward into the next report — see
|
||||
// HealthMonitoring-017. Any concurrent increment that
|
||||
// arrived during the failed Send remains on the counter;
|
||||
// Interlocked.Add sums correctly with it.
|
||||
_collector.AddIntervalCounters(
|
||||
scriptErrors: report.ScriptErrorCount,
|
||||
alarmErrors: report.AlarmEvaluationErrorCount,
|
||||
deadLetters: report.DeadLetterCount,
|
||||
siteAuditWriteFailures: report.SiteAuditWriteFailures,
|
||||
auditRedactionFailures: report.AuditRedactionFailure);
|
||||
throw;
|
||||
}
|
||||
|
||||
_logger.LogInformation("Sent health report #{Seq} for site {SiteId}", seq, _siteId);
|
||||
}
|
||||
|
||||
@@ -140,4 +140,33 @@ public interface ISiteHealthCollector
|
||||
/// <param name="siteId">The site identifier.</param>
|
||||
/// <returns>A health report for the specified site.</returns>
|
||||
SiteHealthReport CollectReport(string siteId);
|
||||
|
||||
/// <summary>
|
||||
/// HealthMonitoring-017: atomically add back the given per-interval error
|
||||
/// counts into the collector's accumulators. Called by the report sender
|
||||
/// when transport delivery of a freshly-collected report fails, so the
|
||||
/// counts that <see cref="CollectReport"/> already drained roll forward
|
||||
/// into the next report rather than being silently lost. Concurrent
|
||||
/// increments arriving between the failed Send and this restore are
|
||||
/// preserved — <c>Interlocked.Add</c> sums correctly with them. The
|
||||
/// default interface implementation is a no-op so existing test fakes
|
||||
/// (the only implementations outside <see cref="SiteHealthCollector"/>)
|
||||
/// continue to compile without per-fake updates; production callers see
|
||||
/// the real behaviour via the concrete class.
|
||||
/// </summary>
|
||||
/// <param name="scriptErrors">Script error count to add back.</param>
|
||||
/// <param name="alarmErrors">Alarm evaluation error count to add back.</param>
|
||||
/// <param name="deadLetters">Dead letter count to add back.</param>
|
||||
/// <param name="siteAuditWriteFailures">Site audit write failure count to add back.</param>
|
||||
/// <param name="auditRedactionFailures">Audit redaction failure count to add back.</param>
|
||||
void AddIntervalCounters(
|
||||
int scriptErrors,
|
||||
int alarmErrors,
|
||||
int deadLetters,
|
||||
int siteAuditWriteFailures,
|
||||
int auditRedactionFailures)
|
||||
{
|
||||
// Default no-op so test fakes do not need to be updated. The real
|
||||
// SiteHealthCollector overrides this with the Interlocked.Add restore.
|
||||
}
|
||||
}
|
||||
|
||||
@@ -142,6 +142,27 @@ public class SiteHealthCollector : ISiteHealthCollector
|
||||
/// <inheritdoc />
|
||||
public bool IsActiveNode => _isActiveNode;
|
||||
|
||||
/// <inheritdoc />
|
||||
public void AddIntervalCounters(
|
||||
int scriptErrors,
|
||||
int alarmErrors,
|
||||
int deadLetters,
|
||||
int siteAuditWriteFailures,
|
||||
int auditRedactionFailures)
|
||||
{
|
||||
// HealthMonitoring-017: each counter is restored atomically via
|
||||
// Interlocked.Add so an increment that arrived during the failed Send
|
||||
// (and therefore accumulated against the zero left by CollectReport's
|
||||
// Exchange) is correctly summed with the values being put back. No
|
||||
// ordering between the five Adds is required — they target independent
|
||||
// fields.
|
||||
if (scriptErrors != 0) Interlocked.Add(ref _scriptErrorCount, scriptErrors);
|
||||
if (alarmErrors != 0) Interlocked.Add(ref _alarmErrorCount, alarmErrors);
|
||||
if (deadLetters != 0) Interlocked.Add(ref _deadLetterCount, deadLetters);
|
||||
if (siteAuditWriteFailures != 0) Interlocked.Add(ref _siteAuditWriteFailures, siteAuditWriteFailures);
|
||||
if (auditRedactionFailures != 0) Interlocked.Add(ref _auditRedactionFailures, auditRedactionFailures);
|
||||
}
|
||||
|
||||
/// <inheritdoc />
|
||||
public SiteHealthReport CollectReport(string siteId)
|
||||
{
|
||||
|
||||
@@ -46,11 +46,36 @@ public static class LoggerConfigurationFactory
|
||||
/// <summary>
|
||||
/// Parses a Serilog <see cref="LogEventLevel"/> name, falling back to
|
||||
/// <see cref="LogEventLevel.Information"/> for null/blank/unrecognised values.
|
||||
///
|
||||
/// Host-022: when an operator sets <c>ScadaLink:Logging:MinimumLevel</c> to a
|
||||
/// value that doesn't parse (e.g. the typo "Informaiton"), the helper must NOT
|
||||
/// throw — startup has to succeed so the rest of the system can come up — but
|
||||
/// it MUST make the silent fallback visible. The logger is not yet built at
|
||||
/// this point, so the warning is written directly to <see cref="Console.Error"/>
|
||||
/// using <see cref="WriteParseWarning"/>; non-null/non-blank values that fail
|
||||
/// to parse are reported once, naming the offending value and the fallback.
|
||||
/// Null/blank values are treated as "unset" and silently default — only
|
||||
/// explicit-but-invalid values trigger the warning.
|
||||
/// </summary>
|
||||
private static LogEventLevel ParseLevel(string? level)
|
||||
internal static LogEventLevel ParseLevel(string? level)
|
||||
=> ParseLevel(level, Console.Error);
|
||||
|
||||
/// <summary>
|
||||
/// Test-visible overload of <see cref="ParseLevel(string?)"/> that routes the
|
||||
/// one-shot warning through a caller-supplied writer (<see cref="Console.Error"/>
|
||||
/// in production) so unit tests can capture the warning output.
|
||||
/// </summary>
|
||||
/// <param name="level">Configured level string, possibly null/blank/invalid.</param>
|
||||
/// <param name="warningWriter">Writer that receives a single warning line if the value is non-blank but unparseable.</param>
|
||||
internal static LogEventLevel ParseLevel(string? level, TextWriter warningWriter)
|
||||
{
|
||||
return Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed)
|
||||
? parsed
|
||||
: LogEventLevel.Information;
|
||||
if (Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed))
|
||||
return parsed;
|
||||
|
||||
if (!string.IsNullOrWhiteSpace(level))
|
||||
warningWriter.WriteLine(
|
||||
$"warning: ScadaLink:Logging:MinimumLevel value '{level}' is not a recognised Serilog LogEventLevel; falling back to Information.");
|
||||
|
||||
return LogEventLevel.Information;
|
||||
}
|
||||
}
|
||||
|
||||
@@ -161,18 +161,23 @@ try
|
||||
// exponential backoff before failing fatally.
|
||||
// Host-015: only connection-class (transient) faults are retried — a
|
||||
// schema-version mismatch is permanent and must fail fast on attempt 1.
|
||||
// Host-019: thread the host's ApplicationStopping token into both the
|
||||
// migration call itself and the inter-attempt Task.Delay so a SIGTERM
|
||||
// during the bounded-retry window (~2 min worst-case) tears down
|
||||
// cleanly instead of being ignored until the loop exhausts.
|
||||
await StartupRetry.ExecuteWithRetryAsync(
|
||||
"database-migration",
|
||||
async () =>
|
||||
async ct =>
|
||||
{
|
||||
using var scope = app.Services.CreateScope();
|
||||
var dbContext = scope.ServiceProvider.GetRequiredService<ScadaLinkDbContext>();
|
||||
await MigrationHelper.ApplyOrValidateMigrationsAsync(dbContext, isDevelopment, migrationLogger);
|
||||
await MigrationHelper.ApplyOrValidateMigrationsAsync(dbContext, isDevelopment, migrationLogger, ct);
|
||||
},
|
||||
maxAttempts: 8,
|
||||
initialDelay: TimeSpan.FromSeconds(2),
|
||||
migrationLogger,
|
||||
isTransient: StartupRetry.IsTransientDatabaseFault);
|
||||
isTransient: StartupRetry.IsTransientDatabaseFault,
|
||||
cancellationToken: app.Lifetime.ApplicationStopping);
|
||||
}
|
||||
|
||||
// Middleware pipeline
|
||||
|
||||
@@ -28,7 +28,7 @@ public static class StartupRetry
|
||||
/// <param name="logger">Logger for retry warnings.</param>
|
||||
/// <param name="isTransient">Optional predicate classifying an exception as transient; null means all exceptions are transient.</param>
|
||||
/// <param name="cancellationToken">Cancellation token that aborts the retry loop immediately.</param>
|
||||
public static async Task ExecuteWithRetryAsync(
|
||||
public static Task ExecuteWithRetryAsync(
|
||||
string operationName,
|
||||
Func<Task> operation,
|
||||
int maxAttempts,
|
||||
@@ -36,6 +36,23 @@ public static class StartupRetry
|
||||
ILogger logger,
|
||||
Func<Exception, bool>? isTransient = null,
|
||||
CancellationToken cancellationToken = default)
|
||||
=> ExecuteWithRetryAsync(operationName, _ => operation(), maxAttempts, initialDelay, logger, isTransient, cancellationToken);
|
||||
|
||||
/// <summary>
|
||||
/// Executes an asynchronous operation with bounded exponential backoff, retrying only transient faults.
|
||||
/// Overload that forwards the retry-loop cancellation token to the operation itself —
|
||||
/// Host-019: needed so callers (e.g. the database-migration step) can honour
|
||||
/// <c>IHostApplicationLifetime.ApplicationStopping</c> inside the operation as well
|
||||
/// as inside the inter-attempt <c>Task.Delay</c>.
|
||||
/// </summary>
|
||||
public static async Task ExecuteWithRetryAsync(
|
||||
string operationName,
|
||||
Func<CancellationToken, Task> operation,
|
||||
int maxAttempts,
|
||||
TimeSpan initialDelay,
|
||||
ILogger logger,
|
||||
Func<Exception, bool>? isTransient = null,
|
||||
CancellationToken cancellationToken = default)
|
||||
{
|
||||
// Default: treat every exception as transient (preserves the pre-Host-015
|
||||
// behaviour for callers that do not classify faults).
|
||||
@@ -47,7 +64,7 @@ public static class StartupRetry
|
||||
cancellationToken.ThrowIfCancellationRequested();
|
||||
try
|
||||
{
|
||||
await operation();
|
||||
await operation(cancellationToken);
|
||||
if (attempt > 1)
|
||||
logger.LogInformation(
|
||||
"Startup operation '{Operation}' succeeded on attempt {Attempt}.",
|
||||
|
||||
@@ -251,10 +251,16 @@ public sealed class AuditWriteMiddleware
|
||||
ForwardState = null,
|
||||
};
|
||||
|
||||
// Fire-and-forget — the writer itself swallows; the additional
|
||||
// try/catch around the fire still protects us if WriteAsync throws
|
||||
// synchronously before returning a task.
|
||||
_ = _auditWriter.WriteAsync(evt);
|
||||
// InboundAPI-018: fire-and-forget the writer so the user-facing
|
||||
// response stays non-blocking (alog.md §13 — audit emission must
|
||||
// NEVER abort or delay the user request), but observe the returned
|
||||
// Task so an asynchronous fault is logged instead of vanishing into
|
||||
// TaskScheduler.UnobservedTaskException. The outer try/catch still
|
||||
// catches a synchronous throw before WriteAsync returns a task; the
|
||||
// ContinueWith only fires on a faulted task and runs off-thread, so
|
||||
// it does not block the response.
|
||||
var writeTask = _auditWriter.WriteAsync(evt);
|
||||
ObserveAuditWriteFault(writeTask, ctx);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
@@ -265,6 +271,36 @@ public sealed class AuditWriteMiddleware
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// InboundAPI-018: observe the audit writer's returned <see cref="Task"/>
|
||||
/// so a fault that surfaces ASYNCHRONOUSLY (e.g. a DB timeout deep in the
|
||||
/// central audit pipeline) is logged at Warning rather than dropped into
|
||||
/// <see cref="TaskScheduler.UnobservedTaskException"/>. Stays
|
||||
/// fire-and-forget so the user-facing response is not delayed — the
|
||||
/// continuation runs only on a faulted task and writes a single log line
|
||||
/// off the request thread. A completed-successfully task takes the fast
|
||||
/// path with no continuation scheduled.
|
||||
/// </summary>
|
||||
private void ObserveAuditWriteFault(Task writeTask, HttpContext ctx)
|
||||
{
|
||||
if (writeTask.IsCompletedSuccessfully)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
var method = ctx.Request.Method;
|
||||
var path = ctx.Request.Path;
|
||||
var status = ctx.Response.StatusCode;
|
||||
_ = writeTask.ContinueWith(
|
||||
t => _logger.LogWarning(
|
||||
t.Exception,
|
||||
"AuditWriteMiddleware async audit write faulted for {Method} {Path} (status {Status})",
|
||||
method, path, status),
|
||||
CancellationToken.None,
|
||||
TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously,
|
||||
TaskScheduler.Default);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Reads the buffered request body up to <paramref name="capBytes"/> bytes
|
||||
/// into a string for the audit copy and rewinds the stream so the
|
||||
|
||||
@@ -0,0 +1,70 @@
|
||||
using Microsoft.Extensions.Options;
|
||||
|
||||
namespace ScadaLink.Security;
|
||||
|
||||
/// <summary>
|
||||
/// Security-020: validates <see cref="SecurityOptions"/> at startup so a
|
||||
/// missing or empty required LDAP field fails fast at boot with a clear,
|
||||
/// key-naming message — rather than surfacing minutes or hours later as a
|
||||
/// generic "An unexpected error occurred during authentication" on the first
|
||||
/// real login attempt.
|
||||
///
|
||||
/// <para>
|
||||
/// The LDAP-side required fields validated here are <see cref="SecurityOptions.LdapServer"/>
|
||||
/// (no sane default — the host must be specified) and
|
||||
/// <see cref="SecurityOptions.LdapSearchBase"/> (the DN root every directory
|
||||
/// search runs against). A typo in the appsettings section name, a missing
|
||||
/// environment-variable substitution, or a misconfigured Docker compose file
|
||||
/// leaves both defaulted to <c>string.Empty</c> — without this validator the
|
||||
/// process would start cleanly and only fail on the first login when
|
||||
/// <c>LdapConnection.Connect("")</c> throws a low-level exception that does
|
||||
/// not name the offending config key.
|
||||
/// </para>
|
||||
///
|
||||
/// <para>
|
||||
/// <see cref="SecurityOptions.JwtSigningKey"/> is intentionally NOT validated
|
||||
/// here — it already fails fast at <see cref="JwtTokenService"/> construction
|
||||
/// (Security-003 fix), with a length-aware error message. Centralising it
|
||||
/// here would duplicate that guard; leaving it on the constructor keeps the
|
||||
/// minimum-byte length contract co-located with the type that enforces it.
|
||||
/// </para>
|
||||
/// </summary>
|
||||
public sealed class SecurityOptionsValidator : IValidateOptions<SecurityOptions>
|
||||
{
|
||||
/// <summary>
|
||||
/// The configuration section name <see cref="SecurityOptions"/> is bound
|
||||
/// to (matches the Host's <c>builder.Configuration.GetSection("Security")</c>
|
||||
/// call). Exposed so validation messages can name the full
|
||||
/// <c>Security:Field</c> key the operator would edit, not just the field
|
||||
/// name.
|
||||
/// </summary>
|
||||
public const string ConfigSectionName = "Security";
|
||||
|
||||
/// <inheritdoc />
|
||||
public ValidateOptionsResult Validate(string? name, SecurityOptions options)
|
||||
{
|
||||
ArgumentNullException.ThrowIfNull(options);
|
||||
|
||||
var failures = new List<string>();
|
||||
|
||||
if (string.IsNullOrWhiteSpace(options.LdapServer))
|
||||
{
|
||||
failures.Add(
|
||||
$"{ConfigSectionName}:{nameof(SecurityOptions.LdapServer)} is required " +
|
||||
"but was empty or whitespace — set it to the LDAP server hostname or IP " +
|
||||
"(e.g. \"ldap.example.com\").");
|
||||
}
|
||||
|
||||
if (string.IsNullOrWhiteSpace(options.LdapSearchBase))
|
||||
{
|
||||
failures.Add(
|
||||
$"{ConfigSectionName}:{nameof(SecurityOptions.LdapSearchBase)} is required " +
|
||||
"but was empty or whitespace — set it to the search-base DN " +
|
||||
"(e.g. \"dc=example,dc=com\").");
|
||||
}
|
||||
|
||||
return failures.Count == 0
|
||||
? ValidateOptionsResult.Success
|
||||
: ValidateOptionsResult.Fail(failures);
|
||||
}
|
||||
}
|
||||
@@ -1,5 +1,6 @@
|
||||
using Microsoft.AspNetCore.Authentication.Cookies;
|
||||
using Microsoft.Extensions.DependencyInjection;
|
||||
using Microsoft.Extensions.DependencyInjection.Extensions;
|
||||
using Microsoft.Extensions.Options;
|
||||
|
||||
namespace ScadaLink.Security;
|
||||
@@ -16,6 +17,17 @@ public static class ServiceCollectionExtensions
|
||||
services.AddScoped<JwtTokenService>();
|
||||
services.AddScoped<RoleMapper>();
|
||||
|
||||
// Security-020: register the IValidateOptions<SecurityOptions> so a
|
||||
// missing/empty LdapServer or LdapSearchBase fails fast at startup
|
||||
// with a clear, key-naming message rather than a generic LDAP error
|
||||
// on the first real login. ValidateOnStart() forces the validation to
|
||||
// run during host startup rather than lazily on the first
|
||||
// IOptions<SecurityOptions> resolve. TryAddEnumerable so multiple
|
||||
// AddSecurity calls (or future additional validators) don't pile up.
|
||||
services.AddOptions<SecurityOptions>().ValidateOnStart();
|
||||
services.TryAddEnumerable(
|
||||
ServiceDescriptor.Singleton<IValidateOptions<SecurityOptions>, SecurityOptionsValidator>());
|
||||
|
||||
// Register ASP.NET Core authentication with cookie scheme
|
||||
services.AddAuthentication(CookieAuthenticationDefaults.AuthenticationScheme)
|
||||
.AddCookie(options =>
|
||||
|
||||
@@ -51,6 +51,30 @@ public class StoreAndForwardService
|
||||
private Timer? _retryTimer;
|
||||
private int _retryInProgress;
|
||||
|
||||
/// <summary>
|
||||
/// StoreAndForward-024: the in-flight retry sweep <see cref="Task"/>, or
|
||||
/// <c>null</c> when no sweep is currently running. Captured when the timer
|
||||
/// callback starts a sweep so <see cref="StopAsync"/> can wait for it to
|
||||
/// finish before the host disposes downstream dependencies
|
||||
/// (<see cref="_storage"/>, <see cref="_replication"/>) that the sweep is
|
||||
/// still touching. Written from the timer thread and from
|
||||
/// <see cref="StopAsync"/>, so reads are synchronised via the
|
||||
/// <see cref="Volatile"/> APIs.
|
||||
/// </summary>
|
||||
private Task? _sweepTask;
|
||||
|
||||
/// <summary>
|
||||
/// StoreAndForward-024: how long <see cref="StopAsync"/> waits for an
|
||||
/// in-flight retry sweep to finish before returning. The default — 10 s —
|
||||
/// is generous enough to let a typical sweep over the buffered queue drain,
|
||||
/// but bounded so a hung downstream call (a stuck SQLite write, a
|
||||
/// long-running delivery handler) cannot block host shutdown indefinitely.
|
||||
/// On timeout the wait is abandoned and the timer is still disposed; the
|
||||
/// sweep keeps running but will throw on the next call into a disposed
|
||||
/// dependency — preferred to blocking shutdown forever.
|
||||
/// </summary>
|
||||
private static readonly TimeSpan SweepShutdownWaitTimeout = TimeSpan.FromSeconds(10);
|
||||
|
||||
/// <summary>
|
||||
/// WP-10: Delivery handler delegate. The return value / exception is interpreted
|
||||
/// the same way on both the immediate-delivery path (<see cref="EnqueueAsync"/>)
|
||||
@@ -120,7 +144,14 @@ public class StoreAndForwardService
|
||||
{
|
||||
await _storage.InitializeAsync();
|
||||
_retryTimer = new Timer(
|
||||
_ => _ = RetryPendingMessagesAsync(),
|
||||
// StoreAndForward-024: capture the sweep Task on each tick so
|
||||
// StopAsync can await any in-flight invocation before the host
|
||||
// disposes _storage/_replication underneath it. The RetryPending
|
||||
// path is self-guarded against overlapping sweeps via the
|
||||
// _retryInProgress Interlocked flag, so unconditionally re-assigning
|
||||
// the field here cannot lose a still-running task (the new tick
|
||||
// will short-circuit if one is already running).
|
||||
_ => Volatile.Write(ref _sweepTask, RetryPendingMessagesAsync()),
|
||||
null,
|
||||
_options.RetryTimerInterval,
|
||||
_options.RetryTimerInterval);
|
||||
@@ -131,15 +162,58 @@ public class StoreAndForwardService
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Stops the background retry timer.
|
||||
/// Stops the background retry timer and waits (bounded) for any in-flight
|
||||
/// retry sweep to finish before returning.
|
||||
///
|
||||
/// StoreAndForward-024: prior to this fix, <see cref="StopAsync"/> only
|
||||
/// disposed the timer — a sweep already inside
|
||||
/// <see cref="RetryPendingMessagesAsync"/> continued running against
|
||||
/// <see cref="_storage"/> and <see cref="_replication"/> after this method
|
||||
/// returned, and could then NRE / throw on a disposed dependency once the
|
||||
/// DI container ran its own shutdown. We now await the captured sweep task
|
||||
/// (with a bounded <see cref="SweepShutdownWaitTimeout"/> so a hung
|
||||
/// dependency cannot block host shutdown indefinitely) before returning.
|
||||
/// </summary>
|
||||
public async Task StopAsync()
|
||||
{
|
||||
if (_retryTimer != null)
|
||||
{
|
||||
// Stop the periodic callback first so no new sweep starts while we
|
||||
// are waiting for the in-flight one to drain.
|
||||
await _retryTimer.DisposeAsync();
|
||||
_retryTimer = null;
|
||||
}
|
||||
|
||||
var inflight = Volatile.Read(ref _sweepTask);
|
||||
if (inflight is null || inflight.IsCompleted)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
try
|
||||
{
|
||||
// WaitAsync with a finite timeout: a hung delivery handler /
|
||||
// storage call cannot block host shutdown indefinitely. On timeout
|
||||
// the sweep keeps running but the host is free to proceed with
|
||||
// disposal — preferred to never returning.
|
||||
await inflight.WaitAsync(SweepShutdownWaitTimeout).ConfigureAwait(false);
|
||||
}
|
||||
catch (TimeoutException)
|
||||
{
|
||||
_logger.LogWarning(
|
||||
"Store-and-forward retry sweep did not finish within {Timeout}; " +
|
||||
"shutdown is proceeding while the sweep is still in-flight",
|
||||
SweepShutdownWaitTimeout);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
// The sweep itself already logs at Error on failure (see
|
||||
// RetryPendingMessagesAsync's catch); we only log here so a
|
||||
// surprise fault during shutdown is still visible. Swallow so the
|
||||
// host's shutdown sequence can continue regardless.
|
||||
_logger.LogWarning(ex,
|
||||
"Store-and-forward retry sweep faulted during shutdown wait");
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
|
||||
Reference in New Issue
Block a user