feat(auditlog): AuditLogPurgeActor daily partition-switch purge (#23 M6)

Central singleton (M6-T4 Bundle C) that drives the daily AuditLog partition purge. On a configurable timer (default 24 hours) the actor: 1. Queries IAuditLogRepository.GetPartitionBoundariesOlderThanAsync for monthly boundaries whose latest OccurredAtUtc is older than DateTime.UtcNow - AuditLogOptions.RetentionDays. 2. For each eligible boundary calls SwitchOutPartitionAsync, which runs the drop-and-rebuild dance around UX_AuditLog_EventId. 3. Publishes AuditLogPurgedEvent(boundary, rowsDeleted, durationMs) on the actor-system EventStream so the Bundle E central health collector and ops surfaces can subscribe without coupling to this actor. Co-changes: * SwitchOutPartitionAsync returns long (rows deleted) — sampled BEFORE the switch via COUNT_BIG over the per-partition filter so the count reflects what the switch removed, not a post-purge scan of a table that no longer exists. All stub implementations updated. * AuditLogPurgeOptions: IntervalHours (default 24), IntervalOverride for tests, Interval property resolving either. * AuditLogPurgedEvent: record with MonthBoundary, RowsDeleted, DurationMs. Behavior: * Continue-on-error per boundary — one partition that throws does NOT abandon the rest of the tick. * DI scope opened per tick (IAuditLogRepository is a SCOPED EF Core service); mirrors SiteAuditReconciliationActor and AuditLogIngestActor. * SupervisorStrategy Resume keeps the singleton alive across leaked exceptions. * EventStream capture BEFORE the first await — Context is unsafe after await in async receive handlers (same pattern as Sender-capture in AuditLogIngestActor.OnIngestAsync). Tests: * Tick_Fires_OnDailyInterval — visible timer side effect. * Tick_OldPartitions_SwitchedOut — both seeded boundaries purged. * Tick_NewerPartitions_Untouched — empty enumerator → no switches. * Tick_PublishesPurgedEvent_WithRowCount — AuditLogPurgedEvent carries RowsDeleted and DurationMs. * Tick_SwitchThrows_OtherPartitionsStillProcessed — continue-on-error. * Threshold_UsesAuditLogOptionsRetentionDays — non-default 30-day window computed from UtcNow - RetentionDays. * EndToEnd_RealPartition_RowsRemoved_PurgedEventPublished — TestKit + MsSqlMigrationFixture: real partitioned table, Jan-2026 row purged, Apr-2026 row kept, AuditLogPurgedEvent observed via probe.
2026-05-20 18:36:31 -04:00
parent 6069a20e0f
commit 660fdc4e93
8 changed files with 718 additions and 6 deletions
--- a/src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs
+++ b/src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs
@@ -0,0 +1,214 @@
+using System.Diagnostics;
+using Akka.Actor;
+using Microsoft.Extensions.DependencyInjection;
+using Microsoft.Extensions.Logging;
+using Microsoft.Extensions.Options;
+using ScadaLink.AuditLog.Configuration;
+using ScadaLink.Commons.Interfaces.Repositories;
+
+namespace ScadaLink.AuditLog.Central;
+
+/// <summary>
+/// Central singleton (M6 Bundle C) that drives the daily AuditLog partition
+/// purge. On a configurable timer (default 24 hours) the actor:
+/// <list type="number">
+/// <item>Queries <see cref="IAuditLogRepository.GetPartitionBoundariesOlderThanAsync"/>
+///       for monthly boundaries whose latest <c>OccurredAtUtc</c> is older
+///       than <c>DateTime.UtcNow - RetentionDays</c>.</item>
+/// <item>For each eligible boundary, calls
+///       <see cref="IAuditLogRepository.SwitchOutPartitionAsync"/> which runs
+///       the drop-and-rebuild dance around <c>UX_AuditLog_EventId</c>.</item>
+/// <item>Publishes <see cref="AuditLogPurgedEvent"/> on the actor-system
+///       EventStream so the Bundle E central health collector + ops surfaces
+///       can subscribe without coupling to this actor.</item>
+/// </list>
+/// </summary>
+/// <remarks>
+/// <para>
+/// <b>Daily cadence.</b> Partition switch is metadata-only but the
+/// drop-and-rebuild dance briefly removes <c>UX_AuditLog_EventId</c>; running
+/// more often than necessary trades unique-index rebuild outages for
+/// negligible freshness wins. The default 24-hour interval matches
+/// alog.md §10's retention policy.
+/// </para>
+/// <para>
+/// <b>Continue-on-error.</b> A single boundary that throws (transient SQL
+/// failure, contention with backup, missing object) must NOT prevent the
+/// other eligible boundaries from being purged on the same tick. Per-boundary
+/// work runs inside its own try/catch; the actor's
+/// <see cref="SupervisorStrategy"/> uses Resume so any leaked exception keeps
+/// the singleton alive for the next tick.
+/// </para>
+/// <para>
+/// <b>DI scopes.</b> <see cref="IAuditLogRepository"/> is a scoped EF Core
+/// service registered by <c>AddConfigurationDatabase</c>. The singleton
+/// opens one DI scope per tick and reuses the same repository across every
+/// boundary in that tick — mirrors the
+/// <see cref="SiteAuditReconciliationActor"/> pattern.
+/// </para>
+/// <para>
+/// <b>EventStream.</b> Publishing <see cref="AuditLogPurgedEvent"/> through
+/// the EventStream rather than direct messaging avoids coupling this actor
+/// to its consumers; M6 Bundle E will subscribe a central health-counter
+/// bridge that surfaces purge progress on the central health report.
+/// </para>
+/// </remarks>
+public class AuditLogPurgeActor : ReceiveActor
+{
+    private readonly IServiceProvider _services;
+    private readonly AuditLogPurgeOptions _purgeOptions;
+    private readonly AuditLogOptions _auditOptions;
+    private readonly ILogger<AuditLogPurgeActor> _logger;
+    private ICancelable? _timer;
+
+    public AuditLogPurgeActor(
+        IServiceProvider services,
+        IOptions<AuditLogPurgeOptions> purgeOptions,
+        IOptions<AuditLogOptions> auditOptions,
+        ILogger<AuditLogPurgeActor> logger)
+    {
+        ArgumentNullException.ThrowIfNull(services);
+        ArgumentNullException.ThrowIfNull(purgeOptions);
+        ArgumentNullException.ThrowIfNull(auditOptions);
+        ArgumentNullException.ThrowIfNull(logger);
+
+        _services = services;
+        _purgeOptions = purgeOptions.Value;
+        _auditOptions = auditOptions.Value;
+        _logger = logger;
+
+        ReceiveAsync<PurgeTick>(_ => OnTickAsync());
+    }
+
+    protected override void PreStart()
+    {
+        base.PreStart();
+        var interval = _purgeOptions.Interval;
+        _timer = Context.System.Scheduler.ScheduleTellRepeatedlyCancelable(
+            initialDelay: interval,
+            interval: interval,
+            receiver: Self,
+            message: PurgeTick.Instance,
+            sender: Self);
+    }
+
+    protected override void PostStop()
+    {
+        _timer?.Cancel();
+        base.PostStop();
+    }
+
+    /// <summary>
+    /// Resume keeps the singleton alive across any leaked exception. Restart
+    /// would re-run PreStart and reschedule the timer (harmless but wasteful);
+    /// Stop is wrong because the singleton must keep ticking until shutdown.
+    /// </summary>
+    protected override SupervisorStrategy SupervisorStrategy()
+    {
+        return new OneForOneStrategy(
+            maxNrOfRetries: 0,
+            withinTimeRange: TimeSpan.Zero,
+            decider: Akka.Actor.SupervisorStrategy.DefaultDecider);
+    }
+
+    private async Task OnTickAsync()
+    {
+        // Capture EventStream BEFORE the first await. Accessing Context (and
+        // therefore Context.System) after an await is unsafe because Akka's
+        // ActorBase.Context throws "no active ActorContext" once the
+        // continuation runs on a thread that isn't currently dispatching this
+        // actor — mirrors the same Sender-capture pattern in
+        // AuditLogIngestActor.OnIngestAsync.
+        var eventStream = Context.System.EventStream;
+
+        // Compute the retention threshold from AuditLogOptions.RetentionDays
+        // each tick — the options class supports hot reload via
+        // IOptionsMonitor for the redaction policy and similar settings; we
+        // read the snapshot per-tick so an operator who lowers RetentionDays
+        // sees the change applied on the next purge without an actor
+        // restart.
+        var threshold = DateTime.UtcNow - TimeSpan.FromDays(_auditOptions.RetentionDays);
+
+        IServiceScope? scope = null;
+        IAuditLogRepository repository;
+        try
+        {
+            scope = _services.CreateScope();
+            repository = scope.ServiceProvider.GetRequiredService<IAuditLogRepository>();
+        }
+        catch (Exception ex)
+        {
+            _logger.LogError(ex, "Failed to resolve IAuditLogRepository for AuditLog purge tick.");
+            scope?.Dispose();
+            return;
+        }
+
+        try
+        {
+            IReadOnlyList<DateTime> boundaries;
+            try
+            {
+                boundaries = await repository
+                    .GetPartitionBoundariesOlderThanAsync(threshold)
+                    .ConfigureAwait(false);
+            }
+            catch (Exception ex)
+            {
+                _logger.LogError(
+                    ex,
+                    "Failed to enumerate eligible AuditLog partition boundaries (threshold {ThresholdUtc:o}); skipping purge tick.",
+                    threshold);
+                return;
+            }
+
+            if (boundaries.Count == 0)
+            {
+                return;
+            }
+
+            foreach (var boundary in boundaries)
+            {
+                // Per-boundary try/catch: one bad partition (transient SQL
+                // failure, missing object, contention with backup) does NOT
+                // abandon the rest of the tick.
+                var sw = Stopwatch.StartNew();
+                try
+                {
+                    var rowsDeleted = await repository
+                        .SwitchOutPartitionAsync(boundary)
+                        .ConfigureAwait(false);
+                    sw.Stop();
+
+                    eventStream.Publish(
+                        new AuditLogPurgedEvent(boundary, rowsDeleted, sw.ElapsedMilliseconds));
+
+                    _logger.LogInformation(
+                        "Purged AuditLog partition {MonthBoundary:yyyy-MM-dd}; {RowsDeleted} rows in {DurationMs} ms.",
+                        boundary,
+                        rowsDeleted,
+                        sw.ElapsedMilliseconds);
+                }
+                catch (Exception ex)
+                {
+                    sw.Stop();
+                    _logger.LogError(
+                        ex,
+                        "Failed to purge AuditLog partition {MonthBoundary:yyyy-MM-dd}; other partitions continue. Elapsed {DurationMs} ms.",
+                        boundary,
+                        sw.ElapsedMilliseconds);
+                }
+            }
+        }
+        finally
+        {
+            scope.Dispose();
+        }
+    }
+
+    /// <summary>Self-tick triggering a purge pass across all eligible partitions.</summary>
+    internal sealed class PurgeTick
+    {
+        public static readonly PurgeTick Instance = new();
+        private PurgeTick() { }
+    }
+}
--- a/src/ScadaLink.AuditLog/Central/AuditLogPurgeOptions.cs
+++ b/src/ScadaLink.AuditLog/Central/AuditLogPurgeOptions.cs
@@ -0,0 +1,43 @@
+namespace ScadaLink.AuditLog.Central;
+
+/// <summary>
+/// Tuning knobs for the central <see cref="AuditLogPurgeActor"/> singleton.
+/// Default cadence is 24 hours per the M6 plan; the retention window itself
+/// is sourced from <see cref="ScadaLink.AuditLog.Configuration.AuditLogOptions.RetentionDays"/>
+/// (default 365) so operators tune retention from a single section.
+/// </summary>
+/// <remarks>
+/// <para>
+/// The purge actor is a daily-cadence singleton, not a hot-loop, because
+/// partition-switch I/O is metadata-only but the drop-and-rebuild dance
+/// briefly removes the <c>UX_AuditLog_EventId</c> unique index — running
+/// more often than necessary trades index-rebuild outages for marginal
+/// freshness gains. Lower this only when an operator can prove they need
+/// sub-daily purge granularity.
+/// </para>
+/// <para>
+/// <see cref="IntervalOverride"/> exists for tests to drop the cadence to
+/// milliseconds without polluting the production config surface; production
+/// binds <see cref="IntervalHours"/> only.
+/// </para>
+/// </remarks>
+public sealed class AuditLogPurgeOptions
+{
+    /// <summary>Period of the purge tick in hours (default 24).</summary>
+    public int IntervalHours { get; set; } = 24;
+
+    /// <summary>
+    /// Test-only override for finer control over the tick cadence than
+    /// whole-hour resolution allows. When non-null, takes precedence over
+    /// <see cref="IntervalHours"/>. Not bound from config — production
+    /// config exposes <see cref="IntervalHours"/> only.
+    /// </summary>
+    public TimeSpan? IntervalOverride { get; set; }
+
+    /// <summary>
+    /// Resolves the effective tick interval, honouring the test override
+    /// when set. Falls back to <see cref="IntervalHours"/>.
+    /// </summary>
+    public TimeSpan Interval =>
+        IntervalOverride ?? TimeSpan.FromHours(IntervalHours);
+}
--- a/src/ScadaLink.AuditLog/Central/AuditLogPurgedEvent.cs
+++ b/src/ScadaLink.AuditLog/Central/AuditLogPurgedEvent.cs
@@ -0,0 +1,29 @@
+namespace ScadaLink.AuditLog.Central;
+
+/// <summary>
+/// Published on the actor-system EventStream by <see cref="AuditLogPurgeActor"/>
+/// after each successful partition switch-out. Downstream consumers (Bundle E
+/// central health collector, ops dashboards, audit trails) subscribe so a
+/// purge action is observable without the actor needing to know about any
+/// specific subscriber.
+/// </summary>
+/// <param name="MonthBoundary">
+/// The pf_AuditLog_Month lower-bound boundary that was switched out — i.e.
+/// the first instant of the purged month in UTC.
+/// </param>
+/// <param name="RowsDeleted">
+/// Approximate row count purged from the partition, sampled BEFORE the
+/// switch. Exact accounting would require a post-switch scan of the staging
+/// table, which the dance drops immediately, so this is the closest
+/// observable proxy. Zero is a valid value when the actor's enumerator
+/// included a partition the operator subsequently emptied by hand.
+/// </param>
+/// <param name="DurationMs">
+/// Wall-clock time spent inside <c>SwitchOutPartitionAsync</c> for this
+/// boundary, in milliseconds. Useful for spotting the rare slow purge
+/// without spinning up dedicated telemetry.
+/// </param>
+public sealed record AuditLogPurgedEvent(
+    DateTime MonthBoundary,
+    long RowsDeleted,
+    long DurationMs);