feat(auditlog): AuditLogPurgeActor daily partition-switch purge (#23 M6)
Central singleton (M6-T4 Bundle C) that drives the daily AuditLog partition
purge. On a configurable timer (default 24 hours) the actor:
1. Queries IAuditLogRepository.GetPartitionBoundariesOlderThanAsync for
monthly boundaries whose latest OccurredAtUtc is older than
DateTime.UtcNow - AuditLogOptions.RetentionDays.
2. For each eligible boundary calls SwitchOutPartitionAsync, which runs
the drop-and-rebuild dance around UX_AuditLog_EventId.
3. Publishes AuditLogPurgedEvent(boundary, rowsDeleted, durationMs) on
the actor-system EventStream so the Bundle E central health collector
and ops surfaces can subscribe without coupling to this actor.
Co-changes:
* SwitchOutPartitionAsync returns long (rows deleted) — sampled BEFORE the
switch via COUNT_BIG over the per-partition filter so the count
reflects what the switch removed, not a post-purge scan of a table that
no longer exists. All stub implementations updated.
* AuditLogPurgeOptions: IntervalHours (default 24), IntervalOverride for
tests, Interval property resolving either.
* AuditLogPurgedEvent: record with MonthBoundary, RowsDeleted, DurationMs.
Behavior:
* Continue-on-error per boundary — one partition that throws does NOT
abandon the rest of the tick.
* DI scope opened per tick (IAuditLogRepository is a SCOPED EF Core
service); mirrors SiteAuditReconciliationActor and AuditLogIngestActor.
* SupervisorStrategy Resume keeps the singleton alive across leaked
exceptions.
* EventStream capture BEFORE the first await — Context is unsafe after
await in async receive handlers (same pattern as Sender-capture in
AuditLogIngestActor.OnIngestAsync).
Tests:
* Tick_Fires_OnDailyInterval — visible timer side effect.
* Tick_OldPartitions_SwitchedOut — both seeded boundaries purged.
* Tick_NewerPartitions_Untouched — empty enumerator → no switches.
* Tick_PublishesPurgedEvent_WithRowCount — AuditLogPurgedEvent carries
RowsDeleted and DurationMs.
* Tick_SwitchThrows_OtherPartitionsStillProcessed — continue-on-error.
* Threshold_UsesAuditLogOptionsRetentionDays — non-default 30-day window
computed from UtcNow - RetentionDays.
* EndToEnd_RealPartition_RowsRemoved_PurgedEventPublished — TestKit +
MsSqlMigrationFixture: real partitioned table, Jan-2026 row purged,
Apr-2026 row kept, AuditLogPurgedEvent observed via probe.
This commit is contained in:
214
src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs
Normal file
214
src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs
Normal file
@@ -0,0 +1,214 @@
|
||||
using System.Diagnostics;
|
||||
using Akka.Actor;
|
||||
using Microsoft.Extensions.DependencyInjection;
|
||||
using Microsoft.Extensions.Logging;
|
||||
using Microsoft.Extensions.Options;
|
||||
using ScadaLink.AuditLog.Configuration;
|
||||
using ScadaLink.Commons.Interfaces.Repositories;
|
||||
|
||||
namespace ScadaLink.AuditLog.Central;
|
||||
|
||||
/// <summary>
|
||||
/// Central singleton (M6 Bundle C) that drives the daily AuditLog partition
|
||||
/// purge. On a configurable timer (default 24 hours) the actor:
|
||||
/// <list type="number">
|
||||
/// <item>Queries <see cref="IAuditLogRepository.GetPartitionBoundariesOlderThanAsync"/>
|
||||
/// for monthly boundaries whose latest <c>OccurredAtUtc</c> is older
|
||||
/// than <c>DateTime.UtcNow - RetentionDays</c>.</item>
|
||||
/// <item>For each eligible boundary, calls
|
||||
/// <see cref="IAuditLogRepository.SwitchOutPartitionAsync"/> which runs
|
||||
/// the drop-and-rebuild dance around <c>UX_AuditLog_EventId</c>.</item>
|
||||
/// <item>Publishes <see cref="AuditLogPurgedEvent"/> on the actor-system
|
||||
/// EventStream so the Bundle E central health collector + ops surfaces
|
||||
/// can subscribe without coupling to this actor.</item>
|
||||
/// </list>
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// <para>
|
||||
/// <b>Daily cadence.</b> Partition switch is metadata-only but the
|
||||
/// drop-and-rebuild dance briefly removes <c>UX_AuditLog_EventId</c>; running
|
||||
/// more often than necessary trades unique-index rebuild outages for
|
||||
/// negligible freshness wins. The default 24-hour interval matches
|
||||
/// alog.md §10's retention policy.
|
||||
/// </para>
|
||||
/// <para>
|
||||
/// <b>Continue-on-error.</b> A single boundary that throws (transient SQL
|
||||
/// failure, contention with backup, missing object) must NOT prevent the
|
||||
/// other eligible boundaries from being purged on the same tick. Per-boundary
|
||||
/// work runs inside its own try/catch; the actor's
|
||||
/// <see cref="SupervisorStrategy"/> uses Resume so any leaked exception keeps
|
||||
/// the singleton alive for the next tick.
|
||||
/// </para>
|
||||
/// <para>
|
||||
/// <b>DI scopes.</b> <see cref="IAuditLogRepository"/> is a scoped EF Core
|
||||
/// service registered by <c>AddConfigurationDatabase</c>. The singleton
|
||||
/// opens one DI scope per tick and reuses the same repository across every
|
||||
/// boundary in that tick — mirrors the
|
||||
/// <see cref="SiteAuditReconciliationActor"/> pattern.
|
||||
/// </para>
|
||||
/// <para>
|
||||
/// <b>EventStream.</b> Publishing <see cref="AuditLogPurgedEvent"/> through
|
||||
/// the EventStream rather than direct messaging avoids coupling this actor
|
||||
/// to its consumers; M6 Bundle E will subscribe a central health-counter
|
||||
/// bridge that surfaces purge progress on the central health report.
|
||||
/// </para>
|
||||
/// </remarks>
|
||||
public class AuditLogPurgeActor : ReceiveActor
|
||||
{
|
||||
private readonly IServiceProvider _services;
|
||||
private readonly AuditLogPurgeOptions _purgeOptions;
|
||||
private readonly AuditLogOptions _auditOptions;
|
||||
private readonly ILogger<AuditLogPurgeActor> _logger;
|
||||
private ICancelable? _timer;
|
||||
|
||||
public AuditLogPurgeActor(
|
||||
IServiceProvider services,
|
||||
IOptions<AuditLogPurgeOptions> purgeOptions,
|
||||
IOptions<AuditLogOptions> auditOptions,
|
||||
ILogger<AuditLogPurgeActor> logger)
|
||||
{
|
||||
ArgumentNullException.ThrowIfNull(services);
|
||||
ArgumentNullException.ThrowIfNull(purgeOptions);
|
||||
ArgumentNullException.ThrowIfNull(auditOptions);
|
||||
ArgumentNullException.ThrowIfNull(logger);
|
||||
|
||||
_services = services;
|
||||
_purgeOptions = purgeOptions.Value;
|
||||
_auditOptions = auditOptions.Value;
|
||||
_logger = logger;
|
||||
|
||||
ReceiveAsync<PurgeTick>(_ => OnTickAsync());
|
||||
}
|
||||
|
||||
protected override void PreStart()
|
||||
{
|
||||
base.PreStart();
|
||||
var interval = _purgeOptions.Interval;
|
||||
_timer = Context.System.Scheduler.ScheduleTellRepeatedlyCancelable(
|
||||
initialDelay: interval,
|
||||
interval: interval,
|
||||
receiver: Self,
|
||||
message: PurgeTick.Instance,
|
||||
sender: Self);
|
||||
}
|
||||
|
||||
protected override void PostStop()
|
||||
{
|
||||
_timer?.Cancel();
|
||||
base.PostStop();
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Resume keeps the singleton alive across any leaked exception. Restart
|
||||
/// would re-run PreStart and reschedule the timer (harmless but wasteful);
|
||||
/// Stop is wrong because the singleton must keep ticking until shutdown.
|
||||
/// </summary>
|
||||
protected override SupervisorStrategy SupervisorStrategy()
|
||||
{
|
||||
return new OneForOneStrategy(
|
||||
maxNrOfRetries: 0,
|
||||
withinTimeRange: TimeSpan.Zero,
|
||||
decider: Akka.Actor.SupervisorStrategy.DefaultDecider);
|
||||
}
|
||||
|
||||
private async Task OnTickAsync()
|
||||
{
|
||||
// Capture EventStream BEFORE the first await. Accessing Context (and
|
||||
// therefore Context.System) after an await is unsafe because Akka's
|
||||
// ActorBase.Context throws "no active ActorContext" once the
|
||||
// continuation runs on a thread that isn't currently dispatching this
|
||||
// actor — mirrors the same Sender-capture pattern in
|
||||
// AuditLogIngestActor.OnIngestAsync.
|
||||
var eventStream = Context.System.EventStream;
|
||||
|
||||
// Compute the retention threshold from AuditLogOptions.RetentionDays
|
||||
// each tick — the options class supports hot reload via
|
||||
// IOptionsMonitor for the redaction policy and similar settings; we
|
||||
// read the snapshot per-tick so an operator who lowers RetentionDays
|
||||
// sees the change applied on the next purge without an actor
|
||||
// restart.
|
||||
var threshold = DateTime.UtcNow - TimeSpan.FromDays(_auditOptions.RetentionDays);
|
||||
|
||||
IServiceScope? scope = null;
|
||||
IAuditLogRepository repository;
|
||||
try
|
||||
{
|
||||
scope = _services.CreateScope();
|
||||
repository = scope.ServiceProvider.GetRequiredService<IAuditLogRepository>();
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
_logger.LogError(ex, "Failed to resolve IAuditLogRepository for AuditLog purge tick.");
|
||||
scope?.Dispose();
|
||||
return;
|
||||
}
|
||||
|
||||
try
|
||||
{
|
||||
IReadOnlyList<DateTime> boundaries;
|
||||
try
|
||||
{
|
||||
boundaries = await repository
|
||||
.GetPartitionBoundariesOlderThanAsync(threshold)
|
||||
.ConfigureAwait(false);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
_logger.LogError(
|
||||
ex,
|
||||
"Failed to enumerate eligible AuditLog partition boundaries (threshold {ThresholdUtc:o}); skipping purge tick.",
|
||||
threshold);
|
||||
return;
|
||||
}
|
||||
|
||||
if (boundaries.Count == 0)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
foreach (var boundary in boundaries)
|
||||
{
|
||||
// Per-boundary try/catch: one bad partition (transient SQL
|
||||
// failure, missing object, contention with backup) does NOT
|
||||
// abandon the rest of the tick.
|
||||
var sw = Stopwatch.StartNew();
|
||||
try
|
||||
{
|
||||
var rowsDeleted = await repository
|
||||
.SwitchOutPartitionAsync(boundary)
|
||||
.ConfigureAwait(false);
|
||||
sw.Stop();
|
||||
|
||||
eventStream.Publish(
|
||||
new AuditLogPurgedEvent(boundary, rowsDeleted, sw.ElapsedMilliseconds));
|
||||
|
||||
_logger.LogInformation(
|
||||
"Purged AuditLog partition {MonthBoundary:yyyy-MM-dd}; {RowsDeleted} rows in {DurationMs} ms.",
|
||||
boundary,
|
||||
rowsDeleted,
|
||||
sw.ElapsedMilliseconds);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
sw.Stop();
|
||||
_logger.LogError(
|
||||
ex,
|
||||
"Failed to purge AuditLog partition {MonthBoundary:yyyy-MM-dd}; other partitions continue. Elapsed {DurationMs} ms.",
|
||||
boundary,
|
||||
sw.ElapsedMilliseconds);
|
||||
}
|
||||
}
|
||||
}
|
||||
finally
|
||||
{
|
||||
scope.Dispose();
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>Self-tick triggering a purge pass across all eligible partitions.</summary>
|
||||
internal sealed class PurgeTick
|
||||
{
|
||||
public static readonly PurgeTick Instance = new();
|
||||
private PurgeTick() { }
|
||||
}
|
||||
}
|
||||
43
src/ScadaLink.AuditLog/Central/AuditLogPurgeOptions.cs
Normal file
43
src/ScadaLink.AuditLog/Central/AuditLogPurgeOptions.cs
Normal file
@@ -0,0 +1,43 @@
|
||||
namespace ScadaLink.AuditLog.Central;
|
||||
|
||||
/// <summary>
|
||||
/// Tuning knobs for the central <see cref="AuditLogPurgeActor"/> singleton.
|
||||
/// Default cadence is 24 hours per the M6 plan; the retention window itself
|
||||
/// is sourced from <see cref="ScadaLink.AuditLog.Configuration.AuditLogOptions.RetentionDays"/>
|
||||
/// (default 365) so operators tune retention from a single section.
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// <para>
|
||||
/// The purge actor is a daily-cadence singleton, not a hot-loop, because
|
||||
/// partition-switch I/O is metadata-only but the drop-and-rebuild dance
|
||||
/// briefly removes the <c>UX_AuditLog_EventId</c> unique index — running
|
||||
/// more often than necessary trades index-rebuild outages for marginal
|
||||
/// freshness gains. Lower this only when an operator can prove they need
|
||||
/// sub-daily purge granularity.
|
||||
/// </para>
|
||||
/// <para>
|
||||
/// <see cref="IntervalOverride"/> exists for tests to drop the cadence to
|
||||
/// milliseconds without polluting the production config surface; production
|
||||
/// binds <see cref="IntervalHours"/> only.
|
||||
/// </para>
|
||||
/// </remarks>
|
||||
public sealed class AuditLogPurgeOptions
|
||||
{
|
||||
/// <summary>Period of the purge tick in hours (default 24).</summary>
|
||||
public int IntervalHours { get; set; } = 24;
|
||||
|
||||
/// <summary>
|
||||
/// Test-only override for finer control over the tick cadence than
|
||||
/// whole-hour resolution allows. When non-null, takes precedence over
|
||||
/// <see cref="IntervalHours"/>. Not bound from config — production
|
||||
/// config exposes <see cref="IntervalHours"/> only.
|
||||
/// </summary>
|
||||
public TimeSpan? IntervalOverride { get; set; }
|
||||
|
||||
/// <summary>
|
||||
/// Resolves the effective tick interval, honouring the test override
|
||||
/// when set. Falls back to <see cref="IntervalHours"/>.
|
||||
/// </summary>
|
||||
public TimeSpan Interval =>
|
||||
IntervalOverride ?? TimeSpan.FromHours(IntervalHours);
|
||||
}
|
||||
29
src/ScadaLink.AuditLog/Central/AuditLogPurgedEvent.cs
Normal file
29
src/ScadaLink.AuditLog/Central/AuditLogPurgedEvent.cs
Normal file
@@ -0,0 +1,29 @@
|
||||
namespace ScadaLink.AuditLog.Central;
|
||||
|
||||
/// <summary>
|
||||
/// Published on the actor-system EventStream by <see cref="AuditLogPurgeActor"/>
|
||||
/// after each successful partition switch-out. Downstream consumers (Bundle E
|
||||
/// central health collector, ops dashboards, audit trails) subscribe so a
|
||||
/// purge action is observable without the actor needing to know about any
|
||||
/// specific subscriber.
|
||||
/// </summary>
|
||||
/// <param name="MonthBoundary">
|
||||
/// The pf_AuditLog_Month lower-bound boundary that was switched out — i.e.
|
||||
/// the first instant of the purged month in UTC.
|
||||
/// </param>
|
||||
/// <param name="RowsDeleted">
|
||||
/// Approximate row count purged from the partition, sampled BEFORE the
|
||||
/// switch. Exact accounting would require a post-switch scan of the staging
|
||||
/// table, which the dance drops immediately, so this is the closest
|
||||
/// observable proxy. Zero is a valid value when the actor's enumerator
|
||||
/// included a partition the operator subsequently emptied by hand.
|
||||
/// </param>
|
||||
/// <param name="DurationMs">
|
||||
/// Wall-clock time spent inside <c>SwitchOutPartitionAsync</c> for this
|
||||
/// boundary, in milliseconds. Useful for spotting the rare slow purge
|
||||
/// without spinning up dedicated telemetry.
|
||||
/// </param>
|
||||
public sealed record AuditLogPurgedEvent(
|
||||
DateTime MonthBoundary,
|
||||
long RowsDeleted,
|
||||
long DurationMs);
|
||||
Reference in New Issue
Block a user