fix(configdb): replace SwitchOutPartitionAsync stub with drop-and-rebuild dance (#23 M6)

Replaces M1's NotSupportedException stub with the production drop-DROP-INDEX
→ CREATE-staging → SWITCH PARTITION → DROP-staging → CREATE-INDEX dance
documented in alog.md §4. UX_AuditLog_EventId is intentionally non-aligned
with ps_AuditLog_Month so single-column EventId uniqueness can be enforced
cheaply for InsertIfNotExistsAsync; SQL Server rejects ALTER TABLE SWITCH
while a non-aligned unique index is present, so the implementation drops
it, switches the partition data into a GUID-suffixed staging table on
[PRIMARY], drops staging (discarding the rows), and rebuilds the unique
index — all inside an explicit transaction with a CATCH that guarantees
the unique index is rebuilt regardless of failure point.

Also adds GetPartitionBoundariesOlderThanAsync to IAuditLogRepository: a
CROSS APPLY over sys.partition_range_values + per-partition MAX(OccurredAtUtc)
to enumerate retention-eligible months for the M6 purge actor (next commit).

Tests verify:
* Old partition's rows are removed; other months untouched
* UX_AuditLog_EventId is rebuilt after a successful switch
* InsertIfNotExistsAsync's first-write-wins idempotency still holds after switch
* On engineered SWITCH failure (inbound FK from a probe table), SqlException
  propagates AND UX_AuditLog_EventId is still present (CATCH branch ran)
* GetPartitionBoundariesOlderThanAsync returns only boundaries whose partition's
  MAX(OccurredAtUtc) is strictly older than the threshold; empty partitions
  excluded
This commit is contained in:
Joseph Doherty
2026-05-20 18:20:55 -04:00
parent c763bd9a04
commit 6069a20e0f
5 changed files with 445 additions and 24 deletions

View File

@@ -45,12 +45,43 @@ public interface IAuditLogRepository
/// <summary>
/// Switches out (purges) the monthly partition whose lower bound is
/// <paramref name="monthBoundary"/>. The honest M1 implementation throws
/// <see cref="NotSupportedException"/>: the <c>UX_AuditLog_EventId</c> unique
/// index is non-partition-aligned (lives on <c>[PRIMARY]</c>, not on
/// <c>ps_AuditLog_Month</c>), so SQL Server rejects
/// <c>ALTER TABLE … SWITCH PARTITION</c> until the drop-and-rebuild dance
/// shipped by the M6 purge actor is in place.
/// <paramref name="monthBoundary"/>.
/// </summary>
/// <remarks>
/// <para>
/// <b>Drop-and-rebuild dance.</b> <c>UX_AuditLog_EventId</c> is intentionally
/// non-partition-aligned (it lives on <c>[PRIMARY]</c> so single-column
/// EventId uniqueness — required by <see cref="InsertIfNotExistsAsync"/> —
/// can be enforced cheaply). SQL Server rejects
/// <c>ALTER TABLE … SWITCH PARTITION</c> while a non-aligned unique index
/// is present, so the M6 implementation drops the index, creates a staging
/// table with byte-identical schema, switches the partition's data into
/// staging, drops staging (discarding the rows), and rebuilds the unique
/// index. The CATCH branch guarantees the index is rebuilt even on partial
/// failure so the table never returns to live traffic without its
/// idempotency-supporting index.
/// </para>
/// <para>
/// <b>Outage window.</b> The dance briefly removes the unique index, so
/// concurrent <see cref="InsertIfNotExistsAsync"/> calls during the switch
/// could in principle race past the IF NOT EXISTS check without the index
/// catching the duplicate. This is acceptable for the daily purge cadence
/// — the inserts that the IF NOT EXISTS check guards are themselves rare
/// enough that a sub-second collision window is operationally negligible,
/// and the composite PK still rejects same-(EventId, OccurredAtUtc) rows.
/// </para>
/// </remarks>
Task SwitchOutPartitionAsync(DateTime monthBoundary, CancellationToken ct = default);
/// <summary>
/// Returns the set of <c>pf_AuditLog_Month</c> partition lower-bound
/// boundaries whose partitions contain only rows with
/// <see cref="AuditEvent.OccurredAtUtc"/> strictly older than
/// <paramref name="threshold"/>. Boundaries whose partition is empty are
/// excluded (a no-op switch is wasted work). Used by the M6 purge actor
/// to enumerate retention-eligible months on every tick.
/// </summary>
Task<IReadOnlyList<DateTime>> GetPartitionBoundariesOlderThanAsync(
DateTime threshold,
CancellationToken ct = default);
}