docs(audit): roadmap corrections after M1

- M2 head: honor M1 vocabulary (ApiCall/Delivered), harden InsertIfNotExistsAsync
  (race window — first concurrent writer arrives in M2), add keyset-tiebreaker
  test (Bundle D reviewer's deferred recommendation), reuse MsSqlMigrationFixture
  + Xunit.SkippableFact pattern.
- M6-T4 (AuditLogPurgeActor): replace M1's NotSupportedException stub with the
  drop-and-rebuild dance for the non-aligned UX_AuditLog_EventId unique index;
  acknowledge the small outage window during partition SWITCH.
- M6-T5 (partition maintenance): note M1 ships 24 monthly boundaries (Jan 2026 -
  Dec 2027); service rolls the function forward via SPLIT RANGE.
This commit is contained in:
Joseph Doherty
2026-05-20 11:58:56 -04:00
parent 6064c5c0fc
commit ed442c7c8c

View File

@@ -250,10 +250,17 @@ The design for both is merged on `main` (`alog.md` cached-call tracking section;
## M2 — Site pipeline (sync-only path) ## M2 — Site pipeline (sync-only path)
**Goal:** First end-to-end audit emission: a script-initiated `ExternalSystem.Call()` produces an audit row in the central `AuditLog` table. No cached paths yet, no notifications, no inbound API, no UI. Just one channel + kind: `ApiOutbound.SyncCall`. **Goal:** First end-to-end audit emission: a script-initiated `ExternalSystem.Call()` produces an audit row in the central `AuditLog` table. No cached paths yet, no notifications, no inbound API, no UI. Just one channel + kind: `ApiOutbound` / `ApiCall`.
**Affected projects:** `Commons`, `AuditLog` (new), `Communication`, `Host`, `ExternalSystemGateway`, all matching `*.Tests/`, `tests/ScadaLink.IntegrationTests/`. **Affected projects:** `Commons`, `AuditLog` (new), `Communication`, `Host`, `ExternalSystemGateway`, all matching `*.Tests/`, `tests/ScadaLink.IntegrationTests/`.
> **M1 realities to honor:**
> - **Vocabulary**: M1 enums use `AuditKind.ApiCall` (sync) and `AuditStatus.Delivered|Failed`. The original spec's `SyncCall` / `Success` names were superseded; alog.md + Component-AuditLog.md were reconciled in the M1 merge.
> - **Idempotent insert race**: M1's `AuditLogRepository.InsertIfNotExistsAsync` uses non-locking `IF NOT EXISTS … INSERT`. M2 is the first concurrent writer (`AuditLogIngestActor` will receive batches from multiple sites). **Harden the repo before relying on it** — either add `WITH (UPDLOCK, HOLDLOCK)` to the existence check, or catch SqlException numbers 2601/2627 (duplicate key on `UX_AuditLog_EventId`) and swallow. Add a new task at the head of M2 for this fix and its concurrency test.
> - **Keyset tiebreaker test gap**: M1's `QueryAsync_Keyset_NextPageStartsAfterCursor` test uses five rows with distinct `OccurredAtUtc`, so the `Guid.CompareTo` tiebreaker branch is never exercised. Add a same-OccurredAt test in M2 (Bundle D reviewer's deferred recommendation).
> - **Reusable MSSQL fixture**: `tests/ScadaLink.ConfigurationDatabase.Tests/Migrations/MsSqlMigrationFixture.cs` + `[SkippableFact]` + `Skip.IfNot(_fixture.Available, _fixture.SkipReason)` is the established pattern. Consider promoting it to a `[CollectionDefinition]`-shared fixture when M2+ adds more MSSQL-dependent test classes.
> - **Project layout**: `src/ScadaLink.AuditLog/` is wired into the solution with `Configuration/AuditLogOptions.cs` + validator + `ServiceCollectionExtensions.AddAuditLog()`. M2's `Site/` and `Central/` subfolders attach to this project; the DI extension is the registration point.
**Acceptance criteria:** **Acceptance criteria:**
- Site-local `IAuditWriter` writes to a per-site SQLite `auditlog.db` on the hot path with `ForwardState = 'Pending'`; durability is sub-millisecond; failures fall back to a bounded in-memory ring and surface a metric. - Site-local `IAuditWriter` writes to a per-site SQLite `auditlog.db` on the hot path with `ForwardState = 'Pending'`; durability is sub-millisecond; failures fall back to a bounded in-memory ring and surface a metric.
- `SiteAuditTelemetryActor` drains pending rows in batches via a new `IngestAuditEvents` RPC on the existing `SiteStream` gRPC service; on success flips `ForwardState = 'Forwarded'`. - `SiteAuditTelemetryActor` drains pending rows in batches via a new `IngestAuditEvents` RPC on the existing `SiteStream` gRPC service; on success flips `ForwardState = 'Forwarded'`.
@@ -1078,29 +1085,36 @@ The design for both is merged on `main` (`alog.md` cached-call tracking section;
#### M6-T4: `AuditLogPurgeActor` — daily partition-switch purge #### M6-T4: `AuditLogPurgeActor` — daily partition-switch purge
> **M1 reality**: `IAuditLogRepository.SwitchOutPartitionAsync` ships in M1 as a `NotSupportedException` stub because the non-aligned `UX_AuditLog_EventId` unique index (necessary for first-write-wins idempotency without including `OccurredAtUtc` in the unique key) blocks `ALTER TABLE … SWITCH PARTITION`. **M6 must replace the stub with the drop-and-rebuild dance**: (1) `DROP INDEX UX_AuditLog_EventId ON dbo.AuditLog;` (2) create the staging table on `[PRIMARY]` with identical schema; (3) `ALTER TABLE dbo.AuditLog SWITCH PARTITION <n> TO dbo.<staging>;` (4) `DROP TABLE dbo.<staging>;` (5) `CREATE UNIQUE NONCLUSTERED INDEX UX_AuditLog_EventId ON dbo.AuditLog(EventId) ON [PRIMARY];`. The small unique-index outage window during the switch is acceptable — partition switches are O(seconds) and `InsertIfNotExistsAsync` callers will see a transient retry surface; document this in the actor.
**Files:** **Files:**
- Create: `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs` — central singleton; daily timer. For each partition whose latest `OccurredAtUtc` is older than `AuditLogOptions.RetentionDays`, call `IAuditLogRepository.SwitchOutPartitionAsync(partitionBoundary)`. Emit an `AuditLogPurged` event (logged + metricked) with partition range, row count, and duration. - Create: `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs` — central singleton; daily timer. For each partition whose latest `OccurredAtUtc` is older than `AuditLogOptions.RetentionDays`, call `IAuditLogRepository.SwitchOutPartitionAsync(partitionBoundary)`. Emit an `AuditLogPurged` event (logged + metricked) with partition range, row count, and duration.
- Modify: `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs` — replace the M1 `NotSupportedException` stub with the drop-and-rebuild dance described above. Wrap in a transaction. Add a regression test asserting the unique index is rebuilt and the data left behind matches the un-switched partitions.
- Create: `tests/ScadaLink.AuditLog.Tests/Central/AuditLogPurgeActorTests.cs`. - Create: `tests/ScadaLink.AuditLog.Tests/Central/AuditLogPurgeActorTests.cs`.
**Steps:** **Steps:**
1. Failing test: with retention = 30 days, partitions older than 30 days are switched out; newer partitions are kept. 1. Failing test: with retention = 30 days, partitions older than 30 days are switched out; newer partitions are kept.
2. Failing test: purge emits the `AuditLogPurged` event with correct row count. 2. Failing test: purge emits the `AuditLogPurged` event with correct row count.
3. Failing test: partition switch under the `scadalink_audit_purger` role completes successfully. 3. Failing test: partition switch under the `scadalink_audit_purger` role completes successfully (requires the role to ALSO be granted permission to DROP/CREATE the `UX_AuditLog_EventId` index — extend the role grants in this milestone if not in M1's role definition; M1 granted `ALTER ON SCHEMA::dbo` which should cover this).
4. Implement. 4. Failing test: post-switch, `InsertIfNotExistsAsync` continues to enforce first-write-wins (unique index successfully rebuilt).
5. Run: pass. 5. Implement.
6. Commit: `feat(auditlog): AuditLogPurgeActor with partition-switch purge`. 6. Run: pass.
7. Commit: `feat(auditlog): AuditLogPurgeActor with partition-switch purge (drop-and-rebuild around UX_AuditLog_EventId)`.
#### M6-T5: `AuditLogPartitionMaintenanceService` — monthly roll-forward #### M6-T5: `AuditLogPartitionMaintenanceService` — monthly roll-forward
> **M1 reality**: the partition function `pf_AuditLog_Month` ships with 24 explicit monthly boundaries (Jan 2026 through Dec 2027) on filegroup `[PRIMARY]`. M6's hosted service must keep this rolling — split a new boundary for the upcoming month and (if a separate hot/cold filegroup strategy is adopted later) drop oldest boundaries via MERGE after purge.
**Files:** **Files:**
- Create: `src/ScadaLink.AuditLog/Central/AuditLogPartitionMaintenanceService.cs``IHostedService` that runs on startup AND every month: ensures the next month's partition range exists on `pf_AuditLog_Month` and the partition scheme has a destination filegroup. Implemented via raw SQL (`ALTER PARTITION FUNCTION ... SPLIT RANGE`). - Create: `src/ScadaLink.AuditLog/Central/AuditLogPartitionMaintenanceService.cs``IHostedService` that runs on startup AND every month: ensures the next month's partition range exists on `pf_AuditLog_Month` and the partition scheme has a destination filegroup. Implemented via raw SQL (`ALTER PARTITION FUNCTION pf_AuditLog_Month SPLIT RANGE (<next-month-boundary>)`); ensure the scheme stays `ALL TO ([PRIMARY])` unless production deployment overrides per-filegroup.
- Create: `tests/ScadaLink.AuditLog.Tests/Central/PartitionMaintenanceServiceTests.cs` (integration; runs against a temp DB). - Create: `tests/ScadaLink.AuditLog.Tests/Central/PartitionMaintenanceServiceTests.cs` (integration via `MsSqlMigrationFixture`; runs against a temp DB).
**Steps:** **Steps:**
1. Failing test: after service runs, the partition function has ranges covering "current month + next month". 1. Failing test: against a DB seeded with the M1 migration (covering through Dec 2027), running the service in Apr 2028 splits a Jan 2028 boundary so the function has a range for "current month + at least the next month".
2. Implement. 2. Implement.
3. Run: pass. 3. Failing test: subsequent monthly runs add successive future boundaries (idempotent: already-split boundaries are no-ops, not errors).
4. Commit: `feat(auditlog): partition maintenance HostedService`. 4. Run: pass.
5. Commit: `feat(auditlog): partition maintenance HostedService (SPLIT RANGE roll-forward)`.
#### M6-T6: Health metric `SiteAuditBacklog` #### M6-T6: Health metric `SiteAuditBacklog`