Files
scadalink-design/docs/plans/2026-05-20-audit-log-code-roadmap.md
Joseph Doherty d3d4a5b13d docs(audit): add 8-milestone code implementation roadmap
Roadmap covering Audit Log (#23) code implementation across 8 milestones
(M1 Foundation → M8 CLI). Reflects the actual state of the codebase —
all 22 prior components have source + tests, but Site Call Audit (#22)
and cached-call tracking are design-only despite being on main; their
minimum surface is inlined into M3.

M1 is laid out at full TDD-level task detail (11 bite-sized tasks).
M2–M8 are at milestone-shape detail (goals, files, task headlines,
acceptance criteria, risk callouts). Per-milestone bite-sized plans
will be generated by brainstorm + writing-plans when each milestone is
about to execute — locking 80 task cards now would mostly be stale by
M5 as M1 reveals codebase realities.

Critical path: M1 → M2 → (M3 ∥ M4 ∥ M5) → M6 → (M7 ∥ M8).

Spec: docs/requirements/Component-AuditLog.md + alog.md (commit
fec0bb1).
2026-05-20 09:22:18 -04:00

472 lines
40 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Audit Log (#23) Code Implementation Roadmap
> **For Claude:** REQUIRED SUB-SKILL FLOW per milestone: `brainstorming` → `writing-plans` → `subagent-driven-development`. Use `docs/requirements/Component-AuditLog.md` + `alog.md` as the spec; this document is the roadmap that sequences milestones and locks acceptance criteria for each. **M1 carries full TDD-level task detail; M2M8 are milestone-shape detail and will be expanded into bite-sized plans by their own writing-plans pass when their turn comes.**
**Goal:** Implement central component #23 Audit Log — append-only forensic + operational record across every script-trust-boundary action — into the existing ScadaLink codebase.
**Architecture:** Layered alongside (not replacing) the future Notifications/SiteCalls operational stores. Site-local SQLite hot-path append + gRPC telemetry batches + reconciliation pulls; central direct-write for Inbound API and Notification Outbox dispatch; monthly-partitioned MS SQL with single global retention; strict append-only enforced via DB roles. See `alog.md` for the locked design decisions and `Component-AuditLog.md` for the component spec.
**Tech Stack:** Akka.NET (clustering, singletons, ClusterClient), EF Core (MS SQL provider, code-first migrations), Microsoft.Data.SqlClient, Microsoft.Data.Sqlite, gRPC (HTTP/2 server-streaming on the existing `SiteStream` channel), ASP.NET Core (Inbound API middleware), Blazor Server + Bootstrap (Central UI), System.CommandLine (CLI), xUnit + Akka.TestKit.Xunit2 + NSubstitute (tests).
**Spec:** `/Users/dohertj2/Desktop/scadalink-design/alog.md` (validated, immutable; commit `fec0bb1`). Component design at `/Users/dohertj2/Desktop/scadalink-design/docs/requirements/Component-AuditLog.md`.
---
## Codebase Reality Check (what already exists)
- **All 22 prior components have source + tests.** Audit Log slots in as a new `src/ScadaLink.AuditLog/` project plus changes to: Commons, ConfigurationDatabase, Communication (proto), Host (DI + actor registration), ExternalSystemGateway, InboundAPI, NotificationOutbox, HealthMonitoring, CentralUI, CLI, SiteRuntime (audit hook surface).
- **Existing patterns to copy from:**
- Singleton wiring: `src/ScadaLink.Host/Actors/AkkaHostedService.cs:272280` (NotificationOutboxActor) — `ClusterSingletonManager.Props` + manager/proxy pair.
- EF migration: `src/ScadaLink.ConfigurationDatabase/Migrations/20260519050659_AddNotificationsTable.cs` — table create + indexes; **no partitioning yet — Audit Log will be the first.**
- Site SQLite hot-path: `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:2898` — single connection, write lock, Channel-based background writer.
- Site-buffer + forwarder: `src/ScadaLink.StoreAndForward/``StoreAndForwardStorage` + `NotificationForwarder` show the Pending → Forwarded transition we'll mirror.
- Actor + repo + test trio: `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs` and `tests/ScadaLink.NotificationOutbox.Tests/NotificationOutboxActorIngestTests.cs:20` — TestKit base class, NSubstitute repo, `Sys.ActorOf`, `ExpectMsg<T>`.
- gRPC additive: `src/ScadaLink.Communication/Protos/sitestream.proto` — currently carries only `AttributeValueUpdate` and `AlarmStateUpdate` in a `oneof`; we extend it.
- CLI command shape: `src/ScadaLink.CLI/Commands/AuditLogCommands.cs:153` — System.CommandLine pattern; new group will live alongside it (the file's existing commands are for the IAuditService config audit and stay).
- Blazor listing page: `src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor` — filter bar + keyset paging + status badges idiom.
- **`AuditLog.razor` and `AuditLogCommands.cs` already exist** but they're the **IAuditService config-change viewer**. Per the design pass we renamed them in docs to "Configuration Audit Log Viewer"; in code they'll be renamed (file + URL + command name) so the new operational Audit Log can take the unqualified name.
- **Test framework:** xUnit + Akka.TestKit.Xunit2 + NSubstitute. Integration tests under `tests/ScadaLink.IntegrationTests/`. Playwright UI tests under `tests/ScadaLink.CentralUI.PlaywrightTests/`. A `tests/ScadaLink.PerformanceTests/` exists for load tests.
---
## Prerequisite: Site Call Audit (#22) + cached-call tracking are NOT implemented in code
The design for both is merged on `main` (`alog.md` cached-call tracking section; `Component-SiteCallAudit.md`), but `grep` finds zero references to `TrackedOperationId` or `CachedCallTelemetry` in `src/`. This matters because **M3 (cached operations + dual-write transaction) cannot be built without them**.
**Three ways to handle this — pick before M3:**
1. **Inline into M3 (Recommended):** Implement just enough of Site Call Audit (#22) and cached-call tracking inside M3 — specifically the `CachedCallTelemetry` message, the operational-tracking SQLite table at sites, the `SiteCalls` table + repo + `SiteCallAuditActor` skeleton at central. This makes M3 the biggest milestone but ships a coherent slice (cached calls audited end-to-end).
2. **M0 prerequisite milestone:** Implement #22 and cached-call tracking as a separate slice before M3 starts. Cleanest dependency story; slowest to first-audit-row.
3. **Ship Audit Log sync-only first, retrofit cached path later:** M1, M2, M4 (sync-only emissions), M5, M6 (no cached features), M7, M8 ship as-is; cached audit is a separate follow-up. Lowest first-shippable scope but leaves cached calls unaudited until much later.
**Default choice in this roadmap: (1).** M3 absorbs the minimum #22 + cached-call tracking surface needed to make combined telemetry work; the rest of #22 (full reconciliation, KPIs, Retry/Discard relay) can be a follow-up.
---
## Milestone index
| M | Title | Ships | Touches | Depends on |
|---|---|---|---|---|
| **M1** | Foundation: schema, types, DB roles, partitioning | Migration deployed; Commons types exist; no observable behavior yet. | Commons, ConfigurationDatabase, ConfigurationDatabase.Tests | — |
| **M2** | Site pipeline (sync-only path) | One emission path end-to-end (ESG sync `Call()` audited from script to central row). | Commons, AuditLog (new), Communication (proto), Host, ExternalSystemGateway, all Tests projects, IntegrationTests | M1 |
| **M3** | Cached operations + dual-write transaction | Cached external calls and DB writes audited; SiteCalls table populated alongside; combined telemetry packet contract live. | Commons, AuditLog, SiteCallAudit (new), ConfigurationDatabase, ExternalSystemGateway, StoreAndForward, Host | M2; #22 + cached-call tracking inlined here per the prerequisite section |
| **M4** | Remaining boundary emission | All four channels emitting: sync DB writes/reads, Notify dispatcher attempt/terminal, Inbound API middleware. | ExternalSystemGateway, InboundAPI, NotificationOutbox, SiteRuntime (Database surface) | M2; M3 (NotificationOutbox terminal/attempt uses ICentralAuditWriter pattern) |
| **M5** | Payload + redaction policy | Header redaction, body redactor regex, SQL parameter redaction, safety net, configuration binding. | AuditLog, ExternalSystemGateway, InboundAPI, all emitter projects | M2 |
| **M6** | Reconciliation, purge, partition maintenance, health metrics | Self-healing telemetry, monthly partition switch, the five new health metrics + their dashboard tiles. | AuditLog, ConfigurationDatabase (partition maintenance), HealthMonitoring | M2, M3 |
| **M7** | Central UI — new Audit Log page + drill-ins + KPI tiles | User-visible Audit Log surface; existing `AuditLog.razor` renamed to ConfigurationAuditLog. | CentralUI, CentralUI.Tests, CentralUI.PlaywrightTests | M2, M4, M6 |
| **M8** | CLI — `scadalink audit query / export / verify-chain` | Operator surface for query/export; `verify-chain` is a no-op stub until v1.x hash chain ships. | CLI, ManagementService (HTTP endpoint), CLI.Tests, IntegrationTests | M2 |
**Ship-state at end of each milestone is the shippable slice** — each milestone leaves the system in a working, testable, deployable state (no half-built actors mid-pipeline). M1 ships no user-visible behaviour but produces a clean foundation; from M2 onward each ships an observable audit capability.
**Critical path:** M1 → M2 → (M3 ∥ M4 ∥ M5) → M6 → (M7 ∥ M8). M3, M4, M5 can overlap once M2 is solid. M7 and M8 can overlap once M6 lands.
---
## M1 — Foundation: schema, types, DB roles, partitioning
**Goal:** Land the new `AuditLog` table (partitioned) and DB roles in MS SQL, plus the Commons types every later milestone needs. After M1 the database is ready and types compile; nothing else changes.
**Affected projects:**
- `src/ScadaLink.Commons/` — entity, enums, interfaces, message DTOs.
- `src/ScadaLink.ConfigurationDatabase/` — EF mapping, DbContext registration, migration, DB role script, partition function/scheme, retention options.
- `tests/ScadaLink.Commons.Tests/` — enum + record tests.
- `tests/ScadaLink.ConfigurationDatabase.Tests/` — migration tests, repo tests.
**Acceptance criteria:**
- `dotnet build` of the solution succeeds.
- `dotnet ef database update` against a dev MS SQL applies the migration; `AuditLog` table exists, partitioned monthly on `OccurredAtUtc`, with PK on `EventId` and the five expected indexes.
- `scadalink_audit_writer` and `scadalink_audit_purger` SQL roles exist with the documented grants; a smoke test confirms `UPDATE AuditLog` from the writer role fails.
- `AuditEvent` record, `AuditChannel`/`AuditKind`/`AuditStatus` enums, `IAuditWriter`/`ICentralAuditWriter` interfaces, `AuditTelemetryEnvelope`/`PullAuditEvents` message DTOs all exist in Commons in the right folders.
- `IAuditLogRepository` interface (Commons) and EF implementation (ConfigurationDatabase) exist; the implementation only exposes `InsertIfNotExistsAsync`, paged read, and `SwitchOutPartitionAsync` — no update or row-delete.
- All new tests pass; no existing tests regress.
### M1 — Tasks (TDD-detail)
#### M1-T1: Add audit enums to Commons
**Files:**
- Create: `src/ScadaLink.Commons/Types/Enums/AuditChannel.cs`, `AuditKind.cs`, `AuditStatus.cs`.
- Create: `tests/ScadaLink.Commons.Tests/Types/Enums/AuditEnumTests.cs`.
**Steps:**
1. Write failing test verifying `AuditChannel` has exactly `ApiOutbound | DbOutbound | Notification | ApiInbound` (asserting `Enum.GetValues` length and members).
2. Same for `AuditKind` (10 members per `Component-AuditLog.md`).
3. Same for `AuditStatus` (8 members).
4. Run: tests fail (enums don't exist). Implement the three enums.
5. Run tests: pass.
6. Commit: `feat(commons): add Audit{Channel,Kind,Status} enums for #23`.
#### M1-T2: Add AuditEvent record + ForwardState enum
**Files:**
- Create: `src/ScadaLink.Commons/Entities/Audit/AuditEvent.cs` — public record carrying all 20 central columns (per `alog.md` §4) plus a nullable `ForwardState?` for the site-local variant.
- Create: `src/ScadaLink.Commons/Types/Enums/AuditForwardState.cs``Pending | Forwarded | Reconciled`.
- Create: `tests/ScadaLink.Commons.Tests/Entities/Audit/AuditEventTests.cs`.
**Steps:**
1. Write failing test that constructs an `AuditEvent`, sets every property, and round-trips via `with` expressions — asserts immutability and required-property behaviour.
2. Run: fail (type doesn't exist). Implement the record.
3. Run: pass.
4. Commit: `feat(commons): add AuditEvent record + ForwardState enum`.
#### M1-T3: Add IAuditWriter and ICentralAuditWriter
**Files:**
- Create: `src/ScadaLink.Commons/Interfaces/Services/IAuditWriter.cs`, `ICentralAuditWriter.cs`.
- Create: `tests/ScadaLink.Commons.Tests/Interfaces/Services/AuditWriterContractTests.cs` (smoke — only that the interfaces exist and have the documented signatures).
**Steps:**
1. Write failing reflection-based test asserting both interfaces expose `Task WriteAsync(AuditEvent, CancellationToken)`.
2. Run: fail. Implement both interfaces; document each with XML doc comments naming Audit Log #23 as the owner.
3. Run: pass.
4. Commit: `feat(commons): add IAuditWriter and ICentralAuditWriter`.
#### M1-T4: Add audit telemetry + pull message DTOs
**Files:**
- Create: `src/ScadaLink.Commons/Messages/Integration/AuditTelemetryEnvelope.cs`, `PullAuditEventsRequest.cs`, `PullAuditEventsResponse.cs`.
- Create: `tests/ScadaLink.Commons.Tests/Messages/Integration/AuditTelemetryMessagesTests.cs`.
**Steps:**
1. Failing test: construct envelope with a batch of 3 events, assert immutability + batch enumerability.
2. Failing test: pull request carries `SinceUtc` + `BatchSize`; response carries events + `MoreAvailable`.
3. Implement.
4. Run: pass.
5. Commit: `feat(commons): add audit telemetry + pull message DTOs`.
#### M1-T5: Extend ScadaLinkDbContext with AuditLogs DbSet + entity config
**Files:**
- Modify: `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs` — add `public DbSet<AuditEvent> AuditLogs => Set<AuditEvent>();` at the appropriate position (after `Notifications`).
- Create: `src/ScadaLink.ConfigurationDatabase/Entities/AuditLogEntityTypeConfiguration.cs``IEntityTypeConfiguration<AuditEvent>` mapping the columns, types, length constraints, and indexes per `alog.md` §4. Note: this is an EF mapping only; the partition function and scheme are created in the SQL migration (next task) since EF Core doesn't model them natively.
- Modify: `OnModelCreating` — apply the new configuration.
- Create: `tests/ScadaLink.ConfigurationDatabase.Tests/Entities/AuditLogEntityTypeConfigurationTests.cs` — use `ModelBuilder` directly to verify the entity is mapped to `AuditLog` table, PK is `EventId`, and the expected columns + indexes are declared.
**Steps:**
1. Failing test asserts mapped table name, PK column, and column count.
2. Implement entity configuration; apply in `OnModelCreating`.
3. Failing test asserts the five expected indexes exist on the model.
4. Add `HasIndex` declarations.
5. Run: pass.
6. Commit: `feat(configdb): map AuditEvent to AuditLog table with PK and indexes`.
#### M1-T6: Generate and customize EF migration for AuditLog
**Files:**
- Create: `src/ScadaLink.ConfigurationDatabase/Migrations/<timestamp>_AddAuditLogTable.cs` via `dotnet ef migrations add AddAuditLogTable --project ScadaLink.ConfigurationDatabase`.
- Modify: the generated `Up()` / `Down()` to:
- Create the partition function `pf_AuditLog_Month` and partition scheme `ps_AuditLog_Month` (raw SQL via `migrationBuilder.Sql(...)`), tied to a dedicated filegroup (or PRIMARY in dev — configurable via a migration setting).
- Alter the `CreateTable` call (or follow up with `Sql`) to align the table to `ps_AuditLog_Month(OccurredAtUtc)`.
- Add the five indexes generated by EF; ensure each is also partition-aligned where appropriate.
- Create: `tests/ScadaLink.ConfigurationDatabase.Tests/Migrations/AddAuditLogTableMigrationTests.cs` — applies the migration to an isolated MS SQL LocalDB instance (existing IntegrationTests harness), asserts table + partition function + scheme + indexes are present.
**Steps:**
1. Run `dotnet ef migrations add AddAuditLogTable`.
2. Failing integration test: apply migration, query `sys.partition_functions` and `sys.partition_schemes` for the expected names.
3. Edit migration to add the partition function + scheme + alignment.
4. Re-run test: pass.
5. Failing test: query `sys.indexes` for the five expected named indexes.
6. Adjust migration if any index name drifts.
7. Run: pass.
8. Commit: `feat(configdb): add AuditLog migration with monthly partitioning`.
#### M1-T7: Add DB roles in migration
**Files:**
- Modify: the M1-T6 migration `Up()` to also create the `scadalink_audit_writer` (INSERT + SELECT only) and `scadalink_audit_purger` (ALTER PARTITION FUNCTION + ALTER TABLE … SWITCH PARTITION + SELECT) roles via raw SQL. Make role creation idempotent (`IF NOT EXISTS`).
- Modify: `Down()` — drop the roles.
- Create: `tests/ScadaLink.ConfigurationDatabase.Tests/Migrations/AuditLogRoleGrantsTests.cs` — applies migration, then runs `SELECT` on `sys.database_role_members` / `sys.database_permissions` to assert the role grants. Plus a smoke test: connect as a user mapped to `scadalink_audit_writer`, attempt `UPDATE AuditLog SET Status = 'X'` and expect a permission error.
**Steps:**
1. Failing test asserts both roles exist with documented grants.
2. Add `migrationBuilder.Sql(...)` blocks.
3. Run: pass.
4. Failing test: `UPDATE AuditLog` as audit writer → expect SqlException with permission error.
5. Verify the role's permissions deny UPDATE (they should by default since only INSERT + SELECT granted).
6. Run: pass.
7. Commit: `feat(configdb): add scadalink_audit_writer and scadalink_audit_purger roles`.
#### M1-T8: Add IAuditLogRepository + EF implementation
**Files:**
- Create: `src/ScadaLink.Commons/Interfaces/Repositories/IAuditLogRepository.cs``InsertIfNotExistsAsync(AuditEvent, CancellationToken)`, `QueryAsync(filter, paging, CancellationToken)`, `SwitchOutPartitionAsync(monthBoundary, CancellationToken)`. **Deliberately no `UpdateAsync` or row-level `DeleteAsync`.**
- Create: `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs` — implementation using the DbContext; `InsertIfNotExistsAsync` uses `MERGE` or raw `INSERT … WHERE NOT EXISTS` to satisfy idempotency without throwing on dupes.
- Modify: `ServiceCollectionExtensions.cs` — register `IAuditLogRepository``AuditLogRepository` in DI.
- Create: `tests/ScadaLink.ConfigurationDatabase.Tests/Repositories/AuditLogRepositoryTests.cs`.
**Steps:**
1. Failing test: `InsertIfNotExistsAsync` for a fresh `EventId` writes one row; calling again with the same `EventId` is a no-op (no exception, no second row).
2. Implement; use a `MERGE` or `INSERT … WHERE NOT EXISTS` strategy that does NOT rely on EF change tracking.
3. Run: pass.
4. Failing test: paged `QueryAsync` returns rows in `(OccurredAtUtc desc, EventId desc)` order, respecting filter predicates (channel, kind, status, site, target, actor, correlation, time range).
5. Implement filter projection + keyset paging.
6. Run: pass.
7. Failing test: `SwitchOutPartitionAsync` for the oldest partition removes its rows from the live table.
8. Implement via `migrationBuilder`-style `Sql("ALTER TABLE ... SWITCH PARTITION ... TO ...")` (against a staging table the implementation creates and drops within the same transaction).
9. Run: pass.
10. Commit: `feat(configdb): IAuditLogRepository + EF implementation (append-only, partition-switch purge)`.
#### M1-T9: Add AuditLogOptions configuration class + binding
**Files:**
- Create: `src/ScadaLink.AuditLog/Configuration/AuditLogOptions.cs` (new project — see M1-T11) — owns `DefaultCapBytes`, `ErrorCapBytes`, `HeaderRedactList`, `GlobalBodyRedactors`, `PerTargetOverrides`, `RetentionDays`, validation attributes.
- Add: validation on startup (`IValidateOptions<AuditLogOptions>`).
- Test: ensure `appsettings.json` bind round-trips and validation rejects out-of-range `RetentionDays`.
**Steps:**
1. Failing test: bind a valid section → values present.
2. Implement options class + binding.
3. Failing test: bind invalid `RetentionDays` → validator rejects.
4. Implement validator.
5. Run: pass.
6. Commit: `feat(auditlog): add AuditLogOptions config binding`.
#### M1-T10: Add ScadaLink.AuditLog project skeleton
**Files:**
- Create: `src/ScadaLink.AuditLog/ScadaLink.AuditLog.csproj` — TargetFramework matches the rest of the solution; ProjectReferences to `ScadaLink.Commons` and `ScadaLink.ConfigurationDatabase`.
- Create: `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs``AddAuditLog(this IServiceCollection, IConfiguration)` that registers `AuditLogOptions`, `IAuditLogRepository`, plus placeholders that later milestones will fill (writer impls, actors).
- Create: `tests/ScadaLink.AuditLog.Tests/ScadaLink.AuditLog.Tests.csproj` with one smoke test.
- Modify: `ScadaLink.slnx` — add both projects to the solution.
- Modify: `Directory.Packages.props` if any new package versions are needed.
**Steps:**
1. Create projects via `dotnet new classlib` / `dotnet new xunit`; add references; add to slnx.
2. Failing test: smoke-test `AddAuditLog()` populates DI with `IAuditLogRepository` and `IOptions<AuditLogOptions>`.
3. Implement `ServiceCollectionExtensions.AddAuditLog`.
4. Run: pass.
5. Commit: `feat(auditlog): scaffold ScadaLink.AuditLog project`.
#### M1-T11: Update Component-Host.md responsibilities + README component table
**Files:**
- Modify: `docs/requirements/Component-Host.md` — list `ScadaLink.AuditLog` in the central role's registration set.
- Modify: `README.md` — confirm row #23 link reflects the new project (no functional change; this is a paper-trail update).
**Steps:**
1. Edit, verify cross-refs, commit: `docs(audit): register ScadaLink.AuditLog project in Host role`.
---
## M2 — Site pipeline (sync-only path)
**Goal:** First end-to-end audit emission: a script-initiated `ExternalSystem.Call()` produces an audit row in the central `AuditLog` table. No cached paths yet, no notifications, no inbound API, no UI. Just one channel + kind: `ApiOutbound.SyncCall`.
**Affected projects:** `Commons`, `AuditLog` (new), `Communication`, `Host`, `ExternalSystemGateway`, all matching `*.Tests/`, `tests/ScadaLink.IntegrationTests/`.
**Acceptance criteria:**
- Site-local `IAuditWriter` writes to a per-site SQLite `auditlog.db` on the hot path with `ForwardState = 'Pending'`; durability is sub-millisecond; failures fall back to a bounded in-memory ring and surface a metric.
- `SiteAuditTelemetryActor` drains pending rows in batches via a new `IngestAuditEvents` RPC on the existing `SiteStream` gRPC service; on success flips `ForwardState = 'Forwarded'`.
- `AuditLogIngestActor` (central singleton) receives the batch, performs `InsertIfNotExistsAsync` per event, returns ack.
- `ExternalSystem.Call()` emits one `ApiOutbound.SyncCall` row via `IAuditWriter` on every call completion; audit-write failure does NOT abort the script.
- Integration test in `tests/ScadaLink.IntegrationTests/` boots a site + central pair, executes a sync script that calls an external system, and asserts a corresponding row appears in the central `AuditLog` within N seconds.
- No regressions in existing ExternalSystemGateway or Communication tests.
**Task headlines** (each expanded to TDD detail in its own writing-plans pass before execution):
1. Site-local `SqliteAuditWriter` implementing `IAuditWriter` — schema bootstrap, hot-path INSERT, write lock, ring-buffer fallback. Pattern from `SiteEventLogger.cs:2898`.
2. Bounded in-memory `RingBufferFallback` that drains into the SQLite writer when health returns.
3. `SiteAuditTelemetryActor` actor — periodic drain loop (5s busy / 30s idle), batch INSERT-IF-NOT-EXISTS via gRPC, `ForwardState` transitions.
4. Extend `sitestream.proto`: add `IngestAuditEvents(stream AuditEventBatch) returns (IngestAck)`. Regenerate. Update `SiteStreamGrpcServer.cs` to handle the new RPC.
5. `AuditLogIngestActor` (central singleton) — handles ingest message, calls `IAuditLogRepository.InsertIfNotExistsAsync` per event in a single transaction.
6. Host wiring: register `SiteAuditTelemetryActor` as a site singleton on a **dedicated dispatcher** (per `alog.md` §6.2); register `AuditLogIngestActor` as a central singleton. Reference pattern at `AkkaHostedService.cs:272280`.
7. ESG sync `Call()` emission hook — add `IAuditWriter` injection; emit `AuditEvent` (channel=ApiOutbound, kind=SyncCall) before returning. Audit-write failures never throw to the script.
8. End-to-end integration test in `IntegrationTests/AuditLog/SyncCallEmissionTests.cs` — site + central wired, script invokes ESG `Call()`, central row appears.
9. Health metric `SiteAuditWriteFailures` (this milestone defines it; M6 surfaces the tile).
10. Update `docker/deploy.sh` / `infra/reseed.sh` if needed so dev clusters can verify locally.
**Risk callouts:**
- Site SQLite write throughput under load — bench against existing SiteEventLogger numbers.
- gRPC additive evolution: the existing proto uses a `oneof`. Adding a new top-level RPC is safe; embedding new oneof variants is also safe. Confirm message-ordering guarantees aren't violated.
- Don't accidentally bind `SiteAuditTelemetryActor` to the same dispatcher used by script blocking I/O; that's a real perf issue (per spec).
---
## M3 — Cached operations + dual-write transaction + (inlined) Site Call Audit foundations
**Goal:** Cached external calls (`ExternalSystem.CachedCall`) and cached DB writes (`Database.CachedWrite`) produce three audit rows per operation (`CachedEnqueued`, `CachedAttempt × N`, `CachedTerminal`) AND populate the operational `SiteCalls` table at central — in one transaction at central, from a single combined telemetry packet.
**Affected projects:** `Commons`, `AuditLog`, `SiteCallAudit` (new — minimum-viable surface), `ConfigurationDatabase` (new `SiteCalls` table migration), `ExternalSystemGateway`, `StoreAndForward`, `Host`. Tests across all of them + IntegrationTests.
**Prerequisite call-out:** This milestone implements the minimum-viable Site Call Audit (#22) surface and cached-call tracking pieces — `TrackedOperationId`, site-local operation tracking SQLite, `SiteCalls` table at central, the existing-message `CachedCallTelemetry` (must be created from scratch since it doesn't exist in code despite living in the docs). Full reconciliation, KPIs, and Retry/Discard relay for #22 are deferred — they're not on the critical path for the audit log's combined telemetry.
**Acceptance criteria:**
- New `SiteCalls` MS SQL table + repo (no partitioning needed; this is operational state, not audit).
- New `CachedCallTelemetry` message in Commons carrying BOTH the cached-call operational fields AND an `AuditEvent` payload.
- Site path: `CachedCall` writes the audit row to site SQLite (`Kind = CachedEnqueued`), creates the site operation-tracking row, and sends a combined telemetry packet.
- Central path: `AuditLogIngestActor` (extended) receives the combined packet, performs **one transaction containing both** the `AuditLog` insert and the `SiteCalls` upsert.
- Retry attempt → `Kind = CachedAttempt` audit row + `SiteCalls` status transition. Terminal → `Kind = CachedTerminal` audit row + `SiteCalls` terminal status.
- Integration test asserts: triggering a `CachedCall` that fails transient-then-succeeds produces 3 AuditLog rows + 1 SiteCalls row with `Status = Delivered`, all sharing the same `TrackedOperationId` correlation key.
**Task headlines:**
1. `TrackedOperationId` GUID newtype in Commons.
2. Site-local SQLite operation-tracking table + repo (matches `alog.md` cached-call tracking design).
3. `CachedCallTelemetry` Commons message carrying both operational fields and `AuditEvent` payload.
4. `SiteCalls` MS SQL table + EF mapping + migration + `ISiteCallAuditRepository` + repo impl.
5. `SiteCallAuditActor` skeleton (singleton, central) — receives telemetry, owns `SiteCalls` upsert via repo.
6. Extend `AuditLogIngestActor` to detect combined telemetry and execute both writes (`AuditLog` insert + `SiteCalls` upsert) in a single `DbContext` transaction.
7. ESG `CachedCall()` emission — produce combined telemetry on every lifecycle transition (enqueue, attempt, terminal).
8. Extend gRPC proto with the combined-telemetry RPC if it's distinct from `IngestAuditEvents`, or fold it into the existing one with a discriminator field (decision in milestone brainstorm).
9. Integration test in `IntegrationTests/AuditLog/CachedCallCombinedTelemetryTests.cs`.
**Risk callouts:**
- Combined telemetry packet evolution: design the packet so future cached audit-kind additions are non-breaking (oneof or open-field map).
- Single transaction at central spans two tables; ensure connection retry behaviour is correct.
- Idempotency: AuditLog dedups on `EventId`; SiteCalls dedups on `TrackedOperationId`. If telemetry retries and AuditLog already has the row, ensure SiteCalls upsert still runs (no short-circuit).
---
## M4 — Remaining boundary emission
**Goal:** Every channel × kind from `Component-AuditLog.md` produces a row when its boundary call fires.
**Affected projects:** `ExternalSystemGateway` (sync DB writes/reads, cached DB writes), `SiteRuntime` (Database surface exposing them), `NotificationOutbox` (central direct-write of `Attempt`/`Terminal`), `InboundAPI` (middleware). Tests across all.
**Acceptance criteria:**
- Sync `Database.Connection().Execute()``DbOutbound.SyncWrite` row; `ExecuteReader``DbOutbound.SyncRead`. Parameter values captured by default; per-connection redaction opt-in supported.
- `Database.CachedWrite` → three lifecycle rows via the combined telemetry built in M3.
- Notification Outbox dispatcher: every delivery attempt writes `Notification.Attempt`; terminal writes `Notification.Terminal`. Site-emitted `Notification.Enqueued` flows through the standard site→central audit path. Audit-write failure never affects delivery.
- Inbound API middleware writes one `ApiInbound.Completed` row per request, before `await next()` returns. API key NAME captured (never material). Audit-write failure does NOT change the HTTP response.
**Task headlines:**
1. ESG `Database.Connection()` execute hook — wrap `Execute*` / `ExecuteScalar` / `ExecuteReader` to emit before/after audit events.
2. `Database.CachedWrite` combined-telemetry emission (mirror M3's ESG cached path).
3. NotificationOutboxActor extension — inject `ICentralAuditWriter`; write `Notification.Attempt` per dispatcher attempt; write `Notification.Terminal` on terminal transitions; never abort on failure.
4. Site-emitted `Notification.Enqueued` — when a script calls `Notify.To().Send()` (site-side via Store-and-Forward), emit a site audit row (`Notification.Enqueued`); telemetry forwards as usual.
5. Inbound API middleware: new `AuditWriteMiddleware` in `src/ScadaLink.InboundAPI/Middleware/` writing `ApiInbound.Completed` before response flush; register in the ASP.NET pipeline.
6. Tests: emission unit tests per call mode, plus 4 integration tests (one per channel).
**Risk callouts:**
- Inbound API: correlation-id generation needs to be consistent with any upstream tracing headers (W3C `traceparent` if present).
- Notification dispatcher: confirm `ICentralAuditWriter` errors are logged but don't block the dispatch loop.
---
## M5 — Payload + redaction policy
**Goal:** Payload capture is bounded (8 KB / 64 KB on error), headers are redacted by default, SQL parameter values are captured by default with per-connection opt-out, body redactor regexes are configurable per target, and the safety net over-redacts on misconfiguration.
**Affected projects:** `AuditLog` (policy engine + options), `ExternalSystemGateway` (HTTP header redactors, SQL param redaction hook), `InboundAPI` (header redactors, body capture), `NotificationOutbox` (subject/body capture follows existing rules). Tests.
**Acceptance criteria:**
- A `IAuditPayloadFilter` service is invoked between event construction and write. Truncates to default cap; raises to error cap on non-`Success` rows; applies header redactors; applies body regex redactors; applies SQL parameter redactors (per-connection); over-redacts on regex error and increments `AuditRedactionFailure`.
- Configuration test: changing `appsettings.json` redactors changes runtime behaviour (no rebuild needed for regex changes).
- Bench: 95th-percentile audit emission latency on the hot path stays under N µs at default cap (target to be set during M5 brainstorm).
**Task headlines:**
1. `IAuditPayloadFilter` + default implementation (header redaction, body regex, SQL parameter redaction, safety net).
2. Wire the filter into the emission paths (M2, M3, M4 emitters all call through the filter before handing the `AuditEvent` to the writer).
3. `appsettings.json` schema for the filter (already prepared in M1-T9; M5 plugs the runtime in).
4. Tests: redaction unit tests with known-bad payloads (passwords in JSON, `Authorization` headers, SQL params named `@apikey`).
5. Performance test in `tests/ScadaLink.PerformanceTests/` for the hot-path latency budget.
**Risk callouts:**
- Regex performance — pre-compile and cache patterns; reject patterns that take too long to compile.
- Don't redact post-truncation if the truncation cut a redaction target in half.
---
## M6 — Reconciliation, purge, partition maintenance, health metrics
**Goal:** Self-healing telemetry, monthly partition rollover, daily purge, all five new health metrics live and feeding the existing health-report pipeline.
**Affected projects:** `AuditLog` (3 new actors: `SiteAuditReconciliationActor`, `AuditLogPurgeActor`, partition-maintenance worker), `Communication` (the `PullAuditEvents` RPC), `HealthMonitoring` (5 new metrics), `ConfigurationDatabase` (partition-roll-forward SQL helper).
**Acceptance criteria:**
- `SiteAuditReconciliationActor` runs every 5 minutes per site; pulls events the site reports as `Pending`; central performs `InsertIfNotExistsAsync` then signals the site to flip those rows to `Reconciled`.
- `AuditLogPurgeActor` runs daily; for each partition older than `RetentionDays`, switches it out to a staging table and drops the staging table. Emits an `AuditLog:Purged` event with rowcount + duration.
- Partition-maintenance job runs at month boundary to add the next month's partition function range and ensure the scheme has a destination filegroup.
- 5 new health metrics published per site: `SiteAuditBacklog` (count + oldest + bytes), `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`; and per central node: `CentralAuditWriteFailures`, `AuditRedactionFailure`.
- Integration test: simulated 5-minute central outage → telemetry catches up after recovery via reconciliation, no rows lost; site backlog metric reflects the queue depth and drops as it drains.
**Task headlines:**
1. `PullAuditEvents` RPC on the existing `SiteStream` gRPC server.
2. `SiteAuditReconciliationActor` actor with timer + per-site `LastReconciledAt` cursor.
3. `AuditLogPurgeActor` actor with daily schedule, partition-switch logic via `IAuditLogRepository.SwitchOutPartitionAsync`.
4. Partition-roll-forward helper (raw SQL `migrationBuilder.Sql` equivalent at runtime — likely a `HostedService` that runs once at startup and once per month).
5. Health metric publishing per emitter; integrate with the existing `SiteHealthState` / `CentralHealthAggregator` plumbing.
6. Integration tests for outage/recovery + purge.
**Risk callouts:**
- Partition switch on an active table — ensure online schema operations don't block ingest; document the window if a brief lock is unavoidable.
- Reconciliation can produce duplicate `Forwarded``Reconciled` state flips; ensure idempotency at site SQLite layer.
---
## M7 — Central UI: new Audit Log page + drill-ins + KPI tiles
**Goal:** User-visible Audit Log: filter bar, results grid (custom Blazor + Bootstrap, no third-party grid), drilldown drawer with cURL / "show all events" / redaction indicators / pretty-printed payloads. 6 drill-in entry points from existing pages. 3 KPI tiles on Health dashboard.
**Affected projects:** `CentralUI`, `CentralUI.Tests`, `CentralUI.PlaywrightTests`.
**Acceptance criteria:**
- New `Components/Pages/Audit/AuditLogPage.razor` exists; new "Audit" nav group sibling to Notifications.
- All 10 filter elements, 10 grid columns, keyset pagination + default page 100, drilldown drawer per `Component-AuditLog.md` §10.
- Existing `Components/Pages/Monitoring/AuditLog.razor` (the IAuditService config-change viewer) **renamed in code** to `ConfigurationAuditLog.razor`, with URL `/audit/configuration` to match the doc-renaming we did. Drill-ins from existing pages (Notifications, Site Calls, External Systems, Inbound API Keys, Sites, Instances) added.
- 3 KPI tiles added to the Health dashboard; data sourced from `HealthMonitoring`.
- Playwright tests cover: filter narrowing, drilldown drawer, "Copy as cURL" on `ApiInbound` rows, drill-in from Notifications to filtered Audit Log.
- `OperationalAudit` read permission gating + `AuditExport` for the Export button.
**Task headlines:**
1. New `Components/Pages/Audit/AuditLogPage.razor` + matching `.razor.cs` code-behind + `.razor.css`.
2. Custom Blazor `<AuditFilterBar>` component (multi-select chips for Channel/Kind/Status, autocomplete for Instance/Script).
3. Custom Blazor `<AuditResultsGrid>` component — keyset paging via `QueryAsync` repository method (M1-T8).
4. `<AuditDrilldownDrawer>` component — JSON pretty-print, SQL syntax highlight, "Copy as cURL", "Show all events" CorrelationId filter.
5. Rename existing `AuditLog.razor``ConfigurationAuditLog.razor` + update routes + update internal links.
6. Drill-in additions to 6 existing pages.
7. 3 KPI tile components on Health dashboard.
8. Server-side CSV export (streaming) with `AuditExport` permission check.
9. Playwright E2E tests.
**Risk callouts:**
- Permission check at the page level needs to align with the existing role/permission infrastructure (Security #10).
- Keyset paging across partitioned table needs the right index; M1's `IX_AuditLog_OccurredAtUtc` is the supporting index.
---
## M8 — CLI: `scadalink audit query | export | verify-chain`
**Goal:** Operator surface for the centralized Audit Log.
**Affected projects:** `CLI`, `CLI.Tests`, `ManagementService` (new HTTP endpoint), `IntegrationTests`.
**Acceptance criteria:**
- `scadalink audit query` mirrors the UI filter set; results stream as JSON (default) or table.
- `scadalink audit export` streams server-side to CSV / JSONL / Parquet; requires `AuditExport` permission.
- `scadalink audit verify-chain --month YYYY-MM` is a no-op stub returning a "hash-chain not yet enabled in this release" message and exit code 0 (per v1.x deferral).
- Existing `audit-log query` (IAuditService config-change viewer) **renamed** in code to `audit-config query` to disambiguate; old name kept as a deprecated alias for one minor version.
- Permissions: `audit query` and `audit verify-chain` require `OperationalAudit`; `audit export` additionally requires `AuditExport`.
**Task headlines:**
1. New `AuditCommands.cs` (separate file from `AuditLogCommands.cs` — the latter stays for the renamed config audit).
2. Build the three subcommands with their flag sets (per CLI doc & `alog.md` §15.1, post-Bundle-D fix).
3. ManagementService HTTP endpoints backing each subcommand.
4. Output formatters (JSON, table) reused from existing CLI patterns.
5. CLI integration tests in `tests/ScadaLink.CLI.Tests/` + `tests/ScadaLink.IntegrationTests/`.
6. Update CLI README + help text.
**Risk callouts:**
- The CLI rename (`audit-log query``audit-config query`) breaks any operator scripts; provide a deprecation alias and document the migration.
---
## Cross-cutting concerns (apply at every milestone)
- **Branching:** every milestone gets its own `feature/audit-log-mN-<slice>` branch; merged with `--no-ff` to `main` on milestone completion. No pushes without explicit user authorization.
- **Tests:** Every task adds tests first (failing test → impl → passing test). Existing tests must keep passing.
- **Commits:** small and frequent. Bite-sized per writing-plans skill.
- **Reviews:** per the bundling cadence in user memory — group small adjacent tasks into a single implementer dispatch, run one combined spec+quality review per bundle, then a final cross-bundle review at end of milestone.
- **Docs:** if implementation reveals a design gap, fix the design doc FIRST (in `docs/requirements/Component-AuditLog.md` and/or `alog.md`), commit, then implement. Don't let the code and docs drift.
- **Infra:** the 3 `infra/*` working-tree modifications still uncommitted on `main` are unrelated and stay that way unless the user explicitly addresses them. Use explicit `git add <path>` throughout, never `git commit -am`.
---
## Per-milestone execution flow (template)
When a milestone is about to start, run this sequence:
1. **Brainstorm**: short skill invocation to nail any code-level decisions not fixed in the spec (test fixture placement, migration helper choice, etc.).
2. **Writing-plans**: produce a milestone-specific plan with TDD detail per task — saved to `docs/plans/2026-XX-XX-auditlog-mN-<slice>.md` + peer `.tasks.json`.
3. **Subagent-driven execution**: bundle small tasks per cadence preference; per-bundle implementer + combined reviewer; cross-milestone review at end; merge to `main` with `--no-ff`.
The roadmap is the contract for what each milestone ships; the per-milestone plan is the contract for how it gets built.