diff --git a/docs/plans/2026-05-20-audit-log-code-roadmap.md b/docs/plans/2026-05-20-audit-log-code-roadmap.md new file mode 100644 index 0000000..575e83c --- /dev/null +++ b/docs/plans/2026-05-20-audit-log-code-roadmap.md @@ -0,0 +1,471 @@ +# Audit Log (#23) Code Implementation Roadmap + +> **For Claude:** REQUIRED SUB-SKILL FLOW per milestone: `brainstorming` → `writing-plans` → `subagent-driven-development`. Use `docs/requirements/Component-AuditLog.md` + `alog.md` as the spec; this document is the roadmap that sequences milestones and locks acceptance criteria for each. **M1 carries full TDD-level task detail; M2–M8 are milestone-shape detail and will be expanded into bite-sized plans by their own writing-plans pass when their turn comes.** + +**Goal:** Implement central component #23 Audit Log — append-only forensic + operational record across every script-trust-boundary action — into the existing ScadaLink codebase. + +**Architecture:** Layered alongside (not replacing) the future Notifications/SiteCalls operational stores. Site-local SQLite hot-path append + gRPC telemetry batches + reconciliation pulls; central direct-write for Inbound API and Notification Outbox dispatch; monthly-partitioned MS SQL with single global retention; strict append-only enforced via DB roles. See `alog.md` for the locked design decisions and `Component-AuditLog.md` for the component spec. + +**Tech Stack:** Akka.NET (clustering, singletons, ClusterClient), EF Core (MS SQL provider, code-first migrations), Microsoft.Data.SqlClient, Microsoft.Data.Sqlite, gRPC (HTTP/2 server-streaming on the existing `SiteStream` channel), ASP.NET Core (Inbound API middleware), Blazor Server + Bootstrap (Central UI), System.CommandLine (CLI), xUnit + Akka.TestKit.Xunit2 + NSubstitute (tests). + +**Spec:** `/Users/dohertj2/Desktop/scadalink-design/alog.md` (validated, immutable; commit `fec0bb1`). Component design at `/Users/dohertj2/Desktop/scadalink-design/docs/requirements/Component-AuditLog.md`. + +--- + +## Codebase Reality Check (what already exists) + +- **All 22 prior components have source + tests.** Audit Log slots in as a new `src/ScadaLink.AuditLog/` project plus changes to: Commons, ConfigurationDatabase, Communication (proto), Host (DI + actor registration), ExternalSystemGateway, InboundAPI, NotificationOutbox, HealthMonitoring, CentralUI, CLI, SiteRuntime (audit hook surface). +- **Existing patterns to copy from:** + - Singleton wiring: `src/ScadaLink.Host/Actors/AkkaHostedService.cs:272–280` (NotificationOutboxActor) — `ClusterSingletonManager.Props` + manager/proxy pair. + - EF migration: `src/ScadaLink.ConfigurationDatabase/Migrations/20260519050659_AddNotificationsTable.cs` — table create + indexes; **no partitioning yet — Audit Log will be the first.** + - Site SQLite hot-path: `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:28–98` — single connection, write lock, Channel-based background writer. + - Site-buffer + forwarder: `src/ScadaLink.StoreAndForward/` — `StoreAndForwardStorage` + `NotificationForwarder` show the Pending → Forwarded transition we'll mirror. + - Actor + repo + test trio: `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs` and `tests/ScadaLink.NotificationOutbox.Tests/NotificationOutboxActorIngestTests.cs:20` — TestKit base class, NSubstitute repo, `Sys.ActorOf`, `ExpectMsg`. + - gRPC additive: `src/ScadaLink.Communication/Protos/sitestream.proto` — currently carries only `AttributeValueUpdate` and `AlarmStateUpdate` in a `oneof`; we extend it. + - CLI command shape: `src/ScadaLink.CLI/Commands/AuditLogCommands.cs:1–53` — System.CommandLine pattern; new group will live alongside it (the file's existing commands are for the IAuditService config audit and stay). + - Blazor listing page: `src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor` — filter bar + keyset paging + status badges idiom. +- **`AuditLog.razor` and `AuditLogCommands.cs` already exist** but they're the **IAuditService config-change viewer**. Per the design pass we renamed them in docs to "Configuration Audit Log Viewer"; in code they'll be renamed (file + URL + command name) so the new operational Audit Log can take the unqualified name. +- **Test framework:** xUnit + Akka.TestKit.Xunit2 + NSubstitute. Integration tests under `tests/ScadaLink.IntegrationTests/`. Playwright UI tests under `tests/ScadaLink.CentralUI.PlaywrightTests/`. A `tests/ScadaLink.PerformanceTests/` exists for load tests. + +--- + +## Prerequisite: Site Call Audit (#22) + cached-call tracking are NOT implemented in code + +The design for both is merged on `main` (`alog.md` cached-call tracking section; `Component-SiteCallAudit.md`), but `grep` finds zero references to `TrackedOperationId` or `CachedCallTelemetry` in `src/`. This matters because **M3 (cached operations + dual-write transaction) cannot be built without them**. + +**Three ways to handle this — pick before M3:** + +1. **Inline into M3 (Recommended):** Implement just enough of Site Call Audit (#22) and cached-call tracking inside M3 — specifically the `CachedCallTelemetry` message, the operational-tracking SQLite table at sites, the `SiteCalls` table + repo + `SiteCallAuditActor` skeleton at central. This makes M3 the biggest milestone but ships a coherent slice (cached calls audited end-to-end). +2. **M0 prerequisite milestone:** Implement #22 and cached-call tracking as a separate slice before M3 starts. Cleanest dependency story; slowest to first-audit-row. +3. **Ship Audit Log sync-only first, retrofit cached path later:** M1, M2, M4 (sync-only emissions), M5, M6 (no cached features), M7, M8 ship as-is; cached audit is a separate follow-up. Lowest first-shippable scope but leaves cached calls unaudited until much later. + +**Default choice in this roadmap: (1).** M3 absorbs the minimum #22 + cached-call tracking surface needed to make combined telemetry work; the rest of #22 (full reconciliation, KPIs, Retry/Discard relay) can be a follow-up. + +--- + +## Milestone index + +| M | Title | Ships | Touches | Depends on | +|---|---|---|---|---| +| **M1** | Foundation: schema, types, DB roles, partitioning | Migration deployed; Commons types exist; no observable behavior yet. | Commons, ConfigurationDatabase, ConfigurationDatabase.Tests | — | +| **M2** | Site pipeline (sync-only path) | One emission path end-to-end (ESG sync `Call()` audited from script to central row). | Commons, AuditLog (new), Communication (proto), Host, ExternalSystemGateway, all Tests projects, IntegrationTests | M1 | +| **M3** | Cached operations + dual-write transaction | Cached external calls and DB writes audited; SiteCalls table populated alongside; combined telemetry packet contract live. | Commons, AuditLog, SiteCallAudit (new), ConfigurationDatabase, ExternalSystemGateway, StoreAndForward, Host | M2; #22 + cached-call tracking inlined here per the prerequisite section | +| **M4** | Remaining boundary emission | All four channels emitting: sync DB writes/reads, Notify dispatcher attempt/terminal, Inbound API middleware. | ExternalSystemGateway, InboundAPI, NotificationOutbox, SiteRuntime (Database surface) | M2; M3 (NotificationOutbox terminal/attempt uses ICentralAuditWriter pattern) | +| **M5** | Payload + redaction policy | Header redaction, body redactor regex, SQL parameter redaction, safety net, configuration binding. | AuditLog, ExternalSystemGateway, InboundAPI, all emitter projects | M2 | +| **M6** | Reconciliation, purge, partition maintenance, health metrics | Self-healing telemetry, monthly partition switch, the five new health metrics + their dashboard tiles. | AuditLog, ConfigurationDatabase (partition maintenance), HealthMonitoring | M2, M3 | +| **M7** | Central UI — new Audit Log page + drill-ins + KPI tiles | User-visible Audit Log surface; existing `AuditLog.razor` renamed to ConfigurationAuditLog. | CentralUI, CentralUI.Tests, CentralUI.PlaywrightTests | M2, M4, M6 | +| **M8** | CLI — `scadalink audit query / export / verify-chain` | Operator surface for query/export; `verify-chain` is a no-op stub until v1.x hash chain ships. | CLI, ManagementService (HTTP endpoint), CLI.Tests, IntegrationTests | M2 | + +**Ship-state at end of each milestone is the shippable slice** — each milestone leaves the system in a working, testable, deployable state (no half-built actors mid-pipeline). M1 ships no user-visible behaviour but produces a clean foundation; from M2 onward each ships an observable audit capability. + +**Critical path:** M1 → M2 → (M3 ∥ M4 ∥ M5) → M6 → (M7 ∥ M8). M3, M4, M5 can overlap once M2 is solid. M7 and M8 can overlap once M6 lands. + +--- + +## M1 — Foundation: schema, types, DB roles, partitioning + +**Goal:** Land the new `AuditLog` table (partitioned) and DB roles in MS SQL, plus the Commons types every later milestone needs. After M1 the database is ready and types compile; nothing else changes. + +**Affected projects:** +- `src/ScadaLink.Commons/` — entity, enums, interfaces, message DTOs. +- `src/ScadaLink.ConfigurationDatabase/` — EF mapping, DbContext registration, migration, DB role script, partition function/scheme, retention options. +- `tests/ScadaLink.Commons.Tests/` — enum + record tests. +- `tests/ScadaLink.ConfigurationDatabase.Tests/` — migration tests, repo tests. + +**Acceptance criteria:** +- `dotnet build` of the solution succeeds. +- `dotnet ef database update` against a dev MS SQL applies the migration; `AuditLog` table exists, partitioned monthly on `OccurredAtUtc`, with PK on `EventId` and the five expected indexes. +- `scadalink_audit_writer` and `scadalink_audit_purger` SQL roles exist with the documented grants; a smoke test confirms `UPDATE AuditLog` from the writer role fails. +- `AuditEvent` record, `AuditChannel`/`AuditKind`/`AuditStatus` enums, `IAuditWriter`/`ICentralAuditWriter` interfaces, `AuditTelemetryEnvelope`/`PullAuditEvents` message DTOs all exist in Commons in the right folders. +- `IAuditLogRepository` interface (Commons) and EF implementation (ConfigurationDatabase) exist; the implementation only exposes `InsertIfNotExistsAsync`, paged read, and `SwitchOutPartitionAsync` — no update or row-delete. +- All new tests pass; no existing tests regress. + +### M1 — Tasks (TDD-detail) + +#### M1-T1: Add audit enums to Commons + +**Files:** +- Create: `src/ScadaLink.Commons/Types/Enums/AuditChannel.cs`, `AuditKind.cs`, `AuditStatus.cs`. +- Create: `tests/ScadaLink.Commons.Tests/Types/Enums/AuditEnumTests.cs`. + +**Steps:** +1. Write failing test verifying `AuditChannel` has exactly `ApiOutbound | DbOutbound | Notification | ApiInbound` (asserting `Enum.GetValues` length and members). +2. Same for `AuditKind` (10 members per `Component-AuditLog.md`). +3. Same for `AuditStatus` (8 members). +4. Run: tests fail (enums don't exist). Implement the three enums. +5. Run tests: pass. +6. Commit: `feat(commons): add Audit{Channel,Kind,Status} enums for #23`. + +#### M1-T2: Add AuditEvent record + ForwardState enum + +**Files:** +- Create: `src/ScadaLink.Commons/Entities/Audit/AuditEvent.cs` — public record carrying all 20 central columns (per `alog.md` §4) plus a nullable `ForwardState?` for the site-local variant. +- Create: `src/ScadaLink.Commons/Types/Enums/AuditForwardState.cs` — `Pending | Forwarded | Reconciled`. +- Create: `tests/ScadaLink.Commons.Tests/Entities/Audit/AuditEventTests.cs`. + +**Steps:** +1. Write failing test that constructs an `AuditEvent`, sets every property, and round-trips via `with` expressions — asserts immutability and required-property behaviour. +2. Run: fail (type doesn't exist). Implement the record. +3. Run: pass. +4. Commit: `feat(commons): add AuditEvent record + ForwardState enum`. + +#### M1-T3: Add IAuditWriter and ICentralAuditWriter + +**Files:** +- Create: `src/ScadaLink.Commons/Interfaces/Services/IAuditWriter.cs`, `ICentralAuditWriter.cs`. +- Create: `tests/ScadaLink.Commons.Tests/Interfaces/Services/AuditWriterContractTests.cs` (smoke — only that the interfaces exist and have the documented signatures). + +**Steps:** +1. Write failing reflection-based test asserting both interfaces expose `Task WriteAsync(AuditEvent, CancellationToken)`. +2. Run: fail. Implement both interfaces; document each with XML doc comments naming Audit Log #23 as the owner. +3. Run: pass. +4. Commit: `feat(commons): add IAuditWriter and ICentralAuditWriter`. + +#### M1-T4: Add audit telemetry + pull message DTOs + +**Files:** +- Create: `src/ScadaLink.Commons/Messages/Integration/AuditTelemetryEnvelope.cs`, `PullAuditEventsRequest.cs`, `PullAuditEventsResponse.cs`. +- Create: `tests/ScadaLink.Commons.Tests/Messages/Integration/AuditTelemetryMessagesTests.cs`. + +**Steps:** +1. Failing test: construct envelope with a batch of 3 events, assert immutability + batch enumerability. +2. Failing test: pull request carries `SinceUtc` + `BatchSize`; response carries events + `MoreAvailable`. +3. Implement. +4. Run: pass. +5. Commit: `feat(commons): add audit telemetry + pull message DTOs`. + +#### M1-T5: Extend ScadaLinkDbContext with AuditLogs DbSet + entity config + +**Files:** +- Modify: `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs` — add `public DbSet AuditLogs => Set();` at the appropriate position (after `Notifications`). +- Create: `src/ScadaLink.ConfigurationDatabase/Entities/AuditLogEntityTypeConfiguration.cs` — `IEntityTypeConfiguration` mapping the columns, types, length constraints, and indexes per `alog.md` §4. Note: this is an EF mapping only; the partition function and scheme are created in the SQL migration (next task) since EF Core doesn't model them natively. +- Modify: `OnModelCreating` — apply the new configuration. +- Create: `tests/ScadaLink.ConfigurationDatabase.Tests/Entities/AuditLogEntityTypeConfigurationTests.cs` — use `ModelBuilder` directly to verify the entity is mapped to `AuditLog` table, PK is `EventId`, and the expected columns + indexes are declared. + +**Steps:** +1. Failing test asserts mapped table name, PK column, and column count. +2. Implement entity configuration; apply in `OnModelCreating`. +3. Failing test asserts the five expected indexes exist on the model. +4. Add `HasIndex` declarations. +5. Run: pass. +6. Commit: `feat(configdb): map AuditEvent to AuditLog table with PK and indexes`. + +#### M1-T6: Generate and customize EF migration for AuditLog + +**Files:** +- Create: `src/ScadaLink.ConfigurationDatabase/Migrations/_AddAuditLogTable.cs` via `dotnet ef migrations add AddAuditLogTable --project ScadaLink.ConfigurationDatabase`. +- Modify: the generated `Up()` / `Down()` to: + - Create the partition function `pf_AuditLog_Month` and partition scheme `ps_AuditLog_Month` (raw SQL via `migrationBuilder.Sql(...)`), tied to a dedicated filegroup (or PRIMARY in dev — configurable via a migration setting). + - Alter the `CreateTable` call (or follow up with `Sql`) to align the table to `ps_AuditLog_Month(OccurredAtUtc)`. + - Add the five indexes generated by EF; ensure each is also partition-aligned where appropriate. +- Create: `tests/ScadaLink.ConfigurationDatabase.Tests/Migrations/AddAuditLogTableMigrationTests.cs` — applies the migration to an isolated MS SQL LocalDB instance (existing IntegrationTests harness), asserts table + partition function + scheme + indexes are present. + +**Steps:** +1. Run `dotnet ef migrations add AddAuditLogTable`. +2. Failing integration test: apply migration, query `sys.partition_functions` and `sys.partition_schemes` for the expected names. +3. Edit migration to add the partition function + scheme + alignment. +4. Re-run test: pass. +5. Failing test: query `sys.indexes` for the five expected named indexes. +6. Adjust migration if any index name drifts. +7. Run: pass. +8. Commit: `feat(configdb): add AuditLog migration with monthly partitioning`. + +#### M1-T7: Add DB roles in migration + +**Files:** +- Modify: the M1-T6 migration `Up()` to also create the `scadalink_audit_writer` (INSERT + SELECT only) and `scadalink_audit_purger` (ALTER PARTITION FUNCTION + ALTER TABLE … SWITCH PARTITION + SELECT) roles via raw SQL. Make role creation idempotent (`IF NOT EXISTS`). +- Modify: `Down()` — drop the roles. +- Create: `tests/ScadaLink.ConfigurationDatabase.Tests/Migrations/AuditLogRoleGrantsTests.cs` — applies migration, then runs `SELECT` on `sys.database_role_members` / `sys.database_permissions` to assert the role grants. Plus a smoke test: connect as a user mapped to `scadalink_audit_writer`, attempt `UPDATE AuditLog SET Status = 'X'` and expect a permission error. + +**Steps:** +1. Failing test asserts both roles exist with documented grants. +2. Add `migrationBuilder.Sql(...)` blocks. +3. Run: pass. +4. Failing test: `UPDATE AuditLog` as audit writer → expect SqlException with permission error. +5. Verify the role's permissions deny UPDATE (they should by default since only INSERT + SELECT granted). +6. Run: pass. +7. Commit: `feat(configdb): add scadalink_audit_writer and scadalink_audit_purger roles`. + +#### M1-T8: Add IAuditLogRepository + EF implementation + +**Files:** +- Create: `src/ScadaLink.Commons/Interfaces/Repositories/IAuditLogRepository.cs` — `InsertIfNotExistsAsync(AuditEvent, CancellationToken)`, `QueryAsync(filter, paging, CancellationToken)`, `SwitchOutPartitionAsync(monthBoundary, CancellationToken)`. **Deliberately no `UpdateAsync` or row-level `DeleteAsync`.** +- Create: `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs` — implementation using the DbContext; `InsertIfNotExistsAsync` uses `MERGE` or raw `INSERT … WHERE NOT EXISTS` to satisfy idempotency without throwing on dupes. +- Modify: `ServiceCollectionExtensions.cs` — register `IAuditLogRepository` → `AuditLogRepository` in DI. +- Create: `tests/ScadaLink.ConfigurationDatabase.Tests/Repositories/AuditLogRepositoryTests.cs`. + +**Steps:** +1. Failing test: `InsertIfNotExistsAsync` for a fresh `EventId` writes one row; calling again with the same `EventId` is a no-op (no exception, no second row). +2. Implement; use a `MERGE` or `INSERT … WHERE NOT EXISTS` strategy that does NOT rely on EF change tracking. +3. Run: pass. +4. Failing test: paged `QueryAsync` returns rows in `(OccurredAtUtc desc, EventId desc)` order, respecting filter predicates (channel, kind, status, site, target, actor, correlation, time range). +5. Implement filter projection + keyset paging. +6. Run: pass. +7. Failing test: `SwitchOutPartitionAsync` for the oldest partition removes its rows from the live table. +8. Implement via `migrationBuilder`-style `Sql("ALTER TABLE ... SWITCH PARTITION ... TO ...")` (against a staging table the implementation creates and drops within the same transaction). +9. Run: pass. +10. Commit: `feat(configdb): IAuditLogRepository + EF implementation (append-only, partition-switch purge)`. + +#### M1-T9: Add AuditLogOptions configuration class + binding + +**Files:** +- Create: `src/ScadaLink.AuditLog/Configuration/AuditLogOptions.cs` (new project — see M1-T11) — owns `DefaultCapBytes`, `ErrorCapBytes`, `HeaderRedactList`, `GlobalBodyRedactors`, `PerTargetOverrides`, `RetentionDays`, validation attributes. +- Add: validation on startup (`IValidateOptions`). +- Test: ensure `appsettings.json` bind round-trips and validation rejects out-of-range `RetentionDays`. + +**Steps:** +1. Failing test: bind a valid section → values present. +2. Implement options class + binding. +3. Failing test: bind invalid `RetentionDays` → validator rejects. +4. Implement validator. +5. Run: pass. +6. Commit: `feat(auditlog): add AuditLogOptions config binding`. + +#### M1-T10: Add ScadaLink.AuditLog project skeleton + +**Files:** +- Create: `src/ScadaLink.AuditLog/ScadaLink.AuditLog.csproj` — TargetFramework matches the rest of the solution; ProjectReferences to `ScadaLink.Commons` and `ScadaLink.ConfigurationDatabase`. +- Create: `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs` — `AddAuditLog(this IServiceCollection, IConfiguration)` that registers `AuditLogOptions`, `IAuditLogRepository`, plus placeholders that later milestones will fill (writer impls, actors). +- Create: `tests/ScadaLink.AuditLog.Tests/ScadaLink.AuditLog.Tests.csproj` with one smoke test. +- Modify: `ScadaLink.slnx` — add both projects to the solution. +- Modify: `Directory.Packages.props` if any new package versions are needed. + +**Steps:** +1. Create projects via `dotnet new classlib` / `dotnet new xunit`; add references; add to slnx. +2. Failing test: smoke-test `AddAuditLog()` populates DI with `IAuditLogRepository` and `IOptions`. +3. Implement `ServiceCollectionExtensions.AddAuditLog`. +4. Run: pass. +5. Commit: `feat(auditlog): scaffold ScadaLink.AuditLog project`. + +#### M1-T11: Update Component-Host.md responsibilities + README component table + +**Files:** +- Modify: `docs/requirements/Component-Host.md` — list `ScadaLink.AuditLog` in the central role's registration set. +- Modify: `README.md` — confirm row #23 link reflects the new project (no functional change; this is a paper-trail update). + +**Steps:** +1. Edit, verify cross-refs, commit: `docs(audit): register ScadaLink.AuditLog project in Host role`. + +--- + +## M2 — Site pipeline (sync-only path) + +**Goal:** First end-to-end audit emission: a script-initiated `ExternalSystem.Call()` produces an audit row in the central `AuditLog` table. No cached paths yet, no notifications, no inbound API, no UI. Just one channel + kind: `ApiOutbound.SyncCall`. + +**Affected projects:** `Commons`, `AuditLog` (new), `Communication`, `Host`, `ExternalSystemGateway`, all matching `*.Tests/`, `tests/ScadaLink.IntegrationTests/`. + +**Acceptance criteria:** +- Site-local `IAuditWriter` writes to a per-site SQLite `auditlog.db` on the hot path with `ForwardState = 'Pending'`; durability is sub-millisecond; failures fall back to a bounded in-memory ring and surface a metric. +- `SiteAuditTelemetryActor` drains pending rows in batches via a new `IngestAuditEvents` RPC on the existing `SiteStream` gRPC service; on success flips `ForwardState = 'Forwarded'`. +- `AuditLogIngestActor` (central singleton) receives the batch, performs `InsertIfNotExistsAsync` per event, returns ack. +- `ExternalSystem.Call()` emits one `ApiOutbound.SyncCall` row via `IAuditWriter` on every call completion; audit-write failure does NOT abort the script. +- Integration test in `tests/ScadaLink.IntegrationTests/` boots a site + central pair, executes a sync script that calls an external system, and asserts a corresponding row appears in the central `AuditLog` within N seconds. +- No regressions in existing ExternalSystemGateway or Communication tests. + +**Task headlines** (each expanded to TDD detail in its own writing-plans pass before execution): +1. Site-local `SqliteAuditWriter` implementing `IAuditWriter` — schema bootstrap, hot-path INSERT, write lock, ring-buffer fallback. Pattern from `SiteEventLogger.cs:28–98`. +2. Bounded in-memory `RingBufferFallback` that drains into the SQLite writer when health returns. +3. `SiteAuditTelemetryActor` actor — periodic drain loop (5s busy / 30s idle), batch INSERT-IF-NOT-EXISTS via gRPC, `ForwardState` transitions. +4. Extend `sitestream.proto`: add `IngestAuditEvents(stream AuditEventBatch) returns (IngestAck)`. Regenerate. Update `SiteStreamGrpcServer.cs` to handle the new RPC. +5. `AuditLogIngestActor` (central singleton) — handles ingest message, calls `IAuditLogRepository.InsertIfNotExistsAsync` per event in a single transaction. +6. Host wiring: register `SiteAuditTelemetryActor` as a site singleton on a **dedicated dispatcher** (per `alog.md` §6.2); register `AuditLogIngestActor` as a central singleton. Reference pattern at `AkkaHostedService.cs:272–280`. +7. ESG sync `Call()` emission hook — add `IAuditWriter` injection; emit `AuditEvent` (channel=ApiOutbound, kind=SyncCall) before returning. Audit-write failures never throw to the script. +8. End-to-end integration test in `IntegrationTests/AuditLog/SyncCallEmissionTests.cs` — site + central wired, script invokes ESG `Call()`, central row appears. +9. Health metric `SiteAuditWriteFailures` (this milestone defines it; M6 surfaces the tile). +10. Update `docker/deploy.sh` / `infra/reseed.sh` if needed so dev clusters can verify locally. + +**Risk callouts:** +- Site SQLite write throughput under load — bench against existing SiteEventLogger numbers. +- gRPC additive evolution: the existing proto uses a `oneof`. Adding a new top-level RPC is safe; embedding new oneof variants is also safe. Confirm message-ordering guarantees aren't violated. +- Don't accidentally bind `SiteAuditTelemetryActor` to the same dispatcher used by script blocking I/O; that's a real perf issue (per spec). + +--- + +## M3 — Cached operations + dual-write transaction + (inlined) Site Call Audit foundations + +**Goal:** Cached external calls (`ExternalSystem.CachedCall`) and cached DB writes (`Database.CachedWrite`) produce three audit rows per operation (`CachedEnqueued`, `CachedAttempt × N`, `CachedTerminal`) AND populate the operational `SiteCalls` table at central — in one transaction at central, from a single combined telemetry packet. + +**Affected projects:** `Commons`, `AuditLog`, `SiteCallAudit` (new — minimum-viable surface), `ConfigurationDatabase` (new `SiteCalls` table migration), `ExternalSystemGateway`, `StoreAndForward`, `Host`. Tests across all of them + IntegrationTests. + +**Prerequisite call-out:** This milestone implements the minimum-viable Site Call Audit (#22) surface and cached-call tracking pieces — `TrackedOperationId`, site-local operation tracking SQLite, `SiteCalls` table at central, the existing-message `CachedCallTelemetry` (must be created from scratch since it doesn't exist in code despite living in the docs). Full reconciliation, KPIs, and Retry/Discard relay for #22 are deferred — they're not on the critical path for the audit log's combined telemetry. + +**Acceptance criteria:** +- New `SiteCalls` MS SQL table + repo (no partitioning needed; this is operational state, not audit). +- New `CachedCallTelemetry` message in Commons carrying BOTH the cached-call operational fields AND an `AuditEvent` payload. +- Site path: `CachedCall` writes the audit row to site SQLite (`Kind = CachedEnqueued`), creates the site operation-tracking row, and sends a combined telemetry packet. +- Central path: `AuditLogIngestActor` (extended) receives the combined packet, performs **one transaction containing both** the `AuditLog` insert and the `SiteCalls` upsert. +- Retry attempt → `Kind = CachedAttempt` audit row + `SiteCalls` status transition. Terminal → `Kind = CachedTerminal` audit row + `SiteCalls` terminal status. +- Integration test asserts: triggering a `CachedCall` that fails transient-then-succeeds produces 3 AuditLog rows + 1 SiteCalls row with `Status = Delivered`, all sharing the same `TrackedOperationId` correlation key. + +**Task headlines:** +1. `TrackedOperationId` GUID newtype in Commons. +2. Site-local SQLite operation-tracking table + repo (matches `alog.md` cached-call tracking design). +3. `CachedCallTelemetry` Commons message carrying both operational fields and `AuditEvent` payload. +4. `SiteCalls` MS SQL table + EF mapping + migration + `ISiteCallAuditRepository` + repo impl. +5. `SiteCallAuditActor` skeleton (singleton, central) — receives telemetry, owns `SiteCalls` upsert via repo. +6. Extend `AuditLogIngestActor` to detect combined telemetry and execute both writes (`AuditLog` insert + `SiteCalls` upsert) in a single `DbContext` transaction. +7. ESG `CachedCall()` emission — produce combined telemetry on every lifecycle transition (enqueue, attempt, terminal). +8. Extend gRPC proto with the combined-telemetry RPC if it's distinct from `IngestAuditEvents`, or fold it into the existing one with a discriminator field (decision in milestone brainstorm). +9. Integration test in `IntegrationTests/AuditLog/CachedCallCombinedTelemetryTests.cs`. + +**Risk callouts:** +- Combined telemetry packet evolution: design the packet so future cached audit-kind additions are non-breaking (oneof or open-field map). +- Single transaction at central spans two tables; ensure connection retry behaviour is correct. +- Idempotency: AuditLog dedups on `EventId`; SiteCalls dedups on `TrackedOperationId`. If telemetry retries and AuditLog already has the row, ensure SiteCalls upsert still runs (no short-circuit). + +--- + +## M4 — Remaining boundary emission + +**Goal:** Every channel × kind from `Component-AuditLog.md` produces a row when its boundary call fires. + +**Affected projects:** `ExternalSystemGateway` (sync DB writes/reads, cached DB writes), `SiteRuntime` (Database surface exposing them), `NotificationOutbox` (central direct-write of `Attempt`/`Terminal`), `InboundAPI` (middleware). Tests across all. + +**Acceptance criteria:** +- Sync `Database.Connection().Execute()` → `DbOutbound.SyncWrite` row; `ExecuteReader` → `DbOutbound.SyncRead`. Parameter values captured by default; per-connection redaction opt-in supported. +- `Database.CachedWrite` → three lifecycle rows via the combined telemetry built in M3. +- Notification Outbox dispatcher: every delivery attempt writes `Notification.Attempt`; terminal writes `Notification.Terminal`. Site-emitted `Notification.Enqueued` flows through the standard site→central audit path. Audit-write failure never affects delivery. +- Inbound API middleware writes one `ApiInbound.Completed` row per request, before `await next()` returns. API key NAME captured (never material). Audit-write failure does NOT change the HTTP response. + +**Task headlines:** +1. ESG `Database.Connection()` execute hook — wrap `Execute*` / `ExecuteScalar` / `ExecuteReader` to emit before/after audit events. +2. `Database.CachedWrite` combined-telemetry emission (mirror M3's ESG cached path). +3. NotificationOutboxActor extension — inject `ICentralAuditWriter`; write `Notification.Attempt` per dispatcher attempt; write `Notification.Terminal` on terminal transitions; never abort on failure. +4. Site-emitted `Notification.Enqueued` — when a script calls `Notify.To().Send()` (site-side via Store-and-Forward), emit a site audit row (`Notification.Enqueued`); telemetry forwards as usual. +5. Inbound API middleware: new `AuditWriteMiddleware` in `src/ScadaLink.InboundAPI/Middleware/` writing `ApiInbound.Completed` before response flush; register in the ASP.NET pipeline. +6. Tests: emission unit tests per call mode, plus 4 integration tests (one per channel). + +**Risk callouts:** +- Inbound API: correlation-id generation needs to be consistent with any upstream tracing headers (W3C `traceparent` if present). +- Notification dispatcher: confirm `ICentralAuditWriter` errors are logged but don't block the dispatch loop. + +--- + +## M5 — Payload + redaction policy + +**Goal:** Payload capture is bounded (8 KB / 64 KB on error), headers are redacted by default, SQL parameter values are captured by default with per-connection opt-out, body redactor regexes are configurable per target, and the safety net over-redacts on misconfiguration. + +**Affected projects:** `AuditLog` (policy engine + options), `ExternalSystemGateway` (HTTP header redactors, SQL param redaction hook), `InboundAPI` (header redactors, body capture), `NotificationOutbox` (subject/body capture follows existing rules). Tests. + +**Acceptance criteria:** +- A `IAuditPayloadFilter` service is invoked between event construction and write. Truncates to default cap; raises to error cap on non-`Success` rows; applies header redactors; applies body regex redactors; applies SQL parameter redactors (per-connection); over-redacts on regex error and increments `AuditRedactionFailure`. +- Configuration test: changing `appsettings.json` redactors changes runtime behaviour (no rebuild needed for regex changes). +- Bench: 95th-percentile audit emission latency on the hot path stays under N µs at default cap (target to be set during M5 brainstorm). + +**Task headlines:** +1. `IAuditPayloadFilter` + default implementation (header redaction, body regex, SQL parameter redaction, safety net). +2. Wire the filter into the emission paths (M2, M3, M4 emitters all call through the filter before handing the `AuditEvent` to the writer). +3. `appsettings.json` schema for the filter (already prepared in M1-T9; M5 plugs the runtime in). +4. Tests: redaction unit tests with known-bad payloads (passwords in JSON, `Authorization` headers, SQL params named `@apikey`). +5. Performance test in `tests/ScadaLink.PerformanceTests/` for the hot-path latency budget. + +**Risk callouts:** +- Regex performance — pre-compile and cache patterns; reject patterns that take too long to compile. +- Don't redact post-truncation if the truncation cut a redaction target in half. + +--- + +## M6 — Reconciliation, purge, partition maintenance, health metrics + +**Goal:** Self-healing telemetry, monthly partition rollover, daily purge, all five new health metrics live and feeding the existing health-report pipeline. + +**Affected projects:** `AuditLog` (3 new actors: `SiteAuditReconciliationActor`, `AuditLogPurgeActor`, partition-maintenance worker), `Communication` (the `PullAuditEvents` RPC), `HealthMonitoring` (5 new metrics), `ConfigurationDatabase` (partition-roll-forward SQL helper). + +**Acceptance criteria:** +- `SiteAuditReconciliationActor` runs every 5 minutes per site; pulls events the site reports as `Pending`; central performs `InsertIfNotExistsAsync` then signals the site to flip those rows to `Reconciled`. +- `AuditLogPurgeActor` runs daily; for each partition older than `RetentionDays`, switches it out to a staging table and drops the staging table. Emits an `AuditLog:Purged` event with rowcount + duration. +- Partition-maintenance job runs at month boundary to add the next month's partition function range and ensure the scheme has a destination filegroup. +- 5 new health metrics published per site: `SiteAuditBacklog` (count + oldest + bytes), `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`; and per central node: `CentralAuditWriteFailures`, `AuditRedactionFailure`. +- Integration test: simulated 5-minute central outage → telemetry catches up after recovery via reconciliation, no rows lost; site backlog metric reflects the queue depth and drops as it drains. + +**Task headlines:** +1. `PullAuditEvents` RPC on the existing `SiteStream` gRPC server. +2. `SiteAuditReconciliationActor` actor with timer + per-site `LastReconciledAt` cursor. +3. `AuditLogPurgeActor` actor with daily schedule, partition-switch logic via `IAuditLogRepository.SwitchOutPartitionAsync`. +4. Partition-roll-forward helper (raw SQL `migrationBuilder.Sql` equivalent at runtime — likely a `HostedService` that runs once at startup and once per month). +5. Health metric publishing per emitter; integrate with the existing `SiteHealthState` / `CentralHealthAggregator` plumbing. +6. Integration tests for outage/recovery + purge. + +**Risk callouts:** +- Partition switch on an active table — ensure online schema operations don't block ingest; document the window if a brief lock is unavoidable. +- Reconciliation can produce duplicate `Forwarded` ↔ `Reconciled` state flips; ensure idempotency at site SQLite layer. + +--- + +## M7 — Central UI: new Audit Log page + drill-ins + KPI tiles + +**Goal:** User-visible Audit Log: filter bar, results grid (custom Blazor + Bootstrap, no third-party grid), drilldown drawer with cURL / "show all events" / redaction indicators / pretty-printed payloads. 6 drill-in entry points from existing pages. 3 KPI tiles on Health dashboard. + +**Affected projects:** `CentralUI`, `CentralUI.Tests`, `CentralUI.PlaywrightTests`. + +**Acceptance criteria:** +- New `Components/Pages/Audit/AuditLogPage.razor` exists; new "Audit" nav group sibling to Notifications. +- All 10 filter elements, 10 grid columns, keyset pagination + default page 100, drilldown drawer per `Component-AuditLog.md` §10. +- Existing `Components/Pages/Monitoring/AuditLog.razor` (the IAuditService config-change viewer) **renamed in code** to `ConfigurationAuditLog.razor`, with URL `/audit/configuration` to match the doc-renaming we did. Drill-ins from existing pages (Notifications, Site Calls, External Systems, Inbound API Keys, Sites, Instances) added. +- 3 KPI tiles added to the Health dashboard; data sourced from `HealthMonitoring`. +- Playwright tests cover: filter narrowing, drilldown drawer, "Copy as cURL" on `ApiInbound` rows, drill-in from Notifications to filtered Audit Log. +- `OperationalAudit` read permission gating + `AuditExport` for the Export button. + +**Task headlines:** +1. New `Components/Pages/Audit/AuditLogPage.razor` + matching `.razor.cs` code-behind + `.razor.css`. +2. Custom Blazor `` component (multi-select chips for Channel/Kind/Status, autocomplete for Instance/Script). +3. Custom Blazor `` component — keyset paging via `QueryAsync` repository method (M1-T8). +4. `` component — JSON pretty-print, SQL syntax highlight, "Copy as cURL", "Show all events" CorrelationId filter. +5. Rename existing `AuditLog.razor` → `ConfigurationAuditLog.razor` + update routes + update internal links. +6. Drill-in additions to 6 existing pages. +7. 3 KPI tile components on Health dashboard. +8. Server-side CSV export (streaming) with `AuditExport` permission check. +9. Playwright E2E tests. + +**Risk callouts:** +- Permission check at the page level needs to align with the existing role/permission infrastructure (Security #10). +- Keyset paging across partitioned table needs the right index; M1's `IX_AuditLog_OccurredAtUtc` is the supporting index. + +--- + +## M8 — CLI: `scadalink audit query | export | verify-chain` + +**Goal:** Operator surface for the centralized Audit Log. + +**Affected projects:** `CLI`, `CLI.Tests`, `ManagementService` (new HTTP endpoint), `IntegrationTests`. + +**Acceptance criteria:** +- `scadalink audit query` mirrors the UI filter set; results stream as JSON (default) or table. +- `scadalink audit export` streams server-side to CSV / JSONL / Parquet; requires `AuditExport` permission. +- `scadalink audit verify-chain --month YYYY-MM` is a no-op stub returning a "hash-chain not yet enabled in this release" message and exit code 0 (per v1.x deferral). +- Existing `audit-log query` (IAuditService config-change viewer) **renamed** in code to `audit-config query` to disambiguate; old name kept as a deprecated alias for one minor version. +- Permissions: `audit query` and `audit verify-chain` require `OperationalAudit`; `audit export` additionally requires `AuditExport`. + +**Task headlines:** +1. New `AuditCommands.cs` (separate file from `AuditLogCommands.cs` — the latter stays for the renamed config audit). +2. Build the three subcommands with their flag sets (per CLI doc & `alog.md` §15.1, post-Bundle-D fix). +3. ManagementService HTTP endpoints backing each subcommand. +4. Output formatters (JSON, table) reused from existing CLI patterns. +5. CLI integration tests in `tests/ScadaLink.CLI.Tests/` + `tests/ScadaLink.IntegrationTests/`. +6. Update CLI README + help text. + +**Risk callouts:** +- The CLI rename (`audit-log query` → `audit-config query`) breaks any operator scripts; provide a deprecation alias and document the migration. + +--- + +## Cross-cutting concerns (apply at every milestone) + +- **Branching:** every milestone gets its own `feature/audit-log-mN-` branch; merged with `--no-ff` to `main` on milestone completion. No pushes without explicit user authorization. +- **Tests:** Every task adds tests first (failing test → impl → passing test). Existing tests must keep passing. +- **Commits:** small and frequent. Bite-sized per writing-plans skill. +- **Reviews:** per the bundling cadence in user memory — group small adjacent tasks into a single implementer dispatch, run one combined spec+quality review per bundle, then a final cross-bundle review at end of milestone. +- **Docs:** if implementation reveals a design gap, fix the design doc FIRST (in `docs/requirements/Component-AuditLog.md` and/or `alog.md`), commit, then implement. Don't let the code and docs drift. +- **Infra:** the 3 `infra/*` working-tree modifications still uncommitted on `main` are unrelated and stay that way unless the user explicitly addresses them. Use explicit `git add ` throughout, never `git commit -am`. + +--- + +## Per-milestone execution flow (template) + +When a milestone is about to start, run this sequence: + +1. **Brainstorm**: short skill invocation to nail any code-level decisions not fixed in the spec (test fixture placement, migration helper choice, etc.). +2. **Writing-plans**: produce a milestone-specific plan with TDD detail per task — saved to `docs/plans/2026-XX-XX-auditlog-mN-.md` + peer `.tasks.json`. +3. **Subagent-driven execution**: bundle small tasks per cadence preference; per-bundle implementer + combined reviewer; cross-milestone review at end; merge to `main` with `--no-ff`. + +The roadmap is the contract for what each milestone ships; the per-milestone plan is the contract for how it gets built.