Files
scadalink-design/docs/plans/2026-05-20-audit-log-code-roadmap.md
Joseph Doherty d3d4a5b13d docs(audit): add 8-milestone code implementation roadmap
Roadmap covering Audit Log (#23) code implementation across 8 milestones
(M1 Foundation → M8 CLI). Reflects the actual state of the codebase —
all 22 prior components have source + tests, but Site Call Audit (#22)
and cached-call tracking are design-only despite being on main; their
minimum surface is inlined into M3.

M1 is laid out at full TDD-level task detail (11 bite-sized tasks).
M2–M8 are at milestone-shape detail (goals, files, task headlines,
acceptance criteria, risk callouts). Per-milestone bite-sized plans
will be generated by brainstorm + writing-plans when each milestone is
about to execute — locking 80 task cards now would mostly be stale by
M5 as M1 reveals codebase realities.

Critical path: M1 → M2 → (M3 ∥ M4 ∥ M5) → M6 → (M7 ∥ M8).

Spec: docs/requirements/Component-AuditLog.md + alog.md (commit
fec0bb1).
2026-05-20 09:22:18 -04:00

40 KiB
Raw Blame History

Audit Log (#23) Code Implementation Roadmap

For Claude: REQUIRED SUB-SKILL FLOW per milestone: brainstormingwriting-planssubagent-driven-development. Use docs/requirements/Component-AuditLog.md + alog.md as the spec; this document is the roadmap that sequences milestones and locks acceptance criteria for each. M1 carries full TDD-level task detail; M2M8 are milestone-shape detail and will be expanded into bite-sized plans by their own writing-plans pass when their turn comes.

Goal: Implement central component #23 Audit Log — append-only forensic + operational record across every script-trust-boundary action — into the existing ScadaLink codebase.

Architecture: Layered alongside (not replacing) the future Notifications/SiteCalls operational stores. Site-local SQLite hot-path append + gRPC telemetry batches + reconciliation pulls; central direct-write for Inbound API and Notification Outbox dispatch; monthly-partitioned MS SQL with single global retention; strict append-only enforced via DB roles. See alog.md for the locked design decisions and Component-AuditLog.md for the component spec.

Tech Stack: Akka.NET (clustering, singletons, ClusterClient), EF Core (MS SQL provider, code-first migrations), Microsoft.Data.SqlClient, Microsoft.Data.Sqlite, gRPC (HTTP/2 server-streaming on the existing SiteStream channel), ASP.NET Core (Inbound API middleware), Blazor Server + Bootstrap (Central UI), System.CommandLine (CLI), xUnit + Akka.TestKit.Xunit2 + NSubstitute (tests).

Spec: /Users/dohertj2/Desktop/scadalink-design/alog.md (validated, immutable; commit fec0bb1). Component design at /Users/dohertj2/Desktop/scadalink-design/docs/requirements/Component-AuditLog.md.


Codebase Reality Check (what already exists)

  • All 22 prior components have source + tests. Audit Log slots in as a new src/ScadaLink.AuditLog/ project plus changes to: Commons, ConfigurationDatabase, Communication (proto), Host (DI + actor registration), ExternalSystemGateway, InboundAPI, NotificationOutbox, HealthMonitoring, CentralUI, CLI, SiteRuntime (audit hook surface).
  • Existing patterns to copy from:
    • Singleton wiring: src/ScadaLink.Host/Actors/AkkaHostedService.cs:272280 (NotificationOutboxActor) — ClusterSingletonManager.Props + manager/proxy pair.
    • EF migration: src/ScadaLink.ConfigurationDatabase/Migrations/20260519050659_AddNotificationsTable.cs — table create + indexes; no partitioning yet — Audit Log will be the first.
    • Site SQLite hot-path: src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:2898 — single connection, write lock, Channel-based background writer.
    • Site-buffer + forwarder: src/ScadaLink.StoreAndForward/StoreAndForwardStorage + NotificationForwarder show the Pending → Forwarded transition we'll mirror.
    • Actor + repo + test trio: src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs and tests/ScadaLink.NotificationOutbox.Tests/NotificationOutboxActorIngestTests.cs:20 — TestKit base class, NSubstitute repo, Sys.ActorOf, ExpectMsg<T>.
    • gRPC additive: src/ScadaLink.Communication/Protos/sitestream.proto — currently carries only AttributeValueUpdate and AlarmStateUpdate in a oneof; we extend it.
    • CLI command shape: src/ScadaLink.CLI/Commands/AuditLogCommands.cs:153 — System.CommandLine pattern; new group will live alongside it (the file's existing commands are for the IAuditService config audit and stay).
    • Blazor listing page: src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor — filter bar + keyset paging + status badges idiom.
  • AuditLog.razor and AuditLogCommands.cs already exist but they're the IAuditService config-change viewer. Per the design pass we renamed them in docs to "Configuration Audit Log Viewer"; in code they'll be renamed (file + URL + command name) so the new operational Audit Log can take the unqualified name.
  • Test framework: xUnit + Akka.TestKit.Xunit2 + NSubstitute. Integration tests under tests/ScadaLink.IntegrationTests/. Playwright UI tests under tests/ScadaLink.CentralUI.PlaywrightTests/. A tests/ScadaLink.PerformanceTests/ exists for load tests.

Prerequisite: Site Call Audit (#22) + cached-call tracking are NOT implemented in code

The design for both is merged on main (alog.md cached-call tracking section; Component-SiteCallAudit.md), but grep finds zero references to TrackedOperationId or CachedCallTelemetry in src/. This matters because M3 (cached operations + dual-write transaction) cannot be built without them.

Three ways to handle this — pick before M3:

  1. Inline into M3 (Recommended): Implement just enough of Site Call Audit (#22) and cached-call tracking inside M3 — specifically the CachedCallTelemetry message, the operational-tracking SQLite table at sites, the SiteCalls table + repo + SiteCallAuditActor skeleton at central. This makes M3 the biggest milestone but ships a coherent slice (cached calls audited end-to-end).
  2. M0 prerequisite milestone: Implement #22 and cached-call tracking as a separate slice before M3 starts. Cleanest dependency story; slowest to first-audit-row.
  3. Ship Audit Log sync-only first, retrofit cached path later: M1, M2, M4 (sync-only emissions), M5, M6 (no cached features), M7, M8 ship as-is; cached audit is a separate follow-up. Lowest first-shippable scope but leaves cached calls unaudited until much later.

Default choice in this roadmap: (1). M3 absorbs the minimum #22 + cached-call tracking surface needed to make combined telemetry work; the rest of #22 (full reconciliation, KPIs, Retry/Discard relay) can be a follow-up.


Milestone index

M Title Ships Touches Depends on
M1 Foundation: schema, types, DB roles, partitioning Migration deployed; Commons types exist; no observable behavior yet. Commons, ConfigurationDatabase, ConfigurationDatabase.Tests
M2 Site pipeline (sync-only path) One emission path end-to-end (ESG sync Call() audited from script to central row). Commons, AuditLog (new), Communication (proto), Host, ExternalSystemGateway, all Tests projects, IntegrationTests M1
M3 Cached operations + dual-write transaction Cached external calls and DB writes audited; SiteCalls table populated alongside; combined telemetry packet contract live. Commons, AuditLog, SiteCallAudit (new), ConfigurationDatabase, ExternalSystemGateway, StoreAndForward, Host M2; #22 + cached-call tracking inlined here per the prerequisite section
M4 Remaining boundary emission All four channels emitting: sync DB writes/reads, Notify dispatcher attempt/terminal, Inbound API middleware. ExternalSystemGateway, InboundAPI, NotificationOutbox, SiteRuntime (Database surface) M2; M3 (NotificationOutbox terminal/attempt uses ICentralAuditWriter pattern)
M5 Payload + redaction policy Header redaction, body redactor regex, SQL parameter redaction, safety net, configuration binding. AuditLog, ExternalSystemGateway, InboundAPI, all emitter projects M2
M6 Reconciliation, purge, partition maintenance, health metrics Self-healing telemetry, monthly partition switch, the five new health metrics + their dashboard tiles. AuditLog, ConfigurationDatabase (partition maintenance), HealthMonitoring M2, M3
M7 Central UI — new Audit Log page + drill-ins + KPI tiles User-visible Audit Log surface; existing AuditLog.razor renamed to ConfigurationAuditLog. CentralUI, CentralUI.Tests, CentralUI.PlaywrightTests M2, M4, M6
M8 CLI — scadalink audit query / export / verify-chain Operator surface for query/export; verify-chain is a no-op stub until v1.x hash chain ships. CLI, ManagementService (HTTP endpoint), CLI.Tests, IntegrationTests M2

Ship-state at end of each milestone is the shippable slice — each milestone leaves the system in a working, testable, deployable state (no half-built actors mid-pipeline). M1 ships no user-visible behaviour but produces a clean foundation; from M2 onward each ships an observable audit capability.

Critical path: M1 → M2 → (M3 ∥ M4 ∥ M5) → M6 → (M7 ∥ M8). M3, M4, M5 can overlap once M2 is solid. M7 and M8 can overlap once M6 lands.


M1 — Foundation: schema, types, DB roles, partitioning

Goal: Land the new AuditLog table (partitioned) and DB roles in MS SQL, plus the Commons types every later milestone needs. After M1 the database is ready and types compile; nothing else changes.

Affected projects:

  • src/ScadaLink.Commons/ — entity, enums, interfaces, message DTOs.
  • src/ScadaLink.ConfigurationDatabase/ — EF mapping, DbContext registration, migration, DB role script, partition function/scheme, retention options.
  • tests/ScadaLink.Commons.Tests/ — enum + record tests.
  • tests/ScadaLink.ConfigurationDatabase.Tests/ — migration tests, repo tests.

Acceptance criteria:

  • dotnet build of the solution succeeds.
  • dotnet ef database update against a dev MS SQL applies the migration; AuditLog table exists, partitioned monthly on OccurredAtUtc, with PK on EventId and the five expected indexes.
  • scadalink_audit_writer and scadalink_audit_purger SQL roles exist with the documented grants; a smoke test confirms UPDATE AuditLog from the writer role fails.
  • AuditEvent record, AuditChannel/AuditKind/AuditStatus enums, IAuditWriter/ICentralAuditWriter interfaces, AuditTelemetryEnvelope/PullAuditEvents message DTOs all exist in Commons in the right folders.
  • IAuditLogRepository interface (Commons) and EF implementation (ConfigurationDatabase) exist; the implementation only exposes InsertIfNotExistsAsync, paged read, and SwitchOutPartitionAsync — no update or row-delete.
  • All new tests pass; no existing tests regress.

M1 — Tasks (TDD-detail)

M1-T1: Add audit enums to Commons

Files:

  • Create: src/ScadaLink.Commons/Types/Enums/AuditChannel.cs, AuditKind.cs, AuditStatus.cs.
  • Create: tests/ScadaLink.Commons.Tests/Types/Enums/AuditEnumTests.cs.

Steps:

  1. Write failing test verifying AuditChannel has exactly ApiOutbound | DbOutbound | Notification | ApiInbound (asserting Enum.GetValues length and members).
  2. Same for AuditKind (10 members per Component-AuditLog.md).
  3. Same for AuditStatus (8 members).
  4. Run: tests fail (enums don't exist). Implement the three enums.
  5. Run tests: pass.
  6. Commit: feat(commons): add Audit{Channel,Kind,Status} enums for #23.

M1-T2: Add AuditEvent record + ForwardState enum

Files:

  • Create: src/ScadaLink.Commons/Entities/Audit/AuditEvent.cs — public record carrying all 20 central columns (per alog.md §4) plus a nullable ForwardState? for the site-local variant.
  • Create: src/ScadaLink.Commons/Types/Enums/AuditForwardState.csPending | Forwarded | Reconciled.
  • Create: tests/ScadaLink.Commons.Tests/Entities/Audit/AuditEventTests.cs.

Steps:

  1. Write failing test that constructs an AuditEvent, sets every property, and round-trips via with expressions — asserts immutability and required-property behaviour.
  2. Run: fail (type doesn't exist). Implement the record.
  3. Run: pass.
  4. Commit: feat(commons): add AuditEvent record + ForwardState enum.

M1-T3: Add IAuditWriter and ICentralAuditWriter

Files:

  • Create: src/ScadaLink.Commons/Interfaces/Services/IAuditWriter.cs, ICentralAuditWriter.cs.
  • Create: tests/ScadaLink.Commons.Tests/Interfaces/Services/AuditWriterContractTests.cs (smoke — only that the interfaces exist and have the documented signatures).

Steps:

  1. Write failing reflection-based test asserting both interfaces expose Task WriteAsync(AuditEvent, CancellationToken).
  2. Run: fail. Implement both interfaces; document each with XML doc comments naming Audit Log #23 as the owner.
  3. Run: pass.
  4. Commit: feat(commons): add IAuditWriter and ICentralAuditWriter.

M1-T4: Add audit telemetry + pull message DTOs

Files:

  • Create: src/ScadaLink.Commons/Messages/Integration/AuditTelemetryEnvelope.cs, PullAuditEventsRequest.cs, PullAuditEventsResponse.cs.
  • Create: tests/ScadaLink.Commons.Tests/Messages/Integration/AuditTelemetryMessagesTests.cs.

Steps:

  1. Failing test: construct envelope with a batch of 3 events, assert immutability + batch enumerability.
  2. Failing test: pull request carries SinceUtc + BatchSize; response carries events + MoreAvailable.
  3. Implement.
  4. Run: pass.
  5. Commit: feat(commons): add audit telemetry + pull message DTOs.

M1-T5: Extend ScadaLinkDbContext with AuditLogs DbSet + entity config

Files:

  • Modify: src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs — add public DbSet<AuditEvent> AuditLogs => Set<AuditEvent>(); at the appropriate position (after Notifications).
  • Create: src/ScadaLink.ConfigurationDatabase/Entities/AuditLogEntityTypeConfiguration.csIEntityTypeConfiguration<AuditEvent> mapping the columns, types, length constraints, and indexes per alog.md §4. Note: this is an EF mapping only; the partition function and scheme are created in the SQL migration (next task) since EF Core doesn't model them natively.
  • Modify: OnModelCreating — apply the new configuration.
  • Create: tests/ScadaLink.ConfigurationDatabase.Tests/Entities/AuditLogEntityTypeConfigurationTests.cs — use ModelBuilder directly to verify the entity is mapped to AuditLog table, PK is EventId, and the expected columns + indexes are declared.

Steps:

  1. Failing test asserts mapped table name, PK column, and column count.
  2. Implement entity configuration; apply in OnModelCreating.
  3. Failing test asserts the five expected indexes exist on the model.
  4. Add HasIndex declarations.
  5. Run: pass.
  6. Commit: feat(configdb): map AuditEvent to AuditLog table with PK and indexes.

M1-T6: Generate and customize EF migration for AuditLog

Files:

  • Create: src/ScadaLink.ConfigurationDatabase/Migrations/<timestamp>_AddAuditLogTable.cs via dotnet ef migrations add AddAuditLogTable --project ScadaLink.ConfigurationDatabase.
  • Modify: the generated Up() / Down() to:
    • Create the partition function pf_AuditLog_Month and partition scheme ps_AuditLog_Month (raw SQL via migrationBuilder.Sql(...)), tied to a dedicated filegroup (or PRIMARY in dev — configurable via a migration setting).
    • Alter the CreateTable call (or follow up with Sql) to align the table to ps_AuditLog_Month(OccurredAtUtc).
    • Add the five indexes generated by EF; ensure each is also partition-aligned where appropriate.
  • Create: tests/ScadaLink.ConfigurationDatabase.Tests/Migrations/AddAuditLogTableMigrationTests.cs — applies the migration to an isolated MS SQL LocalDB instance (existing IntegrationTests harness), asserts table + partition function + scheme + indexes are present.

Steps:

  1. Run dotnet ef migrations add AddAuditLogTable.
  2. Failing integration test: apply migration, query sys.partition_functions and sys.partition_schemes for the expected names.
  3. Edit migration to add the partition function + scheme + alignment.
  4. Re-run test: pass.
  5. Failing test: query sys.indexes for the five expected named indexes.
  6. Adjust migration if any index name drifts.
  7. Run: pass.
  8. Commit: feat(configdb): add AuditLog migration with monthly partitioning.

M1-T7: Add DB roles in migration

Files:

  • Modify: the M1-T6 migration Up() to also create the scadalink_audit_writer (INSERT + SELECT only) and scadalink_audit_purger (ALTER PARTITION FUNCTION + ALTER TABLE … SWITCH PARTITION + SELECT) roles via raw SQL. Make role creation idempotent (IF NOT EXISTS).
  • Modify: Down() — drop the roles.
  • Create: tests/ScadaLink.ConfigurationDatabase.Tests/Migrations/AuditLogRoleGrantsTests.cs — applies migration, then runs SELECT on sys.database_role_members / sys.database_permissions to assert the role grants. Plus a smoke test: connect as a user mapped to scadalink_audit_writer, attempt UPDATE AuditLog SET Status = 'X' and expect a permission error.

Steps:

  1. Failing test asserts both roles exist with documented grants.
  2. Add migrationBuilder.Sql(...) blocks.
  3. Run: pass.
  4. Failing test: UPDATE AuditLog as audit writer → expect SqlException with permission error.
  5. Verify the role's permissions deny UPDATE (they should by default since only INSERT + SELECT granted).
  6. Run: pass.
  7. Commit: feat(configdb): add scadalink_audit_writer and scadalink_audit_purger roles.

M1-T8: Add IAuditLogRepository + EF implementation

Files:

  • Create: src/ScadaLink.Commons/Interfaces/Repositories/IAuditLogRepository.csInsertIfNotExistsAsync(AuditEvent, CancellationToken), QueryAsync(filter, paging, CancellationToken), SwitchOutPartitionAsync(monthBoundary, CancellationToken). Deliberately no UpdateAsync or row-level DeleteAsync.
  • Create: src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs — implementation using the DbContext; InsertIfNotExistsAsync uses MERGE or raw INSERT … WHERE NOT EXISTS to satisfy idempotency without throwing on dupes.
  • Modify: ServiceCollectionExtensions.cs — register IAuditLogRepositoryAuditLogRepository in DI.
  • Create: tests/ScadaLink.ConfigurationDatabase.Tests/Repositories/AuditLogRepositoryTests.cs.

Steps:

  1. Failing test: InsertIfNotExistsAsync for a fresh EventId writes one row; calling again with the same EventId is a no-op (no exception, no second row).
  2. Implement; use a MERGE or INSERT … WHERE NOT EXISTS strategy that does NOT rely on EF change tracking.
  3. Run: pass.
  4. Failing test: paged QueryAsync returns rows in (OccurredAtUtc desc, EventId desc) order, respecting filter predicates (channel, kind, status, site, target, actor, correlation, time range).
  5. Implement filter projection + keyset paging.
  6. Run: pass.
  7. Failing test: SwitchOutPartitionAsync for the oldest partition removes its rows from the live table.
  8. Implement via migrationBuilder-style Sql("ALTER TABLE ... SWITCH PARTITION ... TO ...") (against a staging table the implementation creates and drops within the same transaction).
  9. Run: pass.
  10. Commit: feat(configdb): IAuditLogRepository + EF implementation (append-only, partition-switch purge).

M1-T9: Add AuditLogOptions configuration class + binding

Files:

  • Create: src/ScadaLink.AuditLog/Configuration/AuditLogOptions.cs (new project — see M1-T11) — owns DefaultCapBytes, ErrorCapBytes, HeaderRedactList, GlobalBodyRedactors, PerTargetOverrides, RetentionDays, validation attributes.
  • Add: validation on startup (IValidateOptions<AuditLogOptions>).
  • Test: ensure appsettings.json bind round-trips and validation rejects out-of-range RetentionDays.

Steps:

  1. Failing test: bind a valid section → values present.
  2. Implement options class + binding.
  3. Failing test: bind invalid RetentionDays → validator rejects.
  4. Implement validator.
  5. Run: pass.
  6. Commit: feat(auditlog): add AuditLogOptions config binding.

M1-T10: Add ScadaLink.AuditLog project skeleton

Files:

  • Create: src/ScadaLink.AuditLog/ScadaLink.AuditLog.csproj — TargetFramework matches the rest of the solution; ProjectReferences to ScadaLink.Commons and ScadaLink.ConfigurationDatabase.
  • Create: src/ScadaLink.AuditLog/ServiceCollectionExtensions.csAddAuditLog(this IServiceCollection, IConfiguration) that registers AuditLogOptions, IAuditLogRepository, plus placeholders that later milestones will fill (writer impls, actors).
  • Create: tests/ScadaLink.AuditLog.Tests/ScadaLink.AuditLog.Tests.csproj with one smoke test.
  • Modify: ScadaLink.slnx — add both projects to the solution.
  • Modify: Directory.Packages.props if any new package versions are needed.

Steps:

  1. Create projects via dotnet new classlib / dotnet new xunit; add references; add to slnx.
  2. Failing test: smoke-test AddAuditLog() populates DI with IAuditLogRepository and IOptions<AuditLogOptions>.
  3. Implement ServiceCollectionExtensions.AddAuditLog.
  4. Run: pass.
  5. Commit: feat(auditlog): scaffold ScadaLink.AuditLog project.

M1-T11: Update Component-Host.md responsibilities + README component table

Files:

  • Modify: docs/requirements/Component-Host.md — list ScadaLink.AuditLog in the central role's registration set.
  • Modify: README.md — confirm row #23 link reflects the new project (no functional change; this is a paper-trail update).

Steps:

  1. Edit, verify cross-refs, commit: docs(audit): register ScadaLink.AuditLog project in Host role.

M2 — Site pipeline (sync-only path)

Goal: First end-to-end audit emission: a script-initiated ExternalSystem.Call() produces an audit row in the central AuditLog table. No cached paths yet, no notifications, no inbound API, no UI. Just one channel + kind: ApiOutbound.SyncCall.

Affected projects: Commons, AuditLog (new), Communication, Host, ExternalSystemGateway, all matching *.Tests/, tests/ScadaLink.IntegrationTests/.

Acceptance criteria:

  • Site-local IAuditWriter writes to a per-site SQLite auditlog.db on the hot path with ForwardState = 'Pending'; durability is sub-millisecond; failures fall back to a bounded in-memory ring and surface a metric.
  • SiteAuditTelemetryActor drains pending rows in batches via a new IngestAuditEvents RPC on the existing SiteStream gRPC service; on success flips ForwardState = 'Forwarded'.
  • AuditLogIngestActor (central singleton) receives the batch, performs InsertIfNotExistsAsync per event, returns ack.
  • ExternalSystem.Call() emits one ApiOutbound.SyncCall row via IAuditWriter on every call completion; audit-write failure does NOT abort the script.
  • Integration test in tests/ScadaLink.IntegrationTests/ boots a site + central pair, executes a sync script that calls an external system, and asserts a corresponding row appears in the central AuditLog within N seconds.
  • No regressions in existing ExternalSystemGateway or Communication tests.

Task headlines (each expanded to TDD detail in its own writing-plans pass before execution):

  1. Site-local SqliteAuditWriter implementing IAuditWriter — schema bootstrap, hot-path INSERT, write lock, ring-buffer fallback. Pattern from SiteEventLogger.cs:2898.
  2. Bounded in-memory RingBufferFallback that drains into the SQLite writer when health returns.
  3. SiteAuditTelemetryActor actor — periodic drain loop (5s busy / 30s idle), batch INSERT-IF-NOT-EXISTS via gRPC, ForwardState transitions.
  4. Extend sitestream.proto: add IngestAuditEvents(stream AuditEventBatch) returns (IngestAck). Regenerate. Update SiteStreamGrpcServer.cs to handle the new RPC.
  5. AuditLogIngestActor (central singleton) — handles ingest message, calls IAuditLogRepository.InsertIfNotExistsAsync per event in a single transaction.
  6. Host wiring: register SiteAuditTelemetryActor as a site singleton on a dedicated dispatcher (per alog.md §6.2); register AuditLogIngestActor as a central singleton. Reference pattern at AkkaHostedService.cs:272280.
  7. ESG sync Call() emission hook — add IAuditWriter injection; emit AuditEvent (channel=ApiOutbound, kind=SyncCall) before returning. Audit-write failures never throw to the script.
  8. End-to-end integration test in IntegrationTests/AuditLog/SyncCallEmissionTests.cs — site + central wired, script invokes ESG Call(), central row appears.
  9. Health metric SiteAuditWriteFailures (this milestone defines it; M6 surfaces the tile).
  10. Update docker/deploy.sh / infra/reseed.sh if needed so dev clusters can verify locally.

Risk callouts:

  • Site SQLite write throughput under load — bench against existing SiteEventLogger numbers.
  • gRPC additive evolution: the existing proto uses a oneof. Adding a new top-level RPC is safe; embedding new oneof variants is also safe. Confirm message-ordering guarantees aren't violated.
  • Don't accidentally bind SiteAuditTelemetryActor to the same dispatcher used by script blocking I/O; that's a real perf issue (per spec).

M3 — Cached operations + dual-write transaction + (inlined) Site Call Audit foundations

Goal: Cached external calls (ExternalSystem.CachedCall) and cached DB writes (Database.CachedWrite) produce three audit rows per operation (CachedEnqueued, CachedAttempt × N, CachedTerminal) AND populate the operational SiteCalls table at central — in one transaction at central, from a single combined telemetry packet.

Affected projects: Commons, AuditLog, SiteCallAudit (new — minimum-viable surface), ConfigurationDatabase (new SiteCalls table migration), ExternalSystemGateway, StoreAndForward, Host. Tests across all of them + IntegrationTests.

Prerequisite call-out: This milestone implements the minimum-viable Site Call Audit (#22) surface and cached-call tracking pieces — TrackedOperationId, site-local operation tracking SQLite, SiteCalls table at central, the existing-message CachedCallTelemetry (must be created from scratch since it doesn't exist in code despite living in the docs). Full reconciliation, KPIs, and Retry/Discard relay for #22 are deferred — they're not on the critical path for the audit log's combined telemetry.

Acceptance criteria:

  • New SiteCalls MS SQL table + repo (no partitioning needed; this is operational state, not audit).
  • New CachedCallTelemetry message in Commons carrying BOTH the cached-call operational fields AND an AuditEvent payload.
  • Site path: CachedCall writes the audit row to site SQLite (Kind = CachedEnqueued), creates the site operation-tracking row, and sends a combined telemetry packet.
  • Central path: AuditLogIngestActor (extended) receives the combined packet, performs one transaction containing both the AuditLog insert and the SiteCalls upsert.
  • Retry attempt → Kind = CachedAttempt audit row + SiteCalls status transition. Terminal → Kind = CachedTerminal audit row + SiteCalls terminal status.
  • Integration test asserts: triggering a CachedCall that fails transient-then-succeeds produces 3 AuditLog rows + 1 SiteCalls row with Status = Delivered, all sharing the same TrackedOperationId correlation key.

Task headlines:

  1. TrackedOperationId GUID newtype in Commons.
  2. Site-local SQLite operation-tracking table + repo (matches alog.md cached-call tracking design).
  3. CachedCallTelemetry Commons message carrying both operational fields and AuditEvent payload.
  4. SiteCalls MS SQL table + EF mapping + migration + ISiteCallAuditRepository + repo impl.
  5. SiteCallAuditActor skeleton (singleton, central) — receives telemetry, owns SiteCalls upsert via repo.
  6. Extend AuditLogIngestActor to detect combined telemetry and execute both writes (AuditLog insert + SiteCalls upsert) in a single DbContext transaction.
  7. ESG CachedCall() emission — produce combined telemetry on every lifecycle transition (enqueue, attempt, terminal).
  8. Extend gRPC proto with the combined-telemetry RPC if it's distinct from IngestAuditEvents, or fold it into the existing one with a discriminator field (decision in milestone brainstorm).
  9. Integration test in IntegrationTests/AuditLog/CachedCallCombinedTelemetryTests.cs.

Risk callouts:

  • Combined telemetry packet evolution: design the packet so future cached audit-kind additions are non-breaking (oneof or open-field map).
  • Single transaction at central spans two tables; ensure connection retry behaviour is correct.
  • Idempotency: AuditLog dedups on EventId; SiteCalls dedups on TrackedOperationId. If telemetry retries and AuditLog already has the row, ensure SiteCalls upsert still runs (no short-circuit).

M4 — Remaining boundary emission

Goal: Every channel × kind from Component-AuditLog.md produces a row when its boundary call fires.

Affected projects: ExternalSystemGateway (sync DB writes/reads, cached DB writes), SiteRuntime (Database surface exposing them), NotificationOutbox (central direct-write of Attempt/Terminal), InboundAPI (middleware). Tests across all.

Acceptance criteria:

  • Sync Database.Connection().Execute()DbOutbound.SyncWrite row; ExecuteReaderDbOutbound.SyncRead. Parameter values captured by default; per-connection redaction opt-in supported.
  • Database.CachedWrite → three lifecycle rows via the combined telemetry built in M3.
  • Notification Outbox dispatcher: every delivery attempt writes Notification.Attempt; terminal writes Notification.Terminal. Site-emitted Notification.Enqueued flows through the standard site→central audit path. Audit-write failure never affects delivery.
  • Inbound API middleware writes one ApiInbound.Completed row per request, before await next() returns. API key NAME captured (never material). Audit-write failure does NOT change the HTTP response.

Task headlines:

  1. ESG Database.Connection() execute hook — wrap Execute* / ExecuteScalar / ExecuteReader to emit before/after audit events.
  2. Database.CachedWrite combined-telemetry emission (mirror M3's ESG cached path).
  3. NotificationOutboxActor extension — inject ICentralAuditWriter; write Notification.Attempt per dispatcher attempt; write Notification.Terminal on terminal transitions; never abort on failure.
  4. Site-emitted Notification.Enqueued — when a script calls Notify.To().Send() (site-side via Store-and-Forward), emit a site audit row (Notification.Enqueued); telemetry forwards as usual.
  5. Inbound API middleware: new AuditWriteMiddleware in src/ScadaLink.InboundAPI/Middleware/ writing ApiInbound.Completed before response flush; register in the ASP.NET pipeline.
  6. Tests: emission unit tests per call mode, plus 4 integration tests (one per channel).

Risk callouts:

  • Inbound API: correlation-id generation needs to be consistent with any upstream tracing headers (W3C traceparent if present).
  • Notification dispatcher: confirm ICentralAuditWriter errors are logged but don't block the dispatch loop.

M5 — Payload + redaction policy

Goal: Payload capture is bounded (8 KB / 64 KB on error), headers are redacted by default, SQL parameter values are captured by default with per-connection opt-out, body redactor regexes are configurable per target, and the safety net over-redacts on misconfiguration.

Affected projects: AuditLog (policy engine + options), ExternalSystemGateway (HTTP header redactors, SQL param redaction hook), InboundAPI (header redactors, body capture), NotificationOutbox (subject/body capture follows existing rules). Tests.

Acceptance criteria:

  • A IAuditPayloadFilter service is invoked between event construction and write. Truncates to default cap; raises to error cap on non-Success rows; applies header redactors; applies body regex redactors; applies SQL parameter redactors (per-connection); over-redacts on regex error and increments AuditRedactionFailure.
  • Configuration test: changing appsettings.json redactors changes runtime behaviour (no rebuild needed for regex changes).
  • Bench: 95th-percentile audit emission latency on the hot path stays under N µs at default cap (target to be set during M5 brainstorm).

Task headlines:

  1. IAuditPayloadFilter + default implementation (header redaction, body regex, SQL parameter redaction, safety net).
  2. Wire the filter into the emission paths (M2, M3, M4 emitters all call through the filter before handing the AuditEvent to the writer).
  3. appsettings.json schema for the filter (already prepared in M1-T9; M5 plugs the runtime in).
  4. Tests: redaction unit tests with known-bad payloads (passwords in JSON, Authorization headers, SQL params named @apikey).
  5. Performance test in tests/ScadaLink.PerformanceTests/ for the hot-path latency budget.

Risk callouts:

  • Regex performance — pre-compile and cache patterns; reject patterns that take too long to compile.
  • Don't redact post-truncation if the truncation cut a redaction target in half.

M6 — Reconciliation, purge, partition maintenance, health metrics

Goal: Self-healing telemetry, monthly partition rollover, daily purge, all five new health metrics live and feeding the existing health-report pipeline.

Affected projects: AuditLog (3 new actors: SiteAuditReconciliationActor, AuditLogPurgeActor, partition-maintenance worker), Communication (the PullAuditEvents RPC), HealthMonitoring (5 new metrics), ConfigurationDatabase (partition-roll-forward SQL helper).

Acceptance criteria:

  • SiteAuditReconciliationActor runs every 5 minutes per site; pulls events the site reports as Pending; central performs InsertIfNotExistsAsync then signals the site to flip those rows to Reconciled.
  • AuditLogPurgeActor runs daily; for each partition older than RetentionDays, switches it out to a staging table and drops the staging table. Emits an AuditLog:Purged event with rowcount + duration.
  • Partition-maintenance job runs at month boundary to add the next month's partition function range and ensure the scheme has a destination filegroup.
  • 5 new health metrics published per site: SiteAuditBacklog (count + oldest + bytes), SiteAuditWriteFailures, SiteAuditTelemetryStalled; and per central node: CentralAuditWriteFailures, AuditRedactionFailure.
  • Integration test: simulated 5-minute central outage → telemetry catches up after recovery via reconciliation, no rows lost; site backlog metric reflects the queue depth and drops as it drains.

Task headlines:

  1. PullAuditEvents RPC on the existing SiteStream gRPC server.
  2. SiteAuditReconciliationActor actor with timer + per-site LastReconciledAt cursor.
  3. AuditLogPurgeActor actor with daily schedule, partition-switch logic via IAuditLogRepository.SwitchOutPartitionAsync.
  4. Partition-roll-forward helper (raw SQL migrationBuilder.Sql equivalent at runtime — likely a HostedService that runs once at startup and once per month).
  5. Health metric publishing per emitter; integrate with the existing SiteHealthState / CentralHealthAggregator plumbing.
  6. Integration tests for outage/recovery + purge.

Risk callouts:

  • Partition switch on an active table — ensure online schema operations don't block ingest; document the window if a brief lock is unavoidable.
  • Reconciliation can produce duplicate ForwardedReconciled state flips; ensure idempotency at site SQLite layer.

M7 — Central UI: new Audit Log page + drill-ins + KPI tiles

Goal: User-visible Audit Log: filter bar, results grid (custom Blazor + Bootstrap, no third-party grid), drilldown drawer with cURL / "show all events" / redaction indicators / pretty-printed payloads. 6 drill-in entry points from existing pages. 3 KPI tiles on Health dashboard.

Affected projects: CentralUI, CentralUI.Tests, CentralUI.PlaywrightTests.

Acceptance criteria:

  • New Components/Pages/Audit/AuditLogPage.razor exists; new "Audit" nav group sibling to Notifications.
  • All 10 filter elements, 10 grid columns, keyset pagination + default page 100, drilldown drawer per Component-AuditLog.md §10.
  • Existing Components/Pages/Monitoring/AuditLog.razor (the IAuditService config-change viewer) renamed in code to ConfigurationAuditLog.razor, with URL /audit/configuration to match the doc-renaming we did. Drill-ins from existing pages (Notifications, Site Calls, External Systems, Inbound API Keys, Sites, Instances) added.
  • 3 KPI tiles added to the Health dashboard; data sourced from HealthMonitoring.
  • Playwright tests cover: filter narrowing, drilldown drawer, "Copy as cURL" on ApiInbound rows, drill-in from Notifications to filtered Audit Log.
  • OperationalAudit read permission gating + AuditExport for the Export button.

Task headlines:

  1. New Components/Pages/Audit/AuditLogPage.razor + matching .razor.cs code-behind + .razor.css.
  2. Custom Blazor <AuditFilterBar> component (multi-select chips for Channel/Kind/Status, autocomplete for Instance/Script).
  3. Custom Blazor <AuditResultsGrid> component — keyset paging via QueryAsync repository method (M1-T8).
  4. <AuditDrilldownDrawer> component — JSON pretty-print, SQL syntax highlight, "Copy as cURL", "Show all events" CorrelationId filter.
  5. Rename existing AuditLog.razorConfigurationAuditLog.razor + update routes + update internal links.
  6. Drill-in additions to 6 existing pages.
  7. 3 KPI tile components on Health dashboard.
  8. Server-side CSV export (streaming) with AuditExport permission check.
  9. Playwright E2E tests.

Risk callouts:

  • Permission check at the page level needs to align with the existing role/permission infrastructure (Security #10).
  • Keyset paging across partitioned table needs the right index; M1's IX_AuditLog_OccurredAtUtc is the supporting index.

Goal: Operator surface for the centralized Audit Log.

Affected projects: CLI, CLI.Tests, ManagementService (new HTTP endpoint), IntegrationTests.

Acceptance criteria:

  • scadalink audit query mirrors the UI filter set; results stream as JSON (default) or table.
  • scadalink audit export streams server-side to CSV / JSONL / Parquet; requires AuditExport permission.
  • scadalink audit verify-chain --month YYYY-MM is a no-op stub returning a "hash-chain not yet enabled in this release" message and exit code 0 (per v1.x deferral).
  • Existing audit-log query (IAuditService config-change viewer) renamed in code to audit-config query to disambiguate; old name kept as a deprecated alias for one minor version.
  • Permissions: audit query and audit verify-chain require OperationalAudit; audit export additionally requires AuditExport.

Task headlines:

  1. New AuditCommands.cs (separate file from AuditLogCommands.cs — the latter stays for the renamed config audit).
  2. Build the three subcommands with their flag sets (per CLI doc & alog.md §15.1, post-Bundle-D fix).
  3. ManagementService HTTP endpoints backing each subcommand.
  4. Output formatters (JSON, table) reused from existing CLI patterns.
  5. CLI integration tests in tests/ScadaLink.CLI.Tests/ + tests/ScadaLink.IntegrationTests/.
  6. Update CLI README + help text.

Risk callouts:

  • The CLI rename (audit-log queryaudit-config query) breaks any operator scripts; provide a deprecation alias and document the migration.

Cross-cutting concerns (apply at every milestone)

  • Branching: every milestone gets its own feature/audit-log-mN-<slice> branch; merged with --no-ff to main on milestone completion. No pushes without explicit user authorization.
  • Tests: Every task adds tests first (failing test → impl → passing test). Existing tests must keep passing.
  • Commits: small and frequent. Bite-sized per writing-plans skill.
  • Reviews: per the bundling cadence in user memory — group small adjacent tasks into a single implementer dispatch, run one combined spec+quality review per bundle, then a final cross-bundle review at end of milestone.
  • Docs: if implementation reveals a design gap, fix the design doc FIRST (in docs/requirements/Component-AuditLog.md and/or alog.md), commit, then implement. Don't let the code and docs drift.
  • Infra: the 3 infra/* working-tree modifications still uncommitted on main are unrelated and stay that way unless the user explicitly addresses them. Use explicit git add <path> throughout, never git commit -am.

Per-milestone execution flow (template)

When a milestone is about to start, run this sequence:

  1. Brainstorm: short skill invocation to nail any code-level decisions not fixed in the spec (test fixture placement, migration helper choice, etc.).
  2. Writing-plans: produce a milestone-specific plan with TDD detail per task — saved to docs/plans/2026-XX-XX-auditlog-mN-<slice>.md + peer .tasks.json.
  3. Subagent-driven execution: bundle small tasks per cadence preference; per-bundle implementer + combined reviewer; cross-milestone review at end; merge to main with --no-ff.

The roadmap is the contract for what each milestone ships; the per-milestone plan is the contract for how it gets built.