docs(components): AuditLog reference doc (pilot exemplar)

2026-06-03 15:29:20 -04:00
parent 0da5d3dd0b
commit b2770764c5
1 changed files with 244 additions and 0 deletions
@@ -0,0 +1,244 @@
+# Audit Log
+
+The Audit Log component records every action a site or central script takes across a trust boundary — outbound API calls, outbound database writes, notification sends, and inbound API requests — into a central append-only `AuditLog` table, with a site SQLite hot-path, gRPC telemetry forwarding, and a reconciliation fallback.
+
+## Overview
+
+Audit Log (#23) is a layered subsystem that runs on both site and central nodes. It exists alongside the operational stores it complements — `Notifications` (Notification Outbox, #21) and `SiteCalls` (Site Call Audit, #22) — rather than replacing them. The operational tables answer "what is the current state of this notification / cached call?"; the `AuditLog` answers "what happened, in what order, who did it, and what crossed the boundary?".
+
+The component code lives in `src/ZB.MOM.WW.ScadaBridge.AuditLog/`, split by role:
+
+- `Site/` — the script-thread write path: `SqliteAuditWriter`, the `FallbackAuditWriter` chain, and the `Site/Telemetry/` drain that pushes rows to central.
+- `Central/` — the central-node ingest singleton (`AuditLogIngestActor`), the direct-write path (`CentralAuditWriter`), the reconciliation puller (`SiteAuditReconciliationActor`), and retention maintenance.
+- `Configuration/`, `Redaction/`, `Payload/` — options, the redactor, and the truncation/redaction primitives.
+
+The same DI entry point, `ServiceCollectionExtensions.AddAuditLog`, registers the writer chain on every host; central nodes additionally call `AddAuditLogCentralMaintenance`, and site nodes call `AddAuditLogHealthMetricsBridge`. Because `AddAuditLog` runs on both site and central composition roots, it never registers a hosted service that would resolve a central-only dependency on a site — central-only registrations are split into their own helper.
+
+## Key Concepts
+
+### Script trust boundary
+
+The audited scope is the script trust boundary, not framework traffic. The four channels are modelled by the `AuditChannel` enum (`ApiOutbound`, `DbOutbound`, `Notification`, `ApiInbound`), and the specific action by `AuditKind` (for example `ApiCall`, `DbWriteCached`, `NotifySend`, `InboundRequest`). Every row is built through `ScadaBridgeAuditEventFactory.Create`, which maps the domain vocabulary onto the canonical record: `Channel`/`Kind`/`Status` become `Action`/`Category`/`Outcome` plus a `DetailsJson` extension bag carrying every other domain field.
+
+### Canonical `AuditEvent` and `DetailsJson`
+
+The transport type is the canonical `ZB.MOM.WW.Audit.AuditEvent` record — ten fields: `EventId`, `OccurredAtUtc`, `Actor`, `Action`, `Outcome`, `Category`, `Target`, `SourceNode`, `CorrelationId`, `DetailsJson`. ScadaBridge domain fields (`ExecutionId`, `ParentExecutionId`, `SourceSiteId`, `RequestSummary`, `IngestedAtUtc`, and so on) ride inside `DetailsJson` as an `AuditDetails` record, serialized by `AuditDetailsCodec`. `AuditRowProjection.Decompose` / `Recompose` move between the canonical record and the domain view.
+
+### `ExecutionId` vs `CorrelationId`
+
+`CorrelationId` is the canonical top-level field and carries the per-operation lifecycle id — for cached calls it is the `TrackedOperationId`, and the cached-telemetry drain reads it back out to join the audit row to its operational tracking row (see `SiteAuditTelemetryActor.OnCachedDrainAsync`). `ExecutionId` is a `DetailsJson` field: the per-run correlation value shared by every row a single script execution or inbound request emits. `ParentExecutionId` (also in `DetailsJson`) is the cross-execution spawn pointer that bridges, for example, an inbound API request to the site script it routes to.
+
+### One row per lifecycle event
+
+Each lifecycle event is one row. A synchronous call produces a single row; a cached call produces several (`Submitted`, `Forwarded`, `Attempted`, then a terminal `Delivered`/`Parked`/`Discarded`). Idempotency is on `EventId`, so the same row arriving twice — from a telemetry retry and from a reconciliation pull — collapses to a no-op everywhere it is written.
+
+## Architecture
+
+### Site write path
+
+`SqliteAuditWriter` is the site hot-path store: a singleton holding one owned `SqliteConnection` behind a write lock, fed by a bounded `Channel<T>` that a background task drains in batches, so script threads never block on disk I/O. It writes two tables — the append-only canonical `audit_event` and a mutable `audit_forward_state` sidecar that tracks the forwarding lifecycle (from `SqliteAuditWriter.InitializeSchema`):
+
+```sql
+CREATE TABLE IF NOT EXISTS audit_event (
+    EventId        TEXT NOT NULL,
+    OccurredAtUtc  TEXT NOT NULL,
+    Actor          TEXT NOT NULL,
+    Action         TEXT NOT NULL,
+    Outcome        TEXT NOT NULL,
+    Category       TEXT NULL,
+    Target         TEXT NULL,
+    SourceNode     TEXT NULL,
+    CorrelationId  TEXT NULL,
+    DetailsJson    TEXT NULL,
+    PRIMARY KEY (EventId)
+);
+```
+
+The sidecar carries `ForwardState` (`Pending`/`Forwarded`/`Reconciled`, per the `AuditForwardState` enum), a duplicated `OccurredAtUtc` for the drain range scan, and a precomputed `IsCachedKind` flag so the cached/non-cached drain split is an integer predicate, not a `DetailsJson` parse on the read hot-path. The site store is ephemeral (roughly 7-day retention, recreated per deployment), so a schema change is an in-place reset rather than a migration.
+
+`SqliteAuditWriter` also implements `ISiteAuditQueue`, the read/mark surface the drain and the reconciliation pull handler consume. The same singleton instance is bound to both `ISiteAuditQueue` and the hot-path `IAuditWriter` so the drain observes exactly the rows the script threads wrote.
+
+### Site fallback chain
+
+The script-facing `IAuditWriter` is `FallbackAuditWriter`, not the raw `SqliteAuditWriter`. It redacts once up front, attempts the primary write, and on any primary failure stashes the (already redacted) row in a drop-oldest `RingBufferFallback` and returns success — a primary outage must never reach the calling script:
+
+```csharp
+public async Task WriteAsync(AuditEvent evt, CancellationToken ct = default)
+{
+    ArgumentNullException.ThrowIfNull(evt);
+    var filtered = _redactor.Apply(evt);
+    try
+    {
+        await _primary.WriteAsync(filtered, ct).ConfigureAwait(false);
+    }
+    catch (Exception ex)
+    {
+        _failureCounter.Increment();
+        _logger.LogWarning(ex,
+            "Primary audit writer threw; routing EventId {EventId} to drop-oldest ring.",
+            filtered.EventId);
+        _ring.TryEnqueue(filtered);
+        return;
+    }
+
+    if (_ring.Count > 0)
+    {
+        await TryDrainRingAsync(ct).ConfigureAwait(false);
+    }
+}
+```
+
+On the next successful primary write the ring drains back through the primary in FIFO order.
+
+### Telemetry forward and central ingest
+
+`SiteAuditTelemetryActor` drains the local queue and pushes to central over two parallel transports, each self-ticking on its own cadence (`BusyIntervalSeconds` while rows flow, `IdleIntervalSeconds` when empty):
+
+- `IngestAuditEvents` for single-row lifecycle events (sync `ApiCall`/`DbWrite`, `NotifySend`, `InboundRequest`).
+- `IngestCachedTelemetry` for cached-call rows, each joined to its `IOperationTrackingStore` snapshot so central can write `AuditLog` and `SiteCalls` together.
+
+On the central node, `AuditLogIngestActor` is a cluster singleton. It opens a fresh DI scope per message because `IAuditLogRepository` is a scoped EF Core service, stamps `IngestedAtUtc`, and inserts idempotently — one bad row never sinks the batch:
+
+```csharp
+private async Task IngestWithRepositoryAsync(
+    IAuditLogRepository repository,
+    IAuditRedactor? redactor,
+    ICentralAuditWriteFailureCounter? failureCounter,
+    IngestAuditEventsCommand cmd,
+    DateTime nowUtc,
+    List<Guid> accepted)
+{
+    foreach (var evt in cmd.Events)
+    {
+        try
+        {
+            var safeRedactor = redactor ?? SafeDefaultAuditRedactor.Instance;
+            var filtered = safeRedactor.Apply(evt);
+            var ingested = AuditRowProjection.WithIngestedAtUtc(filtered, nowUtc);
+            await repository.InsertIfNotExistsAsync(ingested).ConfigureAwait(false);
+            accepted.Add(evt.EventId);
+        }
+        catch (Exception ex)
+        {
+            try { failureCounter?.Increment(); }
+            catch { /* counter must never throw — defence in depth */ }
+            _logger.LogError(ex,
+                "Failed to persist audit event {EventId} during batch ingest; row will be retried by the site.",
+                evt.EventId);
+        }
+    }
+}
+```
+
+The cached path, `OnCachedTelemetryAsync`, wraps each entry in its own MS SQL transaction and writes the `AuditLog` row and the `SiteCalls` row together, so the audit and operational mirrors never drift mid-row.
+
+### Central direct-write
+
+Events that originate on central — Notification Outbox dispatch and Inbound API — never go through site telemetry. They call `ICentralAuditWriter`, implemented by `CentralAuditWriter`, which redacts, stamps `SourceNode` from `INodeIdentityProvider` when the caller has not, opens a per-call scope, and inserts idempotently. Like every audit path, it swallows and logs failures rather than propagating them.
+
+### Reconciliation and retention
+
+`SiteAuditReconciliationActor` is a central singleton that, on a timer, pulls each site for rows at or after a per-site cursor and ingests them idempotently — the self-healing fallback for telemetry the push path missed. `AuditLogPurgeActor` drives the daily partition-switch purge against the central table, and `AuditLogPartitionMaintenanceService` rolls the monthly partition function forward so inserts never land in an unbounded tail partition.
+
+## Usage
+
+Rows are written through one of two DI seams, never constructed ad hoc. Site boundary code resolves the hot-path `IAuditWriter` — the `FallbackAuditWriter` shown above — and writes without blocking on disk and without ever throwing an audit failure back at the script. Central-originated events (Notification Outbox dispatch, Inbound API) resolve `ICentralAuditWriter` instead; its `CentralAuditWriter` implementation redacts, stamps `SourceNode`, opens a per-call EF Core scope, and inserts idempotently, swallowing failures the same way:
+
+```csharp
+public async Task WriteAsync(AuditEvent evt, CancellationToken ct = default)
+{
+    if (evt is null)
+    {
+        _logger.LogWarning("CentralAuditWriter.WriteAsync received null event; ignoring.");
+        return;
+    }
+
+    try
+    {
+        var filtered = _redactor.Apply(evt);
+        if (filtered.SourceNode is null && _nodeIdentity?.NodeName is { } nodeName)
+        {
+            filtered = filtered with { SourceNode = nodeName };
+        }
+
+        await using var scope = _services.CreateAsyncScope();
+        var repo = scope.ServiceProvider.GetRequiredService<IAuditLogRepository>();
+        var stamped = AuditRowProjection.WithIngestedAtUtc(filtered, DateTime.UtcNow);
+        await repo.InsertIfNotExistsAsync(stamped, ct).ConfigureAwait(false);
+    }
+    catch (Exception ex)
+    {
+        try { _failureCounter.Increment(); }
+        catch { /* counter must never throw — defence in depth */ }
+        _logger.LogWarning(ex,
+            "CentralAuditWriter failed for EventId {EventId} (Action={Action}, Outcome={Outcome})",
+            evt.EventId, evt.Action, evt.Outcome);
+    }
+}
+```
+
+The two writer seams are intentionally distinct DI bindings: `IAuditWriter` is the site/boundary hot-path, `ICentralAuditWriter` is the central direct-write path. Keeping them separate stops a site composition root from accidentally resolving the central writer, which depends on a scoped `IAuditLogRepository` only registered by the Configuration Database.
+
+## Configuration
+
+The top-level options class is `AuditLogOptions`, bound from the `AuditLog` section and validated on startup by `AuditLogOptionsValidator`. The writer and telemetry collaborators bind from nested sections; the constant section names live on `ServiceCollectionExtensions`.
+
+| Section | Key | Default | Description |
+|---------|-----|---------|-------------|
+| `AuditLog` | `DefaultCapBytes` | `8192` | Payload-summary cap in bytes. Must be `> 0`. |
+| `AuditLog` | `ErrorCapBytes` | `65536` | Cap on error rows. Must be `>=` `DefaultCapBytes`. |
+| `AuditLog` | `InboundMaxBytes` | `1048576` | Per-body ceiling for `ApiInbound` summaries. Range `[8192, 16777216]`. |
+| `AuditLog` | `HeaderRedactList` | `Authorization`, `X-Api-Key`, `Cookie`, `Set-Cookie` | HTTP headers redacted before persistence. |
+| `AuditLog` | `GlobalBodyRedactors` | empty | Body-content redactor regex patterns applied globally. |
+| `AuditLog` | `PerTargetOverrides` | empty | Per-target overrides keyed by target name (`CapBytes`, `AdditionalBodyRedactors`, `RedactSqlParamsMatching`). |
+| `AuditLog` | `RetentionDays` | `365` | Central retention window. Range `[30, 3650]`. |
+| `AuditLog:SiteWriter` | `DatabasePath` | `auditlog.db` | Site SQLite file path. |
+| `AuditLog:SiteWriter` | `ChannelCapacity` | `4096` | Bounded write-queue capacity. |
+| `AuditLog:SiteWriter` | `BatchSize` | `256` | Max events per write transaction. |
+| `AuditLog:SiteTelemetry` | `BatchSize` | `256` | Max rows per gRPC drain batch. |
+| `AuditLog:SiteTelemetry` | `BusyIntervalSeconds` | `5` | Drain delay while rows are flowing. |
+| `AuditLog:SiteTelemetry` | `IdleIntervalSeconds` | `30` | Drain delay when the queue is empty. |
+| `AuditLog:PartitionMaintenance` | `IntervalSeconds` | `86400` | Partition roll-forward cadence. |
+| `AuditLog:PartitionMaintenance` | `LookaheadMonths` | `1` | Future months `pf_AuditLog_Month` must always cover. |
+
+`PerTargetRedactionOverride` is additive: per-target body redactors append to the global list, and `RedactSqlParamsMatching` is an opt-in case-insensitive regex applied only to `DbOutbound` rows (SQL parameter values are captured verbatim by default). Header redaction always runs — when no redactor is wired, the paths fall back to `SafeDefaultAuditRedactor`, which scrubs the default sensitive headers regardless.
+
+`SiteAuditReconciliationOptions` exposes `ReconciliationIntervalSeconds` (default `300`) and `StalledAfterNonDrainingCycles` (default `2`); `AuditLogPurgeOptions` exposes `IntervalHours` (default `24`), while the purge window itself is sourced from `AuditLogOptions.RetentionDays` so retention is tuned from one place.
+
+## Dependencies & Interactions
+
+- [Commons (#16)](./Commons.md) — owns the canonical `AuditEvent` shape consumed here (via `ZB.MOM.WW.Audit`), the `IAuditWriter` / `ICentralAuditWriter` / `ISiteAuditQueue` interfaces, the `AuditChannel` / `AuditKind` / `AuditForwardState` enums, the `AuditDetails` / `AuditRowProjection` / `ScadaBridgeAuditEventFactory` projection types, and the ingest/pull message contracts.
+- [Configuration Database (#17)](./ConfigurationDatabase.md) — registers the scoped `IAuditLogRepository` (the central `dbo.AuditLog` table, partition boundaries, and `InsertIfNotExistsAsync` idempotency). Central hosts must call `AddConfigurationDatabase` for the ingest, direct-write, reconciliation, and purge paths to resolve their repository.
+- [Central–Site Communication (#5)](./Communication.md) — supplies the gRPC transport: `IngestAuditEvents` / `IngestCachedTelemetry` push and the `PullAuditEvents` reconciliation pull, plus the DTO mappers.
+- [Site Call Audit (#22)](./SiteCallAudit.md) — shares the combined cached-telemetry packet. `AuditLogIngestActor.OnCachedTelemetryAsync` writes the `AuditLog` row and the `SiteCalls` upsert in one transaction; sites remain the source of truth for cached-call status.
+- [Notification Outbox (#21)](./NotificationOutbox.md) — a central direct-write caller of `ICentralAuditWriter` (dispatch lifecycle rows), alongside the Inbound API.
+- [Health Monitoring (#11)](./HealthMonitoring.md) — `AddAuditLogHealthMetricsBridge` replaces the NoOp failure counters with bridges that surface `SiteAuditWriteFailures` and `AuditRedactionFailure` on the site health report; `SiteAuditBacklogReporter` polls the writer backlog. On central, `AuditCentralHealthSnapshot` exposes `CentralAuditWriteFailures`, `AuditRedactionFailure`, and per-site `SiteAuditTelemetryStalled`.
+- Design spec: [Component-AuditLog.md](../requirements/Component-AuditLog.md).
+
+## Troubleshooting
+
+### Telemetry loss self-heals
+
+If the push path misses rows (a gRPC blip, a central restart, a site briefly offline), the site keeps those rows `Pending` and `SiteAuditReconciliationActor` re-pulls them on its next tick. Idempotency on `EventId` makes duplicates from both paths a no-op, so no operator action is required. The reconciliation cursor is in-memory; a singleton restart resets it to `DateTime.MinValue`, which re-pulls everything the site still holds — conservative but correct.
+
+### A row repeatedly fails to ingest
+
+`SiteAuditReconciliationActor` tracks per-`EventId` insert failures. While a row keeps failing below `MaxPermanentInsertAttempts` (5), the site cursor is held back so the next tick retries it. At the threshold the actor logs `Critical`, permanently abandons that one row, and advances the cursor so a single broken row cannot block all further progress for the site. A `Critical` log line naming an abandoned `EventId` is the signal to investigate that row's payload.
+
+### A site shows as stalled
+
+When two consecutive reconciliation cycles both return rows and report `MoreAvailable=true`, the backlog is not draining and the actor latches the site as stalled, publishing `SiteAuditTelemetryStalledChanged` on the EventStream (surfaced as `SiteAuditTelemetryStalled` on the central health snapshot). Only transitions are published, so a stalled site does not flood the health surface.
+
+### Audit writes never abort the action
+
+Every write path is best-effort by contract. A primary SQLite failure routes to the ring buffer; an ingest or direct-write failure is swallowed, logged, and counted on the health surface. The audited action's own success/failure path is authoritative — a missing audit row never means the action failed. The site retention purge enforces the matching invariant: a row is not dropped until it has reached `Forwarded` or `Reconciled`.
+
+## Related Documentation
+
+- [Audit Log design specification](../requirements/Component-AuditLog.md)
+- [Site Call Audit](./SiteCallAudit.md)
+- [Notification Outbox](./NotificationOutbox.md)
+- [Configuration Database](./ConfigurationDatabase.md)
+- [Central–Site Communication](./Communication.md)
+- [Commons](./Commons.md)
+- [Health Monitoring](./HealthMonitoring.md)