docs(audit): current-state OtOpcUa

This commit is contained in:
Joseph Doherty
2026-06-01 06:55:07 -04:00
parent 2dbedce0ac
commit e498bb7c5a
@@ -0,0 +1,140 @@
# Audit — current state: OtOpcUa
Repo: `~/Desktop/OtOpcUa` (Gitea `lmxopcua`). Stack: .NET 10, Akka.NET cluster, EF Core + SQL Server.
All paths below are relative to the repo root. Verified against source on 2026-06-01.
OtOpcUa already has a structured, idempotent audit pipeline: a cluster-broadcast `AuditEvent`
message, a cluster-singleton writer actor that batches and bulk-inserts, and an append-only
`ConfigAuditLog` EF entity with two-layer dedup. There is **also** a second, older write path —
SQL stored procedures that `INSERT dbo.ConfigAuditLog` directly — so the table has two
producers with slightly different column conventions (see §1).
## 1. How it works today
**Record shape**`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Audit/AuditEvent.cs:9-17`:
a sealed record `AuditEvent(Guid EventId, string Category, string Action, string Actor,
DateTime OccurredAtUtc, string? DetailsJson, NodeId SourceNode, CorrelationId CorrelationId)`.
`NodeId` and `CorrelationId` are Commons value-types — `NodeId` wraps a string (the *logical
cluster node / host name*, explicitly **not** an OPC UA NodeId per its XML doc,
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Types/NodeId.cs:3-8`); `CorrelationId` wraps a `Guid`
(`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Types/CorrelationId.cs:3`).
**Transport**`AuditEvent` is an Akka message meant to be sent to the `AuditWriterActor`
**cluster singleton** (`AuditEvent.cs:6` describes it as "cluster-broadcast … consumed by the
`AuditWriterActor` singleton"). The singleton is registered through Akka.Hosting at
`src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/ServiceCollectionExtensions.cs:68-75`
(`WithSingleton<AuditWriterActorKey>(AuditWriterSingletonName, …)`). Any cluster member can
emit an `AuditEvent`; the singleton is the one sink that persists it.
**Storage** — EF entity `ConfigAuditLog`
(`src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ConfigAuditLog.cs:7-44`): append-only
("Grants revoked for UPDATE/DELETE on all principals", `ConfigAuditLog.cs:4-5`). Columns:
`AuditId` (identity PK), `Timestamp` (default `SYSUTCDATETIME()`), `Principal`, `EventType`,
`ClusterId?`, `NodeId?`, `GenerationId?`, `DetailsJson?`, `EventId?` (Guid), `CorrelationId?`
(Guid). Mapping/constraints in `OtOpcUaConfigDbContext.cs:429-463`: `DetailsJson` must be valid
JSON (`CK_ConfigAuditLog_DetailsJson_IsJson`, line 435-436); `Principal`/`EventType`/`ClusterId`/`NodeId`
length-capped (lines 441-444); supporting indexes `IX_ConfigAuditLog_Cluster_Time` (line 449-451)
and `IX_ConfigAuditLog_Generation` (line 452-454).
**Writer / batching**`src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/Audit/AuditWriterActor.cs`:
a `ReceiveActor` with `FlushBatchSize = 500` (line 25) and `FlushInterval = 5s` (line 26).
It buffers events in a `Dictionary<Guid, AuditEvent>` keyed by `EventId` (line 30), flushing
when the buffer hits 500 (line 60), when the 5s periodic timer fires (`PreStart`, line 50-53),
or on `PreRestart`/`PostStop` (lines 96-107) so a supervisor swap or coordinated shutdown does
not lose the buffer. `FlushBuffer` (lines 63-93) snapshots and clears the buffer, then for each
event constructs a `ConfigAuditLog` row (lines 75-84): `Timestamp = OccurredAtUtc`,
`Principal = Actor`, `EventType = $"{Category}:{Action}"`, `NodeId = SourceNode.Value`,
`DetailsJson`, `EventId`, `CorrelationId = CorrelationId.Value`. A failed flush is logged and the
batch is **dropped** (`catch` at lines 89-92) — best-effort, no retry/dead-letter.
**Dedup / idempotency (two layers)** — described at `AuditWriterActor.cs:17-21`:
1. *In-buffer* — duplicate `EventId`s within a batch collapse via the dictionary (last-write-wins;
`HandleEvent`, lines 55-61).
2. *Database* — a **filtered unique index** `UX_ConfigAuditLog_EventId` (`OtOpcUaConfigDbContext.cs:459-462`,
`IsUnique()` + `HasFilter("[EventId] IS NOT NULL")`) gives cross-restart safety: a retry of an
already-flushed batch hits the constraint, the duplicate insert is dropped, and the rest of the
batch survives. `EventId`/`CorrelationId` are nullable so legacy/backfill rows (NULL) don't
collide — confirmed in the entity XML (`ConfigAuditLog.cs:33-43`) and migration
`Migrations/20260526105027_AddConfigAuditLogEventIdColumns.cs:27-38`.
**Scope** — two producers, two conventions:
- **Akka `AuditEvent` path** (the structured one): config writes + authorization checks. The
EventType vocabulary lives in the entity XML doc (`ConfigAuditLog.cs:18`): `DraftCreated |
DraftEdited | Published | RolledBack | NodeApplied | CredentialAdded | CredentialDisabled |
ClusterCreated | NodeAdded | ExternalIdReleased | CrossClusterNamespaceAttempt |
OpcUaAccessDenied | …`. Note the access-denied / cross-cluster entries are authz-check events,
not config writes.
- **SQL stored-procedure path** (older, still present): several SPs `INSERT dbo.ConfigAuditLog`
directly — e.g. `Published`/`RolledBack`/`NodeApplied`/`ExternalIdReleased`/`CrossClusterNamespaceAttempt`
in `Migrations/20260417215224_StoredProcedures.cs:151,217,351,407,504`. These use `SUSER_SNAME()`
as `Principal`, set `ClusterId`/`GenerationId`, write a **bare** `EventType` (no `Category:Action`
split), and leave `EventId`/`CorrelationId` NULL.
**Query / UI** — the only read surface is the Admin UI page
`src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Clusters/ClusterAudit.razor`
(`@page "/clusters/{ClusterId}/audit"`, `[Authorize]`, lines 1-2). It reads the latest
`PageSize = 200` rows (line 69) **filtered by `ClusterId`**, newest-first (`OnInitializedAsync`,
lines 74-82), and renders Timestamp / Principal / Event(Type) / Node / Correlation(first 8 hex) /
Details columns (lines 38-58). Tested in
`tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/AuditWriterActorTests.cs`: count-threshold
flush (lines 26-41), in-buffer dedup of duplicate EventIds (lines 45-62), `PostStop` flush
(lines 66-81), and the column mapping incl. `EventType == "Config:Edit"` and `NodeId == "node-a"`
(lines 85-104).
> Load-bearing gotcha: the actor path **never sets `ClusterId`** (lines 75-84), but the UI filters
> on `ClusterId` (`ClusterAudit.razor:78`). So today the cluster-scoped view surfaces the
> stored-procedure rows; structured `AuditEvent` rows written by the actor (which carry the host in
> `NodeId`, not `ClusterId`) won't appear under a cluster. Worth flagging during normalization.
## 2. Mapping to the canonical `AuditEvent`
Target = `ZB.MOM.WW.Audit.AuditEvent` (built in parallel). OtOpcUa's existing `AuditEvent` is
already almost field-for-field aligned; the only synthesized field is `Outcome`.
| Canonical field | OtOpcUa source | Mapping |
|---|---|---|
| `Guid EventId` | `AuditEvent.EventId` | Direct. Already the idempotency key (buffer key + `UX_ConfigAuditLog_EventId`). |
| `DateTimeOffset OccurredAtUtc` | `AuditEvent.OccurredAtUtc` (`DateTime`) | Direct; widen `DateTime`(UTC) → `DateTimeOffset`. |
| `string Actor` | `AuditEvent.Actor` | Direct (→ `ConfigAuditLog.Principal`). At Auth adoption this becomes the `ZB.MOM.WW.Auth` principal. |
| `string Action` | `AuditEvent.Action` (+ `Category`) | Direct. Today persisted as `"{Category}:{Action}"` in `EventType`; canonical keeps `Action` and `Category` separate. |
| `AuditOutcome Outcome` | *(none)* | **Derived** from the EventType vocabulary, not stored today. `OpcUaAccessDenied`/`CrossClusterNamespaceAttempt``Denied`; the config-write verbs → `Success`. No explicit `Failure` value exists yet (a failed flush is dropped, not recorded as an event). |
| `string? Category` | `AuditEvent.Category` | Direct (e.g. `"Config"`). |
| `string? Target` | *(none)* | No dedicated field today; the closest is `SourceNode``NodeId` (the acting host) or details. Leave null or carry the affected object in `DetailsJson`. |
| `string? SourceNode` | `AuditEvent.SourceNode` (`NodeId.Value`) | Direct — the logical cluster node / host name (NOT an OPC UA NodeId). Currently lands in `ConfigAuditLog.NodeId`. |
| `Guid? CorrelationId` | `AuditEvent.CorrelationId` (`CorrelationId.Value`) | Direct. |
| `string? DetailsJson` | `AuditEvent.DetailsJson` | Direct; carries everything else (incl. `ClusterId`/`GenerationId`, which today are separate columns on the SP path). |
## 3. Adoption plan → `ZB.MOM.WW.Audit`
**Effort: medium.** OtOpcUa is the *donor* design for the canonical record, so most of the work is
re-pointing types and bridging two persistence conventions, not redesigning the pipeline.
**Replace with the shared library:**
- `Commons/Messages/Audit/AuditEvent.cs` → the canonical `ZB.MOM.WW.Audit.AuditEvent`. Add the new
`Outcome` field (derive it at every emit site from the EventType vocabulary, e.g.
`OpcUaAccessDenied → Denied`); keep `Category`/`Action`/`SourceNode`/`CorrelationId` as-is. Decide
whether `SourceNode`/`CorrelationId` carry the Commons value-types or the canonical primitives at
the seam (likely a thin adapter at construction).
- `AuditWriterActor` → implement the library's `IAuditWriter` (keep the actor as OtOpcUa's
Akka-cluster-singleton transport/batching adapter behind that seam; the 500/5s batching,
PreRestart/PostStop flush, and two-layer dedup stay bespoke per §"left per-project").
**Keep bespoke (thin adapter only):**
- Transport — the cluster-broadcast → singleton `AuditWriterActor`, batching, and flush triggers.
- Storage — the `ConfigAuditLog` EF entity, indexes, and `UX_ConfigAuditLog_EventId` idempotency
index. Map the canonical record onto the existing columns; add an `Outcome` column (or fold it into
`EventType`/`DetailsJson` if a schema change is undesirable). `ClusterId`/`GenerationId` remain
OtOpcUa-specific columns fed via `DetailsJson` or kept as side columns.
- Domain vocabulary — the EventType strings (`DraftCreated`, `Published`, `OpcUaAccessDenied`, …)
and the `Category:Action` composition convention.
- Query/UI — `ClusterAudit.razor` and its `ClusterId` filter.
**Reconcile, not extract:**
- The **two producers** (Akka `AuditEvent` path vs. SQL stored-procedure `INSERT`s using
`SUSER_SNAME()`). The SP path bypasses the canonical record entirely and writes a different
column convention (bare `EventType`, NULL `EventId`/`CorrelationId`, populated
`ClusterId`/`GenerationId`). Adopting the library does not by itself unify these; either route the
SP events through the actor or accept that SP rows stay non-idempotent and absent from the
`EventId` dedup guarantee. Flag for the normalization spec.
- The **`ClusterId`-filter / actor-never-sets-`ClusterId`** mismatch noted in §1 — fix when the
query surface is normalized so structured `AuditEvent` rows are discoverable by cluster.