From e8c75cc38fcfd0c38ed0105923c6e847d80ef5cb Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Fri, 19 Jun 2026 00:23:32 -0400 Subject: [PATCH] docs(plan): design for cert-action audit logging + OpcUaProbeResult dead-letter tidy --- ...-audit-and-probe-deadletter-tidy-design.md | 161 ++++++++++++++++++ 1 file changed, 161 insertions(+) create mode 100644 docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md diff --git a/docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md b/docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md new file mode 100644 index 00000000..9de7b06c --- /dev/null +++ b/docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md @@ -0,0 +1,161 @@ +# Cert-action audit logging + `OpcUaProbeResult` dead-letter tidy — Design + +**Date:** 2026-06-19 +**Status:** Approved (brainstorming) +**Branch:** `feat/cert-audit-and-probe-tidy` (off master `40e8a23e`) + +Two independent, self-contained backlog items, batched into one branch. + +- **Item 1 — Cert-action audit logging** (`stillpending.md` §A.7 deferred follow-up): the + meaty one. Standard complexity. +- **Item 2 — `OpcUaProbeResult` dead-letter tidy** (Phase-4 final-integration-review + follow-up (i)): trivial actor hygiene. + +Hard guardrails (both): **no EF migration, no Commons/proto/interface change, no bUnit.** +Stage by explicit path; never stage the never-stage files (`sql_login.txt`, +`Host/pki/`, `docker-dev/docker-compose.yml`, `pending.md`, `current.md`, +`stillpending.md`). No `--no-verify`, no force-push. Finish = merge to master + push. + +--- + +## Background: H2 was reclassified, then this was chosen + +The session began on the "within-timestamp tie-cluster paging (#400)" follow-up — found +**already shipped** (`2e6c6d3a`/`0f929ae6`, 2026-06-17); reconciled the stale `pending.md` +banners. Then surveyed the backlog: the only genuinely-OPEN §A item, **H2 (OPC UA +`HistoryUpdate` service)**, was reclassified **infra-gated/DEFERRED** (no historian backend +can durably insert/replace/delete — `IHistorianDataSource` is read-only and the Wonderware +sidecar has no update RPC; same boundary as H5b). The two remaining actionable, in-repo, +non-infra items are the two below. + +--- + +## Item 1 — Cert-action audit logging + +### Problem + +When a FleetAdmin **Trusts / Untrusts / Deletes** a peer certificate +(`Certificates.razor` → `CertificateStoreManager`), nothing is written to the audit DB. +Today only certificate *validation* is logged (to Serilog). These are security-relevant +mutations and should leave a durable audit trail. + +### Key discovery that shapes the design + +OtOpcUa has **no live structured `AuditEvent` emit sites yet**. The `Security/AuditActor` +helper is explicitly forward-looking ("tested and ready so future emit sites pick up the +correct Actor automatically"), and `IAuditWriter` is **not** registered as an injectable +DI service anywhere. But the persistence path exists end-to-end: + +- `AuditWriterActor` (an **AdminRole cluster singleton**, `ControlPlane/Audit/`) implements + `IAuditWriter` (from the external `ZB.MOM.WW.Audit` package), batches `AuditEvent`s, and + bulk-inserts into the existing **`ConfigAuditLog`** table (filtered-unique-index dedup on + `EventId`). +- AdminUI already reaches an AdminRole singleton from the `ActorRegistry`: + `AdminOperationsClient` does `registry.Get()`. Same pattern works + for `AuditWriterActorKey`. + +So this is the **first** structured emit site — exactly what the infra was built for. The +`ConfigAuditLog` table and `AuditWriterActor` already exist ⇒ **no EF migration, no Commons +change.** + +### Components + +1. **`ActorAuditWriter : IAuditWriter`** — *new*, `AdminUI/Audit/ActorAuditWriter.cs`. + Holds the `AuditWriterActorKey` singleton-proxy `IActorRef` (resolved from + `ActorRegistry`, mirroring `AdminOperationsClient`); `WriteAsync(evt, ct)` does + `_proxy.Tell(evt)` and returns `Task.CompletedTask` (best-effort, non-blocking — the + actor owns batching/dedup/flush). Registered `AddSingleton()` + in the AdminUI DI extension. + +2. **`CertAuditEvents.Build(action, store, thumbprint, actor, success, error)` → `AuditEvent`** + — *new*, `AdminUI/Audit/CertAuditEvents.cs`. Pure static factory: + - `Category = "Certificate"`, `Action ∈ {Trust, Untrust, Delete}` + - `SourceNode = thumbprint` + - `DetailsJson` = compact JSON `{ "store": …, "thumbprint": …, "error": …? }` + - `Outcome` = Success / Failure (from `success`) + - `Actor = actor`, fresh `EventId`, `OccurredAtUtc = DateTimeOffset.UtcNow` + Pure ⇒ fully unit-testable; the implementer confirms the exact `AuditEvent` required / + optional fields + the `AuditOutcome` enum member names against the `ZB.MOM.WW.Audit` + package. + +3. **`CertificateStoreManager`** — *modify*. Audit at the **choke point** (the thing that + performs the mutation; already has a unit-test class — far more testable than the razor, + which has no bUnit). Each public method gains the actor string and emits the event from + the `CertActionResult` it already computes: + - `Trust(string thumbprint, string actor)` + - `Untrust(string thumbprint, string actor)` + - `Delete(string store, string thumbprint, string actor)` + Production ctor gains `IAuditWriter _audit`; the internal test ctor defaults it to a + null/fake writer so existing tests compile. Each method calls + `_audit.WriteAsync(CertAuditEvents.Build(...))` (non-blocking) — including the + not-found / invalid-thumbprint **failure** paths (audited with `Outcome = Failure`). + +4. **`Certificates.razor`** — *modify*. Pass the authenticated principal name into the + manager calls: `authState.User.Identity?.Name ?? "system"` (already fetched at the + FleetAdmin re-check, ~line 185). Sourcing the actor from `AuthenticationState` (not the + scoped `IAuditActorAccessor`/HttpContext) avoids the Blazor-Server HttpContext-null + pitfall over the SignalR circuit. + +### Data flow + +``` +FleetAdmin clicks Trust + → Certificates.razor RequestAction (FleetAdmin re-checked) + → CertManager.Trust(thumbprint, actor) + → filesystem move (rejected → trusted) [unchanged] + → CertAuditEvents.Build(Trust, store, thumbprint, actor, success, error) + → IAuditWriter.WriteAsync(evt) (= ActorAuditWriter → proxy.Tell) + → AuditWriterActor batches → bulk insert → ConfigAuditLog +``` + +Both success and failure are audited (Outcome reflects `result.Success`). + +### Testing (xUnit + Shouldly; no bUnit) + +- `CertAuditEventsTests` — factory output for trust/untrust/delete × success/failure: + Category, Action, SourceNode, Outcome, DetailsJson shape. +- Extend `CertificateStoreManagerTests` — inject a recording fake `IAuditWriter`; assert + exactly one event per action with the correct Action + Outcome, including the not-found + failure path. +- `ActorAuditWriterTests` — a `TestKit` probe registered under `AuditWriterActorKey` + receives the Tell'd `AuditEvent`. +- Razor wiring proven by `dotnet build` + a live docker-dev `/run` (login disabled on the + local rig — agent drives it; deploy, perform a Trust/Untrust on the Certificates page, + confirm a new `ConfigAuditLog` row). + +### Classification: **standard** (additive DI + first structured emit path; multi-file). + +--- + +## Item 2 — `OpcUaProbeResult` dead-letter tidy + +### Problem + +`DriverHostActor`, `HistorianAdapterActor`, `ScriptedAlarmHostActor` subscribe the +redundancy-state topic (for state changes) but have **no** handler for the co-published +`PeerOpcUaProbeActor.OpcUaProbeResult`, so each INFO-dead-letters that message (benign, +capped at 10 then 5-min-suppressed, but noisy). + +### Change + +Add `Receive(_ => { })` to each of the three actors — +the exact intentional-drop pattern already used by `PeerProbeSupervisor.cs:69` and +`OpcUaPublishActor` (which actually consumes it). ~3 lines + a one-line doc comment each. +The implementer confirms each actor's topic subscription before adding the drop. + +### Testing + +A dead-letter probe test (`EventStream` / `AllDeadLetters`) asserting that an +`OpcUaProbeResult` published to the redundancy topic produces **no** dead-letter for each +actor — mirrors the existing Phase-B dead-letter-probe test idiom. + +### Classification: **trivial → small** (mechanical drop handler + one guard test). + +--- + +## Execution + +Subagent-driven (this session). Item 1 and Item 2 touch **disjoint** files ⇒ their +implementers can be dispatched concurrently. Item 1 gets spec + code review (standard); +Item 2 gets a code review (small). Final integration review before finish. Finish = merge +to master + push.