docs(plan): design for cert-action audit logging + OpcUaProbeResult dead-letter tidy

This commit is contained in:
Joseph Doherty
2026-06-19 00:23:32 -04:00
parent 40e8a23e7c
commit e8c75cc38f
@@ -0,0 +1,161 @@
# Cert-action audit logging + `OpcUaProbeResult` dead-letter tidy — Design
**Date:** 2026-06-19
**Status:** Approved (brainstorming)
**Branch:** `feat/cert-audit-and-probe-tidy` (off master `40e8a23e`)
Two independent, self-contained backlog items, batched into one branch.
- **Item 1 — Cert-action audit logging** (`stillpending.md` §A.7 deferred follow-up): the
meaty one. Standard complexity.
- **Item 2 — `OpcUaProbeResult` dead-letter tidy** (Phase-4 final-integration-review
follow-up (i)): trivial actor hygiene.
Hard guardrails (both): **no EF migration, no Commons/proto/interface change, no bUnit.**
Stage by explicit path; never stage the never-stage files (`sql_login.txt`,
`Host/pki/`, `docker-dev/docker-compose.yml`, `pending.md`, `current.md`,
`stillpending.md`). No `--no-verify`, no force-push. Finish = merge to master + push.
---
## Background: H2 was reclassified, then this was chosen
The session began on the "within-timestamp tie-cluster paging (#400)" follow-up — found
**already shipped** (`2e6c6d3a`/`0f929ae6`, 2026-06-17); reconciled the stale `pending.md`
banners. Then surveyed the backlog: the only genuinely-OPEN §A item, **H2 (OPC UA
`HistoryUpdate` service)**, was reclassified **infra-gated/DEFERRED** (no historian backend
can durably insert/replace/delete — `IHistorianDataSource` is read-only and the Wonderware
sidecar has no update RPC; same boundary as H5b). The two remaining actionable, in-repo,
non-infra items are the two below.
---
## Item 1 — Cert-action audit logging
### Problem
When a FleetAdmin **Trusts / Untrusts / Deletes** a peer certificate
(`Certificates.razor``CertificateStoreManager`), nothing is written to the audit DB.
Today only certificate *validation* is logged (to Serilog). These are security-relevant
mutations and should leave a durable audit trail.
### Key discovery that shapes the design
OtOpcUa has **no live structured `AuditEvent` emit sites yet**. The `Security/AuditActor`
helper is explicitly forward-looking ("tested and ready so future emit sites pick up the
correct Actor automatically"), and `IAuditWriter` is **not** registered as an injectable
DI service anywhere. But the persistence path exists end-to-end:
- `AuditWriterActor` (an **AdminRole cluster singleton**, `ControlPlane/Audit/`) implements
`IAuditWriter` (from the external `ZB.MOM.WW.Audit` package), batches `AuditEvent`s, and
bulk-inserts into the existing **`ConfigAuditLog`** table (filtered-unique-index dedup on
`EventId`).
- AdminUI already reaches an AdminRole singleton from the `ActorRegistry`:
`AdminOperationsClient` does `registry.Get<AdminOperationsActorKey>()`. Same pattern works
for `AuditWriterActorKey`.
So this is the **first** structured emit site — exactly what the infra was built for. The
`ConfigAuditLog` table and `AuditWriterActor` already exist ⇒ **no EF migration, no Commons
change.**
### Components
1. **`ActorAuditWriter : IAuditWriter`** — *new*, `AdminUI/Audit/ActorAuditWriter.cs`.
Holds the `AuditWriterActorKey` singleton-proxy `IActorRef` (resolved from
`ActorRegistry`, mirroring `AdminOperationsClient`); `WriteAsync(evt, ct)` does
`_proxy.Tell(evt)` and returns `Task.CompletedTask` (best-effort, non-blocking — the
actor owns batching/dedup/flush). Registered `AddSingleton<IAuditWriter, ActorAuditWriter>()`
in the AdminUI DI extension.
2. **`CertAuditEvents.Build(action, store, thumbprint, actor, success, error)``AuditEvent`**
*new*, `AdminUI/Audit/CertAuditEvents.cs`. Pure static factory:
- `Category = "Certificate"`, `Action ∈ {Trust, Untrust, Delete}`
- `SourceNode = thumbprint`
- `DetailsJson` = compact JSON `{ "store": …, "thumbprint": …, "error": …? }`
- `Outcome` = Success / Failure (from `success`)
- `Actor = actor`, fresh `EventId`, `OccurredAtUtc = DateTimeOffset.UtcNow`
Pure ⇒ fully unit-testable; the implementer confirms the exact `AuditEvent` required /
optional fields + the `AuditOutcome` enum member names against the `ZB.MOM.WW.Audit`
package.
3. **`CertificateStoreManager`** — *modify*. Audit at the **choke point** (the thing that
performs the mutation; already has a unit-test class — far more testable than the razor,
which has no bUnit). Each public method gains the actor string and emits the event from
the `CertActionResult` it already computes:
- `Trust(string thumbprint, string actor)`
- `Untrust(string thumbprint, string actor)`
- `Delete(string store, string thumbprint, string actor)`
Production ctor gains `IAuditWriter _audit`; the internal test ctor defaults it to a
null/fake writer so existing tests compile. Each method calls
`_audit.WriteAsync(CertAuditEvents.Build(...))` (non-blocking) — including the
not-found / invalid-thumbprint **failure** paths (audited with `Outcome = Failure`).
4. **`Certificates.razor`** — *modify*. Pass the authenticated principal name into the
manager calls: `authState.User.Identity?.Name ?? "system"` (already fetched at the
FleetAdmin re-check, ~line 185). Sourcing the actor from `AuthenticationState` (not the
scoped `IAuditActorAccessor`/HttpContext) avoids the Blazor-Server HttpContext-null
pitfall over the SignalR circuit.
### Data flow
```
FleetAdmin clicks Trust
→ Certificates.razor RequestAction (FleetAdmin re-checked)
→ CertManager.Trust(thumbprint, actor)
→ filesystem move (rejected → trusted) [unchanged]
→ CertAuditEvents.Build(Trust, store, thumbprint, actor, success, error)
→ IAuditWriter.WriteAsync(evt) (= ActorAuditWriter → proxy.Tell)
→ AuditWriterActor batches → bulk insert → ConfigAuditLog
```
Both success and failure are audited (Outcome reflects `result.Success`).
### Testing (xUnit + Shouldly; no bUnit)
- `CertAuditEventsTests` — factory output for trust/untrust/delete × success/failure:
Category, Action, SourceNode, Outcome, DetailsJson shape.
- Extend `CertificateStoreManagerTests` — inject a recording fake `IAuditWriter`; assert
exactly one event per action with the correct Action + Outcome, including the not-found
failure path.
- `ActorAuditWriterTests` — a `TestKit` probe registered under `AuditWriterActorKey`
receives the Tell'd `AuditEvent`.
- Razor wiring proven by `dotnet build` + a live docker-dev `/run` (login disabled on the
local rig — agent drives it; deploy, perform a Trust/Untrust on the Certificates page,
confirm a new `ConfigAuditLog` row).
### Classification: **standard** (additive DI + first structured emit path; multi-file).
---
## Item 2 — `OpcUaProbeResult` dead-letter tidy
### Problem
`DriverHostActor`, `HistorianAdapterActor`, `ScriptedAlarmHostActor` subscribe the
redundancy-state topic (for state changes) but have **no** handler for the co-published
`PeerOpcUaProbeActor.OpcUaProbeResult`, so each INFO-dead-letters that message (benign,
capped at 10 then 5-min-suppressed, but noisy).
### Change
Add `Receive<PeerOpcUaProbeActor.OpcUaProbeResult>(_ => { })` to each of the three actors —
the exact intentional-drop pattern already used by `PeerProbeSupervisor.cs:69` and
`OpcUaPublishActor` (which actually consumes it). ~3 lines + a one-line doc comment each.
The implementer confirms each actor's topic subscription before adding the drop.
### Testing
A dead-letter probe test (`EventStream` / `AllDeadLetters`) asserting that an
`OpcUaProbeResult` published to the redundancy topic produces **no** dead-letter for each
actor — mirrors the existing Phase-B dead-letter-probe test idiom.
### Classification: **trivial → small** (mechanical drop handler + one guard test).
---
## Execution
Subagent-driven (this session). Item 1 and Item 2 touch **disjoint** files ⇒ their
implementers can be dispatched concurrently. Item 1 gets spec + code review (standard);
Item 2 gets a code review (small). Final integration review before finish. Finish = merge
to master + push.