Files
lmxopcua/docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md
T

162 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Cert-action audit logging + `OpcUaProbeResult` dead-letter tidy — Design
**Date:** 2026-06-19
**Status:** Approved (brainstorming)
**Branch:** `feat/cert-audit-and-probe-tidy` (off master `40e8a23e`)
Two independent, self-contained backlog items, batched into one branch.
- **Item 1 — Cert-action audit logging** (`stillpending.md` §A.7 deferred follow-up): the
meaty one. Standard complexity.
- **Item 2 — `OpcUaProbeResult` dead-letter tidy** (Phase-4 final-integration-review
follow-up (i)): trivial actor hygiene.
Hard guardrails (both): **no EF migration, no Commons/proto/interface change, no bUnit.**
Stage by explicit path; never stage the never-stage files (`sql_login.txt`,
`Host/pki/`, `docker-dev/docker-compose.yml`, `pending.md`, `current.md`,
`stillpending.md`). No `--no-verify`, no force-push. Finish = merge to master + push.
---
## Background: H2 was reclassified, then this was chosen
The session began on the "within-timestamp tie-cluster paging (#400)" follow-up — found
**already shipped** (`2e6c6d3a`/`0f929ae6`, 2026-06-17); reconciled the stale `pending.md`
banners. Then surveyed the backlog: the only genuinely-OPEN §A item, **H2 (OPC UA
`HistoryUpdate` service)**, was reclassified **infra-gated/DEFERRED** (no historian backend
can durably insert/replace/delete — `IHistorianDataSource` is read-only and the Wonderware
sidecar has no update RPC; same boundary as H5b). The two remaining actionable, in-repo,
non-infra items are the two below.
---
## Item 1 — Cert-action audit logging
### Problem
When a FleetAdmin **Trusts / Untrusts / Deletes** a peer certificate
(`Certificates.razor``CertificateStoreManager`), nothing is written to the audit DB.
Today only certificate *validation* is logged (to Serilog). These are security-relevant
mutations and should leave a durable audit trail.
### Key discovery that shapes the design
OtOpcUa has **no live structured `AuditEvent` emit sites yet**. The `Security/AuditActor`
helper is explicitly forward-looking ("tested and ready so future emit sites pick up the
correct Actor automatically"), and `IAuditWriter` is **not** registered as an injectable
DI service anywhere. But the persistence path exists end-to-end:
- `AuditWriterActor` (an **AdminRole cluster singleton**, `ControlPlane/Audit/`) implements
`IAuditWriter` (from the external `ZB.MOM.WW.Audit` package), batches `AuditEvent`s, and
bulk-inserts into the existing **`ConfigAuditLog`** table (filtered-unique-index dedup on
`EventId`).
- AdminUI already reaches an AdminRole singleton from the `ActorRegistry`:
`AdminOperationsClient` does `registry.Get<AdminOperationsActorKey>()`. Same pattern works
for `AuditWriterActorKey`.
So this is the **first** structured emit site — exactly what the infra was built for. The
`ConfigAuditLog` table and `AuditWriterActor` already exist ⇒ **no EF migration, no Commons
change.**
### Components
1. **`ActorAuditWriter : IAuditWriter`** — *new*, `AdminUI/Audit/ActorAuditWriter.cs`.
Holds the `AuditWriterActorKey` singleton-proxy `IActorRef` (resolved from
`ActorRegistry`, mirroring `AdminOperationsClient`); `WriteAsync(evt, ct)` does
`_proxy.Tell(evt)` and returns `Task.CompletedTask` (best-effort, non-blocking — the
actor owns batching/dedup/flush). Registered `AddSingleton<IAuditWriter, ActorAuditWriter>()`
in the AdminUI DI extension.
2. **`CertAuditEvents.Build(action, store, thumbprint, actor, success, error)``AuditEvent`**
*new*, `AdminUI/Audit/CertAuditEvents.cs`. Pure static factory:
- `Category = "Certificate"`, `Action ∈ {Trust, Untrust, Delete}`
- `SourceNode = thumbprint`
- `DetailsJson` = compact JSON `{ "store": …, "thumbprint": …, "error": …? }`
- `Outcome` = Success / Failure (from `success`)
- `Actor = actor`, fresh `EventId`, `OccurredAtUtc = DateTimeOffset.UtcNow`
Pure ⇒ fully unit-testable; the implementer confirms the exact `AuditEvent` required /
optional fields + the `AuditOutcome` enum member names against the `ZB.MOM.WW.Audit`
package.
3. **`CertificateStoreManager`** — *modify*. Audit at the **choke point** (the thing that
performs the mutation; already has a unit-test class — far more testable than the razor,
which has no bUnit). Each public method gains the actor string and emits the event from
the `CertActionResult` it already computes:
- `Trust(string thumbprint, string actor)`
- `Untrust(string thumbprint, string actor)`
- `Delete(string store, string thumbprint, string actor)`
Production ctor gains `IAuditWriter _audit`; the internal test ctor defaults it to a
null/fake writer so existing tests compile. Each method calls
`_audit.WriteAsync(CertAuditEvents.Build(...))` (non-blocking) — including the
not-found / invalid-thumbprint **failure** paths (audited with `Outcome = Failure`).
4. **`Certificates.razor`** — *modify*. Pass the authenticated principal name into the
manager calls: `authState.User.Identity?.Name ?? "system"` (already fetched at the
FleetAdmin re-check, ~line 185). Sourcing the actor from `AuthenticationState` (not the
scoped `IAuditActorAccessor`/HttpContext) avoids the Blazor-Server HttpContext-null
pitfall over the SignalR circuit.
### Data flow
```
FleetAdmin clicks Trust
→ Certificates.razor RequestAction (FleetAdmin re-checked)
→ CertManager.Trust(thumbprint, actor)
→ filesystem move (rejected → trusted) [unchanged]
→ CertAuditEvents.Build(Trust, store, thumbprint, actor, success, error)
→ IAuditWriter.WriteAsync(evt) (= ActorAuditWriter → proxy.Tell)
→ AuditWriterActor batches → bulk insert → ConfigAuditLog
```
Both success and failure are audited (Outcome reflects `result.Success`).
### Testing (xUnit + Shouldly; no bUnit)
- `CertAuditEventsTests` — factory output for trust/untrust/delete × success/failure:
Category, Action, SourceNode, Outcome, DetailsJson shape.
- Extend `CertificateStoreManagerTests` — inject a recording fake `IAuditWriter`; assert
exactly one event per action with the correct Action + Outcome, including the not-found
failure path.
- `ActorAuditWriterTests` — a `TestKit` probe registered under `AuditWriterActorKey`
receives the Tell'd `AuditEvent`.
- Razor wiring proven by `dotnet build` + a live docker-dev `/run` (login disabled on the
local rig — agent drives it; deploy, perform a Trust/Untrust on the Certificates page,
confirm a new `ConfigAuditLog` row).
### Classification: **standard** (additive DI + first structured emit path; multi-file).
---
## Item 2 — `OpcUaProbeResult` dead-letter tidy
### Problem
`DriverHostActor`, `HistorianAdapterActor`, `ScriptedAlarmHostActor` subscribe the
redundancy-state topic (for state changes) but have **no** handler for the co-published
`PeerOpcUaProbeActor.OpcUaProbeResult`, so each INFO-dead-letters that message (benign,
capped at 10 then 5-min-suppressed, but noisy).
### Change
Add `Receive<PeerOpcUaProbeActor.OpcUaProbeResult>(_ => { })` to each of the three actors —
the exact intentional-drop pattern already used by `PeerProbeSupervisor.cs:69` and
`OpcUaPublishActor` (which actually consumes it). ~3 lines + a one-line doc comment each.
The implementer confirms each actor's topic subscription before adding the drop.
### Testing
A dead-letter probe test (`EventStream` / `AllDeadLetters`) asserting that an
`OpcUaProbeResult` published to the redundancy topic produces **no** dead-letter for each
actor — mirrors the existing Phase-B dead-letter-probe test idiom.
### Classification: **trivial → small** (mechanical drop handler + one guard test).
---
## Execution
Subagent-driven (this session). Item 1 and Item 2 touch **disjoint** files ⇒ their
implementers can be dispatched concurrently. Item 1 gets spec + code review (standard);
Item 2 gets a code review (small). Final integration review before finish. Finish = merge
to master + push.