docs(plan): design for cert-action audit logging + OpcUaProbeResult dead-letter tidy
This commit is contained in:
@@ -0,0 +1,161 @@
|
||||
# Cert-action audit logging + `OpcUaProbeResult` dead-letter tidy — Design
|
||||
|
||||
**Date:** 2026-06-19
|
||||
**Status:** Approved (brainstorming)
|
||||
**Branch:** `feat/cert-audit-and-probe-tidy` (off master `40e8a23e`)
|
||||
|
||||
Two independent, self-contained backlog items, batched into one branch.
|
||||
|
||||
- **Item 1 — Cert-action audit logging** (`stillpending.md` §A.7 deferred follow-up): the
|
||||
meaty one. Standard complexity.
|
||||
- **Item 2 — `OpcUaProbeResult` dead-letter tidy** (Phase-4 final-integration-review
|
||||
follow-up (i)): trivial actor hygiene.
|
||||
|
||||
Hard guardrails (both): **no EF migration, no Commons/proto/interface change, no bUnit.**
|
||||
Stage by explicit path; never stage the never-stage files (`sql_login.txt`,
|
||||
`Host/pki/`, `docker-dev/docker-compose.yml`, `pending.md`, `current.md`,
|
||||
`stillpending.md`). No `--no-verify`, no force-push. Finish = merge to master + push.
|
||||
|
||||
---
|
||||
|
||||
## Background: H2 was reclassified, then this was chosen
|
||||
|
||||
The session began on the "within-timestamp tie-cluster paging (#400)" follow-up — found
|
||||
**already shipped** (`2e6c6d3a`/`0f929ae6`, 2026-06-17); reconciled the stale `pending.md`
|
||||
banners. Then surveyed the backlog: the only genuinely-OPEN §A item, **H2 (OPC UA
|
||||
`HistoryUpdate` service)**, was reclassified **infra-gated/DEFERRED** (no historian backend
|
||||
can durably insert/replace/delete — `IHistorianDataSource` is read-only and the Wonderware
|
||||
sidecar has no update RPC; same boundary as H5b). The two remaining actionable, in-repo,
|
||||
non-infra items are the two below.
|
||||
|
||||
---
|
||||
|
||||
## Item 1 — Cert-action audit logging
|
||||
|
||||
### Problem
|
||||
|
||||
When a FleetAdmin **Trusts / Untrusts / Deletes** a peer certificate
|
||||
(`Certificates.razor` → `CertificateStoreManager`), nothing is written to the audit DB.
|
||||
Today only certificate *validation* is logged (to Serilog). These are security-relevant
|
||||
mutations and should leave a durable audit trail.
|
||||
|
||||
### Key discovery that shapes the design
|
||||
|
||||
OtOpcUa has **no live structured `AuditEvent` emit sites yet**. The `Security/AuditActor`
|
||||
helper is explicitly forward-looking ("tested and ready so future emit sites pick up the
|
||||
correct Actor automatically"), and `IAuditWriter` is **not** registered as an injectable
|
||||
DI service anywhere. But the persistence path exists end-to-end:
|
||||
|
||||
- `AuditWriterActor` (an **AdminRole cluster singleton**, `ControlPlane/Audit/`) implements
|
||||
`IAuditWriter` (from the external `ZB.MOM.WW.Audit` package), batches `AuditEvent`s, and
|
||||
bulk-inserts into the existing **`ConfigAuditLog`** table (filtered-unique-index dedup on
|
||||
`EventId`).
|
||||
- AdminUI already reaches an AdminRole singleton from the `ActorRegistry`:
|
||||
`AdminOperationsClient` does `registry.Get<AdminOperationsActorKey>()`. Same pattern works
|
||||
for `AuditWriterActorKey`.
|
||||
|
||||
So this is the **first** structured emit site — exactly what the infra was built for. The
|
||||
`ConfigAuditLog` table and `AuditWriterActor` already exist ⇒ **no EF migration, no Commons
|
||||
change.**
|
||||
|
||||
### Components
|
||||
|
||||
1. **`ActorAuditWriter : IAuditWriter`** — *new*, `AdminUI/Audit/ActorAuditWriter.cs`.
|
||||
Holds the `AuditWriterActorKey` singleton-proxy `IActorRef` (resolved from
|
||||
`ActorRegistry`, mirroring `AdminOperationsClient`); `WriteAsync(evt, ct)` does
|
||||
`_proxy.Tell(evt)` and returns `Task.CompletedTask` (best-effort, non-blocking — the
|
||||
actor owns batching/dedup/flush). Registered `AddSingleton<IAuditWriter, ActorAuditWriter>()`
|
||||
in the AdminUI DI extension.
|
||||
|
||||
2. **`CertAuditEvents.Build(action, store, thumbprint, actor, success, error)` → `AuditEvent`**
|
||||
— *new*, `AdminUI/Audit/CertAuditEvents.cs`. Pure static factory:
|
||||
- `Category = "Certificate"`, `Action ∈ {Trust, Untrust, Delete}`
|
||||
- `SourceNode = thumbprint`
|
||||
- `DetailsJson` = compact JSON `{ "store": …, "thumbprint": …, "error": …? }`
|
||||
- `Outcome` = Success / Failure (from `success`)
|
||||
- `Actor = actor`, fresh `EventId`, `OccurredAtUtc = DateTimeOffset.UtcNow`
|
||||
Pure ⇒ fully unit-testable; the implementer confirms the exact `AuditEvent` required /
|
||||
optional fields + the `AuditOutcome` enum member names against the `ZB.MOM.WW.Audit`
|
||||
package.
|
||||
|
||||
3. **`CertificateStoreManager`** — *modify*. Audit at the **choke point** (the thing that
|
||||
performs the mutation; already has a unit-test class — far more testable than the razor,
|
||||
which has no bUnit). Each public method gains the actor string and emits the event from
|
||||
the `CertActionResult` it already computes:
|
||||
- `Trust(string thumbprint, string actor)`
|
||||
- `Untrust(string thumbprint, string actor)`
|
||||
- `Delete(string store, string thumbprint, string actor)`
|
||||
Production ctor gains `IAuditWriter _audit`; the internal test ctor defaults it to a
|
||||
null/fake writer so existing tests compile. Each method calls
|
||||
`_audit.WriteAsync(CertAuditEvents.Build(...))` (non-blocking) — including the
|
||||
not-found / invalid-thumbprint **failure** paths (audited with `Outcome = Failure`).
|
||||
|
||||
4. **`Certificates.razor`** — *modify*. Pass the authenticated principal name into the
|
||||
manager calls: `authState.User.Identity?.Name ?? "system"` (already fetched at the
|
||||
FleetAdmin re-check, ~line 185). Sourcing the actor from `AuthenticationState` (not the
|
||||
scoped `IAuditActorAccessor`/HttpContext) avoids the Blazor-Server HttpContext-null
|
||||
pitfall over the SignalR circuit.
|
||||
|
||||
### Data flow
|
||||
|
||||
```
|
||||
FleetAdmin clicks Trust
|
||||
→ Certificates.razor RequestAction (FleetAdmin re-checked)
|
||||
→ CertManager.Trust(thumbprint, actor)
|
||||
→ filesystem move (rejected → trusted) [unchanged]
|
||||
→ CertAuditEvents.Build(Trust, store, thumbprint, actor, success, error)
|
||||
→ IAuditWriter.WriteAsync(evt) (= ActorAuditWriter → proxy.Tell)
|
||||
→ AuditWriterActor batches → bulk insert → ConfigAuditLog
|
||||
```
|
||||
|
||||
Both success and failure are audited (Outcome reflects `result.Success`).
|
||||
|
||||
### Testing (xUnit + Shouldly; no bUnit)
|
||||
|
||||
- `CertAuditEventsTests` — factory output for trust/untrust/delete × success/failure:
|
||||
Category, Action, SourceNode, Outcome, DetailsJson shape.
|
||||
- Extend `CertificateStoreManagerTests` — inject a recording fake `IAuditWriter`; assert
|
||||
exactly one event per action with the correct Action + Outcome, including the not-found
|
||||
failure path.
|
||||
- `ActorAuditWriterTests` — a `TestKit` probe registered under `AuditWriterActorKey`
|
||||
receives the Tell'd `AuditEvent`.
|
||||
- Razor wiring proven by `dotnet build` + a live docker-dev `/run` (login disabled on the
|
||||
local rig — agent drives it; deploy, perform a Trust/Untrust on the Certificates page,
|
||||
confirm a new `ConfigAuditLog` row).
|
||||
|
||||
### Classification: **standard** (additive DI + first structured emit path; multi-file).
|
||||
|
||||
---
|
||||
|
||||
## Item 2 — `OpcUaProbeResult` dead-letter tidy
|
||||
|
||||
### Problem
|
||||
|
||||
`DriverHostActor`, `HistorianAdapterActor`, `ScriptedAlarmHostActor` subscribe the
|
||||
redundancy-state topic (for state changes) but have **no** handler for the co-published
|
||||
`PeerOpcUaProbeActor.OpcUaProbeResult`, so each INFO-dead-letters that message (benign,
|
||||
capped at 10 then 5-min-suppressed, but noisy).
|
||||
|
||||
### Change
|
||||
|
||||
Add `Receive<PeerOpcUaProbeActor.OpcUaProbeResult>(_ => { })` to each of the three actors —
|
||||
the exact intentional-drop pattern already used by `PeerProbeSupervisor.cs:69` and
|
||||
`OpcUaPublishActor` (which actually consumes it). ~3 lines + a one-line doc comment each.
|
||||
The implementer confirms each actor's topic subscription before adding the drop.
|
||||
|
||||
### Testing
|
||||
|
||||
A dead-letter probe test (`EventStream` / `AllDeadLetters`) asserting that an
|
||||
`OpcUaProbeResult` published to the redundancy topic produces **no** dead-letter for each
|
||||
actor — mirrors the existing Phase-B dead-letter-probe test idiom.
|
||||
|
||||
### Classification: **trivial → small** (mechanical drop handler + one guard test).
|
||||
|
||||
---
|
||||
|
||||
## Execution
|
||||
|
||||
Subagent-driven (this session). Item 1 and Item 2 touch **disjoint** files ⇒ their
|
||||
implementers can be dispatched concurrently. Item 1 gets spec + code review (standard);
|
||||
Item 2 gets a code review (small). Final integration review before finish. Finish = merge
|
||||
to master + push.
|
||||
Reference in New Issue
Block a user