Files
lmxopcua/docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md
T

7.8 KiB
Raw Blame History

Cert-action audit logging + OpcUaProbeResult dead-letter tidy — Design

Date: 2026-06-19 Status: Approved (brainstorming) Branch: feat/cert-audit-and-probe-tidy (off master 40e8a23e)

Two independent, self-contained backlog items, batched into one branch.

  • Item 1 — Cert-action audit logging (stillpending.md §A.7 deferred follow-up): the meaty one. Standard complexity.
  • Item 2 — OpcUaProbeResult dead-letter tidy (Phase-4 final-integration-review follow-up (i)): trivial actor hygiene.

Hard guardrails (both): no EF migration, no Commons/proto/interface change, no bUnit. Stage by explicit path; never stage the never-stage files (sql_login.txt, Host/pki/, docker-dev/docker-compose.yml, pending.md, current.md, stillpending.md). No --no-verify, no force-push. Finish = merge to master + push.


Background: H2 was reclassified, then this was chosen

The session began on the "within-timestamp tie-cluster paging (#400)" follow-up — found already shipped (2e6c6d3a/0f929ae6, 2026-06-17); reconciled the stale pending.md banners. Then surveyed the backlog: the only genuinely-OPEN §A item, H2 (OPC UA HistoryUpdate service), was reclassified infra-gated/DEFERRED (no historian backend can durably insert/replace/delete — IHistorianDataSource is read-only and the Wonderware sidecar has no update RPC; same boundary as H5b). The two remaining actionable, in-repo, non-infra items are the two below.


Item 1 — Cert-action audit logging

Problem

When a FleetAdmin Trusts / Untrusts / Deletes a peer certificate (Certificates.razorCertificateStoreManager), nothing is written to the audit DB. Today only certificate validation is logged (to Serilog). These are security-relevant mutations and should leave a durable audit trail.

Key discovery that shapes the design

OtOpcUa has no live structured AuditEvent emit sites yet. The Security/AuditActor helper is explicitly forward-looking ("tested and ready so future emit sites pick up the correct Actor automatically"), and IAuditWriter is not registered as an injectable DI service anywhere. But the persistence path exists end-to-end:

  • AuditWriterActor (an AdminRole cluster singleton, ControlPlane/Audit/) implements IAuditWriter (from the external ZB.MOM.WW.Audit package), batches AuditEvents, and bulk-inserts into the existing ConfigAuditLog table (filtered-unique-index dedup on EventId).
  • AdminUI already reaches an AdminRole singleton from the ActorRegistry: AdminOperationsClient does registry.Get<AdminOperationsActorKey>(). Same pattern works for AuditWriterActorKey.

So this is the first structured emit site — exactly what the infra was built for. The ConfigAuditLog table and AuditWriterActor already exist ⇒ no EF migration, no Commons change.

Components

  1. ActorAuditWriter : IAuditWriternew, AdminUI/Audit/ActorAuditWriter.cs. Holds the AuditWriterActorKey singleton-proxy IActorRef (resolved from ActorRegistry, mirroring AdminOperationsClient); WriteAsync(evt, ct) does _proxy.Tell(evt) and returns Task.CompletedTask (best-effort, non-blocking — the actor owns batching/dedup/flush). Registered AddSingleton<IAuditWriter, ActorAuditWriter>() in the AdminUI DI extension.

  2. CertAuditEvents.Build(action, store, thumbprint, actor, success, error)AuditEventnew, AdminUI/Audit/CertAuditEvents.cs. Pure static factory:

    • Category = "Certificate", Action ∈ {Trust, Untrust, Delete}
    • SourceNode = thumbprint
    • DetailsJson = compact JSON { "store": …, "thumbprint": …, "error": …? }
    • Outcome = Success / Failure (from success)
    • Actor = actor, fresh EventId, OccurredAtUtc = DateTimeOffset.UtcNow Pure ⇒ fully unit-testable; the implementer confirms the exact AuditEvent required / optional fields + the AuditOutcome enum member names against the ZB.MOM.WW.Audit package.
  3. CertificateStoreManagermodify. Audit at the choke point (the thing that performs the mutation; already has a unit-test class — far more testable than the razor, which has no bUnit). Each public method gains the actor string and emits the event from the CertActionResult it already computes:

    • Trust(string thumbprint, string actor)
    • Untrust(string thumbprint, string actor)
    • Delete(string store, string thumbprint, string actor) Production ctor gains IAuditWriter _audit; the internal test ctor defaults it to a null/fake writer so existing tests compile. Each method calls _audit.WriteAsync(CertAuditEvents.Build(...)) (non-blocking) — including the not-found / invalid-thumbprint failure paths (audited with Outcome = Failure).
  4. Certificates.razormodify. Pass the authenticated principal name into the manager calls: authState.User.Identity?.Name ?? "system" (already fetched at the FleetAdmin re-check, ~line 185). Sourcing the actor from AuthenticationState (not the scoped IAuditActorAccessor/HttpContext) avoids the Blazor-Server HttpContext-null pitfall over the SignalR circuit.

Data flow

FleetAdmin clicks Trust
  → Certificates.razor RequestAction  (FleetAdmin re-checked)
  → CertManager.Trust(thumbprint, actor)
      → filesystem move (rejected → trusted)          [unchanged]
      → CertAuditEvents.Build(Trust, store, thumbprint, actor, success, error)
      → IAuditWriter.WriteAsync(evt)  (= ActorAuditWriter → proxy.Tell)
  → AuditWriterActor batches → bulk insert → ConfigAuditLog

Both success and failure are audited (Outcome reflects result.Success).

Testing (xUnit + Shouldly; no bUnit)

  • CertAuditEventsTests — factory output for trust/untrust/delete × success/failure: Category, Action, SourceNode, Outcome, DetailsJson shape.
  • Extend CertificateStoreManagerTests — inject a recording fake IAuditWriter; assert exactly one event per action with the correct Action + Outcome, including the not-found failure path.
  • ActorAuditWriterTests — a TestKit probe registered under AuditWriterActorKey receives the Tell'd AuditEvent.
  • Razor wiring proven by dotnet build + a live docker-dev /run (login disabled on the local rig — agent drives it; deploy, perform a Trust/Untrust on the Certificates page, confirm a new ConfigAuditLog row).

Classification: standard (additive DI + first structured emit path; multi-file).


Item 2 — OpcUaProbeResult dead-letter tidy

Problem

DriverHostActor, HistorianAdapterActor, ScriptedAlarmHostActor subscribe the redundancy-state topic (for state changes) but have no handler for the co-published PeerOpcUaProbeActor.OpcUaProbeResult, so each INFO-dead-letters that message (benign, capped at 10 then 5-min-suppressed, but noisy).

Change

Add Receive<PeerOpcUaProbeActor.OpcUaProbeResult>(_ => { }) to each of the three actors — the exact intentional-drop pattern already used by PeerProbeSupervisor.cs:69 and OpcUaPublishActor (which actually consumes it). ~3 lines + a one-line doc comment each. The implementer confirms each actor's topic subscription before adding the drop.

Testing

A dead-letter probe test (EventStream / AllDeadLetters) asserting that an OpcUaProbeResult published to the redundancy topic produces no dead-letter for each actor — mirrors the existing Phase-B dead-letter-probe test idiom.

Classification: trivial → small (mechanical drop handler + one guard test).


Execution

Subagent-driven (this session). Item 1 and Item 2 touch disjoint files ⇒ their implementers can be dispatched concurrently. Item 1 gets spec + code review (standard); Item 2 gets a code review (small). Final integration review before finish. Finish = merge to master + push.