Files
lmxopcua/docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy.md
T

18 KiB
Raw Blame History

Cert-action audit logging + OpcUaProbeResult dead-letter tidy — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development to implement this plan task-by-task.

Goal: Record certificate Trust/Untrust/Delete actions to the ConfigAuditLog DB table via the existing IAuditWriter/AuditWriterActor path (the first live structured audit emit site), and silence the benign OpcUaProbeResult dead-letters in three Runtime actors.

Architecture: Item 1 adds an ActorAuditWriter : IAuditWriter AdminUI adapter (Tells the AuditWriterActorKey cluster-singleton proxy, mirroring AdminOperationsClient), a pure CertAuditEvents factory, and audits at the CertificateStoreManager choke point. Item 2 adds an intentional-drop Receive<…OpcUaProbeResult>(_ => {}) to three actors (the PeerProbeSupervisor.cs:69 idiom). No EF migration, no Commons/proto/interface change, no bUnit.

Tech Stack: .NET 10, Akka.NET (Akka.Hosting ActorRegistry), Blazor Server (AdminUI), EF Core (ConfigAuditLog — table already exists), ZB.MOM.WW.Audit package (IAuditWriter/AuditEvent), xUnit + Shouldly, Akka.TestKit.

Design: docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md Branch: feat/cert-audit-and-probe-tidy (off master 40e8a23e; design committed e8c75cc3)

Standing guardrails: stage by explicit path (never git add .); never stage sql_login.txt, Host/pki/, docker-dev/docker-compose.yml, pending.md, current.md, stillpending.md. No --no-verify, no force-push. dangerouslyDisableSandbox: true for build/test/rig. Finish = merge to master + push.

Dependency graph: {T1 ∥ T3} → T2 → T4. T1 (AdminUI cert-audit core) and T3 (Runtime actor drops) touch disjoint projects ⇒ dispatch their implementers concurrently. T2 depends on T1 (uses CertAuditEvents + IAuditWriter). T4 is the final gate.


Task 0 (do this once, before T1/T2): Confirm the AuditEvent / IAuditWriter / AuditOutcome shape

Not a code task — a read the T1/T2 implementer MUST do first so the factory matches the real package type (it lives in the external ZB.MOM.WW.Audit assembly, not this repo).

  • Go-to-definition / decompile ZB.MOM.WW.Audit.AuditEvent, IAuditWriter, and the Outcome enum type. Confirm: exact property names + whether it's a record with a primary ctor or settable inits; the Outcome enum member names (the design assumes Success / Failure — verify); the CorrelationId property type (Guid? / a CorrelationId struct / nullable).
  • Ground truth for how the existing code populates it: src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/Audit/AuditWriterActor.cs:97-107 reads EventId, OccurredAtUtc, Actor, Category, Action, SourceNode, DetailsJson, CorrelationId, Outcome. Mirror those names exactly.
  • If any assumed field/name differs, adjust the T1 factory code below accordingly (the shape is the contract; the literal property names must match the package).

Task 1: Cert-audit core — ActorAuditWriter + CertAuditEvents factory + DI + unit tests

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 3

Files:

  • Create: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/ActorAuditWriter.cs
  • Create: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/CertAuditEvents.cs
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/EndpointRouteBuilderExtensions.cs (register IAuditWriter near line 55)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/CertAuditEventsTests.cs
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/ActorAuditWriterTests.cs

Step 1 — Write CertAuditEventsTests (failing). Cover trust/untrust/delete × success/failure:

  • Build("Trust", "trusted", thumb, "alice", success: true, error: null)Category=="Certificate", Action=="Trust", SourceNode==thumb, Actor=="alice", Outcome is the success member, DetailsJson contains the store + thumbprint, EventId != Guid.Empty.
  • A failure case (success: false, error: "certificate not found in rejected") ⇒ Outcome is the failure member and DetailsJson includes the error text. Use Shouldly. (Assert DetailsJson by ShouldContain substrings, not exact JSON, to stay robust.)

Step 2 — Write CertAuditEvents:

using System.Text.Json;
using ZB.MOM.WW.Audit;   // confirm namespace in Task 0

namespace ZB.MOM.WW.OtOpcUa.AdminUI.Audit;

/// <summary>
/// Pure factory for the canonical <see cref="AuditEvent"/> emitted on a certificate-store
/// mutation (Trust / Untrust / Delete). Separated from <c>CertificateStoreManager</c> so the
/// event shape is unit-testable without a filesystem or an actor system.
/// </summary>
public static class CertAuditEvents
{
    /// <summary>The audit category for all certificate-store actions.</summary>
    public const string Category = "Certificate";

    /// <summary>Builds the audit event for one certificate action and outcome.</summary>
    /// <param name="action">The action verb: <c>Trust</c> / <c>Untrust</c> / <c>Delete</c>.</param>
    /// <param name="store">The PKI sub-store acted on (e.g. <c>trusted</c> / <c>rejected</c>).</param>
    /// <param name="thumbprint">The certificate thumbprint.</param>
    /// <param name="actor">The authenticated principal performing the action.</param>
    /// <param name="success">Whether the action succeeded.</param>
    /// <param name="error">The failure message when <paramref name="success"/> is false; otherwise null.</param>
    /// <returns>A canonical audit event ready for <see cref="IAuditWriter.WriteAsync"/>.</returns>
    public static AuditEvent Build(
        string action, string store, string thumbprint, string actor, bool success, string? error)
    {
        var details = JsonSerializer.Serialize(new
        {
            store,
            thumbprint,
            error = success ? null : error,
        });

        return new AuditEvent
        {
            EventId = Guid.NewGuid(),
            OccurredAtUtc = DateTimeOffset.UtcNow,
            Actor = actor,
            Category = Category,
            Action = action,
            SourceNode = thumbprint,
            DetailsJson = details,
            // CorrelationId / Outcome property types per Task 0. Outcome member names assumed
            // Success / Failure — confirm and adjust.
            Outcome = success ? AuditOutcome.Success : AuditOutcome.Failure,
        };
    }
}

(Adjust property/enum names to the real package per Task 0. If AuditEvent uses a positional record ctor instead of init-setters, switch to the ctor form.)

Step 3 — Write ActorAuditWriterTests (failing). Use Akka.TestKit: register a TestProbe in an ActorRegistry under AuditWriterActorKey, construct ActorAuditWriter(registry), call WriteAsync(evt), assert the probe ExpectMsg<AuditEvent>() equal to evt. (Mirror how other AdminUI TestKit tests build an ActorRegistry + register a key — search the AdminUI.Tests project.)

Step 4 — Write ActorAuditWriter (copy the AdminOperationsClient resolve idiom):

using Akka.Actor;
using Akka.Hosting;
using ZB.MOM.WW.Audit;
using ZB.MOM.WW.OtOpcUa.ControlPlane;   // AuditWriterActorKey

namespace ZB.MOM.WW.OtOpcUa.AdminUI.Audit;

/// <summary>
/// <see cref="IAuditWriter"/> backed by the <c>AuditWriterActor</c> cluster singleton registered
/// in <c>AddOtOpcUaControlPlane</c>. Resolves the singleton proxy from <see cref="ActorRegistry"/>
/// at construction (mirroring <c>AdminOperationsClient</c>); <see cref="WriteAsync"/> is a
/// best-effort, non-blocking <c>Tell</c> — the actor owns batching, dedup, and flush.
/// </summary>
public sealed class ActorAuditWriter : IAuditWriter
{
    private readonly IActorRef _proxy;

    /// <summary>Initializes a new instance resolving the audit-writer singleton proxy.</summary>
    /// <param name="registry">The actor registry to resolve the audit-writer singleton proxy.</param>
    public ActorAuditWriter(ActorRegistry registry) => _proxy = registry.Get<AuditWriterActorKey>();

    /// <inheritdoc />
    public Task WriteAsync(AuditEvent evt, CancellationToken ct = default)
    {
        _proxy.Tell(evt);
        return Task.CompletedTask;
    }
}

(Confirm the exact IAuditWriter.WriteAsync signature in Task 0 and match it.)

Step 5 — Register in DI. In EndpointRouteBuilderExtensions.cs, next to services.AddSingleton<Certificates.CertificateStoreManager>(); (line ~55):

services.AddSingleton<ZB.MOM.WW.Audit.IAuditWriter, Audit.ActorAuditWriter>();

(Confirm ActorRegistry is resolvable in this method's host — it is, Akka.Hosting registers it as a singleton; AdminOperationsClient proves the pattern works in AdminUI.)

Step 6 — Run tests + build: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests --filter "FullyQualifiedName~Audit" → green.

Step 7 — Commit (stage by path): git add src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/ src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/EndpointRouteBuilderExtensions.cs tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/ then git commit -m "feat(adminui): IAuditWriter adapter + cert-action audit-event factory".


Task 2: Wire audit into CertificateStoreManager + Certificates.razor + extend manager tests

Classification: standard Estimated implement time: ~5 min Parallelizable with: none (depends on Task 1)

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Certificates/CertificateStoreManager.cs
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Certificates.razor:194-205
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Certificates/CertificateStoreManagerTests.cs

Step 1 — Extend CertificateStoreManagerTests (failing). Add a recording fake IAuditWriter:

private sealed class RecordingAuditWriter : ZB.MOM.WW.Audit.IAuditWriter
{
    public List<ZB.MOM.WW.Audit.AuditEvent> Events { get; } = new();
    public Task WriteAsync(ZB.MOM.WW.Audit.AuditEvent evt, CancellationToken ct = default)
    { Events.Add(evt); return Task.CompletedTask; }
}

Change the test ctor to build _audit = new RecordingAuditWriter(); and _sut = new CertificateStoreManager(_root, _audit);. Add tests:

  • Trust_success_writes_one_success_audit_event — seed a rejected cert, Trust(thumb, "alice"), assert _audit.Events has one event with Action=="Trust", success outcome, SourceNode==thumb.
  • Delete_not_found_writes_one_failure_audit_eventDelete("trusted", unknownThumb, "bob"), assert one event, failure outcome.
  • Confirm the existing (no-audit-arg) assertions still pass via the defaulted test ctor.

Step 2 — Modify CertificateStoreManager:

  • Add field private readonly IAuditWriter _audit;.
  • Production ctor: public CertificateStoreManager(IConfiguration config, IAuditWriter audit) — set _audit = audit; (and keep the pki-root read).
  • Internal test ctor: internal CertificateStoreManager(string pkiRoot, IAuditWriter? audit = null)_audit = audit ?? NullAuditWriter. (Use a tiny private no-op IAuditWriter if the package has no Null impl — confirm in Task 0.)
  • Add the actor param to the three public methods and emit after computing the result:
public CertActionResult Trust(string thumbprint, string actor)
{
    var r = Move("rejected", "trusted", thumbprint);
    Audit("Trust", "rejected→trusted", thumbprint, actor, r);
    return r;
}
// Untrust(thumbprint, actor) → Move("trusted","rejected",…) + Audit("Untrust", "trusted→rejected", …)
// Delete(store, thumbprint, actor) → existing body + Audit("Delete", store, …, r) at each return

private void Audit(string action, string store, string thumbprint, string actor, CertActionResult r)
    => _audit.WriteAsync(CertAuditEvents.Build(action, store, thumbprint, actor, r.Success, r.Error));

Ensure every return path of Delete (invalid store, invalid thumbprint, not-found, success, IO failure) is audited — simplest is to compute the CertActionResult into a local, Audit(...), then return it. Keep _audit.WriteAsync(...) fire-and-forget (returns a completed task).

Step 3 — Modify Certificates.razor ConfirmAction (line 194): capture the actor and thread it:

var actor = authState.User.Identity?.Name ?? "system";
var result = p.Verb switch
{
    "trust"   => CertManager.Trust(p.Thumbprint, actor),
    "untrust" => CertManager.Untrust(p.Thumbprint, actor),
    "delete"  => p.Kind switch
    {
        StoreKind.Trusted  => CertManager.Delete("trusted", p.Thumbprint, actor),
        StoreKind.Rejected => CertManager.Delete("rejected", p.Thumbprint, actor),
        _                  => CertActionResult.Fail($"cannot delete from {p.Kind}"),
    },
    _ => CertActionResult.Fail("unknown action"),
};

(authState is already in scope from the FleetAdmin re-check at line 185.)

Step 4 — Test + build: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests --filter "FullyQualifiedName~CertificateStoreManager" → green; dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI → 0 errors (proves the razor compiles).

Step 5 — Commit: git add src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Certificates/CertificateStoreManager.cs src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Certificates.razor tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Certificates/CertificateStoreManagerTests.cs then git commit -m "feat(adminui): audit cert Trust/Untrust/Delete to ConfigAuditLog".


Task 3: OpcUaProbeResult dead-letter tidy — three actor drop handlers + guard test

Classification: small Estimated implement time: ~4 min Parallelizable with: Task 1

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs (handler region ~line 482+)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs (ctor handlers ~line 86)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs (ctor handlers ~line 172)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/… (new guard test; place beside the existing dead-letter-probe test — find it via rg "AllDeadLetters" tests)

Step 1 — Write the guard test (failing without the fix). For at least DriverHostActor (the most-trafficked), subscribe a probe to AllDeadLetters on the EventStream, spin up the actor with its redundancy-topic subscription, publish a PeerOpcUaProbeActor.OpcUaProbeResult to the RedundancyStateTopic, and assert no DeadLetter wrapping an OpcUaProbeResult. (Mirror the Phase-B AllDeadLetters-probe idiom already in Runtime.Tests.) A combined test covering all three is fine if the harness makes that easy; otherwise the DriverHostActor case is the required guard and the other two are covered by inspection + build.

Step 2 — Add the drop to each actor. Each already subscribes the redundancy-state topic in PreStart and imports Akka.Cluster.Tools.PublishSubscribe. Add, next to the existing RedundancyStateChanged / SubscribeAck handlers:

// The redundancy-state topic also carries OpcUaProbeResult (published by OpcUaPublishActor's
// peer-probes). We don't consume it here — drop it so it doesn't dead-letter. Matches
// PeerProbeSupervisor / OpcUaPublishActor. (intentional no-op)
Receive<ZB.MOM.WW.OtOpcUa.Runtime.Health.PeerOpcUaProbeActor.OpcUaProbeResult>(_ => { });
  • HistorianAdapterActor: add in the ctor near Receive<SubscribeAck>(_ => { }); (~line 87).
  • ScriptedAlarmHostActor: add in the ctor near the other Receive<…> registrations (~line 172).
  • DriverHostActor: add in the steady-state behaviour where Receive<…> handlers are registered (the Receive<RedundancyStateChanged> handler — confirm it exists in each behaviour that is active while subscribed; add the drop wherever RedundancyStateChanged is handled so the probe result is dropped in the same behaviour(s)). Note: if DriverHostActor uses Become/multiple behaviours, add the drop to every behaviour that handles RedundancyStateChanged, else the probe still dead-letters in the others.

Step 3 — Test + build: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ProbeResult" → green; dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime → 0 errors.

Step 4 — Commit: git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ then git commit -m "fix(runtime): drop OpcUaProbeResult in redundancy-topic subscribers (no dead-letter)".


Task 4: Reconcile + full build/test + live /run + finish

Classification: standard (verification + integration review + finish) Estimated implement time: ~5 min + live run Parallelizable with: none (depends on T1, T2, T3)

Step 1 — Full build + solution test: dotnet build ZB.MOM.WW.OtOpcUa.slnx → 0 errors; dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests → green.

Step 2 — Final integration review (code-reviewer subagent over git diff master..HEAD): confirm no Commons/proto/EF change, audit emitted on all cert paths (incl. failures), the AuditEvent shape matches the package, the drop handler is in every behaviour that subscribes the topic, and tests genuinely fail without the fix.

Step 3 — Live /run (agent-driven; docker-dev login is disabled). Rebuild both central nodes to the branch (docker compose -f docker-dev/docker-compose.yml up -d --build central-1 central-2); open the AdminUI Certificates page (http://localhost:9200), perform a Trust or Untrust (or seed a rejected cert first), and confirm a new ConfigAuditLog row exists (query the shared SQL 10.100.0.35,14330 ConfigDb: principal = the dev admin user, EventType = Certificate:Trust, Outcome = success). Confirm server logs show no OpcUaProbeResult dead-letter on either node. Capture the evidence.

Step 4 — Finish. git checkout master && git merge --ff-only feat/cert-audit-and-probe-tidy (fast-forward; the branch is linear off master), then git push. Delete the feature branch. Update pending.md/stillpending.md (on-disk, never-staged) to record both items shipped.