Files
lmxopcua/docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy.md
T

337 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Cert-action audit logging + `OpcUaProbeResult` dead-letter tidy — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development to implement this plan task-by-task.
**Goal:** Record certificate Trust/Untrust/Delete actions to the `ConfigAuditLog` DB table
via the existing `IAuditWriter`/`AuditWriterActor` path (the first live structured audit
emit site), and silence the benign `OpcUaProbeResult` dead-letters in three Runtime actors.
**Architecture:** Item 1 adds an `ActorAuditWriter : IAuditWriter` AdminUI adapter (Tells the
`AuditWriterActorKey` cluster-singleton proxy, mirroring `AdminOperationsClient`), a pure
`CertAuditEvents` factory, and audits at the `CertificateStoreManager` choke point. Item 2
adds an intentional-drop `Receive<…OpcUaProbeResult>(_ => {})` to three actors (the
`PeerProbeSupervisor.cs:69` idiom). No EF migration, no Commons/proto/interface change, no bUnit.
**Tech Stack:** .NET 10, Akka.NET (Akka.Hosting `ActorRegistry`), Blazor Server (AdminUI),
EF Core (`ConfigAuditLog` — table already exists), `ZB.MOM.WW.Audit` package
(`IAuditWriter`/`AuditEvent`), xUnit + Shouldly, Akka.TestKit.
**Design:** `docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md`
**Branch:** `feat/cert-audit-and-probe-tidy` (off master `40e8a23e`; design committed `e8c75cc3`)
**Standing guardrails:** stage by explicit path (never `git add .`); never stage
`sql_login.txt`, `Host/pki/`, `docker-dev/docker-compose.yml`, `pending.md`, `current.md`,
`stillpending.md`. No `--no-verify`, no force-push. `dangerouslyDisableSandbox: true` for
build/test/rig. Finish = merge to master + push.
**Dependency graph:** `{T1 ∥ T3} → T2 → T4`. T1 (AdminUI cert-audit core) and T3 (Runtime
actor drops) touch disjoint projects ⇒ dispatch their implementers concurrently. T2 depends
on T1 (uses `CertAuditEvents` + `IAuditWriter`). T4 is the final gate.
---
### Task 0 (do this once, before T1/T2): Confirm the `AuditEvent` / `IAuditWriter` / `AuditOutcome` shape
Not a code task — a **read** the T1/T2 implementer MUST do first so the factory matches the
real package type (it lives in the external `ZB.MOM.WW.Audit` assembly, not this repo).
- Go-to-definition / decompile `ZB.MOM.WW.Audit.AuditEvent`, `IAuditWriter`,
and the `Outcome` enum type. Confirm: exact property names + whether it's a record with a
primary ctor or settable inits; the `Outcome` enum member names (the design assumes
`Success` / `Failure` — verify); the `CorrelationId` property type (Guid? / a CorrelationId
struct / nullable).
- Ground truth for how the existing code populates it:
`src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/Audit/AuditWriterActor.cs:97-107` reads
`EventId, OccurredAtUtc, Actor, Category, Action, SourceNode, DetailsJson, CorrelationId,
Outcome`. Mirror those names exactly.
- If any assumed field/name differs, adjust the T1 factory code below accordingly (the
shape is the contract; the literal property names must match the package).
---
### Task 1: Cert-audit core — `ActorAuditWriter` + `CertAuditEvents` factory + DI + unit tests
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 3
**Files:**
- Create: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/ActorAuditWriter.cs`
- Create: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/CertAuditEvents.cs`
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/EndpointRouteBuilderExtensions.cs` (register `IAuditWriter` near line 55)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/CertAuditEventsTests.cs`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/ActorAuditWriterTests.cs`
**Step 1 — Write `CertAuditEventsTests` (failing).** Cover trust/untrust/delete × success/failure:
- `Build("Trust", "trusted", thumb, "alice", success: true, error: null)``Category=="Certificate"`,
`Action=="Trust"`, `SourceNode==thumb`, `Actor=="alice"`, `Outcome` is the success member,
`DetailsJson` contains the store + thumbprint, `EventId != Guid.Empty`.
- A failure case (`success: false, error: "certificate not found in rejected"`) ⇒ `Outcome` is the
failure member and `DetailsJson` includes the error text.
Use Shouldly. (Assert `DetailsJson` by `ShouldContain` substrings, not exact JSON, to stay robust.)
**Step 2 — Write `CertAuditEvents`:**
```csharp
using System.Text.Json;
using ZB.MOM.WW.Audit; // confirm namespace in Task 0
namespace ZB.MOM.WW.OtOpcUa.AdminUI.Audit;
/// <summary>
/// Pure factory for the canonical <see cref="AuditEvent"/> emitted on a certificate-store
/// mutation (Trust / Untrust / Delete). Separated from <c>CertificateStoreManager</c> so the
/// event shape is unit-testable without a filesystem or an actor system.
/// </summary>
public static class CertAuditEvents
{
/// <summary>The audit category for all certificate-store actions.</summary>
public const string Category = "Certificate";
/// <summary>Builds the audit event for one certificate action and outcome.</summary>
/// <param name="action">The action verb: <c>Trust</c> / <c>Untrust</c> / <c>Delete</c>.</param>
/// <param name="store">The PKI sub-store acted on (e.g. <c>trusted</c> / <c>rejected</c>).</param>
/// <param name="thumbprint">The certificate thumbprint.</param>
/// <param name="actor">The authenticated principal performing the action.</param>
/// <param name="success">Whether the action succeeded.</param>
/// <param name="error">The failure message when <paramref name="success"/> is false; otherwise null.</param>
/// <returns>A canonical audit event ready for <see cref="IAuditWriter.WriteAsync"/>.</returns>
public static AuditEvent Build(
string action, string store, string thumbprint, string actor, bool success, string? error)
{
var details = JsonSerializer.Serialize(new
{
store,
thumbprint,
error = success ? null : error,
});
return new AuditEvent
{
EventId = Guid.NewGuid(),
OccurredAtUtc = DateTimeOffset.UtcNow,
Actor = actor,
Category = Category,
Action = action,
SourceNode = thumbprint,
DetailsJson = details,
// CorrelationId / Outcome property types per Task 0. Outcome member names assumed
// Success / Failure — confirm and adjust.
Outcome = success ? AuditOutcome.Success : AuditOutcome.Failure,
};
}
}
```
(Adjust property/enum names to the real package per Task 0. If `AuditEvent` uses a positional
record ctor instead of init-setters, switch to the ctor form.)
**Step 3 — Write `ActorAuditWriterTests` (failing).** Use Akka.TestKit: register a `TestProbe`
in an `ActorRegistry` under `AuditWriterActorKey`, construct `ActorAuditWriter(registry)`, call
`WriteAsync(evt)`, assert the probe `ExpectMsg<AuditEvent>()` equal to `evt`. (Mirror how other
AdminUI TestKit tests build an `ActorRegistry` + register a key — search the AdminUI.Tests project.)
**Step 4 — Write `ActorAuditWriter`** (copy the `AdminOperationsClient` resolve idiom):
```csharp
using Akka.Actor;
using Akka.Hosting;
using ZB.MOM.WW.Audit;
using ZB.MOM.WW.OtOpcUa.ControlPlane; // AuditWriterActorKey
namespace ZB.MOM.WW.OtOpcUa.AdminUI.Audit;
/// <summary>
/// <see cref="IAuditWriter"/> backed by the <c>AuditWriterActor</c> cluster singleton registered
/// in <c>AddOtOpcUaControlPlane</c>. Resolves the singleton proxy from <see cref="ActorRegistry"/>
/// at construction (mirroring <c>AdminOperationsClient</c>); <see cref="WriteAsync"/> is a
/// best-effort, non-blocking <c>Tell</c> — the actor owns batching, dedup, and flush.
/// </summary>
public sealed class ActorAuditWriter : IAuditWriter
{
private readonly IActorRef _proxy;
/// <summary>Initializes a new instance resolving the audit-writer singleton proxy.</summary>
/// <param name="registry">The actor registry to resolve the audit-writer singleton proxy.</param>
public ActorAuditWriter(ActorRegistry registry) => _proxy = registry.Get<AuditWriterActorKey>();
/// <inheritdoc />
public Task WriteAsync(AuditEvent evt, CancellationToken ct = default)
{
_proxy.Tell(evt);
return Task.CompletedTask;
}
}
```
(Confirm the exact `IAuditWriter.WriteAsync` signature in Task 0 and match it.)
**Step 5 — Register in DI.** In `EndpointRouteBuilderExtensions.cs`, next to
`services.AddSingleton<Certificates.CertificateStoreManager>();` (line ~55):
```csharp
services.AddSingleton<ZB.MOM.WW.Audit.IAuditWriter, Audit.ActorAuditWriter>();
```
(Confirm `ActorRegistry` is resolvable in this method's host — it is, Akka.Hosting registers it
as a singleton; `AdminOperationsClient` proves the pattern works in AdminUI.)
**Step 6 — Run tests + build:**
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests --filter "FullyQualifiedName~Audit"` → green.
**Step 7 — Commit** (stage by path):
`git add src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/ src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/EndpointRouteBuilderExtensions.cs tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/`
then `git commit -m "feat(adminui): IAuditWriter adapter + cert-action audit-event factory"`.
---
### Task 2: Wire audit into `CertificateStoreManager` + `Certificates.razor` + extend manager tests
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** none (depends on Task 1)
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Certificates/CertificateStoreManager.cs`
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Certificates.razor:194-205`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Certificates/CertificateStoreManagerTests.cs`
**Step 1 — Extend `CertificateStoreManagerTests` (failing).** Add a recording fake `IAuditWriter`:
```csharp
private sealed class RecordingAuditWriter : ZB.MOM.WW.Audit.IAuditWriter
{
public List<ZB.MOM.WW.Audit.AuditEvent> Events { get; } = new();
public Task WriteAsync(ZB.MOM.WW.Audit.AuditEvent evt, CancellationToken ct = default)
{ Events.Add(evt); return Task.CompletedTask; }
}
```
Change the test ctor to build `_audit = new RecordingAuditWriter();` and
`_sut = new CertificateStoreManager(_root, _audit);`. Add tests:
- `Trust_success_writes_one_success_audit_event` — seed a rejected cert, `Trust(thumb, "alice")`,
assert `_audit.Events` has one event with `Action=="Trust"`, success outcome, `SourceNode==thumb`.
- `Delete_not_found_writes_one_failure_audit_event``Delete("trusted", unknownThumb, "bob")`,
assert one event, failure outcome.
- Confirm the existing (no-audit-arg) assertions still pass via the defaulted test ctor.
**Step 2 — Modify `CertificateStoreManager`:**
- Add field `private readonly IAuditWriter _audit;`.
- Production ctor: `public CertificateStoreManager(IConfiguration config, IAuditWriter audit)`
set `_audit = audit;` (and keep the pki-root read).
- Internal test ctor: `internal CertificateStoreManager(string pkiRoot, IAuditWriter? audit = null)`
`_audit = audit ?? NullAuditWriter`. (Use a tiny private no-op `IAuditWriter` if the package
has no Null impl — confirm in Task 0.)
- Add the actor param to the three public methods and emit after computing the result:
```csharp
public CertActionResult Trust(string thumbprint, string actor)
{
var r = Move("rejected", "trusted", thumbprint);
Audit("Trust", "rejected→trusted", thumbprint, actor, r);
return r;
}
// Untrust(thumbprint, actor) → Move("trusted","rejected",…) + Audit("Untrust", "trusted→rejected", …)
// Delete(store, thumbprint, actor) → existing body + Audit("Delete", store, …, r) at each return
private void Audit(string action, string store, string thumbprint, string actor, CertActionResult r)
=> _audit.WriteAsync(CertAuditEvents.Build(action, store, thumbprint, actor, r.Success, r.Error));
```
Ensure **every** return path of `Delete` (invalid store, invalid thumbprint, not-found, success,
IO failure) is audited — simplest is to compute the `CertActionResult` into a local, `Audit(...)`,
then return it. Keep `_audit.WriteAsync(...)` fire-and-forget (returns a completed task).
**Step 3 — Modify `Certificates.razor` `ConfirmAction` (line 194):** capture the actor and thread it:
```csharp
var actor = authState.User.Identity?.Name ?? "system";
var result = p.Verb switch
{
"trust" => CertManager.Trust(p.Thumbprint, actor),
"untrust" => CertManager.Untrust(p.Thumbprint, actor),
"delete" => p.Kind switch
{
StoreKind.Trusted => CertManager.Delete("trusted", p.Thumbprint, actor),
StoreKind.Rejected => CertManager.Delete("rejected", p.Thumbprint, actor),
_ => CertActionResult.Fail($"cannot delete from {p.Kind}"),
},
_ => CertActionResult.Fail("unknown action"),
};
```
(`authState` is already in scope from the FleetAdmin re-check at line 185.)
**Step 4 — Test + build:**
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests --filter "FullyQualifiedName~CertificateStoreManager"` → green;
`dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI` → 0 errors (proves the razor compiles).
**Step 5 — Commit:**
`git add src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Certificates/CertificateStoreManager.cs src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Certificates.razor tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Certificates/CertificateStoreManagerTests.cs`
then `git commit -m "feat(adminui): audit cert Trust/Untrust/Delete to ConfigAuditLog"`.
---
### Task 3: `OpcUaProbeResult` dead-letter tidy — three actor drop handlers + guard test
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Task 1
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs` (handler region ~line 482+)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs` (ctor handlers ~line 86)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs` (ctor handlers ~line 172)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/…` (new guard test; place beside the existing dead-letter-probe test — find it via `rg "AllDeadLetters" tests`)
**Step 1 — Write the guard test (failing without the fix).** For at least `DriverHostActor`
(the most-trafficked), subscribe a probe to `AllDeadLetters` on the EventStream, spin up the
actor with its redundancy-topic subscription, publish a `PeerOpcUaProbeActor.OpcUaProbeResult`
to the `RedundancyStateTopic`, and assert **no** `DeadLetter` wrapping an `OpcUaProbeResult`.
(Mirror the Phase-B `AllDeadLetters`-probe idiom already in Runtime.Tests.) A combined test
covering all three is fine if the harness makes that easy; otherwise the `DriverHostActor`
case is the required guard and the other two are covered by inspection + build.
**Step 2 — Add the drop to each actor.** Each already subscribes the redundancy-state topic in
`PreStart` and imports `Akka.Cluster.Tools.PublishSubscribe`. Add, next to the existing
`RedundancyStateChanged` / `SubscribeAck` handlers:
```csharp
// The redundancy-state topic also carries OpcUaProbeResult (published by OpcUaPublishActor's
// peer-probes). We don't consume it here — drop it so it doesn't dead-letter. Matches
// PeerProbeSupervisor / OpcUaPublishActor. (intentional no-op)
Receive<ZB.MOM.WW.OtOpcUa.Runtime.Health.PeerOpcUaProbeActor.OpcUaProbeResult>(_ => { });
```
- `HistorianAdapterActor`: add in the ctor near `Receive<SubscribeAck>(_ => { });` (~line 87).
- `ScriptedAlarmHostActor`: add in the ctor near the other `Receive<…>` registrations (~line 172).
- `DriverHostActor`: add in the steady-state behaviour where `Receive<…>` handlers are registered
(the `Receive<RedundancyStateChanged>` handler — confirm it exists in each behaviour that is
active while subscribed; add the drop wherever `RedundancyStateChanged` is handled so the probe
result is dropped in the same behaviour(s)).
**Note:** if `DriverHostActor` uses `Become`/multiple behaviours, add the drop to every behaviour
that handles `RedundancyStateChanged`, else the probe still dead-letters in the others.
**Step 3 — Test + build:**
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ProbeResult"` → green;
`dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime` → 0 errors.
**Step 4 — Commit:**
`git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/`
then `git commit -m "fix(runtime): drop OpcUaProbeResult in redundancy-topic subscribers (no dead-letter)"`.
---
### Task 4: Reconcile + full build/test + live `/run` + finish
**Classification:** standard (verification + integration review + finish)
**Estimated implement time:** ~5 min + live run
**Parallelizable with:** none (depends on T1, T2, T3)
**Step 1 — Full build + solution test:**
`dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors;
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests` → green.
**Step 2 — Final integration review** (code-reviewer subagent over `git diff master..HEAD`):
confirm no Commons/proto/EF change, audit emitted on all cert paths (incl. failures), the
`AuditEvent` shape matches the package, the drop handler is in every behaviour that subscribes
the topic, and tests genuinely fail without the fix.
**Step 3 — Live `/run` (agent-driven; docker-dev login is disabled).** Rebuild both central
nodes to the branch (`docker compose -f docker-dev/docker-compose.yml up -d --build central-1
central-2`); open the AdminUI Certificates page (`http://localhost:9200`), perform a Trust or
Untrust (or seed a rejected cert first), and confirm a new `ConfigAuditLog` row exists
(query the shared SQL `10.100.0.35,14330` ConfigDb: principal = the dev admin user,
EventType = `Certificate:Trust`, Outcome = success). Confirm server logs show **no**
`OpcUaProbeResult` dead-letter on either node. Capture the evidence.
**Step 4 — Finish.** `git checkout master && git merge --ff-only feat/cert-audit-and-probe-tidy`
(fast-forward; the branch is linear off master), then `git push`. Delete the feature branch.
Update `pending.md`/`stillpending.md` (on-disk, never-staged) to record both items shipped.