docs(plan): implementation plan + tasks for cert-audit + probe dead-letter tidy

This commit is contained in:
Joseph Doherty
2026-06-19 00:26:30 -04:00
parent e8c75cc38f
commit 21c881161a
2 changed files with 347 additions and 0 deletions
@@ -0,0 +1,336 @@
# Cert-action audit logging + `OpcUaProbeResult` dead-letter tidy — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development to implement this plan task-by-task.
**Goal:** Record certificate Trust/Untrust/Delete actions to the `ConfigAuditLog` DB table
via the existing `IAuditWriter`/`AuditWriterActor` path (the first live structured audit
emit site), and silence the benign `OpcUaProbeResult` dead-letters in three Runtime actors.
**Architecture:** Item 1 adds an `ActorAuditWriter : IAuditWriter` AdminUI adapter (Tells the
`AuditWriterActorKey` cluster-singleton proxy, mirroring `AdminOperationsClient`), a pure
`CertAuditEvents` factory, and audits at the `CertificateStoreManager` choke point. Item 2
adds an intentional-drop `Receive<…OpcUaProbeResult>(_ => {})` to three actors (the
`PeerProbeSupervisor.cs:69` idiom). No EF migration, no Commons/proto/interface change, no bUnit.
**Tech Stack:** .NET 10, Akka.NET (Akka.Hosting `ActorRegistry`), Blazor Server (AdminUI),
EF Core (`ConfigAuditLog` — table already exists), `ZB.MOM.WW.Audit` package
(`IAuditWriter`/`AuditEvent`), xUnit + Shouldly, Akka.TestKit.
**Design:** `docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy-design.md`
**Branch:** `feat/cert-audit-and-probe-tidy` (off master `40e8a23e`; design committed `e8c75cc3`)
**Standing guardrails:** stage by explicit path (never `git add .`); never stage
`sql_login.txt`, `Host/pki/`, `docker-dev/docker-compose.yml`, `pending.md`, `current.md`,
`stillpending.md`. No `--no-verify`, no force-push. `dangerouslyDisableSandbox: true` for
build/test/rig. Finish = merge to master + push.
**Dependency graph:** `{T1 ∥ T3} → T2 → T4`. T1 (AdminUI cert-audit core) and T3 (Runtime
actor drops) touch disjoint projects ⇒ dispatch their implementers concurrently. T2 depends
on T1 (uses `CertAuditEvents` + `IAuditWriter`). T4 is the final gate.
---
### Task 0 (do this once, before T1/T2): Confirm the `AuditEvent` / `IAuditWriter` / `AuditOutcome` shape
Not a code task — a **read** the T1/T2 implementer MUST do first so the factory matches the
real package type (it lives in the external `ZB.MOM.WW.Audit` assembly, not this repo).
- Go-to-definition / decompile `ZB.MOM.WW.Audit.AuditEvent`, `IAuditWriter`,
and the `Outcome` enum type. Confirm: exact property names + whether it's a record with a
primary ctor or settable inits; the `Outcome` enum member names (the design assumes
`Success` / `Failure` — verify); the `CorrelationId` property type (Guid? / a CorrelationId
struct / nullable).
- Ground truth for how the existing code populates it:
`src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/Audit/AuditWriterActor.cs:97-107` reads
`EventId, OccurredAtUtc, Actor, Category, Action, SourceNode, DetailsJson, CorrelationId,
Outcome`. Mirror those names exactly.
- If any assumed field/name differs, adjust the T1 factory code below accordingly (the
shape is the contract; the literal property names must match the package).
---
### Task 1: Cert-audit core — `ActorAuditWriter` + `CertAuditEvents` factory + DI + unit tests
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 3
**Files:**
- Create: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/ActorAuditWriter.cs`
- Create: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/CertAuditEvents.cs`
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/EndpointRouteBuilderExtensions.cs` (register `IAuditWriter` near line 55)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/CertAuditEventsTests.cs`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/ActorAuditWriterTests.cs`
**Step 1 — Write `CertAuditEventsTests` (failing).** Cover trust/untrust/delete × success/failure:
- `Build("Trust", "trusted", thumb, "alice", success: true, error: null)``Category=="Certificate"`,
`Action=="Trust"`, `SourceNode==thumb`, `Actor=="alice"`, `Outcome` is the success member,
`DetailsJson` contains the store + thumbprint, `EventId != Guid.Empty`.
- A failure case (`success: false, error: "certificate not found in rejected"`) ⇒ `Outcome` is the
failure member and `DetailsJson` includes the error text.
Use Shouldly. (Assert `DetailsJson` by `ShouldContain` substrings, not exact JSON, to stay robust.)
**Step 2 — Write `CertAuditEvents`:**
```csharp
using System.Text.Json;
using ZB.MOM.WW.Audit; // confirm namespace in Task 0
namespace ZB.MOM.WW.OtOpcUa.AdminUI.Audit;
/// <summary>
/// Pure factory for the canonical <see cref="AuditEvent"/> emitted on a certificate-store
/// mutation (Trust / Untrust / Delete). Separated from <c>CertificateStoreManager</c> so the
/// event shape is unit-testable without a filesystem or an actor system.
/// </summary>
public static class CertAuditEvents
{
/// <summary>The audit category for all certificate-store actions.</summary>
public const string Category = "Certificate";
/// <summary>Builds the audit event for one certificate action and outcome.</summary>
/// <param name="action">The action verb: <c>Trust</c> / <c>Untrust</c> / <c>Delete</c>.</param>
/// <param name="store">The PKI sub-store acted on (e.g. <c>trusted</c> / <c>rejected</c>).</param>
/// <param name="thumbprint">The certificate thumbprint.</param>
/// <param name="actor">The authenticated principal performing the action.</param>
/// <param name="success">Whether the action succeeded.</param>
/// <param name="error">The failure message when <paramref name="success"/> is false; otherwise null.</param>
/// <returns>A canonical audit event ready for <see cref="IAuditWriter.WriteAsync"/>.</returns>
public static AuditEvent Build(
string action, string store, string thumbprint, string actor, bool success, string? error)
{
var details = JsonSerializer.Serialize(new
{
store,
thumbprint,
error = success ? null : error,
});
return new AuditEvent
{
EventId = Guid.NewGuid(),
OccurredAtUtc = DateTimeOffset.UtcNow,
Actor = actor,
Category = Category,
Action = action,
SourceNode = thumbprint,
DetailsJson = details,
// CorrelationId / Outcome property types per Task 0. Outcome member names assumed
// Success / Failure — confirm and adjust.
Outcome = success ? AuditOutcome.Success : AuditOutcome.Failure,
};
}
}
```
(Adjust property/enum names to the real package per Task 0. If `AuditEvent` uses a positional
record ctor instead of init-setters, switch to the ctor form.)
**Step 3 — Write `ActorAuditWriterTests` (failing).** Use Akka.TestKit: register a `TestProbe`
in an `ActorRegistry` under `AuditWriterActorKey`, construct `ActorAuditWriter(registry)`, call
`WriteAsync(evt)`, assert the probe `ExpectMsg<AuditEvent>()` equal to `evt`. (Mirror how other
AdminUI TestKit tests build an `ActorRegistry` + register a key — search the AdminUI.Tests project.)
**Step 4 — Write `ActorAuditWriter`** (copy the `AdminOperationsClient` resolve idiom):
```csharp
using Akka.Actor;
using Akka.Hosting;
using ZB.MOM.WW.Audit;
using ZB.MOM.WW.OtOpcUa.ControlPlane; // AuditWriterActorKey
namespace ZB.MOM.WW.OtOpcUa.AdminUI.Audit;
/// <summary>
/// <see cref="IAuditWriter"/> backed by the <c>AuditWriterActor</c> cluster singleton registered
/// in <c>AddOtOpcUaControlPlane</c>. Resolves the singleton proxy from <see cref="ActorRegistry"/>
/// at construction (mirroring <c>AdminOperationsClient</c>); <see cref="WriteAsync"/> is a
/// best-effort, non-blocking <c>Tell</c> — the actor owns batching, dedup, and flush.
/// </summary>
public sealed class ActorAuditWriter : IAuditWriter
{
private readonly IActorRef _proxy;
/// <summary>Initializes a new instance resolving the audit-writer singleton proxy.</summary>
/// <param name="registry">The actor registry to resolve the audit-writer singleton proxy.</param>
public ActorAuditWriter(ActorRegistry registry) => _proxy = registry.Get<AuditWriterActorKey>();
/// <inheritdoc />
public Task WriteAsync(AuditEvent evt, CancellationToken ct = default)
{
_proxy.Tell(evt);
return Task.CompletedTask;
}
}
```
(Confirm the exact `IAuditWriter.WriteAsync` signature in Task 0 and match it.)
**Step 5 — Register in DI.** In `EndpointRouteBuilderExtensions.cs`, next to
`services.AddSingleton<Certificates.CertificateStoreManager>();` (line ~55):
```csharp
services.AddSingleton<ZB.MOM.WW.Audit.IAuditWriter, Audit.ActorAuditWriter>();
```
(Confirm `ActorRegistry` is resolvable in this method's host — it is, Akka.Hosting registers it
as a singleton; `AdminOperationsClient` proves the pattern works in AdminUI.)
**Step 6 — Run tests + build:**
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests --filter "FullyQualifiedName~Audit"` → green.
**Step 7 — Commit** (stage by path):
`git add src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Audit/ src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/EndpointRouteBuilderExtensions.cs tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Audit/`
then `git commit -m "feat(adminui): IAuditWriter adapter + cert-action audit-event factory"`.
---
### Task 2: Wire audit into `CertificateStoreManager` + `Certificates.razor` + extend manager tests
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** none (depends on Task 1)
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Certificates/CertificateStoreManager.cs`
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Certificates.razor:194-205`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Certificates/CertificateStoreManagerTests.cs`
**Step 1 — Extend `CertificateStoreManagerTests` (failing).** Add a recording fake `IAuditWriter`:
```csharp
private sealed class RecordingAuditWriter : ZB.MOM.WW.Audit.IAuditWriter
{
public List<ZB.MOM.WW.Audit.AuditEvent> Events { get; } = new();
public Task WriteAsync(ZB.MOM.WW.Audit.AuditEvent evt, CancellationToken ct = default)
{ Events.Add(evt); return Task.CompletedTask; }
}
```
Change the test ctor to build `_audit = new RecordingAuditWriter();` and
`_sut = new CertificateStoreManager(_root, _audit);`. Add tests:
- `Trust_success_writes_one_success_audit_event` — seed a rejected cert, `Trust(thumb, "alice")`,
assert `_audit.Events` has one event with `Action=="Trust"`, success outcome, `SourceNode==thumb`.
- `Delete_not_found_writes_one_failure_audit_event``Delete("trusted", unknownThumb, "bob")`,
assert one event, failure outcome.
- Confirm the existing (no-audit-arg) assertions still pass via the defaulted test ctor.
**Step 2 — Modify `CertificateStoreManager`:**
- Add field `private readonly IAuditWriter _audit;`.
- Production ctor: `public CertificateStoreManager(IConfiguration config, IAuditWriter audit)`
set `_audit = audit;` (and keep the pki-root read).
- Internal test ctor: `internal CertificateStoreManager(string pkiRoot, IAuditWriter? audit = null)`
`_audit = audit ?? NullAuditWriter`. (Use a tiny private no-op `IAuditWriter` if the package
has no Null impl — confirm in Task 0.)
- Add the actor param to the three public methods and emit after computing the result:
```csharp
public CertActionResult Trust(string thumbprint, string actor)
{
var r = Move("rejected", "trusted", thumbprint);
Audit("Trust", "rejected→trusted", thumbprint, actor, r);
return r;
}
// Untrust(thumbprint, actor) → Move("trusted","rejected",…) + Audit("Untrust", "trusted→rejected", …)
// Delete(store, thumbprint, actor) → existing body + Audit("Delete", store, …, r) at each return
private void Audit(string action, string store, string thumbprint, string actor, CertActionResult r)
=> _audit.WriteAsync(CertAuditEvents.Build(action, store, thumbprint, actor, r.Success, r.Error));
```
Ensure **every** return path of `Delete` (invalid store, invalid thumbprint, not-found, success,
IO failure) is audited — simplest is to compute the `CertActionResult` into a local, `Audit(...)`,
then return it. Keep `_audit.WriteAsync(...)` fire-and-forget (returns a completed task).
**Step 3 — Modify `Certificates.razor` `ConfirmAction` (line 194):** capture the actor and thread it:
```csharp
var actor = authState.User.Identity?.Name ?? "system";
var result = p.Verb switch
{
"trust" => CertManager.Trust(p.Thumbprint, actor),
"untrust" => CertManager.Untrust(p.Thumbprint, actor),
"delete" => p.Kind switch
{
StoreKind.Trusted => CertManager.Delete("trusted", p.Thumbprint, actor),
StoreKind.Rejected => CertManager.Delete("rejected", p.Thumbprint, actor),
_ => CertActionResult.Fail($"cannot delete from {p.Kind}"),
},
_ => CertActionResult.Fail("unknown action"),
};
```
(`authState` is already in scope from the FleetAdmin re-check at line 185.)
**Step 4 — Test + build:**
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests --filter "FullyQualifiedName~CertificateStoreManager"` → green;
`dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI` → 0 errors (proves the razor compiles).
**Step 5 — Commit:**
`git add src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Certificates/CertificateStoreManager.cs src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Certificates.razor tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests/Certificates/CertificateStoreManagerTests.cs`
then `git commit -m "feat(adminui): audit cert Trust/Untrust/Delete to ConfigAuditLog"`.
---
### Task 3: `OpcUaProbeResult` dead-letter tidy — three actor drop handlers + guard test
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Task 1
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs` (handler region ~line 482+)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs` (ctor handlers ~line 86)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs` (ctor handlers ~line 172)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/…` (new guard test; place beside the existing dead-letter-probe test — find it via `rg "AllDeadLetters" tests`)
**Step 1 — Write the guard test (failing without the fix).** For at least `DriverHostActor`
(the most-trafficked), subscribe a probe to `AllDeadLetters` on the EventStream, spin up the
actor with its redundancy-topic subscription, publish a `PeerOpcUaProbeActor.OpcUaProbeResult`
to the `RedundancyStateTopic`, and assert **no** `DeadLetter` wrapping an `OpcUaProbeResult`.
(Mirror the Phase-B `AllDeadLetters`-probe idiom already in Runtime.Tests.) A combined test
covering all three is fine if the harness makes that easy; otherwise the `DriverHostActor`
case is the required guard and the other two are covered by inspection + build.
**Step 2 — Add the drop to each actor.** Each already subscribes the redundancy-state topic in
`PreStart` and imports `Akka.Cluster.Tools.PublishSubscribe`. Add, next to the existing
`RedundancyStateChanged` / `SubscribeAck` handlers:
```csharp
// The redundancy-state topic also carries OpcUaProbeResult (published by OpcUaPublishActor's
// peer-probes). We don't consume it here — drop it so it doesn't dead-letter. Matches
// PeerProbeSupervisor / OpcUaPublishActor. (intentional no-op)
Receive<ZB.MOM.WW.OtOpcUa.Runtime.Health.PeerOpcUaProbeActor.OpcUaProbeResult>(_ => { });
```
- `HistorianAdapterActor`: add in the ctor near `Receive<SubscribeAck>(_ => { });` (~line 87).
- `ScriptedAlarmHostActor`: add in the ctor near the other `Receive<…>` registrations (~line 172).
- `DriverHostActor`: add in the steady-state behaviour where `Receive<…>` handlers are registered
(the `Receive<RedundancyStateChanged>` handler — confirm it exists in each behaviour that is
active while subscribed; add the drop wherever `RedundancyStateChanged` is handled so the probe
result is dropped in the same behaviour(s)).
**Note:** if `DriverHostActor` uses `Become`/multiple behaviours, add the drop to every behaviour
that handles `RedundancyStateChanged`, else the probe still dead-letters in the others.
**Step 3 — Test + build:**
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ProbeResult"` → green;
`dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime` → 0 errors.
**Step 4 — Commit:**
`git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/`
then `git commit -m "fix(runtime): drop OpcUaProbeResult in redundancy-topic subscribers (no dead-letter)"`.
---
### Task 4: Reconcile + full build/test + live `/run` + finish
**Classification:** standard (verification + integration review + finish)
**Estimated implement time:** ~5 min + live run
**Parallelizable with:** none (depends on T1, T2, T3)
**Step 1 — Full build + solution test:**
`dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors;
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.AdminUI.Tests tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests` → green.
**Step 2 — Final integration review** (code-reviewer subagent over `git diff master..HEAD`):
confirm no Commons/proto/EF change, audit emitted on all cert paths (incl. failures), the
`AuditEvent` shape matches the package, the drop handler is in every behaviour that subscribes
the topic, and tests genuinely fail without the fix.
**Step 3 — Live `/run` (agent-driven; docker-dev login is disabled).** Rebuild both central
nodes to the branch (`docker compose -f docker-dev/docker-compose.yml up -d --build central-1
central-2`); open the AdminUI Certificates page (`http://localhost:9200`), perform a Trust or
Untrust (or seed a rejected cert first), and confirm a new `ConfigAuditLog` row exists
(query the shared SQL `10.100.0.35,14330` ConfigDb: principal = the dev admin user,
EventType = `Certificate:Trust`, Outcome = success). Confirm server logs show **no**
`OpcUaProbeResult` dead-letter on either node. Capture the evidence.
**Step 4 — Finish.** `git checkout master && git merge --ff-only feat/cert-audit-and-probe-tidy`
(fast-forward; the branch is linear off master), then `git push`. Delete the feature branch.
Update `pending.md`/`stillpending.md` (on-disk, never-staged) to record both items shipped.
@@ -0,0 +1,11 @@
{
"planPath": "docs/plans/2026-06-19-cert-audit-and-probe-deadletter-tidy.md",
"executionState": "IN_PROGRESS",
"tasks": [
{"id": 1, "subject": "Task 1: Cert-audit core — ActorAuditWriter + CertAuditEvents factory + DI + unit tests", "classification": "standard", "status": "pending", "parallelizableWith": [3]},
{"id": 2, "subject": "Task 2: Wire audit into CertificateStoreManager + Certificates.razor + extend manager tests", "classification": "standard", "status": "pending", "blockedBy": [1]},
{"id": 3, "subject": "Task 3: OpcUaProbeResult dead-letter tidy — three actor drop handlers + guard test", "classification": "small", "status": "pending", "parallelizableWith": [1]},
{"id": 4, "subject": "Task 4: Reconcile + full build/test + live /run + finish (merge + push)", "classification": "standard", "status": "pending", "blockedBy": [1, 2, 3]}
],
"lastUpdated": "2026-06-19"
}