plan(phase2): Task 2.0 gate DONE — verified plan specs materially off (MxGw store moved to lib, OtOpcUa path dormant, SB rename structurally impossible); user chose DEEP adopt + pause; corrected deep design in -phase2-deep.md; PAUSED for review

This commit is contained in:
Joseph Doherty
2026-06-02 09:13:09 -04:00
parent 95681ac0b2
commit abe06a2163
3 changed files with 194 additions and 1 deletions
@@ -0,0 +1,183 @@
# Phase 2 (Audit adoption) — Task 2.0 gate findings + DEEP re-scope (for review)
Companion to `2026-06-02-auth-audit-normalization.md`. Produced by the **Task 2.0 read-only
verification gate** (3 parallel explorers, all paths verified 2026-06-02 against live code on each
repo's `feat/adopt-zb-auth` HEAD). **Status: PAUSED for user review before any audit code is written.**
**Decisions taken (2026-06-02):**
- **Depth = DEEP adopt (canonical record).** Each app's audit record becomes the library's 9-field
`ZB.MOM.WW.Audit.AuditEvent`; domain-specific fields relocate into `DetailsJson`; each app consumes
the library's `IAuditWriter`/`IAuditRedactor`/`AuditOutcome` types. (User chose this over the
gate-recommended lighter "Align" — consistent with the standing maximal/full-adopt directive.)
- **Cadence = re-scope + PAUSE for review.** This doc is the review artifact; implementation does not
start until the user signs off (especially on the ScadaBridge cost, below).
> **Why a re-scope was needed:** the plan's Phase 2 task specs were written from optimistic
> `components/audit/current-state/*` docs (see [[component-status-claims-are-optimistic]]). The gate
> found all three repos' specs are materially off — file refs moved (MxGateway), the target path is
> dormant (OtOpcUa), and the "outright rename" is structurally impossible (ScadaBridge).
---
## The canonical contract (shared `ZB.MOM.WW.Audit` 0.1.0)
`AuditEvent` (sealed record): REQUIRED `EventId:Guid`, `OccurredAtUtc:DateTimeOffset` (UTC-normalized
on set), `Actor:string`, `Action:string`, `Outcome:AuditOutcome`; OPTIONAL `Category:string?`,
`Target:string?`, `SourceNode:string?`, `CorrelationId:Guid?`, `DetailsJson:string?`. **Nine fields.**
`AuditOutcome { Success, Failure, Denied }`. `IAuditWriter.WriteAsync(AuditEvent, CancellationToken)`
best-effort, never throws. `IAuditRedactor.Apply(AuditEvent) -> AuditEvent` — pure, never throws.
The package is pinned (central PM / explicit) + feed-mapped in all three repos; **referenced by none yet.**
---
## OtOpcUa — DEEP (Tasks 2.1 + 2.2) · risk: LOWMEDIUM
**Verified current state:** Commons `AuditEvent` is an **8-field positional record**
`(Guid EventId, string Category, string Action, string Actor, DateTime OccurredAtUtc, string? DetailsJson,
NodeId SourceNode, CorrelationId CorrelationId)` — where `NodeId`/`CorrelationId` are `readonly record
struct` newtypes over `string`/`Guid`. It is an **Akka message** delivered via `DistributedPubSub`
(`provider=cluster`) with **default (reflection) serialization** — no custom serializer. **The structured
actor path is DORMANT: zero production emit sites** construct/`Tell` an `AuditEvent` today (only the tests
do); all live audit goes through the bespoke **stored-procedure path** (`sp_NodeApplied`/`sp_PublishGeneration`/
`sp_ValidateDraft`/`sp_RollbackToGeneration` INSERT directly with `ClusterId`/`GenerationId`, NULL `EventId`).
`AuditWriterActor` (`ControlPlane/Audit/AuditWriterActor.cs`): 500/5s batching, two-layer dedup (in-buffer
`Dictionary<Guid,AuditEvent>` + DB filtered-unique `UX_ConfigAuditLog_EventId`), mapping at `:75-84`.
`ConfigAuditLog` (10 cols, no `Outcome`; `ISJSON` CHECK on `DetailsJson`). `ClusterAudit.razor:78` filters
`a.ClusterId == ClusterId`, but the actor sets `NodeId` not `ClusterId`, so structured rows are invisible.
Package pinned `0.1.0` in `Directory.Packages.props`, feed-mapped, unreferenced.
**Deep design — this is the easy one (the record is already ~canonical):**
- **2.1 (high-risk: actor + contract):** Delete Commons `AuditEvent.cs`; reference `ZB.MOM.WW.Audit.AuditEvent`
from `ZB.MOM.WW.OtOpcUa.Commons` + `…ControlPlane`. Field map: `EventId``EventId`; `OccurredAtUtc`
`DateTime``DateTimeOffset` (widen at construction); `Actor`/`Action`/`Category`/`DetailsJson` direct;
`SourceNode` (unwrap `NodeId.Value``string?`); `CorrelationId` (unwrap `.Value``Guid?`); `Target` unused
(null) — OtOpcUa has no extra domain fields to push into `DetailsJson`, so **no field relocation**. Add the
NEW required `Outcome` (derive: `OpcUaAccessDenied`/`CrossClusterNamespaceAttempt``Denied`; config verbs→
`Success`; no `Failure` in OtOpcUa's vocabulary). `AuditWriterActor : IAuditWriter` (`WriteAsync` wraps the
fire-and-forget `Tell`, returns `Task.CompletedTask` — trivially best-effort). Keep batching/dedup. Mapping
at `:75-84` becomes `NodeId = evt.SourceNode`, `CorrelationId = evt.CorrelationId`, `Outcome = evt.Outcome`,
`EventType = $"{evt.Category}:{evt.Action}"` (storage keeps the composite). Value-type unwrap happens at the
(test + future) construction sites. **Akka wire note:** the message type changes shape → a rolling-deploy
wire break IN PRINCIPLE, but **moot** (no live emit traffic). Flag in the commit; no dual-accept window needed.
- **2.2 (high-risk: EF migration + UI query):** add nullable `Outcome` to `ConfigAuditLog` (+ DbContext mapping
`:429-463`) + EF migration `AddConfigAuditLogOutcome` (chains after `20260602112419_CanonicalizeAdminRoles`).
Fix `ClusterAudit.razor:78` so `ClusterId == null && NodeId` resolves to the cluster (OR-predicate joining
`ClusterNodes`, or populate `ClusterId` at flush). SP path stays bespoke (documented).
- **Package refs:** `…Commons` (record + `AuditOutcome`), `…ControlPlane` (`IAuditWriter`), `…Configuration`
(only if `Outcome` is stored as the enum type; otherwise store `string?`/`int?` and skip).
- **Effort:** ~record swap 5m + actor seam 5m + Outcome derivation 5m (2.1); column+migration+query 5m (2.2).
---
## MxGateway — DEEP (Task 2.3, re-scoped) · risk: MEDIUMHIGH (was "standard")
**Verified current state — the plan's file refs are STALE:** Phase 1 (Task 1.3) **moved**
`IApiKeyAuditStore` + `ApiKeyAuditEntry` + `SqliteApiKeyAuditStore` **into the shared library**
(`ZB.MOM.WW.Auth.Abstractions`/`…ApiKeys` 0.1.2) — they no longer exist in MxGateway. `ApiKeyAuditEntry` =
**5 fields** `(string? KeyId, string EventType, string? RemoteAddress, DateTimeOffset CreatedUtc, string? Details)`,
persisted to the SQLite `api_key_audit` table (5 cols). `IApiKeyAuditStore` = `AppendAsync` + `ListRecentAsync`
(the dashboard "recent audit" view reads via `ListRecentAsync`). **Three producers, but one is library-internal:**
- `ApiKeyAdminCommands` (**library-internal**, in `ZB.MOM.WW.Auth.ApiKeys`) — emits CLI/admin verbs
(`init-db`/`create-key`/`revoke-key`/`rotate-key`/`delete-key`/`set-scopes`/`enable-key`/`disable-key`),
keyless for `init-db`, `RemoteAddress` null on the CLI path. **MxGateway cannot edit these call sites.**
- `DashboardApiKeyManagementService` (MxGateway-local) — `dashboard-*` verbs, real `KeyId` + `RemoteAddress`.
- `ConstraintEnforcer.RecordDenialAsync` (MxGateway-local) — single `constraint-denied` EventType, `RemoteAddress`
hardcoded null, `Details = "{commandKind}: {target}: {ConstraintName}: {Message}"`.
`AppendAsync` currently **propagates** exceptions (no best-effort wrap). Serilog migration **landed** (no blocker).
`ZB.MOM.WW.Audit` unreferenced; `nuget.config` already maps the package.
**Deep design — the library-internal CLI producer forces an adapter:**
- Add `<PackageReference Include="ZB.MOM.WW.Audit" />` to `…Server`.
- New **MxGateway-owned canonical store** `audit_event` (SQLite, 9 canonical columns + `details_json`) with its own
migrator — the existing `api_key_audit` lives in the **library-owned** auth DB schema, so we do NOT alter that
schema. Implement `IAuditWriter` over the new store (best-effort try/catch — fixes the no-wrap gap).
- **Adapter for the library-internal CLI events:** register a MxGateway `IApiKeyAuditStore` impl whose
`AppendAsync(ApiKeyAuditEntry)` maps → canonical `AuditEvent` (`EventId=NewGuid`; `KeyId``Actor` with
`"cli"`/`"system"` fallback; `EventType``Action`; `CreatedUtc``OccurredAtUtc`; `RemoteAddress``SourceNode`;
`Outcome=Success`; `Category="ApiKey"`; `Target=KeyId`; `Details``DetailsJson` wrapped `{"detail":"…"}`) and
forwards to `IAuditWriter`. Its `ListRecentAsync` reads the canonical store and maps back to `ApiKeyAuditEntry`
(so the existing dashboard recent-audit view keeps working) **or** the dashboard view is repointed to canonical.
- **Local producers** (`DashboardApiKeyManagementService`, `ConstraintEnforcer`) rewritten to build canonical
`AuditEvent`s directly via `IAuditWriter` (`constraint-denied``Outcome.Denied`; capture `CorrelationId` from
`MxCommandRequest.ClientCorrelationId` (constraint path — needs threading down) / `HttpContext.TraceIdentifier`
(dashboard); structured `Target` from `commandKind`/`target` (GAPS #6)).
- **Open question for review:** retire `api_key_audit` (canonical store becomes the sole audit table) vs keep it
coexisting. Retiring is cleaner-deep but touches the library's store wiring; coexisting is lower-risk.
- **Effort/classification:** re-scoped from "standard ~5m" to **high-risk** (new store + migrator + adapter +
producer rewrites + dashboard read path + DI + tests). Realistically 23 sub-commits.
---
## ScadaBridge — DEEP (Task 2.5, re-scoped) · risk: **VERY HIGH — audit-subsystem re-architecture**
**This is the one to scrutinize at review.** The gate definitively answered the plan's central claim is FALSE.
**Verified current state:** ScadaBridge's `AuditEvent` (`…Commons/Entities/Audit/AuditEvent.cs`) is a
**24-field** record — `EventId, OccurredAtUtc(DateTime), IngestedAtUtc, Channel(AuditChannel), Kind(AuditKind),
CorrelationId, ExecutionId, ParentExecutionId, SourceSiteId, SourceNode, SourceInstanceId, SourceScript, Actor,
Target, Status(AuditStatus), HttpStatus, DurationMs, ErrorMessage, ErrorDetail, RequestSummary, ResponseSummary,
PayloadTruncated, Extra, ForwardState(AuditForwardState?)`. It is the **storage shape of a partitioned SQL Server
audit table** with these as **queryable columns**. `IAuditPayloadFilter.Apply(ScadaBridgeAuditEvent) ->
ScadaBridgeAuditEvent` (NOT the library's record — a reflection contract test `PayloadFilterContractTests` pins
the typing). `IAuditWriter`/`ICentralAuditWriter` are likewise typed to the 24-field record. **`AuditStatus`
drives the site→central forwarding STATE MACHINE** (`Pending→Submitted→Forwarded→Reconciled`;
`Delivered`/`Failed`/`Parked`/`Discarded`) and the **filter's error-cap logic** (`IsErrorStatus`). The Central
reporting/UI queries by `Channel`/`Kind`/`Status`/`Site`. **Phase 1 did NOT touch any audit-pipeline file** (zero
drift). Blast radius of just the interface rename: ~10 files / ~20 sites; the contract test pins it.
**What DEEP adoption concretely requires here (full honesty):**
Replacing the 24-field record with the 9-field canonical + pushing ~15 domain fields into `DetailsJson` means
**re-architecting the entire audit subsystem**, because those fields are not decorative — they are load-bearing:
1. **Storage:** migrate the partitioned SQL Server audit table from ~24 typed columns to the 9 canonical columns
+ a JSON `DetailsJson` column. Massive, lossy-on-queryability data migration; partitioning scheme likely must
change; `IngestedAtUtc`/`ForwardState` are operational columns the forwarder UPDATEs.
3. **Forwarding state machine breaks:** `Status`/`ForwardState` move into opaque JSON — you cannot `UPDATE` a
JSON-embedded field as a column, and the reconciliation queries `WHERE Status/ForwardState = …` stop working.
The site→central forwarder would have to be redesigned (e.g., promote Status back out of JSON, defeating the
point).
4. **Redactor breaks:** `DefaultAuditPayloadFilter` reads `Channel`/`Status`/`RequestSummary`/`ResponseSummary`/
`ErrorDetail`/`Extra`/`PayloadTruncated` to choose truncation caps — on a 9-field canonical record those are
gone (opaque in `DetailsJson`), so the filter must be rewritten to parse JSON.
5. **Reporting/UI breaks:** Central audit-log queries/filters by Channel/Kind/Status/Site lose SQL queryability.
6. ~Dozens of call sites + the contract test + the perf hot-path test.
**Honest assessment:** ScadaBridge DEEP ≈ the **largest single undertaking in the whole program** (bigger than the
Phase-1 ApiKeys re-arch). The audit component's own GAPS doc says *"Align, don't replace"* for exactly this reason.
**Bounded alternative to weigh at review (recommended if "deep" is to be kept tractable):** make the canonical
`ZB.MOM.WW.Audit.AuditEvent` the **seam/transport + cross-project reporting** shape (the redactor and an
`IAuditWriter` operate on the canonical record; domain richness rides in `DetailsJson`), while the **SQL storage
keeps its typed queryable columns** populated by a storage-side projection (canonical+DetailsJson → columns) and
the forwarding state machine continues to key on the `Status`/`ForwardState` columns. This delivers "deep" at the
seam/record level (library types consumed; domain fields in `DetailsJson` for the canonical view) **without**
gutting the partitioned store, the state machine, the filter, or the reporting — a far safer "deep."
---
## Cross-cutting
- **Branch model:** `feat/adopt-zb-audit` per app, **stacked on `feat/adopt-zb-auth` HEAD** (Phase 3 wires the
audit `Actor` from the Phase-1 Auth principal, so audit must build on auth). Local-only, never pushed.
- **No library change / republish** needed for the chosen designs (MxGateway adapts in-repo) — so no Gitea token
required unless the user later wants the canonical mapping pushed into a shared lib.
- **Phase 3 (unchanged in intent):** `IAuditActorAccessor` seam + wire `AuditEvent.Actor` from the Auth principal
at every authenticated emit site; keep `"system"`/`"cli"` fallbacks for keyless paths.
## Re-scoped task list (for review)
| # | Repo | Re-scoped scope | Class | Risk |
|---|---|---|---|---|
| 2.1 | OtOpcUa | Commons record → canonical `AuditEvent`; `AuditWriterActor : IAuditWriter`; `Outcome` derivation; Akka-wire note (dormant) | high-risk | LowMed |
| 2.2 | OtOpcUa | `ConfigAuditLog.Outcome` column + EF migration + `ClusterAudit` visibility fix; SP path bespoke | high-risk | LowMed |
| 2.3 | MxGateway | new canonical SQLite `audit_event` store + migrator; `IAuditWriter`; `IApiKeyAuditStore`→canonical adapter (for library-internal CLI events) incl. `ListRecentAsync`; rewrite local producers; CorrelationId/Target capture; DI; tests | **high-risk** (↑ from standard) | MedHigh |
| 2.5 | ScadaBridge | **DEEP = audit-subsystem re-arch** (24-field→9-field record everywhere; domain fields→`DetailsJson`; SQL partitioned-table migration; forwarding state machine + filter + reporting rewrite; contract/perf tests) — **OR** the bounded "deep-at-the-seam" alternative above | **very-high-risk** | **VERY HIGH** |
## Open items to confirm at review
1. **ScadaBridge:** full audit re-architecture (pure 9-col storage) vs the **bounded "deep-at-the-seam"** variant
(canonical record at the seam/reporting boundary; keep typed storage columns + state machine). Strongly
recommend the bounded variant.
2. **MxGateway:** retire `api_key_audit` (canonical store is sole) vs keep it coexisting.
3. **OtOpcUa:** confirm leaving the SP path bespoke (structured path is dormant; canonicalization is forward-looking
prep) is acceptable, and the `ClusterAudit` fix approach (OR-predicate vs populate `ClusterId`).
4. **Sequencing:** OtOpcUa (2.1→2.2) and MxGateway (2.3) are independent + tractable; ScadaBridge (2.5) is the
gating risk — do it last, and as staged reviewed sub-commits regardless of variant.
@@ -239,6 +239,16 @@ Order within the phase (per `components/auth/GAPS.md` sequencing): **#3 seam →
## PHASE 2 — Audit adoption (audit GAPS #1#3, #5, #6)
> ⚠️ **RE-SCOPED 2026-06-02 — the task specs below are SUPERSEDED.** The Task 2.0 gate (verified against
> live code) found these specs materially wrong: MxGateway's audit files moved into the shared library
> (Phase 1), OtOpcUa's structured audit path is dormant (zero emit sites), and the ScadaBridge
> "outright rename" is structurally impossible (its filter is typed to its own 24-field record, not the
> library's 9-field one). The user chose **DEEP adopt (canonical record)** + **pause for review**. The
> corrected, gate-grounded deep design is in
> [`2026-06-02-auth-audit-normalization-phase2-deep.md`](2026-06-02-auth-audit-normalization-phase2-deep.md)
> — **implementation is PAUSED pending user review of that doc (esp. the ScadaBridge audit-subsystem
> re-architecture cost).** The original specs below are kept for historical context only.
Branch `feat/adopt-zb-audit` per repo. Behaviour-preserving except the OtOpcUa `Outcome` column + `ClusterId` visibility fix. Concrete paths below come from `components/audit/current-state/*`.
### Task 2.0: Explore audit source + confirm elaboration *(GATE — light, paths already known)*
@@ -23,7 +23,7 @@
{"id": 23, "subject": "Task 1.6: Unify dev base DN (#6) — DONE all 3 to dc=zb,dc=local (OtOpcUa 8ba289f, MxGw 9572045, SB 6ae6051)", "status": "completed", "blockedBy": [17]},
{"id": 24, "subject": "Task 1.7: Canonical roles native expansion (#8) [high-risk] — DONE all 3, full-value canonical (MxGw 04bce3ff, OtOpcUa c1619d9 +DB-mig, SB b104760+4118452 +DB-mig +SoD collapse)", "status": "completed", "blockedBy": [18]},
{"id": 25, "subject": "Task 2.0: GATE confirm audit source refs", "status": "pending", "blockedBy": [8]},
{"id": 25, "subject": "Task 2.0: GATE confirm audit source refs — DONE; found plan specs materially off → DEEP re-scope in -phase2-deep.md; PAUSED for user review before 2.1/2.2/2.3/2.5", "status": "completed", "blockedBy": [8]},
{"id": 26, "subject": "Task 2.1: OtOpcUa canonical record + IAuditWriter + Outcome (#1) [high-risk]", "status": "pending", "blockedBy": [25]},
{"id": 27, "subject": "Task 2.2: OtOpcUa Outcome migration + ClusterId fix (#1,#5) [high-risk]", "status": "pending", "blockedBy": [26]},
{"id": 28, "subject": "Task 2.3: MxGateway store→IAuditWriter adapter (#2,#6)", "status": "pending", "blockedBy": [25]},