Files
scadaproj/docs/plans/2026-06-02-auth-audit-normalization-phase2-deep.md
T

209 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2 (Audit adoption) — Task 2.0 gate findings + DEEP re-scope (for review)
Companion to `2026-06-02-auth-audit-normalization.md`. Produced by the **Task 2.0 read-only
verification gate** (3 parallel explorers, all paths verified 2026-06-02 against live code on each
repo's `feat/adopt-zb-auth` HEAD). **Status: PAUSED for user review before any audit code is written.**
**Decisions taken (2026-06-02):**
- **Depth = DEEP adopt (canonical record).** Each app's audit record becomes the library's 9-field
`ZB.MOM.WW.Audit.AuditEvent`; domain-specific fields relocate into `DetailsJson`; each app consumes
the library's `IAuditWriter`/`IAuditRedactor`/`AuditOutcome` types. (User chose this over the
gate-recommended lighter "Align" — consistent with the standing maximal/full-adopt directive.)
- **Cadence = re-scope + PAUSE for review.** This doc is the review artifact; implementation does not
start until the user signs off (especially on the ScadaBridge cost, below).
> **Why a re-scope was needed:** the plan's Phase 2 task specs were written from optimistic
> `components/audit/current-state/*` docs (see [[component-status-claims-are-optimistic]]). The gate
> found all three repos' specs are materially off — file refs moved (MxGateway), the target path is
> dormant (OtOpcUa), and the "outright rename" is structurally impossible (ScadaBridge).
---
## The canonical contract (shared `ZB.MOM.WW.Audit` 0.1.0)
`AuditEvent` (sealed record): REQUIRED `EventId:Guid`, `OccurredAtUtc:DateTimeOffset` (UTC-normalized
on set), `Actor:string`, `Action:string`, `Outcome:AuditOutcome`; OPTIONAL `Category:string?`,
`Target:string?`, `SourceNode:string?`, `CorrelationId:Guid?`, `DetailsJson:string?`. **Nine fields.**
`AuditOutcome { Success, Failure, Denied }`. `IAuditWriter.WriteAsync(AuditEvent, CancellationToken)`
best-effort, never throws. `IAuditRedactor.Apply(AuditEvent) -> AuditEvent` — pure, never throws.
The package is pinned (central PM / explicit) + feed-mapped in all three repos; **referenced by none yet.**
---
## OtOpcUa — DEEP (Tasks 2.1 + 2.2) · risk: LOWMEDIUM
**Verified current state:** Commons `AuditEvent` is an **8-field positional record**
`(Guid EventId, string Category, string Action, string Actor, DateTime OccurredAtUtc, string? DetailsJson,
NodeId SourceNode, CorrelationId CorrelationId)` — where `NodeId`/`CorrelationId` are `readonly record
struct` newtypes over `string`/`Guid`. It is an **Akka message** delivered via `DistributedPubSub`
(`provider=cluster`) with **default (reflection) serialization** — no custom serializer. **The structured
actor path is DORMANT: zero production emit sites** construct/`Tell` an `AuditEvent` today (only the tests
do); all live audit goes through the bespoke **stored-procedure path** (`sp_NodeApplied`/`sp_PublishGeneration`/
`sp_ValidateDraft`/`sp_RollbackToGeneration` INSERT directly with `ClusterId`/`GenerationId`, NULL `EventId`).
`AuditWriterActor` (`ControlPlane/Audit/AuditWriterActor.cs`): 500/5s batching, two-layer dedup (in-buffer
`Dictionary<Guid,AuditEvent>` + DB filtered-unique `UX_ConfigAuditLog_EventId`), mapping at `:75-84`.
`ConfigAuditLog` (10 cols, no `Outcome`; `ISJSON` CHECK on `DetailsJson`). `ClusterAudit.razor:78` filters
`a.ClusterId == ClusterId`, but the actor sets `NodeId` not `ClusterId`, so structured rows are invisible.
Package pinned `0.1.0` in `Directory.Packages.props`, feed-mapped, unreferenced.
**Deep design — this is the easy one (the record is already ~canonical):**
- **2.1 (high-risk: actor + contract):** Delete Commons `AuditEvent.cs`; reference `ZB.MOM.WW.Audit.AuditEvent`
from `ZB.MOM.WW.OtOpcUa.Commons` + `…ControlPlane`. Field map: `EventId``EventId`; `OccurredAtUtc`
`DateTime``DateTimeOffset` (widen at construction); `Actor`/`Action`/`Category`/`DetailsJson` direct;
`SourceNode` (unwrap `NodeId.Value``string?`); `CorrelationId` (unwrap `.Value``Guid?`); `Target` unused
(null) — OtOpcUa has no extra domain fields to push into `DetailsJson`, so **no field relocation**. Add the
NEW required `Outcome` (derive: `OpcUaAccessDenied`/`CrossClusterNamespaceAttempt``Denied`; config verbs→
`Success`; no `Failure` in OtOpcUa's vocabulary). `AuditWriterActor : IAuditWriter` (`WriteAsync` wraps the
fire-and-forget `Tell`, returns `Task.CompletedTask` — trivially best-effort). Keep batching/dedup. Mapping
at `:75-84` becomes `NodeId = evt.SourceNode`, `CorrelationId = evt.CorrelationId`, `Outcome = evt.Outcome`,
`EventType = $"{evt.Category}:{evt.Action}"` (storage keeps the composite). Value-type unwrap happens at the
(test + future) construction sites. **Akka wire note:** the message type changes shape → a rolling-deploy
wire break IN PRINCIPLE, but **moot** (no live emit traffic). Flag in the commit; no dual-accept window needed.
- **2.2 (high-risk: EF migration + UI query):** add nullable `Outcome` to `ConfigAuditLog` (+ DbContext mapping
`:429-463`) + EF migration `AddConfigAuditLogOutcome` (chains after `20260602112419_CanonicalizeAdminRoles`).
Fix `ClusterAudit.razor:78` so `ClusterId == null && NodeId` resolves to the cluster (OR-predicate joining
`ClusterNodes`, or populate `ClusterId` at flush). SP path stays bespoke (documented).
- **Package refs:** `…Commons` (record + `AuditOutcome`), `…ControlPlane` (`IAuditWriter`), `…Configuration`
(only if `Outcome` is stored as the enum type; otherwise store `string?`/`int?` and skip).
- **Effort:** ~record swap 5m + actor seam 5m + Outcome derivation 5m (2.1); column+migration+query 5m (2.2).
---
## MxGateway — DEEP (Task 2.3, re-scoped) · risk: MEDIUMHIGH (was "standard")
**Verified current state — the plan's file refs are STALE:** Phase 1 (Task 1.3) **moved**
`IApiKeyAuditStore` + `ApiKeyAuditEntry` + `SqliteApiKeyAuditStore` **into the shared library**
(`ZB.MOM.WW.Auth.Abstractions`/`…ApiKeys` 0.1.2) — they no longer exist in MxGateway. `ApiKeyAuditEntry` =
**5 fields** `(string? KeyId, string EventType, string? RemoteAddress, DateTimeOffset CreatedUtc, string? Details)`,
persisted to the SQLite `api_key_audit` table (5 cols). `IApiKeyAuditStore` = `AppendAsync` + `ListRecentAsync`
(the dashboard "recent audit" view reads via `ListRecentAsync`). **Three producers, but one is library-internal:**
- `ApiKeyAdminCommands` (**library-internal**, in `ZB.MOM.WW.Auth.ApiKeys`) — emits CLI/admin verbs
(`init-db`/`create-key`/`revoke-key`/`rotate-key`/`delete-key`/`set-scopes`/`enable-key`/`disable-key`),
keyless for `init-db`, `RemoteAddress` null on the CLI path. **MxGateway cannot edit these call sites.**
- `DashboardApiKeyManagementService` (MxGateway-local) — `dashboard-*` verbs, real `KeyId` + `RemoteAddress`.
- `ConstraintEnforcer.RecordDenialAsync` (MxGateway-local) — single `constraint-denied` EventType, `RemoteAddress`
hardcoded null, `Details = "{commandKind}: {target}: {ConstraintName}: {Message}"`.
`AppendAsync` currently **propagates** exceptions (no best-effort wrap). Serilog migration **landed** (no blocker).
`ZB.MOM.WW.Audit` unreferenced; `nuget.config` already maps the package.
**Deep design — the library-internal CLI producer forces an adapter:**
- Add `<PackageReference Include="ZB.MOM.WW.Audit" />` to `…Server`.
- New **MxGateway-owned canonical store** `audit_event` (SQLite, 9 canonical columns + `details_json`) with its own
migrator — the existing `api_key_audit` lives in the **library-owned** auth DB schema, so we do NOT alter that
schema. Implement `IAuditWriter` over the new store (best-effort try/catch — fixes the no-wrap gap).
- **Adapter for the library-internal CLI events:** register a MxGateway `IApiKeyAuditStore` impl whose
`AppendAsync(ApiKeyAuditEntry)` maps → canonical `AuditEvent` (`EventId=NewGuid`; `KeyId``Actor` with
`"cli"`/`"system"` fallback; `EventType``Action`; `CreatedUtc``OccurredAtUtc`; `RemoteAddress``SourceNode`;
`Outcome=Success`; `Category="ApiKey"`; `Target=KeyId`; `Details``DetailsJson` wrapped `{"detail":"…"}`) and
forwards to `IAuditWriter`. Its `ListRecentAsync` reads the canonical store and maps back to `ApiKeyAuditEntry`
(so the existing dashboard recent-audit view keeps working) **or** the dashboard view is repointed to canonical.
- **Local producers** (`DashboardApiKeyManagementService`, `ConstraintEnforcer`) rewritten to build canonical
`AuditEvent`s directly via `IAuditWriter` (`constraint-denied``Outcome.Denied`; capture `CorrelationId` from
`MxCommandRequest.ClientCorrelationId` (constraint path — needs threading down) / `HttpContext.TraceIdentifier`
(dashboard); structured `Target` from `commandKind`/`target` (GAPS #6)).
- **Open question for review:** retire `api_key_audit` (canonical store becomes the sole audit table) vs keep it
coexisting. Retiring is cleaner-deep but touches the library's store wiring; coexisting is lower-risk.
- **Effort/classification:** re-scoped from "standard ~5m" to **high-risk** (new store + migrator + adapter +
producer rewrites + dashboard read path + DI + tests). Realistically 23 sub-commits.
---
## ScadaBridge — DEEP (Task 2.5, re-scoped) · risk: **VERY HIGH — audit-subsystem re-architecture**
**This is the one to scrutinize at review.** The gate definitively answered the plan's central claim is FALSE.
**Verified current state:** ScadaBridge's `AuditEvent` (`…Commons/Entities/Audit/AuditEvent.cs`) is a
**24-field** record — `EventId, OccurredAtUtc(DateTime), IngestedAtUtc, Channel(AuditChannel), Kind(AuditKind),
CorrelationId, ExecutionId, ParentExecutionId, SourceSiteId, SourceNode, SourceInstanceId, SourceScript, Actor,
Target, Status(AuditStatus), HttpStatus, DurationMs, ErrorMessage, ErrorDetail, RequestSummary, ResponseSummary,
PayloadTruncated, Extra, ForwardState(AuditForwardState?)`. It is the **storage shape of a partitioned SQL Server
audit table** with these as **queryable columns**. `IAuditPayloadFilter.Apply(ScadaBridgeAuditEvent) ->
ScadaBridgeAuditEvent` (NOT the library's record — a reflection contract test `PayloadFilterContractTests` pins
the typing). `IAuditWriter`/`ICentralAuditWriter` are likewise typed to the 24-field record. **`AuditStatus`
drives the site→central forwarding STATE MACHINE** (`Pending→Submitted→Forwarded→Reconciled`;
`Delivered`/`Failed`/`Parked`/`Discarded`) and the **filter's error-cap logic** (`IsErrorStatus`). The Central
reporting/UI queries by `Channel`/`Kind`/`Status`/`Site`. **Phase 1 did NOT touch any audit-pipeline file** (zero
drift). Blast radius of just the interface rename: ~10 files / ~20 sites; the contract test pins it.
**What DEEP adoption concretely requires here (full honesty):**
Replacing the 24-field record with the 9-field canonical + pushing ~15 domain fields into `DetailsJson` means
**re-architecting the entire audit subsystem**, because those fields are not decorative — they are load-bearing:
1. **Storage:** migrate the partitioned SQL Server audit table from ~24 typed columns to the 9 canonical columns
+ a JSON `DetailsJson` column. Massive, lossy-on-queryability data migration; partitioning scheme likely must
change; `IngestedAtUtc`/`ForwardState` are operational columns the forwarder UPDATEs.
3. **Forwarding state machine breaks:** `Status`/`ForwardState` move into opaque JSON — you cannot `UPDATE` a
JSON-embedded field as a column, and the reconciliation queries `WHERE Status/ForwardState = …` stop working.
The site→central forwarder would have to be redesigned (e.g., promote Status back out of JSON, defeating the
point).
4. **Redactor breaks:** `DefaultAuditPayloadFilter` reads `Channel`/`Status`/`RequestSummary`/`ResponseSummary`/
`ErrorDetail`/`Extra`/`PayloadTruncated` to choose truncation caps — on a 9-field canonical record those are
gone (opaque in `DetailsJson`), so the filter must be rewritten to parse JSON.
5. **Reporting/UI breaks:** Central audit-log queries/filters by Channel/Kind/Status/Site lose SQL queryability.
6. ~Dozens of call sites + the contract test + the perf hot-path test.
**Honest assessment:** ScadaBridge DEEP ≈ the **largest single undertaking in the whole program** (bigger than the
Phase-1 ApiKeys re-arch). The audit component's own GAPS doc says *"Align, don't replace"* for exactly this reason.
**Bounded alternative to weigh at review (recommended if "deep" is to be kept tractable):** make the canonical
`ZB.MOM.WW.Audit.AuditEvent` the **seam/transport + cross-project reporting** shape (the redactor and an
`IAuditWriter` operate on the canonical record; domain richness rides in `DetailsJson`), while the **SQL storage
keeps its typed queryable columns** populated by a storage-side projection (canonical+DetailsJson → columns) and
the forwarding state machine continues to key on the `Status`/`ForwardState` columns. This delivers "deep" at the
seam/record level (library types consumed; domain fields in `DetailsJson` for the canonical view) **without**
gutting the partitioned store, the state machine, the filter, or the reporting — a far safer "deep."
---
## Cross-cutting
- **Branch model:** `feat/adopt-zb-audit` per app, **stacked on `feat/adopt-zb-auth` HEAD** (Phase 3 wires the
audit `Actor` from the Phase-1 Auth principal, so audit must build on auth). Local-only, never pushed.
- **No library change / republish** needed for the chosen designs (MxGateway adapts in-repo) — so no Gitea token
required unless the user later wants the canonical mapping pushed into a shared lib.
- **Phase 3 (unchanged in intent):** `IAuditActorAccessor` seam + wire `AuditEvent.Actor` from the Auth principal
at every authenticated emit site; keep `"system"`/`"cli"` fallbacks for keyless paths.
## Re-scoped task list (for review)
| # | Repo | Re-scoped scope | Class | Risk |
|---|---|---|---|---|
| 2.1 | OtOpcUa | Commons record → canonical `AuditEvent`; `AuditWriterActor : IAuditWriter`; `Outcome` derivation; Akka-wire note (dormant) | high-risk | LowMed |
| 2.2 | OtOpcUa | `ConfigAuditLog.Outcome` column + EF migration + `ClusterAudit` visibility fix; SP path bespoke | high-risk | LowMed |
| 2.3 | MxGateway | new canonical SQLite `audit_event` store + migrator; `IAuditWriter`; `IApiKeyAuditStore`→canonical adapter (for library-internal CLI events) incl. `ListRecentAsync`; rewrite local producers; CorrelationId/Target capture; DI; tests | **high-risk** (↑ from standard) | MedHigh |
| 2.5 | ScadaBridge | **DEEP = audit-subsystem re-arch** (24-field→9-field record everywhere; domain fields→`DetailsJson`; SQL partitioned-table migration; forwarding state machine + filter + reporting rewrite; contract/perf tests) — **OR** the bounded "deep-at-the-seam" alternative above | **very-high-risk** | **VERY HIGH** |
## Implementation status (2026-06-02, deep adoption underway)
- **✅ OtOpcUa 2.1 + 2.2 DONE** (`feat/adopt-zb-audit`, spec ✅ + code ✅): `933dd1a` — deleted bespoke Commons
`AuditEvent`, adopted library `ZB.MOM.WW.Audit.AuditEvent`, `AuditWriterActor : IAuditWriter` (best-effort
`WriteAsync` wraps `Self.Tell`), `AuditOutcomeMapper.FromAction` derivation, batching/dedup intact; `b7f5e88`
nullable `Outcome` column + migration `20260602135350_AddConfigAuditLogOutcome` (additive, chains after
CanonicalizeAdminRoles, no pending model changes) + `ClusterAudit` fix via shared `ClusterAuditQuery` (OR-predicate
joining `ClusterNode` membership). SP path untouched. ControlPlane 45/45, Configuration 80/80 (+3), AdminUI 121/121.
Minor backlog: no `IX_ConfigAuditLog_NodeId` (irrelevant while structured path dormant).
- **✅ MxGateway 2.3 DONE** (`feat/adopt-zb-audit`, spec ✅ + code ✅): `a5944bb` — new MxGateway-owned canonical
SQLite `audit_event` store (same auth DB file via the library's `AuthSqliteConnectionFactory`; library tables
untouched), `CanonicalAuditWriter : IAuditWriter` (best-effort, never throws — closes the library's no-wrap gap),
`CanonicalForwardingApiKeyAuditStore : IApiKeyAuditStore` adapter (maps `ApiKeyAuditEntry`→canonical w/ system/cli
fallback + constraint-denied→Denied + DetailsJson wrap; `ListRecent` round-trips for the dashboard view), DI
overrides the library's `TryAddSingleton`'d store; `7ea8358` — Dashboard + ConstraintEnforcer rewritten to emit
canonical `AuditEvent` directly via `IAuditWriter` with structured `Target` + (dashboard) `CorrelationId`. 587 pass,
3 pre-existing FakeWorker reds, +10 tests. `api_key_audit` left unused (documented). Minor backlog: dup `WrapDetail`,
per-op `EnsureTable`, a test temp-dir leak, unfiltered `ListRecent` category.
- **✅ ScadaBridge 2.5 — DONE (FULL re-arch, user-chosen).** Decomposed into C1C7 (design in
`2026-06-02-scadabridge-audit-rearch.md`), all spec+code reviewed, MSSQL-verified, local-only on `feat/adopt-zb-audit`.
Canonical record everywhere; site SQLite two-table (canonical + forwarding sidecar); central `dbo.AuditLog` collapsed to
10 canonical cols + persisted computed cols (`CollapseAuditLogToCanonical` migration); redactor/outcome/UI/export/CLI all
canonical. Forwarding state machine preserved (sidecar) + queryability preserved (persisted computed columns) — the design's
key insight that central is append-only made pure-9-col central feasible without gutting forwarding.
## Open items to confirm at review
1. **ScadaBridge:** full audit re-architecture (pure 9-col storage) vs the **bounded "deep-at-the-seam"** variant
(canonical record at the seam/reporting boundary; keep typed storage columns + state machine). Strongly
recommend the bounded variant.
2. **MxGateway:** retire `api_key_audit` (canonical store is sole) vs keep it coexisting.
3. **OtOpcUa:** confirm leaving the SP path bespoke (structured path is dormant; canonicalization is forward-looking
prep) is acceptable, and the `ClusterAudit` fix approach (OR-predicate vs populate `ClusterId`).
4. **Sequencing:** OtOpcUa (2.1→2.2) and MxGateway (2.3) are independent + tractable; ScadaBridge (2.5) is the
gating risk — do it last, and as staged reviewed sub-commits regardless of variant.