Files
scadaproj/docs/plans/2026-06-02-auth-audit-normalization-phase2-deep.md
T

18 KiB
Raw Blame History

Phase 2 (Audit adoption) — Task 2.0 gate findings + DEEP re-scope (for review)

Companion to 2026-06-02-auth-audit-normalization.md. Produced by the Task 2.0 read-only verification gate (3 parallel explorers, all paths verified 2026-06-02 against live code on each repo's feat/adopt-zb-auth HEAD). Status: PAUSED for user review before any audit code is written.

Decisions taken (2026-06-02):

  • Depth = DEEP adopt (canonical record). Each app's audit record becomes the library's 9-field ZB.MOM.WW.Audit.AuditEvent; domain-specific fields relocate into DetailsJson; each app consumes the library's IAuditWriter/IAuditRedactor/AuditOutcome types. (User chose this over the gate-recommended lighter "Align" — consistent with the standing maximal/full-adopt directive.)
  • Cadence = re-scope + PAUSE for review. This doc is the review artifact; implementation does not start until the user signs off (especially on the ScadaBridge cost, below).

Why a re-scope was needed: the plan's Phase 2 task specs were written from optimistic components/audit/current-state/* docs (see component-status-claims-are-optimistic). The gate found all three repos' specs are materially off — file refs moved (MxGateway), the target path is dormant (OtOpcUa), and the "outright rename" is structurally impossible (ScadaBridge).


The canonical contract (shared ZB.MOM.WW.Audit 0.1.0)

AuditEvent (sealed record): REQUIRED EventId:Guid, OccurredAtUtc:DateTimeOffset (UTC-normalized on set), Actor:string, Action:string, Outcome:AuditOutcome; OPTIONAL Category:string?, Target:string?, SourceNode:string?, CorrelationId:Guid?, DetailsJson:string?. Nine fields. AuditOutcome { Success, Failure, Denied }. IAuditWriter.WriteAsync(AuditEvent, CancellationToken) — best-effort, never throws. IAuditRedactor.Apply(AuditEvent) -> AuditEvent — pure, never throws. The package is pinned (central PM / explicit) + feed-mapped in all three repos; referenced by none yet.


OtOpcUa — DEEP (Tasks 2.1 + 2.2) · risk: LOWMEDIUM

Verified current state: Commons AuditEvent is an 8-field positional record(Guid EventId, string Category, string Action, string Actor, DateTime OccurredAtUtc, string? DetailsJson, NodeId SourceNode, CorrelationId CorrelationId) — where NodeId/CorrelationId are readonly record struct newtypes over string/Guid. It is an Akka message delivered via DistributedPubSub (provider=cluster) with default (reflection) serialization — no custom serializer. The structured actor path is DORMANT: zero production emit sites construct/Tell an AuditEvent today (only the tests do); all live audit goes through the bespoke stored-procedure path (sp_NodeApplied/sp_PublishGeneration/ sp_ValidateDraft/sp_RollbackToGeneration INSERT directly with ClusterId/GenerationId, NULL EventId). AuditWriterActor (ControlPlane/Audit/AuditWriterActor.cs): 500/5s batching, two-layer dedup (in-buffer Dictionary<Guid,AuditEvent> + DB filtered-unique UX_ConfigAuditLog_EventId), mapping at :75-84. ConfigAuditLog (10 cols, no Outcome; ISJSON CHECK on DetailsJson). ClusterAudit.razor:78 filters a.ClusterId == ClusterId, but the actor sets NodeId not ClusterId, so structured rows are invisible. Package pinned 0.1.0 in Directory.Packages.props, feed-mapped, unreferenced.

Deep design — this is the easy one (the record is already ~canonical):

  • 2.1 (high-risk: actor + contract): Delete Commons AuditEvent.cs; reference ZB.MOM.WW.Audit.AuditEvent from ZB.MOM.WW.OtOpcUa.Commons + …ControlPlane. Field map: EventIdEventId; OccurredAtUtc DateTimeDateTimeOffset (widen at construction); Actor/Action/Category/DetailsJson direct; SourceNode (unwrap NodeId.Valuestring?); CorrelationId (unwrap .ValueGuid?); Target unused (null) — OtOpcUa has no extra domain fields to push into DetailsJson, so no field relocation. Add the NEW required Outcome (derive: OpcUaAccessDenied/CrossClusterNamespaceAttemptDenied; config verbs→ Success; no Failure in OtOpcUa's vocabulary). AuditWriterActor : IAuditWriter (WriteAsync wraps the fire-and-forget Tell, returns Task.CompletedTask — trivially best-effort). Keep batching/dedup. Mapping at :75-84 becomes NodeId = evt.SourceNode, CorrelationId = evt.CorrelationId, Outcome = evt.Outcome, EventType = $"{evt.Category}:{evt.Action}" (storage keeps the composite). Value-type unwrap happens at the (test + future) construction sites. Akka wire note: the message type changes shape → a rolling-deploy wire break IN PRINCIPLE, but moot (no live emit traffic). Flag in the commit; no dual-accept window needed.
  • 2.2 (high-risk: EF migration + UI query): add nullable Outcome to ConfigAuditLog (+ DbContext mapping :429-463) + EF migration AddConfigAuditLogOutcome (chains after 20260602112419_CanonicalizeAdminRoles). Fix ClusterAudit.razor:78 so ClusterId == null && NodeId resolves to the cluster (OR-predicate joining ClusterNodes, or populate ClusterId at flush). SP path stays bespoke (documented).
  • Package refs: …Commons (record + AuditOutcome), …ControlPlane (IAuditWriter), …Configuration (only if Outcome is stored as the enum type; otherwise store string?/int? and skip).
  • Effort: ~record swap 5m + actor seam 5m + Outcome derivation 5m (2.1); column+migration+query 5m (2.2).

MxGateway — DEEP (Task 2.3, re-scoped) · risk: MEDIUMHIGH (was "standard")

Verified current state — the plan's file refs are STALE: Phase 1 (Task 1.3) moved IApiKeyAuditStore + ApiKeyAuditEntry + SqliteApiKeyAuditStore into the shared library (ZB.MOM.WW.Auth.Abstractions/…ApiKeys 0.1.2) — they no longer exist in MxGateway. ApiKeyAuditEntry = 5 fields (string? KeyId, string EventType, string? RemoteAddress, DateTimeOffset CreatedUtc, string? Details), persisted to the SQLite api_key_audit table (5 cols). IApiKeyAuditStore = AppendAsync + ListRecentAsync (the dashboard "recent audit" view reads via ListRecentAsync). Three producers, but one is library-internal:

  • ApiKeyAdminCommands (library-internal, in ZB.MOM.WW.Auth.ApiKeys) — emits CLI/admin verbs (init-db/create-key/revoke-key/rotate-key/delete-key/set-scopes/enable-key/disable-key), keyless for init-db, RemoteAddress null on the CLI path. MxGateway cannot edit these call sites.
  • DashboardApiKeyManagementService (MxGateway-local) — dashboard-* verbs, real KeyId + RemoteAddress.
  • ConstraintEnforcer.RecordDenialAsync (MxGateway-local) — single constraint-denied EventType, RemoteAddress hardcoded null, Details = "{commandKind}: {target}: {ConstraintName}: {Message}". AppendAsync currently propagates exceptions (no best-effort wrap). Serilog migration landed (no blocker). ZB.MOM.WW.Audit unreferenced; nuget.config already maps the package.

Deep design — the library-internal CLI producer forces an adapter:

  • Add <PackageReference Include="ZB.MOM.WW.Audit" /> to …Server.
  • New MxGateway-owned canonical store audit_event (SQLite, 9 canonical columns + details_json) with its own migrator — the existing api_key_audit lives in the library-owned auth DB schema, so we do NOT alter that schema. Implement IAuditWriter over the new store (best-effort try/catch — fixes the no-wrap gap).
  • Adapter for the library-internal CLI events: register a MxGateway IApiKeyAuditStore impl whose AppendAsync(ApiKeyAuditEntry) maps → canonical AuditEvent (EventId=NewGuid; KeyIdActor with "cli"/"system" fallback; EventTypeAction; CreatedUtcOccurredAtUtc; RemoteAddressSourceNode; Outcome=Success; Category="ApiKey"; Target=KeyId; DetailsDetailsJson wrapped {"detail":"…"}) and forwards to IAuditWriter. Its ListRecentAsync reads the canonical store and maps back to ApiKeyAuditEntry (so the existing dashboard recent-audit view keeps working) or the dashboard view is repointed to canonical.
  • Local producers (DashboardApiKeyManagementService, ConstraintEnforcer) rewritten to build canonical AuditEvents directly via IAuditWriter (constraint-deniedOutcome.Denied; capture CorrelationId from MxCommandRequest.ClientCorrelationId (constraint path — needs threading down) / HttpContext.TraceIdentifier (dashboard); structured Target from commandKind/target (GAPS #6)).
  • Open question for review: retire api_key_audit (canonical store becomes the sole audit table) vs keep it coexisting. Retiring is cleaner-deep but touches the library's store wiring; coexisting is lower-risk.
  • Effort/classification: re-scoped from "standard ~5m" to high-risk (new store + migrator + adapter + producer rewrites + dashboard read path + DI + tests). Realistically 23 sub-commits.

ScadaBridge — DEEP (Task 2.5, re-scoped) · risk: VERY HIGH — audit-subsystem re-architecture

This is the one to scrutinize at review. The gate definitively answered the plan's central claim is FALSE.

Verified current state: ScadaBridge's AuditEvent (…Commons/Entities/Audit/AuditEvent.cs) is a 24-field record — EventId, OccurredAtUtc(DateTime), IngestedAtUtc, Channel(AuditChannel), Kind(AuditKind), CorrelationId, ExecutionId, ParentExecutionId, SourceSiteId, SourceNode, SourceInstanceId, SourceScript, Actor, Target, Status(AuditStatus), HttpStatus, DurationMs, ErrorMessage, ErrorDetail, RequestSummary, ResponseSummary, PayloadTruncated, Extra, ForwardState(AuditForwardState?). It is the storage shape of a partitioned SQL Server audit table with these as queryable columns. IAuditPayloadFilter.Apply(ScadaBridgeAuditEvent) -> ScadaBridgeAuditEvent (NOT the library's record — a reflection contract test PayloadFilterContractTests pins the typing). IAuditWriter/ICentralAuditWriter are likewise typed to the 24-field record. AuditStatus drives the site→central forwarding STATE MACHINE (Pending→Submitted→Forwarded→Reconciled; Delivered/Failed/Parked/Discarded) and the filter's error-cap logic (IsErrorStatus). The Central reporting/UI queries by Channel/Kind/Status/Site. Phase 1 did NOT touch any audit-pipeline file (zero drift). Blast radius of just the interface rename: ~10 files / ~20 sites; the contract test pins it.

What DEEP adoption concretely requires here (full honesty): Replacing the 24-field record with the 9-field canonical + pushing ~15 domain fields into DetailsJson means re-architecting the entire audit subsystem, because those fields are not decorative — they are load-bearing:

  1. Storage: migrate the partitioned SQL Server audit table from ~24 typed columns to the 9 canonical columns
    • a JSON DetailsJson column. Massive, lossy-on-queryability data migration; partitioning scheme likely must change; IngestedAtUtc/ForwardState are operational columns the forwarder UPDATEs.
  2. Forwarding state machine breaks: Status/ForwardState move into opaque JSON — you cannot UPDATE a JSON-embedded field as a column, and the reconciliation queries WHERE Status/ForwardState = … stop working. The site→central forwarder would have to be redesigned (e.g., promote Status back out of JSON, defeating the point).
  3. Redactor breaks: DefaultAuditPayloadFilter reads Channel/Status/RequestSummary/ResponseSummary/ ErrorDetail/Extra/PayloadTruncated to choose truncation caps — on a 9-field canonical record those are gone (opaque in DetailsJson), so the filter must be rewritten to parse JSON.
  4. Reporting/UI breaks: Central audit-log queries/filters by Channel/Kind/Status/Site lose SQL queryability.
  5. ~Dozens of call sites + the contract test + the perf hot-path test.

Honest assessment: ScadaBridge DEEP ≈ the largest single undertaking in the whole program (bigger than the Phase-1 ApiKeys re-arch). The audit component's own GAPS doc says "Align, don't replace" for exactly this reason.

Bounded alternative to weigh at review (recommended if "deep" is to be kept tractable): make the canonical ZB.MOM.WW.Audit.AuditEvent the seam/transport + cross-project reporting shape (the redactor and an IAuditWriter operate on the canonical record; domain richness rides in DetailsJson), while the SQL storage keeps its typed queryable columns populated by a storage-side projection (canonical+DetailsJson → columns) and the forwarding state machine continues to key on the Status/ForwardState columns. This delivers "deep" at the seam/record level (library types consumed; domain fields in DetailsJson for the canonical view) without gutting the partitioned store, the state machine, the filter, or the reporting — a far safer "deep."


Cross-cutting

  • Branch model: feat/adopt-zb-audit per app, stacked on feat/adopt-zb-auth HEAD (Phase 3 wires the audit Actor from the Phase-1 Auth principal, so audit must build on auth). Local-only, never pushed.
  • No library change / republish needed for the chosen designs (MxGateway adapts in-repo) — so no Gitea token required unless the user later wants the canonical mapping pushed into a shared lib.
  • Phase 3 (unchanged in intent): IAuditActorAccessor seam + wire AuditEvent.Actor from the Auth principal at every authenticated emit site; keep "system"/"cli" fallbacks for keyless paths.

Re-scoped task list (for review)

# Repo Re-scoped scope Class Risk
2.1 OtOpcUa Commons record → canonical AuditEvent; AuditWriterActor : IAuditWriter; Outcome derivation; Akka-wire note (dormant) high-risk LowMed
2.2 OtOpcUa ConfigAuditLog.Outcome column + EF migration + ClusterAudit visibility fix; SP path bespoke high-risk LowMed
2.3 MxGateway new canonical SQLite audit_event store + migrator; IAuditWriter; IApiKeyAuditStore→canonical adapter (for library-internal CLI events) incl. ListRecentAsync; rewrite local producers; CorrelationId/Target capture; DI; tests high-risk (↑ from standard) MedHigh
2.5 ScadaBridge DEEP = audit-subsystem re-arch (24-field→9-field record everywhere; domain fields→DetailsJson; SQL partitioned-table migration; forwarding state machine + filter + reporting rewrite; contract/perf tests) — OR the bounded "deep-at-the-seam" alternative above very-high-risk VERY HIGH

Implementation status (2026-06-02, deep adoption underway)

  • OtOpcUa 2.1 + 2.2 DONE (feat/adopt-zb-audit, spec + code ): 933dd1a — deleted bespoke Commons AuditEvent, adopted library ZB.MOM.WW.Audit.AuditEvent, AuditWriterActor : IAuditWriter (best-effort WriteAsync wraps Self.Tell), AuditOutcomeMapper.FromAction derivation, batching/dedup intact; b7f5e88 — nullable Outcome column + migration 20260602135350_AddConfigAuditLogOutcome (additive, chains after CanonicalizeAdminRoles, no pending model changes) + ClusterAudit fix via shared ClusterAuditQuery (OR-predicate joining ClusterNode membership). SP path untouched. ControlPlane 45/45, Configuration 80/80 (+3), AdminUI 121/121. Minor backlog: no IX_ConfigAuditLog_NodeId (irrelevant while structured path dormant).
  • MxGateway 2.3 DONE (feat/adopt-zb-audit, spec + code ): a5944bb — new MxGateway-owned canonical SQLite audit_event store (same auth DB file via the library's AuthSqliteConnectionFactory; library tables untouched), CanonicalAuditWriter : IAuditWriter (best-effort, never throws — closes the library's no-wrap gap), CanonicalForwardingApiKeyAuditStore : IApiKeyAuditStore adapter (maps ApiKeyAuditEntry→canonical w/ system/cli fallback + constraint-denied→Denied + DetailsJson wrap; ListRecent round-trips for the dashboard view), DI overrides the library's TryAddSingleton'd store; 7ea8358 — Dashboard + ConstraintEnforcer rewritten to emit canonical AuditEvent directly via IAuditWriter with structured Target + (dashboard) CorrelationId. 587 pass, 3 pre-existing FakeWorker reds, +10 tests. api_key_audit left unused (documented). Minor backlog: dup WrapDetail, per-op EnsureTable, a test temp-dir leak, unfiltered ListRecent category.
  • ScadaBridge 2.5 — DONE (FULL re-arch, user-chosen). Decomposed into C1C7 (design in 2026-06-02-scadabridge-audit-rearch.md), all spec+code reviewed, MSSQL-verified, local-only on feat/adopt-zb-audit. Canonical record everywhere; site SQLite two-table (canonical + forwarding sidecar); central dbo.AuditLog collapsed to 10 canonical cols + persisted computed cols (CollapseAuditLogToCanonical migration); redactor/outcome/UI/export/CLI all canonical. Forwarding state machine preserved (sidecar) + queryability preserved (persisted computed columns) — the design's key insight that central is append-only made pure-9-col central feasible without gutting forwarding.

Open items to confirm at review

  1. ScadaBridge: full audit re-architecture (pure 9-col storage) vs the bounded "deep-at-the-seam" variant (canonical record at the seam/reporting boundary; keep typed storage columns + state machine). Strongly recommend the bounded variant.
  2. MxGateway: retire api_key_audit (canonical store is sole) vs keep it coexisting.
  3. OtOpcUa: confirm leaving the SP path bespoke (structured path is dormant; canonicalization is forward-looking prep) is acceptable, and the ClusterAudit fix approach (OR-predicate vs populate ClusterId).
  4. Sequencing: OtOpcUa (2.1→2.2) and MxGateway (2.3) are independent + tractable; ScadaBridge (2.5) is the gating risk — do it last, and as staged reviewed sub-commits regardless of variant.