18 KiB
Phase 2 (Audit adoption) — Task 2.0 gate findings + DEEP re-scope (for review)
Companion to 2026-06-02-auth-audit-normalization.md. Produced by the Task 2.0 read-only
verification gate (3 parallel explorers, all paths verified 2026-06-02 against live code on each
repo's feat/adopt-zb-auth HEAD). Status: PAUSED for user review before any audit code is written.
Decisions taken (2026-06-02):
- Depth = DEEP adopt (canonical record). Each app's audit record becomes the library's 9-field
ZB.MOM.WW.Audit.AuditEvent; domain-specific fields relocate intoDetailsJson; each app consumes the library'sIAuditWriter/IAuditRedactor/AuditOutcometypes. (User chose this over the gate-recommended lighter "Align" — consistent with the standing maximal/full-adopt directive.) - Cadence = re-scope + PAUSE for review. This doc is the review artifact; implementation does not start until the user signs off (especially on the ScadaBridge cost, below).
Why a re-scope was needed: the plan's Phase 2 task specs were written from optimistic
components/audit/current-state/*docs (see component-status-claims-are-optimistic). The gate found all three repos' specs are materially off — file refs moved (MxGateway), the target path is dormant (OtOpcUa), and the "outright rename" is structurally impossible (ScadaBridge).
The canonical contract (shared ZB.MOM.WW.Audit 0.1.0)
AuditEvent (sealed record): REQUIRED EventId:Guid, OccurredAtUtc:DateTimeOffset (UTC-normalized
on set), Actor:string, Action:string, Outcome:AuditOutcome; OPTIONAL Category:string?,
Target:string?, SourceNode:string?, CorrelationId:Guid?, DetailsJson:string?. Nine fields.
AuditOutcome { Success, Failure, Denied }. IAuditWriter.WriteAsync(AuditEvent, CancellationToken) —
best-effort, never throws. IAuditRedactor.Apply(AuditEvent) -> AuditEvent — pure, never throws.
The package is pinned (central PM / explicit) + feed-mapped in all three repos; referenced by none yet.
OtOpcUa — DEEP (Tasks 2.1 + 2.2) · risk: LOW–MEDIUM
Verified current state: Commons AuditEvent is an 8-field positional record —
(Guid EventId, string Category, string Action, string Actor, DateTime OccurredAtUtc, string? DetailsJson, NodeId SourceNode, CorrelationId CorrelationId) — where NodeId/CorrelationId are readonly record struct newtypes over string/Guid. It is an Akka message delivered via DistributedPubSub
(provider=cluster) with default (reflection) serialization — no custom serializer. The structured
actor path is DORMANT: zero production emit sites construct/Tell an AuditEvent today (only the tests
do); all live audit goes through the bespoke stored-procedure path (sp_NodeApplied/sp_PublishGeneration/
sp_ValidateDraft/sp_RollbackToGeneration INSERT directly with ClusterId/GenerationId, NULL EventId).
AuditWriterActor (ControlPlane/Audit/AuditWriterActor.cs): 500/5s batching, two-layer dedup (in-buffer
Dictionary<Guid,AuditEvent> + DB filtered-unique UX_ConfigAuditLog_EventId), mapping at :75-84.
ConfigAuditLog (10 cols, no Outcome; ISJSON CHECK on DetailsJson). ClusterAudit.razor:78 filters
a.ClusterId == ClusterId, but the actor sets NodeId not ClusterId, so structured rows are invisible.
Package pinned 0.1.0 in Directory.Packages.props, feed-mapped, unreferenced.
Deep design — this is the easy one (the record is already ~canonical):
- 2.1 (high-risk: actor + contract): Delete Commons
AuditEvent.cs; referenceZB.MOM.WW.Audit.AuditEventfromZB.MOM.WW.OtOpcUa.Commons+…ControlPlane. Field map:EventId→EventId;OccurredAtUtcDateTime→DateTimeOffset(widen at construction);Actor/Action/Category/DetailsJsondirect;SourceNode(unwrapNodeId.Value→string?);CorrelationId(unwrap.Value→Guid?);Targetunused (null) — OtOpcUa has no extra domain fields to push intoDetailsJson, so no field relocation. Add the NEW requiredOutcome(derive:OpcUaAccessDenied/CrossClusterNamespaceAttempt→Denied; config verbs→Success; noFailurein OtOpcUa's vocabulary).AuditWriterActor : IAuditWriter(WriteAsyncwraps the fire-and-forgetTell, returnsTask.CompletedTask— trivially best-effort). Keep batching/dedup. Mapping at:75-84becomesNodeId = evt.SourceNode,CorrelationId = evt.CorrelationId,Outcome = evt.Outcome,EventType = $"{evt.Category}:{evt.Action}"(storage keeps the composite). Value-type unwrap happens at the (test + future) construction sites. Akka wire note: the message type changes shape → a rolling-deploy wire break IN PRINCIPLE, but moot (no live emit traffic). Flag in the commit; no dual-accept window needed. - 2.2 (high-risk: EF migration + UI query): add nullable
OutcometoConfigAuditLog(+ DbContext mapping:429-463) + EF migrationAddConfigAuditLogOutcome(chains after20260602112419_CanonicalizeAdminRoles). FixClusterAudit.razor:78soClusterId == null && NodeIdresolves to the cluster (OR-predicate joiningClusterNodes, or populateClusterIdat flush). SP path stays bespoke (documented). - Package refs:
…Commons(record +AuditOutcome),…ControlPlane(IAuditWriter),…Configuration(only ifOutcomeis stored as the enum type; otherwise storestring?/int?and skip). - Effort: ~record swap 5m + actor seam 5m + Outcome derivation 5m (2.1); column+migration+query 5m (2.2).
MxGateway — DEEP (Task 2.3, re-scoped) · risk: MEDIUM–HIGH (was "standard")
Verified current state — the plan's file refs are STALE: Phase 1 (Task 1.3) moved
IApiKeyAuditStore + ApiKeyAuditEntry + SqliteApiKeyAuditStore into the shared library
(ZB.MOM.WW.Auth.Abstractions/…ApiKeys 0.1.2) — they no longer exist in MxGateway. ApiKeyAuditEntry =
5 fields (string? KeyId, string EventType, string? RemoteAddress, DateTimeOffset CreatedUtc, string? Details),
persisted to the SQLite api_key_audit table (5 cols). IApiKeyAuditStore = AppendAsync + ListRecentAsync
(the dashboard "recent audit" view reads via ListRecentAsync). Three producers, but one is library-internal:
ApiKeyAdminCommands(library-internal, inZB.MOM.WW.Auth.ApiKeys) — emits CLI/admin verbs (init-db/create-key/revoke-key/rotate-key/delete-key/set-scopes/enable-key/disable-key), keyless forinit-db,RemoteAddressnull on the CLI path. MxGateway cannot edit these call sites.DashboardApiKeyManagementService(MxGateway-local) —dashboard-*verbs, realKeyId+RemoteAddress.ConstraintEnforcer.RecordDenialAsync(MxGateway-local) — singleconstraint-deniedEventType,RemoteAddresshardcoded null,Details = "{commandKind}: {target}: {ConstraintName}: {Message}".AppendAsynccurrently propagates exceptions (no best-effort wrap). Serilog migration landed (no blocker).ZB.MOM.WW.Auditunreferenced;nuget.configalready maps the package.
Deep design — the library-internal CLI producer forces an adapter:
- Add
<PackageReference Include="ZB.MOM.WW.Audit" />to…Server. - New MxGateway-owned canonical store
audit_event(SQLite, 9 canonical columns +details_json) with its own migrator — the existingapi_key_auditlives in the library-owned auth DB schema, so we do NOT alter that schema. ImplementIAuditWriterover the new store (best-effort try/catch — fixes the no-wrap gap). - Adapter for the library-internal CLI events: register a MxGateway
IApiKeyAuditStoreimpl whoseAppendAsync(ApiKeyAuditEntry)maps → canonicalAuditEvent(EventId=NewGuid;KeyId→Actorwith"cli"/"system"fallback;EventType→Action;CreatedUtc→OccurredAtUtc;RemoteAddress→SourceNode;Outcome=Success;Category="ApiKey";Target=KeyId;Details→DetailsJsonwrapped{"detail":"…"}) and forwards toIAuditWriter. ItsListRecentAsyncreads the canonical store and maps back toApiKeyAuditEntry(so the existing dashboard recent-audit view keeps working) or the dashboard view is repointed to canonical. - Local producers (
DashboardApiKeyManagementService,ConstraintEnforcer) rewritten to build canonicalAuditEvents directly viaIAuditWriter(constraint-denied→Outcome.Denied; captureCorrelationIdfromMxCommandRequest.ClientCorrelationId(constraint path — needs threading down) /HttpContext.TraceIdentifier(dashboard); structuredTargetfromcommandKind/target(GAPS #6)). - Open question for review: retire
api_key_audit(canonical store becomes the sole audit table) vs keep it coexisting. Retiring is cleaner-deep but touches the library's store wiring; coexisting is lower-risk. - Effort/classification: re-scoped from "standard ~5m" to high-risk (new store + migrator + adapter + producer rewrites + dashboard read path + DI + tests). Realistically 2–3 sub-commits.
ScadaBridge — DEEP (Task 2.5, re-scoped) · risk: VERY HIGH — audit-subsystem re-architecture
This is the one to scrutinize at review. The gate definitively answered the plan's central claim is FALSE.
Verified current state: ScadaBridge's AuditEvent (…Commons/Entities/Audit/AuditEvent.cs) is a
24-field record — EventId, OccurredAtUtc(DateTime), IngestedAtUtc, Channel(AuditChannel), Kind(AuditKind), CorrelationId, ExecutionId, ParentExecutionId, SourceSiteId, SourceNode, SourceInstanceId, SourceScript, Actor, Target, Status(AuditStatus), HttpStatus, DurationMs, ErrorMessage, ErrorDetail, RequestSummary, ResponseSummary, PayloadTruncated, Extra, ForwardState(AuditForwardState?). It is the storage shape of a partitioned SQL Server
audit table with these as queryable columns. IAuditPayloadFilter.Apply(ScadaBridgeAuditEvent) -> ScadaBridgeAuditEvent (NOT the library's record — a reflection contract test PayloadFilterContractTests pins
the typing). IAuditWriter/ICentralAuditWriter are likewise typed to the 24-field record. AuditStatus
drives the site→central forwarding STATE MACHINE (Pending→Submitted→Forwarded→Reconciled;
Delivered/Failed/Parked/Discarded) and the filter's error-cap logic (IsErrorStatus). The Central
reporting/UI queries by Channel/Kind/Status/Site. Phase 1 did NOT touch any audit-pipeline file (zero
drift). Blast radius of just the interface rename: ~10 files / ~20 sites; the contract test pins it.
What DEEP adoption concretely requires here (full honesty):
Replacing the 24-field record with the 9-field canonical + pushing ~15 domain fields into DetailsJson means
re-architecting the entire audit subsystem, because those fields are not decorative — they are load-bearing:
- Storage: migrate the partitioned SQL Server audit table from ~24 typed columns to the 9 canonical columns
- a JSON
DetailsJsoncolumn. Massive, lossy-on-queryability data migration; partitioning scheme likely must change;IngestedAtUtc/ForwardStateare operational columns the forwarder UPDATEs.
- a JSON
- Forwarding state machine breaks:
Status/ForwardStatemove into opaque JSON — you cannotUPDATEa JSON-embedded field as a column, and the reconciliation queriesWHERE Status/ForwardState = …stop working. The site→central forwarder would have to be redesigned (e.g., promote Status back out of JSON, defeating the point). - Redactor breaks:
DefaultAuditPayloadFilterreadsChannel/Status/RequestSummary/ResponseSummary/ErrorDetail/Extra/PayloadTruncatedto choose truncation caps — on a 9-field canonical record those are gone (opaque inDetailsJson), so the filter must be rewritten to parse JSON. - Reporting/UI breaks: Central audit-log queries/filters by Channel/Kind/Status/Site lose SQL queryability.
- ~Dozens of call sites + the contract test + the perf hot-path test.
Honest assessment: ScadaBridge DEEP ≈ the largest single undertaking in the whole program (bigger than the Phase-1 ApiKeys re-arch). The audit component's own GAPS doc says "Align, don't replace" for exactly this reason.
Bounded alternative to weigh at review (recommended if "deep" is to be kept tractable): make the canonical
ZB.MOM.WW.Audit.AuditEvent the seam/transport + cross-project reporting shape (the redactor and an
IAuditWriter operate on the canonical record; domain richness rides in DetailsJson), while the SQL storage
keeps its typed queryable columns populated by a storage-side projection (canonical+DetailsJson → columns) and
the forwarding state machine continues to key on the Status/ForwardState columns. This delivers "deep" at the
seam/record level (library types consumed; domain fields in DetailsJson for the canonical view) without
gutting the partitioned store, the state machine, the filter, or the reporting — a far safer "deep."
Cross-cutting
- Branch model:
feat/adopt-zb-auditper app, stacked onfeat/adopt-zb-authHEAD (Phase 3 wires the auditActorfrom the Phase-1 Auth principal, so audit must build on auth). Local-only, never pushed. - No library change / republish needed for the chosen designs (MxGateway adapts in-repo) — so no Gitea token required unless the user later wants the canonical mapping pushed into a shared lib.
- Phase 3 (unchanged in intent):
IAuditActorAccessorseam + wireAuditEvent.Actorfrom the Auth principal at every authenticated emit site; keep"system"/"cli"fallbacks for keyless paths.
Re-scoped task list (for review)
| # | Repo | Re-scoped scope | Class | Risk |
|---|---|---|---|---|
| 2.1 | OtOpcUa | Commons record → canonical AuditEvent; AuditWriterActor : IAuditWriter; Outcome derivation; Akka-wire note (dormant) |
high-risk | Low–Med |
| 2.2 | OtOpcUa | ConfigAuditLog.Outcome column + EF migration + ClusterAudit visibility fix; SP path bespoke |
high-risk | Low–Med |
| 2.3 | MxGateway | new canonical SQLite audit_event store + migrator; IAuditWriter; IApiKeyAuditStore→canonical adapter (for library-internal CLI events) incl. ListRecentAsync; rewrite local producers; CorrelationId/Target capture; DI; tests |
high-risk (↑ from standard) | Med–High |
| 2.5 | ScadaBridge | DEEP = audit-subsystem re-arch (24-field→9-field record everywhere; domain fields→DetailsJson; SQL partitioned-table migration; forwarding state machine + filter + reporting rewrite; contract/perf tests) — OR the bounded "deep-at-the-seam" alternative above |
very-high-risk | VERY HIGH |
Implementation status (2026-06-02, deep adoption underway)
- ✅ OtOpcUa 2.1 + 2.2 DONE (
feat/adopt-zb-audit, spec ✅ + code ✅):933dd1a— deleted bespoke CommonsAuditEvent, adopted libraryZB.MOM.WW.Audit.AuditEvent,AuditWriterActor : IAuditWriter(best-effortWriteAsyncwrapsSelf.Tell),AuditOutcomeMapper.FromActionderivation, batching/dedup intact;b7f5e88— nullableOutcomecolumn + migration20260602135350_AddConfigAuditLogOutcome(additive, chains after CanonicalizeAdminRoles, no pending model changes) +ClusterAuditfix via sharedClusterAuditQuery(OR-predicate joiningClusterNodemembership). SP path untouched. ControlPlane 45/45, Configuration 80/80 (+3), AdminUI 121/121. Minor backlog: noIX_ConfigAuditLog_NodeId(irrelevant while structured path dormant). - ✅ MxGateway 2.3 DONE (
feat/adopt-zb-audit, spec ✅ + code ✅):a5944bb— new MxGateway-owned canonical SQLiteaudit_eventstore (same auth DB file via the library'sAuthSqliteConnectionFactory; library tables untouched),CanonicalAuditWriter : IAuditWriter(best-effort, never throws — closes the library's no-wrap gap),CanonicalForwardingApiKeyAuditStore : IApiKeyAuditStoreadapter (mapsApiKeyAuditEntry→canonical w/ system/cli fallback + constraint-denied→Denied + DetailsJson wrap;ListRecentround-trips for the dashboard view), DI overrides the library'sTryAddSingleton'd store;7ea8358— Dashboard + ConstraintEnforcer rewritten to emit canonicalAuditEventdirectly viaIAuditWriterwith structuredTarget+ (dashboard)CorrelationId. 587 pass, 3 pre-existing FakeWorker reds, +10 tests.api_key_auditleft unused (documented). Minor backlog: dupWrapDetail, per-opEnsureTable, a test temp-dir leak, unfilteredListRecentcategory. - ✅ ScadaBridge 2.5 — DONE (FULL re-arch, user-chosen). Decomposed into C1–C7 (design in
2026-06-02-scadabridge-audit-rearch.md), all spec+code reviewed, MSSQL-verified, local-only onfeat/adopt-zb-audit. Canonical record everywhere; site SQLite two-table (canonical + forwarding sidecar); centraldbo.AuditLogcollapsed to 10 canonical cols + persisted computed cols (CollapseAuditLogToCanonicalmigration); redactor/outcome/UI/export/CLI all canonical. Forwarding state machine preserved (sidecar) + queryability preserved (persisted computed columns) — the design's key insight that central is append-only made pure-9-col central feasible without gutting forwarding.
Open items to confirm at review
- ScadaBridge: full audit re-architecture (pure 9-col storage) vs the bounded "deep-at-the-seam" variant (canonical record at the seam/reporting boundary; keep typed storage columns + state machine). Strongly recommend the bounded variant.
- MxGateway: retire
api_key_audit(canonical store is sole) vs keep it coexisting. - OtOpcUa: confirm leaving the SP path bespoke (structured path is dormant; canonicalization is forward-looking
prep) is acceptable, and the
ClusterAuditfix approach (OR-predicate vs populateClusterId). - Sequencing: OtOpcUa (2.1→2.2) and MxGateway (2.3) are independent + tractable; ScadaBridge (2.5) is the gating risk — do it last, and as staged reviewed sub-commits regardless of variant.