plan(phase2): Task 2.0 gate DONE — verified plan specs materially off (MxGw store moved to lib, OtOpcUa path dormant, SB rename structurally impossible); user chose DEEP adopt + pause; corrected deep design in -phase2-deep.md; PAUSED for review
This commit is contained in:
@@ -0,0 +1,183 @@
|
||||
# Phase 2 (Audit adoption) — Task 2.0 gate findings + DEEP re-scope (for review)
|
||||
|
||||
Companion to `2026-06-02-auth-audit-normalization.md`. Produced by the **Task 2.0 read-only
|
||||
verification gate** (3 parallel explorers, all paths verified 2026-06-02 against live code on each
|
||||
repo's `feat/adopt-zb-auth` HEAD). **Status: PAUSED for user review before any audit code is written.**
|
||||
|
||||
**Decisions taken (2026-06-02):**
|
||||
- **Depth = DEEP adopt (canonical record).** Each app's audit record becomes the library's 9-field
|
||||
`ZB.MOM.WW.Audit.AuditEvent`; domain-specific fields relocate into `DetailsJson`; each app consumes
|
||||
the library's `IAuditWriter`/`IAuditRedactor`/`AuditOutcome` types. (User chose this over the
|
||||
gate-recommended lighter "Align" — consistent with the standing maximal/full-adopt directive.)
|
||||
- **Cadence = re-scope + PAUSE for review.** This doc is the review artifact; implementation does not
|
||||
start until the user signs off (especially on the ScadaBridge cost, below).
|
||||
|
||||
> **Why a re-scope was needed:** the plan's Phase 2 task specs were written from optimistic
|
||||
> `components/audit/current-state/*` docs (see [[component-status-claims-are-optimistic]]). The gate
|
||||
> found all three repos' specs are materially off — file refs moved (MxGateway), the target path is
|
||||
> dormant (OtOpcUa), and the "outright rename" is structurally impossible (ScadaBridge).
|
||||
|
||||
---
|
||||
|
||||
## The canonical contract (shared `ZB.MOM.WW.Audit` 0.1.0)
|
||||
|
||||
`AuditEvent` (sealed record): REQUIRED `EventId:Guid`, `OccurredAtUtc:DateTimeOffset` (UTC-normalized
|
||||
on set), `Actor:string`, `Action:string`, `Outcome:AuditOutcome`; OPTIONAL `Category:string?`,
|
||||
`Target:string?`, `SourceNode:string?`, `CorrelationId:Guid?`, `DetailsJson:string?`. **Nine fields.**
|
||||
`AuditOutcome { Success, Failure, Denied }`. `IAuditWriter.WriteAsync(AuditEvent, CancellationToken)` —
|
||||
best-effort, never throws. `IAuditRedactor.Apply(AuditEvent) -> AuditEvent` — pure, never throws.
|
||||
The package is pinned (central PM / explicit) + feed-mapped in all three repos; **referenced by none yet.**
|
||||
|
||||
---
|
||||
|
||||
## OtOpcUa — DEEP (Tasks 2.1 + 2.2) · risk: LOW–MEDIUM
|
||||
|
||||
**Verified current state:** Commons `AuditEvent` is an **8-field positional record** —
|
||||
`(Guid EventId, string Category, string Action, string Actor, DateTime OccurredAtUtc, string? DetailsJson,
|
||||
NodeId SourceNode, CorrelationId CorrelationId)` — where `NodeId`/`CorrelationId` are `readonly record
|
||||
struct` newtypes over `string`/`Guid`. It is an **Akka message** delivered via `DistributedPubSub`
|
||||
(`provider=cluster`) with **default (reflection) serialization** — no custom serializer. **The structured
|
||||
actor path is DORMANT: zero production emit sites** construct/`Tell` an `AuditEvent` today (only the tests
|
||||
do); all live audit goes through the bespoke **stored-procedure path** (`sp_NodeApplied`/`sp_PublishGeneration`/
|
||||
`sp_ValidateDraft`/`sp_RollbackToGeneration` INSERT directly with `ClusterId`/`GenerationId`, NULL `EventId`).
|
||||
`AuditWriterActor` (`ControlPlane/Audit/AuditWriterActor.cs`): 500/5s batching, two-layer dedup (in-buffer
|
||||
`Dictionary<Guid,AuditEvent>` + DB filtered-unique `UX_ConfigAuditLog_EventId`), mapping at `:75-84`.
|
||||
`ConfigAuditLog` (10 cols, no `Outcome`; `ISJSON` CHECK on `DetailsJson`). `ClusterAudit.razor:78` filters
|
||||
`a.ClusterId == ClusterId`, but the actor sets `NodeId` not `ClusterId`, so structured rows are invisible.
|
||||
Package pinned `0.1.0` in `Directory.Packages.props`, feed-mapped, unreferenced.
|
||||
|
||||
**Deep design — this is the easy one (the record is already ~canonical):**
|
||||
- **2.1 (high-risk: actor + contract):** Delete Commons `AuditEvent.cs`; reference `ZB.MOM.WW.Audit.AuditEvent`
|
||||
from `ZB.MOM.WW.OtOpcUa.Commons` + `…ControlPlane`. Field map: `EventId`→`EventId`; `OccurredAtUtc`
|
||||
`DateTime`→`DateTimeOffset` (widen at construction); `Actor`/`Action`/`Category`/`DetailsJson` direct;
|
||||
`SourceNode` (unwrap `NodeId.Value`→`string?`); `CorrelationId` (unwrap `.Value`→`Guid?`); `Target` unused
|
||||
(null) — OtOpcUa has no extra domain fields to push into `DetailsJson`, so **no field relocation**. Add the
|
||||
NEW required `Outcome` (derive: `OpcUaAccessDenied`/`CrossClusterNamespaceAttempt`→`Denied`; config verbs→
|
||||
`Success`; no `Failure` in OtOpcUa's vocabulary). `AuditWriterActor : IAuditWriter` (`WriteAsync` wraps the
|
||||
fire-and-forget `Tell`, returns `Task.CompletedTask` — trivially best-effort). Keep batching/dedup. Mapping
|
||||
at `:75-84` becomes `NodeId = evt.SourceNode`, `CorrelationId = evt.CorrelationId`, `Outcome = evt.Outcome`,
|
||||
`EventType = $"{evt.Category}:{evt.Action}"` (storage keeps the composite). Value-type unwrap happens at the
|
||||
(test + future) construction sites. **Akka wire note:** the message type changes shape → a rolling-deploy
|
||||
wire break IN PRINCIPLE, but **moot** (no live emit traffic). Flag in the commit; no dual-accept window needed.
|
||||
- **2.2 (high-risk: EF migration + UI query):** add nullable `Outcome` to `ConfigAuditLog` (+ DbContext mapping
|
||||
`:429-463`) + EF migration `AddConfigAuditLogOutcome` (chains after `20260602112419_CanonicalizeAdminRoles`).
|
||||
Fix `ClusterAudit.razor:78` so `ClusterId == null && NodeId` resolves to the cluster (OR-predicate joining
|
||||
`ClusterNodes`, or populate `ClusterId` at flush). SP path stays bespoke (documented).
|
||||
- **Package refs:** `…Commons` (record + `AuditOutcome`), `…ControlPlane` (`IAuditWriter`), `…Configuration`
|
||||
(only if `Outcome` is stored as the enum type; otherwise store `string?`/`int?` and skip).
|
||||
- **Effort:** ~record swap 5m + actor seam 5m + Outcome derivation 5m (2.1); column+migration+query 5m (2.2).
|
||||
|
||||
---
|
||||
|
||||
## MxGateway — DEEP (Task 2.3, re-scoped) · risk: MEDIUM–HIGH (was "standard")
|
||||
|
||||
**Verified current state — the plan's file refs are STALE:** Phase 1 (Task 1.3) **moved**
|
||||
`IApiKeyAuditStore` + `ApiKeyAuditEntry` + `SqliteApiKeyAuditStore` **into the shared library**
|
||||
(`ZB.MOM.WW.Auth.Abstractions`/`…ApiKeys` 0.1.2) — they no longer exist in MxGateway. `ApiKeyAuditEntry` =
|
||||
**5 fields** `(string? KeyId, string EventType, string? RemoteAddress, DateTimeOffset CreatedUtc, string? Details)`,
|
||||
persisted to the SQLite `api_key_audit` table (5 cols). `IApiKeyAuditStore` = `AppendAsync` + `ListRecentAsync`
|
||||
(the dashboard "recent audit" view reads via `ListRecentAsync`). **Three producers, but one is library-internal:**
|
||||
- `ApiKeyAdminCommands` (**library-internal**, in `ZB.MOM.WW.Auth.ApiKeys`) — emits CLI/admin verbs
|
||||
(`init-db`/`create-key`/`revoke-key`/`rotate-key`/`delete-key`/`set-scopes`/`enable-key`/`disable-key`),
|
||||
keyless for `init-db`, `RemoteAddress` null on the CLI path. **MxGateway cannot edit these call sites.**
|
||||
- `DashboardApiKeyManagementService` (MxGateway-local) — `dashboard-*` verbs, real `KeyId` + `RemoteAddress`.
|
||||
- `ConstraintEnforcer.RecordDenialAsync` (MxGateway-local) — single `constraint-denied` EventType, `RemoteAddress`
|
||||
hardcoded null, `Details = "{commandKind}: {target}: {ConstraintName}: {Message}"`.
|
||||
`AppendAsync` currently **propagates** exceptions (no best-effort wrap). Serilog migration **landed** (no blocker).
|
||||
`ZB.MOM.WW.Audit` unreferenced; `nuget.config` already maps the package.
|
||||
|
||||
**Deep design — the library-internal CLI producer forces an adapter:**
|
||||
- Add `<PackageReference Include="ZB.MOM.WW.Audit" />` to `…Server`.
|
||||
- New **MxGateway-owned canonical store** `audit_event` (SQLite, 9 canonical columns + `details_json`) with its own
|
||||
migrator — the existing `api_key_audit` lives in the **library-owned** auth DB schema, so we do NOT alter that
|
||||
schema. Implement `IAuditWriter` over the new store (best-effort try/catch — fixes the no-wrap gap).
|
||||
- **Adapter for the library-internal CLI events:** register a MxGateway `IApiKeyAuditStore` impl whose
|
||||
`AppendAsync(ApiKeyAuditEntry)` maps → canonical `AuditEvent` (`EventId=NewGuid`; `KeyId`→`Actor` with
|
||||
`"cli"`/`"system"` fallback; `EventType`→`Action`; `CreatedUtc`→`OccurredAtUtc`; `RemoteAddress`→`SourceNode`;
|
||||
`Outcome=Success`; `Category="ApiKey"`; `Target=KeyId`; `Details`→`DetailsJson` wrapped `{"detail":"…"}`) and
|
||||
forwards to `IAuditWriter`. Its `ListRecentAsync` reads the canonical store and maps back to `ApiKeyAuditEntry`
|
||||
(so the existing dashboard recent-audit view keeps working) **or** the dashboard view is repointed to canonical.
|
||||
- **Local producers** (`DashboardApiKeyManagementService`, `ConstraintEnforcer`) rewritten to build canonical
|
||||
`AuditEvent`s directly via `IAuditWriter` (`constraint-denied`→`Outcome.Denied`; capture `CorrelationId` from
|
||||
`MxCommandRequest.ClientCorrelationId` (constraint path — needs threading down) / `HttpContext.TraceIdentifier`
|
||||
(dashboard); structured `Target` from `commandKind`/`target` (GAPS #6)).
|
||||
- **Open question for review:** retire `api_key_audit` (canonical store becomes the sole audit table) vs keep it
|
||||
coexisting. Retiring is cleaner-deep but touches the library's store wiring; coexisting is lower-risk.
|
||||
- **Effort/classification:** re-scoped from "standard ~5m" to **high-risk** (new store + migrator + adapter +
|
||||
producer rewrites + dashboard read path + DI + tests). Realistically 2–3 sub-commits.
|
||||
|
||||
---
|
||||
|
||||
## ScadaBridge — DEEP (Task 2.5, re-scoped) · risk: **VERY HIGH — audit-subsystem re-architecture**
|
||||
|
||||
**This is the one to scrutinize at review.** The gate definitively answered the plan's central claim is FALSE.
|
||||
|
||||
**Verified current state:** ScadaBridge's `AuditEvent` (`…Commons/Entities/Audit/AuditEvent.cs`) is a
|
||||
**24-field** record — `EventId, OccurredAtUtc(DateTime), IngestedAtUtc, Channel(AuditChannel), Kind(AuditKind),
|
||||
CorrelationId, ExecutionId, ParentExecutionId, SourceSiteId, SourceNode, SourceInstanceId, SourceScript, Actor,
|
||||
Target, Status(AuditStatus), HttpStatus, DurationMs, ErrorMessage, ErrorDetail, RequestSummary, ResponseSummary,
|
||||
PayloadTruncated, Extra, ForwardState(AuditForwardState?)`. It is the **storage shape of a partitioned SQL Server
|
||||
audit table** with these as **queryable columns**. `IAuditPayloadFilter.Apply(ScadaBridgeAuditEvent) ->
|
||||
ScadaBridgeAuditEvent` (NOT the library's record — a reflection contract test `PayloadFilterContractTests` pins
|
||||
the typing). `IAuditWriter`/`ICentralAuditWriter` are likewise typed to the 24-field record. **`AuditStatus`
|
||||
drives the site→central forwarding STATE MACHINE** (`Pending→Submitted→Forwarded→Reconciled`;
|
||||
`Delivered`/`Failed`/`Parked`/`Discarded`) and the **filter's error-cap logic** (`IsErrorStatus`). The Central
|
||||
reporting/UI queries by `Channel`/`Kind`/`Status`/`Site`. **Phase 1 did NOT touch any audit-pipeline file** (zero
|
||||
drift). Blast radius of just the interface rename: ~10 files / ~20 sites; the contract test pins it.
|
||||
|
||||
**What DEEP adoption concretely requires here (full honesty):**
|
||||
Replacing the 24-field record with the 9-field canonical + pushing ~15 domain fields into `DetailsJson` means
|
||||
**re-architecting the entire audit subsystem**, because those fields are not decorative — they are load-bearing:
|
||||
1. **Storage:** migrate the partitioned SQL Server audit table from ~24 typed columns to the 9 canonical columns
|
||||
+ a JSON `DetailsJson` column. Massive, lossy-on-queryability data migration; partitioning scheme likely must
|
||||
change; `IngestedAtUtc`/`ForwardState` are operational columns the forwarder UPDATEs.
|
||||
3. **Forwarding state machine breaks:** `Status`/`ForwardState` move into opaque JSON — you cannot `UPDATE` a
|
||||
JSON-embedded field as a column, and the reconciliation queries `WHERE Status/ForwardState = …` stop working.
|
||||
The site→central forwarder would have to be redesigned (e.g., promote Status back out of JSON, defeating the
|
||||
point).
|
||||
4. **Redactor breaks:** `DefaultAuditPayloadFilter` reads `Channel`/`Status`/`RequestSummary`/`ResponseSummary`/
|
||||
`ErrorDetail`/`Extra`/`PayloadTruncated` to choose truncation caps — on a 9-field canonical record those are
|
||||
gone (opaque in `DetailsJson`), so the filter must be rewritten to parse JSON.
|
||||
5. **Reporting/UI breaks:** Central audit-log queries/filters by Channel/Kind/Status/Site lose SQL queryability.
|
||||
6. ~Dozens of call sites + the contract test + the perf hot-path test.
|
||||
|
||||
**Honest assessment:** ScadaBridge DEEP ≈ the **largest single undertaking in the whole program** (bigger than the
|
||||
Phase-1 ApiKeys re-arch). The audit component's own GAPS doc says *"Align, don't replace"* for exactly this reason.
|
||||
|
||||
**Bounded alternative to weigh at review (recommended if "deep" is to be kept tractable):** make the canonical
|
||||
`ZB.MOM.WW.Audit.AuditEvent` the **seam/transport + cross-project reporting** shape (the redactor and an
|
||||
`IAuditWriter` operate on the canonical record; domain richness rides in `DetailsJson`), while the **SQL storage
|
||||
keeps its typed queryable columns** populated by a storage-side projection (canonical+DetailsJson → columns) and
|
||||
the forwarding state machine continues to key on the `Status`/`ForwardState` columns. This delivers "deep" at the
|
||||
seam/record level (library types consumed; domain fields in `DetailsJson` for the canonical view) **without**
|
||||
gutting the partitioned store, the state machine, the filter, or the reporting — a far safer "deep."
|
||||
|
||||
---
|
||||
|
||||
## Cross-cutting
|
||||
|
||||
- **Branch model:** `feat/adopt-zb-audit` per app, **stacked on `feat/adopt-zb-auth` HEAD** (Phase 3 wires the
|
||||
audit `Actor` from the Phase-1 Auth principal, so audit must build on auth). Local-only, never pushed.
|
||||
- **No library change / republish** needed for the chosen designs (MxGateway adapts in-repo) — so no Gitea token
|
||||
required unless the user later wants the canonical mapping pushed into a shared lib.
|
||||
- **Phase 3 (unchanged in intent):** `IAuditActorAccessor` seam + wire `AuditEvent.Actor` from the Auth principal
|
||||
at every authenticated emit site; keep `"system"`/`"cli"` fallbacks for keyless paths.
|
||||
|
||||
## Re-scoped task list (for review)
|
||||
|
||||
| # | Repo | Re-scoped scope | Class | Risk |
|
||||
|---|---|---|---|---|
|
||||
| 2.1 | OtOpcUa | Commons record → canonical `AuditEvent`; `AuditWriterActor : IAuditWriter`; `Outcome` derivation; Akka-wire note (dormant) | high-risk | Low–Med |
|
||||
| 2.2 | OtOpcUa | `ConfigAuditLog.Outcome` column + EF migration + `ClusterAudit` visibility fix; SP path bespoke | high-risk | Low–Med |
|
||||
| 2.3 | MxGateway | new canonical SQLite `audit_event` store + migrator; `IAuditWriter`; `IApiKeyAuditStore`→canonical adapter (for library-internal CLI events) incl. `ListRecentAsync`; rewrite local producers; CorrelationId/Target capture; DI; tests | **high-risk** (↑ from standard) | Med–High |
|
||||
| 2.5 | ScadaBridge | **DEEP = audit-subsystem re-arch** (24-field→9-field record everywhere; domain fields→`DetailsJson`; SQL partitioned-table migration; forwarding state machine + filter + reporting rewrite; contract/perf tests) — **OR** the bounded "deep-at-the-seam" alternative above | **very-high-risk** | **VERY HIGH** |
|
||||
|
||||
## Open items to confirm at review
|
||||
1. **ScadaBridge:** full audit re-architecture (pure 9-col storage) vs the **bounded "deep-at-the-seam"** variant
|
||||
(canonical record at the seam/reporting boundary; keep typed storage columns + state machine). Strongly
|
||||
recommend the bounded variant.
|
||||
2. **MxGateway:** retire `api_key_audit` (canonical store is sole) vs keep it coexisting.
|
||||
3. **OtOpcUa:** confirm leaving the SP path bespoke (structured path is dormant; canonicalization is forward-looking
|
||||
prep) is acceptable, and the `ClusterAudit` fix approach (OR-predicate vs populate `ClusterId`).
|
||||
4. **Sequencing:** OtOpcUa (2.1→2.2) and MxGateway (2.3) are independent + tractable; ScadaBridge (2.5) is the
|
||||
gating risk — do it last, and as staged reviewed sub-commits regardless of variant.
|
||||
@@ -239,6 +239,16 @@ Order within the phase (per `components/auth/GAPS.md` sequencing): **#3 seam →
|
||||
|
||||
## PHASE 2 — Audit adoption (audit GAPS #1–#3, #5, #6)
|
||||
|
||||
> ⚠️ **RE-SCOPED 2026-06-02 — the task specs below are SUPERSEDED.** The Task 2.0 gate (verified against
|
||||
> live code) found these specs materially wrong: MxGateway's audit files moved into the shared library
|
||||
> (Phase 1), OtOpcUa's structured audit path is dormant (zero emit sites), and the ScadaBridge
|
||||
> "outright rename" is structurally impossible (its filter is typed to its own 24-field record, not the
|
||||
> library's 9-field one). The user chose **DEEP adopt (canonical record)** + **pause for review**. The
|
||||
> corrected, gate-grounded deep design is in
|
||||
> [`2026-06-02-auth-audit-normalization-phase2-deep.md`](2026-06-02-auth-audit-normalization-phase2-deep.md)
|
||||
> — **implementation is PAUSED pending user review of that doc (esp. the ScadaBridge audit-subsystem
|
||||
> re-architecture cost).** The original specs below are kept for historical context only.
|
||||
|
||||
Branch `feat/adopt-zb-audit` per repo. Behaviour-preserving except the OtOpcUa `Outcome` column + `ClusterId` visibility fix. Concrete paths below come from `components/audit/current-state/*`.
|
||||
|
||||
### Task 2.0: Explore audit source + confirm elaboration *(GATE — light, paths already known)*
|
||||
|
||||
@@ -23,7 +23,7 @@
|
||||
{"id": 23, "subject": "Task 1.6: Unify dev base DN (#6) — DONE all 3 to dc=zb,dc=local (OtOpcUa 8ba289f, MxGw 9572045, SB 6ae6051)", "status": "completed", "blockedBy": [17]},
|
||||
{"id": 24, "subject": "Task 1.7: Canonical roles native expansion (#8) [high-risk] — DONE all 3, full-value canonical (MxGw 04bce3ff, OtOpcUa c1619d9 +DB-mig, SB b104760+4118452 +DB-mig +SoD collapse)", "status": "completed", "blockedBy": [18]},
|
||||
|
||||
{"id": 25, "subject": "Task 2.0: GATE confirm audit source refs", "status": "pending", "blockedBy": [8]},
|
||||
{"id": 25, "subject": "Task 2.0: GATE confirm audit source refs — DONE; found plan specs materially off → DEEP re-scope in -phase2-deep.md; PAUSED for user review before 2.1/2.2/2.3/2.5", "status": "completed", "blockedBy": [8]},
|
||||
{"id": 26, "subject": "Task 2.1: OtOpcUa canonical record + IAuditWriter + Outcome (#1) [high-risk]", "status": "pending", "blockedBy": [25]},
|
||||
{"id": 27, "subject": "Task 2.2: OtOpcUa Outcome migration + ClusterId fix (#1,#5) [high-risk]", "status": "pending", "blockedBy": [26]},
|
||||
{"id": 28, "subject": "Task 2.3: MxGateway store→IAuditWriter adapter (#2,#6)", "status": "pending", "blockedBy": [25]},
|
||||
|
||||
Reference in New Issue
Block a user