merge: integrate WaitAsync/M5-audit (parallel session) with galaxy array-write + inbound-timeout fixes
This commit is contained in:
@@ -0,0 +1,150 @@
|
||||
# M5 — Audit Hardening (T3–T8) — Design
|
||||
|
||||
**Status:** Approved (awaiting plan).
|
||||
**Worktree/branch:** `worktree-m5-audit-hardening` off `main` (`e77e209`).
|
||||
**Source:** Phase-2 milestone M5 from `docs/plans/2026-06-15-stillpending-completion-design.md`.
|
||||
|
||||
## Goal
|
||||
|
||||
Harden the centralized Audit Log with six independent, ready-to-build items. Two
|
||||
items originally listed under M5 — **T1 hash-chain tamper evidence** and **T2
|
||||
Parquet export** — remain **deferred to v1.x** (per CLAUDE.md's audit design
|
||||
decisions); their stubs (CLI `verify-chain` no-op, export `501`) stay unchanged.
|
||||
|
||||
## Scope (in)
|
||||
|
||||
T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical
|
||||
backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture
|
||||
increments · T8 CLI `audit tree`.
|
||||
|
||||
## Scope (out / deferred to v1.x)
|
||||
|
||||
T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet
|
||||
export (the `501` gate stays). Reversing those deferrals is a separate decision.
|
||||
|
||||
---
|
||||
|
||||
## Items
|
||||
|
||||
### T8 — CLI `audit tree` (smallest; reuses existing server walk + UI)
|
||||
The recursive execution-tree walk (`IAuditLogRepository.GetExecutionTreeAsync`,
|
||||
backed by `IX_AuditLog_ParentExecution`) and the Blazor `ExecutionTreePage`
|
||||
already exist; only an HTTP projection + CLI surface are missing.
|
||||
- **Server:** add `GET /api/audit/tree?executionId=…` in
|
||||
`AuditEndpoints.MapAuditAPI` → `repo.GetExecutionTreeAsync` → serialize
|
||||
`ExecutionTreeNode[]`.
|
||||
- **CLI:** add `audit tree --execution-id <guid> [--format table|json]` in
|
||||
`AuditCommands` + an `AuditTreeHelpers` renderer (indented ASCII tree for
|
||||
`table`; raw nodes for `json`), mirroring `AuditQueryHelpers`/`AuditExportHelpers`.
|
||||
- No schema change. **Tests:** endpoint returns the tree; CLI renders a
|
||||
multi-level tree + handles not-found.
|
||||
|
||||
### T6 — Per-node stuck-count KPIs
|
||||
KPIs are per-site today; `SourceNode` is on the `Notification` and `SiteCalls`
|
||||
rows but not aggregated.
|
||||
- Add `ComputePerNodeKpisAsync` (group by `SourceNode`) parallel to the existing
|
||||
`ComputePerSiteKpisAsync` in `NotificationOutboxRepository` and
|
||||
`SiteCallAuditRepository`.
|
||||
- New `PerNode…KpiRequest`/`Response` message pair per actor; register in each
|
||||
actor's `Receive<>`.
|
||||
- Surface a per-node breakdown on the existing KPI tiles
|
||||
(`AuditKpiTiles`/`SiteCallKpiTiles`) — additive, behind the existing tiles.
|
||||
- **Tests:** repository grouping returns correct per-node counts (stuck/parked/
|
||||
queue-depth); message round-trip.
|
||||
|
||||
### T7 — Structured response-capture increments (no schema change)
|
||||
- **(a) Inbound request headers** → captured into the existing `Extra` JSON in
|
||||
`AuditWriteMiddleware.EmitInboundAudit`, passed through the existing header
|
||||
redactor (auth headers redacted by default).
|
||||
- **(b) `AuditInboundCeilingHits`** counter on `AuditCentralHealthSnapshot`
|
||||
(alongside the existing failure counters), incremented when an inbound row
|
||||
truncates (request or response hits `InboundMaxBytes`). Surfaced via the
|
||||
health snapshot.
|
||||
- **(c) Per-method opt-out** of body capture: a `SkipBodyCapture` flag on
|
||||
`PerTargetRedactionOverride`, checked in the capture pipeline so a noisy/
|
||||
sensitive method can suppress body capture (headers + metadata still recorded).
|
||||
- **Tests:** request headers land in `Extra` and are redacted; ceiling-hit
|
||||
increments the counter; opt-out suppresses body but keeps the row.
|
||||
|
||||
### T4 — `ParentExecutionId` tag-cascade (touches the actor model — high-risk)
|
||||
Completes the execution tree beyond the inbound-API→routed-script case.
|
||||
- **Alarm on-trigger:** thread a `Guid? parentExecutionId` through
|
||||
`AlarmActor.SpawnAlarmExecutionActor` → `AlarmExecutionActor` →
|
||||
`ScriptRuntimeContext`, so an alarm-triggered script chains to its firing
|
||||
context (the alarm's own execution id where one exists; otherwise a root).
|
||||
- **Nested `CallScript`/`CallShared`:** in `ScriptRuntimeContext`, pass **the
|
||||
current run's `ExecutionId`** (not the inherited `_parentExecutionId`) as the
|
||||
child invocation's `ParentExecutionId`, so `A → CallScript(B)` records B's
|
||||
parent as A — a true multi-level tree.
|
||||
- **Timer/expression-trigger top-level runs** stay roots (no spawner) — unchanged.
|
||||
- **Tests:** alarm-triggered script row carries the expected parent; a 2-level
|
||||
nested `CallScript` produces a chain A→B→C walkable by `GetExecutionTreeAsync`.
|
||||
- **Risk:** serialized actor state + correlation plumbing; covered by targeted
|
||||
SiteRuntime actor tests + a tree-walk integration assertion.
|
||||
|
||||
### T3 — Per-channel retention overrides (one design wrinkle, resolved)
|
||||
Retention is a single global `RetentionDays`; the purge actor switches out whole
|
||||
month partitions by `OccurredAtUtc` (channel-blind).
|
||||
- Add `PerChannelRetentionDays` (`Dictionary<string,int>`, keyed by channel /
|
||||
`Action` name) to `AuditLogOptions`, validated like the global value; a channel
|
||||
override may only be **shorter** than the global window (longer is meaningless
|
||||
under month-partition switch-out, which is governed by the largest retention).
|
||||
- **Mechanism (resolved):** after the coarse global partition purge, the purge
|
||||
actor runs a **bounded row-level delete** for channels whose override is
|
||||
shorter than global (`DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr`,
|
||||
batched). This runs from the **purge/maintenance path, not the writer role** —
|
||||
the append-only invariant binds the writer/ingest role, not maintenance. The
|
||||
**M2.10 CI grep-guard is widened** to allow the purge actor's single audited
|
||||
deletion call site (an allow-list entry, not a blanket exemption).
|
||||
- **Tests:** a channel with a shorter override is purged earlier than the global;
|
||||
channels without an override follow the global; the guard still rejects
|
||||
UPDATE/DELETE everywhere except the sanctioned purge site.
|
||||
|
||||
### T5 — Historical backfill (reframed per the computed-column reality)
|
||||
- **`SourceNode`** is a physical nullable column. For truly historical rows the
|
||||
node-of-origin is **unknowable**, so the backfill sets a **configurable
|
||||
sentinel** (default `"unknown"`) on `NULL` rows via a one-shot maintenance
|
||||
command (run from the purge/maintenance path), rather than guessing a node.
|
||||
- **`ExecutionId`/`ParentExecutionId`** are **persisted computed columns derived
|
||||
from `DetailsJson`**; backfilling them means mutating the JSON, which
|
||||
append-only forbids. These are **documented as a runbook limitation** (pre-feature
|
||||
rows stay NULL) — no code.
|
||||
- **Tests:** the SourceNode backfill sets the sentinel only on NULL rows within a
|
||||
bounded range and is idempotent; documentation note added.
|
||||
|
||||
---
|
||||
|
||||
## Cross-cutting
|
||||
|
||||
- **Shared seams:** `AuditLogOptions` (T3, T7), `AuditEndpoints.MapAuditAPI`
|
||||
(T8), `AuditCommands` (T8), `AuditCentralHealthSnapshot` (T6, T7),
|
||||
`IAuditLogRepository`/the KPI repositories (T6), the purge/maintenance role
|
||||
(T3, T5). No AuditLog **schema** change in M5 (T1/T2 deferred).
|
||||
- **Append-only:** the only new deletion is T3's purge-role channel delete +
|
||||
T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the
|
||||
CI guard's allow-list. Writer/ingest paths stay INSERT-only.
|
||||
|
||||
## Testing strategy
|
||||
|
||||
Per-item unit + targeted integration tests (above). T4 additionally gets a
|
||||
tree-walk integration assertion. Full-solution build + targeted suites at the
|
||||
integration step. No new infra dependency (Parquet deferred).
|
||||
|
||||
## Sequencing
|
||||
|
||||
Independent items, parallelizable by disjoint area:
|
||||
- **Wave A (parallel):** T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7
|
||||
(middleware+health+redaction-override) — disjoint projects.
|
||||
- **Wave B (parallel):** T4 (SiteRuntime actors — high-risk), T3 (AuditLog
|
||||
options+purge actor+CI guard), T5 (purge-path backfill command + runbook).
|
||||
- **Wave C:** integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md
|
||||
KPI/retention notes, runbook).
|
||||
|
||||
## Risks
|
||||
|
||||
- **T4** actor-model correlation (serialized state) — targeted tests + tree-walk
|
||||
assertion.
|
||||
- **T3** append-only tension — resolved via maintenance-role delete + CI-guard
|
||||
allow-list; verify the guard still blocks all other DELETE/UPDATE.
|
||||
- **T5** node-of-origin unknowable — sentinel + documented limitation (no false
|
||||
precision).
|
||||
@@ -0,0 +1,92 @@
|
||||
# M5 — Audit Hardening (T3–T8) Implementation Plan
|
||||
|
||||
> **For Claude:** executed via superpowers-extended-cc:subagent-driven-development in this session.
|
||||
|
||||
**Goal:** Ship six independent audit-log hardening items (per-channel retention, ParentExecutionId tag-cascade, SourceNode backfill, per-node stuck KPIs, structured response-capture increments, CLI `audit tree`) without an AuditLog schema change.
|
||||
|
||||
**Architecture:** Each item extends an existing seam identified in the survey. No new infra dependency (T1 hash-chain + T2 Parquet stay deferred to v1.x). Design: `docs/plans/2026-06-16-m5-audit-hardening-design.md`.
|
||||
|
||||
**Tech Stack:** C#/.NET 10, EF Core (MS SQL), Akka.NET, Blazor Server, System.CommandLine, xUnit.
|
||||
|
||||
**Conventions:** targeted builds/tests per task (`dotnet build <proj>`, `dotnet test --filter`); full-solution build only at integration (M5.7). Implementers do NOT create worktrees (already in `worktree-m5-audit-hardening`) and commit with pathspec form `git commit -m "..." -- <paths>` (retry on index.lock). Append-only invariant holds for writer/ingest paths; the only sanctioned mutations are T3's purge-role channel delete and T5's purge-role sentinel UPDATE, both reflected in the M2.10 CI-guard allow-list.
|
||||
|
||||
---
|
||||
|
||||
# Wave A — leverage-existing-infra (parallel; disjoint projects)
|
||||
|
||||
### Task M5.1 (T8): CLI `audit tree` + tree endpoint
|
||||
**Classification:** standard · **~5 min** · **Parallelizable with:** M5.2, M5.3
|
||||
**Files:**
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.ManagementService/AuditEndpoints.cs` (`MapAuditAPI`, ~line 97) — add `GET /api/audit/tree?executionId=<guid>` → `IAuditLogRepository.GetExecutionTreeAsync(executionId)` → JSON `ExecutionTreeNode[]`; 400 on missing/invalid guid, empty array when no rows.
|
||||
- Create: `src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditTreeHelpers.cs` — render `ExecutionTreeNode[]` as an indented ASCII tree (table) and as raw JSON (`--format json`), mirroring `AuditQueryHelpers`/`AuditExportHelpers`.
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditCommands.cs` (`Build`, ~line 28) — add `BuildTree()`: `audit tree --execution-id <guid> [--format table|json]`, calls the new endpoint via the existing `ManagementHttpClient` pattern.
|
||||
- Test: ManagementService tests for the endpoint (multi-level tree + not-found); CLI tests for `AuditTreeHelpers` rendering.
|
||||
**AC:** `audit tree --execution-id <id>` prints the execution tree (root→children, indented); `--format json` emits the node array; the server walk reuses the existing `GetExecutionTreeAsync` (no new SQL). No schema change.
|
||||
|
||||
### Task M5.2 (T6): Per-node stuck-count KPIs
|
||||
**Classification:** standard · **~5 min** · **Parallelizable with:** M5.1, M5.3
|
||||
**Files:**
|
||||
- Modify: `NotificationOutboxRepository` — add `ComputePerNodeKpisAsync` (group by `SourceNode`) parallel to `ComputePerSiteKpisAsync`.
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/...Repository` — same `ComputePerNodeKpisAsync`.
|
||||
- Modify: `NotificationOutboxActor.cs` (~line 1054) + `SiteCallAuditActor.cs` (~line 781) — add a `PerNode…KpiRequest`/`Response` message pair (in Commons messages) and a `Receive<>`/handler each.
|
||||
- Modify: CentralUI `AuditKpiTiles.razor` / `SiteCallKpiTiles.razor` (or the per-site KPI panel) — add an additive per-node breakdown.
|
||||
- Test: repository per-node grouping returns correct stuck/parked/queue-depth counts; actor message round-trip.
|
||||
**AC:** per-node stuck/parked counts available + surfaced; `SourceNode` already on both tables (no migration). Per-site KPIs unchanged.
|
||||
|
||||
### Task M5.3 (T7): Structured response-capture increments
|
||||
**Classification:** standard · **~5 min** · **Parallelizable with:** M5.1, M5.2
|
||||
**Files:**
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/...AuditWriteMiddleware.cs` (`EmitInboundAudit`, ~line 246) — capture inbound **request headers** into the existing `Extra` JSON (through the existing header redactor; auth headers redacted by default).
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditCentralHealthSnapshot.cs` — add an `AuditInboundCeilingHits` counter (+ its interface), incremented from the middleware when an inbound row truncates (`requestTruncated || responseTruncated`).
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs` — add a `SkipBodyCapture` flag; honor it in the capture pipeline (suppress body, keep headers + metadata + the row).
|
||||
- Test: request headers land in `Extra` and are redacted; ceiling-hit increments the counter; `SkipBodyCapture` suppresses body but still writes the row.
|
||||
**AC:** no schema change (uses `Extra` JSON + health snapshot); existing redaction behavior preserved.
|
||||
|
||||
---
|
||||
|
||||
# Wave B — actor model + maintenance (parallel; T5 after M5.1's CLI edits)
|
||||
|
||||
### Task M5.4 (T4): ParentExecutionId tag-cascade
|
||||
**Classification:** high-risk (actor model + correlation) · **~5 min** · **Parallelizable with:** M5.5 (and M5.6)
|
||||
**Files:**
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs` (`SpawnAlarmExecutionActor`, ~line 578) + `AlarmExecutionActor.cs` (ctor, ~line 90) — thread a `Guid? parentExecutionId` so alarm-triggered scripts chain to the firing context; pass it into the `ScriptRuntimeContext` (currently `null`).
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptRuntimeContext.cs` (`CallScript` ~line 394, `CallShared`) — pass **the current run's `_executionId`** (not the inherited `_parentExecutionId`) as the child invocation's `ParentExecutionId`, forming a true multi-level tree.
|
||||
- Test (`tests/.../SiteRuntime.Tests/`): an alarm-triggered script row carries the expected parent; a 2-level nested `CallScript` (A→B→C) is walkable via `GetExecutionTreeAsync` (or assert the emitted `ParentExecutionId` chain).
|
||||
**AC:** alarm/trigger-spawned and nested-call runs form a correct execution tree; top-level timer/expression-trigger runs stay roots; no regression to the inbound-API→routed-script path.
|
||||
|
||||
### Task M5.5 (T3): Per-channel retention overrides
|
||||
**Classification:** high-risk (purge/deletion + CI guard) · **~5 min** · **Parallelizable with:** M5.4, M5.6
|
||||
**Files:**
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/AuditLogOptions.cs` — add `Dictionary<string,int> PerChannelRetentionDays` (keyed by `Action`/channel name); validate in `AuditLogOptionsValidator.cs` (each override in `[30, global]`, shorter-than-global only).
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeActor.cs` (`HandlePurgeTickAsync`, ~line 135) — after the global partition switch-out, for each channel with a shorter override, run a **bounded batched DELETE** (`WHERE Action=@channel AND OccurredAtUtc<@threshold`) via the purge/maintenance path.
|
||||
- Modify: the M2.10 CI grep-guard script — add an allow-list entry for the purge actor's single audited DELETE call site (do NOT blanket-exempt; the guard must still reject all other UPDATE/DELETE on AuditLog).
|
||||
- Test: a channel with a shorter override is purged earlier than global; un-overridden channels follow global; the CI guard still fails on a stray DELETE elsewhere.
|
||||
**AC:** per-channel retention works without violating writer-role append-only; the guard remains effective.
|
||||
|
||||
### Task M5.6 (T5): SourceNode sentinel backfill + runbook
|
||||
**Classification:** small · **~4 min** · **Parallelizable with:** M5.4, M5.5 · **Depends on:** M5.1 (shares `AuditCommands.cs`)
|
||||
**Files:**
|
||||
- Create: a one-shot maintenance backfill (purge/maintenance path) that sets `SourceNode` to a configurable sentinel (default `"unknown"`) on `NULL` rows within a bounded `OccurredAtUtc` range; idempotent.
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditCommands.cs` — add `audit backfill-source-node [--sentinel <s>] [--before <date>]` invoking it (after M5.1's `audit tree` is in, to avoid a concurrent edit to this file).
|
||||
- Modify/Create: a runbook note (`deploy/.../RUNBOOK.md` or the AuditLog component doc) documenting that `ExecutionId`/`ParentExecutionId` are computed from `DetailsJson` and CANNOT be backfilled under append-only (pre-feature rows stay NULL) — no false precision.
|
||||
- Test: backfill sets the sentinel only on NULL rows in range, is idempotent, and does not touch non-NULL rows.
|
||||
**AC:** SourceNode backfill is sanctioned maintenance (CI-guard allow-listed if it does UPDATE); the computed-id limitation is documented, not coded.
|
||||
|
||||
---
|
||||
|
||||
# Wave C — integration + docs
|
||||
|
||||
### Task M5.7: Integration verification + docs
|
||||
**Classification:** high-risk (final integration reviewer) · **~5 min** · **Depends on:** M5.1–M5.6
|
||||
**Steps:**
|
||||
1. `dotnet build ZB.MOM.WW.ScadaBridge.slnx` (full solution).
|
||||
2. Targeted tests across AuditLog, ManagementService, CLI, NotificationOutbox/SiteCallAudit, SiteRuntime, CentralUI; run the CI grep-guard to confirm it still blocks stray UPDATE/DELETE.
|
||||
3. Docs: `docs/requirements/Component-AuditLog.md` (per-channel retention, per-node KPIs, response-capture increments, tag-cascade, `audit tree`), `Component-CLI.md` + CLI README (`audit tree`, `audit backfill-source-node`), CLAUDE.md audit notes (per-channel retention; tag-cascade now beyond inbound; per-node KPIs), and the runbook computed-id limitation.
|
||||
4. Commit; final integration review of the whole `1b7600f..HEAD` diff.
|
||||
**AC:** full build green; all targeted suites + CI guard green; docs reflect the six shipped items; no doc claims a deferred item shipped (T1/T2 remain deferred).
|
||||
|
||||
---
|
||||
|
||||
## Native tasks & dependencies
|
||||
|
||||
Sub-tasks created as native tasks under umbrella #16 (M5). Edges: M5.6 ⟵ M5.1 (shared CLI file); M5.7 ⟵ M5.1–M5.6. Waves: A = {M5.1, M5.2, M5.3} parallel; B = {M5.4, M5.5, M5.6} parallel (M5.6 after M5.1); C = M5.7.
|
||||
@@ -0,0 +1,13 @@
|
||||
{
|
||||
"planPath": "docs/plans/2026-06-16-m5-audit-hardening.md",
|
||||
"tasks": [
|
||||
{"id": 119, "subject": "M5.1 (T8): CLI audit tree + tree endpoint", "status": "pending"},
|
||||
{"id": 120, "subject": "M5.2 (T6): Per-node stuck-count KPIs", "status": "pending"},
|
||||
{"id": 121, "subject": "M5.3 (T7): Structured response-capture increments", "status": "pending"},
|
||||
{"id": 122, "subject": "M5.4 (T4): ParentExecutionId tag-cascade", "status": "pending"},
|
||||
{"id": 123, "subject": "M5.5 (T3): Per-channel retention overrides", "status": "pending"},
|
||||
{"id": 124, "subject": "M5.6 (T5): SourceNode sentinel backfill + runbook", "status": "pending", "blockedBy": [119]},
|
||||
{"id": 125, "subject": "M5.7: M5 integration verification + docs", "status": "pending", "blockedBy": [119, 120, 121, 122, 123, 124]}
|
||||
],
|
||||
"lastUpdated": "2026-06-16"
|
||||
}
|
||||
@@ -0,0 +1,264 @@
|
||||
# Patch request — event-driven "wait for attribute change (with timeout)" script helper
|
||||
|
||||
**Date:** 2026-06-17
|
||||
**Type:** Source enhancement (small, additive) to the SiteRuntime script surface
|
||||
**Why now:** the DELMIA/MES receiver re-implementation
|
||||
([`2026-06-17-delmia-mes-receiver-templates-design.md`](2026-06-17-delmia-mes-receiver-templates-design.md), §9 risk #1)
|
||||
currently has to **busy-poll** for the handshake completion flag. This spec describes the gap
|
||||
and a precise, patch-ready design for a host-provided `WaitAsync` helper so scripts can wait
|
||||
**event-driven** for a tag/attribute to reach a value, bounded by a timeout.
|
||||
|
||||
> All file paths, line numbers, message records, and signatures below were read from source on
|
||||
> 2026-06-17. Treat line numbers as guides (they drift); the type/method names are the anchors.
|
||||
|
||||
---
|
||||
|
||||
## 1. The gap
|
||||
|
||||
The receiver handshake (and any request/response tag interaction) needs to **wait until a
|
||||
data-sourced attribute reaches a value** — e.g. wait up to 30 s for `RecipeProcessedFlag == true`
|
||||
or `MoveInCompleteFlag == true` after setting the trigger flag.
|
||||
|
||||
ScadaBridge's script surface today has **read** (`Attributes.GetAsync` / indexer) and **write**
|
||||
(`Attributes.SetAsync` / indexer), but **no "wait for value" primitive**. The only way to wait is
|
||||
a manual poll loop:
|
||||
|
||||
```csharp
|
||||
// current workaround — every handshake script repeats this
|
||||
var deadline = DateTime.UtcNow.AddSeconds(30);
|
||||
while (DateTime.UtcNow < deadline && !CancellationToken.IsCancellationRequested)
|
||||
{
|
||||
if ((bool?)(await Attributes.GetAsync("RecipeProcessedFlag")) == true) break;
|
||||
await Task.Delay(200, CancellationToken);
|
||||
}
|
||||
```
|
||||
|
||||
Why this is unsatisfactory:
|
||||
|
||||
- **Latency** — completion is detected up to one poll interval late (200 ms here).
|
||||
- **Wasted work** — each iteration is an actor `Ask` (`GetAttributeRequest` round-trip to the
|
||||
`InstanceActor`); N handshakes × M polls = a lot of needless messages.
|
||||
- **Boilerplate** — the same loop is copy-pasted into every handshake script, easy to get wrong
|
||||
(forgetting `CancellationToken`, off-by-one on the deadline, not handling quality).
|
||||
- **No quality awareness** — the poll reads whatever value is cached regardless of OPC/MX quality.
|
||||
|
||||
Crucially, **the data is already being pushed to the actor that owns it.** A data-sourced
|
||||
attribute's value arrives from the DCL and is applied in the `InstanceActor`, which then raises
|
||||
`AttributeValueChanged`. So an event-driven waiter is natural and removes the poll entirely.
|
||||
|
||||
---
|
||||
|
||||
## 2. Where the change goes (verified wiring)
|
||||
|
||||
| Concern | Type / file | Notes |
|
||||
|---|---|---|
|
||||
| Change notification | `AttributeValueChanged(InstanceUniqueName, AttributePath, AttributeName, Value, Quality, Timestamp)` — `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Streaming/AttributeValueChanged.cs` | raised on **every** change |
|
||||
| **Single choke point** | `InstanceActor.HandleAttributeValueChanged(...)` — `src/…/SiteRuntime/Actors/InstanceActor.cs` | both static writes (`HandleSetStaticAttributeCore`) **and** DCL/subscription updates (`HandleTagValueUpdate` ← `TagValueUpdate`) funnel through here, then `PublishAndNotifyChildren` |
|
||||
| Owner of state | `InstanceActor` (`_attributes`, `_attributeQualities`, `_attributeTimestamps`) | **single-threaded** — registration + current-value check is atomic here |
|
||||
| Script read path | `AttributeAccessor` (`ScopeAccessors.cs`) → `ScriptRuntimeContext.GetAttribute` → `Ask<GetAttributeResponse>(GetAttributeRequest)` | the helper mirrors this |
|
||||
| Script globals build | `ScriptExecutionActor` (`src/…/SiteRuntime/Actors/ScriptExecutionActor.cs`) builds `ScriptRuntimeContext` (passes `instanceActor`, `self`, `_askTimeout`) and `ScriptGlobals` (`CancellationToken = cts.Token` from the per-script timeout) | **the script timeout token is NOT currently passed into `ScriptRuntimeContext`** — this patch must thread it in |
|
||||
| Helper idiom | `ScriptRuntimeContext` nested helpers (e.g. `ExternalSystemHelper`) — ctor deps stored as readonly fields, exposed via an on-demand property | follow this idiom |
|
||||
| Trust model | `ScriptTrustPolicy` (`src/…/ScriptAnalysis/`) | `System.Threading.Tasks` + `CancellationToken`/`CancellationTokenSource` are in `AllowedExceptions`; lambdas/`Func<>` are fine. **No trust change needed** — the wait runs in host code; the script just `await`s a provided method. |
|
||||
|
||||
**Design principle:** do the wait **inside the `InstanceActor`** as a one-shot registered waiter,
|
||||
not in the script via polling. Because the actor is single-threaded and `HandleAttributeValueChanged`
|
||||
is the one place every change passes, a waiter that (a) checks the current value on registration and
|
||||
(b) is re-evaluated on each change **cannot miss the edge** between "read current" and "subscribe".
|
||||
|
||||
---
|
||||
|
||||
## 3. Proposed API (script-facing)
|
||||
|
||||
Add to the `Attributes` accessor (`AttributeAccessor` in `ScopeAccessors.cs`), so scope/composition
|
||||
path resolution (`Resolve(name)`) applies just like get/set:
|
||||
|
||||
```csharp
|
||||
// Wait until `name` equals targetValue (value-equality, codec-normalized). Returns true if matched
|
||||
// within the timeout, false if it timed out. Honors the script CancellationToken.
|
||||
Task<bool> Attributes.WaitAsync(string name, object? targetValue, TimeSpan timeout);
|
||||
|
||||
// Predicate form — site-local template scripts only (predicate is an in-process delegate).
|
||||
Task<bool> Attributes.WaitAsync(string name, Func<object?, bool> predicate, TimeSpan timeout);
|
||||
|
||||
// Optional richer overload that also returns the matched value + quality.
|
||||
Task<WaitResult> Attributes.WaitForAsync(string name, object? targetValue, TimeSpan timeout);
|
||||
// record WaitResult(bool Matched, object? Value, string? Quality, bool TimedOut);
|
||||
```
|
||||
|
||||
> **Status:** IMPLEMENTED. `Attributes.WaitForAsync(...)` returns a `WaitResult`
|
||||
> (`readonly record struct WaitResult(bool Matched, object? Value, string? Quality, bool TimedOut)`
|
||||
> in Commons), populated on match (Value + Quality) and `Matched:false, TimedOut:true` on timeout.
|
||||
|
||||
Return **bool** (not throw) for the common case — the handshake wants matched/timed-out, not an
|
||||
exception. The value-equality overload is the one the handshake needs and is the one that can also
|
||||
be exposed on the inbound/routed side (§6), because a value serializes and a delegate does not.
|
||||
|
||||
Handshake, rewritten (replaces the §1 poll loop):
|
||||
|
||||
```csharp
|
||||
await Attributes.SetAsync("RecipeDownloadFlag", true); // trigger
|
||||
var ok = await Attributes.WaitAsync("RecipeProcessedFlag", true, TimeSpan.FromSeconds(30));
|
||||
if (!ok) return new { Result = false, ResultText = "Timeout waiting for recipe to be processed" };
|
||||
return new {
|
||||
Result = (bool?)(await Attributes.GetAsync("RecipeProcessResult")) ?? false,
|
||||
ResultText = (string?)(await Attributes.GetAsync("RecipeProcessResultText")) ?? ""
|
||||
};
|
||||
```
|
||||
|
||||
```csharp
|
||||
await Attributes.SetAsync("MoveInFlag", true);
|
||||
var ok = await Attributes.WaitAsync("MoveInCompleteFlag", true, TimeSpan.FromSeconds(30));
|
||||
// … read MoveInSuccessfulFlag / MoveInErrorText / MoveInBatchID …
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Implementation outline (the patch)
|
||||
|
||||
### 4.1 New messages (`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/…`)
|
||||
```csharp
|
||||
// actor protocol (site-local; delegate is fine because messaging is in-process)
|
||||
public record WaitForAttributeRequest(
|
||||
string CorrelationId,
|
||||
string InstanceName,
|
||||
string AttributeName, // already scope-resolved by the accessor
|
||||
string? TargetValueEncoded, // AttributeValueCodec.Encode(targetValue); null = "any change"
|
||||
Func<object?, bool>? Predicate, // local-only; null when TargetValueEncoded is used
|
||||
TimeSpan Timeout,
|
||||
DateTimeOffset OccurredAtUtc);
|
||||
|
||||
public record WaitForAttributeResponse(
|
||||
string CorrelationId,
|
||||
bool Matched,
|
||||
object? Value,
|
||||
string Quality,
|
||||
bool TimedOut,
|
||||
string? ErrorMessage = null);
|
||||
|
||||
// internal self-message used to fire the timeout
|
||||
public record WaitForAttributeTimeout(string CorrelationId);
|
||||
```
|
||||
|
||||
### 4.2 `InstanceActor` (`src/…/SiteRuntime/Actors/InstanceActor.cs`)
|
||||
- Add a registry: `Dictionary<string, PendingWait> _attributeWaiters` keyed by `CorrelationId`, where
|
||||
`PendingWait` holds the attribute name, the match test (decoded target value **or** predicate),
|
||||
the original `Sender` (`IActorRef`), and the scheduled `ICancelable` timeout handle.
|
||||
- **Handle `WaitForAttributeRequest`:**
|
||||
1. Build the match test (decode `TargetValueEncoded` via `AttributeValueCodec` → equality test, or
|
||||
use `Predicate`).
|
||||
2. **Fast path:** if the current `_attributes[name]` already satisfies the test, reply
|
||||
`WaitForAttributeResponse(Matched: true, Value, Quality)` immediately and return.
|
||||
3. Otherwise register the waiter and schedule the timeout:
|
||||
`Context.System.Scheduler.ScheduleTellOnce(effectiveTimeout, Self, new WaitForAttributeTimeout(cid), Self)`,
|
||||
storing the returned `ICancelable`. Capture `Sender` now (it is invalid later).
|
||||
4. Bound `effectiveTimeout = min(request.Timeout, requestDeadlineFromCaller)` (the caller's `Ask`
|
||||
already carries the script token; see §4.3). Optionally cap the number of concurrent waiters
|
||||
per instance (defensive; reply with `ErrorMessage` if exceeded).
|
||||
- **In `HandleAttributeValueChanged` (after state is updated):** iterate `_attributeWaiters` whose
|
||||
attribute matches the changed `AttributeName`; for any whose test now passes, cancel its timeout,
|
||||
reply `WaitForAttributeResponse(Matched: true, …)`, and remove it. (Iterate over a snapshot to
|
||||
allow removal during enumeration.)
|
||||
- **Handle `WaitForAttributeTimeout`:** if still registered, reply
|
||||
`WaitForAttributeResponse(Matched: false, TimedOut: true)` and remove.
|
||||
- Optional: a `quality == "Good"`-only mode (parameter on the request) if a handshake must ignore
|
||||
Bad-quality transients.
|
||||
|
||||
> **Status:** IMPLEMENTED as an opt-in `requireGoodQuality` parameter on `WaitAsync`/`WaitForAsync`
|
||||
> (additive trailing `RequireGoodQuality` field on `WaitForAttributeRequest`, gated at both the
|
||||
> fast-path and resolve-loop match sites). Default `false` = quality-agnostic (matches on value only).
|
||||
|
||||
### 4.3 `ScriptRuntimeContext` (`src/…/SiteRuntime/Scripts/ScriptRuntimeContext.cs`)
|
||||
- **Thread the script timeout token in.** Add a `CancellationToken scriptTimeoutToken` constructor
|
||||
parameter (today only `_askTimeout` is available to helpers; the per-script `cts.Token` is **not**
|
||||
passed). `ScriptExecutionActor` already has `cts.Token` — pass it when constructing the context.
|
||||
- Add a method that the accessor calls:
|
||||
```csharp
|
||||
public async Task<bool> WaitAttribute(string name, string? targetValueEncoded,
|
||||
Func<object?,bool>? predicate, TimeSpan timeout)
|
||||
{
|
||||
var cid = Guid.NewGuid().ToString();
|
||||
var req = new WaitForAttributeRequest(cid, _instanceName, name, targetValueEncoded,
|
||||
predicate, timeout, DateTimeOffset.UtcNow);
|
||||
// Ask bounded by the script timeout token so a script-deadline abort cancels the await.
|
||||
var resp = await _instanceActor.Ask<WaitForAttributeResponse>(
|
||||
req, timeout + _askTimeout /* small slack */, _scriptTimeoutToken);
|
||||
return resp.Matched;
|
||||
}
|
||||
```
|
||||
|
||||
### 4.4 `ScriptExecutionActor` (`src/…/SiteRuntime/Actors/ScriptExecutionActor.cs`)
|
||||
- Pass `cts.Token` (the per-script timeout, created at the `new CancellationTokenSource(timeout)`
|
||||
site) into the new `ScriptRuntimeContext` constructor parameter from §4.3.
|
||||
|
||||
### 4.5 `AttributeAccessor` (`src/…/SiteRuntime/Scripts/ScopeAccessors.cs`)
|
||||
```csharp
|
||||
public Task<bool> WaitAsync(string key, object? targetValue, TimeSpan timeout)
|
||||
=> _ctx.WaitAttribute(Resolve(key), AttributeValueCodec.Encode(targetValue), null, timeout);
|
||||
|
||||
public Task<bool> WaitAsync(string key, Func<object?, bool> predicate, TimeSpan timeout)
|
||||
=> _ctx.WaitAttribute(Resolve(key), null, predicate, timeout);
|
||||
```
|
||||
|
||||
### 4.6 Trust model — no change
|
||||
`WaitAsync` is a host-provided async method; the wait/scheduling happens in host code. The script
|
||||
only `await`s it and may pass a `Func<>` (a normal closure, not reflection). `System.Threading.Tasks`
|
||||
+ `CancellationToken` are already in `ScriptTrustPolicy.AllowedExceptions`. Verify the new helper
|
||||
type/members don't collide with `ForbiddenIdentifiers` (`dynamic`, `Activator`) — they don't.
|
||||
|
||||
---
|
||||
|
||||
## 5. Correctness notes
|
||||
|
||||
- **No missed edge.** Registration (current-value check) and change-handling both run on the
|
||||
`InstanceActor`'s single thread, so a value that flips between "set trigger" and "register waiter"
|
||||
is caught by the fast-path check; a value that flips after registration is caught by
|
||||
`HandleAttributeValueChanged`. The poll-loop and this design are both correct; this one is
|
||||
event-driven and cheaper.
|
||||
- **Timeout is authoritative and self-cleaning.** The scheduled `WaitForAttributeTimeout` guarantees
|
||||
the waiter is removed and the caller answered even if the value never changes. Match cancels the
|
||||
scheduled timeout.
|
||||
- **Cancellation.** Bounding the helper `Ask` with the script timeout token means a script that hits
|
||||
its own `ExecutionTimeoutSeconds` abandons the wait; pair with a best-effort cancel message to the
|
||||
actor to evict the orphan waiter promptly (otherwise it self-evicts at its own timeout).
|
||||
- **Concurrency / re-entrancy.** Multiple waiters per instance are fine (keyed by `CorrelationId`).
|
||||
Consider a per-instance cap as a guard against a script leaking waiters in a loop.
|
||||
|
||||
---
|
||||
|
||||
## 6. Optional: inbound / routed variant
|
||||
|
||||
For symmetry with `RouteTarget.GetAttributes` (`src/…/InboundAPI/RouteHelper.cs`), an inbound script
|
||||
could call `Route.To(code).WaitForAttribute(name, targetValue, timeout)`. Mirror the existing routed
|
||||
pattern: add `RouteToWaitForAttributeRequest/Response`, an `IInstanceRouter.RouteToWaitForAttributeAsync`
|
||||
method, and unpack it on the site comms actor into the same `WaitForAttributeRequest` to the
|
||||
`InstanceActor`. **Value-equality only** across the wire — a `Func<>` predicate cannot be serialized,
|
||||
so the routed form takes the encoded target value (the predicate overload stays site-local). This is
|
||||
optional: the receiver handshake runs **inside** the template script (site-local), so §3–§5 alone
|
||||
fully cover the DELMIA/MES use case.
|
||||
|
||||
> **Status:** IMPLEMENTED. `Route.To(code).WaitForAttribute(name, targetValue, timeout)` is wired
|
||||
> end-to-end (`RouteToWaitForAttributeRequest/Response` → `IInstanceRouter` → `CommunicationService`
|
||||
> → `SiteCommunicationActor` → `DeploymentManagerActor` → `InstanceActor`), value-equality only
|
||||
> across the wire. NOT wired into the CentralUI Test-Run sandbox — that remains a follow-up.
|
||||
|
||||
---
|
||||
|
||||
## 7. Acceptance criteria
|
||||
|
||||
1. A template script can `await Attributes.WaitAsync("Flag", true, TimeSpan.FromSeconds(30))` and it
|
||||
returns `true` promptly when the data-sourced attribute reaches `true` (driven by a DCL update),
|
||||
with no poll loop.
|
||||
2. Returns `false` (no throw) when the value never matches within the timeout.
|
||||
3. The wait is bounded by the script's own `ExecutionTimeoutSeconds` (a shorter script deadline wins).
|
||||
4. No `AttributeValueChanged` edge is missed across the register/change boundary (unit test: flip the
|
||||
value in the same actor step as registration, and one step after).
|
||||
5. Waiters are removed on match and on timeout (no leak; assert registry empty afterward).
|
||||
6. Scope/composition path resolution works (`Children["DelmiaReceiver"]`-scoped wait resolves to the
|
||||
composed child's attribute).
|
||||
7. Passes `ScriptAnalysis` trust validation unchanged.
|
||||
8. The DELMIA/MES handshake base scripts (design doc §4) compile and pass using `WaitAsync` in place
|
||||
of the poll loop.
|
||||
|
||||
Suggested tests: extend `InstanceActor` tests (waiter fast-path, change-match, timeout, removal) and
|
||||
the script-surface tests under `tests/…/SiteRuntime*`.
|
||||
```
|
||||
@@ -0,0 +1,226 @@
|
||||
# WaitAsync Deferred Optional Items — Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (subagent-driven) to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Implement the three items deferred from the WaitAsync spec (`docs/plans/2026-06-17-waitfor-attribute-change-helper-spec.md`): §3 `WaitForAsync`/`WaitResult` richer overload, §4.2 quality-gated ("Good"-only) matching, and §6 inbound/routed `Route.To(...).WaitForAttribute` variant.
|
||||
|
||||
**Architecture:** Builds on the shipped core (`b89d69a`→`04e97f4`). Two of the items (§3, §4.2) are site-local enrichments of the existing `Attributes` script surface + `InstanceActor` waiter; no new actor protocol shapes beyond an additive `RequireGoodQuality` field. The third (§6) mirrors the existing `Route.To(...).GetAttributes` cross-cluster path end-to-end (`RouteTarget` → `IInstanceRouter` → `CommunicationService` → `SiteCommunicationActor` → `DeploymentManagerActor` → `InstanceActor`), value-equality only across the wire, with the cluster Ask bounded by the *wait* timeout rather than the generic integration timeout.
|
||||
|
||||
**Tech Stack:** C#/.NET 10, Akka.NET 1.5, xUnit + Akka.TestKit + NSubstitute.
|
||||
|
||||
**Branch/worktree:** `waitfor-attr-helper` at `/Users/dohertj2/Desktop/ScadaBridge/.claude/worktrees/waitfor-attr-helper` (off local main; carries the core feature). Implementers do NOT create worktrees, commit **pathspec form** (`git commit -m "…" -- <paths>`), do NOT push, do NOT touch main. Targeted builds/tests per task; full-solution build only in WD-3.
|
||||
|
||||
---
|
||||
|
||||
## Naming / shared shapes
|
||||
|
||||
- New script return type `WaitResult` (Commons): `public readonly record struct WaitResult(bool Matched, object? Value, string? Quality, bool TimedOut);`
|
||||
- `WaitForAttributeRequest` gains a trailing additive field `bool RequireGoodQuality = false` (site-local request). `RequireGoodQuality` semantics: a match requires the value test to pass **and** `string.Equals(quality, "Good", StringComparison.Ordinal)`.
|
||||
- Routed contract (value-equality only, no predicate, no quality flag across the wire — §6 says value-equality only): `RouteToWaitForAttributeRequest` / `RouteToWaitForAttributeResponse` (Commons `Messages/InboundApi`).
|
||||
- The `WaitForAttributeResponse.Quality` field is already `string?` (null on timeout/error).
|
||||
|
||||
---
|
||||
|
||||
## Execution waves
|
||||
|
||||
- **Wave 1 (parallel, disjoint files):** WD-1 ∥ WD-2a. (2 concurrent committers; post-wave HEAD-presence check.)
|
||||
- **Wave 2:** WD-2b (after WD-2a).
|
||||
- **Wave 3:** WD-3 (after WD-1, WD-2a, WD-2b).
|
||||
|
||||
WD-1 must add `RequireGoodQuality` ONLY as a **trailing defaulted** ctor param of `WaitForAttributeRequest`, so WD-2b's `new WaitForAttributeRequest(...)` (built in wave 2) compiles regardless.
|
||||
|
||||
---
|
||||
|
||||
### Task WD-1: Site-local `WaitForAsync` + `WaitResult` + quality-gated mode (§3 + §4.2)
|
||||
|
||||
**Classification:** high-risk (modifies the `InstanceActor` single-threaded match evaluation + an additive message-contract field)
|
||||
**Estimated implement time:** ~5 min
|
||||
**Parallelizable with:** WD-2a
|
||||
|
||||
**Files:**
|
||||
- Create: `src/ZB.MOM.WW.ScadaBridge.Commons/Types/WaitResult.cs`
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Instance/WaitForAttribute.cs` (add trailing `bool RequireGoodQuality = false` to `WaitForAttributeRequest`)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs` (thread `RequireGoodQuality` into `PendingWait` + both match sites)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptRuntimeContext.cs` (add `WaitAttributeFull` returning `WaitResult`; add `requireGoodQuality` param)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScopeAccessors.cs` (add `WaitForAsync` overloads + `requireGoodQuality` optional param on `WaitAsync`)
|
||||
- Test: `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/InstanceActorWaitForAttributeTests.cs` + `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Scripts/ScopeAccessorTests.cs`
|
||||
|
||||
**Steps (TDD):**
|
||||
|
||||
1. **`WaitResult`** — add the readonly record struct above.
|
||||
|
||||
2. **`WaitForAttributeRequest`** — add trailing `bool RequireGoodQuality = false`. Keep the `Func<>` predicate field as-is. Update the XML-doc.
|
||||
|
||||
3. **`InstanceActor`** — add `bool RequireGoodQuality` to the `PendingWait` record. At BOTH match sites build the effective match as:
|
||||
```csharp
|
||||
// fast-path (HandleWaitForAttribute): quality from _attributeQualities.GetValueOrDefault(name, <existing default>)
|
||||
// resolve loop (ResolveMatchedWaiters): quality from changed.Quality
|
||||
bool QualityOk(string? q) => !requireGoodQuality || string.Equals(q, "Good", StringComparison.Ordinal);
|
||||
bool matched = QualityOk(quality) && test(value); // keep test() inside its existing try/catch
|
||||
```
|
||||
Store `RequireGoodQuality` on the `PendingWait` so the resolve loop knows it. Keep the throwing-predicate guard (the `QualityOk && test` must still be inside the existing try/catch). The fast-path quality-fail when `requireGoodQuality` is just a non-match → register + schedule timeout as normal (do NOT fast-reply matched).
|
||||
|
||||
4. **`ScriptRuntimeContext`** — refactor: a private `Task<WaitForAttributeResponse> WaitInternal(name, encoded, predicate, timeout, requireGoodQuality)` that does the token-bounded `Ask` (keep the existing `AskTimeoutException → ...` handling; on AskTimeout return a synthetic `WaitForAttributeResponse(.., Matched:false, TimedOut:true)`). Then:
|
||||
```csharp
|
||||
public async Task<bool> WaitAttribute(string name, string? enc, Func<object?,bool>? pred, TimeSpan t, bool requireGoodQuality = false)
|
||||
=> (await WaitInternal(name, enc, pred, t, requireGoodQuality)).Matched;
|
||||
public async Task<WaitResult> WaitAttributeFull(string name, string? enc, Func<object?,bool>? pred, TimeSpan t, bool requireGoodQuality = false)
|
||||
{ var r = await WaitInternal(...); return new WaitResult(r.Matched, r.Value, r.Quality, r.TimedOut); }
|
||||
```
|
||||
(Note: `WaitAttribute`'s existing `AskTimeoutException → return false` must be preserved — fold it into `WaitInternal` returning a non-matched/timed-out response, OR catch in both. Do NOT catch `OperationCanceledException`/`TaskCanceledException`.)
|
||||
|
||||
5. **`AttributeAccessor`** — add `requireGoodQuality` optional param to both existing `WaitAsync` overloads, and add two `WaitForAsync` overloads:
|
||||
```csharp
|
||||
public Task<WaitResult> WaitForAsync(string key, object? targetValue, TimeSpan timeout, bool requireGoodQuality = false)
|
||||
=> _ctx.WaitAttributeFull(Resolve(key), AttributeValueCodec.Encode(targetValue), null, timeout, requireGoodQuality);
|
||||
public Task<WaitResult> WaitForAsync(string key, Func<object?,bool> predicate, TimeSpan timeout, bool requireGoodQuality = false)
|
||||
=> _ctx.WaitAttributeFull(Resolve(key), null, predicate, timeout, requireGoodQuality);
|
||||
```
|
||||
XML-doc: `requireGoodQuality:true` ignores Bad/Uncertain-quality transients.
|
||||
|
||||
6. **Tests** (extend existing files): (a) `WaitForAsync` returns a populated `WaitResult` on match (Value+Quality) and on timeout (`Matched:false, TimedOut:true`). (b) quality-gated: a value reaching target at **Bad** quality does NOT match when `requireGoodQuality:true` (stays pending → times out), but DOES match when `false`; and matches when it reaches target at Good quality. Cover both fast-path (already-at-target-but-Bad) and change-match. (c) scope resolution still applied for `WaitForAsync`.
|
||||
|
||||
7. Build `Commons` + `SiteRuntime` + the SiteRuntime test project; run `--filter "FullyQualifiedName~WaitForAttribute|FullyQualifiedName~WaitAsync|FullyQualifiedName~WaitForAsync"` and the `~InstanceActor|~ScopeAccessor` regression filter. All green.
|
||||
|
||||
8. Commit (pathspec).
|
||||
|
||||
---
|
||||
|
||||
### Task WD-2a: Routed contract + central path (§6, part 1)
|
||||
|
||||
**Classification:** high-risk (cross-cluster message contract + `IInstanceRouter` surface)
|
||||
**Estimated implement time:** ~5 min
|
||||
**Parallelizable with:** WD-1
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/InboundApi/RouteToInstanceRequest.cs` (add the two records)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.InboundAPI/IInstanceRouter.cs` (add method)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.InboundAPI/CommunicationServiceInstanceRouter.cs` (delegate)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.InboundAPI/RouteHelper.cs` (`RouteTarget.WaitForAttribute`)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs` (`RouteToWaitForAttributeAsync` — **wait-timeout-aware** Ask)
|
||||
- Modify (compile-break fixes — interface gained a member): `tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests/Integration/ParentExecutionIdCorrelationTests.cs` (`BridgingInstanceRouter`) and the inline `IInstanceRouter` double in `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/EndpointContentTypeTests.cs`
|
||||
- Test: `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/RouteHelperTests.cs`
|
||||
|
||||
**Steps (TDD):**
|
||||
|
||||
1. **Commons records** (mirror `RouteToGetAttributes*`, value-equality only):
|
||||
```csharp
|
||||
public record RouteToWaitForAttributeRequest(
|
||||
string CorrelationId, string InstanceUniqueName, string AttributeName,
|
||||
string? TargetValueEncoded, TimeSpan Timeout, DateTimeOffset Timestamp,
|
||||
Guid? ParentExecutionId = null);
|
||||
public record RouteToWaitForAttributeResponse(
|
||||
string CorrelationId, bool Matched, object? Value, string? Quality, bool TimedOut,
|
||||
bool Success, string? ErrorMessage, DateTimeOffset Timestamp);
|
||||
```
|
||||
(`Success`/`ErrorMessage` = routing-level outcome, e.g. instance-not-found; `Matched`/`TimedOut`/`Value`/`Quality` = wait outcome.)
|
||||
|
||||
2. **`IInstanceRouter`** — add `Task<RouteToWaitForAttributeResponse> RouteToWaitForAttributeAsync(string siteId, RouteToWaitForAttributeRequest request, CancellationToken cancellationToken);`. **Update all 3 implementers** (prod `CommunicationServiceInstanceRouter` + the 2 test doubles listed above; the test doubles can return a canned response / throw NotImplemented only if never exercised — prefer a sane canned response).
|
||||
|
||||
3. **`CommunicationServiceInstanceRouter`** — delegate to `_communicationService.RouteToWaitForAttributeAsync(...)`.
|
||||
|
||||
4. **`RouteHelper.RouteTarget`** — add (mirror `GetAttributes`, throw on `!Success`):
|
||||
```csharp
|
||||
public async Task<bool> WaitForAttribute(string attributeName, object? targetValue, TimeSpan timeout, CancellationToken cancellationToken = default)
|
||||
{
|
||||
var token = Effective(cancellationToken);
|
||||
var siteId = await ResolveSiteAsync(token);
|
||||
var request = new RouteToWaitForAttributeRequest(Guid.NewGuid().ToString(), _instanceCode,
|
||||
attributeName, AttributeValueCodec.Encode(targetValue), timeout, DateTimeOffset.UtcNow, _parentExecutionId);
|
||||
var response = await _instanceRouter.RouteToWaitForAttributeAsync(siteId, request, token);
|
||||
if (!response.Success) throw new InvalidOperationException(response.ErrorMessage ?? "Remote attribute wait failed");
|
||||
return response.Matched;
|
||||
}
|
||||
```
|
||||
(`AttributeValueCodec` is in Commons.Types — add the using if needed.)
|
||||
|
||||
5. **`CommunicationService.RouteToWaitForAttributeAsync`** — mirror `RouteToGetAttributesAsync` BUT bound the Ask by the wait timeout, not the generic integration timeout:
|
||||
```csharp
|
||||
var envelope = new SiteEnvelope(siteId, request);
|
||||
var askTimeout = request.Timeout + _options.IntegrationTimeout; // slack beyond the wait
|
||||
return await GetActor().Ask<RouteToWaitForAttributeResponse>(envelope, askTimeout, cancellationToken);
|
||||
```
|
||||
|
||||
6. **Test** (`RouteHelperTests`): with a substitute `IInstanceRouter` returning a canned `RouteToWaitForAttributeResponse(Matched:true,...)`, `Route.To("x").WaitForAttribute("Flag", true, 30s)` returns true; `Success:false` → throws `InvalidOperationException`; the encoded target equals `AttributeValueCodec.Encode(true)`.
|
||||
|
||||
7. Build `Commons` + `InboundAPI` + `Communication` + the two affected test projects; run `--filter "FullyQualifiedName~RouteHelper"` + a build of AuditLog.Tests/InboundAPI.Tests to confirm the interface-addition compiles. Commit (pathspec).
|
||||
|
||||
---
|
||||
|
||||
### Task WD-2b: Site unpacking + handler (§6, part 2)
|
||||
|
||||
**Classification:** high-risk (actor handler crossing into `InstanceActor`; Ask-timeout correctness)
|
||||
**Estimated implement time:** ~4 min
|
||||
**Parallelizable with:** none
|
||||
**blockedBy:** WD-2a
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/SiteCommunicationActor.cs` (add `Receive<RouteToWaitForAttributeRequest>(msg => _deploymentManagerProxy.Forward(msg));` next to the other RouteTo forwards ~line 145)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (`Receive<RouteToWaitForAttributeRequest>(RouteInboundApiWaitForAttribute);` + handler)
|
||||
- Test: `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/DeploymentManagerActorTests.cs`
|
||||
|
||||
**Steps (TDD):**
|
||||
|
||||
1. **`SiteCommunicationActor`** — add the `Receive`/Forward line.
|
||||
|
||||
2. **`DeploymentManagerActor.RouteInboundApiWaitForAttribute`** — mirror `RouteInboundApiGetAttributes`:
|
||||
```csharp
|
||||
private void RouteInboundApiWaitForAttribute(RouteToWaitForAttributeRequest request)
|
||||
{
|
||||
if (!_instanceActors.TryGetValue(request.InstanceUniqueName, out var instanceActor))
|
||||
{
|
||||
Sender.Tell(new RouteToWaitForAttributeResponse(request.CorrelationId, false, null, null, false,
|
||||
false, $"Instance '{request.InstanceUniqueName}' not found on this site.", DateTimeOffset.UtcNow));
|
||||
return;
|
||||
}
|
||||
var sender = Sender;
|
||||
var inner = new WaitForAttributeRequest(request.CorrelationId, request.InstanceUniqueName,
|
||||
request.AttributeName, request.TargetValueEncoded, null /*predicate*/, request.Timeout,
|
||||
DateTimeOffset.UtcNow /*, RequireGoodQuality defaults false */);
|
||||
// Ask bounded by the WAIT timeout + slack (NOT a fixed 30s).
|
||||
instanceActor.Ask<WaitForAttributeResponse>(inner, request.Timeout + TimeSpan.FromSeconds(5))
|
||||
.ContinueWith(t => t.IsCompletedSuccessfully
|
||||
? new RouteToWaitForAttributeResponse(request.CorrelationId, t.Result.Matched, t.Result.Value,
|
||||
t.Result.Quality, t.Result.TimedOut, true, null, DateTimeOffset.UtcNow)
|
||||
: new RouteToWaitForAttributeResponse(request.CorrelationId, false, null, null, false, false,
|
||||
t.Exception?.GetBaseException().Message ?? "Attribute wait timed out", DateTimeOffset.UtcNow))
|
||||
.PipeTo(sender);
|
||||
}
|
||||
```
|
||||
(`WaitForAttributeRequest` lives in Commons `Messages/Instance` — add the using. Build with both the trailing-`RequireGoodQuality` and pre-field signatures in mind; passing 7 positional args + default is fine.)
|
||||
|
||||
3. **Test** (`DeploymentManagerActorTests`, mirror the routed get-attributes test): deploy/register an instance whose attribute already equals the target → `RouteToWaitForAttributeRequest` → `RouteToWaitForAttributeResponse(Success:true, Matched:true)`; unknown instance → `Success:false`.
|
||||
|
||||
4. Build `Communication` + `SiteRuntime` + SiteRuntime test project; run `--filter "FullyQualifiedName~DeploymentManagerActor"`. Commit (pathspec).
|
||||
|
||||
---
|
||||
|
||||
### Task WD-3: Integration — docs + full verification
|
||||
|
||||
**Classification:** standard
|
||||
**Estimated implement time:** ~4 min
|
||||
**Parallelizable with:** none
|
||||
**blockedBy:** WD-1, WD-2a, WD-2b
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/plans/2026-06-17-waitfor-attribute-change-helper-spec.md` (mark §3 `WaitForAsync`/`WaitResult`, §4.2 quality-gated mode, and §6 routed variant as IMPLEMENTED; note Test-Run sandbox parity excluded)
|
||||
- Modify: `docs/requirements/Component-SiteRuntime.md` (script-surface note: `Attributes.WaitForAsync` + `requireGoodQuality`) and `docs/requirements/Component-InboundAPI.md` (`Route.To(...).WaitForAttribute`) — brief, only if those docs enumerate the script surface
|
||||
- (No new component, no migration, no docker config change)
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. Update the spec doc + component docs as above.
|
||||
2. **Full-solution build:** `dotnet build ZB.MOM.WW.ScadaBridge.slnx` — 0 errors.
|
||||
3. **Targeted test sweep** across everything touched:
|
||||
`dotnet test tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/... --filter "FullyQualifiedName~WaitForAttribute|FullyQualifiedName~WaitAsync|FullyQualifiedName~WaitForAsync|FullyQualifiedName~DeploymentManagerActor"`,
|
||||
`dotnet test tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/... --filter "FullyQualifiedName~RouteHelper"`,
|
||||
and a build of `tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests` + `tests/ZB.MOM.WW.ScadaBridge.Communication.Tests` to confirm no compile/regression from the interface addition.
|
||||
4. `git diff` review; commit (pathspec).
|
||||
|
||||
---
|
||||
|
||||
## Out of scope (explicit)
|
||||
|
||||
- Routed `WaitForAttribute` is NOT wired into the CentralUI Test-Run sandbox (`ISandboxInstanceGateway`/`SandboxInstanceGateway`); production inbound scripts get it. Follow-up if Test-Run parity is wanted.
|
||||
- No predicate or quality flag across the wire (§6 is value-equality only, per spec).
|
||||
- No docker redeploy (no cluster-runtime config change; additive script surface only).
|
||||
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"planPath": "docs/plans/2026-06-17-waitfor-deferred-items.md",
|
||||
"tasks": [
|
||||
{"id": 1, "subject": "WD-1: site-local WaitForAsync + WaitResult + quality-gated mode (§3+§4.2)", "classification": "high-risk", "status": "pending", "parallelizableWith": [2]},
|
||||
{"id": 2, "subject": "WD-2a: routed contract + central path (§6 part 1)", "classification": "high-risk", "status": "pending", "parallelizableWith": [1]},
|
||||
{"id": 3, "subject": "WD-2b: site unpacking + DeploymentManager handler (§6 part 2)", "classification": "high-risk", "status": "pending", "blockedBy": [2]},
|
||||
{"id": 4, "subject": "WD-3: integration — docs + full verification", "classification": "standard", "status": "pending", "blockedBy": [1, 2, 3]}
|
||||
],
|
||||
"lastUpdated": "2026-06-17"
|
||||
}
|
||||
@@ -158,16 +158,32 @@ is per-run and flat — `WHERE ExecutionId = X` returns everything one run did,
|
||||
nothing links a run to the run that *spawned* it. `ParentExecutionId` carries the
|
||||
spawning execution's `ExecutionId`: a spawned run still gets its own fresh
|
||||
`ExecutionId`, and every audit row it emits also carries the spawner's id in
|
||||
`ParentExecutionId`. The first cut bridges the **inbound API → routed-site-script**
|
||||
case: an inbound request runs a method script that calls `Route.Call`, routing to
|
||||
a site instance; the routed site script records the inbound request's
|
||||
`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
|
||||
itself is top-level (`ParentExecutionId` NULL). The pointer always references the
|
||||
*immediate* spawner, so a routed run that itself routes onward threads its own
|
||||
`ExecutionId` — walking `ParentExecutionId → ExecutionId` recursively
|
||||
reconstructs the call chain as a tree of arbitrary depth. The tag-cascade case
|
||||
(an attribute write triggering another script) is **deferred** — the model
|
||||
generalises to it with no schema change once that spawn point is threaded.
|
||||
`ParentExecutionId`. The pointer always references the *immediate* spawner, so a
|
||||
run that itself spawns further runs threads its own `ExecutionId` — walking
|
||||
`ParentExecutionId → ExecutionId` recursively reconstructs the call chain as a
|
||||
tree of arbitrary depth.
|
||||
|
||||
**Tag-cascade coverage (M5.4 T4):** `ParentExecutionId` threading now spans all
|
||||
known spawn points:
|
||||
|
||||
- **Inbound API → routed site script** — an inbound request runs a method script
|
||||
that calls `Route.Call`; the routed site script records the inbound request's
|
||||
`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
|
||||
is top-level (`ParentExecutionId` NULL).
|
||||
- **Alarm-triggered on-trigger script** — when an alarm fires and its on-trigger
|
||||
script runs (via `AlarmActor → AlarmExecutionActor`), the alarm context's
|
||||
`ExecutionId` is carried as the run's `ParentExecutionId`. Currently the alarm
|
||||
subsystem has no Guid-typed firing id so on-trigger runs are roots (NULL) in
|
||||
practice, but the wiring is in place for a future alarm `ExecutionId`.
|
||||
- **Nested `CallScript` / `CallShared` invocations** — when a script calls
|
||||
`Instance.CallScript(...)` or a shared script via `CallShared`, the calling
|
||||
execution's `ExecutionId` threads into the spawned run as its
|
||||
`ParentExecutionId`, making deeply nested call chains visible as a tree.
|
||||
|
||||
Attribute-write-triggered cascades (one tag change triggering another script via a
|
||||
tag subscription) are also wired: trigger-driven runs carry `ParentExecutionId =
|
||||
NULL` (top-level roots), and any nested `CallScript`/`CallShared` they perform
|
||||
chains as above. The schema is unchanged — no further tag-cascade work is deferred.
|
||||
|
||||
## The Site-Local `AuditLog` (SQLite)
|
||||
|
||||
@@ -268,7 +284,34 @@ operational `SiteCalls` shape for the dispatcher and UI.
|
||||
|
||||
- **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
|
||||
raised to 64 KB on any error row (`Status IN ('Failed', 'Parked', 'Discarded')`).
|
||||
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min 8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to other channels do not apply here. `PayloadTruncated = 1` is set only when the inbound ceiling is hit — verbatim capture is the normal case. The ceiling applies independently to each body. Header redaction and per-target body redactors still run before persistence.
|
||||
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and
|
||||
`ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB
|
||||
(configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min
|
||||
8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to
|
||||
other channels do not apply here. `PayloadTruncated = 1` is set only when the
|
||||
inbound ceiling is hit — verbatim capture is the normal case. The ceiling
|
||||
applies independently to each body. Header redaction and per-target body
|
||||
redactors still run before persistence.
|
||||
- **Inbound ceiling hits (M5.3 T7).** Every time the `InboundMaxBytes` ceiling
|
||||
truncates a body an `IAuditInboundCeilingHitsCounter.Increment()` call fires.
|
||||
This counter is surfaced as `AuditInboundCeilingHits` on the central health
|
||||
snapshot (alongside `CentralAuditWriteFailures` / `AuditRedactionFailure`) so
|
||||
operators can detect persistently oversized payloads and raise the ceiling or
|
||||
add per-target body redactors.
|
||||
- **Request headers in `Extra` (M5.3 T7).** For `Channel = ApiInbound`, the
|
||||
`AuditWriteMiddleware` captures the inbound HTTP request headers (post-redaction
|
||||
— `Authorization`, `X-API-Key`, `Cookie`, `Set-Cookie`, and the configured
|
||||
`HeaderRedactList` are scrubbed before serialization) into the `Extra` JSON
|
||||
column under the key `"requestHeaders"`. This makes the full header envelope
|
||||
visible in the Audit Log UI's detail drawer and the CLI's `audit query` output
|
||||
without widening the schema.
|
||||
- **Per-method `SkipBodyCapture` (M5.3 T7).** `PerTargetOverrides` now includes
|
||||
a `SkipBodyCapture: true` flag. When set for an inbound API method, the audit
|
||||
row is always emitted (headers, status, duration, actor, etc. are recorded) but
|
||||
`RequestSummary` and `ResponseSummary` are left null. Use this for methods whose
|
||||
payloads are structurally large or contain secrets not covered by body redactors.
|
||||
Headers are still captured into `Extra.requestHeaders` (after redaction) even
|
||||
when `SkipBodyCapture` is true.
|
||||
- **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
|
||||
bodies are never stored.
|
||||
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
|
||||
@@ -311,16 +354,33 @@ MS SQL for direct-write events). Unredacted secrets never persist.
|
||||
## Retention & Purge
|
||||
|
||||
- **Central:** 365-day default based on `OccurredAtUtc`, configurable via
|
||||
`AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
|
||||
no per-channel overrides.
|
||||
`AuditLog:RetentionDays` (min 30, max 3650).
|
||||
- **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
|
||||
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
|
||||
there are no row-level deletes at central.
|
||||
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). The global partition switch is
|
||||
channel-blind; it drops a whole month once every row in it is older than the
|
||||
global window. There are no row-level deletes at central for the global purge.
|
||||
- **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
|
||||
runs daily, switches out any partition whose latest `OccurredAtUtc` is older
|
||||
than the retention window, and emits an `AuditLog:Purged` event (partition
|
||||
range, rowcount, duration). A partition-maintenance step rolls forward each
|
||||
month, creating the next month's partition ahead of time.
|
||||
than the retention window, then applies any per-channel overrides (see below),
|
||||
and emits an `AuditLog:Purged` event (partition range, rowcount, duration) per
|
||||
switched partition. A partition-maintenance step rolls forward each month,
|
||||
creating the next month's partition ahead of time.
|
||||
- **Per-channel retention overrides (M5.5 T3):** `AuditLog:PerChannelRetentionDays`
|
||||
is a dictionary keyed by canonical channel name (`ApiOutbound`, `DbOutbound`,
|
||||
`Notification`, `ApiInbound`) whose value is a retention window in days that
|
||||
MUST be strictly shorter than the global `RetentionDays`. After the daily
|
||||
partition switch-out, the purge actor runs a bounded, batched row DELETE
|
||||
(`PurgeChannelOlderThanAsync`) for each channel whose override is shorter than
|
||||
the global window — expiring rows of that channel earlier than the global
|
||||
partition switch would. Overrides equal to or longer than the global window are
|
||||
silently skipped (the global switch already covers them). The DELETE runs under
|
||||
`scadabridge_audit_purger` (the maintenance role); the append-only writer role
|
||||
is unaffected. Batch size is configurable via
|
||||
`AuditLogPurge:ChannelPurgeBatchSize` (default 5000). Each channel override
|
||||
runs in its own try/catch, mirroring the per-boundary error-isolation of the
|
||||
partition switch-out loop. Values are validated to be in
|
||||
`[30, RetentionDays]`; keys that are not a recognized `AuditChannel` enum name
|
||||
are rejected at startup.
|
||||
- **Sites:** daily site job; default 7-day retention (configurable, min 1,
|
||||
max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
|
||||
never purged on age alone.
|
||||
@@ -340,10 +400,13 @@ MS SQL for direct-write events). Unredacted secrets never persist.
|
||||
**AuditExport** permission.
|
||||
- **Payload redaction at write.** See Payload Capture Policy. Unredacted
|
||||
secrets never persist; the safety net over-redacts on misconfiguration.
|
||||
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
|
||||
computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
|
||||
verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. Off by
|
||||
default in v1.
|
||||
- **Hash-chain tamper evidence (T1) — deferred to v1.x.** A future `RowHash`
|
||||
column, computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will
|
||||
be verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. The
|
||||
`verify-chain` CLI command is a no-op placeholder today. Off by default in v1.
|
||||
- **Parquet archival (T2) — deferred to v1.x.** Long-term cold storage of purged
|
||||
monthly partitions as Parquet files (suitable for offline analytics) will be
|
||||
added in a future milestone. T1 and T2 are not shipped as part of M5.
|
||||
- **Site SQLite security.** File permissions: read/write by the ScadaBridge
|
||||
service account only. Not backed up off-machine — site SQLite is a buffer,
|
||||
not a record.
|
||||
@@ -355,11 +418,22 @@ Point-in-time, computed from the central `AuditLog` table; global and per-site.
|
||||
- **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
|
||||
- **Audit error rate** — % of central `AuditLog` rows with `Status IN ('Failed', 'Parked', 'Discarded')` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
|
||||
- **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
|
||||
- **`AuditInboundCeilingHits`** (M5.3 T7) — rolling count of inbound API responses truncated by the `InboundMaxBytes` ceiling; surfaced on the central health snapshot alongside `CentralAuditWriteFailures`.
|
||||
|
||||
**Per-node stuck KPIs (M5.3 T6):** Both [Notification Outbox](Component-NotificationOutbox.md)
|
||||
and [Site Call Audit](Component-SiteCallAudit.md) now expose a
|
||||
`PerNodeNotificationKpiRequest` / `PerNodeSiteCallKpiRequest` message pair that
|
||||
groups the existing stuck, parked, and delivered-last-interval counts by the
|
||||
`SourceNode` that emitted the original row. This surfaces per-node breakdowns on
|
||||
the Health dashboard tiles and the Notification Outbox / Site Calls pages,
|
||||
making it possible to identify a single misbehaving node (e.g., `site-a:node-b`)
|
||||
as the source of a spike rather than a site-wide problem. The existing global and
|
||||
per-site KPI shapes are unchanged; the per-node slice is additive.
|
||||
|
||||
[Notification Outbox](Component-NotificationOutbox.md) and
|
||||
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
|
||||
sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
|
||||
describe the audit table itself.
|
||||
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected for their
|
||||
operational dispatch responsibilities — they remain sourced from `Notifications`
|
||||
and `SiteCalls` respectively. Audit Log KPIs describe the audit table itself.
|
||||
|
||||
## Configuration
|
||||
|
||||
@@ -370,21 +444,78 @@ component (Options pattern):
|
||||
"AuditLog": {
|
||||
"DefaultCapBytes": 8192,
|
||||
"ErrorCapBytes": 65536,
|
||||
"InboundMaxBytes": 1048576,
|
||||
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
|
||||
"GlobalBodyRedactors": [
|
||||
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
|
||||
],
|
||||
"PerTargetOverrides": {
|
||||
"Weather/GetForecast": { "CapBytes": 4096 },
|
||||
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
|
||||
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" },
|
||||
"HighVolumeMethod": { "SkipBodyCapture": true }
|
||||
},
|
||||
"RetentionDays": 365
|
||||
"RetentionDays": 365,
|
||||
"PerChannelRetentionDays": {
|
||||
"ApiOutbound": 90,
|
||||
"Notification": 180
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`PerTargetOverrides` keys bind by External System / Inbound Method /
|
||||
Notification List / Database Connection name. `RetentionDays` is a single
|
||||
global value in v1; per-channel overrides are deferred to v1.x.
|
||||
Notification List / Database Connection name. `SkipBodyCapture: true` omits
|
||||
`RequestSummary`/`ResponseSummary` for that method while still capturing headers
|
||||
into `Extra.requestHeaders` and emitting the full audit row. `RetentionDays` is
|
||||
the global window; `PerChannelRetentionDays` specifies per-channel windows that
|
||||
are strictly shorter — any channel whose override equals or exceeds the global
|
||||
value is silently ignored (the global partition switch-out already governs it).
|
||||
|
||||
`AuditLogPurge` section controls the purge actor cadence and batch size:
|
||||
|
||||
```jsonc
|
||||
"AuditLogPurge": {
|
||||
"IntervalHours": 24,
|
||||
"ChannelPurgeBatchSize": 5000
|
||||
}
|
||||
```
|
||||
|
||||
## Ops Notes — Historical Null Columns
|
||||
|
||||
### `SourceNode` backfill (M5.6 T5)
|
||||
|
||||
`SourceNode` (`varchar(64)` NULL) is a physical column stamped on every row at
|
||||
write time. Rows ingested before M5.6 shipped have `SourceNode IS NULL` because
|
||||
the value was not populated until the feature landed. A one-time CLI command sets
|
||||
these to a configurable sentinel:
|
||||
|
||||
```
|
||||
scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel unknown] [--batch 5000]
|
||||
```
|
||||
|
||||
The default sentinel is `"unknown"`. The true node-of-origin for pre-feature rows
|
||||
is **unknowable** retroactively — the emitting node is long gone from the telemetry
|
||||
pipeline. The sentinel makes that explicit rather than leaving the column NULL
|
||||
(which the Audit Log UI's Node filter already treats as "unresolved", but which
|
||||
an operator might mistake for a data-quality bug).
|
||||
|
||||
The backfill runs via `POST /api/audit/backfill-source-node` (Admin role required)
|
||||
on the maintenance/purge path, NOT the append-only `scadabridge_audit_writer` role.
|
||||
It is idempotent and can be re-run safely.
|
||||
|
||||
### `ExecutionId` and `ParentExecutionId` — cannot be backfilled
|
||||
|
||||
`ExecutionId` and `ParentExecutionId` are **PERSISTED COMPUTED columns** derived
|
||||
from `DetailsJson`. They were introduced in the same feature window as the column
|
||||
itself but their value comes from the JSON payload that was written at ingest time.
|
||||
|
||||
The AuditLog append-only invariant **forbids mutating `DetailsJson`** — rows may
|
||||
only be inserted, never updated. Because backfilling the computed values would
|
||||
require rewriting the underlying `DetailsJson`, it is impossible under the
|
||||
append-only contract. Pre-feature rows carry `NULL` in both columns permanently.
|
||||
|
||||
This is a documented limitation, not a defect. The NULL values are visible in the
|
||||
Audit Log UI's execution-tree drilldown (rows with no `ExecutionId` appear as
|
||||
orphaned entries) and in the CLI's `audit tree` output.
|
||||
|
||||
## Dependencies
|
||||
|
||||
@@ -442,6 +573,8 @@ global value in v1; per-channel overrides are deferred to v1.x.
|
||||
tiles (Volume, Error rate, Backlog) plus new health metrics:
|
||||
`SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
|
||||
`CentralAuditWriteFailures`, `AuditRedactionFailure`.
|
||||
- **[CLI (#19)](Component-CLI.md)** — new `scadabridge audit query`,
|
||||
`scadabridge audit export`, and `scadabridge audit verify-chain` commands; same
|
||||
permission requirements as the UI.
|
||||
- **[CLI (#19)](Component-CLI.md)** — `scadabridge audit query`,
|
||||
`scadabridge audit export`, `scadabridge audit tree --execution-id <guid>`,
|
||||
`scadabridge audit backfill-source-node --sentinel <s> --before <date>`, and
|
||||
`scadabridge audit verify-chain` (no-op placeholder for the deferred hash-chain
|
||||
feature); same permission requirements as the UI.
|
||||
|
||||
@@ -228,14 +228,17 @@ The new centralized Audit Log component (#23) is exposed via the `scadabridge au
|
||||
The `scadabridge audit` group targets the centralized Audit Log component (#23) and
|
||||
exposes the UI-equivalent operational audit surface. Permissions follow the same
|
||||
read-vs-export split the Central UI uses (see Component-AuditLog.md, Security &
|
||||
Tamper-Evidence, and Security & Auth #10): `audit query` and `audit verify-chain`
|
||||
require the `OperationalAudit` permission; `audit export` additionally requires
|
||||
`AuditExport`. The server enforces permission checks and returns HTTP 403 (CLI
|
||||
exit code 2) on denial.
|
||||
Tamper-Evidence, and Security & Auth #10): `audit query`, `audit tree`, and
|
||||
`audit verify-chain` require the `OperationalAudit` permission; `audit export`
|
||||
additionally requires `AuditExport`; `audit backfill-source-node` requires the
|
||||
`Admin` role (maintenance path only). The server enforces permission checks and
|
||||
returns HTTP 403 (CLI exit code 2) on denial.
|
||||
|
||||
```
|
||||
scadabridge audit query [--since <t>] [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--execution-id <id>] [--parent-execution-id <id>] [--errors-only] [--page-size <n>] [--all]
|
||||
scadabridge audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
|
||||
scadabridge audit tree --execution-id <guid> [--format table|json]
|
||||
scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel <value>] [--batch <n>]
|
||||
scadabridge audit verify-chain --month <YYYY-MM>
|
||||
```
|
||||
|
||||
@@ -247,6 +250,18 @@ scadabridge audit verify-chain --month <YYYY-MM>
|
||||
requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
|
||||
streams rows rather than materializing them in memory; the CLI writes bytes
|
||||
through to disk. Supports the same scoping filters as `audit query`.
|
||||
- `audit tree --execution-id <guid>` (M5.3 T8) — renders the full execution-chain
|
||||
tree for the given `ExecutionId`. The server resolves the root from any node in
|
||||
the chain (walks `ParentExecutionId` to find the root, then traverses downward)
|
||||
and returns all reachable executions with their summary row counts and first/last
|
||||
occurred timestamps. Output format: `json` (default — structured tree suitable
|
||||
for scripting) or `table` (human-readable indented tree). Requires
|
||||
`OperationalAudit` permission. Backed by `GET /api/audit/tree?executionId=<guid>`.
|
||||
- `audit backfill-source-node --before <ISO-8601-UTC>` (M5.6 T5) — sets
|
||||
`SourceNode` to a sentinel value (`--sentinel`, default `"unknown"`) on pre-feature
|
||||
rows where `SourceNode IS NULL` and `OccurredAtUtc < --before`, in batches
|
||||
(`--batch`, default 5000). Admin-only maintenance command. Idempotent.
|
||||
Backed by `POST /api/audit/backfill-source-node`.
|
||||
- `audit verify-chain` — hash-chain verification for the named month.
|
||||
**No-op in v1**: the command is defined so the command tree is stable, but
|
||||
verification only becomes meaningful once the hash-chain ships (see
|
||||
@@ -366,7 +381,7 @@ Configuration is resolved in the following priority order (highest wins):
|
||||
- **System.CommandLine**: Command-line argument parsing.
|
||||
- **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection.
|
||||
- **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadabridge audit` command group rides a parallel REST surface on the same Host (`GET /api/audit/query` and `GET /api/audit/export`), sharing HTTP Basic Auth with `/management` but bypassing the actor for read-only, keyset-paged / streaming workloads.
|
||||
- **Audit Log (#23)**: The `scadabridge audit query` and `audit export` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`) on the Host's Management API surface; `audit verify-chain` rides `POST /management` until hash-chain verification ships. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side by `AuditEndpoints`.
|
||||
- **Audit Log (#23)**: The `scadabridge audit query`, `audit export`, `audit tree`, and `audit backfill-source-node` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`, `GET /api/audit/tree`, `POST /api/audit/backfill-source-node`) on the Host's Management API surface; `audit verify-chain` is a client-side no-op today (hash-chain deferred to v1.x). Permission checks (`OperationalAudit`, `AuditExport`, `Admin`) are enforced server-side by `AuditEndpoints`.
|
||||
|
||||
## Interactions
|
||||
|
||||
|
||||
@@ -189,6 +189,7 @@ Inbound API scripts **cannot** call shared scripts directly — shared scripts a
|
||||
- `Route.To("instanceUniqueCode").GetAttributes("attr1", "attr2", ...)` — Read multiple attribute values in a **single call**, returned as a dictionary of name-value pairs.
|
||||
- `Route.To("instanceUniqueCode").SetAttribute("attributeName", value)` — Write a single attribute value on a specific instance at any site.
|
||||
- `Route.To("instanceUniqueCode").SetAttributes(dictionary)` — Write multiple attribute values in a **single call**, accepting a dictionary of name-value pairs.
|
||||
- `Route.To("instanceUniqueCode").WaitForAttribute("attributeName", targetValue, timeout)` — Wait, event-driven, until an attribute on a specific instance at any site reaches `targetValue` (value-equality only across the wire), bounded by `timeout`. Returns `true` if matched within the timeout, `false` if it timed out. The cluster call is bounded by the wait timeout rather than the generic integration timeout.
|
||||
|
||||
#### Input/Output
|
||||
- **Input parameters** are available as defined in the method definition.
|
||||
|
||||
Reference in New Issue
Block a user