Files
ScadaBridge/docs/plans/2026-06-15-stillpending-m2-implementation.md
T
Joseph Doherty 4b6187c853 feat(inbound-api): nested Object/List extended-type validation (#13)
Object/List parameters and return values were shape-validated only (object vs
array), with no field-level/nested type checks — type-wrong nested data passed
inbound validation and failed only at script runtime. Add recursive type
validation (declared Object field types, List element type, scalars at any depth)
with path-qualified errors, symmetric across ParameterValidator and ReturnValueValidator.

Both validators now parse the canonical JSON Schema definition format (the
Central UI / MigrateParametersToJsonSchema output) via a shared recursive engine,
Commons.Types.InboundApi.InboundApiSchema, instead of the legacy flat
[{name,type}] array which they could not even deserialize from migrated rows.
The legacy flat-array form is still accepted on read for transition safety.
Undeclared fields are rejected at every level (consistent with the existing
top-level unexpected-parameter rejection); a present-but-null value satisfies
any type, only absence of a required field is an error.
2026-06-15 15:04:28 -04:00

204 lines
29 KiB
Markdown

# M2 — Correctness & Behavioral Gaps (Tier 2) Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: superpowers-extended-cc:subagent-driven-development. Execute task-by-task on branch `feature/stillpending-m2`, in-place (NOT a worktree — docker tooling builds from the repo path; implementers run **serially** to avoid racing the shared git index). Honor each task's `Classification` for the review chain.
**Goal:** Close the Tier-2 correctness/behavioral divergences from `stillpending.md` — make narrow/inert behaviors match the spec, and where the spec was the divergence, update it in the same slice.
**Architecture:** Touches the central Config DB (EF migrations), Site Runtime actors, the DCL alarm pipeline, the template validation/flattening pipeline, the deployment diff, Host startup validation, the Security cookie-auth pipeline, and Site Event Logging. Each item is independently shippable.
**Tech Stack:** C#/.NET 10, EF Core 10 (MS SQL central + SQLite site), Akka.NET 1.5, OPC UA (`OPCFoundation.NetStandard.Opc.Ua.Client`), ASP.NET Core cookie auth, xUnit/FluentAssertions/NSubstitute.
**Build/verify:** `dotnet build ZB.MOM.WW.ScadaBridge.slnx` (TreatWarningsAsErrors ON). Redeploy: `bash docker/deploy.sh`. Test user `--username multi-role --password password`.
---
## Scope decisions (recorded; per "use recommendations")
- **#19 (script started/completed events)** — already shipped in M1.8 (`e74c3ae`). **Excluded.**
- **#16 (Transport stale-instance enumeration)** — genuine Tier-2 gap but NOT in the approved M2 list, and the fix needs a non-trivial shared-script-hash staleness compute across instances. **Deferred to the Transport milestone (M8).** Tracked, not dropped.
- **#17 (MachineDataDb)** — a deliberate prior decision ("Host-008") removed this validation with a regression test asserting absence *passes*. The approved design doc says to add the option + startup validation, and both REQ-HOST-3/4 and the shipped docker `appsettings.Central.json` carry the key. **Resolution: implement per design doc (add option + central startup validation, no DbContext since nothing consumes it), reverting the Host-008 regression test and noting the reversal in the commit.**
- **#31 (StateTransitionValidator delete-from-NotDeployed)** — the audit claimed a "deliberate per code comment"; investigation found **no such comment**. **Reconcile by intent (git blame); default = align code to the spec matrix (remove `NotDeployed` from `CanDelete`) unless blame shows deliberate orphan-cleanup intent, in which case update the doc matrix.**
- **#8 (conditionFilter) semantics** — the filter is currently an undefined nullable string. **Define it as a comma-separated, case-insensitive list of alarm/condition *type names*; null/blank = mirror all.** Authoritative enforcement is **client-side in `DataConnectionActor` routing** (uniform across OPC UA + MxGateway, since MxGateway has no server-side filter); OPC UA additionally gets a server-side `WhereClause` as a bandwidth optimization where the type maps cleanly. Implementer confirms the discriminator field on `NativeAlarmTransition`.
- **#15 (LDAP re-query)** — highest risk; passwordless group re-query depends on a shared-lib capability that may not exist. **Spike first**, then ship the always-achievable layers (idle-timeout enforcement + DB role-mapping refresh on stored group claims) and the LDAP group re-query only if the lib supports a service-account search; document any residual limitation.
---
## Execution order & dependencies
Risk-first, migration-safe ordering. `#32` first (unblocks DB-backed verification). The two migration-touching tasks (`#32`, M2.5) are serialized so the snapshot stays clean.
| # | Task | Class | Migration? |
|---|------|-------|-----------|
| #32 | M2.0 EF model/snapshot drift | high-risk | snapshot only |
| M2.1 | #22 native-alarm capability validation wired into deploy pipeline | standard | no |
| M2.2 | #10 connection-level diff surfaced | standard | no |
| M2.3 | #7 `Database.CachedWrite` transient/permanent classification | high-risk | no |
| M2.4 | #8 alarm `conditionFilter` applied | high-risk | no |
| M2.5 | #9 per-script execution timeout | standard | **yes** (new column) |
| M2.6 | #13 nested `Object`/`List` validation | standard | no |
| M2.7 | #20 + #21 return-type + argument-type compatibility | standard | no |
| M2.8 | #23 binding-completeness Error + "name exists at site" | standard | no |
| M2.9 | #17 MachineDataDb fail-fast | small | no |
| M2.10 | #18 CI grep-guard (data-layer scan test) | small | no |
| M2.11 | #24 debug snapshot unknown-instance → error | small | no |
| M2.12 | #25 recursion-limit → site event log | small | no |
| M2.13 | #27 OPC UA / MxGateway transition field population | small | no |
| M2.14 | #28 readiness "required singletons running" probe | standard | no |
| M2.15 | #29 site active-node purge-gate DI registration | small | no |
| M2.16 | #30 `FailedWriteCount` consumed by Health Monitoring | small | no |
| M2.17 | #31 StateTransitionValidator reconcile | small | no |
| M2.18 | #26 debug-stream ordering + replay/dedup | high-risk | no |
| M2.19 | #15 LDAP periodic re-query (spike + impl) | high-risk | no |
---
## Tasks
### M2.0 — #32: EF model/snapshot drift (PendingModelChangesWarning)
**Classification:** high-risk · **Files:** `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs:68-69`, `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Migrations/ScadaBridgeDbContextModelSnapshot.cs`, possibly a new empty migration.
**Root cause:** `OccurredAtUtc` has `.HasConversion(UtcConverter)` in config; the model snapshot omits the converter annotation → EF throws `PendingModelChangesWarning` in `MsSqlMigrationFixture.MigrateAsync` (~57 AuditLog MSSQL tests fail in fixture ctor).
**Fix:** Run `dotnet ef migrations has-pending-model-changes` (or `migrations add`) against `ScadaBridgeDbContext` to surface the FULL drift (there may be more than `OccurredAtUtc`). Prefer the EF-canonical path: `dotnet ef migrations add ResyncAuditLogModelSnapshot`**verify the generated migration's `Up`/`Down` are empty (no DDL)**; a value-converter-only change produces no DDL but realigns the snapshot. If non-empty/unexpected DDL appears, stop and report. Auto-apply is dev-only per CLAUDE.md, so an empty migration is harmless to prod.
**Tests:** Re-run `dotnet test tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests` (requires MSSQL via `cd infra && docker compose up -d`); the ~57 fixture-ctor failures must clear. If MSSQL is unreachable in this environment, confirm the build + the snapshot diff is empty-DDL and note the test gating.
**DoD:** No `PendingModelChangesWarning`; AuditLog MSSQL suite green (or gated-with-note if no DB). Adversarial: confirm no real schema change was smuggled in.
### M2.1 — #22: native-alarm-source capability validation wired into deploy pipeline
**Classification:** standard · **Files:** `src/.../DeploymentManager/FlatteningPipeline.cs:93,115`, `SemanticValidator.cs:30-33,239-245`, validation service call site.
**Gap (M1-era regression):** `FlatteningPipeline` loads `dataConnections` but never extracts the alarm-capable subset, so `SemanticValidator.Validate(...)` is always called with `alarmCapableConnectionNames = null` → native-alarm-source capability check never runs; a source can reference a non-alarm-capable connection and deploy.
**Fix:** In `FlatteningPipeline`, compute the alarm-capable connection-name set from the loaded connections (filter by the protocol/capability that maps to `IAlarmSubscribableConnection` — OPC UA + MxGateway), pass it into the validator. Confirm the capability predicate (protocol enum / adapter capability) is the same one DCL uses to decide `IAlarmSubscribableConnection`.
**Tests:** `tests/.../TemplateEngine.Tests` SemanticValidator/flattening — add: native-alarm source on a non-alarm-capable connection → validation Error; on a capable one → passes.
**DoD:** Deploy gate rejects native-alarm sources bound to non-capable connections.
### M2.2 — #10: connection-level diff surfaced in deployment diff
**Classification:** standard · **Files:** `src/.../Commons/Types/Flattening/ConfigurationDiff.cs:7-24`, `src/.../TemplateEngine/Flattening/DiffService.cs:19-54,174-204`, Central UI diff render (`CentralUI/Components/Shared/DiffDialog.razor` caller / deployment preview page).
**Gap:** `ComputeConnectionsDiff` exists **with tests** but is dead (no callers); `ConfigurationDiff` has no `ConnectionChanges` slot; `HasChanges` ignores connections.
**Fix:** Add `ConnectionChanges` slot (`IReadOnlyList<DiffEntry<ConnectionConfig>>` — the element type already exists) to `ConfigurationDiff`; include it in `HasChanges`. Call `ComputeConnectionsDiff` from `ComputeDiff` and populate the slot. Surface in the deployment-diff UI alongside attribute/alarm/script changes (connection name + old/new protocol + endpoint config). Wire the existing `ComputeConnectionsDiff` tests' expectations through `ComputeDiff` too.
**Tests:** `tests/.../TemplateEngine.Tests/Flattening/DiffServiceTests.cs` — add a `ComputeDiff` integration assertion that `ConnectionChanges` populates and `HasChanges` is true when only a connection differs.
**DoD:** Standalone connection endpoint/protocol/failover drift appears in the deployment diff.
### M2.3 — #7: `Database.CachedWrite` classifies transient vs permanent SQL errors
**Classification:** high-risk · **Files:** `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:78-204`, new `SqlErrorClassifier.cs` + `PermanentDatabaseException`, reference `ExternalSystemClient.cs:80-162` + `ErrorClassifier.cs`.
**Gap:** `CachedWriteAsync` buffers ALL writes without an immediate attempt; `DeliverBufferedAsync` throws on any `SqlException` → S&F retries permanent errors forever; the script never gets a synchronous `Failed`. The API path (`ExternalSystemClient`) does it right.
**Fix (mirror API path):** Add `SqlErrorClassifier.IsTransient(SqlException)` — transient = connection/timeout/deadlock/throttle error numbers (e.g. `-2, 64, 53, 233, 1205, 40197, 40501, 40613, 49918-49920`); permanent = constraint/syntax/permission/etc. Create `PermanentDatabaseException` (parallel to `PermanentExternalSystemException`). In `CachedWriteAsync`: attempt immediately; on success done; on permanent → return `Failed` synchronously (set the tracking row terminal `Failed`) and do NOT buffer; on transient → buffer to S&F. In `DeliverBufferedAsync`: classify on `SqlException`, return `false` (park) for permanent, rethrow for transient (S&F retries). Keep behavior unified with `TrackedOperationId`/`OperationTrackingStore` and the `Pending → Retrying → Delivered/Parked/Failed/Discarded` lifecycle.
**Tests:** `tests/.../ExternalSystemGateway.Tests/DatabaseGatewayTests.cs` — transient SQL (deadlock 1205, timeout -2) → buffers/retries; permanent SQL (constraint 2627, syntax 102, permission 229) → synchronous `Failed`, not buffered; `DeliverBuffered` parks on permanent. Adversarial: ambiguous error numbers default to the safer classification (document which).
**DoD:** Permanent SQL errors fail fast to the script as `Failed`; only transient errors buffer.
### M2.4 — #8: alarm `conditionFilter` applied (OPC UA WhereClause + client-side routing)
**Classification:** high-risk · **Files:** `src/.../DataConnectionLayer/Actors/DataConnectionActor.cs:1482,1540-1554`, `Adapters/RealOpcUaClient.cs:242,295-310`, `Adapters/MxGatewayDataConnection.cs:154-167`, `IAlarmSubscribableConnection.cs`.
**Decision (semantics):** filter = comma-separated, case-insensitive list of alarm/condition **type names**; null/blank = mirror all. **Authoritative gate = client-side in `DataConnectionActor.HandleAlarmTransitionReceived`** (after source-ref match, drop transitions whose type name isn't in the source's filter set). Store the per-source filter set correctly (the current `_alarmSourceFilter[...]` keying is wrong — key by source reference). OPC UA additionally builds a server-side `WhereClause` in `RealOpcUaClient` as an optimization where the condition type maps cleanly; MxGateway relies solely on the client-side gate.
**Fix:** (1) Parse the filter string into a normalized set at subscribe time, keyed by source ref. (2) In routing, consult the set and skip non-matching transitions. (3) In `RealOpcUaClient.BuildAlarmEventFilter`, attach a `WhereClause` (ContentFilter on the condition/event type) built from the filter when present. Confirm `NativeAlarmTransition` exposes a usable type-name discriminator; if not, filter on the available field and note it.
**Tests:** `tests/.../DataConnectionLayer.Tests/DataConnectionActorAlarmTests.cs` — filter set → only matching-type transitions delivered; null → all delivered; MxGateway path filters client-side; OPC UA builds a non-empty WhereClause. Adversarial: case/whitespace variations in the filter list.
**DoD:** Setting a conditionFilter actually restricts mirrored conditions across both adapters.
### M2.5 — #9: per-script execution timeout
**Classification:** standard · **Migration: yes.** · **Files:** `Commons/Entities/Templates/TemplateScript.cs`, `ConfigurationDatabase/Configurations/TemplateConfiguration.cs` (`TemplateScriptConfiguration`), **new EF migration**, `Commons/Types/Flattening/FlattenedConfiguration.cs` (`ResolvedScript`), `TemplateEngine/Flattening/FlatteningService.cs` (`ResolveInheritedScripts`), `SiteRuntime/Actors/ScriptActor.cs`, `ScriptExecutionActor.cs:100`, `AlarmExecutionActor.cs:66`, `SiteRuntimeOptions.cs:31` (global fallback unchanged).
**Gap:** Only a global `ScriptExecutionTimeoutSeconds`; no per-script field. Mirror the existing nullable `MinTimeBetweenRuns` pattern end-to-end.
**Fix:** Add `int? ExecutionTimeoutSeconds` to `TemplateScript` + EF config (nullable) + **migration** (runs after M2.0 so the snapshot is clean) + `ResolvedScript` + flattening map + `ScriptActor` field; pass it into `ScriptExecutionActor`/`AlarmExecutionActor`, which compute `effective = perScript ?? options.ScriptExecutionTimeoutSeconds`. Validate non-negative.
**Tests:** flattening test (field threads through), actor test (per-script override vs global default both enforce the CTS timeout), EF round-trip test.
**DoD:** A per-script timeout overrides the global; absent → global default.
### M2.6 — #13: nested `Object`/`List` extended-type validation
**Classification:** standard · **Files:** `src/.../InboundAPI/.../ParameterValidator.cs:109-145`, `ReturnValueValidator.cs:18`.
**Gap:** `Object`/`List` are shape-validated only (object-vs-array); no nested/field-level type validation.
**Fix:** Recursive descent through the declared `Object` field schema / `List` element type, type-checking each level (scalars by extended-type, nested Object/List recursively). Reuse the existing extended-type system; keep error messages path-qualified (`field.sub[2].x`). Apply symmetrically in both validators.
**Tests:** `tests/.../InboundAPI.Tests` — valid nested payload passes; wrong scalar type at depth, wrong list element type, missing required nested field → rejected with path.
**DoD:** Nested type mismatches are caught at inbound validation, not at script runtime. (Satisfies the M4 cross-reference to this item.)
**Status: complete.** A shared recursive engine, `Commons.Types.InboundApi.InboundApiSchema` (parse + path-qualified `Validate`), backs both validators so they cannot drift. Key finding: the canonical persisted/authored format is **JSON Schema** (object `properties` + `required`, array `items`) — produced by the Central UI schema builder and the `MigrateParametersToJsonSchema` migration — but the validators still parsed the *legacy flat array* `[{name,type}]` and only shape-checked `Object`/`List`. They could not even consume a migrated JSON-Schema-object definition (the `Deserialize<List<…>>` would fail). Rewriting both to read `InboundApiSchema` fixes that latent format mismatch *and* delivers true nested validation; the legacy flat array is still accepted on read (case-insensitive keys) for transition safety. **Undeclared-field policy: reject at every level** (a declared object rejects any field not in its `properties`, consistent with the existing top-level `InboundAPI-010` "unexpected parameter" rejection); a bare `{"type":"object"}` with no declared fields stays shape-only. A present-but-null value satisfies any type; only the *absence* of a required field is an error.
### M2.7 — #20 + #21: return-type + argument-type compatibility checks
**Classification:** standard · **Files:** `src/.../TemplateEngine/Validation/SemanticValidator.cs:62-63,251-266,279-287,390-425`.
**Gap:** `BuildReturnMap` builds maps never read (no return-type comparison); call validation checks arg *count* only (comma counting), not arg *types*.
**Fix:** #20 — compare a call site's used-return against the target script's declared `ReturnDefinition`; flag incompatible use. #21 — extract/infer argument types at the call site and check each against the parameter definition (count + type). These share `SemanticValidator` — implement together. Be conservative: only flag clear mismatches (avoid false positives on dynamically-typed expressions); document the inference limits.
**Tests:** `tests/.../TemplateEngine.Tests` SemanticValidator — return-type mismatch flagged; arg type mismatch flagged; correct calls pass; dynamic/unknown types don't false-positive.
**DoD:** Type-incompatible script calls fail validation, not just count-mismatched ones.
### M2.8 — #23: connection-binding completeness as deploy-gating Error + "name exists at site"
**Classification:** standard · **Files:** `src/.../TemplateEngine/Validation/ValidationService.cs:504-519`, `ValidationResult.cs:9`.
**Gap:** Missing-binding for a data-sourced attribute is a non-blocking Warning (so `IsValid` stays true); the "connection name exists at the target site" half is missing.
**Fix:** Elevate binding-completeness to Error (or add a parallel Error-level check) so a deployment with unresolved bindings fails the gate; add the "binding references a connection that exists on the target site" check (resolve by site connection, not just name presence). Confirm this doesn't break legitimately-unbound attributes (static/non-data-sourced) — only data-sourced attributes require a binding.
**Tests:** `tests/.../TemplateEngine.Tests` ValidationService — data-sourced attribute with no binding → Error + `IsValid` false; binding to a non-existent site connection → Error; static attribute without binding → OK.
**DoD:** Incomplete/invalid connection bindings block deploy.
### M2.9 — #17: MachineDataDb fail-fast (per design doc; reverts Host-008)
**Classification:** small · **Files:** `src/ZB.MOM.WW.ScadaBridge.Host/DatabaseOptions.cs:6-12`, `StartupValidator.cs:59-62`, `tests/.../Host.Tests/StartupValidatorTests.cs` (the `Central_MissingMachineDataDb_PassesValidation` regression).
**Fix:** Add `string? MachineDataDb` to `DatabaseOptions`; add a Central-only `Require("ScadaBridge:Database:MachineDataDb", non-empty, ...)` in `StartupValidator`. **No DbContext** (nothing consumes it). Revert the Host-008 regression test to expect failure when missing; add `MachineDataDb` to `ValidCentralConfig()`. Commit message must note the deliberate Host-008 reversal and cite REQ-HOST-3/4 + shipped docker appsettings as justification.
**Tests:** `StartupValidatorTests` — Central missing MachineDataDb → fails; present → passes; Site role unaffected.
**DoD:** Central nodes fail fast on empty MachineDataDb; spec REQ-HOST-4 satisfied.
### M2.10 — #18: CI grep-guard against UPDATE/DELETE on AuditLog
**Classification:** small · **Files:** new guard test in `tests/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase.Tests/` (the only thing that actually runs — no CI service exists; build is Docker-only).
**Fix:** Add a test that scans the ConfigurationDatabase source tree (and migration SQL) for `UPDATE`/`DELETE` statements targeting `AuditLog`, failing if any are found in C# data-access code. Scope strictly to the `AuditLog` table (allow purge/delete on Notifications/SiteCalls and partition-switch DDL). This backstops the existing DB-role `DENY UPDATE/DELETE` (migration `20260602174346`). Optionally add an MSBuild target mirroring it, but the test is the enforced control.
**Tests:** the guard test itself; verify it passes on current clean source and would fail on a planted violation (assert via a unit on the scanner helper).
**DoD:** A code-level guard fails the test run on AuditLog mutations.
### M2.11 — #24: debug snapshot/subscribe for unknown instance returns an error
**Classification:** small · **Files:** `src/.../DeploymentManager/.../DeploymentManagerActor.cs:845-866`.
**Gap:** Unknown-instance snapshot/subscribe returns an empty snapshot — caller can't distinguish "not deployed" from "deployed-but-empty".
**Fix:** Check instance registration first; return an explicit "instance not found"/not-deployed error response (matching the existing debug response contract) instead of an empty snapshot.
**Tests:** `tests/.../DeploymentManager` (or SiteRuntime) — unknown instance → error response; known empty instance → empty snapshot (unchanged).
**DoD:** Unknown-instance debug requests are distinguishable from empty ones.
### M2.12 — #25: recursion-limit error → site event log
**Classification:** small · **Files:** `src/.../SiteRuntime/.../ScriptRuntimeContext.cs:302-305,464-466` (thread `ISiteEventLogger` in, mirroring M1.8's `ScriptExecutionActor` wiring).
**Fix:** Inject `ISiteEventLogger` into `ScriptRuntimeContext`; on recursion-limit violation, emit a `script` Error site event (fire-and-forget `_ = logger?.LogEventAsync(...)`) in addition to the existing `ILogger` log, at both check sites.
**Tests:** `tests/.../SiteRuntime.Tests` — recursion-limit hit emits a site event with category `script`, severity Error.
**DoD:** Recursion-limit violations appear in the site event log per spec.
### M2.13 — #27: populate obtainable OPC UA / MxGateway transition fields
**Classification:** small · **Files:** `src/.../DataConnectionLayer/Adapters/RealOpcUaClient.cs:395-403`, `MxGatewayAlarmMapper.cs:79-113`.
**Fix:** Populate fields that are genuinely obtainable: for OPC UA A&C, add SelectClauses + map Category, Description, OriginalRaiseTime where the server exposes them (extend `BuildAlarmEventFilter`'s SelectClauses); for MxGateway, extract `OperatorUser` (present in the event but dropped) and any available Current/Limit values. Leave truly-unavailable fields empty and document which are unavailable-by-protocol vs left-empty.
**Tests:** `tests/.../DataConnectionLayer.Tests` mapper tests — obtainable fields populate from a representative event; unavailable fields documented.
**DoD:** Display fields populate where the source provides them.
### M2.14 — #28: readiness gate checks required cluster singletons
**Classification:** standard · **Files:** `src/.../Host/Program.cs:188-201,314-317`, new health check (peer to `AkkaClusterHealthCheck.cs`).
**Gap:** Readiness covers membership + DB connectivity only; spec wants "required singletons running".
**Fix:** Add a `Ready`-tagged health check that, on the active central node, verifies each required singleton proxy is reachable (e.g. `NotificationOutboxActor`, `AuditLogIngestActor`, `SiteCallAuditActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`) via a short `Ask`/Identify with timeout; degrade to Unhealthy if a required singleton is unreachable. Respect the "(if applicable)" softening — only gate on singletons that should be running for this node's role. Keep the probe cheap (cache/identify, short timeout) so readiness polling stays fast.
**Tests:** `tests/.../Host.Tests` or IntegrationTests — health check reports Unhealthy when a required singleton proxy is absent; Healthy when present. Avoid flakiness (use Identify with a bounded timeout).
**DoD:** `/health/ready` reflects singleton health.
### M2.15 — #29: register the site active-node purge gate
**Classification:** small · **Files:** `src/.../SiteEventLogging/ServiceCollectionExtensions.cs:33-37`, site service registration / cluster setup.
**Gap:** `SiteEventLogActiveNodeCheck` is consulted by `EventLogPurgeService` but no implementation is registered on the site node → purge runs on standby too (defaults to `() => true`).
**Fix:** Register a `SiteEventLogActiveNodeCheck` delegate on the site node that returns true only when this node is the cluster leader/active (mirror how central gates active-node work). Keep the null-default behavior for non-clustered test hosts.
**Tests:** `tests/.../SiteEventLogging.Tests` — purge gated off on standby, on for active; default-true preserved when unregistered.
**DoD:** Site event-log purge runs only on the active node.
### M2.16 — #30: Health Monitoring consumes `FailedWriteCount`
**Classification:** small · **Files:** `src/.../SiteEventLogging/ISiteEventLogger.cs:32-40`, Health Monitoring metric path.
**Fix:** Wire `FailedWriteCount` into the site health metrics the same way other site metrics are collected/reported (find the existing site metric collection path), so the dangling metric is consumed (surface as a health metric / threshold). Keep it raw-count per the health-reporting conventions.
**Tests:** `tests/.../HealthMonitoring`/SiteEventLogging — failed writes increment the reported metric.
**DoD:** `FailedWriteCount` reaches Health Monitoring.
### M2.17 — #31: reconcile StateTransitionValidator delete-from-NotDeployed
**Classification:** small · **Files:** `src/.../DeploymentManager/.../StateTransitionValidator.cs:38-41`, possibly `docs/requirements/Component-DeploymentManager.md` (spec matrix).
**Fix:** `git blame`/log the `CanDelete` line to recover intent. Default: **align code to the spec matrix** — remove `NotDeployed` from the allowed delete states, add a clarifying comment — UNLESS history shows deliberate orphan-cleanup intent, in which case update the spec matrix (Delete from NotDeployed = Yes, with a no-op-cleanup note) instead. Whichever direction, code and doc must agree at the end.
**Tests:** `tests/.../DeploymentManager` StateTransitionValidator — the chosen rule is asserted.
**DoD:** Code and spec matrix agree on delete-from-NotDeployed.
### M2.18 — #26: debug-stream stream-first ordering + replay/dedup
**Classification:** high-risk · **Files:** `src/.../DebugStreamBridgeActor.cs:89-103,163-166`.
**Gap:** `PreStart` sends the snapshot first, then opens the gRPC stream → events in the gap window are lost. Spec wants stream-first + replay with timestamp dedup.
**Fix:** Open the gRPC subscription FIRST (buffer incoming events), then fetch+send the snapshot, then flush buffered events, deduping by timestamp/identity against the snapshot so no gap-window event is lost or double-delivered. Preserve ordering. This is a re-arch of the actor's PreStart lifecycle — keep the existing message contract.
**Tests:** `tests/.../` DebugStreamBridgeActor — an event arriving during the snapshot window is delivered exactly once after the snapshot; ordering preserved; dedup drops the snapshot-overlapping event.
**DoD:** No gap-window events lost; no duplicates.
### M2.19 — #15: LDAP periodic re-query for interactive sessions (SECURITY)
**Classification:** high-risk · **Files:** `src/.../Security/ServiceCollectionExtensions.cs:86-148` (cookie events), `JwtTokenService.cs` (wire the unused `IsIdleTimedOut`/`ShouldRefresh`/`RecordActivity`/`RefreshToken`), `RoleMapper.cs`, LDAP service interface, `CentralUI/Auth/AuthEndpoints.cs` (claims-build parity).
**Spike first:** Determine whether the shared `ZB.MOM.WW.Auth.Ldap` lib exposes a **passwordless service-account group search** for an already-authenticated username. Report the answer before building the LDAP leg.
**Fix (layered):**
1. **Always achievable** — add `CookieAuthenticationEvents.OnValidatePrincipal` that: enforces idle-timeout (reject/sign-out past 30-min idle, advance last-activity on use), and refreshes role claims by **re-running `RoleMapper` on the stored group claims** (picks up central role-mapping changes without LDAP). Stamp a `LastLdapCheck` claim.
2. **If the lib supports passwordless group search** — when `LastLdapCheck` is >15 min old, re-query LDAP groups via the service-account search, re-map roles, update role/site claims. **On LDAP failure: keep existing roles, do NOT sign out** (per "LDAP failure: new logins fail; active sessions continue with current roles"). If the lib does NOT support it, ship layer 1 and document the residual limitation (group-membership changes picked up only at next login) in the security doc.
Rebuild claims identically to `/auth/login` (same claim types). Use the cookie-only model (embedded-JWT is dispositioned doc-only in M4).
**Tests (incl. adversarial):** idle-timeout enforced; role-mapping change reflected without LDAP; LDAP-down on re-query keeps existing roles (no sign-out); >15-min triggers re-query, <15-min skips (TTL respected); a revoked-group user loses roles after re-query (if LDAP leg shipped).
**DoD:** Interactive sessions enforce idle-timeout and refresh roles per the documented policy; any residual LDAP-dependency limitation is documented.
---
## Cross-cutting
- `dotnet build ZB.MOM.WW.ScadaBridge.slnx` green (TreatWarningsAsErrors); relevant unit/integration tests pass per task.
- MSSQL-backed tests need `cd infra && docker compose up -d`; if unavailable, gate-with-note (M2.0 especially).
- Migration tasks (M2.0, M2.5) serialized; M2.0 first.
- `git diff` review before each commit; design-summary commit messages; one logical slice per commit.
- After all tasks: final integration code review, build, and `bash docker/deploy.sh` smoke (`curl localhost:9000/health/ready`).