Files
ScadaBridge/docs/plans/2026-06-15-stillpending-m2-implementation.md
T
Joseph Doherty 4b6187c853 feat(inbound-api): nested Object/List extended-type validation (#13)
Object/List parameters and return values were shape-validated only (object vs
array), with no field-level/nested type checks — type-wrong nested data passed
inbound validation and failed only at script runtime. Add recursive type
validation (declared Object field types, List element type, scalars at any depth)
with path-qualified errors, symmetric across ParameterValidator and ReturnValueValidator.

Both validators now parse the canonical JSON Schema definition format (the
Central UI / MigrateParametersToJsonSchema output) via a shared recursive engine,
Commons.Types.InboundApi.InboundApiSchema, instead of the legacy flat
[{name,type}] array which they could not even deserialize from migrated rows.
The legacy flat-array form is still accepted on read for transition safety.
Undeclared fields are rejected at every level (consistent with the existing
top-level unexpected-parameter rejection); a present-but-null value satisfies
any type, only absence of a required field is an error.
2026-06-15 15:04:28 -04:00

29 KiB

M2 — Correctness & Behavioral Gaps (Tier 2) Implementation Plan

For Claude: REQUIRED SUB-SKILL: superpowers-extended-cc:subagent-driven-development. Execute task-by-task on branch feature/stillpending-m2, in-place (NOT a worktree — docker tooling builds from the repo path; implementers run serially to avoid racing the shared git index). Honor each task's Classification for the review chain.

Goal: Close the Tier-2 correctness/behavioral divergences from stillpending.md — make narrow/inert behaviors match the spec, and where the spec was the divergence, update it in the same slice.

Architecture: Touches the central Config DB (EF migrations), Site Runtime actors, the DCL alarm pipeline, the template validation/flattening pipeline, the deployment diff, Host startup validation, the Security cookie-auth pipeline, and Site Event Logging. Each item is independently shippable.

Tech Stack: C#/.NET 10, EF Core 10 (MS SQL central + SQLite site), Akka.NET 1.5, OPC UA (OPCFoundation.NetStandard.Opc.Ua.Client), ASP.NET Core cookie auth, xUnit/FluentAssertions/NSubstitute.

Build/verify: dotnet build ZB.MOM.WW.ScadaBridge.slnx (TreatWarningsAsErrors ON). Redeploy: bash docker/deploy.sh. Test user --username multi-role --password password.


Scope decisions (recorded; per "use recommendations")

  • #19 (script started/completed events) — already shipped in M1.8 (e74c3ae). Excluded.
  • #16 (Transport stale-instance enumeration) — genuine Tier-2 gap but NOT in the approved M2 list, and the fix needs a non-trivial shared-script-hash staleness compute across instances. Deferred to the Transport milestone (M8). Tracked, not dropped.
  • #17 (MachineDataDb) — a deliberate prior decision ("Host-008") removed this validation with a regression test asserting absence passes. The approved design doc says to add the option + startup validation, and both REQ-HOST-3/4 and the shipped docker appsettings.Central.json carry the key. Resolution: implement per design doc (add option + central startup validation, no DbContext since nothing consumes it), reverting the Host-008 regression test and noting the reversal in the commit.
  • #31 (StateTransitionValidator delete-from-NotDeployed) — the audit claimed a "deliberate per code comment"; investigation found no such comment. Reconcile by intent (git blame); default = align code to the spec matrix (remove NotDeployed from CanDelete) unless blame shows deliberate orphan-cleanup intent, in which case update the doc matrix.
  • #8 (conditionFilter) semantics — the filter is currently an undefined nullable string. Define it as a comma-separated, case-insensitive list of alarm/condition type names; null/blank = mirror all. Authoritative enforcement is client-side in DataConnectionActor routing (uniform across OPC UA + MxGateway, since MxGateway has no server-side filter); OPC UA additionally gets a server-side WhereClause as a bandwidth optimization where the type maps cleanly. Implementer confirms the discriminator field on NativeAlarmTransition.
  • #15 (LDAP re-query) — highest risk; passwordless group re-query depends on a shared-lib capability that may not exist. Spike first, then ship the always-achievable layers (idle-timeout enforcement + DB role-mapping refresh on stored group claims) and the LDAP group re-query only if the lib supports a service-account search; document any residual limitation.

Execution order & dependencies

Risk-first, migration-safe ordering. #32 first (unblocks DB-backed verification). The two migration-touching tasks (#32, M2.5) are serialized so the snapshot stays clean.

# Task Class Migration?
#32 M2.0 EF model/snapshot drift high-risk snapshot only
M2.1 #22 native-alarm capability validation wired into deploy pipeline standard no
M2.2 #10 connection-level diff surfaced standard no
M2.3 #7 Database.CachedWrite transient/permanent classification high-risk no
M2.4 #8 alarm conditionFilter applied high-risk no
M2.5 #9 per-script execution timeout standard yes (new column)
M2.6 #13 nested Object/List validation standard no
M2.7 #20 + #21 return-type + argument-type compatibility standard no
M2.8 #23 binding-completeness Error + "name exists at site" standard no
M2.9 #17 MachineDataDb fail-fast small no
M2.10 #18 CI grep-guard (data-layer scan test) small no
M2.11 #24 debug snapshot unknown-instance → error small no
M2.12 #25 recursion-limit → site event log small no
M2.13 #27 OPC UA / MxGateway transition field population small no
M2.14 #28 readiness "required singletons running" probe standard no
M2.15 #29 site active-node purge-gate DI registration small no
M2.16 #30 FailedWriteCount consumed by Health Monitoring small no
M2.17 #31 StateTransitionValidator reconcile small no
M2.18 #26 debug-stream ordering + replay/dedup high-risk no
M2.19 #15 LDAP periodic re-query (spike + impl) high-risk no

Tasks

M2.0 — #32: EF model/snapshot drift (PendingModelChangesWarning)

Classification: high-risk · Files: src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs:68-69, src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Migrations/ScadaBridgeDbContextModelSnapshot.cs, possibly a new empty migration. Root cause: OccurredAtUtc has .HasConversion(UtcConverter) in config; the model snapshot omits the converter annotation → EF throws PendingModelChangesWarning in MsSqlMigrationFixture.MigrateAsync (~57 AuditLog MSSQL tests fail in fixture ctor). Fix: Run dotnet ef migrations has-pending-model-changes (or migrations add) against ScadaBridgeDbContext to surface the FULL drift (there may be more than OccurredAtUtc). Prefer the EF-canonical path: dotnet ef migrations add ResyncAuditLogModelSnapshotverify the generated migration's Up/Down are empty (no DDL); a value-converter-only change produces no DDL but realigns the snapshot. If non-empty/unexpected DDL appears, stop and report. Auto-apply is dev-only per CLAUDE.md, so an empty migration is harmless to prod. Tests: Re-run dotnet test tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests (requires MSSQL via cd infra && docker compose up -d); the ~57 fixture-ctor failures must clear. If MSSQL is unreachable in this environment, confirm the build + the snapshot diff is empty-DDL and note the test gating. DoD: No PendingModelChangesWarning; AuditLog MSSQL suite green (or gated-with-note if no DB). Adversarial: confirm no real schema change was smuggled in.

M2.1 — #22: native-alarm-source capability validation wired into deploy pipeline

Classification: standard · Files: src/.../DeploymentManager/FlatteningPipeline.cs:93,115, SemanticValidator.cs:30-33,239-245, validation service call site. Gap (M1-era regression): FlatteningPipeline loads dataConnections but never extracts the alarm-capable subset, so SemanticValidator.Validate(...) is always called with alarmCapableConnectionNames = null → native-alarm-source capability check never runs; a source can reference a non-alarm-capable connection and deploy. Fix: In FlatteningPipeline, compute the alarm-capable connection-name set from the loaded connections (filter by the protocol/capability that maps to IAlarmSubscribableConnection — OPC UA + MxGateway), pass it into the validator. Confirm the capability predicate (protocol enum / adapter capability) is the same one DCL uses to decide IAlarmSubscribableConnection. Tests: tests/.../TemplateEngine.Tests SemanticValidator/flattening — add: native-alarm source on a non-alarm-capable connection → validation Error; on a capable one → passes. DoD: Deploy gate rejects native-alarm sources bound to non-capable connections.

M2.2 — #10: connection-level diff surfaced in deployment diff

Classification: standard · Files: src/.../Commons/Types/Flattening/ConfigurationDiff.cs:7-24, src/.../TemplateEngine/Flattening/DiffService.cs:19-54,174-204, Central UI diff render (CentralUI/Components/Shared/DiffDialog.razor caller / deployment preview page). Gap: ComputeConnectionsDiff exists with tests but is dead (no callers); ConfigurationDiff has no ConnectionChanges slot; HasChanges ignores connections. Fix: Add ConnectionChanges slot (IReadOnlyList<DiffEntry<ConnectionConfig>> — the element type already exists) to ConfigurationDiff; include it in HasChanges. Call ComputeConnectionsDiff from ComputeDiff and populate the slot. Surface in the deployment-diff UI alongside attribute/alarm/script changes (connection name + old/new protocol + endpoint config). Wire the existing ComputeConnectionsDiff tests' expectations through ComputeDiff too. Tests: tests/.../TemplateEngine.Tests/Flattening/DiffServiceTests.cs — add a ComputeDiff integration assertion that ConnectionChanges populates and HasChanges is true when only a connection differs. DoD: Standalone connection endpoint/protocol/failover drift appears in the deployment diff.

M2.3 — #7: Database.CachedWrite classifies transient vs permanent SQL errors

Classification: high-risk · Files: src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:78-204, new SqlErrorClassifier.cs + PermanentDatabaseException, reference ExternalSystemClient.cs:80-162 + ErrorClassifier.cs. Gap: CachedWriteAsync buffers ALL writes without an immediate attempt; DeliverBufferedAsync throws on any SqlException → S&F retries permanent errors forever; the script never gets a synchronous Failed. The API path (ExternalSystemClient) does it right. Fix (mirror API path): Add SqlErrorClassifier.IsTransient(SqlException) — transient = connection/timeout/deadlock/throttle error numbers (e.g. -2, 64, 53, 233, 1205, 40197, 40501, 40613, 49918-49920); permanent = constraint/syntax/permission/etc. Create PermanentDatabaseException (parallel to PermanentExternalSystemException). In CachedWriteAsync: attempt immediately; on success done; on permanent → return Failed synchronously (set the tracking row terminal Failed) and do NOT buffer; on transient → buffer to S&F. In DeliverBufferedAsync: classify on SqlException, return false (park) for permanent, rethrow for transient (S&F retries). Keep behavior unified with TrackedOperationId/OperationTrackingStore and the Pending → Retrying → Delivered/Parked/Failed/Discarded lifecycle. Tests: tests/.../ExternalSystemGateway.Tests/DatabaseGatewayTests.cs — transient SQL (deadlock 1205, timeout -2) → buffers/retries; permanent SQL (constraint 2627, syntax 102, permission 229) → synchronous Failed, not buffered; DeliverBuffered parks on permanent. Adversarial: ambiguous error numbers default to the safer classification (document which). DoD: Permanent SQL errors fail fast to the script as Failed; only transient errors buffer.

M2.4 — #8: alarm conditionFilter applied (OPC UA WhereClause + client-side routing)

Classification: high-risk · Files: src/.../DataConnectionLayer/Actors/DataConnectionActor.cs:1482,1540-1554, Adapters/RealOpcUaClient.cs:242,295-310, Adapters/MxGatewayDataConnection.cs:154-167, IAlarmSubscribableConnection.cs. Decision (semantics): filter = comma-separated, case-insensitive list of alarm/condition type names; null/blank = mirror all. Authoritative gate = client-side in DataConnectionActor.HandleAlarmTransitionReceived (after source-ref match, drop transitions whose type name isn't in the source's filter set). Store the per-source filter set correctly (the current _alarmSourceFilter[...] keying is wrong — key by source reference). OPC UA additionally builds a server-side WhereClause in RealOpcUaClient as an optimization where the condition type maps cleanly; MxGateway relies solely on the client-side gate. Fix: (1) Parse the filter string into a normalized set at subscribe time, keyed by source ref. (2) In routing, consult the set and skip non-matching transitions. (3) In RealOpcUaClient.BuildAlarmEventFilter, attach a WhereClause (ContentFilter on the condition/event type) built from the filter when present. Confirm NativeAlarmTransition exposes a usable type-name discriminator; if not, filter on the available field and note it. Tests: tests/.../DataConnectionLayer.Tests/DataConnectionActorAlarmTests.cs — filter set → only matching-type transitions delivered; null → all delivered; MxGateway path filters client-side; OPC UA builds a non-empty WhereClause. Adversarial: case/whitespace variations in the filter list. DoD: Setting a conditionFilter actually restricts mirrored conditions across both adapters.

M2.5 — #9: per-script execution timeout

Classification: standard · Migration: yes. · Files: Commons/Entities/Templates/TemplateScript.cs, ConfigurationDatabase/Configurations/TemplateConfiguration.cs (TemplateScriptConfiguration), new EF migration, Commons/Types/Flattening/FlattenedConfiguration.cs (ResolvedScript), TemplateEngine/Flattening/FlatteningService.cs (ResolveInheritedScripts), SiteRuntime/Actors/ScriptActor.cs, ScriptExecutionActor.cs:100, AlarmExecutionActor.cs:66, SiteRuntimeOptions.cs:31 (global fallback unchanged). Gap: Only a global ScriptExecutionTimeoutSeconds; no per-script field. Mirror the existing nullable MinTimeBetweenRuns pattern end-to-end. Fix: Add int? ExecutionTimeoutSeconds to TemplateScript + EF config (nullable) + migration (runs after M2.0 so the snapshot is clean) + ResolvedScript + flattening map + ScriptActor field; pass it into ScriptExecutionActor/AlarmExecutionActor, which compute effective = perScript ?? options.ScriptExecutionTimeoutSeconds. Validate non-negative. Tests: flattening test (field threads through), actor test (per-script override vs global default both enforce the CTS timeout), EF round-trip test. DoD: A per-script timeout overrides the global; absent → global default.

M2.6 — #13: nested Object/List extended-type validation

Classification: standard · Files: src/.../InboundAPI/.../ParameterValidator.cs:109-145, ReturnValueValidator.cs:18. Gap: Object/List are shape-validated only (object-vs-array); no nested/field-level type validation. Fix: Recursive descent through the declared Object field schema / List element type, type-checking each level (scalars by extended-type, nested Object/List recursively). Reuse the existing extended-type system; keep error messages path-qualified (field.sub[2].x). Apply symmetrically in both validators. Tests: tests/.../InboundAPI.Tests — valid nested payload passes; wrong scalar type at depth, wrong list element type, missing required nested field → rejected with path. DoD: Nested type mismatches are caught at inbound validation, not at script runtime. (Satisfies the M4 cross-reference to this item.) Status: complete. A shared recursive engine, Commons.Types.InboundApi.InboundApiSchema (parse + path-qualified Validate), backs both validators so they cannot drift. Key finding: the canonical persisted/authored format is JSON Schema (object properties + required, array items) — produced by the Central UI schema builder and the MigrateParametersToJsonSchema migration — but the validators still parsed the legacy flat array [{name,type}] and only shape-checked Object/List. They could not even consume a migrated JSON-Schema-object definition (the Deserialize<List<…>> would fail). Rewriting both to read InboundApiSchema fixes that latent format mismatch and delivers true nested validation; the legacy flat array is still accepted on read (case-insensitive keys) for transition safety. Undeclared-field policy: reject at every level (a declared object rejects any field not in its properties, consistent with the existing top-level InboundAPI-010 "unexpected parameter" rejection); a bare {"type":"object"} with no declared fields stays shape-only. A present-but-null value satisfies any type; only the absence of a required field is an error.

M2.7 — #20 + #21: return-type + argument-type compatibility checks

Classification: standard · Files: src/.../TemplateEngine/Validation/SemanticValidator.cs:62-63,251-266,279-287,390-425. Gap: BuildReturnMap builds maps never read (no return-type comparison); call validation checks arg count only (comma counting), not arg types. Fix: #20 — compare a call site's used-return against the target script's declared ReturnDefinition; flag incompatible use. #21 — extract/infer argument types at the call site and check each against the parameter definition (count + type). These share SemanticValidator — implement together. Be conservative: only flag clear mismatches (avoid false positives on dynamically-typed expressions); document the inference limits. Tests: tests/.../TemplateEngine.Tests SemanticValidator — return-type mismatch flagged; arg type mismatch flagged; correct calls pass; dynamic/unknown types don't false-positive. DoD: Type-incompatible script calls fail validation, not just count-mismatched ones.

M2.8 — #23: connection-binding completeness as deploy-gating Error + "name exists at site"

Classification: standard · Files: src/.../TemplateEngine/Validation/ValidationService.cs:504-519, ValidationResult.cs:9. Gap: Missing-binding for a data-sourced attribute is a non-blocking Warning (so IsValid stays true); the "connection name exists at the target site" half is missing. Fix: Elevate binding-completeness to Error (or add a parallel Error-level check) so a deployment with unresolved bindings fails the gate; add the "binding references a connection that exists on the target site" check (resolve by site connection, not just name presence). Confirm this doesn't break legitimately-unbound attributes (static/non-data-sourced) — only data-sourced attributes require a binding. Tests: tests/.../TemplateEngine.Tests ValidationService — data-sourced attribute with no binding → Error + IsValid false; binding to a non-existent site connection → Error; static attribute without binding → OK. DoD: Incomplete/invalid connection bindings block deploy.

M2.9 — #17: MachineDataDb fail-fast (per design doc; reverts Host-008)

Classification: small · Files: src/ZB.MOM.WW.ScadaBridge.Host/DatabaseOptions.cs:6-12, StartupValidator.cs:59-62, tests/.../Host.Tests/StartupValidatorTests.cs (the Central_MissingMachineDataDb_PassesValidation regression). Fix: Add string? MachineDataDb to DatabaseOptions; add a Central-only Require("ScadaBridge:Database:MachineDataDb", non-empty, ...) in StartupValidator. No DbContext (nothing consumes it). Revert the Host-008 regression test to expect failure when missing; add MachineDataDb to ValidCentralConfig(). Commit message must note the deliberate Host-008 reversal and cite REQ-HOST-3/4 + shipped docker appsettings as justification. Tests: StartupValidatorTests — Central missing MachineDataDb → fails; present → passes; Site role unaffected. DoD: Central nodes fail fast on empty MachineDataDb; spec REQ-HOST-4 satisfied.

M2.10 — #18: CI grep-guard against UPDATE/DELETE on AuditLog

Classification: small · Files: new guard test in tests/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase.Tests/ (the only thing that actually runs — no CI service exists; build is Docker-only). Fix: Add a test that scans the ConfigurationDatabase source tree (and migration SQL) for UPDATE/DELETE statements targeting AuditLog, failing if any are found in C# data-access code. Scope strictly to the AuditLog table (allow purge/delete on Notifications/SiteCalls and partition-switch DDL). This backstops the existing DB-role DENY UPDATE/DELETE (migration 20260602174346). Optionally add an MSBuild target mirroring it, but the test is the enforced control. Tests: the guard test itself; verify it passes on current clean source and would fail on a planted violation (assert via a unit on the scanner helper). DoD: A code-level guard fails the test run on AuditLog mutations.

M2.11 — #24: debug snapshot/subscribe for unknown instance returns an error

Classification: small · Files: src/.../DeploymentManager/.../DeploymentManagerActor.cs:845-866. Gap: Unknown-instance snapshot/subscribe returns an empty snapshot — caller can't distinguish "not deployed" from "deployed-but-empty". Fix: Check instance registration first; return an explicit "instance not found"/not-deployed error response (matching the existing debug response contract) instead of an empty snapshot. Tests: tests/.../DeploymentManager (or SiteRuntime) — unknown instance → error response; known empty instance → empty snapshot (unchanged). DoD: Unknown-instance debug requests are distinguishable from empty ones.

M2.12 — #25: recursion-limit error → site event log

Classification: small · Files: src/.../SiteRuntime/.../ScriptRuntimeContext.cs:302-305,464-466 (thread ISiteEventLogger in, mirroring M1.8's ScriptExecutionActor wiring). Fix: Inject ISiteEventLogger into ScriptRuntimeContext; on recursion-limit violation, emit a script Error site event (fire-and-forget _ = logger?.LogEventAsync(...)) in addition to the existing ILogger log, at both check sites. Tests: tests/.../SiteRuntime.Tests — recursion-limit hit emits a site event with category script, severity Error. DoD: Recursion-limit violations appear in the site event log per spec.

M2.13 — #27: populate obtainable OPC UA / MxGateway transition fields

Classification: small · Files: src/.../DataConnectionLayer/Adapters/RealOpcUaClient.cs:395-403, MxGatewayAlarmMapper.cs:79-113. Fix: Populate fields that are genuinely obtainable: for OPC UA A&C, add SelectClauses + map Category, Description, OriginalRaiseTime where the server exposes them (extend BuildAlarmEventFilter's SelectClauses); for MxGateway, extract OperatorUser (present in the event but dropped) and any available Current/Limit values. Leave truly-unavailable fields empty and document which are unavailable-by-protocol vs left-empty. Tests: tests/.../DataConnectionLayer.Tests mapper tests — obtainable fields populate from a representative event; unavailable fields documented. DoD: Display fields populate where the source provides them.

M2.14 — #28: readiness gate checks required cluster singletons

Classification: standard · Files: src/.../Host/Program.cs:188-201,314-317, new health check (peer to AkkaClusterHealthCheck.cs). Gap: Readiness covers membership + DB connectivity only; spec wants "required singletons running". Fix: Add a Ready-tagged health check that, on the active central node, verifies each required singleton proxy is reachable (e.g. NotificationOutboxActor, AuditLogIngestActor, SiteCallAuditActor, AuditLogPurgeActor, SiteAuditReconciliationActor) via a short Ask/Identify with timeout; degrade to Unhealthy if a required singleton is unreachable. Respect the "(if applicable)" softening — only gate on singletons that should be running for this node's role. Keep the probe cheap (cache/identify, short timeout) so readiness polling stays fast. Tests: tests/.../Host.Tests or IntegrationTests — health check reports Unhealthy when a required singleton proxy is absent; Healthy when present. Avoid flakiness (use Identify with a bounded timeout). DoD: /health/ready reflects singleton health.

M2.15 — #29: register the site active-node purge gate

Classification: small · Files: src/.../SiteEventLogging/ServiceCollectionExtensions.cs:33-37, site service registration / cluster setup. Gap: SiteEventLogActiveNodeCheck is consulted by EventLogPurgeService but no implementation is registered on the site node → purge runs on standby too (defaults to () => true). Fix: Register a SiteEventLogActiveNodeCheck delegate on the site node that returns true only when this node is the cluster leader/active (mirror how central gates active-node work). Keep the null-default behavior for non-clustered test hosts. Tests: tests/.../SiteEventLogging.Tests — purge gated off on standby, on for active; default-true preserved when unregistered. DoD: Site event-log purge runs only on the active node.

M2.16 — #30: Health Monitoring consumes FailedWriteCount

Classification: small · Files: src/.../SiteEventLogging/ISiteEventLogger.cs:32-40, Health Monitoring metric path. Fix: Wire FailedWriteCount into the site health metrics the same way other site metrics are collected/reported (find the existing site metric collection path), so the dangling metric is consumed (surface as a health metric / threshold). Keep it raw-count per the health-reporting conventions. Tests: tests/.../HealthMonitoring/SiteEventLogging — failed writes increment the reported metric. DoD: FailedWriteCount reaches Health Monitoring.

M2.17 — #31: reconcile StateTransitionValidator delete-from-NotDeployed

Classification: small · Files: src/.../DeploymentManager/.../StateTransitionValidator.cs:38-41, possibly docs/requirements/Component-DeploymentManager.md (spec matrix). Fix: git blame/log the CanDelete line to recover intent. Default: align code to the spec matrix — remove NotDeployed from the allowed delete states, add a clarifying comment — UNLESS history shows deliberate orphan-cleanup intent, in which case update the spec matrix (Delete from NotDeployed = Yes, with a no-op-cleanup note) instead. Whichever direction, code and doc must agree at the end. Tests: tests/.../DeploymentManager StateTransitionValidator — the chosen rule is asserted. DoD: Code and spec matrix agree on delete-from-NotDeployed.

M2.18 — #26: debug-stream stream-first ordering + replay/dedup

Classification: high-risk · Files: src/.../DebugStreamBridgeActor.cs:89-103,163-166. Gap: PreStart sends the snapshot first, then opens the gRPC stream → events in the gap window are lost. Spec wants stream-first + replay with timestamp dedup. Fix: Open the gRPC subscription FIRST (buffer incoming events), then fetch+send the snapshot, then flush buffered events, deduping by timestamp/identity against the snapshot so no gap-window event is lost or double-delivered. Preserve ordering. This is a re-arch of the actor's PreStart lifecycle — keep the existing message contract. Tests: tests/.../ DebugStreamBridgeActor — an event arriving during the snapshot window is delivered exactly once after the snapshot; ordering preserved; dedup drops the snapshot-overlapping event. DoD: No gap-window events lost; no duplicates.

M2.19 — #15: LDAP periodic re-query for interactive sessions (SECURITY)

Classification: high-risk · Files: src/.../Security/ServiceCollectionExtensions.cs:86-148 (cookie events), JwtTokenService.cs (wire the unused IsIdleTimedOut/ShouldRefresh/RecordActivity/RefreshToken), RoleMapper.cs, LDAP service interface, CentralUI/Auth/AuthEndpoints.cs (claims-build parity). Spike first: Determine whether the shared ZB.MOM.WW.Auth.Ldap lib exposes a passwordless service-account group search for an already-authenticated username. Report the answer before building the LDAP leg. Fix (layered):

  1. Always achievable — add CookieAuthenticationEvents.OnValidatePrincipal that: enforces idle-timeout (reject/sign-out past 30-min idle, advance last-activity on use), and refreshes role claims by re-running RoleMapper on the stored group claims (picks up central role-mapping changes without LDAP). Stamp a LastLdapCheck claim.
  2. If the lib supports passwordless group search — when LastLdapCheck is >15 min old, re-query LDAP groups via the service-account search, re-map roles, update role/site claims. On LDAP failure: keep existing roles, do NOT sign out (per "LDAP failure: new logins fail; active sessions continue with current roles"). If the lib does NOT support it, ship layer 1 and document the residual limitation (group-membership changes picked up only at next login) in the security doc. Rebuild claims identically to /auth/login (same claim types). Use the cookie-only model (embedded-JWT is dispositioned doc-only in M4). Tests (incl. adversarial): idle-timeout enforced; role-mapping change reflected without LDAP; LDAP-down on re-query keeps existing roles (no sign-out); >15-min triggers re-query, <15-min skips (TTL respected); a revoked-group user loses roles after re-query (if LDAP leg shipped). DoD: Interactive sessions enforce idle-timeout and refresh roles per the documented policy; any residual LDAP-dependency limitation is documented.

Cross-cutting

  • dotnet build ZB.MOM.WW.ScadaBridge.slnx green (TreatWarningsAsErrors); relevant unit/integration tests pass per task.
  • MSSQL-backed tests need cd infra && docker compose up -d; if unavailable, gate-with-note (M2.0 especially).
  • Migration tasks (M2.0, M2.5) serialized; M2.0 first.
  • git diff review before each commit; design-summary commit messages; one logical slice per commit.
  • After all tasks: final integration code review, build, and bash docker/deploy.sh smoke (curl localhost:9000/health/ready).