Object/List parameters and return values were shape-validated only (object vs
array), with no field-level/nested type checks — type-wrong nested data passed
inbound validation and failed only at script runtime. Add recursive type
validation (declared Object field types, List element type, scalars at any depth)
with path-qualified errors, symmetric across ParameterValidator and ReturnValueValidator.
Both validators now parse the canonical JSON Schema definition format (the
Central UI / MigrateParametersToJsonSchema output) via a shared recursive engine,
Commons.Types.InboundApi.InboundApiSchema, instead of the legacy flat
[{name,type}] array which they could not even deserialize from migrated rows.
The legacy flat-array form is still accepted on read for transition safety.
Undeclared fields are rejected at every level (consistent with the existing
top-level unexpected-parameter rejection); a present-but-null value satisfies
any type, only absence of a required field is an error.
29 KiB
M2 — Correctness & Behavioral Gaps (Tier 2) Implementation Plan
For Claude: REQUIRED SUB-SKILL: superpowers-extended-cc:subagent-driven-development. Execute task-by-task on branch
feature/stillpending-m2, in-place (NOT a worktree — docker tooling builds from the repo path; implementers run serially to avoid racing the shared git index). Honor each task'sClassificationfor the review chain.
Goal: Close the Tier-2 correctness/behavioral divergences from stillpending.md — make narrow/inert behaviors match the spec, and where the spec was the divergence, update it in the same slice.
Architecture: Touches the central Config DB (EF migrations), Site Runtime actors, the DCL alarm pipeline, the template validation/flattening pipeline, the deployment diff, Host startup validation, the Security cookie-auth pipeline, and Site Event Logging. Each item is independently shippable.
Tech Stack: C#/.NET 10, EF Core 10 (MS SQL central + SQLite site), Akka.NET 1.5, OPC UA (OPCFoundation.NetStandard.Opc.Ua.Client), ASP.NET Core cookie auth, xUnit/FluentAssertions/NSubstitute.
Build/verify: dotnet build ZB.MOM.WW.ScadaBridge.slnx (TreatWarningsAsErrors ON). Redeploy: bash docker/deploy.sh. Test user --username multi-role --password password.
Scope decisions (recorded; per "use recommendations")
- #19 (script started/completed events) — already shipped in M1.8 (
e74c3ae). Excluded. - #16 (Transport stale-instance enumeration) — genuine Tier-2 gap but NOT in the approved M2 list, and the fix needs a non-trivial shared-script-hash staleness compute across instances. Deferred to the Transport milestone (M8). Tracked, not dropped.
- #17 (MachineDataDb) — a deliberate prior decision ("Host-008") removed this validation with a regression test asserting absence passes. The approved design doc says to add the option + startup validation, and both REQ-HOST-3/4 and the shipped docker
appsettings.Central.jsoncarry the key. Resolution: implement per design doc (add option + central startup validation, no DbContext since nothing consumes it), reverting the Host-008 regression test and noting the reversal in the commit. - #31 (StateTransitionValidator delete-from-NotDeployed) — the audit claimed a "deliberate per code comment"; investigation found no such comment. Reconcile by intent (git blame); default = align code to the spec matrix (remove
NotDeployedfromCanDelete) unless blame shows deliberate orphan-cleanup intent, in which case update the doc matrix. - #8 (conditionFilter) semantics — the filter is currently an undefined nullable string. Define it as a comma-separated, case-insensitive list of alarm/condition type names; null/blank = mirror all. Authoritative enforcement is client-side in
DataConnectionActorrouting (uniform across OPC UA + MxGateway, since MxGateway has no server-side filter); OPC UA additionally gets a server-sideWhereClauseas a bandwidth optimization where the type maps cleanly. Implementer confirms the discriminator field onNativeAlarmTransition. - #15 (LDAP re-query) — highest risk; passwordless group re-query depends on a shared-lib capability that may not exist. Spike first, then ship the always-achievable layers (idle-timeout enforcement + DB role-mapping refresh on stored group claims) and the LDAP group re-query only if the lib supports a service-account search; document any residual limitation.
Execution order & dependencies
Risk-first, migration-safe ordering. #32 first (unblocks DB-backed verification). The two migration-touching tasks (#32, M2.5) are serialized so the snapshot stays clean.
| # | Task | Class | Migration? |
|---|---|---|---|
| #32 | M2.0 EF model/snapshot drift | high-risk | snapshot only |
| M2.1 | #22 native-alarm capability validation wired into deploy pipeline | standard | no |
| M2.2 | #10 connection-level diff surfaced | standard | no |
| M2.3 | #7 Database.CachedWrite transient/permanent classification |
high-risk | no |
| M2.4 | #8 alarm conditionFilter applied |
high-risk | no |
| M2.5 | #9 per-script execution timeout | standard | yes (new column) |
| M2.6 | #13 nested Object/List validation |
standard | no |
| M2.7 | #20 + #21 return-type + argument-type compatibility | standard | no |
| M2.8 | #23 binding-completeness Error + "name exists at site" | standard | no |
| M2.9 | #17 MachineDataDb fail-fast | small | no |
| M2.10 | #18 CI grep-guard (data-layer scan test) | small | no |
| M2.11 | #24 debug snapshot unknown-instance → error | small | no |
| M2.12 | #25 recursion-limit → site event log | small | no |
| M2.13 | #27 OPC UA / MxGateway transition field population | small | no |
| M2.14 | #28 readiness "required singletons running" probe | standard | no |
| M2.15 | #29 site active-node purge-gate DI registration | small | no |
| M2.16 | #30 FailedWriteCount consumed by Health Monitoring |
small | no |
| M2.17 | #31 StateTransitionValidator reconcile | small | no |
| M2.18 | #26 debug-stream ordering + replay/dedup | high-risk | no |
| M2.19 | #15 LDAP periodic re-query (spike + impl) | high-risk | no |
Tasks
M2.0 — #32: EF model/snapshot drift (PendingModelChangesWarning)
Classification: high-risk · Files: src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs:68-69, src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Migrations/ScadaBridgeDbContextModelSnapshot.cs, possibly a new empty migration.
Root cause: OccurredAtUtc has .HasConversion(UtcConverter) in config; the model snapshot omits the converter annotation → EF throws PendingModelChangesWarning in MsSqlMigrationFixture.MigrateAsync (~57 AuditLog MSSQL tests fail in fixture ctor).
Fix: Run dotnet ef migrations has-pending-model-changes (or migrations add) against ScadaBridgeDbContext to surface the FULL drift (there may be more than OccurredAtUtc). Prefer the EF-canonical path: dotnet ef migrations add ResyncAuditLogModelSnapshot — verify the generated migration's Up/Down are empty (no DDL); a value-converter-only change produces no DDL but realigns the snapshot. If non-empty/unexpected DDL appears, stop and report. Auto-apply is dev-only per CLAUDE.md, so an empty migration is harmless to prod.
Tests: Re-run dotnet test tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests (requires MSSQL via cd infra && docker compose up -d); the ~57 fixture-ctor failures must clear. If MSSQL is unreachable in this environment, confirm the build + the snapshot diff is empty-DDL and note the test gating.
DoD: No PendingModelChangesWarning; AuditLog MSSQL suite green (or gated-with-note if no DB). Adversarial: confirm no real schema change was smuggled in.
M2.1 — #22: native-alarm-source capability validation wired into deploy pipeline
Classification: standard · Files: src/.../DeploymentManager/FlatteningPipeline.cs:93,115, SemanticValidator.cs:30-33,239-245, validation service call site.
Gap (M1-era regression): FlatteningPipeline loads dataConnections but never extracts the alarm-capable subset, so SemanticValidator.Validate(...) is always called with alarmCapableConnectionNames = null → native-alarm-source capability check never runs; a source can reference a non-alarm-capable connection and deploy.
Fix: In FlatteningPipeline, compute the alarm-capable connection-name set from the loaded connections (filter by the protocol/capability that maps to IAlarmSubscribableConnection — OPC UA + MxGateway), pass it into the validator. Confirm the capability predicate (protocol enum / adapter capability) is the same one DCL uses to decide IAlarmSubscribableConnection.
Tests: tests/.../TemplateEngine.Tests SemanticValidator/flattening — add: native-alarm source on a non-alarm-capable connection → validation Error; on a capable one → passes.
DoD: Deploy gate rejects native-alarm sources bound to non-capable connections.
M2.2 — #10: connection-level diff surfaced in deployment diff
Classification: standard · Files: src/.../Commons/Types/Flattening/ConfigurationDiff.cs:7-24, src/.../TemplateEngine/Flattening/DiffService.cs:19-54,174-204, Central UI diff render (CentralUI/Components/Shared/DiffDialog.razor caller / deployment preview page).
Gap: ComputeConnectionsDiff exists with tests but is dead (no callers); ConfigurationDiff has no ConnectionChanges slot; HasChanges ignores connections.
Fix: Add ConnectionChanges slot (IReadOnlyList<DiffEntry<ConnectionConfig>> — the element type already exists) to ConfigurationDiff; include it in HasChanges. Call ComputeConnectionsDiff from ComputeDiff and populate the slot. Surface in the deployment-diff UI alongside attribute/alarm/script changes (connection name + old/new protocol + endpoint config). Wire the existing ComputeConnectionsDiff tests' expectations through ComputeDiff too.
Tests: tests/.../TemplateEngine.Tests/Flattening/DiffServiceTests.cs — add a ComputeDiff integration assertion that ConnectionChanges populates and HasChanges is true when only a connection differs.
DoD: Standalone connection endpoint/protocol/failover drift appears in the deployment diff.
M2.3 — #7: Database.CachedWrite classifies transient vs permanent SQL errors
Classification: high-risk · Files: src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:78-204, new SqlErrorClassifier.cs + PermanentDatabaseException, reference ExternalSystemClient.cs:80-162 + ErrorClassifier.cs.
Gap: CachedWriteAsync buffers ALL writes without an immediate attempt; DeliverBufferedAsync throws on any SqlException → S&F retries permanent errors forever; the script never gets a synchronous Failed. The API path (ExternalSystemClient) does it right.
Fix (mirror API path): Add SqlErrorClassifier.IsTransient(SqlException) — transient = connection/timeout/deadlock/throttle error numbers (e.g. -2, 64, 53, 233, 1205, 40197, 40501, 40613, 49918-49920); permanent = constraint/syntax/permission/etc. Create PermanentDatabaseException (parallel to PermanentExternalSystemException). In CachedWriteAsync: attempt immediately; on success done; on permanent → return Failed synchronously (set the tracking row terminal Failed) and do NOT buffer; on transient → buffer to S&F. In DeliverBufferedAsync: classify on SqlException, return false (park) for permanent, rethrow for transient (S&F retries). Keep behavior unified with TrackedOperationId/OperationTrackingStore and the Pending → Retrying → Delivered/Parked/Failed/Discarded lifecycle.
Tests: tests/.../ExternalSystemGateway.Tests/DatabaseGatewayTests.cs — transient SQL (deadlock 1205, timeout -2) → buffers/retries; permanent SQL (constraint 2627, syntax 102, permission 229) → synchronous Failed, not buffered; DeliverBuffered parks on permanent. Adversarial: ambiguous error numbers default to the safer classification (document which).
DoD: Permanent SQL errors fail fast to the script as Failed; only transient errors buffer.
M2.4 — #8: alarm conditionFilter applied (OPC UA WhereClause + client-side routing)
Classification: high-risk · Files: src/.../DataConnectionLayer/Actors/DataConnectionActor.cs:1482,1540-1554, Adapters/RealOpcUaClient.cs:242,295-310, Adapters/MxGatewayDataConnection.cs:154-167, IAlarmSubscribableConnection.cs.
Decision (semantics): filter = comma-separated, case-insensitive list of alarm/condition type names; null/blank = mirror all. Authoritative gate = client-side in DataConnectionActor.HandleAlarmTransitionReceived (after source-ref match, drop transitions whose type name isn't in the source's filter set). Store the per-source filter set correctly (the current _alarmSourceFilter[...] keying is wrong — key by source reference). OPC UA additionally builds a server-side WhereClause in RealOpcUaClient as an optimization where the condition type maps cleanly; MxGateway relies solely on the client-side gate.
Fix: (1) Parse the filter string into a normalized set at subscribe time, keyed by source ref. (2) In routing, consult the set and skip non-matching transitions. (3) In RealOpcUaClient.BuildAlarmEventFilter, attach a WhereClause (ContentFilter on the condition/event type) built from the filter when present. Confirm NativeAlarmTransition exposes a usable type-name discriminator; if not, filter on the available field and note it.
Tests: tests/.../DataConnectionLayer.Tests/DataConnectionActorAlarmTests.cs — filter set → only matching-type transitions delivered; null → all delivered; MxGateway path filters client-side; OPC UA builds a non-empty WhereClause. Adversarial: case/whitespace variations in the filter list.
DoD: Setting a conditionFilter actually restricts mirrored conditions across both adapters.
M2.5 — #9: per-script execution timeout
Classification: standard · Migration: yes. · Files: Commons/Entities/Templates/TemplateScript.cs, ConfigurationDatabase/Configurations/TemplateConfiguration.cs (TemplateScriptConfiguration), new EF migration, Commons/Types/Flattening/FlattenedConfiguration.cs (ResolvedScript), TemplateEngine/Flattening/FlatteningService.cs (ResolveInheritedScripts), SiteRuntime/Actors/ScriptActor.cs, ScriptExecutionActor.cs:100, AlarmExecutionActor.cs:66, SiteRuntimeOptions.cs:31 (global fallback unchanged).
Gap: Only a global ScriptExecutionTimeoutSeconds; no per-script field. Mirror the existing nullable MinTimeBetweenRuns pattern end-to-end.
Fix: Add int? ExecutionTimeoutSeconds to TemplateScript + EF config (nullable) + migration (runs after M2.0 so the snapshot is clean) + ResolvedScript + flattening map + ScriptActor field; pass it into ScriptExecutionActor/AlarmExecutionActor, which compute effective = perScript ?? options.ScriptExecutionTimeoutSeconds. Validate non-negative.
Tests: flattening test (field threads through), actor test (per-script override vs global default both enforce the CTS timeout), EF round-trip test.
DoD: A per-script timeout overrides the global; absent → global default.
M2.6 — #13: nested Object/List extended-type validation
Classification: standard · Files: src/.../InboundAPI/.../ParameterValidator.cs:109-145, ReturnValueValidator.cs:18.
Gap: Object/List are shape-validated only (object-vs-array); no nested/field-level type validation.
Fix: Recursive descent through the declared Object field schema / List element type, type-checking each level (scalars by extended-type, nested Object/List recursively). Reuse the existing extended-type system; keep error messages path-qualified (field.sub[2].x). Apply symmetrically in both validators.
Tests: tests/.../InboundAPI.Tests — valid nested payload passes; wrong scalar type at depth, wrong list element type, missing required nested field → rejected with path.
DoD: Nested type mismatches are caught at inbound validation, not at script runtime. (Satisfies the M4 cross-reference to this item.)
Status: complete. A shared recursive engine, Commons.Types.InboundApi.InboundApiSchema (parse + path-qualified Validate), backs both validators so they cannot drift. Key finding: the canonical persisted/authored format is JSON Schema (object properties + required, array items) — produced by the Central UI schema builder and the MigrateParametersToJsonSchema migration — but the validators still parsed the legacy flat array [{name,type}] and only shape-checked Object/List. They could not even consume a migrated JSON-Schema-object definition (the Deserialize<List<…>> would fail). Rewriting both to read InboundApiSchema fixes that latent format mismatch and delivers true nested validation; the legacy flat array is still accepted on read (case-insensitive keys) for transition safety. Undeclared-field policy: reject at every level (a declared object rejects any field not in its properties, consistent with the existing top-level InboundAPI-010 "unexpected parameter" rejection); a bare {"type":"object"} with no declared fields stays shape-only. A present-but-null value satisfies any type; only the absence of a required field is an error.
M2.7 — #20 + #21: return-type + argument-type compatibility checks
Classification: standard · Files: src/.../TemplateEngine/Validation/SemanticValidator.cs:62-63,251-266,279-287,390-425.
Gap: BuildReturnMap builds maps never read (no return-type comparison); call validation checks arg count only (comma counting), not arg types.
Fix: #20 — compare a call site's used-return against the target script's declared ReturnDefinition; flag incompatible use. #21 — extract/infer argument types at the call site and check each against the parameter definition (count + type). These share SemanticValidator — implement together. Be conservative: only flag clear mismatches (avoid false positives on dynamically-typed expressions); document the inference limits.
Tests: tests/.../TemplateEngine.Tests SemanticValidator — return-type mismatch flagged; arg type mismatch flagged; correct calls pass; dynamic/unknown types don't false-positive.
DoD: Type-incompatible script calls fail validation, not just count-mismatched ones.
M2.8 — #23: connection-binding completeness as deploy-gating Error + "name exists at site"
Classification: standard · Files: src/.../TemplateEngine/Validation/ValidationService.cs:504-519, ValidationResult.cs:9.
Gap: Missing-binding for a data-sourced attribute is a non-blocking Warning (so IsValid stays true); the "connection name exists at the target site" half is missing.
Fix: Elevate binding-completeness to Error (or add a parallel Error-level check) so a deployment with unresolved bindings fails the gate; add the "binding references a connection that exists on the target site" check (resolve by site connection, not just name presence). Confirm this doesn't break legitimately-unbound attributes (static/non-data-sourced) — only data-sourced attributes require a binding.
Tests: tests/.../TemplateEngine.Tests ValidationService — data-sourced attribute with no binding → Error + IsValid false; binding to a non-existent site connection → Error; static attribute without binding → OK.
DoD: Incomplete/invalid connection bindings block deploy.
M2.9 — #17: MachineDataDb fail-fast (per design doc; reverts Host-008)
Classification: small · Files: src/ZB.MOM.WW.ScadaBridge.Host/DatabaseOptions.cs:6-12, StartupValidator.cs:59-62, tests/.../Host.Tests/StartupValidatorTests.cs (the Central_MissingMachineDataDb_PassesValidation regression).
Fix: Add string? MachineDataDb to DatabaseOptions; add a Central-only Require("ScadaBridge:Database:MachineDataDb", non-empty, ...) in StartupValidator. No DbContext (nothing consumes it). Revert the Host-008 regression test to expect failure when missing; add MachineDataDb to ValidCentralConfig(). Commit message must note the deliberate Host-008 reversal and cite REQ-HOST-3/4 + shipped docker appsettings as justification.
Tests: StartupValidatorTests — Central missing MachineDataDb → fails; present → passes; Site role unaffected.
DoD: Central nodes fail fast on empty MachineDataDb; spec REQ-HOST-4 satisfied.
M2.10 — #18: CI grep-guard against UPDATE/DELETE on AuditLog
Classification: small · Files: new guard test in tests/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase.Tests/ (the only thing that actually runs — no CI service exists; build is Docker-only).
Fix: Add a test that scans the ConfigurationDatabase source tree (and migration SQL) for UPDATE/DELETE statements targeting AuditLog, failing if any are found in C# data-access code. Scope strictly to the AuditLog table (allow purge/delete on Notifications/SiteCalls and partition-switch DDL). This backstops the existing DB-role DENY UPDATE/DELETE (migration 20260602174346). Optionally add an MSBuild target mirroring it, but the test is the enforced control.
Tests: the guard test itself; verify it passes on current clean source and would fail on a planted violation (assert via a unit on the scanner helper).
DoD: A code-level guard fails the test run on AuditLog mutations.
M2.11 — #24: debug snapshot/subscribe for unknown instance returns an error
Classification: small · Files: src/.../DeploymentManager/.../DeploymentManagerActor.cs:845-866.
Gap: Unknown-instance snapshot/subscribe returns an empty snapshot — caller can't distinguish "not deployed" from "deployed-but-empty".
Fix: Check instance registration first; return an explicit "instance not found"/not-deployed error response (matching the existing debug response contract) instead of an empty snapshot.
Tests: tests/.../DeploymentManager (or SiteRuntime) — unknown instance → error response; known empty instance → empty snapshot (unchanged).
DoD: Unknown-instance debug requests are distinguishable from empty ones.
M2.12 — #25: recursion-limit error → site event log
Classification: small · Files: src/.../SiteRuntime/.../ScriptRuntimeContext.cs:302-305,464-466 (thread ISiteEventLogger in, mirroring M1.8's ScriptExecutionActor wiring).
Fix: Inject ISiteEventLogger into ScriptRuntimeContext; on recursion-limit violation, emit a script Error site event (fire-and-forget _ = logger?.LogEventAsync(...)) in addition to the existing ILogger log, at both check sites.
Tests: tests/.../SiteRuntime.Tests — recursion-limit hit emits a site event with category script, severity Error.
DoD: Recursion-limit violations appear in the site event log per spec.
M2.13 — #27: populate obtainable OPC UA / MxGateway transition fields
Classification: small · Files: src/.../DataConnectionLayer/Adapters/RealOpcUaClient.cs:395-403, MxGatewayAlarmMapper.cs:79-113.
Fix: Populate fields that are genuinely obtainable: for OPC UA A&C, add SelectClauses + map Category, Description, OriginalRaiseTime where the server exposes them (extend BuildAlarmEventFilter's SelectClauses); for MxGateway, extract OperatorUser (present in the event but dropped) and any available Current/Limit values. Leave truly-unavailable fields empty and document which are unavailable-by-protocol vs left-empty.
Tests: tests/.../DataConnectionLayer.Tests mapper tests — obtainable fields populate from a representative event; unavailable fields documented.
DoD: Display fields populate where the source provides them.
M2.14 — #28: readiness gate checks required cluster singletons
Classification: standard · Files: src/.../Host/Program.cs:188-201,314-317, new health check (peer to AkkaClusterHealthCheck.cs).
Gap: Readiness covers membership + DB connectivity only; spec wants "required singletons running".
Fix: Add a Ready-tagged health check that, on the active central node, verifies each required singleton proxy is reachable (e.g. NotificationOutboxActor, AuditLogIngestActor, SiteCallAuditActor, AuditLogPurgeActor, SiteAuditReconciliationActor) via a short Ask/Identify with timeout; degrade to Unhealthy if a required singleton is unreachable. Respect the "(if applicable)" softening — only gate on singletons that should be running for this node's role. Keep the probe cheap (cache/identify, short timeout) so readiness polling stays fast.
Tests: tests/.../Host.Tests or IntegrationTests — health check reports Unhealthy when a required singleton proxy is absent; Healthy when present. Avoid flakiness (use Identify with a bounded timeout).
DoD: /health/ready reflects singleton health.
M2.15 — #29: register the site active-node purge gate
Classification: small · Files: src/.../SiteEventLogging/ServiceCollectionExtensions.cs:33-37, site service registration / cluster setup.
Gap: SiteEventLogActiveNodeCheck is consulted by EventLogPurgeService but no implementation is registered on the site node → purge runs on standby too (defaults to () => true).
Fix: Register a SiteEventLogActiveNodeCheck delegate on the site node that returns true only when this node is the cluster leader/active (mirror how central gates active-node work). Keep the null-default behavior for non-clustered test hosts.
Tests: tests/.../SiteEventLogging.Tests — purge gated off on standby, on for active; default-true preserved when unregistered.
DoD: Site event-log purge runs only on the active node.
M2.16 — #30: Health Monitoring consumes FailedWriteCount
Classification: small · Files: src/.../SiteEventLogging/ISiteEventLogger.cs:32-40, Health Monitoring metric path.
Fix: Wire FailedWriteCount into the site health metrics the same way other site metrics are collected/reported (find the existing site metric collection path), so the dangling metric is consumed (surface as a health metric / threshold). Keep it raw-count per the health-reporting conventions.
Tests: tests/.../HealthMonitoring/SiteEventLogging — failed writes increment the reported metric.
DoD: FailedWriteCount reaches Health Monitoring.
M2.17 — #31: reconcile StateTransitionValidator delete-from-NotDeployed
Classification: small · Files: src/.../DeploymentManager/.../StateTransitionValidator.cs:38-41, possibly docs/requirements/Component-DeploymentManager.md (spec matrix).
Fix: git blame/log the CanDelete line to recover intent. Default: align code to the spec matrix — remove NotDeployed from the allowed delete states, add a clarifying comment — UNLESS history shows deliberate orphan-cleanup intent, in which case update the spec matrix (Delete from NotDeployed = Yes, with a no-op-cleanup note) instead. Whichever direction, code and doc must agree at the end.
Tests: tests/.../DeploymentManager StateTransitionValidator — the chosen rule is asserted.
DoD: Code and spec matrix agree on delete-from-NotDeployed.
M2.18 — #26: debug-stream stream-first ordering + replay/dedup
Classification: high-risk · Files: src/.../DebugStreamBridgeActor.cs:89-103,163-166.
Gap: PreStart sends the snapshot first, then opens the gRPC stream → events in the gap window are lost. Spec wants stream-first + replay with timestamp dedup.
Fix: Open the gRPC subscription FIRST (buffer incoming events), then fetch+send the snapshot, then flush buffered events, deduping by timestamp/identity against the snapshot so no gap-window event is lost or double-delivered. Preserve ordering. This is a re-arch of the actor's PreStart lifecycle — keep the existing message contract.
Tests: tests/.../ DebugStreamBridgeActor — an event arriving during the snapshot window is delivered exactly once after the snapshot; ordering preserved; dedup drops the snapshot-overlapping event.
DoD: No gap-window events lost; no duplicates.
M2.19 — #15: LDAP periodic re-query for interactive sessions (SECURITY)
Classification: high-risk · Files: src/.../Security/ServiceCollectionExtensions.cs:86-148 (cookie events), JwtTokenService.cs (wire the unused IsIdleTimedOut/ShouldRefresh/RecordActivity/RefreshToken), RoleMapper.cs, LDAP service interface, CentralUI/Auth/AuthEndpoints.cs (claims-build parity).
Spike first: Determine whether the shared ZB.MOM.WW.Auth.Ldap lib exposes a passwordless service-account group search for an already-authenticated username. Report the answer before building the LDAP leg.
Fix (layered):
- Always achievable — add
CookieAuthenticationEvents.OnValidatePrincipalthat: enforces idle-timeout (reject/sign-out past 30-min idle, advance last-activity on use), and refreshes role claims by re-runningRoleMapperon the stored group claims (picks up central role-mapping changes without LDAP). Stamp aLastLdapCheckclaim. - If the lib supports passwordless group search — when
LastLdapCheckis >15 min old, re-query LDAP groups via the service-account search, re-map roles, update role/site claims. On LDAP failure: keep existing roles, do NOT sign out (per "LDAP failure: new logins fail; active sessions continue with current roles"). If the lib does NOT support it, ship layer 1 and document the residual limitation (group-membership changes picked up only at next login) in the security doc. Rebuild claims identically to/auth/login(same claim types). Use the cookie-only model (embedded-JWT is dispositioned doc-only in M4). Tests (incl. adversarial): idle-timeout enforced; role-mapping change reflected without LDAP; LDAP-down on re-query keeps existing roles (no sign-out); >15-min triggers re-query, <15-min skips (TTL respected); a revoked-group user loses roles after re-query (if LDAP leg shipped). DoD: Interactive sessions enforce idle-timeout and refresh roles per the documented policy; any residual LDAP-dependency limitation is documented.
Cross-cutting
dotnet build ZB.MOM.WW.ScadaBridge.slnxgreen (TreatWarningsAsErrors); relevant unit/integration tests pass per task.- MSSQL-backed tests need
cd infra && docker compose up -d; if unavailable, gate-with-note (M2.0 especially). - Migration tasks (M2.0, M2.5) serialized; M2.0 first.
git diffreview before each commit; design-summary commit messages; one logical slice per commit.- After all tasks: final integration code review, build, and
bash docker/deploy.shsmoke (curl localhost:9000/health/ready).