Files
ScadaBridge/docs/plans/2026-05-29-native-alarms.RESUME.md
T

13 KiB
Raw Blame History

Native Alarms — Execution Resume Notes

Skill in progress: superpowers-extended-cc:executing-plans on docs/plans/2026-05-29-native-alarms.md. To resume: /superpowers-extended-cc:executing-plans docs/plans/2026-05-29-native-alarms.md (reads …md.tasks.json).

Workspace

  • Worktree: /Users/dohertj2/Desktop/scadalink-design-native-alarms (branch feat/native-alarms, off main @ 09e19db which holds the design + plan).
  • Do all work here; main checkout stays untouched. Build: dotnet build ZB.MOM.WW.ScadaBridge.slnx.
  • The shared MS SQL container scadabridge-mssql is up (the ConfigDB MsSql migration-fixture tests use it).

Progress: Tasks 127 done & committed; only T28 pending

Commits (oldest→newest): 696da92 T1 … 3bf1d26 T21, a6dcbf6 T22, 1f6c420 T23, T24, 046797e T25, 2b7c765 T26, 003e54c T27. (Full list in git log.)

Cadence is batches of 3 (user choice on resume). Batches 48 (T1327) all . Greens: SiteRuntime 313/313, Communication 200/200, ManagementService 111/111, CentralUI 581/581, Commons registry 8/8, CLI CommandTree 21/21. Next & final: T28 (integration / live verification).

Decisions / deviations — Batch 8 (T2527)

  • T25: InstanceConfigure "Native Alarm Source Overrides" card — inline per-row override (connection dropdown / source-ref / filter; blank=inherited), repo-direct upsert/clear + SaveChangesAsync (same rationale as T24). Structural test (source assertions) — InstanceConfigure is heavy (7 services incl. InstanceService + IFlatteningPipeline); CRUD behavior covered by T21. Lists the template's direct native sources (mirrors how the Alarm Overrides card uses GetAlarmsByTemplateIdAsync, not full flatten).
  • T26: seed-sites.sh adds a MxAlarmDemo template + GalaxyAlarms native source (connection "ScadaBridge Site X", source-ref $Area_001) + deployed MxAlarmDemo-1 instance. bash -n valid. Live-verified the CLI→cluster path: template create works against the running env2 (:9100); AddTemplateNativeAlarmSource returns "Unknown command" because the running env2 image is pre-T21 — proves the seed/commands are correct but full live run needs a cluster rebuild (docker-env2/deploy.sh), which is T28's job. Cleaned up the test template afterward (env2 templates back to []).
  • T27: CLAUDE.md (Native Alarms bullet under Data & Communication) + README (Template Engine row) inline; 7 component docs (DCL/SiteRuntime/TemplateEngine/CentralUI/CLI/Communication/ConfigurationDatabase) via 7 parallel documenter subagents. All landed (+265 lines).

T28 plan (final)

  • Add OpcUaAlarmLiveSmokeTests SkippableFact (skips when no alarm-capable OPC UA endpoint) mirroring the existing live OPC UA smoke pattern; note infra OPC UA A&C support in docs/test_infra/test_infra.md.
  • Full dotnet build + dotnet test ZB.MOM.WW.ScadaBridge.slnx. Watch known-flaky: 5 StaleTagMonitor*Tests (Commons.Tests) + CliConfigTests.Load_MalformedConfigFile (parallelism race) — NOT regressions.
  • Rebuild cluster: bash docker-env2/deploy.sh (or docker/deploy.sh), then re-run docker-env2/seed-sites.sh and confirm native alarms appear in the Central UI Debug View with severity + condition badges.

Decisions / deviations — Batch 7 (T2224)

  • T22: CLI template native-alarm-source add/list/remove + instance native-alarm-source set/clear + README. Known-flaky: CliConfigTests.Load_MalformedConfigFile… fails ~deterministically in the FULL CLI suite (passes in isolation) due to process-global HOME/Console.SetError racing under xUnit parallelism — confirmed pre-existing (baseline fails 3/3 full runs; my added tests merely perturbed scheduling). Not my regression; treat as known-flaky like the StaleTagMonitor set.
  • T23: DebugView alarm table enriched — added Kind + Sev columns, folded SourceReference into the Alarm cell (monospace subtitle), condition badges (Unacked/Shelved/Suppressed) into the State cell, Type/Category/operator/raise-time/value into the row tooltip; FilteredAlarmStates also matches SourceReference. bUnit test sets _connected/_snapshot/_alarmStates via reflection then asserts markup.
  • T24: Template editor got a new "Native Alarms" tab (not a sub-panel) — fits the existing tabbed editor. Repo-direct, NOT the plan's "management HTTP client": TemplateEdit is Blazor Server in-process and calls ITemplateEngineRepository + SaveChangesAsync directly (mirrors how _alarms load/save works). Connection dropdown loads alarm-capable (OpcUa/MxGateway) connections via ICentralUiRepository.GetAllDataConnectionsAsync deduped by name. Test is structural (source-text assertions like TestRunWarningTests), NOT a full interactive bUnit render — TemplateEdit has ~10 injected services with their own graphs (ScriptAnalysisService/IMemoryCache/ISharedScriptCatalog/BuildParentContextsAsync/GetInstancesFilteredAsync…); the codebase has no precedent for fully rendering it, and the CRUD path is already covered behaviorally by the T21 ManagementActor tests. T25 (InstanceConfigure) may be lighter to full-render — check first.

Decisions / deviations — Batch 6 (T1921)

  • T19: AlarmShelveStateCodec (string↔enum, default Unshelved). Server StreamRelayActor maps msg.Condition.* + Kind.ToString() + nullable OriginalRaiseTime out; client ConvertToDomainEvent rebuilds Condition (severity = wire Priority) + ParseAlarmKind back. Gotcha: client imports Google.Protobuf.WellKnownTypes, so Enum is ambiguous — used System.Enum.TryParse. confirmed proto bool → domain bool? (false, never null after round-trip).
  • T20: ManagementCommandRegistry is reflection-based (auto-discovers *Command records in the Management namespace) — no manual registry edit needed; just added the 7 records to TemplateCommands.cs / InstanceCommands.cs.
  • T21: Handlers call ITemplateEngineRepository directly + SaveChangesAsync (per plan's "call the Task 6 repo methods"), NOT through TemplateService/InstanceService. Trade-off: this skips the service-layer IAuditService logging that the existing template/instance alarm CRUD gets. Acceptable for the read-only-mirror authoring commands and matches the plan's Files scope (ManagementService only), but flag if audit parity is wanted later. Roles: template mutations = Design, instance-override mutations = Deployment, lists = any authenticated. Update = fetch-then-mutate (preserves TemplateId for the unique index). Set-override = upsert (Add if absent else Update).

Decisions / deviations — Batch 5 (T1618)

  • T16: Connection protocol IS in FlattenedConfiguration.Connections[name].ProtocolResolveNativeKind maps protocol-contains-"Mx" → NativeMxAccess else NativeOpcUa; passed into NativeAlarmActor. Added _latestAlarmEvents (enriched event per AlarmName) + extracted BuildAlarmStatesSnapshot() used by both HandleSubscribeDebugView and HandleDebugSnapshot (enriched events Normal-projection fallback for computed alarms that haven't fired). Native actors skipped when _dclManager == null (isolated tests). Beyond the plan's Files list (justified): redeploy/undeploy clear — added native_alarm_state DELETE to SiteStorageService.RemoveDeployedConfigAsync transaction (undeploy) + ClearNativeAlarmsForInstanceAsync next to ClearStaticOverridesAsync in DeploymentManagerActor redeploy path. Native state survives failover (rehydrate) but resets on redeploy — mirrors static-override semantics.
  • T17: Test-only (as the plan predicted) — AlarmStateChanged.Condition getter already defaults to ForComputed(State, Priority) from T2, so computed alarms carry the unified condition without code change. Added regression AlarmActor_ComputedAlarm_CarriesUnifiedConditionState.
  • T18: Proto regen done via the documented macOS manual flow (uncomment <Protobuf> → delete vendored → build → copy obj/Debug/net10.0/Protos/*.cs → re-comment). csproj nets to no change. Only Sitestream.cs changed (service SitestreamGrpc.cs untouched — message-only change). confirmed is proto bool per plan (null→false fidelity loss accepted). New fields 821 on AlarmStateUpdate.

Decisions / deviations made during execution — Batch 4 (T1315)

  • T15: NativeAlarmActor ctor has an optional trailing AlarmKind nativeKind = AlarmKind.NativeOpcUa (additive — keeps the 7-arg call working). T16 will pass NativeMxAccess when the connection protocol is MxGateway. Persistence is fire-and-forget (ContinueWith OnlyOnFaulted logs) — never blocks the actor. State keyed by SourceReference; AlarmName on the emitted AlarmStateChanged is set to the SourceReference. Snapshot path: Snapshot buffers, SnapshotComplete atomic-swaps (dropped → emit Active=false). Live path ignores older TransitionTime; retention drops a condition once !Active && Acknowledged. NativeAlarmSourceUnavailable = log + retain (no emit). Subscribe retry via ScheduleTellOnceCancelable at NativeAlarmRetryIntervalMs.

Known-flaky baseline (NOT my regressions)

  • 5 StaleTagMonitor*Tests in ZB.MOM.WW.ScadaBridge.Commons.Tests are timing-flaky under load. User approved treating as known-flaky; do not "fix". Watch only for NEW failures.

Decisions / deviations made during execution (carry forward)

  • T2: AlarmStateChanged.Condition is a computed-default property (getter falls back to AlarmConditionStateFactory.ForComputed(State, Priority)); additions are init-props (additive). AlarmConditionStateFactory lives in Commons/Types/Alarms.
  • T8: ResolvedNativeAlarmSource has no IsLocked field (per plan). Inheritance lock is enforced via a local lockedNames HashSet inside ResolveInheritedNativeAlarmSources. Override-lock is NOT enforced at flatten (matches plan; UI/validation layer handles it).
  • T9: SemanticValidator.Validate gained an optional 3rd param IReadOnlySet<string>? alarmCapableConnectionNames = null. Connection-existence check only runs when callers pass it; empty source-ref / empty connection-name always checked. ValidationCategory.NativeAlarmSourceInvalid added. (Wiring real callers to pass the connection set is not yet done — fine for now.)
  • T10: DataConnectionActor routes alarm transitions by source-ref prefix (transition.SourceObjectReference/SourceReference StartsWith bound key), dedup per transition. One feed per source-ref, ref-counted. Internal records AlarmTransitionReceived, AlarmSubscribeCompleted. NativeAlarmSourceUnavailable pushed on entering Reconnecting; ReSubscribeAllAlarms on reconnect.
  • T11 (OPC UA): OpcUaAlarmMapper is pure/tested. RealOpcUaClient.CreateAlarmSubscriptionAsync does event MonitoredItem + EventFilter (select clauses indexed 012) + ConditionRefresh via CallAsync (the sync Call is obsolete→error). AlarmConditionState collides with Opc.Ua.AlarmConditionState — fully-qualified as Commons.Types.Alarms.AlarmConditionState at the one new site. Behavior unverified until Task 28 (live A&C server).
  • T12 (MxGateway): MxGatewayAlarmMapper is pure/tested. Gateway proto enums AlarmConditionState/AlarmTransitionKind collide with Commons enums → aliased (ProtoConditionState/ProtoTransitionKind for proto; explicit using X = Commons… for the Commons ones). MxGatewayClient.StreamAlarmsAsync(StreamAlarmsRequest, ct) → IAsyncEnumerable<AlarmFeedMessage> confirmed present in pkg v0.1.0. Adapter opens one shared session-less feed (gateway-wide, null prefix), ref-counted, first-callback drives it (the actor routes). RealMxGatewayClient.RunAlarmStreamAsync reconnects internally (5s) — does NOT use RaiseDisconnected. Reference: OtOpcUa …Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs. Behavior unverified until Task 28 (live gateway).

Execution cadence

  • Per-task TDD: write test → confirm RED → implement → GREEN → commit. Update native task status + this .tasks.json each task; report at each batch boundary and wait for "start"/feedback.
  • Batches so far: B1 = T14, B2 = T58, B3 = T912. Next proposed: B4 = T1317.
  • Native task IDs map plan Task N → native id (N+6) — but on resume the native list is rebuilt from .tasks.json (Step 0).

Watch items for remaining tasks

  • T18 (proto): sitestream.proto is not auto-compiled<Protobuf> include is commented out, generated .cs vendored in SiteStreamGrpc/. Manual macOS regen only (toggle include → dotnet build → copy generated files → re-comment). Do NOT auto-compile on Linux.
  • T28: OPC UA A&C live smoke (SkippableFact) + confirm infra OPC UA server exposes A&C; manual deploy check via bash docker/deploy.sh / docker-env2/deploy.sh.