13 KiB
13 KiB
Native Alarms — Execution Resume Notes
Skill in progress: superpowers-extended-cc:executing-plans on docs/plans/2026-05-29-native-alarms.md.
To resume: /superpowers-extended-cc:executing-plans docs/plans/2026-05-29-native-alarms.md (reads …md.tasks.json).
Workspace
- Worktree:
/Users/dohertj2/Desktop/scadalink-design-native-alarms(branchfeat/native-alarms, offmain@09e19dbwhich holds the design + plan). - Do all work here;
maincheckout stays untouched. Build:dotnet build ZB.MOM.WW.ScadaBridge.slnx. - The shared MS SQL container
scadabridge-mssqlis up (the ConfigDB MsSql migration-fixture tests use it).
Progress: Tasks 1–27 done & committed; only T28 pending
Commits (oldest→newest): 696da92 T1 … 3bf1d26 T21, a6dcbf6 T22, 1f6c420 T23, T24, 046797e T25, 2b7c765 T26, 003e54c T27. (Full list in git log.)
Cadence is batches of 3 (user choice on resume). Batches 4–8 (T13–27) all ✅. Greens: SiteRuntime 313/313, Communication 200/200, ManagementService 111/111, CentralUI 581/581, Commons registry 8/8, CLI CommandTree 21/21. Next & final: T28 (integration / live verification).
Decisions / deviations — Batch 8 (T25–27)
- T25: InstanceConfigure "Native Alarm Source Overrides" card — inline per-row override (connection dropdown / source-ref / filter; blank=inherited), repo-direct upsert/clear + SaveChangesAsync (same rationale as T24). Structural test (source assertions) — InstanceConfigure is heavy (7 services incl. InstanceService + IFlatteningPipeline); CRUD behavior covered by T21. Lists the template's direct native sources (mirrors how the Alarm Overrides card uses
GetAlarmsByTemplateIdAsync, not full flatten). - T26: seed-sites.sh adds a
MxAlarmDemotemplate +GalaxyAlarmsnative source (connection "ScadaBridge Site X", source-ref$Area_001) + deployedMxAlarmDemo-1instance. bash -n valid. Live-verified the CLI→cluster path:template createworks against the running env2 (:9100);AddTemplateNativeAlarmSourcereturns "Unknown command" because the running env2 image is pre-T21 — proves the seed/commands are correct but full live run needs a cluster rebuild (docker-env2/deploy.sh), which is T28's job. Cleaned up the test template afterward (env2 templates back to []). - T27: CLAUDE.md (Native Alarms bullet under Data & Communication) + README (Template Engine row) inline; 7 component docs (DCL/SiteRuntime/TemplateEngine/CentralUI/CLI/Communication/ConfigurationDatabase) via 7 parallel
documentersubagents. All landed (+265 lines).
T28 plan (final)
- Add
OpcUaAlarmLiveSmokeTestsSkippableFact (skips when no alarm-capable OPC UA endpoint) mirroring the existing live OPC UA smoke pattern; note infra OPC UA A&C support indocs/test_infra/test_infra.md. - Full
dotnet build+dotnet test ZB.MOM.WW.ScadaBridge.slnx. Watch known-flaky: 5 StaleTagMonitor*Tests (Commons.Tests) +CliConfigTests.Load_MalformedConfigFile(parallelism race) — NOT regressions. - Rebuild cluster:
bash docker-env2/deploy.sh(or docker/deploy.sh), then re-rundocker-env2/seed-sites.shand confirm native alarms appear in the Central UI Debug View with severity + condition badges.
Decisions / deviations — Batch 7 (T22–24)
- T22: CLI
template native-alarm-source add/list/remove+instance native-alarm-source set/clear+ README. Known-flaky:CliConfigTests.Load_MalformedConfigFile…fails ~deterministically in the FULL CLI suite (passes in isolation) due to process-globalHOME/Console.SetErrorracing under xUnit parallelism — confirmed pre-existing (baseline fails 3/3 full runs; my added tests merely perturbed scheduling). Not my regression; treat as known-flaky like the StaleTagMonitor set. - T23: DebugView alarm table enriched — added Kind + Sev columns, folded SourceReference into the Alarm cell (monospace subtitle), condition badges (Unacked/Shelved/Suppressed) into the State cell, Type/Category/operator/raise-time/value into the row tooltip;
FilteredAlarmStatesalso matches SourceReference. bUnit test sets_connected/_snapshot/_alarmStatesvia reflection then asserts markup. - T24: Template editor got a new "Native Alarms" tab (not a sub-panel) — fits the existing tabbed editor. Repo-direct, NOT the plan's "management HTTP client": TemplateEdit is Blazor Server in-process and calls
ITemplateEngineRepository+SaveChangesAsyncdirectly (mirrors how_alarmsload/save works). Connection dropdown loads alarm-capable (OpcUa/MxGateway) connections viaICentralUiRepository.GetAllDataConnectionsAsyncdeduped by name. Test is structural (source-text assertions likeTestRunWarningTests), NOT a full interactive bUnit render — TemplateEdit has ~10 injected services with their own graphs (ScriptAnalysisService/IMemoryCache/ISharedScriptCatalog/BuildParentContextsAsync/GetInstancesFilteredAsync…); the codebase has no precedent for fully rendering it, and the CRUD path is already covered behaviorally by the T21 ManagementActor tests. T25 (InstanceConfigure) may be lighter to full-render — check first.
Decisions / deviations — Batch 6 (T19–21)
- T19:
AlarmShelveStateCodec(string↔enum, default Unshelved). ServerStreamRelayActormapsmsg.Condition.*+Kind.ToString()+ nullableOriginalRaiseTimeout; clientConvertToDomainEventrebuildsCondition(severity = wirePriority) +ParseAlarmKindback. Gotcha: client importsGoogle.Protobuf.WellKnownTypes, soEnumis ambiguous — usedSystem.Enum.TryParse.confirmedproto bool → domainbool?(false, never null after round-trip). - T20:
ManagementCommandRegistryis reflection-based (auto-discovers*Commandrecords in the Management namespace) — no manual registry edit needed; just added the 7 records to TemplateCommands.cs / InstanceCommands.cs. - T21: Handlers call
ITemplateEngineRepositorydirectly +SaveChangesAsync(per plan's "call the Task 6 repo methods"), NOT through TemplateService/InstanceService. Trade-off: this skips the service-layerIAuditServicelogging that the existing template/instance alarm CRUD gets. Acceptable for the read-only-mirror authoring commands and matches the plan's Files scope (ManagementService only), but flag if audit parity is wanted later. Roles: template mutations = Design, instance-override mutations = Deployment, lists = any authenticated. Update = fetch-then-mutate (preserves TemplateId for the unique index). Set-override = upsert (Add if absent else Update).
Decisions / deviations — Batch 5 (T16–18)
- T16: Connection protocol IS in
FlattenedConfiguration.Connections[name].Protocol→ResolveNativeKindmaps protocol-contains-"Mx" →NativeMxAccesselseNativeOpcUa; passed into NativeAlarmActor. Added_latestAlarmEvents(enriched event per AlarmName) + extractedBuildAlarmStatesSnapshot()used by bothHandleSubscribeDebugViewandHandleDebugSnapshot(enriched events ∪ Normal-projection fallback for computed alarms that haven't fired). Native actors skipped when_dclManager == null(isolated tests). Beyond the plan's Files list (justified): redeploy/undeploy clear — addednative_alarm_stateDELETE toSiteStorageService.RemoveDeployedConfigAsynctransaction (undeploy) +ClearNativeAlarmsForInstanceAsyncnext toClearStaticOverridesAsyncinDeploymentManagerActorredeploy path. Native state survives failover (rehydrate) but resets on redeploy — mirrors static-override semantics. - T17: Test-only (as the plan predicted) —
AlarmStateChanged.Conditiongetter already defaults toForComputed(State, Priority)from T2, so computed alarms carry the unified condition without code change. Added regressionAlarmActor_ComputedAlarm_CarriesUnifiedConditionState. - T18: Proto regen done via the documented macOS manual flow (uncomment
<Protobuf>→ delete vendored → build → copyobj/Debug/net10.0/Protos/*.cs→ re-comment). csproj nets to no change. OnlySitestream.cschanged (serviceSitestreamGrpc.csuntouched — message-only change).confirmedis protoboolper plan (null→false fidelity loss accepted). New fields 8–21 onAlarmStateUpdate.
Decisions / deviations made during execution — Batch 4 (T13–15)
- T15:
NativeAlarmActorctor has an optional trailingAlarmKind nativeKind = AlarmKind.NativeOpcUa(additive — keeps the 7-arg call working). T16 will passNativeMxAccesswhen the connection protocol is MxGateway. Persistence is fire-and-forget (ContinueWithOnlyOnFaulted logs) — never blocks the actor. State keyed bySourceReference;AlarmNameon the emittedAlarmStateChangedis set to theSourceReference. Snapshot path:Snapshotbuffers,SnapshotCompleteatomic-swaps (dropped → emitActive=false). Live path ignores olderTransitionTime; retention drops a condition once!Active && Acknowledged.NativeAlarmSourceUnavailable= log + retain (no emit). Subscribe retry viaScheduleTellOnceCancelableatNativeAlarmRetryIntervalMs.
Known-flaky baseline (NOT my regressions)
- 5
StaleTagMonitor*TestsinZB.MOM.WW.ScadaBridge.Commons.Testsare timing-flaky under load. User approved treating as known-flaky; do not "fix". Watch only for NEW failures.
Decisions / deviations made during execution (carry forward)
- T2:
AlarmStateChanged.Conditionis a computed-default property (getter falls back toAlarmConditionStateFactory.ForComputed(State, Priority)); additions are init-props (additive).AlarmConditionStateFactorylives inCommons/Types/Alarms. - T8:
ResolvedNativeAlarmSourcehas noIsLockedfield (per plan). Inheritance lock is enforced via a locallockedNamesHashSet insideResolveInheritedNativeAlarmSources. Override-lock is NOT enforced at flatten (matches plan; UI/validation layer handles it). - T9:
SemanticValidator.Validategained an optional 3rd paramIReadOnlySet<string>? alarmCapableConnectionNames = null. Connection-existence check only runs when callers pass it; empty source-ref / empty connection-name always checked.ValidationCategory.NativeAlarmSourceInvalidadded. (Wiring real callers to pass the connection set is not yet done — fine for now.) - T10:
DataConnectionActorroutes alarm transitions by source-ref prefix (transition.SourceObjectReference/SourceReferenceStartsWith bound key), dedup per transition. One feed per source-ref, ref-counted. Internal recordsAlarmTransitionReceived,AlarmSubscribeCompleted.NativeAlarmSourceUnavailablepushed on entering Reconnecting;ReSubscribeAllAlarmson reconnect. - T11 (OPC UA):
OpcUaAlarmMapperis pure/tested.RealOpcUaClient.CreateAlarmSubscriptionAsyncdoes event MonitoredItem +EventFilter(select clauses indexed 0–12) +ConditionRefreshviaCallAsync(the syncCallis obsolete→error).AlarmConditionStatecollides withOpc.Ua.AlarmConditionState— fully-qualified asCommons.Types.Alarms.AlarmConditionStateat the onenewsite. Behavior unverified until Task 28 (live A&C server). - T12 (MxGateway):
MxGatewayAlarmMapperis pure/tested. Gateway proto enumsAlarmConditionState/AlarmTransitionKindcollide with Commons enums → aliased (ProtoConditionState/ProtoTransitionKindfor proto; explicitusing X = Commons…for the Commons ones).MxGatewayClient.StreamAlarmsAsync(StreamAlarmsRequest, ct) → IAsyncEnumerable<AlarmFeedMessage>confirmed present in pkg v0.1.0. Adapter opens one shared session-less feed (gateway-wide, null prefix), ref-counted, first-callback drives it (the actor routes).RealMxGatewayClient.RunAlarmStreamAsyncreconnects internally (5s) — does NOT useRaiseDisconnected. Reference: OtOpcUa…Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs. Behavior unverified until Task 28 (live gateway).
Execution cadence
- Per-task TDD: write test → confirm RED → implement → GREEN → commit. Update native task status + this
.tasks.jsoneach task; report at each batch boundary and wait for "start"/feedback. - Batches so far: B1 = T1–4, B2 = T5–8, B3 = T9–12. Next proposed: B4 = T13–17.
- Native task IDs map plan Task N → native id (N+6) — but on resume the native list is rebuilt from
.tasks.json(Step 0).
Watch items for remaining tasks
- T18 (proto):
sitestream.protois not auto-compiled —<Protobuf>include is commented out, generated.csvendored inSiteStreamGrpc/. Manual macOS regen only (toggle include →dotnet build→ copy generated files → re-comment). Do NOT auto-compile on Linux. - T28: OPC UA A&C live smoke (SkippableFact) + confirm infra OPC UA server exposes A&C; manual deploy check via
bash docker/deploy.sh/docker-env2/deploy.sh.