Files
ScadaBridge/docs/plans/2026-05-29-native-alarms.RESUME.md
T

73 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Native Alarms — Execution Resume Notes
**Skill in progress:** `superpowers-extended-cc:executing-plans` on `docs/plans/2026-05-29-native-alarms.md`.
**To resume:** `/superpowers-extended-cc:executing-plans docs/plans/2026-05-29-native-alarms.md` (reads `…md.tasks.json`).
## Workspace
- **Worktree:** `/Users/dohertj2/Desktop/scadalink-design-native-alarms` (branch `feat/native-alarms`, off `main` @ `09e19db` which holds the design + plan).
- Do all work here; `main` checkout stays untouched. Build: `dotnet build ZB.MOM.WW.ScadaBridge.slnx`.
- The shared MS SQL container `scadabridge-mssql` is up (the ConfigDB MsSql migration-fixture tests use it).
## Progress: ALL 28 TASKS DONE & COMMITTED ✅
## T28 (live integration) — results & a real bug it caught
- **OPC UA A&C live smoke** `[SkippableFact]` **PASSED** against the infra OPC PLC server (real ConditionRefresh snapshot round-trip) — opc-plc DOES expose A&C; noted in test_infra.md.
- **Full build + `dotnet test ZB.MOM.WW.ScadaBridge.slnx`: zero native-alarm regressions.** Every project I touched passes in isolation (SiteRuntime 314, CentralUI 584, DCL 104, ManagementService 111, Communication 200, TemplateEngine 332, ConfigDB 248). Pre-existing/known-flaky failures only: Commons `StaleTagMonitor*` (5, timing), 1 SiteRuntime parallelism flake (314/314 isolated), AuditLog (2) + IntegrationTests (2, `near "IF"` SQLite) — the latter **confirmed failing identically on `main`**.
- **env2 cluster rebuilt** with the new image; **backend verified live** (`template native-alarm-source add`/`list` round-trip through the rebuilt ManagementActor; instance deployed).
- **UI verified live via Claude-for-Chrome** at env2 Traefik LB `:9100`: template editor **Native Alarms tab** lists `GalaxyAlarms` + Add-modal connection dropdown filtered to alarm-capable (MxGateway) conns; instance **Native Alarm Source Overrides** card; Debug View **enriched alarm table (Kind + Sev columns)** + connects live ("Streaming MxAlarmDemo-1").
- **RUNTIME BUG found live + fixed (`add7210`):** the NativeAlarmActor sends `SubscribeAlarmsRequest` to the DCL **manager**, but `DataConnectionManagerActor` only routed tag/write/browse — alarm subscribe/unsubscribe were **unhandled → dead-lettered**, so native alarms never subscribed. Unit tests missed it (T15 used a probe, T10 tested the connection actor directly). Added `HandleRouteAlarms` forwarding (mirrors `HandleRoute`) + regression test. **Re-verified live:** after rebuild, the logs show `[ScadaBridge Site X] Alarm feed subscribed for source $Area_001``Native alarm subscription established for GalaxyAlarms` — full chain InstanceActor→NativeAlarmActor→DCL manager→DataConnectionActor→**live MxAccess Gateway** working. (No transitions flowed only because `$Area_001` had no active alarms at the time.)
- **T26 seed bug found live + fixed (`f4ae44a`):** `instance deploy` used `--instance-id`; corrected to `--id`.
## (historical) Progress: Tasks 127 done & committed; only T28 pending
Commits (oldest→newest): `696da92` T1 … `3bf1d26` T21, `a6dcbf6` T22, `1f6c420` T23, T24, `046797e` T25, `2b7c765` T26, `003e54c` T27. (Full list in git log.)
**Cadence is batches of 3 (user choice on resume).** Batches 48 (T1327) all ✅. Greens: SiteRuntime 313/313, Communication 200/200, ManagementService 111/111, CentralUI 581/581, Commons registry 8/8, CLI CommandTree 21/21. **Next & final: T28 (integration / live verification).**
## Decisions / deviations — Batch 8 (T2527)
- **T25:** InstanceConfigure "Native Alarm Source Overrides" card — inline per-row override (connection dropdown / source-ref / filter; blank=inherited), repo-direct upsert/clear + SaveChangesAsync (same rationale as T24). **Structural test** (source assertions) — InstanceConfigure is heavy (7 services incl. InstanceService + IFlatteningPipeline); CRUD behavior covered by T21. Lists the *template's direct* native sources (mirrors how the Alarm Overrides card uses `GetAlarmsByTemplateIdAsync`, not full flatten).
- **T26:** seed-sites.sh adds a `MxAlarmDemo` template + `GalaxyAlarms` native source (connection "ScadaBridge Site X", source-ref `$Area_001`) + deployed `MxAlarmDemo-1` instance. **bash -n valid.** Live-verified the CLI→cluster path: `template create` works against the running env2 (:9100); `AddTemplateNativeAlarmSource` returns "Unknown command" because the **running env2 image is pre-T21** — proves the seed/commands are correct but full live run needs a cluster rebuild (`docker-env2/deploy.sh`), which is **T28's** job. Cleaned up the test template afterward (env2 templates back to []).
- **T27:** CLAUDE.md (Native Alarms bullet under Data & Communication) + README (Template Engine row) inline; 7 component docs (DCL/SiteRuntime/TemplateEngine/CentralUI/CLI/Communication/ConfigurationDatabase) via 7 parallel `documenter` subagents. All landed (+265 lines).
## T28 plan (final)
- Add `OpcUaAlarmLiveSmokeTests` SkippableFact (skips when no alarm-capable OPC UA endpoint) mirroring the existing live OPC UA smoke pattern; note infra OPC UA A&C support in `docs/test_infra/test_infra.md`.
- Full `dotnet build` + `dotnet test ZB.MOM.WW.ScadaBridge.slnx`. Watch known-flaky: 5 StaleTagMonitor*Tests (Commons.Tests) + `CliConfigTests.Load_MalformedConfigFile` (parallelism race) — NOT regressions.
- Rebuild cluster: `bash docker-env2/deploy.sh` (or docker/deploy.sh), then re-run `docker-env2/seed-sites.sh` and confirm native alarms appear in the Central UI Debug View with severity + condition badges.
## Decisions / deviations — Batch 7 (T2224)
- **T22:** CLI `template native-alarm-source add/list/remove` + `instance native-alarm-source set/clear` + README. **Known-flaky:** `CliConfigTests.Load_MalformedConfigFile…` fails ~deterministically in the FULL CLI suite (passes in isolation) due to process-global `HOME`/`Console.SetError` racing under xUnit parallelism — **confirmed pre-existing** (baseline fails 3/3 full runs; my added tests merely perturbed scheduling). Not my regression; treat as known-flaky like the StaleTagMonitor set.
- **T23:** DebugView alarm table enriched — added Kind + Sev columns, folded SourceReference into the Alarm cell (monospace subtitle), condition badges (Unacked/Shelved/Suppressed) into the State cell, Type/Category/operator/raise-time/value into the row tooltip; `FilteredAlarmStates` also matches SourceReference. bUnit test sets `_connected`/`_snapshot`/`_alarmStates` via reflection then asserts markup.
- **T24:** Template editor got a new **"Native Alarms" tab** (not a sub-panel) — fits the existing tabbed editor. **Repo-direct, NOT the plan's "management HTTP client":** TemplateEdit is Blazor Server in-process and calls `ITemplateEngineRepository` + `SaveChangesAsync` directly (mirrors how `_alarms` load/save works). Connection dropdown loads alarm-capable (OpcUa/MxGateway) connections via `ICentralUiRepository.GetAllDataConnectionsAsync` deduped by name. **Test is structural** (source-text assertions like `TestRunWarningTests`), NOT a full interactive bUnit render — TemplateEdit has ~10 injected services with their own graphs (ScriptAnalysisService/IMemoryCache/ISharedScriptCatalog/BuildParentContextsAsync/GetInstancesFilteredAsync…); the codebase has no precedent for fully rendering it, and the CRUD path is already covered behaviorally by the T21 ManagementActor tests. **T25 (InstanceConfigure) may be lighter to full-render — check first.**
## Decisions / deviations — Batch 6 (T1921)
- **T19:** `AlarmShelveStateCodec` (string↔enum, default Unshelved). Server `StreamRelayActor` maps `msg.Condition.*` + `Kind.ToString()` + nullable `OriginalRaiseTime` out; client `ConvertToDomainEvent` rebuilds `Condition` (severity = wire `Priority`) + `ParseAlarmKind` back. **Gotcha:** client imports `Google.Protobuf.WellKnownTypes`, so `Enum` is ambiguous — used `System.Enum.TryParse`. `confirmed` proto bool → domain `bool?` (false, never null after round-trip).
- **T20:** `ManagementCommandRegistry` is **reflection-based** (auto-discovers `*Command` records in the Management namespace) — no manual registry edit needed; just added the 7 records to TemplateCommands.cs / InstanceCommands.cs.
- **T21:** Handlers call `ITemplateEngineRepository` **directly** + `SaveChangesAsync` (per plan's "call the Task 6 repo methods"), NOT through TemplateService/InstanceService. **Trade-off:** this skips the service-layer `IAuditService` logging that the existing template/instance alarm CRUD gets. Acceptable for the read-only-mirror authoring commands and matches the plan's Files scope (ManagementService only), but flag if audit parity is wanted later. Roles: template mutations = Design, instance-override mutations = Deployment, lists = any authenticated. Update = fetch-then-mutate (preserves TemplateId for the unique index). Set-override = upsert (Add if absent else Update).
## Decisions / deviations — Batch 5 (T1618)
- **T16:** Connection protocol IS in `FlattenedConfiguration.Connections[name].Protocol``ResolveNativeKind` maps protocol-contains-"Mx" → `NativeMxAccess` else `NativeOpcUa`; passed into NativeAlarmActor. Added `_latestAlarmEvents` (enriched event per AlarmName) + extracted `BuildAlarmStatesSnapshot()` used by both `HandleSubscribeDebugView` and `HandleDebugSnapshot` (enriched events Normal-projection fallback for computed alarms that haven't fired). Native actors skipped when `_dclManager == null` (isolated tests). **Beyond the plan's Files list (justified):** redeploy/undeploy clear — added `native_alarm_state` DELETE to `SiteStorageService.RemoveDeployedConfigAsync` transaction (undeploy) + `ClearNativeAlarmsForInstanceAsync` next to `ClearStaticOverridesAsync` in `DeploymentManagerActor` redeploy path. Native state survives failover (rehydrate) but resets on redeploy — mirrors static-override semantics.
- **T17:** Test-only (as the plan predicted) — `AlarmStateChanged.Condition` getter already defaults to `ForComputed(State, Priority)` from T2, so computed alarms carry the unified condition without code change. Added regression `AlarmActor_ComputedAlarm_CarriesUnifiedConditionState`.
- **T18:** Proto regen done via the documented macOS manual flow (uncomment `<Protobuf>` → delete vendored → build → copy `obj/Debug/net10.0/Protos/*.cs` → re-comment). csproj nets to no change. Only `Sitestream.cs` changed (service `SitestreamGrpc.cs` untouched — message-only change). `confirmed` is proto `bool` per plan (null→false fidelity loss accepted). New fields 821 on `AlarmStateUpdate`.
## Decisions / deviations made during execution — Batch 4 (T1315)
- **T15:** `NativeAlarmActor` ctor has an optional trailing `AlarmKind nativeKind = AlarmKind.NativeOpcUa` (additive — keeps the 7-arg call working). T16 will pass `NativeMxAccess` when the connection protocol is MxGateway. Persistence is **fire-and-forget** (`ContinueWith` OnlyOnFaulted logs) — never blocks the actor. State keyed by `SourceReference`; `AlarmName` on the emitted `AlarmStateChanged` is set to the `SourceReference`. Snapshot path: `Snapshot` buffers, `SnapshotComplete` atomic-swaps (dropped → emit `Active=false`). Live path ignores older `TransitionTime`; retention drops a condition once `!Active && Acknowledged`. `NativeAlarmSourceUnavailable` = log + retain (no emit). Subscribe retry via `ScheduleTellOnceCancelable` at `NativeAlarmRetryIntervalMs`.
## Known-flaky baseline (NOT my regressions)
- 5 `StaleTagMonitor*Tests` in `ZB.MOM.WW.ScadaBridge.Commons.Tests` are timing-flaky under load. User approved treating as known-flaky; do not "fix". Watch only for NEW failures.
## Decisions / deviations made during execution (carry forward)
- **T2:** `AlarmStateChanged.Condition` is a computed-default property (getter falls back to `AlarmConditionStateFactory.ForComputed(State, Priority)`); additions are init-props (additive). `AlarmConditionStateFactory` lives in `Commons/Types/Alarms`.
- **T8:** `ResolvedNativeAlarmSource` has **no `IsLocked`** field (per plan). Inheritance lock is enforced via a **local `lockedNames` HashSet** inside `ResolveInheritedNativeAlarmSources`. Override-lock is NOT enforced at flatten (matches plan; UI/validation layer handles it).
- **T9:** `SemanticValidator.Validate` gained an **optional** 3rd param `IReadOnlySet<string>? alarmCapableConnectionNames = null`. Connection-existence check only runs when callers pass it; empty source-ref / empty connection-name always checked. `ValidationCategory.NativeAlarmSourceInvalid` added. (Wiring real callers to pass the connection set is not yet done — fine for now.)
- **T10:** `DataConnectionActor` routes alarm transitions by **source-ref prefix** (`transition.SourceObjectReference`/`SourceReference` StartsWith bound key), dedup per transition. One feed per source-ref, ref-counted. Internal records `AlarmTransitionReceived`, `AlarmSubscribeCompleted`. `NativeAlarmSourceUnavailable` pushed on entering Reconnecting; `ReSubscribeAllAlarms` on reconnect.
- **T11 (OPC UA):** `OpcUaAlarmMapper` is pure/tested. `RealOpcUaClient.CreateAlarmSubscriptionAsync` does event MonitoredItem + `EventFilter` (select clauses indexed 012) + `ConditionRefresh` via `CallAsync` (the sync `Call` is obsolete→error). **`AlarmConditionState` collides with `Opc.Ua.AlarmConditionState`** — fully-qualified as `Commons.Types.Alarms.AlarmConditionState` at the one `new` site. **Behavior unverified until Task 28 (live A&C server).**
- **T12 (MxGateway):** `MxGatewayAlarmMapper` is pure/tested. Gateway proto enums `AlarmConditionState`/`AlarmTransitionKind` collide with Commons enums → aliased (`ProtoConditionState`/`ProtoTransitionKind` for proto; explicit `using X = Commons…` for the Commons ones). `MxGatewayClient.StreamAlarmsAsync(StreamAlarmsRequest, ct) → IAsyncEnumerable<AlarmFeedMessage>` confirmed present in pkg v0.1.0. Adapter opens **one shared session-less feed** (gateway-wide, null prefix), ref-counted, first-callback drives it (the actor routes). `RealMxGatewayClient.RunAlarmStreamAsync` reconnects internally (5s) — does NOT use `RaiseDisconnected`. Reference: OtOpcUa `…Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs`. **Behavior unverified until Task 28 (live gateway).**
## Execution cadence
- Per-task TDD: write test → confirm RED → implement → GREEN → commit. Update native task status + this `.tasks.json` each task; report at each batch boundary and wait for "start"/feedback.
- Batches so far: B1 = T14, B2 = T58, B3 = T912. Next proposed: B4 = T1317.
- Native task IDs map plan Task N → native id (N+6) — but on resume the native list is rebuilt from `.tasks.json` (Step 0).
## Watch items for remaining tasks
- **T18 (proto):** `sitestream.proto` is **not auto-compiled**`<Protobuf>` include is commented out, generated `.cs` vendored in `SiteStreamGrpc/`. Manual macOS regen only (toggle include → `dotnet build` → copy generated files → re-comment). Do NOT auto-compile on Linux.
- **T28:** OPC UA A&C live smoke (SkippableFact) + confirm infra OPC UA server exposes A&C; manual deploy check via `bash docker/deploy.sh` / `docker-env2/deploy.sh`.