Files
ScadaBridge/docs/plans/2026-05-29-native-alarms.RESUME.md
T

48 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Native Alarms — Execution Resume Notes
**Skill in progress:** `superpowers-extended-cc:executing-plans` on `docs/plans/2026-05-29-native-alarms.md`.
**To resume:** `/superpowers-extended-cc:executing-plans docs/plans/2026-05-29-native-alarms.md` (reads `…md.tasks.json`).
## Workspace
- **Worktree:** `/Users/dohertj2/Desktop/scadalink-design-native-alarms` (branch `feat/native-alarms`, off `main` @ `09e19db` which holds the design + plan).
- Do all work here; `main` checkout stays untouched. Build: `dotnet build ZB.MOM.WW.ScadaBridge.slnx`.
- The shared MS SQL container `scadabridge-mssql` is up (the ConfigDB MsSql migration-fixture tests use it).
## Progress: Tasks 121 done & committed; 2228 pending
Commits (oldest→newest): `696da92` T1 … `20b41b8` T18, `0c6f9a9` T19, `b1df6d5` T20, `3bf1d26` T21. (Full list in git log.)
**Cadence is batches of 3 (user choice on resume).** Batch 4 = T1315 ✅, Batch 5 = T1618 ✅, Batch 6 = T1921 ✅. Greens: SiteRuntime.Tests 313/313, Communication.Tests 200/200, ManagementService.Tests 111/111, Commons.Tests registry 8/8. Next: **Batch 7 = T22 (CLI) → T23 (DebugView UI) → T24 (Template editor UI)**. Then 25 (instance UI), 26 (seed), 27 (docs), 28 (live integration).
## Decisions / deviations — Batch 6 (T1921)
- **T19:** `AlarmShelveStateCodec` (string↔enum, default Unshelved). Server `StreamRelayActor` maps `msg.Condition.*` + `Kind.ToString()` + nullable `OriginalRaiseTime` out; client `ConvertToDomainEvent` rebuilds `Condition` (severity = wire `Priority`) + `ParseAlarmKind` back. **Gotcha:** client imports `Google.Protobuf.WellKnownTypes`, so `Enum` is ambiguous — used `System.Enum.TryParse`. `confirmed` proto bool → domain `bool?` (false, never null after round-trip).
- **T20:** `ManagementCommandRegistry` is **reflection-based** (auto-discovers `*Command` records in the Management namespace) — no manual registry edit needed; just added the 7 records to TemplateCommands.cs / InstanceCommands.cs.
- **T21:** Handlers call `ITemplateEngineRepository` **directly** + `SaveChangesAsync` (per plan's "call the Task 6 repo methods"), NOT through TemplateService/InstanceService. **Trade-off:** this skips the service-layer `IAuditService` logging that the existing template/instance alarm CRUD gets. Acceptable for the read-only-mirror authoring commands and matches the plan's Files scope (ManagementService only), but flag if audit parity is wanted later. Roles: template mutations = Design, instance-override mutations = Deployment, lists = any authenticated. Update = fetch-then-mutate (preserves TemplateId for the unique index). Set-override = upsert (Add if absent else Update).
## Decisions / deviations — Batch 5 (T1618)
- **T16:** Connection protocol IS in `FlattenedConfiguration.Connections[name].Protocol``ResolveNativeKind` maps protocol-contains-"Mx" → `NativeMxAccess` else `NativeOpcUa`; passed into NativeAlarmActor. Added `_latestAlarmEvents` (enriched event per AlarmName) + extracted `BuildAlarmStatesSnapshot()` used by both `HandleSubscribeDebugView` and `HandleDebugSnapshot` (enriched events Normal-projection fallback for computed alarms that haven't fired). Native actors skipped when `_dclManager == null` (isolated tests). **Beyond the plan's Files list (justified):** redeploy/undeploy clear — added `native_alarm_state` DELETE to `SiteStorageService.RemoveDeployedConfigAsync` transaction (undeploy) + `ClearNativeAlarmsForInstanceAsync` next to `ClearStaticOverridesAsync` in `DeploymentManagerActor` redeploy path. Native state survives failover (rehydrate) but resets on redeploy — mirrors static-override semantics.
- **T17:** Test-only (as the plan predicted) — `AlarmStateChanged.Condition` getter already defaults to `ForComputed(State, Priority)` from T2, so computed alarms carry the unified condition without code change. Added regression `AlarmActor_ComputedAlarm_CarriesUnifiedConditionState`.
- **T18:** Proto regen done via the documented macOS manual flow (uncomment `<Protobuf>` → delete vendored → build → copy `obj/Debug/net10.0/Protos/*.cs` → re-comment). csproj nets to no change. Only `Sitestream.cs` changed (service `SitestreamGrpc.cs` untouched — message-only change). `confirmed` is proto `bool` per plan (null→false fidelity loss accepted). New fields 821 on `AlarmStateUpdate`.
## Decisions / deviations made during execution — Batch 4 (T1315)
- **T15:** `NativeAlarmActor` ctor has an optional trailing `AlarmKind nativeKind = AlarmKind.NativeOpcUa` (additive — keeps the 7-arg call working). T16 will pass `NativeMxAccess` when the connection protocol is MxGateway. Persistence is **fire-and-forget** (`ContinueWith` OnlyOnFaulted logs) — never blocks the actor. State keyed by `SourceReference`; `AlarmName` on the emitted `AlarmStateChanged` is set to the `SourceReference`. Snapshot path: `Snapshot` buffers, `SnapshotComplete` atomic-swaps (dropped → emit `Active=false`). Live path ignores older `TransitionTime`; retention drops a condition once `!Active && Acknowledged`. `NativeAlarmSourceUnavailable` = log + retain (no emit). Subscribe retry via `ScheduleTellOnceCancelable` at `NativeAlarmRetryIntervalMs`.
## Known-flaky baseline (NOT my regressions)
- 5 `StaleTagMonitor*Tests` in `ZB.MOM.WW.ScadaBridge.Commons.Tests` are timing-flaky under load. User approved treating as known-flaky; do not "fix". Watch only for NEW failures.
## Decisions / deviations made during execution (carry forward)
- **T2:** `AlarmStateChanged.Condition` is a computed-default property (getter falls back to `AlarmConditionStateFactory.ForComputed(State, Priority)`); additions are init-props (additive). `AlarmConditionStateFactory` lives in `Commons/Types/Alarms`.
- **T8:** `ResolvedNativeAlarmSource` has **no `IsLocked`** field (per plan). Inheritance lock is enforced via a **local `lockedNames` HashSet** inside `ResolveInheritedNativeAlarmSources`. Override-lock is NOT enforced at flatten (matches plan; UI/validation layer handles it).
- **T9:** `SemanticValidator.Validate` gained an **optional** 3rd param `IReadOnlySet<string>? alarmCapableConnectionNames = null`. Connection-existence check only runs when callers pass it; empty source-ref / empty connection-name always checked. `ValidationCategory.NativeAlarmSourceInvalid` added. (Wiring real callers to pass the connection set is not yet done — fine for now.)
- **T10:** `DataConnectionActor` routes alarm transitions by **source-ref prefix** (`transition.SourceObjectReference`/`SourceReference` StartsWith bound key), dedup per transition. One feed per source-ref, ref-counted. Internal records `AlarmTransitionReceived`, `AlarmSubscribeCompleted`. `NativeAlarmSourceUnavailable` pushed on entering Reconnecting; `ReSubscribeAllAlarms` on reconnect.
- **T11 (OPC UA):** `OpcUaAlarmMapper` is pure/tested. `RealOpcUaClient.CreateAlarmSubscriptionAsync` does event MonitoredItem + `EventFilter` (select clauses indexed 012) + `ConditionRefresh` via `CallAsync` (the sync `Call` is obsolete→error). **`AlarmConditionState` collides with `Opc.Ua.AlarmConditionState`** — fully-qualified as `Commons.Types.Alarms.AlarmConditionState` at the one `new` site. **Behavior unverified until Task 28 (live A&C server).**
- **T12 (MxGateway):** `MxGatewayAlarmMapper` is pure/tested. Gateway proto enums `AlarmConditionState`/`AlarmTransitionKind` collide with Commons enums → aliased (`ProtoConditionState`/`ProtoTransitionKind` for proto; explicit `using X = Commons…` for the Commons ones). `MxGatewayClient.StreamAlarmsAsync(StreamAlarmsRequest, ct) → IAsyncEnumerable<AlarmFeedMessage>` confirmed present in pkg v0.1.0. Adapter opens **one shared session-less feed** (gateway-wide, null prefix), ref-counted, first-callback drives it (the actor routes). `RealMxGatewayClient.RunAlarmStreamAsync` reconnects internally (5s) — does NOT use `RaiseDisconnected`. Reference: OtOpcUa `…Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs`. **Behavior unverified until Task 28 (live gateway).**
## Execution cadence
- Per-task TDD: write test → confirm RED → implement → GREEN → commit. Update native task status + this `.tasks.json` each task; report at each batch boundary and wait for "start"/feedback.
- Batches so far: B1 = T14, B2 = T58, B3 = T912. Next proposed: B4 = T1317.
- Native task IDs map plan Task N → native id (N+6) — but on resume the native list is rebuilt from `.tasks.json` (Step 0).
## Watch items for remaining tasks
- **T18 (proto):** `sitestream.proto` is **not auto-compiled**`<Protobuf>` include is commented out, generated `.cs` vendored in `SiteStreamGrpc/`. Manual macOS regen only (toggle include → `dotnet build` → copy generated files → re-comment). Do NOT auto-compile on Linux.
- **T28:** OPC UA A&C live smoke (SkippableFact) + confirm infra OPC UA server exposes A&C; manual deploy check via `bash docker/deploy.sh` / `docker-env2/deploy.sh`.