48 lines
8.8 KiB
Markdown
48 lines
8.8 KiB
Markdown
# Native Alarms — Execution Resume Notes
|
||
|
||
**Skill in progress:** `superpowers-extended-cc:executing-plans` on `docs/plans/2026-05-29-native-alarms.md`.
|
||
**To resume:** `/superpowers-extended-cc:executing-plans docs/plans/2026-05-29-native-alarms.md` (reads `…md.tasks.json`).
|
||
|
||
## Workspace
|
||
- **Worktree:** `/Users/dohertj2/Desktop/scadalink-design-native-alarms` (branch `feat/native-alarms`, off `main` @ `09e19db` which holds the design + plan).
|
||
- Do all work here; `main` checkout stays untouched. Build: `dotnet build ZB.MOM.WW.ScadaBridge.slnx`.
|
||
- The shared MS SQL container `scadabridge-mssql` is up (the ConfigDB MsSql migration-fixture tests use it).
|
||
|
||
## Progress: Tasks 1–21 done & committed; 22–28 pending
|
||
Commits (oldest→newest): `696da92` T1 … `20b41b8` T18, `0c6f9a9` T19, `b1df6d5` T20, `3bf1d26` T21. (Full list in git log.)
|
||
|
||
**Cadence is batches of 3 (user choice on resume).** Batch 4 = T13–15 ✅, Batch 5 = T16–18 ✅, Batch 6 = T19–21 ✅. Greens: SiteRuntime.Tests 313/313, Communication.Tests 200/200, ManagementService.Tests 111/111, Commons.Tests registry 8/8. Next: **Batch 7 = T22 (CLI) → T23 (DebugView UI) → T24 (Template editor UI)**. Then 25 (instance UI), 26 (seed), 27 (docs), 28 (live integration).
|
||
|
||
## Decisions / deviations — Batch 6 (T19–21)
|
||
- **T19:** `AlarmShelveStateCodec` (string↔enum, default Unshelved). Server `StreamRelayActor` maps `msg.Condition.*` + `Kind.ToString()` + nullable `OriginalRaiseTime` out; client `ConvertToDomainEvent` rebuilds `Condition` (severity = wire `Priority`) + `ParseAlarmKind` back. **Gotcha:** client imports `Google.Protobuf.WellKnownTypes`, so `Enum` is ambiguous — used `System.Enum.TryParse`. `confirmed` proto bool → domain `bool?` (false, never null after round-trip).
|
||
- **T20:** `ManagementCommandRegistry` is **reflection-based** (auto-discovers `*Command` records in the Management namespace) — no manual registry edit needed; just added the 7 records to TemplateCommands.cs / InstanceCommands.cs.
|
||
- **T21:** Handlers call `ITemplateEngineRepository` **directly** + `SaveChangesAsync` (per plan's "call the Task 6 repo methods"), NOT through TemplateService/InstanceService. **Trade-off:** this skips the service-layer `IAuditService` logging that the existing template/instance alarm CRUD gets. Acceptable for the read-only-mirror authoring commands and matches the plan's Files scope (ManagementService only), but flag if audit parity is wanted later. Roles: template mutations = Design, instance-override mutations = Deployment, lists = any authenticated. Update = fetch-then-mutate (preserves TemplateId for the unique index). Set-override = upsert (Add if absent else Update).
|
||
|
||
## Decisions / deviations — Batch 5 (T16–18)
|
||
- **T16:** Connection protocol IS in `FlattenedConfiguration.Connections[name].Protocol` → `ResolveNativeKind` maps protocol-contains-"Mx" → `NativeMxAccess` else `NativeOpcUa`; passed into NativeAlarmActor. Added `_latestAlarmEvents` (enriched event per AlarmName) + extracted `BuildAlarmStatesSnapshot()` used by both `HandleSubscribeDebugView` and `HandleDebugSnapshot` (enriched events ∪ Normal-projection fallback for computed alarms that haven't fired). Native actors skipped when `_dclManager == null` (isolated tests). **Beyond the plan's Files list (justified):** redeploy/undeploy clear — added `native_alarm_state` DELETE to `SiteStorageService.RemoveDeployedConfigAsync` transaction (undeploy) + `ClearNativeAlarmsForInstanceAsync` next to `ClearStaticOverridesAsync` in `DeploymentManagerActor` redeploy path. Native state survives failover (rehydrate) but resets on redeploy — mirrors static-override semantics.
|
||
- **T17:** Test-only (as the plan predicted) — `AlarmStateChanged.Condition` getter already defaults to `ForComputed(State, Priority)` from T2, so computed alarms carry the unified condition without code change. Added regression `AlarmActor_ComputedAlarm_CarriesUnifiedConditionState`.
|
||
- **T18:** Proto regen done via the documented macOS manual flow (uncomment `<Protobuf>` → delete vendored → build → copy `obj/Debug/net10.0/Protos/*.cs` → re-comment). csproj nets to no change. Only `Sitestream.cs` changed (service `SitestreamGrpc.cs` untouched — message-only change). `confirmed` is proto `bool` per plan (null→false fidelity loss accepted). New fields 8–21 on `AlarmStateUpdate`.
|
||
|
||
## Decisions / deviations made during execution — Batch 4 (T13–15)
|
||
- **T15:** `NativeAlarmActor` ctor has an optional trailing `AlarmKind nativeKind = AlarmKind.NativeOpcUa` (additive — keeps the 7-arg call working). T16 will pass `NativeMxAccess` when the connection protocol is MxGateway. Persistence is **fire-and-forget** (`ContinueWith` OnlyOnFaulted logs) — never blocks the actor. State keyed by `SourceReference`; `AlarmName` on the emitted `AlarmStateChanged` is set to the `SourceReference`. Snapshot path: `Snapshot` buffers, `SnapshotComplete` atomic-swaps (dropped → emit `Active=false`). Live path ignores older `TransitionTime`; retention drops a condition once `!Active && Acknowledged`. `NativeAlarmSourceUnavailable` = log + retain (no emit). Subscribe retry via `ScheduleTellOnceCancelable` at `NativeAlarmRetryIntervalMs`.
|
||
|
||
## Known-flaky baseline (NOT my regressions)
|
||
- 5 `StaleTagMonitor*Tests` in `ZB.MOM.WW.ScadaBridge.Commons.Tests` are timing-flaky under load. User approved treating as known-flaky; do not "fix". Watch only for NEW failures.
|
||
|
||
## Decisions / deviations made during execution (carry forward)
|
||
- **T2:** `AlarmStateChanged.Condition` is a computed-default property (getter falls back to `AlarmConditionStateFactory.ForComputed(State, Priority)`); additions are init-props (additive). `AlarmConditionStateFactory` lives in `Commons/Types/Alarms`.
|
||
- **T8:** `ResolvedNativeAlarmSource` has **no `IsLocked`** field (per plan). Inheritance lock is enforced via a **local `lockedNames` HashSet** inside `ResolveInheritedNativeAlarmSources`. Override-lock is NOT enforced at flatten (matches plan; UI/validation layer handles it).
|
||
- **T9:** `SemanticValidator.Validate` gained an **optional** 3rd param `IReadOnlySet<string>? alarmCapableConnectionNames = null`. Connection-existence check only runs when callers pass it; empty source-ref / empty connection-name always checked. `ValidationCategory.NativeAlarmSourceInvalid` added. (Wiring real callers to pass the connection set is not yet done — fine for now.)
|
||
- **T10:** `DataConnectionActor` routes alarm transitions by **source-ref prefix** (`transition.SourceObjectReference`/`SourceReference` StartsWith bound key), dedup per transition. One feed per source-ref, ref-counted. Internal records `AlarmTransitionReceived`, `AlarmSubscribeCompleted`. `NativeAlarmSourceUnavailable` pushed on entering Reconnecting; `ReSubscribeAllAlarms` on reconnect.
|
||
- **T11 (OPC UA):** `OpcUaAlarmMapper` is pure/tested. `RealOpcUaClient.CreateAlarmSubscriptionAsync` does event MonitoredItem + `EventFilter` (select clauses indexed 0–12) + `ConditionRefresh` via `CallAsync` (the sync `Call` is obsolete→error). **`AlarmConditionState` collides with `Opc.Ua.AlarmConditionState`** — fully-qualified as `Commons.Types.Alarms.AlarmConditionState` at the one `new` site. **Behavior unverified until Task 28 (live A&C server).**
|
||
- **T12 (MxGateway):** `MxGatewayAlarmMapper` is pure/tested. Gateway proto enums `AlarmConditionState`/`AlarmTransitionKind` collide with Commons enums → aliased (`ProtoConditionState`/`ProtoTransitionKind` for proto; explicit `using X = Commons…` for the Commons ones). `MxGatewayClient.StreamAlarmsAsync(StreamAlarmsRequest, ct) → IAsyncEnumerable<AlarmFeedMessage>` confirmed present in pkg v0.1.0. Adapter opens **one shared session-less feed** (gateway-wide, null prefix), ref-counted, first-callback drives it (the actor routes). `RealMxGatewayClient.RunAlarmStreamAsync` reconnects internally (5s) — does NOT use `RaiseDisconnected`. Reference: OtOpcUa `…Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs`. **Behavior unverified until Task 28 (live gateway).**
|
||
|
||
## Execution cadence
|
||
- Per-task TDD: write test → confirm RED → implement → GREEN → commit. Update native task status + this `.tasks.json` each task; report at each batch boundary and wait for "start"/feedback.
|
||
- Batches so far: B1 = T1–4, B2 = T5–8, B3 = T9–12. Next proposed: B4 = T13–17.
|
||
- Native task IDs map plan Task N → native id (N+6) — but on resume the native list is rebuilt from `.tasks.json` (Step 0).
|
||
|
||
## Watch items for remaining tasks
|
||
- **T18 (proto):** `sitestream.proto` is **not auto-compiled** — `<Protobuf>` include is commented out, generated `.cs` vendored in `SiteStreamGrpc/`. Manual macOS regen only (toggle include → `dotnet build` → copy generated files → re-comment). Do NOT auto-compile on Linux.
|
||
- **T28:** OPC UA A&C live smoke (SkippableFact) + confirm infra OPC UA server exposes A&C; manual deploy check via `bash docker/deploy.sh` / `docker-env2/deploy.sh`.
|