docs: close native-alarm spec gaps surfaced by docs audit
The native alarms feature merged with 7 component docs updated, but the spec layer drifted: HighLevelReqs, Commons, and ManagementService had no native-alarm coverage and the README table flagged it on only one row. Add HighLevelReqs §3.4.2 (+ validation), document the Commons types/entities/messages and the 7 ManagementService commands, sync the README rows + link the TreeView sub-component, fix 2 broken plan links, and drop the one-off native-alarms RESUME scratchpad.
This commit is contained in:
@@ -760,7 +760,7 @@ Mirrors `docker/README.md`'s structure but documents the env2 specifics. Reuses
|
||||
|
||||
A second Docker deployment of a minimal ScadaBridge cluster topology, designed to run **concurrently with** the primary `docker/` stack so the Transport (#24) feature can be exercised end-to-end across two real environments.
|
||||
|
||||
See [`docs/plans/2026-05-24-second-environment-design.md`](../docs/plans/2026-05-24-second-environment-design.md) for the design rationale.
|
||||
See [`docs/plans/2026-05-24-second-environment-design.md`](2026-05-24-second-environment-design.md) for the design rationale.
|
||||
|
||||
## Cluster Topology
|
||||
|
||||
@@ -886,7 +886,7 @@ Same as primary (env2 shares LDAP). See `infra/glauth/config.toml` and primary `
|
||||
|
||||
## Transport Testing Workflow
|
||||
|
||||
See [`docs/plans/2026-05-24-second-environment-verification.md`](../docs/plans/2026-05-24-second-environment-verification.md) for the manual golden-path checklist.
|
||||
See [`docs/plans/2026-05-24-second-environment-verification.md`](2026-05-24-second-environment-verification.md) for the manual golden-path checklist.
|
||||
|
||||
## What's Different from Primary
|
||||
|
||||
|
||||
@@ -1,72 +0,0 @@
|
||||
# Native Alarms — Execution Resume Notes
|
||||
|
||||
**Skill in progress:** `superpowers-extended-cc:executing-plans` on `docs/plans/2026-05-29-native-alarms.md`.
|
||||
**To resume:** `/superpowers-extended-cc:executing-plans docs/plans/2026-05-29-native-alarms.md` (reads `…md.tasks.json`).
|
||||
|
||||
## Workspace
|
||||
- **Worktree:** `/Users/dohertj2/Desktop/scadalink-design-native-alarms` (branch `feat/native-alarms`, off `main` @ `09e19db` which holds the design + plan).
|
||||
- Do all work here; `main` checkout stays untouched. Build: `dotnet build ZB.MOM.WW.ScadaBridge.slnx`.
|
||||
- The shared MS SQL container `scadabridge-mssql` is up (the ConfigDB MsSql migration-fixture tests use it).
|
||||
|
||||
## Progress: ALL 28 TASKS DONE & COMMITTED ✅
|
||||
|
||||
## T28 (live integration) — results & a real bug it caught
|
||||
- **OPC UA A&C live smoke** `[SkippableFact]` **PASSED** against the infra OPC PLC server (real ConditionRefresh snapshot round-trip) — opc-plc DOES expose A&C; noted in test_infra.md.
|
||||
- **Full build + `dotnet test ZB.MOM.WW.ScadaBridge.slnx`: zero native-alarm regressions.** Every project I touched passes in isolation (SiteRuntime 314, CentralUI 584, DCL 104, ManagementService 111, Communication 200, TemplateEngine 332, ConfigDB 248). Pre-existing/known-flaky failures only: Commons `StaleTagMonitor*` (5, timing), 1 SiteRuntime parallelism flake (314/314 isolated), AuditLog (2) + IntegrationTests (2, `near "IF"` SQLite) — the latter **confirmed failing identically on `main`**.
|
||||
- **env2 cluster rebuilt** with the new image; **backend verified live** (`template native-alarm-source add`/`list` round-trip through the rebuilt ManagementActor; instance deployed).
|
||||
- **UI verified live via Claude-for-Chrome** at env2 Traefik LB `:9100`: template editor **Native Alarms tab** lists `GalaxyAlarms` + Add-modal connection dropdown filtered to alarm-capable (MxGateway) conns; instance **Native Alarm Source Overrides** card; Debug View **enriched alarm table (Kind + Sev columns)** + connects live ("Streaming MxAlarmDemo-1").
|
||||
- **RUNTIME BUG found live + fixed (`add7210`):** the NativeAlarmActor sends `SubscribeAlarmsRequest` to the DCL **manager**, but `DataConnectionManagerActor` only routed tag/write/browse — alarm subscribe/unsubscribe were **unhandled → dead-lettered**, so native alarms never subscribed. Unit tests missed it (T15 used a probe, T10 tested the connection actor directly). Added `HandleRouteAlarms` forwarding (mirrors `HandleRoute`) + regression test. **Re-verified live:** after rebuild, the logs show `[ScadaBridge Site X] Alarm feed subscribed for source $Area_001` → `Native alarm subscription established for GalaxyAlarms` — full chain InstanceActor→NativeAlarmActor→DCL manager→DataConnectionActor→**live MxAccess Gateway** working. (No transitions flowed only because `$Area_001` had no active alarms at the time.)
|
||||
- **T26 seed bug found live + fixed (`f4ae44a`):** `instance deploy` used `--instance-id`; corrected to `--id`.
|
||||
|
||||
## (historical) Progress: Tasks 1–27 done & committed; only T28 pending
|
||||
Commits (oldest→newest): `696da92` T1 … `3bf1d26` T21, `a6dcbf6` T22, `1f6c420` T23, T24, `046797e` T25, `2b7c765` T26, `003e54c` T27. (Full list in git log.)
|
||||
|
||||
**Cadence is batches of 3 (user choice on resume).** Batches 4–8 (T13–27) all ✅. Greens: SiteRuntime 313/313, Communication 200/200, ManagementService 111/111, CentralUI 581/581, Commons registry 8/8, CLI CommandTree 21/21. **Next & final: T28 (integration / live verification).**
|
||||
|
||||
## Decisions / deviations — Batch 8 (T25–27)
|
||||
- **T25:** InstanceConfigure "Native Alarm Source Overrides" card — inline per-row override (connection dropdown / source-ref / filter; blank=inherited), repo-direct upsert/clear + SaveChangesAsync (same rationale as T24). **Structural test** (source assertions) — InstanceConfigure is heavy (7 services incl. InstanceService + IFlatteningPipeline); CRUD behavior covered by T21. Lists the *template's direct* native sources (mirrors how the Alarm Overrides card uses `GetAlarmsByTemplateIdAsync`, not full flatten).
|
||||
- **T26:** seed-sites.sh adds a `MxAlarmDemo` template + `GalaxyAlarms` native source (connection "ScadaBridge Site X", source-ref `$Area_001`) + deployed `MxAlarmDemo-1` instance. **bash -n valid.** Live-verified the CLI→cluster path: `template create` works against the running env2 (:9100); `AddTemplateNativeAlarmSource` returns "Unknown command" because the **running env2 image is pre-T21** — proves the seed/commands are correct but full live run needs a cluster rebuild (`docker-env2/deploy.sh`), which is **T28's** job. Cleaned up the test template afterward (env2 templates back to []).
|
||||
- **T27:** CLAUDE.md (Native Alarms bullet under Data & Communication) + README (Template Engine row) inline; 7 component docs (DCL/SiteRuntime/TemplateEngine/CentralUI/CLI/Communication/ConfigurationDatabase) via 7 parallel `documenter` subagents. All landed (+265 lines).
|
||||
|
||||
## T28 plan (final)
|
||||
- Add `OpcUaAlarmLiveSmokeTests` SkippableFact (skips when no alarm-capable OPC UA endpoint) mirroring the existing live OPC UA smoke pattern; note infra OPC UA A&C support in `docs/test_infra/test_infra.md`.
|
||||
- Full `dotnet build` + `dotnet test ZB.MOM.WW.ScadaBridge.slnx`. Watch known-flaky: 5 StaleTagMonitor*Tests (Commons.Tests) + `CliConfigTests.Load_MalformedConfigFile` (parallelism race) — NOT regressions.
|
||||
- Rebuild cluster: `bash docker-env2/deploy.sh` (or docker/deploy.sh), then re-run `docker-env2/seed-sites.sh` and confirm native alarms appear in the Central UI Debug View with severity + condition badges.
|
||||
|
||||
## Decisions / deviations — Batch 7 (T22–24)
|
||||
- **T22:** CLI `template native-alarm-source add/list/remove` + `instance native-alarm-source set/clear` + README. **Known-flaky:** `CliConfigTests.Load_MalformedConfigFile…` fails ~deterministically in the FULL CLI suite (passes in isolation) due to process-global `HOME`/`Console.SetError` racing under xUnit parallelism — **confirmed pre-existing** (baseline fails 3/3 full runs; my added tests merely perturbed scheduling). Not my regression; treat as known-flaky like the StaleTagMonitor set.
|
||||
- **T23:** DebugView alarm table enriched — added Kind + Sev columns, folded SourceReference into the Alarm cell (monospace subtitle), condition badges (Unacked/Shelved/Suppressed) into the State cell, Type/Category/operator/raise-time/value into the row tooltip; `FilteredAlarmStates` also matches SourceReference. bUnit test sets `_connected`/`_snapshot`/`_alarmStates` via reflection then asserts markup.
|
||||
- **T24:** Template editor got a new **"Native Alarms" tab** (not a sub-panel) — fits the existing tabbed editor. **Repo-direct, NOT the plan's "management HTTP client":** TemplateEdit is Blazor Server in-process and calls `ITemplateEngineRepository` + `SaveChangesAsync` directly (mirrors how `_alarms` load/save works). Connection dropdown loads alarm-capable (OpcUa/MxGateway) connections via `ICentralUiRepository.GetAllDataConnectionsAsync` deduped by name. **Test is structural** (source-text assertions like `TestRunWarningTests`), NOT a full interactive bUnit render — TemplateEdit has ~10 injected services with their own graphs (ScriptAnalysisService/IMemoryCache/ISharedScriptCatalog/BuildParentContextsAsync/GetInstancesFilteredAsync…); the codebase has no precedent for fully rendering it, and the CRUD path is already covered behaviorally by the T21 ManagementActor tests. **T25 (InstanceConfigure) may be lighter to full-render — check first.**
|
||||
|
||||
## Decisions / deviations — Batch 6 (T19–21)
|
||||
- **T19:** `AlarmShelveStateCodec` (string↔enum, default Unshelved). Server `StreamRelayActor` maps `msg.Condition.*` + `Kind.ToString()` + nullable `OriginalRaiseTime` out; client `ConvertToDomainEvent` rebuilds `Condition` (severity = wire `Priority`) + `ParseAlarmKind` back. **Gotcha:** client imports `Google.Protobuf.WellKnownTypes`, so `Enum` is ambiguous — used `System.Enum.TryParse`. `confirmed` proto bool → domain `bool?` (false, never null after round-trip).
|
||||
- **T20:** `ManagementCommandRegistry` is **reflection-based** (auto-discovers `*Command` records in the Management namespace) — no manual registry edit needed; just added the 7 records to TemplateCommands.cs / InstanceCommands.cs.
|
||||
- **T21:** Handlers call `ITemplateEngineRepository` **directly** + `SaveChangesAsync` (per plan's "call the Task 6 repo methods"), NOT through TemplateService/InstanceService. **Trade-off:** this skips the service-layer `IAuditService` logging that the existing template/instance alarm CRUD gets. Acceptable for the read-only-mirror authoring commands and matches the plan's Files scope (ManagementService only), but flag if audit parity is wanted later. Roles: template mutations = Design, instance-override mutations = Deployment, lists = any authenticated. Update = fetch-then-mutate (preserves TemplateId for the unique index). Set-override = upsert (Add if absent else Update).
|
||||
|
||||
## Decisions / deviations — Batch 5 (T16–18)
|
||||
- **T16:** Connection protocol IS in `FlattenedConfiguration.Connections[name].Protocol` → `ResolveNativeKind` maps protocol-contains-"Mx" → `NativeMxAccess` else `NativeOpcUa`; passed into NativeAlarmActor. Added `_latestAlarmEvents` (enriched event per AlarmName) + extracted `BuildAlarmStatesSnapshot()` used by both `HandleSubscribeDebugView` and `HandleDebugSnapshot` (enriched events ∪ Normal-projection fallback for computed alarms that haven't fired). Native actors skipped when `_dclManager == null` (isolated tests). **Beyond the plan's Files list (justified):** redeploy/undeploy clear — added `native_alarm_state` DELETE to `SiteStorageService.RemoveDeployedConfigAsync` transaction (undeploy) + `ClearNativeAlarmsForInstanceAsync` next to `ClearStaticOverridesAsync` in `DeploymentManagerActor` redeploy path. Native state survives failover (rehydrate) but resets on redeploy — mirrors static-override semantics.
|
||||
- **T17:** Test-only (as the plan predicted) — `AlarmStateChanged.Condition` getter already defaults to `ForComputed(State, Priority)` from T2, so computed alarms carry the unified condition without code change. Added regression `AlarmActor_ComputedAlarm_CarriesUnifiedConditionState`.
|
||||
- **T18:** Proto regen done via the documented macOS manual flow (uncomment `<Protobuf>` → delete vendored → build → copy `obj/Debug/net10.0/Protos/*.cs` → re-comment). csproj nets to no change. Only `Sitestream.cs` changed (service `SitestreamGrpc.cs` untouched — message-only change). `confirmed` is proto `bool` per plan (null→false fidelity loss accepted). New fields 8–21 on `AlarmStateUpdate`.
|
||||
|
||||
## Decisions / deviations made during execution — Batch 4 (T13–15)
|
||||
- **T15:** `NativeAlarmActor` ctor has an optional trailing `AlarmKind nativeKind = AlarmKind.NativeOpcUa` (additive — keeps the 7-arg call working). T16 will pass `NativeMxAccess` when the connection protocol is MxGateway. Persistence is **fire-and-forget** (`ContinueWith` OnlyOnFaulted logs) — never blocks the actor. State keyed by `SourceReference`; `AlarmName` on the emitted `AlarmStateChanged` is set to the `SourceReference`. Snapshot path: `Snapshot` buffers, `SnapshotComplete` atomic-swaps (dropped → emit `Active=false`). Live path ignores older `TransitionTime`; retention drops a condition once `!Active && Acknowledged`. `NativeAlarmSourceUnavailable` = log + retain (no emit). Subscribe retry via `ScheduleTellOnceCancelable` at `NativeAlarmRetryIntervalMs`.
|
||||
|
||||
## Known-flaky baseline (NOT my regressions)
|
||||
- 5 `StaleTagMonitor*Tests` in `ZB.MOM.WW.ScadaBridge.Commons.Tests` are timing-flaky under load. User approved treating as known-flaky; do not "fix". Watch only for NEW failures.
|
||||
|
||||
## Decisions / deviations made during execution (carry forward)
|
||||
- **T2:** `AlarmStateChanged.Condition` is a computed-default property (getter falls back to `AlarmConditionStateFactory.ForComputed(State, Priority)`); additions are init-props (additive). `AlarmConditionStateFactory` lives in `Commons/Types/Alarms`.
|
||||
- **T8:** `ResolvedNativeAlarmSource` has **no `IsLocked`** field (per plan). Inheritance lock is enforced via a **local `lockedNames` HashSet** inside `ResolveInheritedNativeAlarmSources`. Override-lock is NOT enforced at flatten (matches plan; UI/validation layer handles it).
|
||||
- **T9:** `SemanticValidator.Validate` gained an **optional** 3rd param `IReadOnlySet<string>? alarmCapableConnectionNames = null`. Connection-existence check only runs when callers pass it; empty source-ref / empty connection-name always checked. `ValidationCategory.NativeAlarmSourceInvalid` added. (Wiring real callers to pass the connection set is not yet done — fine for now.)
|
||||
- **T10:** `DataConnectionActor` routes alarm transitions by **source-ref prefix** (`transition.SourceObjectReference`/`SourceReference` StartsWith bound key), dedup per transition. One feed per source-ref, ref-counted. Internal records `AlarmTransitionReceived`, `AlarmSubscribeCompleted`. `NativeAlarmSourceUnavailable` pushed on entering Reconnecting; `ReSubscribeAllAlarms` on reconnect.
|
||||
- **T11 (OPC UA):** `OpcUaAlarmMapper` is pure/tested. `RealOpcUaClient.CreateAlarmSubscriptionAsync` does event MonitoredItem + `EventFilter` (select clauses indexed 0–12) + `ConditionRefresh` via `CallAsync` (the sync `Call` is obsolete→error). **`AlarmConditionState` collides with `Opc.Ua.AlarmConditionState`** — fully-qualified as `Commons.Types.Alarms.AlarmConditionState` at the one `new` site. **Behavior unverified until Task 28 (live A&C server).**
|
||||
- **T12 (MxGateway):** `MxGatewayAlarmMapper` is pure/tested. Gateway proto enums `AlarmConditionState`/`AlarmTransitionKind` collide with Commons enums → aliased (`ProtoConditionState`/`ProtoTransitionKind` for proto; explicit `using X = Commons…` for the Commons ones). `MxGatewayClient.StreamAlarmsAsync(StreamAlarmsRequest, ct) → IAsyncEnumerable<AlarmFeedMessage>` confirmed present in pkg v0.1.0. Adapter opens **one shared session-less feed** (gateway-wide, null prefix), ref-counted, first-callback drives it (the actor routes). `RealMxGatewayClient.RunAlarmStreamAsync` reconnects internally (5s) — does NOT use `RaiseDisconnected`. Reference: OtOpcUa `…Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs`. **Behavior unverified until Task 28 (live gateway).**
|
||||
|
||||
## Execution cadence
|
||||
- Per-task TDD: write test → confirm RED → implement → GREEN → commit. Update native task status + this `.tasks.json` each task; report at each batch boundary and wait for "start"/feedback.
|
||||
- Batches so far: B1 = T1–4, B2 = T5–8, B3 = T9–12. Next proposed: B4 = T13–17.
|
||||
- Native task IDs map plan Task N → native id (N+6) — but on resume the native list is rebuilt from `.tasks.json` (Step 0).
|
||||
|
||||
## Watch items for remaining tasks
|
||||
- **T18 (proto):** `sitestream.proto` is **not auto-compiled** — `<Protobuf>` include is commented out, generated `.cs` vendored in `SiteStreamGrpc/`. Manual macOS regen only (toggle include → `dotnet build` → copy generated files → re-comment). Do NOT auto-compile on Linux.
|
||||
- **T28:** OPC UA A&C live smoke (SkippableFact) + confirm infra OPC UA server exposes A&C; manual deploy check via `bash docker/deploy.sh` / `docker-env2/deploy.sh`.
|
||||
Reference in New Issue
Block a user