Files
lmxopcua/docs/plans/2026-06-11-alarm-followups-round2.md
T

276 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Alarm Follow-ups (Round 2) Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
**Goal:** Activate exactly-once alarm historization by feeding `HistorianAdapterActor` from the `alerts` topic (with a config-gated durable sink), and verify/document the Galaxy alarm-feed reconnect behaviour.
**Architecture:** `HistorianAdapterActor` subscribes to the cluster `alerts` DPS topic, translates each `AlarmTransitionEvent``AlarmHistorianEvent`, and runs it through the **existing T2 Primary gate** (DPS fans the Primary's single publish to both nodes' historians, so the gate keeps writes exactly-once). `AlarmTransitionEvent` is extended with the two fields the historian record needs (`AlarmTypeName`, `Comment`), including a small `ScriptedAlarmEngine` change to carry `Comment` through the emission. A config-gated `AddAlarmHistorian` registers the real `SqliteStoreAndForwardSink`→Wonderware sink when configured (else `Null`). Galaxy alarm reconnect is already handled by the feed's own retry loop + gRPC channel auto-reconnect — verify + document only.
**Tech Stack:** .NET 10, Akka.NET (cluster, DistributedPubSub, TestKit/xunit2), xUnit + Shouldly, SQLite store-and-forward, named-pipe IPC to Wonderware.
**Design of record:** `docs/plans/2026-06-11-alarm-followups-round2-design.md` (committed master `3ad7960d`).
**Hard rules:** stage by explicit path (never `git add .`); never stage `sql_login.txt` / `src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no force-push, no `--no-verify`; **no Configuration entity / EF migration change** (the historian queue is a standalone SQLite file, NOT the Config DB). Build on a feature branch off master.
---
### Task 0: Branch + baseline
**Classification:** trivial
**Estimated implement time:** ~1 min
**Parallelizable with:** none
**Files:** (none — git only)
**Steps:**
1. `git checkout master && git switch -c feat/alarm-followups-r2` (off `3ad7960d`).
2. Confirm clean tree + green baseline: `dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors.
3. No commit (branch only).
---
### Task 1: Extend `AlarmTransitionEvent` + carry `Comment` through the engine emission (B1)
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 3, Task 4 (different projects — but all three touch the Runtime/Core compilation graph; the executor serialises same-assembly builds, see Execution notes)
**Files:**
- Modify: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Alerts/AlarmTransitionEvent.cs`
- Modify: `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs` (the `ScriptedAlarmEvent` record ≈ line 817 + `BuildEmission` + the ack/confirm/comment/shelve ops that receive the operator comment)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs` (`OnEngineEmission`, the `AlarmTransitionEvent` construction ≈ line 268276)
- Test: `tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/` (the engine emission tests — find the ack/comment-transition test) + `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs`
**Context:** `AlarmTransitionEvent` (the `alerts` DPS payload) has 8 positional fields and lacks two that `AlarmHistorianEvent` needs: `AlarmTypeName` (Part-9 subtype) and `Comment`. `ScriptedAlarmHostActor.OnEngineEmission` builds the event from the engine emission `e` (`ScriptedAlarmEvent`), which already carries `AlarmKind Kind` but has NO `Comment`. Add the two fields and populate them. The Akka cluster uses the default serializer (no Hyperion/protobuf), so appending record fields is forward/backward compatible across a rolling restart.
**Step 1: Extend `AlarmTransitionEvent`** — append two trailing params WITH defaults so existing construction sites (tests) still compile and only `ScriptedAlarmHostActor` must populate:
```csharp
public sealed record AlarmTransitionEvent(
string AlarmId,
string EquipmentPath,
string AlarmName,
string TransitionKind,
int Severity,
string Message,
string User,
DateTime TimestampUtc,
string AlarmTypeName = "AlarmCondition", // Part-9 subtype (LimitAlarm/DiscreteAlarm/OffNormalAlarm/AlarmCondition)
string? Comment = null); // operator comment on ack/confirm/comment/shelve transitions; null otherwise
```
Add `<param>` doc lines for both (TreatWarningsAsErrors).
**Step 2: Carry `Comment` through the engine.** In `ScriptedAlarmEngine.cs`:
- Add `string? Comment = null` as a trailing param to the `ScriptedAlarmEvent` record (≈ line 817).
- In `BuildEmission` (where `ScriptedAlarmEvent` is constructed), populate `Comment` from the condition/op state for comment-bearing transitions. READ how the ack/confirm/`AddComment`/shelve ops receive the operator comment (they take a `comment` argument) — thread the latest operator comment into the emitted event (e.g. carry it on the condition state the emission reads, or pass it into `BuildEmission`). For engine-driven transitions (Activated/Cleared) `Comment` stays null.
**Step 3: Failing tests first.**
- Engine test: a transition produced by an ack/`AddComment` op with an operator comment yields a `ScriptedAlarmEvent` whose `Comment` equals that text; an Activated/Cleared emission has `Comment == null`. (Find the existing engine ack/comment test and assert the new field.)
- Host test (`ScriptedAlarmHostActorTests`): after an emission, the published `AlarmTransitionEvent` carries `AlarmTypeName == e.Kind.ToString()` and `Comment == e.Comment` (extend the existing alerts-publish assertion).
Run them → FAIL (fields don't exist / not populated).
**Step 4: Populate in `OnEngineEmission`.** In the `AlarmTransitionEvent` construction (≈ line 268276) add:
```csharp
AlarmTypeName: e.Kind.ToString(),
Comment: e.Comment,
```
Leave the Primary gate + the OPC UA node write (`_publishActor.Tell`) + everything else unchanged.
**Step 5:** Run the two suites:
`dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests` and
`dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter ScriptedAlarmHostActor` → green. Then `dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors (a Commons-record change ripples; confirm whole-solution build).
**Step 6: Commit** by explicit path (the 3 source files + the 2 test files).
> Standard: data-contract change (DPS-serialised) + engine emission threading. The careful part is the engine `Comment` plumbing — keep Activated/Cleared `Comment == null`.
---
### Task 2: `HistorianAdapterActor` subscribes to `alerts` + translates (B2)
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Blocked by:** Task 1 (needs the extended `AlarmTransitionEvent`)
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs`
**Context:** `HistorianAdapterActor` already has the Primary gate (in its `Receive<AlarmHistorianEvent>`), the `_localRole` cache, and the `redundancy-state` subscription (PreStart). It does NOT yet subscribe to `alerts` and has no feeder. Add the alerts subscription + a translate path that runs through the **same** gate. The gate stays load-bearing: the Primary publishes once, DPS delivers to BOTH nodes' historian actors, the gate keeps only the Primary writing.
**Step 1: Refactor the gate into a shared path.** Extract the gate + enqueue out of the `Receive<AlarmHistorianEvent>` lambda into a private `Historize(AlarmHistorianEvent evt)` method:
```csharp
private void Historize(AlarmHistorianEvent evt)
{
if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return;
_ = EnqueueAsync(evt);
}
```
Point the existing `Receive<AlarmHistorianEvent>(evt => Historize(evt));` at it (behaviour unchanged).
**Step 2: Failing TestKit tests** (extend `HistorianAdapterActorTests`; it has a `RecordingSink` + sends `RedundancyStateChanged` directly):
- `Alerts_transition_is_historized_by_default` — send an `AlarmTransitionEvent` (no role set) → the fake sink records ONE enqueue whose translated `AlarmHistorianEvent` has the right `AlarmId`/`AlarmTypeName`/`EventKind`/`Severity` bucket/`Comment`.
- `Secondary_node_does_not_historize_alerts_transition` — after a `Secondary` `RedundancyStateChanged`, an `AlarmTransitionEvent` records ZERO enqueues.
- `Primary_node_historizes_alerts_transition` — after `Primary`, ONE enqueue.
- `Alerts_transition_translation_buckets_severity` — severity int boundaries map to the right `AlarmSeverity` (e.g. 250→Low, 251→Medium, 750→High, 751→Critical) — can be a focused unit test on the translation helper if you extract one.
Run → FAIL (no `Receive<AlarmTransitionEvent>` yet).
**Step 3: Implement.**
- `using ZB.MOM.WW.OtOpcUa.Commons.Messages.Alerts;`
- In PreStart, ALSO subscribe to the alerts topic: `_mediator.Tell(new Subscribe(ScriptedAlarmHostActor.AlertsTopic, Self));` (reuse the public const `ScriptedAlarmHostActor.AlertsTopic = "alerts"` — same assembly). The existing `Receive<SubscribeAck>` no-op covers both acks.
- Add `Receive<AlarmTransitionEvent>(evt => Historize(Translate(evt)));`.
- Add a `private static AlarmHistorianEvent Translate(AlarmTransitionEvent t)`:
```csharp
private static AlarmHistorianEvent Translate(AlarmTransitionEvent t) => new(
AlarmId: t.AlarmId,
EquipmentPath: t.EquipmentPath,
AlarmName: t.AlarmName,
AlarmTypeName: t.AlarmTypeName,
Severity: ToSeverity(t.Severity),
EventKind: t.TransitionKind,
Message: t.Message,
User: t.User,
Comment: t.Comment,
TimestampUtc: t.TimestampUtc);
// Invert ScriptedAlarmHostActor.SeverityToInt's buckets (Low=250, Medium=500, High=750, Critical=1000).
private static AlarmSeverity ToSeverity(int s) => s switch
{
<= 250 => AlarmSeverity.Low,
<= 500 => AlarmSeverity.Medium,
<= 750 => AlarmSeverity.High,
_ => AlarmSeverity.Critical,
};
```
(Confirm the `AlarmSeverity` enum members + namespace via `AlarmHistorianEvent.cs`.)
**Step 4:** `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter HistorianAdapter` → green; full Runtime.Tests stays green.
**Step 5: Commit** by explicit path.
> High-risk: redundancy exactly-once + a cluster DPS subscription. Do NOT remove the gate (it's what keeps the two nodes from double-writing). Keep the `Receive<AlarmHistorianEvent>` path (a future direct source can still use it).
---
### Task 3: Config-gated durable sink (`AddAlarmHistorian`) + Host wiring (B3)
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 1, Task 4
**Blocked by:** none (independent of B1/B2)
**Files:**
- Create: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/AlarmHistorianOptions.cs` (the config record)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs` (add `AddAlarmHistorian(IConfiguration)`)
- Modify: the Host startup where `AddOtOpcUaRuntime()` is called (`src/Server/ZB.MOM.WW.OtOpcUa.Host/Program.cs` or the host bootstrap — find the call site) to call `AddAlarmHistorian(configuration)`
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Host/appsettings.json` (a commented/example `AlarmHistorian` section, default absent/disabled)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/AlarmHistorianRegistrationTests.cs` (new)
**Context:** Production defaults to `NullAlarmHistorianSink` (`ServiceCollectionExtensions.cs:41` `TryAddSingleton<IAlarmHistorianSink>(NullAlarmHistorianSink.Instance)`). Add a config-gated registration that swaps in the real `SqliteStoreAndForwardSink`→`WonderwareHistorianClient` when an `AlarmHistorian` section is present + enabled.
**Step 1: Options record** (`AlarmHistorianOptions.cs`):
```csharp
public sealed class AlarmHistorianOptions
{
public const string SectionName = "AlarmHistorian";
public bool Enabled { get; init; }
public string DatabasePath { get; init; } = "alarm-historian.db";
public string PipeName { get; init; } = "OtOpcUaHistorian";
public string SharedSecret { get; init; } = "";
public int BatchSize { get; init; } = 100;
}
```
**Step 2: Failing registration tests** (`AlarmHistorianRegistrationTests`, xUnit + Shouldly, build a `ServiceCollection` + in-memory `IConfiguration`):
- Section absent → resolved `IAlarmHistorianSink` is `NullAlarmHistorianSink`.
- Section present with `Enabled=true` → resolved `IAlarmHistorianSink` is a `SqliteStoreAndForwardSink` (assert the concrete type). Use a temp DB path.
- Section present with `Enabled=false` → stays `NullAlarmHistorianSink`.
Run → FAIL (`AddAlarmHistorian` doesn't exist).
**Step 3: Implement `AddAlarmHistorian`** in `ServiceCollectionExtensions`:
```csharp
public static IServiceCollection AddAlarmHistorian(this IServiceCollection services, IConfiguration configuration)
{
var opts = configuration.GetSection(AlarmHistorianOptions.SectionName).Get<AlarmHistorianOptions>();
if (opts is not { Enabled: true }) return services; // leave the Null default from AddOtOpcUaRuntime
services.AddSingleton<IAlarmHistorianSink>(sp =>
{
var writer = new WonderwareHistorianClient(
new WonderwareHistorianClientOptions(PipeName: opts.PipeName, SharedSecret: opts.SharedSecret),
sp.GetService<ILogger<WonderwareHistorianClient>>());
var sink = new SqliteStoreAndForwardSink(
opts.DatabasePath, writer, sp.GetRequiredService<ILogger<SqliteStoreAndForwardSink>>(),
batchSize: opts.BatchSize);
sink.StartDrainLoop(TimeSpan.FromSeconds(5));
return sink;
});
return services;
}
```
- Use `services.AddSingleton<IAlarmHistorianSink>(...)` (NOT `TryAdd`) so it overrides the `Null` default registered by `AddOtOpcUaRuntime`. **Order matters**: `AddAlarmHistorian` must run AFTER `AddOtOpcUaRuntime` — verify the Host calls them in that order, or have `AddAlarmHistorian` remove the prior registration first.
- Confirm the exact `WonderwareHistorianClient` ctor + `WonderwareHistorianClientOptions` record params + `SqliteStoreAndForwardSink` ctor + `StartDrainLoop` signature from the referenced files; adjust the call to match. Add the project references if the Host/Runtime don't already reference `Driver.Historian.Wonderware.Client` + `Core.AlarmHistorian` (check first; if a new project reference is needed, surface it).
- Dispose: ensure the sink (IDisposable) is disposed on shutdown (singleton registered in DI is disposed by the container at host stop — verify the sink is `IDisposable` and DI owns it; it is).
**Step 4: Host wiring** — call `builder.Services.AddAlarmHistorian(builder.Configuration);` right after `AddOtOpcUaRuntime()`. Add the disabled example section to `appsettings.json`:
```jsonc
// "AlarmHistorian": { "Enabled": false, "DatabasePath": "alarm-historian.db", "PipeName": "OtOpcUaHistorian", "SharedSecret": "" }
```
(Keep it commented or `Enabled:false` so dev/docker-dev stay on `Null`.)
**Step 5:** `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter AlarmHistorianRegistration` → green; `dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors.
**Step 6: Commit** by explicit path.
> If wiring the real sink pulls a new project reference into the Host/Runtime that ripples (e.g. the Wonderware client drags Win-only deps), surface it before expanding — the sink construction may need to live in the Host project (which already references drivers) rather than Runtime. Prefer putting `AddAlarmHistorian` wherever the Wonderware client is already referenceable.
---
### Task 4: Galaxy alarm-reconnect — acknowledger-recovery test + doc (A)
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Task 1, Task 3
**Blocked by:** none
**Files:**
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/Runtime/GatewayGalaxyAlarmAcknowledgerTests.cs` (create if absent; else extend)
- Modify: `docs/drivers/Galaxy.md` (the Reconnect + Replay section)
**Context:** The alarm feed already recovers (its `RunAsync` re-invokes `StreamAlarmsAsync` on a transport fault — covered by `GatewayGalaxyAlarmFeedTests.Reopens_stream_after_a_transport_fault`). gRPC.NET channels auto-reconnect, so the same client recovers after a gateway restart; `_ownedMxClient` is intentionally NOT recreated, and gRPC keepalive isn't reachable (`MxGatewayClientOptions` is a NuGet package). This task verifies the **acknowledger** path + documents the behaviour. NO production code change.
**Step 1: Acknowledger-recovery test.** Read `GatewayGalaxyAlarmAcknowledger.cs:29-48` (it holds a client/delegate and calls `AcknowledgeAlarmAsync`). Mirror however `GatewayGalaxyAlarmFeedTests` fakes the client/stream factory. Write a test where the acknowledge call fails once with a transient `RpcException` (or the test double's fault) and the NEXT call succeeds — asserting the acknowledger does not latch a dead state and a retry on the same client succeeds. If the acknowledger has no internal retry (it likely just forwards one call), assert instead that a second independent `AcknowledgeAsync` after a faulted first call still issues the unary call (i.e. the acknowledger is stateless w.r.t. faults — the gRPC channel handles reconnect). Name it `Acknowledge_after_transient_fault_succeeds_on_retry`.
**Step 2:** Run → it should pass immediately if the acknowledger is stateless (no fault latch). If it FAILS (the acknowledger caches a dead client / latches), that's a real gap — STOP and surface it (escalates A to a real fix, out of this task's verify-only scope).
**Step 3: Document** in `docs/drivers/Galaxy.md` (Reconnect + Replay section): a short note — the session-less alarm feed/acknowledger run on `_ownedMxClient`, which is **not** recreated on reconnect by design; the feed's own re-invoke loop (`GatewayGalaxyAlarmFeed.RunAsync`, ~5s backoff) plus gRPC.NET channel auto-reconnect recover the alarm stream + acks after a gateway restart. Channel-level keepalive hardening would require exposing knobs on the `MxGatewayClient` package (sibling repo) — noted as a future option, not needed today.
**Step 4: Commit** by explicit path (the test + the doc).
---
### Task 5: Full-suite gate + docs + finish
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Blocked by:** Task 2, Task 3, Task 4
**Files:** `docs/AlarmHistorian.md` (or `docs/AlarmTracking.md`) — note the historian is now fed from the `alerts` topic (scripted alarms, Primary-gated, exactly-once) + the config-gated real sink (`AlarmHistorian` appsettings section). Keep terse.
**Steps:**
1. Update the historian doc(s) to reflect: `HistorianAdapterActor` now subscribes to `alerts` and historizes scripted-alarm transitions exactly-once (Primary-gated); the durable `SqliteStoreAndForwardSink`→Wonderware sink is enabled via the `AlarmHistorian` config section (else `Null`); Galaxy/AB-CIP historization is out of scope (AVEVA native / future).
2. Run the FULL suite: `dotnet test ZB.MOM.WW.OtOpcUa.slnx` — confirm all affected unit suites green; the ONLY failures should be the known pre-existing env/integration ones (AbCip/AbLegacy IntegrationTests fixtures, OpcUaServer.IntegrationTests PKI, Host.IntegrationTests deploy-Rejected). Capture full output (don't pipe through `tail` — the pipe masks the real exit code).
3. **Commit** docs by explicit path.
4. Run **superpowers-extended-cc:finishing-a-development-branch** (verify tests → present the 4 options → execute the user's choice).
---
## Execution notes
- **Dependency spine:** T0 → {T1, T3, T4 mutually parallel by files} ; T2 after T1 ; T5 after T2/T3/T4.
- **Same-assembly build contention:** T1 (Runtime/ScriptedAlarms) and T3 (Runtime/ServiceCollectionExtensions) both compile into `ZB.MOM.WW.OtOpcUa.Runtime`; T2 also. When executing in one shared working tree, **serialise build/test of same-assembly tasks** even though their files are disjoint (concurrent `dotnet build`/`test` of the same project collide on obj/ and a mid-edit sibling breaks the build). T4 (`Driver.Galaxy`) is the only fully-independent project — safe to run concurrently. (This is the lesson from round 1.)
- **Classifications drive review:** T1/T3 standard (parallel spec+code review). T2 high-risk (serial spec→code + final integration review). T4 small (code review only). T0 trivial.
- **No bUnit / no docker-dev gate:** there's no UI change, and the real sink is config-gated (stays `Null` on docker-dev), so exactly-once is proven by TestKit, not a live rig. An optional end-to-end (configure the `AlarmHistorian` section + a real/fake pipe) is NOT required for done.
- **Done =** build clean + `dotnet test` green (modulo the known pre-existing env/integration failures).