Files
lmxopcua/docs/plans/2026-06-11-alarm-followups-round2.md
T

21 KiB
Raw Blame History

Alarm Follow-ups (Round 2) Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.

Goal: Activate exactly-once alarm historization by feeding HistorianAdapterActor from the alerts topic (with a config-gated durable sink), and verify/document the Galaxy alarm-feed reconnect behaviour.

Architecture: HistorianAdapterActor subscribes to the cluster alerts DPS topic, translates each AlarmTransitionEventAlarmHistorianEvent, and runs it through the existing T2 Primary gate (DPS fans the Primary's single publish to both nodes' historians, so the gate keeps writes exactly-once). AlarmTransitionEvent is extended with the two fields the historian record needs (AlarmTypeName, Comment), including a small ScriptedAlarmEngine change to carry Comment through the emission. A config-gated AddAlarmHistorian registers the real SqliteStoreAndForwardSink→Wonderware sink when configured (else Null). Galaxy alarm reconnect is already handled by the feed's own retry loop + gRPC channel auto-reconnect — verify + document only.

Tech Stack: .NET 10, Akka.NET (cluster, DistributedPubSub, TestKit/xunit2), xUnit + Shouldly, SQLite store-and-forward, named-pipe IPC to Wonderware.

Design of record: docs/plans/2026-06-11-alarm-followups-round2-design.md (committed master 3ad7960d).

Hard rules: stage by explicit path (never git add .); never stage sql_login.txt / src/Server/.../Host/pki/; never echo the gateway API key into a new tracked file; no force-push, no --no-verify; no Configuration entity / EF migration change (the historian queue is a standalone SQLite file, NOT the Config DB). Build on a feature branch off master.


Task 0: Branch + baseline

Classification: trivial Estimated implement time: ~1 min Parallelizable with: none

Files: (none — git only)

Steps:

  1. git checkout master && git switch -c feat/alarm-followups-r2 (off 3ad7960d).
  2. Confirm clean tree + green baseline: dotnet build ZB.MOM.WW.OtOpcUa.slnx → 0 errors.
  3. No commit (branch only).

Task 1: Extend AlarmTransitionEvent + carry Comment through the engine emission (B1)

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 3, Task 4 (different projects — but all three touch the Runtime/Core compilation graph; the executor serialises same-assembly builds, see Execution notes)

Files:

  • Modify: src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Alerts/AlarmTransitionEvent.cs
  • Modify: src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs (the ScriptedAlarmEvent record ≈ line 817 + BuildEmission + the ack/confirm/comment/shelve ops that receive the operator comment)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs (OnEngineEmission, the AlarmTransitionEvent construction ≈ line 268276)
  • Test: tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ (the engine emission tests — find the ack/comment-transition test) + tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs

Context: AlarmTransitionEvent (the alerts DPS payload) has 8 positional fields and lacks two that AlarmHistorianEvent needs: AlarmTypeName (Part-9 subtype) and Comment. ScriptedAlarmHostActor.OnEngineEmission builds the event from the engine emission e (ScriptedAlarmEvent), which already carries AlarmKind Kind but has NO Comment. Add the two fields and populate them. The Akka cluster uses the default serializer (no Hyperion/protobuf), so appending record fields is forward/backward compatible across a rolling restart.

Step 1: Extend AlarmTransitionEvent — append two trailing params WITH defaults so existing construction sites (tests) still compile and only ScriptedAlarmHostActor must populate:

public sealed record AlarmTransitionEvent(
    string AlarmId,
    string EquipmentPath,
    string AlarmName,
    string TransitionKind,
    int Severity,
    string Message,
    string User,
    DateTime TimestampUtc,
    string AlarmTypeName = "AlarmCondition",  // Part-9 subtype (LimitAlarm/DiscreteAlarm/OffNormalAlarm/AlarmCondition)
    string? Comment = null);                  // operator comment on ack/confirm/comment/shelve transitions; null otherwise

Add <param> doc lines for both (TreatWarningsAsErrors).

Step 2: Carry Comment through the engine. In ScriptedAlarmEngine.cs:

  • Add string? Comment = null as a trailing param to the ScriptedAlarmEvent record (≈ line 817).
  • In BuildEmission (where ScriptedAlarmEvent is constructed), populate Comment from the condition/op state for comment-bearing transitions. READ how the ack/confirm/AddComment/shelve ops receive the operator comment (they take a comment argument) — thread the latest operator comment into the emitted event (e.g. carry it on the condition state the emission reads, or pass it into BuildEmission). For engine-driven transitions (Activated/Cleared) Comment stays null.

Step 3: Failing tests first.

  • Engine test: a transition produced by an ack/AddComment op with an operator comment yields a ScriptedAlarmEvent whose Comment equals that text; an Activated/Cleared emission has Comment == null. (Find the existing engine ack/comment test and assert the new field.)
  • Host test (ScriptedAlarmHostActorTests): after an emission, the published AlarmTransitionEvent carries AlarmTypeName == e.Kind.ToString() and Comment == e.Comment (extend the existing alerts-publish assertion). Run them → FAIL (fields don't exist / not populated).

Step 4: Populate in OnEngineEmission. In the AlarmTransitionEvent construction (≈ line 268276) add:

AlarmTypeName: e.Kind.ToString(),
Comment: e.Comment,

Leave the Primary gate + the OPC UA node write (_publishActor.Tell) + everything else unchanged.

Step 5: Run the two suites: dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests and dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter ScriptedAlarmHostActor → green. Then dotnet build ZB.MOM.WW.OtOpcUa.slnx → 0 errors (a Commons-record change ripples; confirm whole-solution build).

Step 6: Commit by explicit path (the 3 source files + the 2 test files).

Standard: data-contract change (DPS-serialised) + engine emission threading. The careful part is the engine Comment plumbing — keep Activated/Cleared Comment == null.


Task 2: HistorianAdapterActor subscribes to alerts + translates (B2)

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: none Blocked by: Task 1 (needs the extended AlarmTransitionEvent)

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs

Context: HistorianAdapterActor already has the Primary gate (in its Receive<AlarmHistorianEvent>), the _localRole cache, and the redundancy-state subscription (PreStart). It does NOT yet subscribe to alerts and has no feeder. Add the alerts subscription + a translate path that runs through the same gate. The gate stays load-bearing: the Primary publishes once, DPS delivers to BOTH nodes' historian actors, the gate keeps only the Primary writing.

Step 1: Refactor the gate into a shared path. Extract the gate + enqueue out of the Receive<AlarmHistorianEvent> lambda into a private Historize(AlarmHistorianEvent evt) method:

private void Historize(AlarmHistorianEvent evt)
{
    if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return;
    _ = EnqueueAsync(evt);
}

Point the existing Receive<AlarmHistorianEvent>(evt => Historize(evt)); at it (behaviour unchanged).

Step 2: Failing TestKit tests (extend HistorianAdapterActorTests; it has a RecordingSink + sends RedundancyStateChanged directly):

  • Alerts_transition_is_historized_by_default — send an AlarmTransitionEvent (no role set) → the fake sink records ONE enqueue whose translated AlarmHistorianEvent has the right AlarmId/AlarmTypeName/EventKind/Severity bucket/Comment.
  • Secondary_node_does_not_historize_alerts_transition — after a Secondary RedundancyStateChanged, an AlarmTransitionEvent records ZERO enqueues.
  • Primary_node_historizes_alerts_transition — after Primary, ONE enqueue.
  • Alerts_transition_translation_buckets_severity — severity int boundaries map to the right AlarmSeverity (e.g. 250→Low, 251→Medium, 750→High, 751→Critical) — can be a focused unit test on the translation helper if you extract one. Run → FAIL (no Receive<AlarmTransitionEvent> yet).

Step 3: Implement.

  • using ZB.MOM.WW.OtOpcUa.Commons.Messages.Alerts;
  • In PreStart, ALSO subscribe to the alerts topic: _mediator.Tell(new Subscribe(ScriptedAlarmHostActor.AlertsTopic, Self)); (reuse the public const ScriptedAlarmHostActor.AlertsTopic = "alerts" — same assembly). The existing Receive<SubscribeAck> no-op covers both acks.
  • Add Receive<AlarmTransitionEvent>(evt => Historize(Translate(evt)));.
  • Add a private static AlarmHistorianEvent Translate(AlarmTransitionEvent t):
    private static AlarmHistorianEvent Translate(AlarmTransitionEvent t) => new(
        AlarmId: t.AlarmId,
        EquipmentPath: t.EquipmentPath,
        AlarmName: t.AlarmName,
        AlarmTypeName: t.AlarmTypeName,
        Severity: ToSeverity(t.Severity),
        EventKind: t.TransitionKind,
        Message: t.Message,
        User: t.User,
        Comment: t.Comment,
        TimestampUtc: t.TimestampUtc);
    
    // Invert ScriptedAlarmHostActor.SeverityToInt's buckets (Low=250, Medium=500, High=750, Critical=1000).
    private static AlarmSeverity ToSeverity(int s) => s switch
    {
        <= 250 => AlarmSeverity.Low,
        <= 500 => AlarmSeverity.Medium,
        <= 750 => AlarmSeverity.High,
        _ => AlarmSeverity.Critical,
    };
    
    (Confirm the AlarmSeverity enum members + namespace via AlarmHistorianEvent.cs.)

Step 4: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter HistorianAdapter → green; full Runtime.Tests stays green.

Step 5: Commit by explicit path.

High-risk: redundancy exactly-once + a cluster DPS subscription. Do NOT remove the gate (it's what keeps the two nodes from double-writing). Keep the Receive<AlarmHistorianEvent> path (a future direct source can still use it).


Task 3: Config-gated durable sink (AddAlarmHistorian) + Host wiring (B3)

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 4 Blocked by: none (independent of B1/B2)

Files:

  • Create: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/AlarmHistorianOptions.cs (the config record)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs (add AddAlarmHistorian(IConfiguration))
  • Modify: the Host startup where AddOtOpcUaRuntime() is called (src/Server/ZB.MOM.WW.OtOpcUa.Host/Program.cs or the host bootstrap — find the call site) to call AddAlarmHistorian(configuration)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Host/appsettings.json (a commented/example AlarmHistorian section, default absent/disabled)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/AlarmHistorianRegistrationTests.cs (new)

Context: Production defaults to NullAlarmHistorianSink (ServiceCollectionExtensions.cs:41 TryAddSingleton<IAlarmHistorianSink>(NullAlarmHistorianSink.Instance)). Add a config-gated registration that swaps in the real SqliteStoreAndForwardSinkWonderwareHistorianClient when an AlarmHistorian section is present + enabled.

Step 1: Options record (AlarmHistorianOptions.cs):

public sealed class AlarmHistorianOptions
{
    public const string SectionName = "AlarmHistorian";
    public bool Enabled { get; init; }
    public string DatabasePath { get; init; } = "alarm-historian.db";
    public string PipeName { get; init; } = "OtOpcUaHistorian";
    public string SharedSecret { get; init; } = "";
    public int BatchSize { get; init; } = 100;
}

Step 2: Failing registration tests (AlarmHistorianRegistrationTests, xUnit + Shouldly, build a ServiceCollection + in-memory IConfiguration):

  • Section absent → resolved IAlarmHistorianSink is NullAlarmHistorianSink.
  • Section present with Enabled=true → resolved IAlarmHistorianSink is a SqliteStoreAndForwardSink (assert the concrete type). Use a temp DB path.
  • Section present with Enabled=false → stays NullAlarmHistorianSink. Run → FAIL (AddAlarmHistorian doesn't exist).

Step 3: Implement AddAlarmHistorian in ServiceCollectionExtensions:

public static IServiceCollection AddAlarmHistorian(this IServiceCollection services, IConfiguration configuration)
{
    var opts = configuration.GetSection(AlarmHistorianOptions.SectionName).Get<AlarmHistorianOptions>();
    if (opts is not { Enabled: true }) return services; // leave the Null default from AddOtOpcUaRuntime

    services.AddSingleton<IAlarmHistorianSink>(sp =>
    {
        var writer = new WonderwareHistorianClient(
            new WonderwareHistorianClientOptions(PipeName: opts.PipeName, SharedSecret: opts.SharedSecret),
            sp.GetService<ILogger<WonderwareHistorianClient>>());
        var sink = new SqliteStoreAndForwardSink(
            opts.DatabasePath, writer, sp.GetRequiredService<ILogger<SqliteStoreAndForwardSink>>(),
            batchSize: opts.BatchSize);
        sink.StartDrainLoop(TimeSpan.FromSeconds(5));
        return sink;
    });
    return services;
}
  • Use services.AddSingleton<IAlarmHistorianSink>(...) (NOT TryAdd) so it overrides the Null default registered by AddOtOpcUaRuntime. Order matters: AddAlarmHistorian must run AFTER AddOtOpcUaRuntime — verify the Host calls them in that order, or have AddAlarmHistorian remove the prior registration first.
  • Confirm the exact WonderwareHistorianClient ctor + WonderwareHistorianClientOptions record params + SqliteStoreAndForwardSink ctor + StartDrainLoop signature from the referenced files; adjust the call to match. Add the project references if the Host/Runtime don't already reference Driver.Historian.Wonderware.Client + Core.AlarmHistorian (check first; if a new project reference is needed, surface it).
  • Dispose: ensure the sink (IDisposable) is disposed on shutdown (singleton registered in DI is disposed by the container at host stop — verify the sink is IDisposable and DI owns it; it is).

Step 4: Host wiring — call builder.Services.AddAlarmHistorian(builder.Configuration); right after AddOtOpcUaRuntime(). Add the disabled example section to appsettings.json:

// "AlarmHistorian": { "Enabled": false, "DatabasePath": "alarm-historian.db", "PipeName": "OtOpcUaHistorian", "SharedSecret": "" }

(Keep it commented or Enabled:false so dev/docker-dev stay on Null.)

Step 5: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter AlarmHistorianRegistration → green; dotnet build ZB.MOM.WW.OtOpcUa.slnx → 0 errors.

Step 6: Commit by explicit path.

If wiring the real sink pulls a new project reference into the Host/Runtime that ripples (e.g. the Wonderware client drags Win-only deps), surface it before expanding — the sink construction may need to live in the Host project (which already references drivers) rather than Runtime. Prefer putting AddAlarmHistorian wherever the Wonderware client is already referenceable.


Task 4: Galaxy alarm-reconnect — acknowledger-recovery test + doc (A)

Classification: small Estimated implement time: ~4 min Parallelizable with: Task 1, Task 3 Blocked by: none

Files:

  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/Runtime/GatewayGalaxyAlarmAcknowledgerTests.cs (create if absent; else extend)
  • Modify: docs/drivers/Galaxy.md (the Reconnect + Replay section)

Context: The alarm feed already recovers (its RunAsync re-invokes StreamAlarmsAsync on a transport fault — covered by GatewayGalaxyAlarmFeedTests.Reopens_stream_after_a_transport_fault). gRPC.NET channels auto-reconnect, so the same client recovers after a gateway restart; _ownedMxClient is intentionally NOT recreated, and gRPC keepalive isn't reachable (MxGatewayClientOptions is a NuGet package). This task verifies the acknowledger path + documents the behaviour. NO production code change.

Step 1: Acknowledger-recovery test. Read GatewayGalaxyAlarmAcknowledger.cs:29-48 (it holds a client/delegate and calls AcknowledgeAlarmAsync). Mirror however GatewayGalaxyAlarmFeedTests fakes the client/stream factory. Write a test where the acknowledge call fails once with a transient RpcException (or the test double's fault) and the NEXT call succeeds — asserting the acknowledger does not latch a dead state and a retry on the same client succeeds. If the acknowledger has no internal retry (it likely just forwards one call), assert instead that a second independent AcknowledgeAsync after a faulted first call still issues the unary call (i.e. the acknowledger is stateless w.r.t. faults — the gRPC channel handles reconnect). Name it Acknowledge_after_transient_fault_succeeds_on_retry.

Step 2: Run → it should pass immediately if the acknowledger is stateless (no fault latch). If it FAILS (the acknowledger caches a dead client / latches), that's a real gap — STOP and surface it (escalates A to a real fix, out of this task's verify-only scope).

Step 3: Document in docs/drivers/Galaxy.md (Reconnect + Replay section): a short note — the session-less alarm feed/acknowledger run on _ownedMxClient, which is not recreated on reconnect by design; the feed's own re-invoke loop (GatewayGalaxyAlarmFeed.RunAsync, ~5s backoff) plus gRPC.NET channel auto-reconnect recover the alarm stream + acks after a gateway restart. Channel-level keepalive hardening would require exposing knobs on the MxGatewayClient package (sibling repo) — noted as a future option, not needed today.

Step 4: Commit by explicit path (the test + the doc).


Task 5: Full-suite gate + docs + finish

Classification: small Estimated implement time: ~4 min Parallelizable with: none Blocked by: Task 2, Task 3, Task 4

Files: docs/AlarmHistorian.md (or docs/AlarmTracking.md) — note the historian is now fed from the alerts topic (scripted alarms, Primary-gated, exactly-once) + the config-gated real sink (AlarmHistorian appsettings section). Keep terse.

Steps:

  1. Update the historian doc(s) to reflect: HistorianAdapterActor now subscribes to alerts and historizes scripted-alarm transitions exactly-once (Primary-gated); the durable SqliteStoreAndForwardSink→Wonderware sink is enabled via the AlarmHistorian config section (else Null); Galaxy/AB-CIP historization is out of scope (AVEVA native / future).
  2. Run the FULL suite: dotnet test ZB.MOM.WW.OtOpcUa.slnx — confirm all affected unit suites green; the ONLY failures should be the known pre-existing env/integration ones (AbCip/AbLegacy IntegrationTests fixtures, OpcUaServer.IntegrationTests PKI, Host.IntegrationTests deploy-Rejected). Capture full output (don't pipe through tail — the pipe masks the real exit code).
  3. Commit docs by explicit path.
  4. Run superpowers-extended-cc:finishing-a-development-branch (verify tests → present the 4 options → execute the user's choice).

Execution notes

  • Dependency spine: T0 → {T1, T3, T4 mutually parallel by files} ; T2 after T1 ; T5 after T2/T3/T4.
  • Same-assembly build contention: T1 (Runtime/ScriptedAlarms) and T3 (Runtime/ServiceCollectionExtensions) both compile into ZB.MOM.WW.OtOpcUa.Runtime; T2 also. When executing in one shared working tree, serialise build/test of same-assembly tasks even though their files are disjoint (concurrent dotnet build/test of the same project collide on obj/ and a mid-edit sibling breaks the build). T4 (Driver.Galaxy) is the only fully-independent project — safe to run concurrently. (This is the lesson from round 1.)
  • Classifications drive review: T1/T3 standard (parallel spec+code review). T2 high-risk (serial spec→code + final integration review). T4 small (code review only). T0 trivial.
  • No bUnit / no docker-dev gate: there's no UI change, and the real sink is config-gated (stays Null on docker-dev), so exactly-once is proven by TestKit, not a live rig. An optional end-to-end (configure the AlarmHistorian section + a real/fake pipe) is NOT required for done.
  • Done = build clean + dotnet test green (modulo the known pre-existing env/integration failures).