21 KiB
Alarm Follow-ups (Round 2) Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
Goal: Activate exactly-once alarm historization by feeding HistorianAdapterActor from the alerts topic (with a config-gated durable sink), and verify/document the Galaxy alarm-feed reconnect behaviour.
Architecture: HistorianAdapterActor subscribes to the cluster alerts DPS topic, translates each AlarmTransitionEvent → AlarmHistorianEvent, and runs it through the existing T2 Primary gate (DPS fans the Primary's single publish to both nodes' historians, so the gate keeps writes exactly-once). AlarmTransitionEvent is extended with the two fields the historian record needs (AlarmTypeName, Comment), including a small ScriptedAlarmEngine change to carry Comment through the emission. A config-gated AddAlarmHistorian registers the real SqliteStoreAndForwardSink→Wonderware sink when configured (else Null). Galaxy alarm reconnect is already handled by the feed's own retry loop + gRPC channel auto-reconnect — verify + document only.
Tech Stack: .NET 10, Akka.NET (cluster, DistributedPubSub, TestKit/xunit2), xUnit + Shouldly, SQLite store-and-forward, named-pipe IPC to Wonderware.
Design of record: docs/plans/2026-06-11-alarm-followups-round2-design.md (committed master 3ad7960d).
Hard rules: stage by explicit path (never git add .); never stage sql_login.txt / src/Server/.../Host/pki/; never echo the gateway API key into a new tracked file; no force-push, no --no-verify; no Configuration entity / EF migration change (the historian queue is a standalone SQLite file, NOT the Config DB). Build on a feature branch off master.
Task 0: Branch + baseline
Classification: trivial Estimated implement time: ~1 min Parallelizable with: none
Files: (none — git only)
Steps:
git checkout master && git switch -c feat/alarm-followups-r2(off3ad7960d).- Confirm clean tree + green baseline:
dotnet build ZB.MOM.WW.OtOpcUa.slnx→ 0 errors. - No commit (branch only).
Task 1: Extend AlarmTransitionEvent + carry Comment through the engine emission (B1)
Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 3, Task 4 (different projects — but all three touch the Runtime/Core compilation graph; the executor serialises same-assembly builds, see Execution notes)
Files:
- Modify:
src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Alerts/AlarmTransitionEvent.cs - Modify:
src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs(theScriptedAlarmEventrecord ≈ line 817 +BuildEmission+ the ack/confirm/comment/shelve ops that receive the operator comment) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs(OnEngineEmission, theAlarmTransitionEventconstruction ≈ line 268–276) - Test:
tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/(the engine emission tests — find the ack/comment-transition test) +tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs
Context: AlarmTransitionEvent (the alerts DPS payload) has 8 positional fields and lacks two that AlarmHistorianEvent needs: AlarmTypeName (Part-9 subtype) and Comment. ScriptedAlarmHostActor.OnEngineEmission builds the event from the engine emission e (ScriptedAlarmEvent), which already carries AlarmKind Kind but has NO Comment. Add the two fields and populate them. The Akka cluster uses the default serializer (no Hyperion/protobuf), so appending record fields is forward/backward compatible across a rolling restart.
Step 1: Extend AlarmTransitionEvent — append two trailing params WITH defaults so existing construction sites (tests) still compile and only ScriptedAlarmHostActor must populate:
public sealed record AlarmTransitionEvent(
string AlarmId,
string EquipmentPath,
string AlarmName,
string TransitionKind,
int Severity,
string Message,
string User,
DateTime TimestampUtc,
string AlarmTypeName = "AlarmCondition", // Part-9 subtype (LimitAlarm/DiscreteAlarm/OffNormalAlarm/AlarmCondition)
string? Comment = null); // operator comment on ack/confirm/comment/shelve transitions; null otherwise
Add <param> doc lines for both (TreatWarningsAsErrors).
Step 2: Carry Comment through the engine. In ScriptedAlarmEngine.cs:
- Add
string? Comment = nullas a trailing param to theScriptedAlarmEventrecord (≈ line 817). - In
BuildEmission(whereScriptedAlarmEventis constructed), populateCommentfrom the condition/op state for comment-bearing transitions. READ how the ack/confirm/AddComment/shelve ops receive the operator comment (they take acommentargument) — thread the latest operator comment into the emitted event (e.g. carry it on the condition state the emission reads, or pass it intoBuildEmission). For engine-driven transitions (Activated/Cleared)Commentstays null.
Step 3: Failing tests first.
- Engine test: a transition produced by an ack/
AddCommentop with an operator comment yields aScriptedAlarmEventwhoseCommentequals that text; an Activated/Cleared emission hasComment == null. (Find the existing engine ack/comment test and assert the new field.) - Host test (
ScriptedAlarmHostActorTests): after an emission, the publishedAlarmTransitionEventcarriesAlarmTypeName == e.Kind.ToString()andComment == e.Comment(extend the existing alerts-publish assertion). Run them → FAIL (fields don't exist / not populated).
Step 4: Populate in OnEngineEmission. In the AlarmTransitionEvent construction (≈ line 268–276) add:
AlarmTypeName: e.Kind.ToString(),
Comment: e.Comment,
Leave the Primary gate + the OPC UA node write (_publishActor.Tell) + everything else unchanged.
Step 5: Run the two suites:
dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests and
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter ScriptedAlarmHostActor → green. Then dotnet build ZB.MOM.WW.OtOpcUa.slnx → 0 errors (a Commons-record change ripples; confirm whole-solution build).
Step 6: Commit by explicit path (the 3 source files + the 2 test files).
Standard: data-contract change (DPS-serialised) + engine emission threading. The careful part is the engine
Commentplumbing — keep Activated/ClearedComment == null.
Task 2: HistorianAdapterActor subscribes to alerts + translates (B2)
Classification: high-risk
Estimated implement time: ~5 min
Parallelizable with: none
Blocked by: Task 1 (needs the extended AlarmTransitionEvent)
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs
Context: HistorianAdapterActor already has the Primary gate (in its Receive<AlarmHistorianEvent>), the _localRole cache, and the redundancy-state subscription (PreStart). It does NOT yet subscribe to alerts and has no feeder. Add the alerts subscription + a translate path that runs through the same gate. The gate stays load-bearing: the Primary publishes once, DPS delivers to BOTH nodes' historian actors, the gate keeps only the Primary writing.
Step 1: Refactor the gate into a shared path. Extract the gate + enqueue out of the Receive<AlarmHistorianEvent> lambda into a private Historize(AlarmHistorianEvent evt) method:
private void Historize(AlarmHistorianEvent evt)
{
if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return;
_ = EnqueueAsync(evt);
}
Point the existing Receive<AlarmHistorianEvent>(evt => Historize(evt)); at it (behaviour unchanged).
Step 2: Failing TestKit tests (extend HistorianAdapterActorTests; it has a RecordingSink + sends RedundancyStateChanged directly):
Alerts_transition_is_historized_by_default— send anAlarmTransitionEvent(no role set) → the fake sink records ONE enqueue whose translatedAlarmHistorianEventhas the rightAlarmId/AlarmTypeName/EventKind/Severitybucket/Comment.Secondary_node_does_not_historize_alerts_transition— after aSecondaryRedundancyStateChanged, anAlarmTransitionEventrecords ZERO enqueues.Primary_node_historizes_alerts_transition— afterPrimary, ONE enqueue.Alerts_transition_translation_buckets_severity— severity int boundaries map to the rightAlarmSeverity(e.g. 250→Low, 251→Medium, 750→High, 751→Critical) — can be a focused unit test on the translation helper if you extract one. Run → FAIL (noReceive<AlarmTransitionEvent>yet).
Step 3: Implement.
using ZB.MOM.WW.OtOpcUa.Commons.Messages.Alerts;- In PreStart, ALSO subscribe to the alerts topic:
_mediator.Tell(new Subscribe(ScriptedAlarmHostActor.AlertsTopic, Self));(reuse the public constScriptedAlarmHostActor.AlertsTopic = "alerts"— same assembly). The existingReceive<SubscribeAck>no-op covers both acks. - Add
Receive<AlarmTransitionEvent>(evt => Historize(Translate(evt)));. - Add a
private static AlarmHistorianEvent Translate(AlarmTransitionEvent t):(Confirm theprivate static AlarmHistorianEvent Translate(AlarmTransitionEvent t) => new( AlarmId: t.AlarmId, EquipmentPath: t.EquipmentPath, AlarmName: t.AlarmName, AlarmTypeName: t.AlarmTypeName, Severity: ToSeverity(t.Severity), EventKind: t.TransitionKind, Message: t.Message, User: t.User, Comment: t.Comment, TimestampUtc: t.TimestampUtc); // Invert ScriptedAlarmHostActor.SeverityToInt's buckets (Low=250, Medium=500, High=750, Critical=1000). private static AlarmSeverity ToSeverity(int s) => s switch { <= 250 => AlarmSeverity.Low, <= 500 => AlarmSeverity.Medium, <= 750 => AlarmSeverity.High, _ => AlarmSeverity.Critical, };AlarmSeverityenum members + namespace viaAlarmHistorianEvent.cs.)
Step 4: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter HistorianAdapter → green; full Runtime.Tests stays green.
Step 5: Commit by explicit path.
High-risk: redundancy exactly-once + a cluster DPS subscription. Do NOT remove the gate (it's what keeps the two nodes from double-writing). Keep the
Receive<AlarmHistorianEvent>path (a future direct source can still use it).
Task 3: Config-gated durable sink (AddAlarmHistorian) + Host wiring (B3)
Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 4 Blocked by: none (independent of B1/B2)
Files:
- Create:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/AlarmHistorianOptions.cs(the config record) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs(addAddAlarmHistorian(IConfiguration)) - Modify: the Host startup where
AddOtOpcUaRuntime()is called (src/Server/ZB.MOM.WW.OtOpcUa.Host/Program.csor the host bootstrap — find the call site) to callAddAlarmHistorian(configuration) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Host/appsettings.json(a commented/exampleAlarmHistoriansection, default absent/disabled) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/AlarmHistorianRegistrationTests.cs(new)
Context: Production defaults to NullAlarmHistorianSink (ServiceCollectionExtensions.cs:41 TryAddSingleton<IAlarmHistorianSink>(NullAlarmHistorianSink.Instance)). Add a config-gated registration that swaps in the real SqliteStoreAndForwardSink→WonderwareHistorianClient when an AlarmHistorian section is present + enabled.
Step 1: Options record (AlarmHistorianOptions.cs):
public sealed class AlarmHistorianOptions
{
public const string SectionName = "AlarmHistorian";
public bool Enabled { get; init; }
public string DatabasePath { get; init; } = "alarm-historian.db";
public string PipeName { get; init; } = "OtOpcUaHistorian";
public string SharedSecret { get; init; } = "";
public int BatchSize { get; init; } = 100;
}
Step 2: Failing registration tests (AlarmHistorianRegistrationTests, xUnit + Shouldly, build a ServiceCollection + in-memory IConfiguration):
- Section absent → resolved
IAlarmHistorianSinkisNullAlarmHistorianSink. - Section present with
Enabled=true→ resolvedIAlarmHistorianSinkis aSqliteStoreAndForwardSink(assert the concrete type). Use a temp DB path. - Section present with
Enabled=false→ staysNullAlarmHistorianSink. Run → FAIL (AddAlarmHistoriandoesn't exist).
Step 3: Implement AddAlarmHistorian in ServiceCollectionExtensions:
public static IServiceCollection AddAlarmHistorian(this IServiceCollection services, IConfiguration configuration)
{
var opts = configuration.GetSection(AlarmHistorianOptions.SectionName).Get<AlarmHistorianOptions>();
if (opts is not { Enabled: true }) return services; // leave the Null default from AddOtOpcUaRuntime
services.AddSingleton<IAlarmHistorianSink>(sp =>
{
var writer = new WonderwareHistorianClient(
new WonderwareHistorianClientOptions(PipeName: opts.PipeName, SharedSecret: opts.SharedSecret),
sp.GetService<ILogger<WonderwareHistorianClient>>());
var sink = new SqliteStoreAndForwardSink(
opts.DatabasePath, writer, sp.GetRequiredService<ILogger<SqliteStoreAndForwardSink>>(),
batchSize: opts.BatchSize);
sink.StartDrainLoop(TimeSpan.FromSeconds(5));
return sink;
});
return services;
}
- Use
services.AddSingleton<IAlarmHistorianSink>(...)(NOTTryAdd) so it overrides theNulldefault registered byAddOtOpcUaRuntime. Order matters:AddAlarmHistorianmust run AFTERAddOtOpcUaRuntime— verify the Host calls them in that order, or haveAddAlarmHistorianremove the prior registration first. - Confirm the exact
WonderwareHistorianClientctor +WonderwareHistorianClientOptionsrecord params +SqliteStoreAndForwardSinkctor +StartDrainLoopsignature from the referenced files; adjust the call to match. Add the project references if the Host/Runtime don't already referenceDriver.Historian.Wonderware.Client+Core.AlarmHistorian(check first; if a new project reference is needed, surface it). - Dispose: ensure the sink (IDisposable) is disposed on shutdown (singleton registered in DI is disposed by the container at host stop — verify the sink is
IDisposableand DI owns it; it is).
Step 4: Host wiring — call builder.Services.AddAlarmHistorian(builder.Configuration); right after AddOtOpcUaRuntime(). Add the disabled example section to appsettings.json:
// "AlarmHistorian": { "Enabled": false, "DatabasePath": "alarm-historian.db", "PipeName": "OtOpcUaHistorian", "SharedSecret": "" }
(Keep it commented or Enabled:false so dev/docker-dev stay on Null.)
Step 5: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter AlarmHistorianRegistration → green; dotnet build ZB.MOM.WW.OtOpcUa.slnx → 0 errors.
Step 6: Commit by explicit path.
If wiring the real sink pulls a new project reference into the Host/Runtime that ripples (e.g. the Wonderware client drags Win-only deps), surface it before expanding — the sink construction may need to live in the Host project (which already references drivers) rather than Runtime. Prefer putting
AddAlarmHistorianwherever the Wonderware client is already referenceable.
Task 4: Galaxy alarm-reconnect — acknowledger-recovery test + doc (A)
Classification: small Estimated implement time: ~4 min Parallelizable with: Task 1, Task 3 Blocked by: none
Files:
- Test:
tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/Runtime/GatewayGalaxyAlarmAcknowledgerTests.cs(create if absent; else extend) - Modify:
docs/drivers/Galaxy.md(the Reconnect + Replay section)
Context: The alarm feed already recovers (its RunAsync re-invokes StreamAlarmsAsync on a transport fault — covered by GatewayGalaxyAlarmFeedTests.Reopens_stream_after_a_transport_fault). gRPC.NET channels auto-reconnect, so the same client recovers after a gateway restart; _ownedMxClient is intentionally NOT recreated, and gRPC keepalive isn't reachable (MxGatewayClientOptions is a NuGet package). This task verifies the acknowledger path + documents the behaviour. NO production code change.
Step 1: Acknowledger-recovery test. Read GatewayGalaxyAlarmAcknowledger.cs:29-48 (it holds a client/delegate and calls AcknowledgeAlarmAsync). Mirror however GatewayGalaxyAlarmFeedTests fakes the client/stream factory. Write a test where the acknowledge call fails once with a transient RpcException (or the test double's fault) and the NEXT call succeeds — asserting the acknowledger does not latch a dead state and a retry on the same client succeeds. If the acknowledger has no internal retry (it likely just forwards one call), assert instead that a second independent AcknowledgeAsync after a faulted first call still issues the unary call (i.e. the acknowledger is stateless w.r.t. faults — the gRPC channel handles reconnect). Name it Acknowledge_after_transient_fault_succeeds_on_retry.
Step 2: Run → it should pass immediately if the acknowledger is stateless (no fault latch). If it FAILS (the acknowledger caches a dead client / latches), that's a real gap — STOP and surface it (escalates A to a real fix, out of this task's verify-only scope).
Step 3: Document in docs/drivers/Galaxy.md (Reconnect + Replay section): a short note — the session-less alarm feed/acknowledger run on _ownedMxClient, which is not recreated on reconnect by design; the feed's own re-invoke loop (GatewayGalaxyAlarmFeed.RunAsync, ~5s backoff) plus gRPC.NET channel auto-reconnect recover the alarm stream + acks after a gateway restart. Channel-level keepalive hardening would require exposing knobs on the MxGatewayClient package (sibling repo) — noted as a future option, not needed today.
Step 4: Commit by explicit path (the test + the doc).
Task 5: Full-suite gate + docs + finish
Classification: small Estimated implement time: ~4 min Parallelizable with: none Blocked by: Task 2, Task 3, Task 4
Files: docs/AlarmHistorian.md (or docs/AlarmTracking.md) — note the historian is now fed from the alerts topic (scripted alarms, Primary-gated, exactly-once) + the config-gated real sink (AlarmHistorian appsettings section). Keep terse.
Steps:
- Update the historian doc(s) to reflect:
HistorianAdapterActornow subscribes toalertsand historizes scripted-alarm transitions exactly-once (Primary-gated); the durableSqliteStoreAndForwardSink→Wonderware sink is enabled via theAlarmHistorianconfig section (elseNull); Galaxy/AB-CIP historization is out of scope (AVEVA native / future). - Run the FULL suite:
dotnet test ZB.MOM.WW.OtOpcUa.slnx— confirm all affected unit suites green; the ONLY failures should be the known pre-existing env/integration ones (AbCip/AbLegacy IntegrationTests fixtures, OpcUaServer.IntegrationTests PKI, Host.IntegrationTests deploy-Rejected). Capture full output (don't pipe throughtail— the pipe masks the real exit code). - Commit docs by explicit path.
- Run superpowers-extended-cc:finishing-a-development-branch (verify tests → present the 4 options → execute the user's choice).
Execution notes
- Dependency spine: T0 → {T1, T3, T4 mutually parallel by files} ; T2 after T1 ; T5 after T2/T3/T4.
- Same-assembly build contention: T1 (Runtime/ScriptedAlarms) and T3 (Runtime/ServiceCollectionExtensions) both compile into
ZB.MOM.WW.OtOpcUa.Runtime; T2 also. When executing in one shared working tree, serialise build/test of same-assembly tasks even though their files are disjoint (concurrentdotnet build/testof the same project collide on obj/ and a mid-edit sibling breaks the build). T4 (Driver.Galaxy) is the only fully-independent project — safe to run concurrently. (This is the lesson from round 1.) - Classifications drive review: T1/T3 standard (parallel spec+code review). T2 high-risk (serial spec→code + final integration review). T4 small (code review only). T0 trivial.
- No bUnit / no docker-dev gate: there's no UI change, and the real sink is config-gated (stays
Nullon docker-dev), so exactly-once is proven by TestKit, not a live rig. An optional end-to-end (configure theAlarmHistoriansection + a real/fake pipe) is NOT required for done. - Done = build clean +
dotnet testgreen (modulo the known pre-existing env/integration failures).