24 KiB
Alarm Historian Follow-ups Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
Goal: Close the six code follow-ups left open by the alarm-followups round-2 work (f64f7ce6): honor the per-alarm HistorizeToAveva opt-out at the durable historian write, expose drain/capacity/retention config knobs, validate the historian config at startup, record the real operator for shelve/enable/disable transitions, and fix two SqliteStoreAndForwardSink thread-safety nits.
Architecture: All six are bounded hardening fixes on the already-shipped historian + scripted-alarm surface — no new components. The load-bearing one (T1) threads a HistorizeToAveva flag through the engine emission → the DPS-serialized AlarmTransitionEvent → a gate in HistorianAdapterActor that suppresses only the durable write (the live /alerts UI keeps every transition). The rest are localized: options/registration hardening (T2), a one-line TransitionUser widening (T3), and field-visibility fixes on the SQLite sink (T4).
Tech Stack: C# / .NET 10, Akka.NET (cluster DistributedPubSub + TestKit), xUnit + Shouldly, Serilog, EF Core (untouched here), SQLite store-and-forward.
Source of truth: pending.md items 1–6 (item 7, the docker-dev rig cleanup, is an operational deferral and is out of scope). Round-2 design: docs/plans/2026-06-11-alarm-followups-round2-design.md.
Base: Branch off master @ f64f7ce6.
Hard rules (carried from round 2): stage by explicit path (never git add .); never stage sql_login.txt or src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/; never echo the gateway API key into a new tracked file; no force-push, no --no-verify; NO Configuration entity / EF migration change — the historian queue is a standalone SQLite file, NOT the Config DB; commit per task by explicit path.
Same-assembly build contention: T1, T2, T3 all touch the Runtime assembly (and T1/T3 share ScriptedAlarmHostActor.cs) → they MUST run serially (build/test collide on obj/). T4 touches only Core.AlarmHistorian → it is the one task safe to run concurrently with the Runtime chain.
Task 0: Branch
Classification: trivial Estimated implement time: ~1 min Parallelizable with: none
Files: none (git only)
Step 1: Create the feature branch off master
git checkout master
git status --short # expect only untracked pending.md
git checkout -b feat/alarm-historian-followups
git rev-parse --short HEAD # expect f64f7ce6
Expected: on feat/alarm-historian-followups, HEAD f64f7ce6.
(Do NOT commit pending.md — it is an untracked working note and stays untracked.)
Task 1: Honor HistorizeToAveva opt-out at the durable write (pending #1)
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 4
Why high-risk: changes a DistributedPubSub-serialized data contract (AlarmTransitionEvent) and the warm-redundant historian gate. Mis-gating either double-writes, drops rows, or (the trap pending.md calls out) silently suppresses the live UI instead of just the durable write.
The design (locked):
- The flag is already on
ScriptedAlarmDefinition.HistorizeToAveva(defaulttrue) and reaches the engine viaToDefinition(ScriptedAlarmHostActor.cs:477). It is NOT yet on the emitted event, so the historian can't see it. - Carry it:
ScriptedAlarmEvent(engine) →AlarmTransitionEvent(Commons, DPS-serialized) → gate inHistorianAdapterActor. - Gate the durable write ONLY.
ScriptedAlarmHostActor.OnEngineEmissionmust keep publishing every transition toalerts(its existing Primary gate is unchanged) so the live/alertsUI is unaffected. TheHistorizeToAvevacheck lives solely inHistorianAdapterActor.Receive<AlarmTransitionEvent>. - Default
true, non-nullable. On a rolling restart Akka's JSON serializer appliesdefault(bool)=falseto an old-format message's missing field (the same quirk theAlarmTypeNamenull-coalesce inTranslateguards). This is safe here because the node that writes is always the Primary (or a boot-window node), and that node also published the event — so the written event always carries the publisher's own same-version flag value. A cross-version old→new flow only reaches the Secondary, which never writes. Document this reasoning in a code comment.
Files:
- Modify:
src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs(ScriptedAlarmEventrecord ~837;BuildEmission~592) - Modify:
src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Alerts/AlarmTransitionEvent.cs(record ~19-29) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs(OnEngineEmission~289-301) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs(ctorReceive<AlarmTransitionEvent>~74) - Test:
tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs
Step 1: Write the failing engine test (flag carried through emission)
In ScriptedAlarmEngineTests.cs, add a test that an alarm defined with HistorizeToAveva: false emits a ScriptedAlarmEvent whose HistorizeToAveva is false (and true when the definition is true). Follow the existing engine-test setup pattern in that file (register a definition, drive the predicate active, capture OnEvent). Assert:
emitted.HistorizeToAveva.ShouldBeFalse();
Run (expect FAIL — ScriptedAlarmEvent has no HistorizeToAveva):
dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests --filter "FullyQualifiedName~HistorizeToAveva"
Step 2: Add the field to ScriptedAlarmEvent + populate in BuildEmission
ScriptedAlarmEvent record (append after Comment):
public sealed record ScriptedAlarmEvent(
string AlarmId,
string EquipmentPath,
string AlarmName,
AlarmKind Kind,
AlarmSeverity Severity,
string Message,
AlarmConditionState Condition,
EmissionKind Emission,
DateTime TimestampUtc,
string? Comment = null,
bool HistorizeToAveva = true);
BuildEmission — add to the new ScriptedAlarmEvent(...) initializer (the definition is on state.Definition):
Comment: kind switch
{
EmissionKind.Acknowledged => condition.LastAckComment,
EmissionKind.Confirmed => condition.LastConfirmComment,
EmissionKind.CommentAdded => condition.Comments.Count == 0 ? null : condition.Comments[^1].Text,
_ => null,
},
HistorizeToAveva: state.Definition.HistorizeToAveva);
Step 3: Add the field to AlarmTransitionEvent (Commons contract)
Append after Comment (additive, default true — see the rolling-restart note above), and add the <param> doc line:
/// <param name="HistorizeToAveva">When <c>false</c>, the durable historian sink suppresses this transition (the live <c>alerts</c> fan-out is unaffected). Defaults to <c>true</c>. On a rolling restart an old-format message deserializes this as <c>false</c> (CLR default); that is safe because the writing node is always the same-version publisher — see <c>HistorianAdapterActor</c>.</param>
public sealed record AlarmTransitionEvent(
string AlarmId,
string EquipmentPath,
string AlarmName,
string TransitionKind,
int Severity,
string Message,
string User,
DateTime TimestampUtc,
string AlarmTypeName = "AlarmCondition",
string? Comment = null,
bool HistorizeToAveva = true);
Step 4: Populate it in ScriptedAlarmHostActor.OnEngineEmission
Add to the new AlarmTransitionEvent(...) initializer (after Comment: e.Comment):
AlarmTypeName: e.Kind.ToString(),
Comment: e.Comment,
HistorizeToAveva: e.HistorizeToAveva);
Do NOT touch the Primary _localRole gate below it — the alerts publish stays ungated by HistorizeToAveva (live UI must see every transition).
Step 5: Gate the durable write in HistorianAdapterActor
Change the Receive<AlarmTransitionEvent> handler (currently ~74) to also require the flag:
// HistorizeToAveva=false is a per-alarm opt-out of DURABLE historization only — the live
// `alerts` fan-out (browser UI) already happened upstream. Gate the sink write, not the publish.
Receive<AlarmTransitionEvent>(t => { if (ShouldHistorize() && t.HistorizeToAveva) _ = EnqueueAsync(Translate(t)); });
Step 6: Write the failing historian-gate test, then make it pass
In HistorianAdapterActorTests.cs: add a bool historizeToAveva = true parameter to the SampleTransition helper (thread it into the AlarmTransitionEvent), then add:
/// <summary>Per-alarm opt-out (pending #1): a Primary node MUST NOT write to the durable sink when the
/// transition carries HistorizeToAveva=false — even though it would otherwise historize. The live alerts
/// fan-out is upstream and unaffected.</summary>
[Fact]
public void Primary_node_does_not_historize_when_opted_out()
{
var (actor, sink) = CreateActor();
TellRedundancyRole(actor, RedundancyRole.Primary);
actor.Tell(SampleTransition(historizeToAveva: false));
// Give the actor time to (not) enqueue.
ExpectNoMsg(Settle);
sink.EnqueueCount.ShouldBe(0);
}
(Match the existing helpers — CreateActor, TellRedundancyRole, Settle, ExpectNoMsg. If ExpectNoMsg isn't the local idiom, mirror the existing Secondary_node_does_not_historize test's settle-then-assert-0 shape.)
Step 7: Run the full affected suites
dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~HistorianAdapterActor|FullyQualifiedName~ScriptedAlarmHostActor"
Expected: PASS. The existing ScriptedAlarmHostActorTests already define alarms with HistorizeToAveva: true/false, so confirm those still pass (the alerts publish must still fire for the false alarm).
Step 8: Commit
git add src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs \
src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Alerts/AlarmTransitionEvent.cs \
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs \
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs \
tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs \
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs
git commit -m "feat(historian): honor per-alarm HistorizeToAveva opt-out at the durable write"
Task 2: Historian config knobs + startup validation (pending #2, #3, #4)
Classification: standard Estimated implement time: ~4 min Parallelizable with: Task 4
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/AlarmHistorianOptions.cs - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs(AddAlarmHistorian~67-89) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Host/appsettings.json(the existing disabledAlarmHistorianexample section) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/AlarmHistorianRegistrationTests.cs
Step 1: Write the failing tests
In AlarmHistorianRegistrationTests.cs, add:
- A binding test: a config with
DrainIntervalSeconds,Capacity,DeadLetterRetentionDaysset binds them ontoAlarmHistorianOptions. Validate()tests (pure, no DI):{ Enabled = true, SharedSecret = "" }→ returns a warning mentioningSharedSecret;{ Enabled = true, DatabasePath = "alarm-historian.db" }(relative) → returns a warning mentioning the relative path;{ Enabled = true, SharedSecret = "x", DatabasePath = "/abs/h.db" }→ returns empty;{ Enabled = false, SharedSecret = "" }→ returns empty (disabled is never a misconfig).
[Fact]
public void Validate_warns_on_empty_shared_secret_when_enabled()
{
var opts = new AlarmHistorianOptions { Enabled = true, SharedSecret = "", DatabasePath = "/var/h.db" };
opts.Validate().ShouldContain(w => w.Contains("SharedSecret"));
}
[Fact]
public void Validate_warns_on_relative_database_path_when_enabled()
{
var opts = new AlarmHistorianOptions { Enabled = true, SharedSecret = "s", DatabasePath = "alarm-historian.db" };
opts.Validate().ShouldContain(w => w.Contains("DatabasePath"));
}
[Fact]
public void Validate_is_silent_when_disabled()
{
new AlarmHistorianOptions { Enabled = false, SharedSecret = "" }.Validate().ShouldBeEmpty();
}
Run (expect FAIL — no Validate, no new options):
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~AlarmHistorianRegistration"
Step 2: Add the knobs + Validate() to AlarmHistorianOptions
/// <summary>Seconds between drain-worker ticks. Defaults to 5.</summary>
public int DrainIntervalSeconds { get; init; } = 5;
/// <summary>Maximum queued rows before the sink evicts the oldest. Defaults to 1,000,000
/// (<c>SqliteStoreAndForwardSink.DefaultCapacity</c>).</summary>
public long Capacity { get; init; } = 1_000_000;
/// <summary>Days to retain dead-lettered rows before purge. Defaults to 30.</summary>
public int DeadLetterRetentionDays { get; init; } = 30;
/// <summary>Returns operator-facing misconfiguration warnings for an <c>Enabled</c> historian
/// (empty when disabled or correctly configured). Pure — the registration logs each entry.</summary>
/// <returns>Zero or more human-readable warning messages.</returns>
public IReadOnlyList<string> Validate()
{
var warnings = new List<string>();
if (!Enabled) return warnings;
if (string.IsNullOrWhiteSpace(SharedSecret))
warnings.Add("AlarmHistorian:SharedSecret is empty while the historian is enabled — the Wonderware sidecar Hello frame will carry an empty secret.");
if (!Path.IsPathRooted(DatabasePath))
warnings.Add($"AlarmHistorian:DatabasePath '{DatabasePath}' is relative — it resolves against the process working directory (e.g. System32 for a Windows service). Set an absolute path.");
return warnings;
}
(Add using System.Collections.Generic; and using System.IO; if the file lacks them.)
Step 3: Thread the knobs + log warnings in AddAlarmHistorian
Inside the existing if (opts is not { Enabled: true }) return services; guard's downstream block:
foreach (var warning in opts.Validate())
Serilog.Log.Logger.ForContext<SqliteStoreAndForwardSink>().Warning("{HistorianConfigWarning}", warning);
services.AddSingleton<IAlarmHistorianSink>(sp =>
{
var sink = new SqliteStoreAndForwardSink(
opts.DatabasePath,
writerFactory(opts, sp),
Serilog.Log.Logger.ForContext<SqliteStoreAndForwardSink>(),
batchSize: opts.BatchSize,
capacity: opts.Capacity,
deadLetterRetention: TimeSpan.FromDays(opts.DeadLetterRetentionDays));
sink.StartDrainLoop(TimeSpan.FromSeconds(opts.DrainIntervalSeconds));
return sink;
});
return services;
Step 4: Update the appsettings example
In Host/appsettings.json, extend the disabled AlarmHistorian example section with the three new keys (DrainIntervalSeconds, Capacity, DeadLetterRetentionDays) so operators see them. Keep Enabled: false. Make a targeted edit to that section only.
Step 5: Run + commit
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~AlarmHistorianRegistration"
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/AlarmHistorianOptions.cs \
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs \
src/Server/ZB.MOM.WW.OtOpcUa.Host/appsettings.json \
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/AlarmHistorianRegistrationTests.cs
git commit -m "feat(historian): drain/capacity/retention config knobs + startup config-warning validation"
Task 3: Record the real operator for shelve/enable/disable transitions (pending #5)
Classification: small Estimated implement time: ~3 min Parallelizable with: Task 4
The design: every shelve/unshelve/enable/disable op in Part9StateMachine appends an audit AlarmComment carrying the acting user (ApplyOneShotShelve → "ShelveOneShot", ApplyTimedShelve → "ShelveTimed", ApplyUnshelve → "Unshelve", ApplyEnable → "Enable", ApplyDisable → "Disable"), and auto-unshelve appends with user="system". So the operator is already the last Comments entry on the emitted condition — exactly the CommentAdded case. No Core/state change needed; widen the TransitionUser switch. (Acknowledged/Confirmed keep using LastAckUser/LastConfirmUser — leave them.)
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs(TransitionUser~508-514) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs
Step 1: Write the failing test
Mirror the existing Acknowledge/AddComment command tests (they drive a command via the alarm-commands topic and FishForMessage<AlarmTransitionEvent> on the alerts probe). Add a test that an operator shelve command (Operation: "OneShotShelve", User: "carol") on an active alarm yields a transition with evt.User.ShouldBe("carol") (it currently returns "system"). Use the exact command-operation string the existing inbound tests use (check the host actor's command dispatch for the canonical OneShotShelve/Shelve spelling).
Run (expect FAIL — currently "system"):
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ScriptedAlarmHostActor"
Step 2: Widen TransitionUser
private static string TransitionUser(ScriptedAlarmEvent e) => e.Emission switch
{
EmissionKind.Acknowledged => e.Condition.LastAckUser ?? "system",
EmissionKind.Confirmed => e.Condition.LastConfirmUser ?? "system",
// Shelve / unshelve / enable / disable / comment ops each append the acting user as the last
// audit entry on the emitted condition (auto-unshelve appends "system"); read it from there.
EmissionKind.CommentAdded
or EmissionKind.Shelved
or EmissionKind.Unshelved
or EmissionKind.Enabled
or EmissionKind.Disabled => e.Condition.Comments.Count > 0 ? e.Condition.Comments[^1].User : "system",
_ => "system",
};
Step 3: Run + commit
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ScriptedAlarmHostActor"
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs \
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs
git commit -m "fix(alarms): historize the real operator for shelve/unshelve/enable/disable transitions"
Task 4: SqliteStoreAndForwardSink thread-safety nits (pending #6)
Classification: small
Estimated implement time: ~3 min
Parallelizable with: Task 1, Task 2, Task 3 (different assembly — Core.AlarmHistorian)
The fixes (visibility-only, no functional behavior change):
_backoffIndex(~74) is written byBumpBackoff/ResetBackoff(drain thread) and read byCurrentBackoff(status query thread) without a memory barrier → markvolatile.- The capacity-eviction log (
~659-661) reads_evictedCountoutside_statusLock(it's incremented inside the lock on658) → capture the total inside the lock and log the captured value.
Files:
- Modify:
src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs
Step 1: Make _backoffIndex volatile
private volatile int _backoffIndex;
Step 2: Capture _evictedCount under the lock for the log message
Replace lines ~658-661:
Interlocked.Add(ref _queuedRowCount, -toEvict);
long lifetimeEvicted;
lock (_statusLock) { _evictedCount += toEvict; lifetimeEvicted = _evictedCount; }
_logger.Warning(
"Historian queue at capacity {Cap} — evicted {Count} oldest row(s) to make room (lifetime evictions: {Total})",
_capacity, toEvict, lifetimeEvicted);
Step 3: Build + run the sink suite (guards no regression)
dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests
Expected: PASS (no behavior change; these are visibility/lock-scope fixes). No new test — thread-safety here isn't deterministically unit-testable; correctness is by inspection + the existing suite.
Step 4: Commit
git add src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs
git commit -m "fix(historian): volatile _backoffIndex + read _evictedCount under lock (thread-safety)"
Task 5: Full-suite gate + docs + close out
Classification: small Estimated implement time: ~4 min Parallelizable with: none (depends on Tasks 1-4)
Files:
- Modify:
docs/ScriptedAlarms.mdand/ordocs/AlarmTracking.md(historian config +HistorizeToAvevaopt-out note) - Modify:
pending.md(strike the now-resolved items 1-6; keep item 7)
Step 1: Build the whole solution
dotnet build ZB.MOM.WW.OtOpcUa.slnx
Expected: 0 errors.
Step 2: Run the full suite, capturing the real exit code
dotnet test ZB.MOM.WW.OtOpcUa.slnx 2>&1 | tee /tmp/followups-test.log; echo "EXIT=${PIPESTATUS[0]}"
Expected: only the known pre-existing env/integration failures (AbCip/AbLegacy fixtures, OpcUaServer DualEndpoint/PKI, Host EquipmentNamespace deploy-Rejected) — NO new failures in Core.ScriptedAlarms.Tests, Core.AlarmHistorian.Tests, or Runtime.Tests. If anything in those three regresses, stop and fix before finishing. (Do NOT pipe to tail — it masks the exit code.)
Step 3: Document the new behavior
In the historian section of docs/ScriptedAlarms.md (and/or docs/AlarmTracking.md): note that (a) HistorizeToAveva=false now suppresses the durable historian write while the live /alerts UI still shows the transition; (b) the AlarmHistorian section exposes DrainIntervalSeconds, Capacity, DeadLetterRetentionDays; (c) an enabled historian logs a startup warning on empty SharedSecret or a relative DatabasePath.
Step 4: Strike resolved items from pending.md
Remove items 1-6 from pending.md (or mark them resolved in this branch) and leave item 7 (docker-dev rig cleanup, operational deferral). Note: pending.md is untracked — update it but it need not be committed unless the user asks.
Step 5: Commit the docs
git add docs/ScriptedAlarms.md docs/AlarmTracking.md
git commit -m "docs(historian): HistorizeToAveva opt-out + config knobs + startup validation"
Step 6: Finish the branch
Use superpowers-extended-cc:finishing-a-development-branch (verify tests → present the 4 options → execute the user's choice).
Dependency spine
T0 (branch)
├─ T1 (HistorizeToAveva opt-out) [high-risk] ─┐
│ └─ T2 (config knobs + validation) [standard] │ (Runtime serial chain;
│ └─ T3 (TransitionUser) [small] │ T1/T3 share a file)
└─ T4 (sink thread-safety) [small] ─┘ (Core.AlarmHistorian — parallel)
└──────────────┬───────────────┘
T5 (gate + docs + finish) [blocked by T1,T2,T3,T4]
T4 is the only task safe to run concurrently with the Runtime chain. T1 → T2 → T3 serialize (same assembly; T1 and T3 edit ScriptedAlarmHostActor.cs).