Files
lmxopcua/docs/plans/2026-06-11-alarm-historian-followups.md
T

24 KiB
Raw Blame History

Alarm Historian Follow-ups Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.

Goal: Close the six code follow-ups left open by the alarm-followups round-2 work (f64f7ce6): honor the per-alarm HistorizeToAveva opt-out at the durable historian write, expose drain/capacity/retention config knobs, validate the historian config at startup, record the real operator for shelve/enable/disable transitions, and fix two SqliteStoreAndForwardSink thread-safety nits.

Architecture: All six are bounded hardening fixes on the already-shipped historian + scripted-alarm surface — no new components. The load-bearing one (T1) threads a HistorizeToAveva flag through the engine emission → the DPS-serialized AlarmTransitionEvent → a gate in HistorianAdapterActor that suppresses only the durable write (the live /alerts UI keeps every transition). The rest are localized: options/registration hardening (T2), a one-line TransitionUser widening (T3), and field-visibility fixes on the SQLite sink (T4).

Tech Stack: C# / .NET 10, Akka.NET (cluster DistributedPubSub + TestKit), xUnit + Shouldly, Serilog, EF Core (untouched here), SQLite store-and-forward.

Source of truth: pending.md items 16 (item 7, the docker-dev rig cleanup, is an operational deferral and is out of scope). Round-2 design: docs/plans/2026-06-11-alarm-followups-round2-design.md.

Base: Branch off master @ f64f7ce6.

Hard rules (carried from round 2): stage by explicit path (never git add .); never stage sql_login.txt or src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/; never echo the gateway API key into a new tracked file; no force-push, no --no-verify; NO Configuration entity / EF migration change — the historian queue is a standalone SQLite file, NOT the Config DB; commit per task by explicit path.

Same-assembly build contention: T1, T2, T3 all touch the Runtime assembly (and T1/T3 share ScriptedAlarmHostActor.cs) → they MUST run serially (build/test collide on obj/). T4 touches only Core.AlarmHistorian → it is the one task safe to run concurrently with the Runtime chain.


Task 0: Branch

Classification: trivial Estimated implement time: ~1 min Parallelizable with: none

Files: none (git only)

Step 1: Create the feature branch off master

git checkout master
git status --short            # expect only untracked pending.md
git checkout -b feat/alarm-historian-followups
git rev-parse --short HEAD    # expect f64f7ce6

Expected: on feat/alarm-historian-followups, HEAD f64f7ce6.

(Do NOT commit pending.md — it is an untracked working note and stays untracked.)


Task 1: Honor HistorizeToAveva opt-out at the durable write (pending #1)

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 4

Why high-risk: changes a DistributedPubSub-serialized data contract (AlarmTransitionEvent) and the warm-redundant historian gate. Mis-gating either double-writes, drops rows, or (the trap pending.md calls out) silently suppresses the live UI instead of just the durable write.

The design (locked):

  • The flag is already on ScriptedAlarmDefinition.HistorizeToAveva (default true) and reaches the engine via ToDefinition (ScriptedAlarmHostActor.cs:477). It is NOT yet on the emitted event, so the historian can't see it.
  • Carry it: ScriptedAlarmEvent (engine) → AlarmTransitionEvent (Commons, DPS-serialized) → gate in HistorianAdapterActor.
  • Gate the durable write ONLY. ScriptedAlarmHostActor.OnEngineEmission must keep publishing every transition to alerts (its existing Primary gate is unchanged) so the live /alerts UI is unaffected. The HistorizeToAveva check lives solely in HistorianAdapterActor.Receive<AlarmTransitionEvent>.
  • Default true, non-nullable. On a rolling restart Akka's JSON serializer applies default(bool) = false to an old-format message's missing field (the same quirk the AlarmTypeName null-coalesce in Translate guards). This is safe here because the node that writes is always the Primary (or a boot-window node), and that node also published the event — so the written event always carries the publisher's own same-version flag value. A cross-version old→new flow only reaches the Secondary, which never writes. Document this reasoning in a code comment.

Files:

  • Modify: src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs (ScriptedAlarmEvent record ~837; BuildEmission ~592)
  • Modify: src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Alerts/AlarmTransitionEvent.cs (record ~19-29)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs (OnEngineEmission ~289-301)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs (ctor Receive<AlarmTransitionEvent> ~74)
  • Test: tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs

Step 1: Write the failing engine test (flag carried through emission)

In ScriptedAlarmEngineTests.cs, add a test that an alarm defined with HistorizeToAveva: false emits a ScriptedAlarmEvent whose HistorizeToAveva is false (and true when the definition is true). Follow the existing engine-test setup pattern in that file (register a definition, drive the predicate active, capture OnEvent). Assert:

emitted.HistorizeToAveva.ShouldBeFalse();

Run (expect FAIL — ScriptedAlarmEvent has no HistorizeToAveva):

dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests --filter "FullyQualifiedName~HistorizeToAveva"

Step 2: Add the field to ScriptedAlarmEvent + populate in BuildEmission

ScriptedAlarmEvent record (append after Comment):

public sealed record ScriptedAlarmEvent(
    string AlarmId,
    string EquipmentPath,
    string AlarmName,
    AlarmKind Kind,
    AlarmSeverity Severity,
    string Message,
    AlarmConditionState Condition,
    EmissionKind Emission,
    DateTime TimestampUtc,
    string? Comment = null,
    bool HistorizeToAveva = true);

BuildEmission — add to the new ScriptedAlarmEvent(...) initializer (the definition is on state.Definition):

            Comment: kind switch
            {
                EmissionKind.Acknowledged => condition.LastAckComment,
                EmissionKind.Confirmed => condition.LastConfirmComment,
                EmissionKind.CommentAdded => condition.Comments.Count == 0 ? null : condition.Comments[^1].Text,
                _ => null,
            },
            HistorizeToAveva: state.Definition.HistorizeToAveva);

Step 3: Add the field to AlarmTransitionEvent (Commons contract)

Append after Comment (additive, default true — see the rolling-restart note above), and add the <param> doc line:

/// <param name="HistorizeToAveva">When <c>false</c>, the durable historian sink suppresses this transition (the live <c>alerts</c> fan-out is unaffected). Defaults to <c>true</c>. On a rolling restart an old-format message deserializes this as <c>false</c> (CLR default); that is safe because the writing node is always the same-version publisher — see <c>HistorianAdapterActor</c>.</param>
public sealed record AlarmTransitionEvent(
    string AlarmId,
    string EquipmentPath,
    string AlarmName,
    string TransitionKind,
    int Severity,
    string Message,
    string User,
    DateTime TimestampUtc,
    string AlarmTypeName = "AlarmCondition",
    string? Comment = null,
    bool HistorizeToAveva = true);

Step 4: Populate it in ScriptedAlarmHostActor.OnEngineEmission

Add to the new AlarmTransitionEvent(...) initializer (after Comment: e.Comment):

            AlarmTypeName: e.Kind.ToString(),
            Comment: e.Comment,
            HistorizeToAveva: e.HistorizeToAveva);

Do NOT touch the Primary _localRole gate below it — the alerts publish stays ungated by HistorizeToAveva (live UI must see every transition).

Step 5: Gate the durable write in HistorianAdapterActor

Change the Receive<AlarmTransitionEvent> handler (currently ~74) to also require the flag:

        // HistorizeToAveva=false is a per-alarm opt-out of DURABLE historization only — the live
        // `alerts` fan-out (browser UI) already happened upstream. Gate the sink write, not the publish.
        Receive<AlarmTransitionEvent>(t => { if (ShouldHistorize() && t.HistorizeToAveva) _ = EnqueueAsync(Translate(t)); });

Step 6: Write the failing historian-gate test, then make it pass

In HistorianAdapterActorTests.cs: add a bool historizeToAveva = true parameter to the SampleTransition helper (thread it into the AlarmTransitionEvent), then add:

/// <summary>Per-alarm opt-out (pending #1): a Primary node MUST NOT write to the durable sink when the
/// transition carries HistorizeToAveva=false — even though it would otherwise historize. The live alerts
/// fan-out is upstream and unaffected.</summary>
[Fact]
public void Primary_node_does_not_historize_when_opted_out()
{
    var (actor, sink) = CreateActor();
    TellRedundancyRole(actor, RedundancyRole.Primary);
    actor.Tell(SampleTransition(historizeToAveva: false));
    // Give the actor time to (not) enqueue.
    ExpectNoMsg(Settle);
    sink.EnqueueCount.ShouldBe(0);
}

(Match the existing helpers — CreateActor, TellRedundancyRole, Settle, ExpectNoMsg. If ExpectNoMsg isn't the local idiom, mirror the existing Secondary_node_does_not_historize test's settle-then-assert-0 shape.)

Step 7: Run the full affected suites

dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~HistorianAdapterActor|FullyQualifiedName~ScriptedAlarmHostActor"

Expected: PASS. The existing ScriptedAlarmHostActorTests already define alarms with HistorizeToAveva: true/false, so confirm those still pass (the alerts publish must still fire for the false alarm).

Step 8: Commit

git add src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs \
        src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Alerts/AlarmTransitionEvent.cs \
        src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs \
        src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs \
        tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs
git commit -m "feat(historian): honor per-alarm HistorizeToAveva opt-out at the durable write"

Task 2: Historian config knobs + startup validation (pending #2, #3, #4)

Classification: standard Estimated implement time: ~4 min Parallelizable with: Task 4

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/AlarmHistorianOptions.cs
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs (AddAlarmHistorian ~67-89)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Host/appsettings.json (the existing disabled AlarmHistorian example section)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/AlarmHistorianRegistrationTests.cs

Step 1: Write the failing tests

In AlarmHistorianRegistrationTests.cs, add:

  • A binding test: a config with DrainIntervalSeconds, Capacity, DeadLetterRetentionDays set binds them onto AlarmHistorianOptions.
  • Validate() tests (pure, no DI): { Enabled = true, SharedSecret = "" } → returns a warning mentioning SharedSecret; { Enabled = true, DatabasePath = "alarm-historian.db" } (relative) → returns a warning mentioning the relative path; { Enabled = true, SharedSecret = "x", DatabasePath = "/abs/h.db" } → returns empty; { Enabled = false, SharedSecret = "" } → returns empty (disabled is never a misconfig).
[Fact]
public void Validate_warns_on_empty_shared_secret_when_enabled()
{
    var opts = new AlarmHistorianOptions { Enabled = true, SharedSecret = "", DatabasePath = "/var/h.db" };
    opts.Validate().ShouldContain(w => w.Contains("SharedSecret"));
}

[Fact]
public void Validate_warns_on_relative_database_path_when_enabled()
{
    var opts = new AlarmHistorianOptions { Enabled = true, SharedSecret = "s", DatabasePath = "alarm-historian.db" };
    opts.Validate().ShouldContain(w => w.Contains("DatabasePath"));
}

[Fact]
public void Validate_is_silent_when_disabled()
{
    new AlarmHistorianOptions { Enabled = false, SharedSecret = "" }.Validate().ShouldBeEmpty();
}

Run (expect FAIL — no Validate, no new options):

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~AlarmHistorianRegistration"

Step 2: Add the knobs + Validate() to AlarmHistorianOptions

    /// <summary>Seconds between drain-worker ticks. Defaults to 5.</summary>
    public int DrainIntervalSeconds { get; init; } = 5;

    /// <summary>Maximum queued rows before the sink evicts the oldest. Defaults to 1,000,000
    /// (<c>SqliteStoreAndForwardSink.DefaultCapacity</c>).</summary>
    public long Capacity { get; init; } = 1_000_000;

    /// <summary>Days to retain dead-lettered rows before purge. Defaults to 30.</summary>
    public int DeadLetterRetentionDays { get; init; } = 30;

    /// <summary>Returns operator-facing misconfiguration warnings for an <c>Enabled</c> historian
    /// (empty when disabled or correctly configured). Pure — the registration logs each entry.</summary>
    /// <returns>Zero or more human-readable warning messages.</returns>
    public IReadOnlyList<string> Validate()
    {
        var warnings = new List<string>();
        if (!Enabled) return warnings;
        if (string.IsNullOrWhiteSpace(SharedSecret))
            warnings.Add("AlarmHistorian:SharedSecret is empty while the historian is enabled — the Wonderware sidecar Hello frame will carry an empty secret.");
        if (!Path.IsPathRooted(DatabasePath))
            warnings.Add($"AlarmHistorian:DatabasePath '{DatabasePath}' is relative — it resolves against the process working directory (e.g. System32 for a Windows service). Set an absolute path.");
        return warnings;
    }

(Add using System.Collections.Generic; and using System.IO; if the file lacks them.)

Step 3: Thread the knobs + log warnings in AddAlarmHistorian

Inside the existing if (opts is not { Enabled: true }) return services; guard's downstream block:

        foreach (var warning in opts.Validate())
            Serilog.Log.Logger.ForContext<SqliteStoreAndForwardSink>().Warning("{HistorianConfigWarning}", warning);

        services.AddSingleton<IAlarmHistorianSink>(sp =>
        {
            var sink = new SqliteStoreAndForwardSink(
                opts.DatabasePath,
                writerFactory(opts, sp),
                Serilog.Log.Logger.ForContext<SqliteStoreAndForwardSink>(),
                batchSize: opts.BatchSize,
                capacity: opts.Capacity,
                deadLetterRetention: TimeSpan.FromDays(opts.DeadLetterRetentionDays));
            sink.StartDrainLoop(TimeSpan.FromSeconds(opts.DrainIntervalSeconds));
            return sink;
        });
        return services;

Step 4: Update the appsettings example

In Host/appsettings.json, extend the disabled AlarmHistorian example section with the three new keys (DrainIntervalSeconds, Capacity, DeadLetterRetentionDays) so operators see them. Keep Enabled: false. Make a targeted edit to that section only.

Step 5: Run + commit

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~AlarmHistorianRegistration"
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/AlarmHistorianOptions.cs \
        src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs \
        src/Server/ZB.MOM.WW.OtOpcUa.Host/appsettings.json \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/AlarmHistorianRegistrationTests.cs
git commit -m "feat(historian): drain/capacity/retention config knobs + startup config-warning validation"

Task 3: Record the real operator for shelve/enable/disable transitions (pending #5)

Classification: small Estimated implement time: ~3 min Parallelizable with: Task 4

The design: every shelve/unshelve/enable/disable op in Part9StateMachine appends an audit AlarmComment carrying the acting user (ApplyOneShotShelve"ShelveOneShot", ApplyTimedShelve"ShelveTimed", ApplyUnshelve"Unshelve", ApplyEnable"Enable", ApplyDisable"Disable"), and auto-unshelve appends with user="system". So the operator is already the last Comments entry on the emitted condition — exactly the CommentAdded case. No Core/state change needed; widen the TransitionUser switch. (Acknowledged/Confirmed keep using LastAckUser/LastConfirmUser — leave them.)

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs (TransitionUser ~508-514)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs

Step 1: Write the failing test

Mirror the existing Acknowledge/AddComment command tests (they drive a command via the alarm-commands topic and FishForMessage<AlarmTransitionEvent> on the alerts probe). Add a test that an operator shelve command (Operation: "OneShotShelve", User: "carol") on an active alarm yields a transition with evt.User.ShouldBe("carol") (it currently returns "system"). Use the exact command-operation string the existing inbound tests use (check the host actor's command dispatch for the canonical OneShotShelve/Shelve spelling).

Run (expect FAIL — currently "system"):

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ScriptedAlarmHostActor"

Step 2: Widen TransitionUser

    private static string TransitionUser(ScriptedAlarmEvent e) => e.Emission switch
    {
        EmissionKind.Acknowledged => e.Condition.LastAckUser ?? "system",
        EmissionKind.Confirmed => e.Condition.LastConfirmUser ?? "system",
        // Shelve / unshelve / enable / disable / comment ops each append the acting user as the last
        // audit entry on the emitted condition (auto-unshelve appends "system"); read it from there.
        EmissionKind.CommentAdded
            or EmissionKind.Shelved
            or EmissionKind.Unshelved
            or EmissionKind.Enabled
            or EmissionKind.Disabled => e.Condition.Comments.Count > 0 ? e.Condition.Comments[^1].User : "system",
        _ => "system",
    };

Step 3: Run + commit

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ScriptedAlarmHostActor"
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs
git commit -m "fix(alarms): historize the real operator for shelve/unshelve/enable/disable transitions"

Task 4: SqliteStoreAndForwardSink thread-safety nits (pending #6)

Classification: small Estimated implement time: ~3 min Parallelizable with: Task 1, Task 2, Task 3 (different assembly — Core.AlarmHistorian)

The fixes (visibility-only, no functional behavior change):

  1. _backoffIndex (~74) is written by BumpBackoff/ResetBackoff (drain thread) and read by CurrentBackoff (status query thread) without a memory barrier → mark volatile.
  2. The capacity-eviction log (~659-661) reads _evictedCount outside _statusLock (it's incremented inside the lock on 658) → capture the total inside the lock and log the captured value.

Files:

  • Modify: src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs

Step 1: Make _backoffIndex volatile

    private volatile int _backoffIndex;

Step 2: Capture _evictedCount under the lock for the log message

Replace lines ~658-661:

        Interlocked.Add(ref _queuedRowCount, -toEvict);
        long lifetimeEvicted;
        lock (_statusLock) { _evictedCount += toEvict; lifetimeEvicted = _evictedCount; }
        _logger.Warning(
            "Historian queue at capacity {Cap} — evicted {Count} oldest row(s) to make room (lifetime evictions: {Total})",
            _capacity, toEvict, lifetimeEvicted);

Step 3: Build + run the sink suite (guards no regression)

dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests

Expected: PASS (no behavior change; these are visibility/lock-scope fixes). No new test — thread-safety here isn't deterministically unit-testable; correctness is by inspection + the existing suite.

Step 4: Commit

git add src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs
git commit -m "fix(historian): volatile _backoffIndex + read _evictedCount under lock (thread-safety)"

Task 5: Full-suite gate + docs + close out

Classification: small Estimated implement time: ~4 min Parallelizable with: none (depends on Tasks 1-4)

Files:

  • Modify: docs/ScriptedAlarms.md and/or docs/AlarmTracking.md (historian config + HistorizeToAveva opt-out note)
  • Modify: pending.md (strike the now-resolved items 1-6; keep item 7)

Step 1: Build the whole solution

dotnet build ZB.MOM.WW.OtOpcUa.slnx

Expected: 0 errors.

Step 2: Run the full suite, capturing the real exit code

dotnet test ZB.MOM.WW.OtOpcUa.slnx 2>&1 | tee /tmp/followups-test.log; echo "EXIT=${PIPESTATUS[0]}"

Expected: only the known pre-existing env/integration failures (AbCip/AbLegacy fixtures, OpcUaServer DualEndpoint/PKI, Host EquipmentNamespace deploy-Rejected) — NO new failures in Core.ScriptedAlarms.Tests, Core.AlarmHistorian.Tests, or Runtime.Tests. If anything in those three regresses, stop and fix before finishing. (Do NOT pipe to tail — it masks the exit code.)

Step 3: Document the new behavior

In the historian section of docs/ScriptedAlarms.md (and/or docs/AlarmTracking.md): note that (a) HistorizeToAveva=false now suppresses the durable historian write while the live /alerts UI still shows the transition; (b) the AlarmHistorian section exposes DrainIntervalSeconds, Capacity, DeadLetterRetentionDays; (c) an enabled historian logs a startup warning on empty SharedSecret or a relative DatabasePath.

Step 4: Strike resolved items from pending.md

Remove items 1-6 from pending.md (or mark them resolved in this branch) and leave item 7 (docker-dev rig cleanup, operational deferral). Note: pending.md is untracked — update it but it need not be committed unless the user asks.

Step 5: Commit the docs

git add docs/ScriptedAlarms.md docs/AlarmTracking.md
git commit -m "docs(historian): HistorizeToAveva opt-out + config knobs + startup validation"

Step 6: Finish the branch

Use superpowers-extended-cc:finishing-a-development-branch (verify tests → present the 4 options → execute the user's choice).


Dependency spine

T0 (branch)
 ├─ T1 (HistorizeToAveva opt-out)      [high-risk]  ─┐
 │   └─ T2 (config knobs + validation) [standard]   │  (Runtime serial chain;
 │       └─ T3 (TransitionUser)        [small]       │   T1/T3 share a file)
 └─ T4 (sink thread-safety)            [small]      ─┘  (Core.AlarmHistorian — parallel)
              └──────────────┬───────────────┘
                             T5 (gate + docs + finish)  [blocked by T1,T2,T3,T4]

T4 is the only task safe to run concurrently with the Runtime chain. T1 → T2 → T3 serialize (same assembly; T1 and T3 edit ScriptedAlarmHostActor.cs).