Files
lmxopcua/docs/plans/2026-06-14-driver-reconfigure-while-faulted-design.md
T

8.8 KiB

Driver-reconfigure-while-faulted — Design

Date: 2026-06-14 Status: Approved (brainstorming) — ready for implementation plan Branch: feat/driver-reconfigure-while-faulted off master f9be3843 Follow-up: pending.md open item #7 (also "Incidental findings from the OpcUaClient live-verify (2026-06-13)").

Problem

A DriverInstanceActor (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs) that is stuck Reconnecting (its InitializeAsync keeps failing on a bad config) ignores a corrected config. When the operator deploys a fix, DriverHostActor.ApplyChildDelta Tells the child a DriverInstanceActor.ApplyDelta — but only the Connected() and Stubbed() behaviours have a Receive<ApplyDelta> handler. In Connecting() and Reconnecting() the message dead-letters, so the actor keeps retrying the stale _currentConfigJson forever. The only workaround today is to restart the whole node (which respawns the driver actor fresh from the current deployment artifact).

Surfaced live on 2026-06-13: a MAIN-opcua-eq driver was faulted from a prior bad config and never adopted the corrected one until the node was restarted.

Goal

An ApplyDelta delivered while the actor is Connecting or Reconnecting adopts the new config and re-initialises with it, so a corrected deployment recovers a faulted driver without a node restart — and the adopted config is guaranteed to win, even against an old initialise still in flight.

Non-goals: changing the Connected apply path (ReinitializeAsync), the host/spawn logic, any message contract visible outside the actor, or anything in EF/Configuration. The live /run verification is deferred (user-driven).

Approach (chosen: B — generation guard + immediate re-init)

Alternatives considered:

  • A — config-swap only. ApplyDelta just sets _currentConfigJson + replies success; the Reconnecting retry timer picks the new config up on its next tick. Smallest diff, but in Connecting (no timer) an old in-flight init that succeeds connects on the stale config — the adopt can lose the race — plus up to reconnectInterval latency. Rejected: the adopt isn't guaranteed to stick, which is the whole point.
  • C — host respawns the faulted child. DriverHostActor Context.Stops + respawns the child with the new config (automating "restart node"). No state-machine change, but drops health history and live subscriptions, adds restart churn, and papers over the actor contract instead of fixing it. Rejected.
  • B — generation guard + immediate re-init (chosen). The corrected config always wins; immediate retry (no timer-tick wait); also closes a latent concurrent-init window in Reconnecting. Contained to this one actor; fully unit-testable offline.

Generation guard (the correctness mechanism)

Each InitializeAsync attempt is tagged with a monotonic generation. A result is honoured only when it matches the latest generation; an older result is from a superseded attempt (e.g. an ApplyDelta adopted a new config mid-(re)connect) and is dropped.

private int _initGeneration;

private void InitializeAsync(string driverConfigJson)
{
    _currentConfigJson = driverConfigJson;
    var generation = ++_initGeneration;          // bump on the actor thread; the closure captures the local
    var self = Self;
    _ = Task.Run(async () =>
    {
        try
        {
            await _driver.InitializeAsync(driverConfigJson, CancellationToken.None);
            self.Tell(new InitializeSucceeded(generation));
        }
        catch (Exception ex)
        {
            self.Tell(new InitializeFailed(ex.Message, generation));
        }
    });
}

The two internal result records gain the token (nothing outside the actor constructs them):

public sealed record InitializeSucceeded(int Generation);
public sealed record InitializeFailed(string Reason, int Generation);

Every InitializeSucceeded/InitializeFailed handler in Connecting() and Reconnecting() drops a superseded result first:

if (msg.Generation != _initGeneration) return;   // superseded — a newer InitializeAsync replaced this one

_initGeneration is read/written only on the actor threadInitializeAsync runs inside a message handler, the Task.Run closure captures generation as a local, and the result handler runs on the actor thread. No lock needed.

The adopt handler (new — Connecting + Reconnecting)

Receive<ApplyDelta>(msg =>
{
    _log.Info("DriverInstance {Id}: ApplyDelta during (re)connect — adopting new config, re-initialising now",
        _driverInstanceId);
    InitializeAsync(msg.DriverConfigJson);   // swaps _currentConfigJson, bumps generation (supersedes the
                                             // in-flight init), starts a fresh attempt immediately
    Sender.Tell(new ApplyResult(true, "config adopted; reinitializing", msg.Correlation));
});

The actor stays in its current state; the new init's result drives the next transition through the existing (now generation-guarded) handlers:

  • Connecting: new init succeeds → Become(Connected); fails → Become(Reconnecting).
  • Reconnecting: new init succeeds → Become(Connected) (and the existing handler cancels the retry timer); fails → no-op, the retry timer keeps retrying the new config (now in _currentConfigJson).

The Reconnecting retry timer is left running — if the immediate attempt fails it continues retrying the new config. A redundant concurrent attempt (immediate + a timer tick) is harmlessly deduped by the generation guard.

Why it's correct

  • No stale hijack. In Connecting, if the old in-flight init's InitializeSucceeded(oldGen) lands while still connecting, the guard drops it, so the actor cannot connect on the stale config. Only the new-generation result transitions it. (If the stale result lands after the new init already reached Connected, it dead-letters harmlessly — Connected has no InitializeSucceeded handler.)
  • No subscription churn / leak. Connecting has no live subscription yet; Reconnecting already detached on its entry (DisconnectObserved/ForceReconnect call DetachSubscription). _desiredRefs is retained and re-applied by ResubscribeDesired on the next Connected entry. The adopt handler touches no subscription state.
  • Reuses an established pattern. The Reconnecting retry loop already calls InitializeAsync (not ReinitializeAsync) repeatedly on a not-connected driver, so adopting via InitializeAsync introduces no new driver-contract assumption.
  • Stubbed untouched. It never calls InitializeAsync, so _initGeneration stays 0 and its existing ApplyDelta stubbed-success reply is unchanged.

Testing (offline — Akka TestKit, no live gate)

Observability uses the existing pattern: with a SetDesiredSubscriptions set, a Connected entry auto-subscribes, so driver.SubscribeCount is the deterministic "did we connect, and on which config" probe. The shared stub gains opt-in per-config gating + config capture (additive — existing InitializeShouldThrow/InitializeCount fields and their tests are untouched).

  1. Reconnecting adopts a corrected config and connects (headline bug). Init throws for v1 → stuck Reconnecting; Ask<ApplyResult>(ApplyDelta(v2)) where v2 succeeds → reply Success with the correlation; the actor reaches Connected on v2 (SubscribeCount goes to 1; the config that drove Connected is v2).
  2. Generation guard ignores a superseded in-flight result (the race). A gated v1 init is pending in Connecting; ApplyDelta(v2) is sent (also gated). Release v1 → the actor stays Connecting (no auto-subscribe fires, SubscribeCount == 0). Release v2Connected (SubscribeCount == 1). Proves a stale init cannot hijack state.
  3. Regression — existing tests stay green: Initialize_failure_keeps_actor_in_Reconnecting_state, ApplyDelta_when_Connected_calls_ReinitializeAsync_and_replies_success.

Files touched

  • src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs — generation field + tagged result records + guarded handlers + the two new Receive<ApplyDelta> handlers.
  • tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs — new adopt + generation-guard tests; extend the private stub with opt-in gating/config-capture.

No host change, no contract change visible outside the actor, no EF/Configuration change, no bUnit.

Risk / classification

High-risk — actor state machine + concurrency. Full review chain (spec + code reviews) and a final integration review. The live /run gate (deploy a corrected config to a faulted MAIN-opcua-eq and confirm it adopts without a node restart) is deferred per the user's instruction.