8.8 KiB
Driver-reconfigure-while-faulted — Design
Date: 2026-06-14
Status: Approved (brainstorming) — ready for implementation plan
Branch: feat/driver-reconfigure-while-faulted off master f9be3843
Follow-up: pending.md open item #7 (also "Incidental findings from the OpcUaClient live-verify (2026-06-13)").
Problem
A DriverInstanceActor (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs)
that is stuck Reconnecting (its InitializeAsync keeps failing on a bad config) ignores a
corrected config. When the operator deploys a fix, DriverHostActor.ApplyChildDelta Tells the
child a DriverInstanceActor.ApplyDelta — but only the Connected() and Stubbed() behaviours have
a Receive<ApplyDelta> handler. In Connecting() and Reconnecting() the message dead-letters,
so the actor keeps retrying the stale _currentConfigJson forever. The only workaround today is to
restart the whole node (which respawns the driver actor fresh from the current deployment artifact).
Surfaced live on 2026-06-13: a MAIN-opcua-eq driver was faulted from a prior bad config and never
adopted the corrected one until the node was restarted.
Goal
An ApplyDelta delivered while the actor is Connecting or Reconnecting adopts the new config and
re-initialises with it, so a corrected deployment recovers a faulted driver without a node restart —
and the adopted config is guaranteed to win, even against an old initialise still in flight.
Non-goals: changing the Connected apply path (ReinitializeAsync), the host/spawn logic, any
message contract visible outside the actor, or anything in EF/Configuration. The live /run
verification is deferred (user-driven).
Approach (chosen: B — generation guard + immediate re-init)
Alternatives considered:
- A — config-swap only.
ApplyDeltajust sets_currentConfigJson+ replies success; theReconnectingretry timer picks the new config up on its next tick. Smallest diff, but inConnecting(no timer) an old in-flight init that succeeds connects on the stale config — the adopt can lose the race — plus up toreconnectIntervallatency. Rejected: the adopt isn't guaranteed to stick, which is the whole point. - C — host respawns the faulted child.
DriverHostActorContext.Stops + respawns the child with the new config (automating "restart node"). No state-machine change, but drops health history and live subscriptions, adds restart churn, and papers over the actor contract instead of fixing it. Rejected. - B — generation guard + immediate re-init (chosen). The corrected config always wins; immediate
retry (no timer-tick wait); also closes a latent concurrent-init window in
Reconnecting. Contained to this one actor; fully unit-testable offline.
Generation guard (the correctness mechanism)
Each InitializeAsync attempt is tagged with a monotonic generation. A result is honoured only when
it matches the latest generation; an older result is from a superseded attempt (e.g. an
ApplyDelta adopted a new config mid-(re)connect) and is dropped.
private int _initGeneration;
private void InitializeAsync(string driverConfigJson)
{
_currentConfigJson = driverConfigJson;
var generation = ++_initGeneration; // bump on the actor thread; the closure captures the local
var self = Self;
_ = Task.Run(async () =>
{
try
{
await _driver.InitializeAsync(driverConfigJson, CancellationToken.None);
self.Tell(new InitializeSucceeded(generation));
}
catch (Exception ex)
{
self.Tell(new InitializeFailed(ex.Message, generation));
}
});
}
The two internal result records gain the token (nothing outside the actor constructs them):
public sealed record InitializeSucceeded(int Generation);
public sealed record InitializeFailed(string Reason, int Generation);
Every InitializeSucceeded/InitializeFailed handler in Connecting() and Reconnecting() drops a
superseded result first:
if (msg.Generation != _initGeneration) return; // superseded — a newer InitializeAsync replaced this one
_initGeneration is read/written only on the actor thread — InitializeAsync runs inside a
message handler, the Task.Run closure captures generation as a local, and the result handler runs
on the actor thread. No lock needed.
The adopt handler (new — Connecting + Reconnecting)
Receive<ApplyDelta>(msg =>
{
_log.Info("DriverInstance {Id}: ApplyDelta during (re)connect — adopting new config, re-initialising now",
_driverInstanceId);
InitializeAsync(msg.DriverConfigJson); // swaps _currentConfigJson, bumps generation (supersedes the
// in-flight init), starts a fresh attempt immediately
Sender.Tell(new ApplyResult(true, "config adopted; reinitializing", msg.Correlation));
});
The actor stays in its current state; the new init's result drives the next transition through the existing (now generation-guarded) handlers:
Connecting: new init succeeds →Become(Connected); fails →Become(Reconnecting).Reconnecting: new init succeeds →Become(Connected)(and the existing handler cancels the retry timer); fails → no-op, the retry timer keeps retrying the new config (now in_currentConfigJson).
The Reconnecting retry timer is left running — if the immediate attempt fails it continues
retrying the new config. A redundant concurrent attempt (immediate + a timer tick) is harmlessly
deduped by the generation guard.
Why it's correct
- No stale hijack. In
Connecting, if the old in-flight init'sInitializeSucceeded(oldGen)lands while still connecting, the guard drops it, so the actor cannot connect on the stale config. Only the new-generation result transitions it. (If the stale result lands after the new init already reachedConnected, it dead-letters harmlessly —Connectedhas noInitializeSucceededhandler.) - No subscription churn / leak.
Connectinghas no live subscription yet;Reconnectingalready detached on its entry (DisconnectObserved/ForceReconnectcallDetachSubscription)._desiredRefsis retained and re-applied byResubscribeDesiredon the nextConnectedentry. The adopt handler touches no subscription state. - Reuses an established pattern. The
Reconnectingretry loop already callsInitializeAsync(notReinitializeAsync) repeatedly on a not-connected driver, so adopting viaInitializeAsyncintroduces no new driver-contract assumption. Stubbeduntouched. It never callsInitializeAsync, so_initGenerationstays 0 and its existingApplyDeltastubbed-success reply is unchanged.
Testing (offline — Akka TestKit, no live gate)
Observability uses the existing pattern: with a SetDesiredSubscriptions set, a Connected entry
auto-subscribes, so driver.SubscribeCount is the deterministic "did we connect, and on which config"
probe. The shared stub gains opt-in per-config gating + config capture (additive — existing
InitializeShouldThrow/InitializeCount fields and their tests are untouched).
- Reconnecting adopts a corrected config and connects (headline bug). Init throws for
v1→ stuckReconnecting;Ask<ApplyResult>(ApplyDelta(v2))wherev2succeeds → replySuccesswith the correlation; the actor reachesConnectedonv2(SubscribeCountgoes to 1; the config that droveConnectedisv2). - Generation guard ignores a superseded in-flight result (the race). A gated
v1init is pending inConnecting;ApplyDelta(v2)is sent (also gated). Releasev1→ the actor staysConnecting(no auto-subscribe fires,SubscribeCount == 0). Releasev2→Connected(SubscribeCount == 1). Proves a stale init cannot hijack state. - Regression — existing tests stay green:
Initialize_failure_keeps_actor_in_Reconnecting_state,ApplyDelta_when_Connected_calls_ReinitializeAsync_and_replies_success.
Files touched
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs— generation field + tagged result records + guarded handlers + the two newReceive<ApplyDelta>handlers.tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs— new adopt + generation-guard tests; extend the private stub with opt-in gating/config-capture.
No host change, no contract change visible outside the actor, no EF/Configuration change, no bUnit.
Risk / classification
High-risk — actor state machine + concurrency. Full review chain (spec + code reviews) and a final
integration review. The live /run gate (deploy a corrected config to a faulted MAIN-opcua-eq and
confirm it adopts without a node restart) is deferred per the user's instruction.