# Driver-reconfigure-while-faulted — Design **Date:** 2026-06-14 **Status:** Approved (brainstorming) — ready for implementation plan **Branch:** `feat/driver-reconfigure-while-faulted` off master `f9be3843` **Follow-up:** `pending.md` open item #7 (also "Incidental findings from the OpcUaClient live-verify (2026-06-13)"). ## Problem A `DriverInstanceActor` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`) that is stuck `Reconnecting` (its `InitializeAsync` keeps failing on a bad config) **ignores a corrected config**. When the operator deploys a fix, `DriverHostActor.ApplyChildDelta` `Tell`s the child a `DriverInstanceActor.ApplyDelta` — but only the `Connected()` and `Stubbed()` behaviours have a `Receive` handler. In `Connecting()` and `Reconnecting()` the message **dead-letters**, so the actor keeps retrying the stale `_currentConfigJson` forever. The only workaround today is to restart the whole node (which respawns the driver actor fresh from the current deployment artifact). Surfaced live on 2026-06-13: a `MAIN-opcua-eq` driver was faulted from a prior bad config and never adopted the corrected one until the node was restarted. ## Goal An `ApplyDelta` delivered while the actor is `Connecting` or `Reconnecting` **adopts the new config and re-initialises with it**, so a corrected deployment recovers a faulted driver without a node restart — and the adopted config is guaranteed to win, even against an old initialise still in flight. Non-goals: changing the `Connected` apply path (`ReinitializeAsync`), the host/spawn logic, any message contract visible outside the actor, or anything in EF/Configuration. The live `/run` verification is deferred (user-driven). ## Approach (chosen: B — generation guard + immediate re-init) Alternatives considered: - **A — config-swap only.** `ApplyDelta` just sets `_currentConfigJson` + replies success; the `Reconnecting` retry timer picks the new config up on its next tick. Smallest diff, but in `Connecting` (no timer) an old in-flight init that *succeeds* connects on the stale config — the adopt can lose the race — plus up to `reconnectInterval` latency. **Rejected:** the adopt isn't guaranteed to stick, which is the whole point. - **C — host respawns the faulted child.** `DriverHostActor` `Context.Stop`s + respawns the child with the new config (automating "restart node"). No state-machine change, but drops health history and live subscriptions, adds restart churn, and papers over the actor contract instead of fixing it. **Rejected.** - **B — generation guard + immediate re-init (chosen).** The corrected config always wins; immediate retry (no timer-tick wait); also closes a latent concurrent-init window in `Reconnecting`. Contained to this one actor; fully unit-testable offline. ### Generation guard (the correctness mechanism) Each `InitializeAsync` attempt is tagged with a monotonic generation. A result is honoured only when it matches the latest generation; an older result is from a **superseded** attempt (e.g. an `ApplyDelta` adopted a new config mid-(re)connect) and is dropped. ```csharp private int _initGeneration; private void InitializeAsync(string driverConfigJson) { _currentConfigJson = driverConfigJson; var generation = ++_initGeneration; // bump on the actor thread; the closure captures the local var self = Self; _ = Task.Run(async () => { try { await _driver.InitializeAsync(driverConfigJson, CancellationToken.None); self.Tell(new InitializeSucceeded(generation)); } catch (Exception ex) { self.Tell(new InitializeFailed(ex.Message, generation)); } }); } ``` The two internal result records gain the token (nothing outside the actor constructs them): ```csharp public sealed record InitializeSucceeded(int Generation); public sealed record InitializeFailed(string Reason, int Generation); ``` Every `InitializeSucceeded`/`InitializeFailed` handler in `Connecting()` and `Reconnecting()` drops a superseded result first: ```csharp if (msg.Generation != _initGeneration) return; // superseded — a newer InitializeAsync replaced this one ``` `_initGeneration` is read/written **only on the actor thread** — `InitializeAsync` runs inside a message handler, the `Task.Run` closure captures `generation` as a local, and the result handler runs on the actor thread. No lock needed. ### The adopt handler (new — `Connecting` + `Reconnecting`) ```csharp Receive(msg => { _log.Info("DriverInstance {Id}: ApplyDelta during (re)connect — adopting new config, re-initialising now", _driverInstanceId); InitializeAsync(msg.DriverConfigJson); // swaps _currentConfigJson, bumps generation (supersedes the // in-flight init), starts a fresh attempt immediately Sender.Tell(new ApplyResult(true, "config adopted; reinitializing", msg.Correlation)); }); ``` The actor **stays in its current state**; the new init's result drives the next transition through the existing (now generation-guarded) handlers: - `Connecting`: new init succeeds → `Become(Connected)`; fails → `Become(Reconnecting)`. - `Reconnecting`: new init succeeds → `Become(Connected)` (and the existing handler cancels the retry timer); fails → no-op, the retry timer keeps retrying the **new** config (now in `_currentConfigJson`). The `Reconnecting` retry timer is **left running** — if the immediate attempt fails it continues retrying the new config. A redundant concurrent attempt (immediate + a timer tick) is harmlessly deduped by the generation guard. ### Why it's correct - **No stale hijack.** In `Connecting`, if the old in-flight init's `InitializeSucceeded(oldGen)` lands while still connecting, the guard drops it, so the actor cannot connect on the stale config. Only the new-generation result transitions it. (If the stale result lands *after* the new init already reached `Connected`, it dead-letters harmlessly — `Connected` has no `InitializeSucceeded` handler.) - **No subscription churn / leak.** `Connecting` has no live subscription yet; `Reconnecting` already detached on its entry (`DisconnectObserved`/`ForceReconnect` call `DetachSubscription`). `_desiredRefs` is retained and re-applied by `ResubscribeDesired` on the next `Connected` entry. The adopt handler touches no subscription state. - **Reuses an established pattern.** The `Reconnecting` retry loop already calls `InitializeAsync` (not `ReinitializeAsync`) repeatedly on a not-connected driver, so adopting via `InitializeAsync` introduces no new driver-contract assumption. - **`Stubbed` untouched.** It never calls `InitializeAsync`, so `_initGeneration` stays 0 and its existing `ApplyDelta` stubbed-success reply is unchanged. ## Testing (offline — Akka TestKit, no live gate) Observability uses the existing pattern: with a `SetDesiredSubscriptions` set, a `Connected` entry auto-subscribes, so `driver.SubscribeCount` is the deterministic "did we connect, and on which config" probe. The shared stub gains **opt-in** per-config gating + config capture (additive — existing `InitializeShouldThrow`/`InitializeCount` fields and their tests are untouched). 1. **Reconnecting adopts a corrected config and connects** (headline bug). Init throws for `v1` → stuck `Reconnecting`; `Ask(ApplyDelta(v2))` where `v2` succeeds → reply `Success` with the correlation; the actor reaches `Connected` on `v2` (`SubscribeCount` goes to 1; the config that drove `Connected` is `v2`). 2. **Generation guard ignores a superseded in-flight result** (the race). A gated `v1` init is pending in `Connecting`; `ApplyDelta(v2)` is sent (also gated). Release `v1` → the actor **stays `Connecting`** (no auto-subscribe fires, `SubscribeCount == 0`). Release `v2` → `Connected` (`SubscribeCount == 1`). Proves a stale init cannot hijack state. 3. **Regression** — existing tests stay green: `Initialize_failure_keeps_actor_in_Reconnecting_state`, `ApplyDelta_when_Connected_calls_ReinitializeAsync_and_replies_success`. ## Files touched - `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` — generation field + tagged result records + guarded handlers + the two new `Receive` handlers. - `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs` — new adopt + generation-guard tests; extend the private stub with opt-in gating/config-capture. No host change, no contract change visible outside the actor, no EF/Configuration change, no bUnit. ## Risk / classification **High-risk** — actor state machine + concurrency. Full review chain (spec + code reviews) and a final integration review. The live `/run` gate (deploy a corrected config to a faulted `MAIN-opcua-eq` and confirm it adopts without a node restart) is **deferred** per the user's instruction.