diff --git a/docs/plans/2026-06-14-driver-reconfigure-while-faulted-design.md b/docs/plans/2026-06-14-driver-reconfigure-while-faulted-design.md new file mode 100644 index 00000000..e2f904d5 --- /dev/null +++ b/docs/plans/2026-06-14-driver-reconfigure-while-faulted-design.md @@ -0,0 +1,166 @@ +# Driver-reconfigure-while-faulted — Design + +**Date:** 2026-06-14 +**Status:** Approved (brainstorming) — ready for implementation plan +**Branch:** `feat/driver-reconfigure-while-faulted` off master `f9be3843` +**Follow-up:** `pending.md` open item #7 (also "Incidental findings from the OpcUaClient live-verify (2026-06-13)"). + +## Problem + +A `DriverInstanceActor` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`) +that is stuck `Reconnecting` (its `InitializeAsync` keeps failing on a bad config) **ignores a +corrected config**. When the operator deploys a fix, `DriverHostActor.ApplyChildDelta` `Tell`s the +child a `DriverInstanceActor.ApplyDelta` — but only the `Connected()` and `Stubbed()` behaviours have +a `Receive` handler. In `Connecting()` and `Reconnecting()` the message **dead-letters**, +so the actor keeps retrying the stale `_currentConfigJson` forever. The only workaround today is to +restart the whole node (which respawns the driver actor fresh from the current deployment artifact). + +Surfaced live on 2026-06-13: a `MAIN-opcua-eq` driver was faulted from a prior bad config and never +adopted the corrected one until the node was restarted. + +## Goal + +An `ApplyDelta` delivered while the actor is `Connecting` or `Reconnecting` **adopts the new config and +re-initialises with it**, so a corrected deployment recovers a faulted driver without a node restart — +and the adopted config is guaranteed to win, even against an old initialise still in flight. + +Non-goals: changing the `Connected` apply path (`ReinitializeAsync`), the host/spawn logic, any +message contract visible outside the actor, or anything in EF/Configuration. The live `/run` +verification is deferred (user-driven). + +## Approach (chosen: B — generation guard + immediate re-init) + +Alternatives considered: + +- **A — config-swap only.** `ApplyDelta` just sets `_currentConfigJson` + replies success; the + `Reconnecting` retry timer picks the new config up on its next tick. Smallest diff, but in + `Connecting` (no timer) an old in-flight init that *succeeds* connects on the stale config — the + adopt can lose the race — plus up to `reconnectInterval` latency. **Rejected:** the adopt isn't + guaranteed to stick, which is the whole point. +- **C — host respawns the faulted child.** `DriverHostActor` `Context.Stop`s + respawns the child + with the new config (automating "restart node"). No state-machine change, but drops health history + and live subscriptions, adds restart churn, and papers over the actor contract instead of fixing + it. **Rejected.** +- **B — generation guard + immediate re-init (chosen).** The corrected config always wins; immediate + retry (no timer-tick wait); also closes a latent concurrent-init window in `Reconnecting`. + Contained to this one actor; fully unit-testable offline. + +### Generation guard (the correctness mechanism) + +Each `InitializeAsync` attempt is tagged with a monotonic generation. A result is honoured only when +it matches the latest generation; an older result is from a **superseded** attempt (e.g. an +`ApplyDelta` adopted a new config mid-(re)connect) and is dropped. + +```csharp +private int _initGeneration; + +private void InitializeAsync(string driverConfigJson) +{ + _currentConfigJson = driverConfigJson; + var generation = ++_initGeneration; // bump on the actor thread; the closure captures the local + var self = Self; + _ = Task.Run(async () => + { + try + { + await _driver.InitializeAsync(driverConfigJson, CancellationToken.None); + self.Tell(new InitializeSucceeded(generation)); + } + catch (Exception ex) + { + self.Tell(new InitializeFailed(ex.Message, generation)); + } + }); +} +``` + +The two internal result records gain the token (nothing outside the actor constructs them): + +```csharp +public sealed record InitializeSucceeded(int Generation); +public sealed record InitializeFailed(string Reason, int Generation); +``` + +Every `InitializeSucceeded`/`InitializeFailed` handler in `Connecting()` and `Reconnecting()` drops a +superseded result first: + +```csharp +if (msg.Generation != _initGeneration) return; // superseded — a newer InitializeAsync replaced this one +``` + +`_initGeneration` is read/written **only on the actor thread** — `InitializeAsync` runs inside a +message handler, the `Task.Run` closure captures `generation` as a local, and the result handler runs +on the actor thread. No lock needed. + +### The adopt handler (new — `Connecting` + `Reconnecting`) + +```csharp +Receive(msg => +{ + _log.Info("DriverInstance {Id}: ApplyDelta during (re)connect — adopting new config, re-initialising now", + _driverInstanceId); + InitializeAsync(msg.DriverConfigJson); // swaps _currentConfigJson, bumps generation (supersedes the + // in-flight init), starts a fresh attempt immediately + Sender.Tell(new ApplyResult(true, "config adopted; reinitializing", msg.Correlation)); +}); +``` + +The actor **stays in its current state**; the new init's result drives the next transition through +the existing (now generation-guarded) handlers: + +- `Connecting`: new init succeeds → `Become(Connected)`; fails → `Become(Reconnecting)`. +- `Reconnecting`: new init succeeds → `Become(Connected)` (and the existing handler cancels the retry + timer); fails → no-op, the retry timer keeps retrying the **new** config (now in `_currentConfigJson`). + +The `Reconnecting` retry timer is **left running** — if the immediate attempt fails it continues +retrying the new config. A redundant concurrent attempt (immediate + a timer tick) is harmlessly +deduped by the generation guard. + +### Why it's correct + +- **No stale hijack.** In `Connecting`, if the old in-flight init's `InitializeSucceeded(oldGen)` lands + while still connecting, the guard drops it, so the actor cannot connect on the stale config. Only the + new-generation result transitions it. (If the stale result lands *after* the new init already reached + `Connected`, it dead-letters harmlessly — `Connected` has no `InitializeSucceeded` handler.) +- **No subscription churn / leak.** `Connecting` has no live subscription yet; `Reconnecting` already + detached on its entry (`DisconnectObserved`/`ForceReconnect` call `DetachSubscription`). `_desiredRefs` + is retained and re-applied by `ResubscribeDesired` on the next `Connected` entry. The adopt handler + touches no subscription state. +- **Reuses an established pattern.** The `Reconnecting` retry loop already calls `InitializeAsync` + (not `ReinitializeAsync`) repeatedly on a not-connected driver, so adopting via `InitializeAsync` + introduces no new driver-contract assumption. +- **`Stubbed` untouched.** It never calls `InitializeAsync`, so `_initGeneration` stays 0 and its + existing `ApplyDelta` stubbed-success reply is unchanged. + +## Testing (offline — Akka TestKit, no live gate) + +Observability uses the existing pattern: with a `SetDesiredSubscriptions` set, a `Connected` entry +auto-subscribes, so `driver.SubscribeCount` is the deterministic "did we connect, and on which config" +probe. The shared stub gains **opt-in** per-config gating + config capture (additive — existing +`InitializeShouldThrow`/`InitializeCount` fields and their tests are untouched). + +1. **Reconnecting adopts a corrected config and connects** (headline bug). Init throws for `v1` → + stuck `Reconnecting`; `Ask(ApplyDelta(v2))` where `v2` succeeds → reply `Success` with + the correlation; the actor reaches `Connected` on `v2` (`SubscribeCount` goes to 1; the config that + drove `Connected` is `v2`). +2. **Generation guard ignores a superseded in-flight result** (the race). A gated `v1` init is pending + in `Connecting`; `ApplyDelta(v2)` is sent (also gated). Release `v1` → the actor **stays + `Connecting`** (no auto-subscribe fires, `SubscribeCount == 0`). Release `v2` → `Connected` + (`SubscribeCount == 1`). Proves a stale init cannot hijack state. +3. **Regression** — existing tests stay green: `Initialize_failure_keeps_actor_in_Reconnecting_state`, + `ApplyDelta_when_Connected_calls_ReinitializeAsync_and_replies_success`. + +## Files touched + +- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` — generation field + tagged + result records + guarded handlers + the two new `Receive` handlers. +- `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs` — new adopt + + generation-guard tests; extend the private stub with opt-in gating/config-capture. + +No host change, no contract change visible outside the actor, no EF/Configuration change, no bUnit. + +## Risk / classification + +**High-risk** — actor state machine + concurrency. Full review chain (spec + code reviews) and a final +integration review. The live `/run` gate (deploy a corrected config to a faulted `MAIN-opcua-eq` and +confirm it adopts without a node restart) is **deferred** per the user's instruction.