docs(drivers): design for adopting config mid-(re)connect (reconfigure-while-faulted #7)
This commit is contained in:
@@ -0,0 +1,166 @@
|
||||
# Driver-reconfigure-while-faulted — Design
|
||||
|
||||
**Date:** 2026-06-14
|
||||
**Status:** Approved (brainstorming) — ready for implementation plan
|
||||
**Branch:** `feat/driver-reconfigure-while-faulted` off master `f9be3843`
|
||||
**Follow-up:** `pending.md` open item #7 (also "Incidental findings from the OpcUaClient live-verify (2026-06-13)").
|
||||
|
||||
## Problem
|
||||
|
||||
A `DriverInstanceActor` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`)
|
||||
that is stuck `Reconnecting` (its `InitializeAsync` keeps failing on a bad config) **ignores a
|
||||
corrected config**. When the operator deploys a fix, `DriverHostActor.ApplyChildDelta` `Tell`s the
|
||||
child a `DriverInstanceActor.ApplyDelta` — but only the `Connected()` and `Stubbed()` behaviours have
|
||||
a `Receive<ApplyDelta>` handler. In `Connecting()` and `Reconnecting()` the message **dead-letters**,
|
||||
so the actor keeps retrying the stale `_currentConfigJson` forever. The only workaround today is to
|
||||
restart the whole node (which respawns the driver actor fresh from the current deployment artifact).
|
||||
|
||||
Surfaced live on 2026-06-13: a `MAIN-opcua-eq` driver was faulted from a prior bad config and never
|
||||
adopted the corrected one until the node was restarted.
|
||||
|
||||
## Goal
|
||||
|
||||
An `ApplyDelta` delivered while the actor is `Connecting` or `Reconnecting` **adopts the new config and
|
||||
re-initialises with it**, so a corrected deployment recovers a faulted driver without a node restart —
|
||||
and the adopted config is guaranteed to win, even against an old initialise still in flight.
|
||||
|
||||
Non-goals: changing the `Connected` apply path (`ReinitializeAsync`), the host/spawn logic, any
|
||||
message contract visible outside the actor, or anything in EF/Configuration. The live `/run`
|
||||
verification is deferred (user-driven).
|
||||
|
||||
## Approach (chosen: B — generation guard + immediate re-init)
|
||||
|
||||
Alternatives considered:
|
||||
|
||||
- **A — config-swap only.** `ApplyDelta` just sets `_currentConfigJson` + replies success; the
|
||||
`Reconnecting` retry timer picks the new config up on its next tick. Smallest diff, but in
|
||||
`Connecting` (no timer) an old in-flight init that *succeeds* connects on the stale config — the
|
||||
adopt can lose the race — plus up to `reconnectInterval` latency. **Rejected:** the adopt isn't
|
||||
guaranteed to stick, which is the whole point.
|
||||
- **C — host respawns the faulted child.** `DriverHostActor` `Context.Stop`s + respawns the child
|
||||
with the new config (automating "restart node"). No state-machine change, but drops health history
|
||||
and live subscriptions, adds restart churn, and papers over the actor contract instead of fixing
|
||||
it. **Rejected.**
|
||||
- **B — generation guard + immediate re-init (chosen).** The corrected config always wins; immediate
|
||||
retry (no timer-tick wait); also closes a latent concurrent-init window in `Reconnecting`.
|
||||
Contained to this one actor; fully unit-testable offline.
|
||||
|
||||
### Generation guard (the correctness mechanism)
|
||||
|
||||
Each `InitializeAsync` attempt is tagged with a monotonic generation. A result is honoured only when
|
||||
it matches the latest generation; an older result is from a **superseded** attempt (e.g. an
|
||||
`ApplyDelta` adopted a new config mid-(re)connect) and is dropped.
|
||||
|
||||
```csharp
|
||||
private int _initGeneration;
|
||||
|
||||
private void InitializeAsync(string driverConfigJson)
|
||||
{
|
||||
_currentConfigJson = driverConfigJson;
|
||||
var generation = ++_initGeneration; // bump on the actor thread; the closure captures the local
|
||||
var self = Self;
|
||||
_ = Task.Run(async () =>
|
||||
{
|
||||
try
|
||||
{
|
||||
await _driver.InitializeAsync(driverConfigJson, CancellationToken.None);
|
||||
self.Tell(new InitializeSucceeded(generation));
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
self.Tell(new InitializeFailed(ex.Message, generation));
|
||||
}
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
The two internal result records gain the token (nothing outside the actor constructs them):
|
||||
|
||||
```csharp
|
||||
public sealed record InitializeSucceeded(int Generation);
|
||||
public sealed record InitializeFailed(string Reason, int Generation);
|
||||
```
|
||||
|
||||
Every `InitializeSucceeded`/`InitializeFailed` handler in `Connecting()` and `Reconnecting()` drops a
|
||||
superseded result first:
|
||||
|
||||
```csharp
|
||||
if (msg.Generation != _initGeneration) return; // superseded — a newer InitializeAsync replaced this one
|
||||
```
|
||||
|
||||
`_initGeneration` is read/written **only on the actor thread** — `InitializeAsync` runs inside a
|
||||
message handler, the `Task.Run` closure captures `generation` as a local, and the result handler runs
|
||||
on the actor thread. No lock needed.
|
||||
|
||||
### The adopt handler (new — `Connecting` + `Reconnecting`)
|
||||
|
||||
```csharp
|
||||
Receive<ApplyDelta>(msg =>
|
||||
{
|
||||
_log.Info("DriverInstance {Id}: ApplyDelta during (re)connect — adopting new config, re-initialising now",
|
||||
_driverInstanceId);
|
||||
InitializeAsync(msg.DriverConfigJson); // swaps _currentConfigJson, bumps generation (supersedes the
|
||||
// in-flight init), starts a fresh attempt immediately
|
||||
Sender.Tell(new ApplyResult(true, "config adopted; reinitializing", msg.Correlation));
|
||||
});
|
||||
```
|
||||
|
||||
The actor **stays in its current state**; the new init's result drives the next transition through
|
||||
the existing (now generation-guarded) handlers:
|
||||
|
||||
- `Connecting`: new init succeeds → `Become(Connected)`; fails → `Become(Reconnecting)`.
|
||||
- `Reconnecting`: new init succeeds → `Become(Connected)` (and the existing handler cancels the retry
|
||||
timer); fails → no-op, the retry timer keeps retrying the **new** config (now in `_currentConfigJson`).
|
||||
|
||||
The `Reconnecting` retry timer is **left running** — if the immediate attempt fails it continues
|
||||
retrying the new config. A redundant concurrent attempt (immediate + a timer tick) is harmlessly
|
||||
deduped by the generation guard.
|
||||
|
||||
### Why it's correct
|
||||
|
||||
- **No stale hijack.** In `Connecting`, if the old in-flight init's `InitializeSucceeded(oldGen)` lands
|
||||
while still connecting, the guard drops it, so the actor cannot connect on the stale config. Only the
|
||||
new-generation result transitions it. (If the stale result lands *after* the new init already reached
|
||||
`Connected`, it dead-letters harmlessly — `Connected` has no `InitializeSucceeded` handler.)
|
||||
- **No subscription churn / leak.** `Connecting` has no live subscription yet; `Reconnecting` already
|
||||
detached on its entry (`DisconnectObserved`/`ForceReconnect` call `DetachSubscription`). `_desiredRefs`
|
||||
is retained and re-applied by `ResubscribeDesired` on the next `Connected` entry. The adopt handler
|
||||
touches no subscription state.
|
||||
- **Reuses an established pattern.** The `Reconnecting` retry loop already calls `InitializeAsync`
|
||||
(not `ReinitializeAsync`) repeatedly on a not-connected driver, so adopting via `InitializeAsync`
|
||||
introduces no new driver-contract assumption.
|
||||
- **`Stubbed` untouched.** It never calls `InitializeAsync`, so `_initGeneration` stays 0 and its
|
||||
existing `ApplyDelta` stubbed-success reply is unchanged.
|
||||
|
||||
## Testing (offline — Akka TestKit, no live gate)
|
||||
|
||||
Observability uses the existing pattern: with a `SetDesiredSubscriptions` set, a `Connected` entry
|
||||
auto-subscribes, so `driver.SubscribeCount` is the deterministic "did we connect, and on which config"
|
||||
probe. The shared stub gains **opt-in** per-config gating + config capture (additive — existing
|
||||
`InitializeShouldThrow`/`InitializeCount` fields and their tests are untouched).
|
||||
|
||||
1. **Reconnecting adopts a corrected config and connects** (headline bug). Init throws for `v1` →
|
||||
stuck `Reconnecting`; `Ask<ApplyResult>(ApplyDelta(v2))` where `v2` succeeds → reply `Success` with
|
||||
the correlation; the actor reaches `Connected` on `v2` (`SubscribeCount` goes to 1; the config that
|
||||
drove `Connected` is `v2`).
|
||||
2. **Generation guard ignores a superseded in-flight result** (the race). A gated `v1` init is pending
|
||||
in `Connecting`; `ApplyDelta(v2)` is sent (also gated). Release `v1` → the actor **stays
|
||||
`Connecting`** (no auto-subscribe fires, `SubscribeCount == 0`). Release `v2` → `Connected`
|
||||
(`SubscribeCount == 1`). Proves a stale init cannot hijack state.
|
||||
3. **Regression** — existing tests stay green: `Initialize_failure_keeps_actor_in_Reconnecting_state`,
|
||||
`ApplyDelta_when_Connected_calls_ReinitializeAsync_and_replies_success`.
|
||||
|
||||
## Files touched
|
||||
|
||||
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` — generation field + tagged
|
||||
result records + guarded handlers + the two new `Receive<ApplyDelta>` handlers.
|
||||
- `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs` — new adopt +
|
||||
generation-guard tests; extend the private stub with opt-in gating/config-capture.
|
||||
|
||||
No host change, no contract change visible outside the actor, no EF/Configuration change, no bUnit.
|
||||
|
||||
## Risk / classification
|
||||
|
||||
**High-risk** — actor state machine + concurrency. Full review chain (spec + code reviews) and a final
|
||||
integration review. The live `/run` gate (deploy a corrected config to a faulted `MAIN-opcua-eq` and
|
||||
confirm it adopts without a node restart) is **deferred** per the user's instruction.
|
||||
Reference in New Issue
Block a user