167 lines
8.8 KiB
Markdown
167 lines
8.8 KiB
Markdown
# Driver-reconfigure-while-faulted — Design
|
|
|
|
**Date:** 2026-06-14
|
|
**Status:** Approved (brainstorming) — ready for implementation plan
|
|
**Branch:** `feat/driver-reconfigure-while-faulted` off master `f9be3843`
|
|
**Follow-up:** `pending.md` open item #7 (also "Incidental findings from the OpcUaClient live-verify (2026-06-13)").
|
|
|
|
## Problem
|
|
|
|
A `DriverInstanceActor` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`)
|
|
that is stuck `Reconnecting` (its `InitializeAsync` keeps failing on a bad config) **ignores a
|
|
corrected config**. When the operator deploys a fix, `DriverHostActor.ApplyChildDelta` `Tell`s the
|
|
child a `DriverInstanceActor.ApplyDelta` — but only the `Connected()` and `Stubbed()` behaviours have
|
|
a `Receive<ApplyDelta>` handler. In `Connecting()` and `Reconnecting()` the message **dead-letters**,
|
|
so the actor keeps retrying the stale `_currentConfigJson` forever. The only workaround today is to
|
|
restart the whole node (which respawns the driver actor fresh from the current deployment artifact).
|
|
|
|
Surfaced live on 2026-06-13: a `MAIN-opcua-eq` driver was faulted from a prior bad config and never
|
|
adopted the corrected one until the node was restarted.
|
|
|
|
## Goal
|
|
|
|
An `ApplyDelta` delivered while the actor is `Connecting` or `Reconnecting` **adopts the new config and
|
|
re-initialises with it**, so a corrected deployment recovers a faulted driver without a node restart —
|
|
and the adopted config is guaranteed to win, even against an old initialise still in flight.
|
|
|
|
Non-goals: changing the `Connected` apply path (`ReinitializeAsync`), the host/spawn logic, any
|
|
message contract visible outside the actor, or anything in EF/Configuration. The live `/run`
|
|
verification is deferred (user-driven).
|
|
|
|
## Approach (chosen: B — generation guard + immediate re-init)
|
|
|
|
Alternatives considered:
|
|
|
|
- **A — config-swap only.** `ApplyDelta` just sets `_currentConfigJson` + replies success; the
|
|
`Reconnecting` retry timer picks the new config up on its next tick. Smallest diff, but in
|
|
`Connecting` (no timer) an old in-flight init that *succeeds* connects on the stale config — the
|
|
adopt can lose the race — plus up to `reconnectInterval` latency. **Rejected:** the adopt isn't
|
|
guaranteed to stick, which is the whole point.
|
|
- **C — host respawns the faulted child.** `DriverHostActor` `Context.Stop`s + respawns the child
|
|
with the new config (automating "restart node"). No state-machine change, but drops health history
|
|
and live subscriptions, adds restart churn, and papers over the actor contract instead of fixing
|
|
it. **Rejected.**
|
|
- **B — generation guard + immediate re-init (chosen).** The corrected config always wins; immediate
|
|
retry (no timer-tick wait); also closes a latent concurrent-init window in `Reconnecting`.
|
|
Contained to this one actor; fully unit-testable offline.
|
|
|
|
### Generation guard (the correctness mechanism)
|
|
|
|
Each `InitializeAsync` attempt is tagged with a monotonic generation. A result is honoured only when
|
|
it matches the latest generation; an older result is from a **superseded** attempt (e.g. an
|
|
`ApplyDelta` adopted a new config mid-(re)connect) and is dropped.
|
|
|
|
```csharp
|
|
private int _initGeneration;
|
|
|
|
private void InitializeAsync(string driverConfigJson)
|
|
{
|
|
_currentConfigJson = driverConfigJson;
|
|
var generation = ++_initGeneration; // bump on the actor thread; the closure captures the local
|
|
var self = Self;
|
|
_ = Task.Run(async () =>
|
|
{
|
|
try
|
|
{
|
|
await _driver.InitializeAsync(driverConfigJson, CancellationToken.None);
|
|
self.Tell(new InitializeSucceeded(generation));
|
|
}
|
|
catch (Exception ex)
|
|
{
|
|
self.Tell(new InitializeFailed(ex.Message, generation));
|
|
}
|
|
});
|
|
}
|
|
```
|
|
|
|
The two internal result records gain the token (nothing outside the actor constructs them):
|
|
|
|
```csharp
|
|
public sealed record InitializeSucceeded(int Generation);
|
|
public sealed record InitializeFailed(string Reason, int Generation);
|
|
```
|
|
|
|
Every `InitializeSucceeded`/`InitializeFailed` handler in `Connecting()` and `Reconnecting()` drops a
|
|
superseded result first:
|
|
|
|
```csharp
|
|
if (msg.Generation != _initGeneration) return; // superseded — a newer InitializeAsync replaced this one
|
|
```
|
|
|
|
`_initGeneration` is read/written **only on the actor thread** — `InitializeAsync` runs inside a
|
|
message handler, the `Task.Run` closure captures `generation` as a local, and the result handler runs
|
|
on the actor thread. No lock needed.
|
|
|
|
### The adopt handler (new — `Connecting` + `Reconnecting`)
|
|
|
|
```csharp
|
|
Receive<ApplyDelta>(msg =>
|
|
{
|
|
_log.Info("DriverInstance {Id}: ApplyDelta during (re)connect — adopting new config, re-initialising now",
|
|
_driverInstanceId);
|
|
InitializeAsync(msg.DriverConfigJson); // swaps _currentConfigJson, bumps generation (supersedes the
|
|
// in-flight init), starts a fresh attempt immediately
|
|
Sender.Tell(new ApplyResult(true, "config adopted; reinitializing", msg.Correlation));
|
|
});
|
|
```
|
|
|
|
The actor **stays in its current state**; the new init's result drives the next transition through
|
|
the existing (now generation-guarded) handlers:
|
|
|
|
- `Connecting`: new init succeeds → `Become(Connected)`; fails → `Become(Reconnecting)`.
|
|
- `Reconnecting`: new init succeeds → `Become(Connected)` (and the existing handler cancels the retry
|
|
timer); fails → no-op, the retry timer keeps retrying the **new** config (now in `_currentConfigJson`).
|
|
|
|
The `Reconnecting` retry timer is **left running** — if the immediate attempt fails it continues
|
|
retrying the new config. A redundant concurrent attempt (immediate + a timer tick) is harmlessly
|
|
deduped by the generation guard.
|
|
|
|
### Why it's correct
|
|
|
|
- **No stale hijack.** In `Connecting`, if the old in-flight init's `InitializeSucceeded(oldGen)` lands
|
|
while still connecting, the guard drops it, so the actor cannot connect on the stale config. Only the
|
|
new-generation result transitions it. (If the stale result lands *after* the new init already reached
|
|
`Connected`, it dead-letters harmlessly — `Connected` has no `InitializeSucceeded` handler.)
|
|
- **No subscription churn / leak.** `Connecting` has no live subscription yet; `Reconnecting` already
|
|
detached on its entry (`DisconnectObserved`/`ForceReconnect` call `DetachSubscription`). `_desiredRefs`
|
|
is retained and re-applied by `ResubscribeDesired` on the next `Connected` entry. The adopt handler
|
|
touches no subscription state.
|
|
- **Reuses an established pattern.** The `Reconnecting` retry loop already calls `InitializeAsync`
|
|
(not `ReinitializeAsync`) repeatedly on a not-connected driver, so adopting via `InitializeAsync`
|
|
introduces no new driver-contract assumption.
|
|
- **`Stubbed` untouched.** It never calls `InitializeAsync`, so `_initGeneration` stays 0 and its
|
|
existing `ApplyDelta` stubbed-success reply is unchanged.
|
|
|
|
## Testing (offline — Akka TestKit, no live gate)
|
|
|
|
Observability uses the existing pattern: with a `SetDesiredSubscriptions` set, a `Connected` entry
|
|
auto-subscribes, so `driver.SubscribeCount` is the deterministic "did we connect, and on which config"
|
|
probe. The shared stub gains **opt-in** per-config gating + config capture (additive — existing
|
|
`InitializeShouldThrow`/`InitializeCount` fields and their tests are untouched).
|
|
|
|
1. **Reconnecting adopts a corrected config and connects** (headline bug). Init throws for `v1` →
|
|
stuck `Reconnecting`; `Ask<ApplyResult>(ApplyDelta(v2))` where `v2` succeeds → reply `Success` with
|
|
the correlation; the actor reaches `Connected` on `v2` (`SubscribeCount` goes to 1; the config that
|
|
drove `Connected` is `v2`).
|
|
2. **Generation guard ignores a superseded in-flight result** (the race). A gated `v1` init is pending
|
|
in `Connecting`; `ApplyDelta(v2)` is sent (also gated). Release `v1` → the actor **stays
|
|
`Connecting`** (no auto-subscribe fires, `SubscribeCount == 0`). Release `v2` → `Connected`
|
|
(`SubscribeCount == 1`). Proves a stale init cannot hijack state.
|
|
3. **Regression** — existing tests stay green: `Initialize_failure_keeps_actor_in_Reconnecting_state`,
|
|
`ApplyDelta_when_Connected_calls_ReinitializeAsync_and_replies_success`.
|
|
|
|
## Files touched
|
|
|
|
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` — generation field + tagged
|
|
result records + guarded handlers + the two new `Receive<ApplyDelta>` handlers.
|
|
- `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs` — new adopt +
|
|
generation-guard tests; extend the private stub with opt-in gating/config-capture.
|
|
|
|
No host change, no contract change visible outside the actor, no EF/Configuration change, no bUnit.
|
|
|
|
## Risk / classification
|
|
|
|
**High-risk** — actor state machine + concurrency. Full review chain (spec + code reviews) and a final
|
|
integration review. The live `/run` gate (deploy a corrected config to a faulted `MAIN-opcua-eq` and
|
|
confirm it adopts without a node restart) is **deferred** per the user's instruction.
|