docs(plans): driver-pages Phase 10 reconnect-transition E2E + close-out design

This commit is contained in:
Joseph Doherty
2026-06-18 08:47:12 -04:00
parent 08c7a2bd42
commit 482418c85a
@@ -0,0 +1,157 @@
# Driver-pages Phase 10 — reconnect-transition E2E + plan close-out — Design
**Date:** 2026-06-18
**Status:** Approved
**Base:** master `08c7a2bd`
**Branch:** `feat/driver-pages-reconnect-e2e`
## Context
The `2026-05-28-adminui-driver-pages` plan is **fully shipped** (Phases 010), but its
`.tasks.json` is comprehensively stale (it still marks Phases 610 "pending" while the
code + tests all exist and are real). A brainstorming pass on 2026-06-18 verified, seam
by seam, that:
- **Phase 6** (live `DriverStatusPanel`) is built end-to-end: `DriverHealthChanged`
contract, `AkkaDriverHealthPublisher` (DI-bound in prod, invoked at
`DriverInstanceActor.PublishHealthSnapshot`), `DriverStatusSignalRBridge` (spawned
admin-gated at `Program.cs:196`), the shared-singleton snapshot store, the hub
(`MapHub<DriverStatusHub>`), and the panel wired into all 9 driver pages. Page
`DriverInstanceId` and actor `_driverInstanceId` key on the same EF value — no mismatch.
- **Phase 8** (Reconnect/Restart) is built: messages, `AdminOperationsActor` +
`DriverHostActor` handlers, DriverOperator-gated buttons in the panel.
- **Phase 10** automated E2E tests (`DriverTestConnectE2eTests`, `DriverReconnectE2eTests`,
`DriverStatusHubE2eTests`) are real, skip-gated, and honest about their scope.
Two genuine remnants remain, both flagged by the `DriverReconnectE2eTests` /
`DriverStatusHubE2eTests` scope-notes:
1. **The one real coverage gap:** no test proves a *deployed* driver actually transitions
**Healthy → Reconnecting → Healthy** in response to `ReconnectDriver`. The existing
10.2 test only asserts command *ingestion* (the round-trip reply), not the resulting
actor health transition.
2. **A harness-fidelity gap behind it:** `TwoNodeClusterHarness.BuildNodeAsync` calls
`WithOtOpcUaRuntimeActors()` (the Akka actor spawn) but **not** `AddOtOpcUaRuntime()`
(the DI registration that binds `IDriverHealthPublisher → AkkaDriverHealthPublisher`).
Consequently, in *every* current `Host.IntegrationTests`, deployed driver actors fall
back to `NullDriverHealthPublisher` and emit no health to the `driver-health` DPS topic.
The harness also leaves `IDriverFactory` at `NullDriverFactory`, so deployed drivers
reach `Stubbed`, never `Connected`.
The plan's stale trackers caused this item to be (wrongly) re-listed as OPEN in
`stillpending.md` §A.9. This phase closes both remnants and reconciles the trackers.
## Goal
Close the genuine `ReconnectDriver` health-transition E2E gap, fix the harness-fidelity gap
behind it, prove the full Phase 10 driver suite green, and reconcile the stale trackers so
this fully-shipped plan stops re-triggering as backlog.
## Design
### 1. Harness fidelity fix
In `TwoNodeClusterHarness.BuildNodeAsync`:
- Add `builder.Services.AddOtOpcUaRuntime()` **before** `AddAkka` (matching production
`Program.cs:87`). This binds the real `AkkaDriverHealthPublisher` so deployed drivers
publish health to the `driver-health` DPS topic. It also seeds the `Null*` runtime
defaults (`IHistorianDataSource`, `IAlarmHistorianSink`, `IHistoryWriter`,
`IDriverFactory`, …) — all harmless no-ops that don't change existing test behavior
(nothing in the current suite subscribes to driver-health, and the Null sinks are inert).
- Add an **opt-in** seam to inject a test `IDriverFactory` for tests that need a connecting
driver. Default (no factory supplied) leaves the existing behavior untouched. Mechanism:
a `StartAsync` parameter (e.g. `IDriverFactory? driverFactory = null`) threaded into
`BuildNodeAsync`; when supplied, register it as a singleton **after** `AddOtOpcUaRuntime`
so it wins over the `Null` default (last-registration-wins / replace).
This change is fidelity-improving for the whole suite: the existing `DriverStatusHubE2eTests`
keeps spawning its own bridge, but real driver health now flows in tests that deploy drivers.
### 2. The reconnect-transition E2E test (the real gap)
A new test (in `DriverReconnectE2eTests.cs`, or a focused new file) that:
1. Starts the harness with a **controllable fake `IDriverFactory`** (see decision below).
2. Seeds a driver row + minimal equipment/tag using the existing
`SeedDeploymentWithEquipmentTags` precedent (from `DriverHostActorLiveValueTests`),
bound to `NodeANodeId`.
3. Triggers a deploy (`DispatchDeployment`, or `POST /api/deployments` with
`HarnessDeployApiKey`).
4. Spawns the **real** `DriverStatusSignalRBridge` over the real DI snapshot store (the
store is the observation surface; a mock `IHubContext` captures the hub push the same
way the existing hub test does).
5. Waits (condition-poll, generous timeout) for the snapshot store to report the deployed
instance as `Healthy`.
6. Dispatches `ReconnectDriver` via `IAdminOperationsClient` (the real cluster-singleton
path the AdminUI button uses).
7. Asserts the store observes the transition **`Reconnecting`** and then returns to
**`Healthy`** within a timeout — proving the full wiring:
`ReconnectDriver → AdminOperationsActor → DriverHostActor.HandleReconnectDriver →
DriverInstanceActor FSM (ForceReconnect → Become(Reconnecting) → Become(Connected)) →
PublishHealthSnapshot → driver-health DPS topic → DriverStatusSignalRBridge → store`.
#### Decision: controllable fake driver factory (not the real Modbus sim)
**Recommended and approved:** observe the transition via a deterministic, controllable fake
`IDriver` / `IDriverFactory` test double rather than a real Modbus sim connection.
Rationale:
- **Determinism, no flakiness.** A fake driver whose connect succeeds drives the actor to
`Connected` immediately; `ReconnectDriver` re-enters `Reconnecting` then `Connected`
deterministically. No sim timing, no skip-gate, runs everywhere.
- **Smaller blast radius.** The real-sim path additionally needs the full driver-factory
bootstrap (all 9 driver factories + deps) wired into the shared harness.
- **The wiring is what matters.** This gap is about the *health-transition + command
wiring*, not the Modbus protocol. The real Modbus TCP connect/reconnect is already
covered by the `Modbus.IntegrationTests` and the 10.1 `TestConnect` E2E (against the sim).
The fake `IDriver` exposes a minimal controllable surface (succeed-connect, optionally
signal a fault) sufficient to walk the FSM through `Connected → Reconnecting → Connected`.
### 3. Live suite run
Run the full `Host.IntegrationTests` driver E2E suite. Bring up the Modbus sim
(`lmxopcua-fix up modbus standard`, endpoint `10.100.0.35:5020` / `MODBUS_SIM_ENDPOINT`) so
the skip-gated 10.1 `DriverTestConnectE2eTests` actually execute green (not skipped),
alongside the new deterministic reconnect test (which runs regardless of the sim).
### 4. Reconcile the stale trackers
- `docs/plans/2026-05-28-adminui-driver-pages-plan.md.tasks.json`: mark Phases 610 tasks
`completed` with their real commits / a "shipped — reconciled 2026-06-18" note; flip
`executionState`/`lastUpdated`.
- `stillpending.md` §A.9: mark Phase 6/8/10 SHIPPED (note the new reconnect-transition test;
keep the deferred full-stack hub test as a documented follow-up). *(Never staged — local
working file.)*
- `docs/plans/2026-05-28-adminui-driver-pages-design.md` §8.3: fix the stale `ModbusTcp`
`Modbus` reference in the smoke checklist.
- Memory: update `project_stillpending_backlog.md` + `MEMORY.md`.
### 5. Explicitly deferred (documented follow-up)
The **full-stack WebSocket + JWT `DriverStatusHub` connection test** (a real `HubConnection`
to `/hubs/driverstatus` with a minted bearer token, `JoinDriver`, assert client receipt).
No repo precedent (no test mints a JWT or opens a real `HubConnection`), flaky-prone, and it
only re-covers transport the mock-hub test + the §8.3 manual runbook already handle.
## Out of scope
- The 10.4 manual browser smoke (driving the AdminUI on docker-dev). Foldable later; the
automated reconnect test + green suite is the higher-value core.
- Any production code change. This phase is test + harness + docs only.
## Constraints
- xUnit + Shouldly. **No bUnit.**
- **No** EF migration, **no** Commons wire/proto contract change, **no**
Core.Abstractions / interface contract change.
- Stage by explicit path; never `git add .`; never stage `sql_login.txt`,
`src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/`, `pending.md`, `current.md`,
`docker-dev/docker-compose.yml`, `stillpending.md`.
- No force-push, no `--no-verify`.
## Finish
Merge to master + push.