Files
lmxopcua/docs/plans/2026-06-18-driver-pages-reconnect-e2e-design.md
T

8.4 KiB
Raw Blame History

Driver-pages Phase 10 — reconnect-transition E2E + plan close-out — Design

Date: 2026-06-18 Status: Approved Base: master 08c7a2bd Branch: feat/driver-pages-reconnect-e2e

Context

The 2026-05-28-adminui-driver-pages plan is fully shipped (Phases 010), but its .tasks.json is comprehensively stale (it still marks Phases 610 "pending" while the code + tests all exist and are real). A brainstorming pass on 2026-06-18 verified, seam by seam, that:

  • Phase 6 (live DriverStatusPanel) is built end-to-end: DriverHealthChanged contract, AkkaDriverHealthPublisher (DI-bound in prod, invoked at DriverInstanceActor.PublishHealthSnapshot), DriverStatusSignalRBridge (spawned admin-gated at Program.cs:196), the shared-singleton snapshot store, the hub (MapHub<DriverStatusHub>), and the panel wired into all 9 driver pages. Page DriverInstanceId and actor _driverInstanceId key on the same EF value — no mismatch.
  • Phase 8 (Reconnect/Restart) is built: messages, AdminOperationsActor + DriverHostActor handlers, DriverOperator-gated buttons in the panel.
  • Phase 10 automated E2E tests (DriverTestConnectE2eTests, DriverReconnectE2eTests, DriverStatusHubE2eTests) are real, skip-gated, and honest about their scope.

Two genuine remnants remain, both flagged by the DriverReconnectE2eTests / DriverStatusHubE2eTests scope-notes:

  1. The one real coverage gap: no test proves a deployed driver actually transitions Healthy → Reconnecting → Healthy in response to ReconnectDriver. The existing 10.2 test only asserts command ingestion (the round-trip reply), not the resulting actor health transition.
  2. A harness-fidelity gap behind it: TwoNodeClusterHarness.BuildNodeAsync calls WithOtOpcUaRuntimeActors() (the Akka actor spawn) but not AddOtOpcUaRuntime() (the DI registration that binds IDriverHealthPublisher → AkkaDriverHealthPublisher). Consequently, in every current Host.IntegrationTests, deployed driver actors fall back to NullDriverHealthPublisher and emit no health to the driver-health DPS topic. The harness also leaves IDriverFactory at NullDriverFactory, so deployed drivers reach Stubbed, never Connected.

The plan's stale trackers caused this item to be (wrongly) re-listed as OPEN in stillpending.md §A.9. This phase closes both remnants and reconciles the trackers.

Goal

Close the genuine ReconnectDriver health-transition E2E gap, fix the harness-fidelity gap behind it, prove the full Phase 10 driver suite green, and reconcile the stale trackers so this fully-shipped plan stops re-triggering as backlog.

Design

1. Harness fidelity fix

In TwoNodeClusterHarness.BuildNodeAsync:

  • Add builder.Services.AddOtOpcUaRuntime() before AddAkka (matching production Program.cs:87). This binds the real AkkaDriverHealthPublisher so deployed drivers publish health to the driver-health DPS topic. It also seeds the Null* runtime defaults (IHistorianDataSource, IAlarmHistorianSink, IHistoryWriter, IDriverFactory, …) — all harmless no-ops that don't change existing test behavior (nothing in the current suite subscribes to driver-health, and the Null sinks are inert).
  • Add an opt-in seam to inject a test IDriverFactory for tests that need a connecting driver. Default (no factory supplied) leaves the existing behavior untouched. Mechanism: a StartAsync parameter (e.g. IDriverFactory? driverFactory = null) threaded into BuildNodeAsync; when supplied, register it as a singleton after AddOtOpcUaRuntime so it wins over the Null default (last-registration-wins / replace).

This change is fidelity-improving for the whole suite: the existing DriverStatusHubE2eTests keeps spawning its own bridge, but real driver health now flows in tests that deploy drivers.

2. The reconnect-transition E2E test (the real gap)

A new test (in DriverReconnectE2eTests.cs, or a focused new file) that:

  1. Starts the harness with a controllable fake IDriverFactory (see decision below).
  2. Seeds a driver row + minimal equipment/tag using the existing SeedDeploymentWithEquipmentTags precedent (from DriverHostActorLiveValueTests), bound to NodeANodeId.
  3. Triggers a deploy (DispatchDeployment, or POST /api/deployments with HarnessDeployApiKey).
  4. Spawns the real DriverStatusSignalRBridge over the real DI snapshot store (the store is the observation surface; a mock IHubContext captures the hub push the same way the existing hub test does).
  5. Waits (condition-poll, generous timeout) for the snapshot store to report the deployed instance as Healthy.
  6. Dispatches ReconnectDriver via IAdminOperationsClient (the real cluster-singleton path the AdminUI button uses).
  7. Asserts the store observes the transition Reconnecting and then returns to Healthy within a timeout — proving the full wiring: ReconnectDriver → AdminOperationsActor → DriverHostActor.HandleReconnectDriver → DriverInstanceActor FSM (ForceReconnect → Become(Reconnecting) → Become(Connected)) → PublishHealthSnapshot → driver-health DPS topic → DriverStatusSignalRBridge → store.

Decision: controllable fake driver factory (not the real Modbus sim)

Recommended and approved: observe the transition via a deterministic, controllable fake IDriver / IDriverFactory test double rather than a real Modbus sim connection.

Rationale:

  • Determinism, no flakiness. A fake driver whose connect succeeds drives the actor to Connected immediately; ReconnectDriver re-enters Reconnecting then Connected deterministically. No sim timing, no skip-gate, runs everywhere.
  • Smaller blast radius. The real-sim path additionally needs the full driver-factory bootstrap (all 9 driver factories + deps) wired into the shared harness.
  • The wiring is what matters. This gap is about the health-transition + command wiring, not the Modbus protocol. The real Modbus TCP connect/reconnect is already covered by the Modbus.IntegrationTests and the 10.1 TestConnect E2E (against the sim).

The fake IDriver exposes a minimal controllable surface (succeed-connect, optionally signal a fault) sufficient to walk the FSM through Connected → Reconnecting → Connected.

3. Live suite run

Run the full Host.IntegrationTests driver E2E suite. Bring up the Modbus sim (lmxopcua-fix up modbus standard, endpoint 10.100.0.35:5020 / MODBUS_SIM_ENDPOINT) so the skip-gated 10.1 DriverTestConnectE2eTests actually execute green (not skipped), alongside the new deterministic reconnect test (which runs regardless of the sim).

4. Reconcile the stale trackers

  • docs/plans/2026-05-28-adminui-driver-pages-plan.md.tasks.json: mark Phases 610 tasks completed with their real commits / a "shipped — reconciled 2026-06-18" note; flip executionState/lastUpdated.
  • stillpending.md §A.9: mark Phase 6/8/10 SHIPPED (note the new reconnect-transition test; keep the deferred full-stack hub test as a documented follow-up). (Never staged — local working file.)
  • docs/plans/2026-05-28-adminui-driver-pages-design.md §8.3: fix the stale ModbusTcpModbus reference in the smoke checklist.
  • Memory: update project_stillpending_backlog.md + MEMORY.md.

5. Explicitly deferred (documented follow-up)

The full-stack WebSocket + JWT DriverStatusHub connection test (a real HubConnection to /hubs/driverstatus with a minted bearer token, JoinDriver, assert client receipt). No repo precedent (no test mints a JWT or opens a real HubConnection), flaky-prone, and it only re-covers transport the mock-hub test + the §8.3 manual runbook already handle.

Out of scope

  • The 10.4 manual browser smoke (driving the AdminUI on docker-dev). Foldable later; the automated reconnect test + green suite is the higher-value core.
  • Any production code change. This phase is test + harness + docs only.

Constraints

  • xUnit + Shouldly. No bUnit.
  • No EF migration, no Commons wire/proto contract change, no Core.Abstractions / interface contract change.
  • Stage by explicit path; never git add .; never stage sql_login.txt, src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/, pending.md, current.md, docker-dev/docker-compose.yml, stillpending.md.
  • No force-push, no --no-verify.

Finish

Merge to master + push.