task-galaxy-e2e branch — non-FOCAS work-in-progress snapshot

Catch-all commit for pending work on the task-galaxy-e2e branch that
wasn't part of the FOCAS migration. Grouping by topic so future per-topic
commits can be cherry-picked if needed.

TwinCAT
- src/.../Driver.TwinCAT/AdsTwinCATClient.cs + TwinCATDriverFactoryExtensions.cs:
  factory-registration extensions + ADS client refinements.
- src/.../Driver.TwinCAT.Cli/Commands/BrowseCommand.cs: new browse command
  for the TwinCAT test-client CLI.
- tests/.../Driver.TwinCAT.IntegrationTests/TwinCAT3SmokeTests.cs + TwinCatProject/:
  fixture scaffold with a minimal POU + README pointing at the TCBSD/ESXi
  VM for e2e.
- docs/Driver.TwinCAT.Cli.md + docs/drivers/TwinCAT-Test-Fixture.md:
  documentation for the above.
- docs/v3/twincat-backlog.md: forward-looking backlog seed.

Admin UI + fleet status
- src/.../Admin/Components/Pages/Clusters/DriversTab.razor + Hosts.razor:
  UI refresh for fleet-status rendering.
- src/.../Admin/Hubs/FleetStatusHub.cs + FleetStatusPoller.cs +
  Admin/Program.cs: SignalR hub + poller plumbing for live fleet data.
- tests/.../Admin.Tests/FleetStatusPollerTests.cs: poller coverage.

Server + redundancy runtime (Phase 6.3 follow-ups)
- src/.../Server/Hosting/RedundancyPublisherHostedService.cs: HostedService
  that owns the RedundancyStatePublisher lifecycle + wires peer reachability.
- src/.../Server/Redundancy/ServerRedundancyNodeWriter.cs: OPC UA
  variable-node writer binding ServiceLevel + ServerUriArray to the
  publisher's events.
- src/.../Server/Program.cs + Server.csproj: hosted-service registration.
- tests/.../Server.Tests/ServerRedundancyNodeWriterTests.cs +
  Server.Tests.csproj: coverage for the above.

Configuration
- src/.../Configuration/Validation/DraftValidator.cs +
  tests/.../Configuration.Tests/DraftValidatorTests.cs: draft-validation
  refinements.

E2E scripts (shared infrastructure)
- scripts/e2e/README.md + _common.ps1 + test-all.ps1: shared helpers + the
  all-drivers test-all runner.
- scripts/e2e/test-opcuaclient.ps1: OPC UA Client e2e runner.

Docs
- docs/v2/implementation/phase-6-{1,2,3,4}*.md + exit-gate-phase-{3,7}.md:
  phase-gate + implementation doc updates.
- docs/v2/plan.md: top-level plan refresh.
- docs/v2/redundancy-interop-playbook.md: client interop playbook for the
  Phase 6.3 redundancy-runtime work.

Two orphan FOCAS docs remain on disk but deliberately unstaged —
docs/v2/focas-deployment.md and docs/v2/implementation/focas-simulator-plan.md
describe the now-retired Tier-C topology and should either be rewritten
or deleted in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-24 14:12:19 -04:00
parent 4b0664bd55
commit 69e0d02c72
58 changed files with 3070 additions and 247 deletions

View File

@@ -0,0 +1,129 @@
# Phase 3 Exit Gate — Driver Fleet (reconstructed retroactively)
> **Status**: **CLOSED (reconstructed 2026-04-23)**. The original plan split the
> driver work across Phases 3 / 4 / 5 (Modbus alone → four PLC drivers → two
> specialty drivers). In execution, all seven non-Galaxy drivers shipped under
> one umbrella against `Core.Abstractions` + `Core`'s generic driver-hosting
> machinery. This doc captures the closure retroactively; no forward work
> remains under these three original phase numbers.
>
> **Plan doc**: none — phases 3/4/5 were intentionally not split out into
> separate plan docs once it was clear the capability-interface contract
> introduced in Phase 1 (`Core.Abstractions` — plan decision #4) was stable
> enough that each driver could land as its own stream rather than as a
> gated mini-phase. See `docs/v2/plan.md` §6 for the now-consolidated
> migration strategy.
## Scope
All seven drivers in the v2 target list (Decision #5) minus Galaxy (closed
separately under Phase 2). The Galaxy Proxy+Host+Shared split exited under
`exit-gate-phase-2-final.md`; this gate does not re-cover it.
## What shipped
### Drivers
| Driver | Project | Capability surface | Test projects |
|---|---|---|---|
| Modbus TCP | `Driver.Modbus` + `Driver.Modbus.Cli` | `IDriver` + `ITagDiscovery` + `IReadable` + `IWritable` + `ISubscribable` + `IHostConnectivityProbe` | `Tests`, `IntegrationTests`, `Cli.Tests` |
| AB CIP | `Driver.AbCip` + `Driver.AbCip.Cli` | all of the above + `IPerCallHostResolver` + `IAlarmSource` | `Tests`, `IntegrationTests`, `Cli.Tests` |
| AB Legacy (PCCC / DF1) | `Driver.AbLegacy` + `Driver.AbLegacy.Cli` | `IDriver` + `IReadable` + `IWritable` + `ITagDiscovery` + `ISubscribable` + `IHostConnectivityProbe` + `IPerCallHostResolver` | `Tests`, `IntegrationTests`, `Cli.Tests` |
| Siemens S7 | `Driver.S7` + `Driver.S7.Cli` | `IDriver` + `ITagDiscovery` + `IReadable` + `IWritable` + `ISubscribable` + `IHostConnectivityProbe` | `Tests`, `IntegrationTests`, `Cli.Tests` |
| Beckhoff TwinCAT (ADS) | `Driver.TwinCAT` + `Driver.TwinCAT.Cli` | `IDriver` + `IReadable` + `IWritable` + `ITagDiscovery` + `ISubscribable` + `IHostConnectivityProbe` + `IPerCallHostResolver` | `Tests`, `IntegrationTests`, `Cli.Tests` |
| FANUC FOCAS | `Driver.FOCAS` + `Driver.FOCAS.Host` + `Driver.FOCAS.Shared` + `Driver.FOCAS.Cli` | `IDriver` + `IReadable` + `IWritable` + `ITagDiscovery` + `ISubscribable` + `IHostConnectivityProbe` + `IPerCallHostResolver`; Tier-C out-of-process backend mirrors the Galaxy Proxy/Host split. `Fwlib64FocasBackend` shipped 2026-04-23 as the production backend (P/Invoke against `Fwlib64.dll`); Host retargeted from net48 x86 to net10.0-windows x64 at the same time. | `Tests`, `Host.Tests`, `Shared.Tests`, `Cli.Tests` |
| OPC UA Client (gateway) | `Driver.OpcUaClient` | `IDriver` + `ITagDiscovery` + `IReadable` + `IWritable` + `ISubscribable` + `IHostConnectivityProbe` + `IAlarmSource` + `IHistoryProvider` (richest surface in the fleet — it's bridging another UA server) | `Tests`, `IntegrationTests` |
### Supporting infrastructure
| PR / Task | Summary |
|---|---|
| #248 | `DriverFactoryRegistry` + `DriverInstanceBootstrapper` — central DB `DriverInstance` rows materialise into live `IDriver` instances at server startup. |
| #210 | Modbus server-side factory + seed SQL (closed first child of umbrella #209). |
| #211 #212 #213 | AB CIP / S7 / AB Legacy server-side factories + seed SQL. |
| #220 (FOCAS) | FOCAS factory wired into the bootstrap pipeline; Tier-C split (`Driver.FOCAS.Host` process launcher, named-pipe IPC, NSSM install scripts, post-mortem MMF) shipped across the five-PR series. |
| (this session) | TwinCAT factory wired in + Server project reference added; all seven driver factories now register uniformly in `Server/Program.cs`. |
| #249 #250 #251 | Per-driver test-client CLI suite (`otopcua-<driver>-cli`) — shared lib + one CLI per driver for direct-to-PLC smoke testing independent of the server. |
| #253 + follow-ups | E2E CLI test scripts (`scripts/e2e/test-<driver>.ps1`) — five-stage bidirectional bridge + subscribe-sees-change assertions per driver, plus `test-all.ps1` matrix runner. |
| (this session) | OPC UA Client e2e script shipped (`test-opcuaclient.ps1`, 8 stages) — the only driver that was missing an e2e script. |
### Docs
Per-driver test-fixture documentation:
- `docs/drivers/Modbus-Test-Fixture.md`
- `docs/drivers/AbServer-Test-Fixture.md` (covers AB CIP fixture)
- `docs/drivers/AbLegacy-Test-Fixture.md`
- `docs/drivers/S7-Test-Fixture.md`
- `docs/drivers/TwinCAT-Test-Fixture.md`
- `docs/drivers/FOCAS-Test-Fixture.md`
- `docs/drivers/OpcUaClient-Test-Fixture.md`
Driver-level ops docs:
- `docs/Driver.Modbus.Cli.md`, `docs/Driver.AbCip.Cli.md`, `docs/Driver.AbLegacy.Cli.md`, `docs/Driver.S7.Cli.md`, `docs/Driver.TwinCAT.Cli.md`, `docs/Driver.FOCAS.Cli.md`
- `docs/v2/driver-specs.md` — unified capability-matrix spec for all eight drivers (Galaxy + seven).
## Compliance evidence
No dedicated `phase-3-compliance.ps1` exists — scope was too broad to fit the
single-script pattern that worked for Phases 6.x and 7. Verification instead
takes the form of the per-driver test suites + e2e scripts:
- [x] **Unit tests** — every driver has a `Tests` project with capability-interface contract tests; `dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.*.Tests` is green.
- [x] **Integration tests**`Driver.*.IntegrationTests` stands up Docker-hosted simulators (pymodbus, ab_server, python-snap7, opc-plc) at collection init and exercises real wire-level read/write/subscribe/probe per driver.
- [x] **CLI tests**`Driver.*.Cli.Tests` covers the per-driver test-client CLIs (#249#251).
- [x] **E2E scripts**`scripts/e2e/test-<driver>.ps1` covers the driver-CLI → PLC → OtOpcUa server → OPC UA client round-trip for all seven drivers + Galaxy; `test-all.ps1` aggregates; README status section (rewritten this session) summarises live-boot evidence.
- [x] **Factory registration** — all seven factories plus Galaxy register in `src/ZB.MOM.WW.OtOpcUa.Server/Program.cs` inside the `DriverFactoryRegistry` composition; the `DriverInstanceBootstrapper` can materialise any configured row.
- [x] **Seed SQL**#210#213 provide per-driver Config DB seed scripts so a fresh Config DB is populatable without Admin UI interaction.
### Live-boot verification
Recorded across the session-level tracking tasks:
| Driver | Fixture | Stages | Tracking |
|---|---|---|---|
| Modbus | pymodbus (dl205 profile) | 5/5 | #209 exit gate; bidirectional + subscribe-sees-change added in #253 follow-ups |
| AB CIP | `ab_server` ControlLogix | 5/5 | #220 |
| S7 | python-snap7 | 5/5 | #220 |
| AB Legacy | `ab_server` SLC500 / MicroLogix / PLC-5 (requires `/1,0` cip-path for Docker fixture) | 5/5 | #222 partial |
| OPC UA Client | opc-plc Docker fixture | 5/8 (probe, remote read, forward bridge, subscribe, browse) | (this session) |
| TwinCAT | TCBSD VM @ 10.100.0.128 (AmsNetId `41.169.163.43.1.1`) — real TwinCAT runtime under FreeBSD on ESXi; bypasses the Hyper-V/RTIME conflict that blocks XAR on this dev box | features validated | fixture is the TCBSD VM; `TWINCAT_TRUST_WIRE=1` still gates the e2e script by default so unintentional runs against cold fixtures don't false-pass |
| FOCAS | Lab-rig CNC + `Fwlib64.dll` | — | **deferred**`Fwlib64FocasBackend` shipped 2026-04-23; wire-level live-boot gated `FOCAS_TRUST_WIRE=1`, lab rig tracked under #222 follow-up |
| Galaxy | Live Galaxy + `OtOpcUaGalaxyHost` (this dev box) | 7/7 (read / write / subscribe / alarms / history) | closed under Phase 2 |
## Deferred to post-gate follow-ups
Items intentionally not blocking closure of this umbrella — each is hardware-
dependent and tracked separately:
- [ ] **FOCAS wire-level live-boot**`test-focas.ps1` against a real CNC once `Fwlib64.dll` is on PATH and `FOCAS_TRUST_WIRE=1` (#222 follow-up). The `Fwlib64FocasBackend` shipped 2026-04-23 — code exists, unit-tests green; only the live-CNC smoke test remains.
- [x] **FOCAS `Fwlib64FocasBackend`****CLOSED 2026-04-23**. The production backend in `src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host/Backend/Fwlib64FocasBackend.cs` wraps `FwlibFocasClient` to fulfil `IFocasBackend` against the licensed `Fwlib64.dll`. Host project retargeted to `net10.0-windows` x64. Default when `OTOPCUA_FOCAS_BACKEND` is unset. 6 new backend tests green. Only wire-level live-boot against real hardware remains — see item above.
- [ ] **OPC UA Client stages 5/7/8** — reverse-bridge, alarm, history stages are opt-in via sidecar NodeId params because opc-plc's default image has no writable nodes and doesn't historize. Against a richer upstream (Prosys, UA Expert sample server) all eight stages can run.
## Completion checklist
- [x] Modbus driver shipped + unit + integration + CLI tests green
- [x] AB CIP driver shipped + tests green + live-boot 5/5
- [x] AB Legacy driver shipped + tests green + live-boot 5/5
- [x] S7 driver shipped + tests green + live-boot 5/5
- [x] TwinCAT driver shipped + tests green + features validated against the TCBSD VM virtual-PLC fixture
- [x] FOCAS driver shipped (Tier-C split) + tests green (wire-live deferred)
- [x] OPC UA Client driver shipped + tests green + live-boot 5/8
- [x] `DriverFactoryRegistry` + `DriverInstanceBootstrapper` shipped
- [x] All seven factories registered in `Server/Program.cs`
- [x] Per-driver test-client CLI suite shipped
- [x] E2E test scripts shipped + `test-all.ps1` aggregator green
- [x] Per-driver test-fixture docs present
- [x] `docs/v2/driver-specs.md` unified capability spec present
- [x] `scripts/e2e/README.md` status section reflects current live-boot matrix
- [x] Exit gate doc checked in (this file)
- [x] TwinCAT validated against the TCBSD VM virtual-PLC fixture — `TWINCAT_TRUST_WIRE=1` + e2e script still gated by default to prevent false-pass against cold fixtures
- [ ] FOCAS lab-rig follow-up filed + tracked (#222)
## Why no compliance script
The Phases 6.1/6.2/6.3/6.4/7 pattern of a single `phase-N-compliance.ps1`
worked because each of those phases touched a narrow slice of server-side
runtime. A "phase-3-compliance.ps1" would have had to boot seven simulators,
configure seven DriverInstance rows, and run seven e2e scripts — which is
exactly what `scripts/e2e/test-all.ps1` already does. The aggregate runner
+ its README is the compliance artefact for this umbrella.

View File

@@ -1,6 +1,6 @@
# Phase 7 Exit Gate — Scripting, Virtual Tags, Scripted Alarms, Historian Sink
> **Status**: Open. Closed when every compliance check passes + every deferred item either ships or is filed as a post-v2-release follow-up.
> **Status**: **FULLY CLOSED** 2026-04-23 audit — the three original follow-ups (#239 / #240 / #241) were all shipped under later branches but this exit-gate doc wasn't updated at the time. All three verified against the repo + tests green.
>
> **Compliance script**: `scripts/compliance/phase-7-compliance.ps1`
> **Plan doc**: `docs/v2/implementation/phase-7-scripting-and-alarming.md`
@@ -45,13 +45,13 @@ Covered by `scripts/compliance/phase-7-compliance.ps1`:
- [x] Walker emits `NodeSourceKind.Virtual` + `NodeSourceKind.ScriptedAlarm` variables
- [x] `DriverNodeManager` dispatch routes Reads by source; Writes to non-Driver rejected with `BadUserAccessDenied` (plan #6)
## Deferred to Post-Gate Follow-ups
## Deferred to Post-Gate Follow-ups (all closed as of 2026-04-23 audit)
Kept out of the capstone so the gate can close cleanly while the less-critical wiring lands in targeted PRs:
Originally kept out of the capstone so the gate could close cleanly. Each landed as a targeted follow-up PR; audit this session verified them against the repo:
- [ ] **SealedBootstrap composition root** (task #239) — instantiate `VirtualTagEngine` + `ScriptedAlarmEngine` + `SqliteStoreAndForwardSink` in `Program.cs`; pass `VirtualTagSource` + `ScriptedAlarmSource` as the new `IReadable` parameters on `DriverNodeManager`. Without this, the engines are dormant in production even though every piece is tested.
- [ ] **Live OPC UA end-to-end smoke** (task #240) — Client.CLI browse + read a virtual tag computed by Roslyn; Client.CLI acknowledge a scripted alarm via the Part 9 method node; historian-disabled deployment returns `BadNotFound` for virtual nodes rather than silent failure.
- [ ] **sp_ComputeGenerationDiff extension** (task #241) — emit Script / VirtualTag / ScriptedAlarm sections alongside the existing Namespace/DriverInstance/Equipment/Tag/NodeAcl rows so the Admin DiffViewer shows Phase 7 changes between generations.
- [x] **SealedBootstrap composition root** (task #239) — **CLOSED**. `src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.cs` instantiates `VirtualTagEngine` + `ScriptedAlarmEngine` via `Phase7EngineComposer.Compose`, and `SqliteStoreAndForwardSink` in `ResolveHistorianSink` when a registered driver provides `IAlarmHistorianWriter` (today: `GalaxyProxyDriver`). `OpcUaServerService.ExecuteAsync` calls `Phase7Composer.PrepareAsync` then `OpcUaApplicationHost.SetPhase7Sources` **before** `applicationHost.StartAsync` so `OtOpcUaServer` + `DriverNodeManager` capture the `VirtualReadable` / `ScriptedAlarmReadable` at construction. 38 tests green under `tests/ZB.MOM.WW.OtOpcUa.Server.Tests/Phase7/` + `SealedBootstrapIntegrationTests`. The work landed under the label "Phase 7 follow-up #246" and was never re-labelled against #239.
- [x] **Live OPC UA end-to-end smoke** (task #240) — **CLOSED**. `scripts/e2e/test-phase7-virtualtags.ps1` drives a full Client.CLI read of a driver-sourced input, reads the VirtualTag computed off it, triggers a scripted alarm by writing the trigger value, and subscribes to the alarm condition — all through a running OtOpcUa server. Covered in `scripts/e2e/test-all.ps1` + `scripts/e2e/README.md` matrix.
- [x] **sp_ComputeGenerationDiff extension** (task #241) — **CLOSED**. Migration `20260420232000_ExtendComputeGenerationDiffWithPhase7.cs` extends the stored proc to emit Script / VirtualTag / ScriptedAlarm sections alongside the existing NodeAcl / Tag / Equipment / DriverInstance / Namespace output. Admin DiffViewer picks them up through its existing section-plugin architecture (Phase 6.4 Stream C).
## Completion Checklist
@@ -66,9 +66,9 @@ Kept out of the capstone so the gate can close cleanly while the less-critical w
- [x] `phase-7-compliance.ps1` present and passes
- [x] Full solution `dotnet test` passes (no new failures beyond pre-existing tolerated CLI flake)
- [x] Exit-gate doc checked in
- [ ] `SealedBootstrap` composition follow-up filed + tracked
- [ ] Live end-to-end smoke follow-up filed + tracked
- [ ] `sp_ComputeGenerationDiff` extension follow-up filed + tracked
- [x] `SealedBootstrap` composition follow-up shipped (#239 / Phase 7 follow-up #246)
- [x] Live end-to-end smoke follow-up shipped (#240`scripts/e2e/test-phase7-virtualtags.ps1`)
- [x] `sp_ComputeGenerationDiff` extension follow-up shipped (#241 — migration `ExtendComputeGenerationDiffWithPhase7`)
## How to run

View File

@@ -1,6 +1,8 @@
# Phase 6.1 — Resilience & Observability Runtime
> **Status**: **SHIPPED** 2026-04-19 — Streams A/B/C/D + E data layer merged to `v2` across PRs #78-82. Final exit-gate PR #83 turns the compliance script into real checks (all pass) and records this status update. One deferred piece: Stream E.2/E.3 SignalR hub + Blazor `/hosts` column refresh lands in a visual-compliance follow-up PR on the Phase 6.4 Admin UI branch.
> **Status**: **SHIPPED** 2026-04-19 — Streams A/B/C/D + E data layer merged to `v2` across PRs #78-82. Final exit-gate PR #83 turns the compliance script into real checks (all pass) and records this status update.
>
> **Stream E.2/E.3 closed 2026-04-23** — `FleetStatusPoller` now polls `DriverInstanceResilienceStatus`, detects per-`(DriverInstanceId, HostName)` deltas, and pushes `ResilienceStatusChangedMessage` via `FleetStatusHub` on the fleet group. Admin `/hosts` page subscribes on load and upserts the matching `HostStatusRow` in-memory on receipt, so operator-visible resilience state now reflects the runtime within one poller tick (~5 s) instead of the Admin page's own 10-second refresh. `FleetStatusPollerTests.Poller_pushes_ResilienceStatusChanged_on_delta` covers the first-observation push, the no-delta-no-push invariant, and the mutated-row re-push.
>
> Baseline: 906 solution tests → post-Phase-6.1: 1042 passing (+136 net). One pre-existing Client.CLI Subscribe flake unchanged.
>
@@ -129,7 +131,7 @@ Closes these gaps flagged in the 2026-04-19 audit:
- [ ] Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
- [ ] Stream C: `/healthz` + `/readyz` + structured logging + JSON Serilog sink
- [ ] Stream D: LiteDB cache + Polly fallback in Configuration
- [ ] Stream E: Admin `/hosts` page refresh
- [x] Stream E: Admin `/hosts` page refresh (E.1 in PRs #78-82 with the data layer; E.2/E.3 closed 2026-04-23)
- [ ] Cross-cutting: `phase-6-1-compliance.ps1` exits 0; full solution `dotnet test` passes; exit-gate doc recorded
## Adversarial Review — 2026-04-19 (Codex, thread `019da489-e317-7aa1-ab1f-6335e0be2447`)

View File

@@ -1,10 +1,9 @@
# Phase 6.2 — Authorization Runtime (ACL + LDAP grants)
> **Status**: **SHIPPED (core)** 2026-04-19 — Streams A, B, C (foundation), D (data layer) merged to `v2` across PRs #84-87. Final exit-gate PR #88 turns the compliance stub into real checks (all pass, 2 deferred surfaces tracked).
> **Status**: **FULLY SHIPPED** (updated 2026-04-23 audit). Streams A-D core merged to `v2` across PRs #84-87 + exit-gate PR #88 on 2026-04-19; both named deferrals landed separately and were confirmed against the repo this session:
>
> Deferred follow-ups (tracked separately):
> - Stream C dispatch wiring on the 11 OPC UA operation surfaces (task #143).
> - Stream D Admin UI — RoleGrantsTab, AclsTab Probe-this-permission, SignalR invalidation, draft-diff ACL section + visual-compliance reviewer signoff (task #144).
> - **Task #143 Stream C dispatch wiring** — `DriverNodeManager` calls `AuthorizationGate.IsAllowed(context.UserIdentity, OpcUaOperation.<Op>, scope)` on Read (line 249), Write (line 536) with per-classification `OpcUaOperation.WriteOperate` / `WriteTune` / `WriteConfigure` routed via `WriteAuthzPolicy`, and HistoryRead (4 call sites). `TriePermissionEvaluator` + `PermissionTrieCache` back the gate.
> - **Task #144 Stream D Admin UI** — `RoleGrants.razor` (LDAP group → Admin role mapping) + `AclsTab.razor` (per-cluster node-ACL editor with a probe-this-permission surface via `PermissionProbeService`) + `AclChangeNotifier` SignalR hub for cache invalidation all present and wired.
>
> Baseline pre-Phase-6.2: 1042 solution tests → post-Phase-6.2 core: 1097 passing (+55 net). One pre-existing Client.CLI Subscribe flake unchanged.
>

View File

@@ -1,13 +1,20 @@
# Phase 6.3 — Redundancy Runtime
> **Status**: **SHIPPED (core)** 2026-04-19 — Streams B (ServiceLevelCalculator + RecoveryStateManager) and D core (ApplyLeaseRegistry) merged to `v2` in PR #89. Exit gate in PR #90.
> **Status**: **SHIPPED (core + Stream C)** — original body merged 2026-04-19; audit 2026-04-23 promoted **Stream C (task #147)** into shipped state.
>
> Deferred follow-ups (tracked separately):
> - Stream A — RedundancyCoordinator cluster-topology loader (task #145).
> - Stream COPC UA node wiring: ServiceLevel + ServerUriArray + RedundancySupport (task #147).
> - Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR (task #149).
> - Stream Fclient interop matrix + Galaxy MXAccess failover test (task #150).
> - sp_PublishGeneration pre-publish validator rejecting unsupported RedundancyMode values (task #148 part 2 — SQL-side).
> **In** (verified in repo):
> - Stream A — `ClusterTopologyLoader`, `RedundancyCoordinator`, `RedundancyTopology`, `PeerReachability` all present under `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`. Coordinator is now also hosted by `Program.cs` via the new `RedundancyPublisherHostedService`, which calls `RefreshAsync` on startup.
> - Stream B`ServiceLevelCalculator` + `RecoveryStateManager`.
> - **Stream C (task #147) — OPC UA node wiring**. `ServerRedundancyNodeWriter` maintains `Server.ServiceLevel` (i=2267), `Server.ServerRedundancy.RedundancySupport` (i=2994), and `Server.ServerRedundancy.ServerUriArray` (non-transparent subtype) by writing the `PropertyState.Value` + calling `ClearChangeMasks`. `RedundancyPublisherHostedService` drives the publisher on a 1 s tick and fans `OnStateChanged` / `OnServerUriArrayChanged` into the writer. Mapping of `Configuration.RedundancyMode` → Part 4 `RedundancySupport` is Warm/Hot/None (v2 clusters don't enumerate Cold / HotAndMirrored per decision #85). Idempotent per-value dedupe prevents spurious OPC UA notifications. Unit coverage: `ServerRedundancyNodeWriterTests` (4 tests, green).
> - Stream D`ApplyLeaseRegistry`.
> - Stream E — `RedundancyTab.razor` with SignalR `RoleChanged` wiring (via `FleetStatusPoller` + `FleetStatusHub`) — stale-flag + role-swap banner.
>
> **Closed this session (2026-04-23)**:
> - **Task #148 part 2** — `DraftValidator.ValidateClusterTopology(cluster, nodes)` now catches three pre-publish invariants the SQL CHECK can't see: (a) unsupported `NodeCount`/`RedundancyMode` pairs; (b) `Enabled`-node count vs. declared `NodeCount` mismatch (catches disabled-node drift with mode still Hot/Warm); (c) multiple-Primary per decision #84. Returns every failure in one pass — same shape as `Validate`. 8 new tests in `DraftValidatorTests` green.
> - **Task #150 Stream F** — `docs/v2/redundancy-interop-playbook.md` captures the manual validation matrix against UaExpert + Kepware + AVEVA MXAccess failover. Automating these closed-source GUI clients in PR-CI is out of scope; the automatable half is already covered by `ServiceLevelCalculatorTests` / `RedundancyStatePublisherTests` / `ClusterTopologyLoaderTests` / `ServerRedundancyNodeWriterTests`.
>
> **Remaining (documented limitation, not blocking v2.0)**:
> - Non-transparent redundancy-state node upgrade — the SDK's default `Server.ServerRedundancy` object is the base `ServerRedundancyState`, so `ApplyServerUriArray` currently logs-and-skips. Operators on the rare deployment that needs `ServerUriArray` read-back get a clear warning with the upgrade path. Documented in the interop playbook's "Known limitations" section.
>
> Baseline pre-Phase-6.3: 1097 solution tests → post-Phase-6.3 core: 1137 passing (+40 net).
>

View File

@@ -1,12 +1,17 @@
# Phase 6.4 — Admin UI Completion
> **Status**: **SHIPPED (data layer)** 2026-04-19 — Stream A.2 (UnsImpactAnalyzer + DraftRevisionToken) and Stream B.1 (EquipmentCsvImporter parser) merged to `v2` in PR #91. Exit gate in PR #92.
> **Status**: **SHIPPED (mostly)** 2026-04-19; audit 2026-04-23 confirms what landed separately after the data-layer PR #91:
>
> Deferred follow-ups (Blazor UI + staging tables + address-space wiring):
> - Stream A UI — UnsTab MudBlazor drag/drop + 409 concurrent-edit modal + Playwright smoke (task #153).
> - Stream B follow-up — EquipmentImportBatch staging + FinaliseImportBatch transaction + CSV import UI (task #155).
> - Stream C — DiffViewer refactor into base + 6 section plugins + 1000-row cap + SignalR paging (task #156).
> - Stream D — IdentificationFields.razor + DriverNodeManager OPC 40010 sub-folder exposure (task #157).
> **In** (verified in repo):
> - **Task #153 Stream A UI**`UnsTab.razor` with drag/drop handlers + concurrent-edit via `DraftRevisionToken` + `UnsImpactAnalyzer`; Playwright smoke test in `tests/ZB.MOM.WW.OtOpcUa.Admin.E2ETests/UnsTabDragDropE2ETests.cs`.
> - **Task #155 Stream B**`EquipmentImportBatch` entity + migration, `EquipmentImportBatchService.CreateBatchAsync` / `FinaliseBatchAsync` / `DropBatchAsync` / `ListByUserAsync`, `ImportEquipment.razor` UI.
> - **Task #156 Stream C**`DiffViewer.razor` + `DiffSection.razor` refactor in place.
> - Admin UI `IdentificationFields.razor` surface shipped (part of #157).
>
> **Closed this session (2026-04-23)**:
> - **Task #157 Stream D server-side half** was a stale audit claim. `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/IdentificationFolderBuilder.cs` ships the OPC 40010 Identification sub-folder materializer (Manufacturer / Model / SerialNumber / HardwareRevision / SoftwareRevision / YearOfConstruction / AssetLocation / ManufacturerUri / DeviceManualUri); `EquipmentNodeWalker.Walk` calls it per equipment; `IdentificationFolderBuilderTests` (158 lines) + two walker-level tests (`Walk_Materializes_Identification_Subfolder_When_AnyFieldPresent`, `Walk_Omits_Identification_Subfolder_When_AllFieldsNull`) cover the null-handling branches. The initial audit grepped only `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/`; the builder lives in `Core/OpcUa/`.
>
> **Phase 6.4 is now FULLY SHIPPED — no deferred surfaces remain.**
>
> Baseline pre-Phase-6.4: 1137 solution tests → post-Phase-6.4 data layer: 1159 passing (+22).
>

View File

@@ -689,7 +689,7 @@ Galaxy.Proxy ──→ Galaxy.Shared ←── Galaxy.Host
**Decided:**
- Mono-repo (Decision #31 above).
- `Core.Abstractions` is **internal-only for now** — no standalone NuGet. Keep the contract mutable while the first 8 drivers are being built; revisit publishing after Phase 5 when the shape has stabilized. Design the contract *as if* it will eventually be public (no leaky types, stable names) to minimize churn later.
- `Core.Abstractions` is **internal-only for now** — no standalone NuGet. Keep the contract mutable while the first 8 drivers are being built; revisit publishing after the driver fleet (originally Phase 5, folded into the Phase 3 umbrella — see exit gate) once the shape has stabilized. Design the contract *as if* it will eventually be public (no leaky types, stable names) to minimize churn later.
---
@@ -742,24 +742,30 @@ Each step leaves the system runnable. The generic extraction is effectively free
10. **Build `Galaxy.Proxy`** — .NET 10 in-process proxy implementing IDriver interfaces, forwarding over IPC
11. **Validate parity** — v2 Galaxy driver must pass the same integration tests as v1
**Phase 3 — Modbus TCP driver (prove the abstraction)**
12. **Build `Driver.ModbusTcp`** — NModbus, config-driven tags from central DB, internal poll loop, device-as-folder hierarchy
13. **Add Modbus config screens to Admin** (first driver-specific config UI)
**Phase 3 — Driver fleet (all seven non-Galaxy drivers) — ✅ CLOSED 2026-04-23** (see [`implementation/exit-gate-phase-3.md`](implementation/exit-gate-phase-3.md))
**Phase 4 PLC drivers**
14. **Build `Driver.AbCip`** — libplctag, ControlLogix/CompactLogix symbolic tags + Admin config screens
15. **Build `Driver.AbLegacy`** — libplctag, SLC 500/MicroLogix file-based addressing + Admin config screens
16. **Build `Driver.S7`** — S7netplus, Siemens S7-300/400/1200/1500 + Admin config screens
17. **Build `Driver.TwinCat`** — Beckhoff.TwinCAT.Ads v6, native ADS notifications, symbol upload + Admin config screens
Originally split across Phase 3 (Modbus alone), Phase 4 (PLC drivers), and
Phase 5 (specialty drivers). In execution, once `Core.Abstractions` had
stabilised under Phase 1 + Phase 2, each driver landed as its own stream
rather than as a gated mini-phase; the phase numbers were folded into a
single umbrella. Shipped:
**Phase 5 — Specialty drivers**
18. **Build `Driver.Focas`**FANUC FOCAS2 P/Invoke, pre-defined CNC tag set, PMC/macro config + Admin config screens
19. **Build `Driver.OpcUaClient`**OPC UA client gateway/aggregation, namespace remapping, subscription proxying + Admin config screens
12. **`Driver.Modbus`** — NModbus, config-driven tags, internal poll loop, device-as-folder hierarchy (umbrella closure #210)
13. **`Driver.AbCip`** — libplctag, ControlLogix/CompactLogix symbolic tags (#211, live-booted under #220)
14. **`Driver.AbLegacy`** — libplctag, SLC 500 / MicroLogix / PLC-5 file-based addressing (#213, live-booted under #222)
15. **`Driver.S7`** — S7netplus, Siemens S7-300/400/1200/1500 (#212, live-booted under #220)
16. **`Driver.TwinCAT`** — Beckhoff.TwinCAT.Ads v7, native ADS notifications, symbol upload (factory wired 2026-04-23; wire-live deferred, #221)
17. **`Driver.FOCAS`** — FANUC FOCAS2 P/Invoke via Tier-C out-of-process `Driver.FOCAS.Host` (#220 five-PR split; wire-live deferred, #222 follow-up)
18. **`Driver.OpcUaClient`** — OPC UA client gateway / aggregation, namespace remapping, subscription proxying (scaffold #66; live-boot 5/8 stages via `test-opcuaclient.ps1`)
Supporting infrastructure: `DriverFactoryRegistry` + `DriverInstanceBootstrapper`
(#248); per-driver test-client CLI suite (#249#251); e2e test scripts with
aggregate runner (#253); server-side factory + seed SQL per driver (#210#213).
**Decided:**
- **Parity test for Galaxy**: existing v1 IntegrationTests suite + scripted Client.CLI walkthrough (see Section 4 above).
- **Timeline**: no hard deadline. Each phase ships when it's right — tests passing, Galaxy parity bar met. Quality cadence over calendar cadence.
- **FOCAS SDK**: license already secured. Phase 5 can proceed as scheduled; `Fwlib64.dll` available for P/Invoke.
- **FOCAS SDK**: license already secured. FOCAS driver shipped as part of the Phase 3 umbrella with Tier-C host; `Fwlib64.dll` available for P/Invoke (wire-level live-boot gated on lab rig, #222 follow-up).
---

View File

@@ -0,0 +1,128 @@
# Redundancy Interop Playbook (Phase 6.3 Stream F — task #150)
> **Scope**: manual validation that third-party OPC UA clients + AVEVA MXAccess
> observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray,
> RedundancySupport) and fail over to the Backup node when the Primary drops.
>
> **Why manual**: the third-party clients named here are Windows-GUI binaries
> (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform.
> Automating any of them into PR-CI is out of scope for v2. This playbook
> captures the minimal dev-box-plus-VM setup and the expected pass criteria so
> the work can be executed repeatably at v2 release readiness and after any
> Phase 6.3 follow-up change.
## Prerequisites
1. Two `OtOpcUa.Server` nodes in one `ServerCluster`:
- Declared as `NodeCount = 2`, `RedundancyMode = Hot` (or `Warm`).
- Each with a distinct `ApplicationUri` (enforced by unique index per
decision #86).
- Each node's `StaticRoutes.xml` points at the other (`ServerCluster.Node[].Host`).
2. `scripts/install/Install-Services.ps1` applied on each node so the
`RedundancyPublisherHostedService` is running.
3. At least one `DriverInstance` with a reachable simulator or PLC so both
servers have a non-empty address space to browse.
4. On the client host:
- `UaExpert` ≥ 1.7 installed
- Kepware `ClientAce QuickClient` (or equivalent) — optional, for a second
client
5. For the AVEVA leg: a `Galaxy.Host` running against an MXAccess deployment
with an external OPC UA client object pointed at the cluster (not at a
single node).
## Expected signals on a running cluster
| Node | `ServiceLevel` | `RedundancySupport` | `ServerUriArray` |
|---|---|---|---|
| Primary, healthy, peer reachable | 200 | `Hot` (or `Warm`) | self + peer |
| Primary, mid-apply | 75 (`PrimaryMidApply`) | same | same |
| Primary, peer UNreachable | 150 (`PrimaryPeerDown`) | same | same |
| Backup, healthy | 100 (`Secondary`) | same | same |
| Either, dwelling in recovery | 50 (`Recovering`) | same | same |
| Either, invariant violation (two Primary, disabled-node mismatch) | 2 (`InvalidTopology`) | same | same |
(The band constants live in `ServiceLevelCalculator.Classify`.)
## Test matrix
Each row is one manual run; pass criterion in the right column.
### Block A — UA protocol signals (UaExpert)
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| A1 | ServiceLevel published | Connect UaExpert to Primary. Browse to `Server.ServerStatus.ServiceLevel`. | Value = 200 (or the expected Band byte per table above) |
| A2 | ServiceLevel updates on peer down | Connect to Primary. Stop Backup (`sc stop OtOpcUa`). Watch `ServiceLevel`. | Transitions 200 → 150 within ~2 s of peer probe timeout |
| A3 | RedundancySupport | Browse to `Server.ServerRedundancy.RedundancySupport`. | Value matches the declared `RedundancyMode` (Warm / Hot / None) |
| A4 | ServerUriArray (non-transparent upgrade) | Requires a redundancy-object-type upgrade follow-up. | When upgrade lands: `ServerUriArray` reports both ApplicationUris, self first |
| A5 | Mid-apply dip | On Primary trigger a `sp_PublishGeneration` apply. | `ServiceLevel` drops to 75 for the apply duration + dwell |
### Block B — Client failover
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| B1 | UaExpert picks Primary by ServiceLevel | In UaExpert configure a Redundancy Group with both endpoint URLs. | Client picks the Primary URL (higher ServiceLevel) |
| B2 | UaExpert cuts over on Primary kill | Kill the Primary's `OtOpcUa` service. | Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
| B3 | UaExpert cuts back when Primary returns | Start the Primary service. Wait ≥ recovery dwell (see `RecoveryStateManager.DwellTime`). | `ServiceLevel` on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes) |
| B4 | Kepware QuickClient failover | Repeat B1B3 with Kepware in place of UaExpert. | Same pass criteria; establishes we're not UaExpert-specific |
### Block C — Galaxy MXAccess failover
This block validates that an AVEVA System Platform app consuming our cluster
via MXAccess tolerates a Primary drop the same way a native OPC UA client does.
The MXAccess toolkit internally wraps the OPC UA Client and does its own
redundancy negotiation; we're asserting that negotiation honors our
`ServiceLevel` signal.
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| C1 | Galaxy binds to Primary on first connect | Bring the cluster up. Start a Galaxy `$MxAccessClient` object pointed at the cluster with both node URLs. | Galaxy reports `QUALITY = Good` + initial values from the Primary |
| C2 | Galaxy redirects on Primary drop | Stop the Primary. | Galaxy's `QUALITY` briefly goes `Uncertain`, then back to `Good`; values continue streaming from the Backup within MXAccess's `ReconnectInterval` (default 20 s) |
| C3 | Galaxy handles mid-apply dip | Trigger a generation apply on the Primary. | Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound |
## Recording results
Copy the tables above into a tracking doc per run. The tracking doc shape:
```
Run date: 2026-MM-DD
Cluster: <id> Primary: <node> Backup: <node> Release: <sha>
A1: PASS evidence: UaExpert screenshot uaexpert-a1.png
A2: PASS evidence: ServiceLevel trend grafana-a2.png
```
One pass of every row is the acceptance criterion. Re-run after any Phase 6.3
follow-up ships (especially the non-transparent redundancy-type upgrade, which
flips A4 from "deferred" to "expected pass").
## Known limitations
- **A4 pending**: `Server.ServerRedundancy` on our current SDK build lands as
the base `ServerRedundancyState`, which has no `ServerUriArray` child.
`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips until the
redundancy-object-type upgrade follow-up lands.
- **Recovery dwell default**: `RecoveryStateManager.DwellTime` defaults to 60 s
in `Program.cs`. Adjust via future config knob if B3 takes too long to
observe.
- **C-block external dependency**: The `Galaxy.Host` side of the redundancy
story is largely out of our code — it's MXAccess's own client-redundancy
policy talking to our published ServiceLevel. A negative result on C1-C3
does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert
(Block A / B) first.
## Automation notes (why this is a playbook, not a test)
- UaExpert and Kepware binaries are closed-source Windows GUIs; they don't
ship headless CLIs for the browse/connect/subscribe flows.
- The OPC Foundation reference SDK *can* drive every scenario, but our own
`Driver.OpcUaClient` tests already cover that client's behaviour; Block B
adds value specifically because these two clients have independent
redundancy implementations we don't control.
- For the sub-set of scenarios that *can* be automated — the self-loopback
case where our own `otopcua-cli` drives Primary + Backup — the existing
`tests/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests` +
`ServiceLevelCalculatorTests` (unit) + `ClusterTopologyLoaderTests`
(integration) already cover the math + data path. The wire-level assertion
that the values actually land on the right OPC UA nodes is covered by
`ServerRedundancyNodeWriterTests`.