docs: add four planning runbooks for Phase 6.3 interop, v2 GA gates, live-hardware validation, and alarms worker wiring

Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20: - phase-6-3-redundancy-interop-plan.md: automation boundary analysis, concrete test matrix (A/B/C blocks), and a step-by-step cutover runbook for the deferred Stream F client interop work - v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion, and owner for each of the nine v2 GA exit criteria - live-hardware-validation-runbooks.md: one runbook per driver (FOCAS CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions, procedure, expected results, and recording template - alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1 worker wiring in the mxaccessgw sibling repo, documenting the discovered AVEVA API surface, the architectural decision that blocks A.2, the dependency order, and what each item needs to unblock Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 04:52:07 -04:00
parent da8a3e46f7
commit 16a87b08f3
4 changed files with 1422 additions and 0 deletions
--- a/docs/plans/alarms-worker-wiring-plan.md
+++ b/docs/plans/alarms-worker-wiring-plan.md
@@ -0,0 +1,340 @@
+# Alarms Worker Wiring Plan
+
+> **Context**: The alarms-over-gateway epic shipped 19 PRs across the
+> `lmxopcua` and `mxaccessgw` repos (merged 2026-04-30). Contracts are live;
+> the sub-attribute fallback path keeps Galaxy alarms functional today. Four
+> items remain as inert scaffolds gated on a dev-rig finding. This document is
+> the focused implementation plan for those four items only.
+>
+> **Do not duplicate `docs/plans/alarms-over-gateway.md`** — that document is
+> the full historical record of all 19 PRs. This document covers only what is
+> still to be done and exactly what blocks each item.
+>
+> **This work lives in the mxaccessgw sibling repo** at
+> `C:\Users\dohertj2\Desktop\mxaccessgw\` — not in this (lmxopcua) repo,
+> except where lmxopcua changes are noted explicitly.
+
+---
+
+## Dev-rig finding that blocks everything (2026-04-30)
+
+During PR A.2 work the following was discovered on the dev box:
+
+> The MXAccess COM Toolkit at
+> `C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll`
+> exposes **no alarm-event family** — only `OnDataChange`, `OnWriteComplete`,
+> `OperationComplete`, `OnBufferedDataChange`.
+>
+> AVEVA's `aaAlarmManagedClient` / `ArchestrAAlarmsAndEvents.SDK` assemblies
+> are **x64-only** and incompatible with the worker's x86 net48 bitness.
+
+The architectural decision required before any of A.2, A.3/A.4, C.1 can ship:
+
+> **Either** accept the value-driven sub-attribute path as the production
+> architecture (operator-comment fidelity is the only v1 regression), **or**
+> add an x64 alarm-helper sub-process alongside the x86 worker.
+
+Resolution drives the implementation shape of every item below. The plan
+presented here assumes the x64 alarm-helper sub-process route (the higher
+parity option), but notes the sub-attribute-only exit at each step.
+
+---
+
+## Discovered AVEVA API surface
+
+Before implementing, verify the following against the AVEVA SDK actually
+installed on the dev box and in the mxaccessgw worker's deployment folder:
+
+| Assembly | Bitness | Likely location | Key types |
+|----------|---------|-----------------|-----------|
+| `ArchestrA.MXAccess.dll` | x86 | `C:\Program Files (x86)\ArchestrA\Framework\Bin\` | `IMxAlarmEventSink`, `MxAlarmEventArgs` — **confirm exists at actual version** |
+| `aaAlarmManagedClient.dll` | x64 | `C:\Program Files\ArchestrA\Framework\Bin\` | `AlarmClient`, `IAlarmConsumer`, `AlarmEventArgs` |
+| `ArchestrAAlarmsAndEvents.SDK.dll` | x64 | Same or Historian SDK folder | `AlarmHistorianWriter`, `GetAlarmExtendedRec` |
+
+The AVEVA MXAccess Toolkit reference in the mxaccessgw repo (`gateway.md`) is
+the canonical API doc for the gateway worker's side. The alarm-client API is
+documented separately; verify the following call shapes during PR A.2:
+
+| Operation | Likely API | Notes |
+|-----------|-----------|-------|
+| Subscribe to alarm events | `AlarmClient.RegisterConsumer(IAlarmConsumer)` + `AlarmClient.Subscribe(filterSpec)` | Confirm exact method signatures against the SDK version on the dev box |
+| Receive alarm event | `IAlarmConsumer.OnAlarmEvent(AlarmEventArgs)` callback | Field set: alarm name, source, type, transition kind, severity, timestamps, operator fields |
+| Acknowledge alarm | `AlarmClient.AcknowledgeAlarm(alarmRef, comment, userPrincipal)` or equivalent | Confirm whether this is synchronous or returns a status |
+| Query active alarms | `AlarmClient.GetAlarmExtendedRec(filter)` or `GetActiveAlarms()` | Returns current active set for ConditionRefresh |
+| Get statistics | `AlarmClient.GetStatistics()` | Optional — useful for worker health checks |
+
+Record the exact method signatures against the installed SDK before starting
+A.2 — the proto field set in `OnAlarmTransitionEvent` must match the SDK's
+actual payload.
+
+---
+
+## Dependency order
+
+```
+A.2 (worker: AlarmClient subscription)
+  └─► A.3 (gateway: dispatch OnAlarmTransition + AcknowledgeAlarm RPC handler)
+        └─► A.4 (gateway: QueryActiveAlarms RPC handler)
+              └─► lmxopcua B.2 (GalaxyDriver IAlarmSource live)
+                    └─► C.1 (sidecar: AahClientManagedAlarmEventWriter live)
+                          └─► D.1 (smoke artifact captured)
+```
+
+A.2 is the single blocking item. All subsequent items unblock serially once
+A.2 delivers alarm events through the channel.
+
+---
+
+## Item A.2 — Worker: subscribe to MxAccess alarm event source
+
+**Repo**: `mxaccessgw` — `src\MxGateway.Worker\` (net48, x86)
+
+**What it needs**:
+
+The worker must subscribe to AVEVA's alarm events and fan them into the same
+bounded channel the data-change pump uses, translating each MxAccess alarm
+event into a `WorkerEvent` proto with family `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
+(defined in PR A.1, already merged).
+
+**Architectural choice determines the implementation path**:
+
+**Option X1 — aaAlarmManagedClient in a new x64 alarm-helper process**
+
+Add a second worker-mode sub-process (`MxGateway.AlarmWorker`, net8.0 x64)
+alongside the existing x86 worker. The AlarmWorker:
+
+1. Loads `aaAlarmManagedClient.dll` (x64) on startup.
+2. Calls `AlarmClient.RegisterConsumer` with a `WorkerAlarmConsumer` sink.
+3. Calls `AlarmClient.Subscribe` with a session-level filter (all alarms for
+   the session's Galaxy scope).
+4. Translates each `IAlarmConsumer.OnAlarmEvent` callback into a protobuf
+   `WorkerEvent` (family `ON_ALARM_TRANSITION`) and writes it to an IPC
+   channel readable by the gateway server-side multiplexer.
+5. Handles session lifecycle: re-subscribes after reconnect; unsubscribes on
+   session close.
+
+IPC from AlarmWorker to gateway: simplest option is a named pipe or an
+in-process queue if the AlarmWorker is hosted in the same gateway process
+space as a separate `IHostedService`.
+
+**Option X2 — Accept sub-attribute fallback as production (no A.2 work)**
+
+If the architectural decision is to accept the sub-attribute path as permanent:
+
+- `MxAccessAlarmEventSink.Attach()` in the worker remains a no-op (as
+  currently coded with the architectural comment).
+- The `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` proto family stays defined but
+  the gateway never emits events on it.
+- lmxopcua's `GalaxyDriver` does not implement `IAlarmSource` for the
+  native path; the value-driven sub-attribute path remains the production
+  path.
+- The only regression vs. v1 is operator-comment fidelity on Galaxy alarms.
+- C.1 is still needed if scripted-alarm historian write-back is required.
+
+**What blocks it**: the architectural decision above. Once made, A.2 becomes
+a 2–3 day implementation task (sub-process plumbing + proto translation +
+unit tests for the consumer sink cancellation behaviour).
+
+**Tests to write (when A.2 proceeds)**:
+
+- `WorkerAlarmConsumerTests` — fake `IAlarmConsumer` source emits canned
+  transitions; assert each produces the correct `WorkerEvent` body shape.
+- Cancellation/session-close test — closing the session unsubscribes from
+  the AlarmClient cleanly (no leaked `IAlarmConsumer` reference if the
+  worker is recycled mid-session).
+- Re-subscribe-after-reconnect test — `ReconnectSupervisor` triggers a
+  reconnect; assert the alarm consumer re-attaches to the new session.
+
+---
+
+## Item A.3 / A.4 — Gateway: dispatch and RPC handlers
+
+**Repo**: `mxaccessgw` — `src\MxGateway.Server\`
+
+**Depends on**: A.2 delivering `WorkerEvent` bodies with family
+`MX_EVENT_FAMILY_ON_ALARM_TRANSITION`.
+
+**What it needs**:
+
+### A.3 — Dispatch + AcknowledgeAlarm
+
+1. The session-level event multiplexer (`Sessions\SessionEventStream.cs` or
+   equivalent — verify name in the mxaccessgw repo) must recognise the new
+   `WorkerEvent` body and forward it as an `MxEvent` with family
+   `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to every `StreamEvents` subscriber
+   for that session.
+
+2. New RPC handler `AcknowledgeAlarm` builds an `AlarmAcknowledgeCommand`
+   worker command and forwards it to the alarm-helper process (Option X1) or
+   the worker's MxAccess session (Option X2 if MxAccess exposes ack). Maps
+   the reply status to `AcknowledgeAlarmReply.MxStatusProxy`.
+
+3. Authorization: new API scope `invoke:alarm-ack` on the API key. Keys
+   without it receive `PERMISSION_DENIED`. Follow the existing scope-check
+   pattern used by `invoke:write`.
+
+### A.4 — QueryActiveAlarms
+
+1. New RPC handler `QueryActiveAlarms` calls `AlarmClient.GetAlarmExtendedRec`
+   (or `GetActiveAlarms` — confirm the method name during implementation)
+   on the alarm-helper process, batches results into `ActiveAlarmSnapshot`
+   proto messages, and streams them back to the caller.
+
+2. New API scope `invoke:alarm-query` (separate from ack so read-only clients
+   can refresh without ack rights).
+
+**What blocks A.3/A.4**: A.2 must deliver `WorkerEvent` bodies on the channel.
+A.3/A.4 are pure dispatch wiring once the events arrive.
+
+**Tests to write**:
+
+- A.3 dispatch test — fake worker emits an `AlarmTransition` event; assert
+  the gateway forwards it on the `StreamEvents` channel of every subscribed
+  session (mirrors existing `OnDataChange` dispatch tests).
+- A.3 AcknowledgeAlarm auth test — existing key without `invoke:alarm-ack`
+  scope returns `PERMISSION_DENIED`.
+- A.4 pagination test — synthetic active-alarm set of 0 / 1 / 100 entries;
+  assert each streams back as separate `ActiveAlarmSnapshot` messages.
+- Integration (parity rig — requires dev box with AVEVA platform):
+  trigger a real Galaxy alarm, call `QueryActiveAlarms`, assert the alarm
+  appears in the stream; call `AcknowledgeAlarm`, assert the alarm transitions
+  to `ActiveAcked` and a `Acknowledge` transition event appears on
+  `StreamEvents`.
+
+---
+
+## Item C.1 — Historian sidecar: AahClientManagedAlarmEventWriter
+
+**Repo**: `lmxopcua` — `src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\`
+
+**Depends on**: Architectural decision (the sidecar uses `aahClientManaged`
+x64, which is not bitness-constrained like the worker). C.1 is independently
+unblockable from A.2 if the goal is to wire up the scripted-alarm historian
+path.
+
+**Current state**:
+
+`SdkAlarmHistorianWriteBackend` in `src\MxGateway.Worker\MxAccess\` is a
+placeholder returning `RetryPlease`. The lmxopcua sidecar's `WriteAlarmEvents`
+IPC slot is defined in `Ipc\Contracts.cs` but `Program.cs` constructs
+`HistorianFrameHandler` without an `alarmWriter` (line 57 per the alarms plan).
+The `IAlarmEventWriter` interface exists; only the production implementation
+and the consumer wiring are missing.
+
+**What it needs**:
+
+1. New `AahClientManagedAlarmEventWriter.cs` implementing `IAlarmEventWriter`
+   (defined in `Ipc\HistorianFrameHandler.cs`). Calls `aahClientManaged`'s
+   alarm-event write API — same path v1's `GalaxyHistorianWriter` used.
+   Uses `HistorianClusterEndpointPicker` for multi-node routing.
+   Maps `MxStatus` write outcomes to `HistorianWriteOutcome` enum
+   (Ack / PermanentFail / RetryPlease).
+
+2. `Program.cs` — build `AahClientManagedAlarmEventWriter` next to the
+   existing `BuildHistorian()` call; pass it to `HistorianFrameHandler`.
+   Gate behind `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED` env var (default `true`
+   when `OTOPCUA_HISTORIAN_ENABLED=true`).
+
+3. `Install-Services.ps1` — add the new env var to the install-time block.
+
+**What blocks C.1**: access to the `aahClientManaged` SDK on the dev box
+(confirmed available per `project_aveva_platform_installed.md` — AVEVA
+Historian SDK is present). C.1 can proceed without A.2 since the sidecar's
+`aahClientManaged` is x64 and does not share the worker's x86 bitness
+constraint.
+
+**Tests to write**:
+
+- Outcome-mapping table: every `MxStatus` on alarm-write → expected
+  `HistorianWriteOutcome`.
+- Batch test: 1 / 100 / 1000 events through a fake `aahClientManaged`
+  writer; assert per-row outcome list parallel to input order.
+- Cluster failover: primary Historian node returns `BadCommunicationError`;
+  picker rotates to secondary; eventual success.
+- `Program.cs` seam: assert handler constructed with alarm writer when env
+  var enabled; without it when disabled.
+- Live integration (parity rig): write a synthetic alarm event through the
+  IPC; query it back via `ReadEvents`; assert round-trip fidelity.
+
+---
+
+## Item D.1 — Smoke artifact
+
+**Repo**: `lmxopcua` (deployment refresh) + `mxaccessgw` (rig verification)
+
+**Depends on**: A.2, A.3, A.4, and C.1 all passing on the dev rig with a live
+Galaxy and live Historian.
+
+**Current state**: The deployment script `Refresh-Services.ps1` (task D.1) has
+shipped as PR #417 (merged 2026-04-30). What was NOT captured at that time was
+a smoke artifact — a log snippet or test output confirming that:
+
+1. An alarm transition event from a live Galaxy alarm reaches lmxopcua's
+   `AlarmConditionService` via the new `IAlarmSource` path (not the fallback).
+2. A scripted-alarm historian write-back reaches AVEVA Historian via the
+   sidecar `IAlarmEventWriter`.
+
+**What it needs**:
+
+Once A.2, A.3, C.1 are wired on the parity rig:
+
+1. Deploy the updated mxaccessgw (with A.2 / A.3 / A.4 changes).
+2. Deploy the updated sidecar (with C.1 changes).
+3. Run `Refresh-Services.ps1` to confirm clean service restarts.
+4. Trigger a Galaxy alarm (e.g. set an AnalogLimitAlarm attribute out of
+   range in Galaxy IDE).
+5. Observe the lmxopcua OPC UA alarm surface via the Client CLI:
+
+   ```powershell
+   dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+       alarms -u opc.tcp://localhost:4840 --subscribe
+   ```
+
+   Pass: the alarm condition appears on the OPC UA A&E surface within
+   2 × publishing interval.
+
+6. Trigger a scripted alarm via the lmxopcua `ScriptedAlarmEngine`
+   (or an OPC UA method call if one is wired).
+7. Confirm in the AVEVA Historian that the scripted alarm event is stored
+   (query via the Historian client or HistorianWatch tool).
+
+8. Capture log snippets:
+   - mxaccessgw log: `[INF] AlarmTransition dispatched sessionId=<> alarmRef=<>`
+   - lmxopcua log: `[INF] AlarmConditionService: IAlarmSource event alarmRef=<> origin=Driver`
+   - Sidecar log: `[INF] AahClientManagedAlarmEventWriter: Wrote <n> alarm events`
+
+9. Commit the log snippets as `docs/plans/alarms-d1-smoke-artifact.md`
+   (a new doc, not this one).
+
+**What blocks D.1**: all of A.2, A.3, C.1, plus the operator decision on the
+x64 alarm-helper architecture (or explicit acceptance of the sub-attribute
+fallback as production).
+
+---
+
+## Summary of blocks
+
+| Item | Blocked by | Estimated effort once unblocked |
+|------|-----------|--------------------------------|
+| A.2 | Architectural decision (x64 alarm-helper vs. sub-attribute fallback as production) | 2–3 days implementation; 1 day tests |
+| A.3 | A.2 delivering WorkerEvent bodies | 1–2 days |
+| A.4 | A.2 (active-alarm query needs AlarmClient session) | 1 day |
+| C.1 | aahClientManaged SDK access (available on dev box); NOT blocked by A.2 | 1–2 days |
+| D.1 | A.2 + A.3 + C.1 all passing on parity rig | 0.5 day (smoke + artifact capture) |
+
+C.1 can proceed in parallel with A.2 / A.3 since the sidecar's `aahClientManaged`
+is x64 and does not share the worker bitness constraint.
+
+---
+
+## What this plan does NOT cover
+
+- The value-driven sub-attribute fallback path — already shipped and
+  functional (not being changed).
+- Track B (lmxopcua EventPump, GalaxyDriver IAlarmSource re-implementation)
+  and Track E (client SDK surface refresh) from the alarms-over-gateway plan —
+  those are in `lmxopcua` and depend on A.3 being live; they follow naturally
+  once A.3 ships.
+- Galaxy-native alarm historian path — System Platform's own `HistorizeToAveva`
+  toggle on the Galaxy template; not in scope.
+- Alarm ACL / role-grant surface — already shipped in Phase 6.2.
--- a/docs/plans/live-hardware-validation-runbooks.md
+++ b/docs/plans/live-hardware-validation-runbooks.md
@@ -0,0 +1,497 @@
+# Live-Hardware Driver Validation Runbooks
+
+> **Scope**: These runbooks cover the three driver validation tasks that
+> require physical hardware or a hardware-equivalent live environment and
+> cannot be satisfied by the Docker-based simulator fixtures or unit tests
+> alone.
+>
+> Driver implementation is complete. The runbooks document the preconditions,
+> step-by-step procedure, expected results, and how to record the outcome for
+> each driver that has an open live-hardware gap.
+
+---
+
+## 1. FANUC FOCAS — Live CNC Smoke (task #54)
+
+### Background
+
+The FOCAS driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/`) uses the
+pure-managed `WireFocasClient` that speaks FOCAS2 over TCP directly (no
+`Fwlib64.dll`, no P/Invoke). The integration test suite at
+`tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.IntegrationTests/` runs against
+the `focas-mock` Python server (PDU-verified against `fwlibe64.dll` upstream)
+and covers all call-shapes the driver issues. What the mock cannot cover:
+
+- Series-specific firmware quirks (e.g. 0i-F vs 30i-B parameter range limits)
+- Real CNC Ethernet stack behaviour (TCP keep-alive, session-close edge cases)
+- Series gating: some driver nodes are conditionally emitted based on
+  `CncSeries` — only a physical CNC can confirm the suppression works
+
+### Preconditions
+
+| Item | Requirement |
+|------|-------------|
+| CNC hardware | FANUC CNC with Ethernet option enabled; TCP port 8193 reachable from the dev box or from the host running OtOpcUa |
+| CNC series | Any of: 0i-D, 0i-F, 0i-MF, 0i-TF, 16i, 30i-B, 31i, 32i, Power Motion i |
+| CNC state | Running state (not E-stop, not alarm) for live axis-data reads |
+| Network | TCP reachability from OtOpcUa server host to CNC port 8193 |
+| OtOpcUa | Server built and deployed (`dotnet publish` or running via `dotnet run`) |
+| Config | DriverInstance row for FOCAS in Config DB (`Type="FOCAS"`, `Backend="wire"`, `Devices[0].HostAddress="focas://<cnc-ip>:8193"`, `Devices[0].Series="<series>"`) |
+
+### Procedure
+
+**Step 1 — Verify TCP reachability**
+
+```powershell
+Test-NetConnection -ComputerName <cnc-ip> -Port 8193
+```
+
+Pass: `TcpTestSucceeded: True`.
+
+**Step 2 — Start OtOpcUa with FOCAS driver configured**
+
+Ensure the Config DB has the DriverInstance row. Start the server:
+
+```powershell
+sc start OtOpcUa
+# or for a dev run:
+dotnet run --project src/Server/ZB.MOM.WW.OtOpcUa.Server
+```
+
+Watch the Serilog log for:
+
+```
+[INF] FocasDriver initializing device focas://<cnc-ip>:8193 series=<series>
+[INF] FocasDriver device <cnc-ip>:8193 Connected
+```
+
+If `EW_SOCKET (-1)` appears, the TCP endpoint is unreachable or the CNC
+Ethernet option is not active.
+
+**Step 3 — Browse the address space**
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    browse -u opc.tcp://localhost:4840 -r -d 3
+```
+
+Expected: a node tree containing at minimum:
+
+```
+FOCAS/
+  <device>/
+    Identity/
+      SeriesNumber
+      Version
+      MaxAxes
+    Status/
+      RunState
+      Mode
+      EmergencyStop
+    Axes/
+      <X|Y|Z>/
+        AbsolutePosition
+        MachinePosition
+```
+
+Nodes suppressed by the `Series` capability gate will be absent — this is
+correct behaviour.
+
+**Step 4 — Read identity nodes**
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Identity/SeriesNumber"
+
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Identity/MaxAxes"
+```
+
+Pass: `Good` quality; `SeriesNumber` matches the string printed on the CNC
+control panel (e.g. `"0i-F"`); `MaxAxes` is a non-zero integer.
+
+**Step 5 — Read live status and axis data**
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Status/RunState"
+
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Axes/X/AbsolutePosition"
+```
+
+Pass: both return `Good` quality. `AbsolutePosition` is a `Double` (e.g.
+`-12.3456` mm). Manually compare against the machine's position display.
+
+**Step 6 — Subscribe and observe polling**
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    subscribe -u opc.tcp://localhost:4840 `
+    -n "ns=2;s=FOCAS/<device>/Status/RunState" -i 500
+```
+
+Let run for 30 s while jogging an axis or changing mode on the CNC operator
+panel. Pass: at least one data-change event received within 5 s; events
+continue arriving every ~500 ms.
+
+**Step 7 — 2-minute soak**
+
+Let the server run for 2 minutes with the subscription active. Pass: no
+`EW_SOCKET`, `EW_HANDLE`, `EW_BUSY` errors in the Serilog output; subscribed
+node continues delivering updates.
+
+**Step 8 — Run the FOCAS e2e script**
+
+```powershell
+pwsh scripts/e2e/test-focas.ps1 -ServerUrl opc.tcp://localhost:4840 `
+    -DriverInstance "<device>" -Series "<series>"
+```
+
+Pass: script exits 0.
+
+### Expected results
+
+| Check | Expected |
+|-------|----------|
+| TCP connect to CNC port 8193 | Success |
+| FOCAS session open (`cnc_allclibhndl3`) | EW_OK (0) in driver log |
+| `Identity/SeriesNumber` | Matches CNC panel, `Good` quality |
+| `Identity/MaxAxes` | Non-zero integer, `Good` quality |
+| `Status/RunState` | Integer 0–3, `Good` quality |
+| `Axes/X/AbsolutePosition` | Double, `Good` quality, matches display |
+| Subscribe: events delivered | >= 3 events in 5 s soak |
+| 2-minute soak: no FOCAS errors | Clean Serilog log |
+
+### Recording the outcome
+
+```
+FOCAS live-CNC smoke — task #54
+Date: YYYY-MM-DD
+CNC: <manufacturer> <model> series=<series> firmware=<version>
+IP: <cnc-ip>:8193
+OtOpcUa SHA: <git sha>
+
+TCP connect: PASS
+Session open: PASS
+Identity reads: PASS  SeriesNumber="<>" MaxAxes=<n>
+Status read:  PASS  RunState=<n>
+Axis read:    PASS  X/AbsolutePosition=<value>
+Subscribe:    PASS  <n> events in 30s
+2-min soak:   PASS  no errors
+e2e script:   PASS
+```
+
+---
+
+## 2. Allen-Bradley CIP — Live Boot (ControlLogix)
+
+### Background
+
+The AB CIP driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip/`) uses
+`libplctag` 1.6.x. The Docker `ab_server` simulator covers connectivity and
+atomic type reads (7 integration tests). Live-boot validation is needed to
+confirm UDT shape-reading, array tag access, and the CIP packing behaviour on
+a real ControlLogix backplane — all gaps acknowledged in
+`docs/drivers/AbServer-Test-Fixture.md`.
+
+AB CIP live-boot was first verified against a ControlLogix rig at PR #222.
+Continue running before each release.
+
+### Preconditions
+
+| Item | Requirement |
+|------|-------------|
+| PLC hardware | ControlLogix (preferred) or CompactLogix; firmware 20+ for request packing |
+| Network | TCP port 44818 reachable from OtOpcUa server host |
+| PLC state | Running; at least one DINT / REAL / BOOL / STRING controller-scoped tag defined |
+| OtOpcUa | Server built and deployed |
+| Config | DriverInstance row: `Type="AbCip"`, `Host="<plc-ip>"`, `Path="1,0"`, `PlcType="ControlLogix"` |
+
+### Procedure
+
+**Step 1 — Verify TCP reachability**
+
+```powershell
+Test-NetConnection -ComputerName <plc-ip> -Port 44818
+```
+
+Pass: `TcpTestSucceeded: True`.
+
+**Step 2 — Start OtOpcUa and watch driver log**
+
+```powershell
+sc start OtOpcUa
+```
+
+Look for:
+
+```
+[INF] AbCipDriver device <plc-ip> Connected  path=1,0  plcType=ControlLogix
+```
+
+**Step 3 — Browse the address space**
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    browse -u opc.tcp://localhost:4840 -r -d 3
+```
+
+Pass: node tree shows the tags defined in the ControlLogix project (controller-
+and program-scoped). UDT members appear as child nodes.
+
+**Step 4 — Read atomic tags**
+
+```powershell
+# Read a DINT tag
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    read -u opc.tcp://localhost:4840 -n "ns=2;s=AbCip/<device>/<TagName>"
+```
+
+Pass: `Good` quality; value type matches the PLC tag type.
+
+**Step 5 — Read a UDT member**
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    read -u opc.tcp://localhost:4840 -n "ns=2;s=AbCip/<device>/<UDT>/<MemberName>"
+```
+
+Pass: `Good` quality; value matches the live PLC value.
+
+**Step 6 — Write a DINT tag (if in ReadWrite mode)**
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    write -u opc.tcp://localhost:4840 `
+    -n "ns=2;s=AbCip/<device>/<TagName>" -v 42 -t Int32
+```
+
+Verify the new value via a subsequent read or on the PLC HMI.
+
+Pass: read back returns 42 with `Good` quality.
+
+**Step 7 — Subscribe to a tag that changes**
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    subscribe -u opc.tcp://localhost:4840 `
+    -n "ns=2;s=AbCip/<device>/<ChangingTag>" -i 500
+```
+
+Jog or trigger a value change on the PLC. Pass: events received within 2 s.
+
+**Step 8 — Override endpoint to docker sim and confirm parity**
+
+```powershell
+$env:AB_SERVER_ENDPOINT = "<plc-ip>:44818"
+dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests `
+    --filter "AbServerFact"
+```
+
+Pass: all 7 integration tests pass against the live PLC.
+
+### Expected results
+
+| Check | Expected |
+|-------|----------|
+| TCP connect | Success |
+| Driver log `Connected` | Present, no error |
+| Browse | Node tree mirrors PLC tag list |
+| Atomic read | `Good` quality, correct type |
+| UDT member read | `Good` quality, correct value |
+| Write round-trip | Written value reads back |
+| Subscribe | Events delivered on value change |
+| Integration tests with live PLC | 7/7 pass |
+
+### Recording the outcome
+
+```
+AB CIP live-boot
+Date: YYYY-MM-DD
+PLC: Allen-Bradley <model> firmware=<version>
+IP: <plc-ip>:44818  path=1,0
+OtOpcUa SHA: <git sha>
+
+TCP connect: PASS
+Driver connected: PASS
+Browse: PASS  <n> tags visible
+Atomic read: PASS
+UDT read: PASS
+Write round-trip: PASS
+Subscribe: PASS
+Integration tests: 7/7 PASS
+```
+
+---
+
+## 3. Beckhoff TwinCAT — Wire-Live Validation
+
+### Background
+
+The TwinCAT driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/`) uses the
+Beckhoff `TwinCAT.Ads` .NET SDK v6. The integration test suite at
+`tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/`
+(`TwinCAT3SmokeTests.cs`) covers 14 `[TwinCATFact]` methods + one 16-case
+`[TwinCATTheory]` (30 cases total) against a live ADS runtime. The TCBSD ESXi
+VM at `10.100.0.128` (AmsNetId `41.169.163.43.1.1`) is the primary fixture
+runtime (project memory `project_tcbsd_fixture.md`) and bypasses the
+TwinCAT/Hyper-V conflict on the dev box.
+
+Live-hardware validation extends beyond the TCBSD VM to confirm the driver
+works against a production PLC (not just the ESXi test VM) and that the three
+defects found during original integration testing do not regress on newer
+firmware:
+
+1. Notification cycle time unit (250 ms was being set to ~41 min — fixed).
+2. `STRING(N)` / `WSTRING(N)` type mapper (fixed).
+3. Bit-indexed BOOL path (fixed).
+
+### Preconditions
+
+**TCBSD ESXi fixture (primary — no physical hardware needed)**
+
+| Item | Requirement |
+|------|-------------|
+| TCBSD VM | Running on ESXi at `10.100.0.128` |
+| AMS Net ID | `41.169.163.43.1.1` |
+| ADS port | `851` (TwinCAT 3 PLC runtime 1) |
+| PLC project | TwinCAT project from `tests/.../TwinCatProject/` loaded and in Run state |
+| Network | TCP port 48898 reachable from dev box to `10.100.0.128` |
+
+**Production PLC (for true wire-live validation)**
+
+| Item | Requirement |
+|------|-------------|
+| TwinCAT hardware | Beckhoff IPC or CX series, TwinCAT 3 (TC3); TC2 is a known gap per fixture doc |
+| AMS route | Route configured on TwinCAT device back to the OtOpcUa host |
+| PLC state | Run state |
+| GVL | At least a `GVL_Fixture.nCounter` DINT and `GVL_Fixture.rSetpoint` REAL present |
+
+### Procedure — TCBSD ESXi fixture
+
+**Step 1 — Verify TCBSD VM is reachable**
+
+```powershell
+Test-NetConnection -ComputerName 10.100.0.128 -Port 48898
+```
+
+Pass: `TcpTestSucceeded: True`.
+
+**Step 2 — Run the integration test suite**
+
+```powershell
+$env:TWINCAT_TARGET_HOST  = "10.100.0.128"
+$env:TWINCAT_TARGET_NETID = "41.169.163.43.1.1"
+
+dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests `
+    --logger "console;verbosity=normal"
+```
+
+Pass: all 30 test cases pass (14 `[TwinCATFact]` + 16-case `[TwinCATTheory]`).
+No `[TwinCATFact]` / `[TwinCATTheory]` skips — the env var is set, so the
+runtime probe is expected to succeed.
+
+Key tests to watch:
+
+| Test | Validates |
+|------|-----------|
+| `Driver_subscribe_receives_native_ADS_notifications_on_counter_changes` | Native ADS notification path (the cycle-time-unit bug regression) |
+| `Driver_reads_every_primitive_type_with_correct_mapping` | 16-type theory incl. `STRING(N)` |
+| `Driver_reads_bit_indexed_BOOL_from_word` | Bit-indexed BOOL fix regression |
+| `Driver_auto_reconnects_after_underlying_client_is_disposed` | Reconnect on ADS client dispose |
+| `Driver_routes_reads_per_device_and_isolates_unreachable_peers` | Multi-device isolation |
+
+**Step 3 — OtOpcUa server browse/read via Client CLI**
+
+Start OtOpcUa with a TwinCAT DriverInstance pointing at the TCBSD VM:
+
+```powershell
+# appsettings.json DriverInstance: Type=TwinCAT, AmsNetId=41.169.163.43.1.1, AmsPort=851
+sc start OtOpcUa
+# or dev run
+dotnet run --project src/Server/ZB.MOM.WW.OtOpcUa.Server
+```
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    browse -u opc.tcp://localhost:4840 -r -d 4
+
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    read -u opc.tcp://localhost:4840 -n "ns=2;s=TwinCAT/<device>/GVL_Fixture/nCounter"
+```
+
+Pass: browse shows the PLC symbol tree; read returns `Good` quality with an
+integer value.
+
+### Procedure — Production PLC (optional, for full wire-live signoff)
+
+If a Beckhoff production IPC is available in the lab:
+
+**Step 1** — Configure the AMS route on the TwinCAT device (TwinCAT System
+Manager → Routes → Add static route from the TwinCAT device back to the
+OtOpcUa server machine).
+
+**Step 2** — Set env vars and run the integration suite against the production
+target:
+
+```powershell
+$env:TWINCAT_TARGET_HOST  = "<production-plc-ip>"
+$env:TWINCAT_TARGET_NETID = "<production-ams-net-id>"
+$env:TWINCAT_TARGET_PORT  = "851"
+
+dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests
+```
+
+**Step 3** — Subscribe to a counter tag for 30 s to confirm native
+notifications arrive:
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+    subscribe -u opc.tcp://localhost:4840 `
+    -n "ns=2;s=TwinCAT/<device>/GVL_Fixture/nCounter" -i 100
+```
+
+Pass: events arrive every ~100 ms driven by the PLC's ADS notification, not
+by polling.
+
+### Expected results
+
+| Check | TCBSD VM | Production PLC |
+|-------|----------|----------------|
+| ADS port 48898 reachable | Required | Required |
+| Integration tests: all 30 pass | Required | Optional (same 30) |
+| Notification cycle-time test passes | Required | Required |
+| Server browse shows symbol tree | Required | Optional |
+| Read `Good` quality | Required | Optional |
+| Native ADS notifications deliver in subscribe | Required | Recommended |
+
+### Known gaps (documented — not blockers for v2 GA)
+
+Per `docs/drivers/TwinCAT-Test-Fixture.md` §"What it does NOT cover":
+
+- Multi-hop AMS routing — single-hop only.
+- TC2 (ADS v1) compatibility — TC3 only.
+- Notification coalescing under sustained CPU load.
+- `Symbol version changed (0x0702)` storm handling under rapid PLC re-downloads.
+
+These are deferred to v3 per `docs/v3/twincat-backlog.md`.
+
+### Recording the outcome
+
+```
+TwinCAT wire-live validation
+Date: YYYY-MM-DD
+Target: TCBSD VM 10.100.0.128 AmsNetId=41.169.163.43.1.1  (and/or production PLC details)
+TwinCAT version: <version>
+OtOpcUa SHA: <git sha>
+
+ADS port reachable: PASS
+Integration tests: 30/30 PASS
+  notification-cycle-time test: PASS  (regression check)
+  STRING(N) type test: PASS  (regression check)
+  bit-indexed BOOL test: PASS  (regression check)
+Server browse: PASS
+Read Good quality: PASS
+Native subscription delivery: PASS  <n> events in 30s
+```
--- a/docs/plans/phase-6-3-redundancy-interop-plan.md
+++ b/docs/plans/phase-6-3-redundancy-interop-plan.md
@@ -0,0 +1,278 @@
+# Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan
+
+> **Scope**: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99,
+> #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is
+> Stream F (task #150): validating that third-party OPC UA clients honour
+> our `ServiceLevel` / `ServerUriArray` / `RedundancySupport` signals and
+> fail over correctly when the Primary drops. This document defines what is
+> automatable as integration tests, what requires two live instances plus a
+> real client, and a step-by-step cutover-validation runbook.
+>
+> **Source of truth**: `docs/Redundancy.md`, `docs/v2/redundancy-interop-playbook.md`,
+> `docs/v2/implementation/phase-6-3-redundancy-runtime.md`,
+> `scripts/compliance/phase-6-3-compliance.ps1`.
+
+## What is already tested (no live cluster needed)
+
+The following are covered by existing automated tests that run in ordinary
+`dotnet test`:
+
+| Area | Test class(es) | What it asserts |
+|---|---|---|
+| `ServiceLevelCalculator` — 8-state matrix | `ServiceLevelCalculatorTests` | All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations |
+| `RecoveryStateManager` — dwell + witness | `RecoveryStateManagerTests` | 60 s dwell default; premature-exit rejection; witness-required gate |
+| `ApplyLeaseRegistry` — lease lifecycle | `ApplyLeaseRegistryTests` | Disposal on success / exception / cancellation; watchdog force-close at 10 min |
+| `ServerRedundancyNodeWriter` — OPC UA variable binding | `ServerRedundancyNodeWriterTests` | `ServiceLevel` byte push; `RedundancySupport` enum; `ServerUriArray` skip-log when node absent |
+| `RedundancyStatePublisher` — orchestration | `RedundancyStatePublisherTests` | Edge-triggered `OnStateChanged`; idempotent dedup |
+| `ClusterTopologyLoader` | `ClusterTopologyLoaderTests` | Two-node seed; one-node degenerate; duplicate-URI rejection |
+| `DraftValidator.ValidateClusterTopology` | `DraftValidatorTests` (8 cases) | NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection |
+
+Run with:
+
+```powershell
+dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy"
+```
+
+Compliance gate (every Phase 6.3 static check):
+
+```powershell
+pwsh ./scripts/compliance/phase-6-3-compliance.ps1
+```
+
+Pass criteria: exit 0; all `[PASS]` lines green; `[DEFERRED]` lines are
+known-deferred surfaces, not failures.
+
+## What cannot be automated — requires two live instances
+
+The scenarios below require two running `OtOpcUa.Server` processes in the
+same `ServerCluster`, a real SQL Server Config DB, and at least one driver
+instance with a reachable endpoint (simulator or real PLC).
+
+### Why it cannot be unit/integration-tested in-process
+
+- UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source
+  Windows GUI binaries with no headless CLI interface for the
+  subscribe/browse flows.
+- The AVEVA MXAccess failover leg (`IAlarmSource` reconnect, `$MxAccessClient`
+  quality transition) involves the Galaxy runtime's own client-redundancy
+  policy and the COM-layer session model — both live outside this repo.
+- Even the automatable sub-set (our own `otopcua-cli` as the client) needs
+  two distinct listening TCP endpoints; that requires two live processes,
+  which is out of scope for `dotnet test` integration fixtures.
+
+## Test matrix
+
+### Prerequisites
+
+1. Two `OtOpcUa.Server` processes on separate Windows hosts (or separate
+   ports on the same host for dev) sharing one Config DB (`ServerCluster`
+   with `NodeCount=2`, `RedundancyMode=Warm` or `Hot`).
+2. Each node registered in `ClusterNode`:
+   - Node A: `RedundancyRole=Primary`, `ServiceLevelBase=255`,
+     `ApplicationUri=urn:node-a:OtOpcUa`
+   - Node B: `RedundancyRole=Secondary`, `ServiceLevelBase=100`,
+     `ApplicationUri=urn:node-b:OtOpcUa`
+3. `PeerHttpProbeLoop` and `PeerUaProbeLoop` HostedServices running on both
+   nodes (registered via `AddHostedService<PeerHttpProbeLoop>` +
+   `AddHostedService<PeerUaProbeLoop>` in `Program.cs`).
+4. At least one `DriverInstance` in the cluster with a reachable PLC or
+   simulator (e.g. Modbus sim at `10.100.0.35:5020`).
+5. Client machine with UaExpert >= 1.7 installed.
+6. Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA
+   OI Gateway 2020R2+.
+
+### Block A — OPC UA protocol signals (UaExpert, no failover yet)
+
+| ID | Scenario | Procedure | Pass criterion | Automatable? |
+|----|----------|-----------|----------------|--------------|
+| A1 | ServiceLevel published on Primary | Connect UaExpert to Node A. Browse `Server/ServerStatus/ServiceLevel`. | Value = 255 (`AuthoritativePrimary`) | No — requires UaExpert GUI |
+| A2 | ServiceLevel published on Backup | Connect UaExpert to Node B. Read same node. | Value = 100 (`AuthoritativeBackup`) | No |
+| A3 | ServiceLevel updates when peer drops | Node A connected. Stop Node B (`sc stop OtOpcUa`). Watch `ServiceLevel` on Node A. | Transitions 255 → 230 (`IsolatedPrimary`) within ~6 s (3 × 2 s HTTP probe interval) | No |
+| A4 | RedundancySupport | Browse `Server/ServerRedundancy/RedundancySupport` on either node. | Value = `Warm` or `Hot` matching the cluster `RedundancyMode` | No |
+| A5 | ServerUriArray | Browse `Server/ServerRedundancy/ServerUriArray` on either node. | Array contains both `ApplicationUri` values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below). | No |
+| A6 | Mid-apply ServiceLevel dip | Trigger a `sp_PublishGeneration` apply (via Admin UI draft → publish) while watching Node A `ServiceLevel`. | Drops to 200 (`PrimaryMidApply`) for the apply duration; returns to 255 after `RefreshAsync`. | No |
+| A7 | Client.CLI reads correct ServiceLevel | `dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://<node-a>:4840 -n "i=2267"` | Prints current byte value matching expected band. | **Yes** — scriptable with the Client CLI |
+| A8 | otopcua-cli failover reconnect | `dotnet run ... -- connect -u opc.tcp://<node-a>:4840 -F opc.tcp://<node-b>:4840` — then kill Node A. | CLI session reconnects to Node B within the session keep-alive timeout. | **Yes** — scriptable with the Client CLI |
+
+### Block B — Third-party client failover
+
+| ID | Scenario | Procedure | Pass criterion |
+|----|----------|-----------|----------------|
+| B1 | UaExpert picks Primary by ServiceLevel | Configure a Redundancy Group in UaExpert with both endpoint URLs. | Client connects to Node A (higher ServiceLevel) |
+| B2 | UaExpert cuts over on Primary kill | Kill Node A `OtOpcUa` service. | Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
+| B3 | UaExpert returns when Primary restores | Start Node A. Wait >= 60 s recovery dwell. | `ServiceLevel` on Node A progresses: 180 (`RecoveringPrimary`) → 255 (`AuthoritativePrimary`). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted). |
+| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware configured for the same two endpoints. | Same pass criteria; establishes no UaExpert-specific behaviour. |
+| B5 | AVEVA OI Gateway | Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary. | OI Gateway data quality recovers within `ReconnectInterval` (default 20 s); no permanent data-loss alert. |
+
+### Block C — Galaxy MXAccess failover
+
+This block requires a running Galaxy and `$MxAccessClient` object (AVEVA
+System Platform installed, Galaxy deployed on dev box — see project memory
+`project_aveva_platform_installed.md`).
+
+| ID | Scenario | Procedure | Pass criterion |
+|----|----------|-----------|----------------|
+| C1 | Galaxy binds to Primary on first connect | Bring cluster up. Start a Galaxy `$MxAccessClient` with both node URLs configured. | Galaxy reports `QUALITY = Good`; initial values stream from Node A. |
+| C2 | Galaxy redirects on Primary drop | Stop Node A. | Galaxy `QUALITY` briefly goes `Uncertain`, then returns to `Good`; values continue streaming from Node B within MXAccess's `ReconnectInterval` (default 20 s). |
+| C3 | Galaxy tolerates mid-apply dip | Trigger generation apply on Node A. | Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption. |
+
+Note: A negative result on C1–C3 does not necessarily indicate an OtOpcUa
+defect. Cross-check with Block A / B first to confirm our `ServiceLevel`
+signal is correct before debugging the MXAccess client layer.
+
+## Step-by-step cutover-validation runbook
+
+This is the minimum procedure to satisfy the v2 GA exit criterion:
+"Non-transparent redundancy cutover validated with at least one production
+client (Ignition 8.3 recommended — see decision #85)."
+
+### Step 1 — Provision the cluster
+
+```powershell
+# On the Config DB host, seed or verify cluster rows:
+# ServerCluster: Id=<id>, Name="test-cluster", NodeCount=2, RedundancyMode=Warm
+# ClusterNode A: NodeId="node-a", ClusterId=<id>, RedundancyRole=Primary,
+#   ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa"
+# ClusterNode B: NodeId="node-b", ClusterId=<id>, RedundancyRole=Secondary,
+#   ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa"
+```
+
+Verify uniqueness constraint: no two `ClusterNode` rows share the same
+`ApplicationUri` (unique index on `ApplicationUri`).
+
+### Step 2 — Start both server instances
+
+On Node A host:
+
+```powershell
+# appsettings.json: Node:NodeId = "node-a"
+sc start OtOpcUa
+```
+
+On Node B host:
+
+```powershell
+# appsettings.json: Node:NodeId = "node-b"
+sc start OtOpcUa
+```
+
+Wait 10 s for HostedServices to complete first probe cycle.
+
+### Step 3 — Verify baseline ServiceLevel via Client CLI
+
+```powershell
+# Node A should report 255
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
+    -u opc.tcp://<node-a-host>:4840 -n "i=2267"
+
+# Node B should report 100
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
+    -u opc.tcp://<node-b-host>:4840 -n "i=2267"
+```
+
+Pass: Node A = 255, Node B = 100.
+
+### Step 4 — Verify ServerUriArray
+
+```powershell
+dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
+    -u opc.tcp://<node-a-host>:4840 -n "i=2271"
+```
+
+Pass: array returned contains both `ApplicationUri` strings. If
+`ServerUriArray` node returns empty or an error, the non-transparent
+redundancy-type upgrade follow-up is still pending (known limitation —
+`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips on the
+base `ServerRedundancyState` object type).
+
+### Step 5 — Execute Primary kill + failover (B2 scenario)
+
+1. Connect UaExpert (or Kepware) Redundancy Group to both endpoints.
+2. Confirm client is subscribed to at least one variable node.
+3. Kill Node A: `sc stop OtOpcUa` on Node A host.
+4. Observe:
+   - Node B `ServiceLevel` should transition: 100 (`AuthoritativeBackup`)
+     → 80 (`IsolatedBackup`) within ~6 s.
+   - Client should reconnect to Node B and resume data-change events.
+5. Record: time from kill to client reconnect; whether data gaps occurred.
+
+### Step 6 — Verify Primary recovery (B3 scenario)
+
+1. Restart Node A: `sc start OtOpcUa` on Node A host.
+2. Observe Node A `ServiceLevel` progression:
+   - ~0 s: 1 (`NoData`) briefly while HostedServices start.
+   - Startup: 180 (`RecoveringPrimary`) — recovery dwell gate active.
+   - After >= 60 s dwell + one positive publish witness: 255 (`AuthoritativePrimary`).
+3. Observe Node B:
+   - Returns to 100 (`AuthoritativeBackup`) once it sees Node A peer probe succeed.
+4. Record dwell duration and whether the client (UaExpert/Kepware) switches back.
+
+### Step 7 — Execute mid-apply dip (A6 scenario)
+
+1. Via Admin UI, create a trivial draft change and publish.
+2. Watch Node A `ServiceLevel` during apply.
+3. Expected: drops to 200 (`PrimaryMidApply`) for the apply duration
+   (typically seconds); returns to 255 when `GenerationRefreshHostedService`
+   releases the lease.
+
+### Step 8 — Record results
+
+Copy the following block into a tracking doc:
+
+```
+Run date: YYYY-MM-DD
+Release SHA: <git sha>
+Cluster: <cluster-id>  Primary: node-a  Backup: node-b
+Config DB: 10.100.0.35,14330
+
+A1: [PASS/FAIL]  evidence: <screenshot or CLI output>
+A2: [PASS/FAIL]
+A3: [PASS/FAIL]  time-to-IsolatedPrimary: <N>s
+A4: [PASS/FAIL]
+A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending]
+A6: [PASS/FAIL]  mid-apply duration: <N>s
+A7: [PASS/FAIL]  CLI output attached
+A8: [PASS/FAIL]  CLI reconnect observed
+B1: [PASS/FAIL]
+B2: [PASS/FAIL]  reconnect time: <N>s
+B3: [PASS/FAIL]  dwell observed: <N>s
+B4: [PASS/FAIL]  (Kepware)
+B5: [PASS/FAIL]  (OI Gateway — if available)
+C1: [PASS/FAIL/SKIP - Galaxy not available]
+C2: [PASS/FAIL/SKIP]
+C3: [PASS/FAIL/SKIP]
+```
+
+One pass of every non-SKIP row is the v2 GA acceptance criterion.
+
+## Known limitations
+
+### A5 — ServerUriArray node not yet writable
+
+The OPC UA .NET Standard SDK's default `Server.ServerRedundancy` object is the
+base `ServerRedundancyState`, which has no `ServerUriArray` child node.
+`ServerRedundancyNodeWriter.ApplyServerUriArray` currently logs a warning and
+skips. The operator obtains `ServerUriArray` by reading `ClusterNode` rows
+directly until the non-transparent redundancy-type upgrade follow-up ships.
+
+### Recovery dwell is 60 s by default
+
+`RecoveryStateManager.DwellTime` defaults to `TimeSpan.FromSeconds(60)` in
+`Program.cs`. Step 6 of the runbook will block for at least 60 s waiting for
+Node A to return to `AuthoritativePrimary`. This is intentional per
+decision #154 (thrash prevention) — do not lower it for the test run.
+
+### IsolatedBackup (80) does not auto-promote
+
+Per decision #154, the Backup at band 80 does not self-elevate. If the operator
+needs authoritative service from Node B while Node A is down, they must write
+`RedundancyRole=Primary` on the `ClusterNode` row for Node B and publish a
+draft generation. The Admin UI `RedundancyTab` exposes this flow.
+
+## Dependency on existing tests
+
+The cutover runbook validates the end-to-end wire path. The math and edge cases
+are already locked by the unit/integration tests enumerated in the first section.
+A failing runbook step that contradicts a passing unit test indicates a
+deployment configuration error or an SDK version mismatch — not a logic bug.
+Check `PeerHttpProbeLoop` logs first (look for `PeerProbe` Serilog events).
--- a/docs/plans/v2-ga-lab-gates-plan.md
+++ b/docs/plans/v2-ga-lab-gates-plan.md
@@ -0,0 +1,307 @@
+# v2 GA Lab Gates Plan
+
+> **Canonical tracker**: `docs/v2/v2-release-readiness.md` — all code-path
+> release blockers are closed as of 2026-04-24. This document maps the
+> remaining exit-criteria from that tracker to concrete commands, automation
+> boundaries, operator procedures, and pass criteria.
+>
+> **Status**: RELEASE-READY (code-path). Manual/lab gates remain open.
+
+## The gate list
+
+From `docs/v2/v2-release-readiness.md` §"Release-readiness exit criteria":
+
+| # | Gate | Kind | Automatable here |
+|---|------|------|-----------------|
+| G1 | All four Phase 6.N compliance scripts exit 0 | Script | Yes — run on this box |
+| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes with <= 1 known-flake failure | Script | Yes — run on this box |
+| G3 | Release blockers closed | Audit | Already closed (code-path) |
+| G4 | Phase 5 driver complement shipped | Audit | Already closed |
+| G5 | Production deployment checklist signed off by Fleet Admin | Operator | No — separate doc, human signoff |
+| G6 | At least one end-to-end integration run against live Galaxy succeeds | Dev rig | No — requires AVEVA platform |
+| G7 | FOCAS live-CNC wire-level smoke (#54) passes against a real FANUC control | Lab hardware | No — requires FANUC CNC |
+| G8 | OPC UA CTT / UA Compliance Test Tool passes against the live endpoint | Operator tool | No — requires CTT binary + live endpoint |
+| G9 | Non-transparent redundancy cutover validated with >= 1 production client | Lab | No — see `docs/plans/phase-6-3-redundancy-interop-plan.md` |
+
+---
+
+## G1 — Phase 6 compliance scripts
+
+### Command
+
+```powershell
+pwsh ./scripts/compliance/phase-6-all.ps1
+```
+
+This meta-runner at `scripts/compliance/phase-6-all.ps1` invokes each
+sub-script in a separate `powershell.exe` process to isolate exit codes:
+
+| Sub-script | Phase | What it checks |
+|-----------|-------|---------------|
+| `phase-6-1-compliance.ps1` | 6.1 Resilience & Observability | Polly resilience classes, health endpoints, LiteDB sealed cache, observability sinks |
+| `phase-6-2-compliance.ps1` | 6.2 Authorization runtime | `AuthorizationGate`, `TriePermissionEvaluator`, `NodeScopeResolver`, dispatch wiring in `DriverNodeManager` |
+| `phase-6-3-compliance.ps1` | 6.3 Redundancy runtime | `ServiceLevelCalculator` 8-state band values, `RecoveryStateManager`, `ApplyLeaseRegistry`, `ServerRedundancyNodeWriter`; also invokes `dotnet test` with a baseline of 1097 |
+| `phase-6-4-compliance.ps1` | 6.4 Admin UI completion | Data-layer types, Identification folder, deferred Blazor items marked `[DEFERRED]` |
+
+### Pass criterion
+
+```
+Phase 6 aggregate: PASS
+```
+
+Exit code 0. Any `[FAIL]` line is a blocker. `[DEFERRED]` lines are expected
+for the known-deferred surfaces listed in the implementation docs; they do not
+fail the run.
+
+### Prerequisites
+
+- SQL Server `10.100.0.35,14330` reachable (Config DB tests use it).
+- `dotnet` SDK on PATH (`.NET 10`).
+- Run from repo root.
+
+---
+
+## G2 — Full solution test suite
+
+### Command
+
+```powershell
+dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal"
+```
+
+For a more targeted run of integration suites that need their fixtures up:
+
+```powershell
+# bring modbus fixture up first
+lmxopcua-fix up modbus standard
+
+dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal"
+```
+
+### Pass criterion
+
+- Passed count >= 1159 (2026-04-19 baseline after Phase 5 driver complement).
+- Failed count <= 1 (the pre-existing
+  `SubscribeCommandTests.Execute_PrintsSubscriptionMessage` flake in
+  `Client.CLI` is the only tolerated failure).
+- No new `[FAILED]` tests relative to the baseline.
+
+### Known flake
+
+`ZB.MOM.WW.OtOpcUa.Client.CLI.Tests::SubscribeCommandTests.Execute_PrintsSubscriptionMessage`
+is a timing-sensitive subscribe-then-cancel test. Rerun the specific project
+if it appears:
+
+```powershell
+dotnet test tests/Client/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests `
+    --filter "FullyQualifiedName~SubscribeCommandTests.Execute_PrintsSubscriptionMessage" `
+    --count 3
+```
+
+If it fails all three runs, investigate; otherwise treat as flake.
+
+### Docker fixtures needed for integration suites
+
+| Driver | Command | Endpoint used |
+|--------|---------|---------------|
+| Modbus | `lmxopcua-fix up modbus standard` | `10.100.0.35:5020` |
+| AB CIP | `lmxopcua-fix up abcip controllogix` | `10.100.0.35:44818` |
+| S7 | `lmxopcua-fix up s7 s7_1500` | `10.100.0.35:1102` |
+| OPC UA Client | `lmxopcua-fix up opcuaclient` | `opc.tcp://10.100.0.35:50000` |
+| FOCAS | `lmxopcua-fix up focas` (mock server) | `10.100.0.35:8193` |
+
+TwinCAT integration tests require the TCBSD ESXi VM at `10.100.0.128`
+(AmsNetId `41.169.163.43.1.1`). Set env var before running:
+
+```powershell
+$env:TWINCAT_TARGET_HOST   = "10.100.0.128"
+$env:TWINCAT_TARGET_NETID  = "41.169.163.43.1.1"
+dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests
+```
+
+Galaxy integration tests run against the live mxaccessgw on the dev box
+(gate G6).
+
+---
+
+## G3 — Release blockers closed (audit, already satisfied)
+
+All three code-path release blockers are closed per `v2-release-readiness.md`:
+
+- Authorization dispatch wiring (task #143, PR #94) — CLOSED.
+- Config fallback Phase 6.1 Stream D (task #136, PR #96) — CLOSED.
+- Redundancy Phase 6.3 Streams A/C core (tasks #145/#147, PRs #98-99) — CLOSED.
+
+No action required. Record the PR numbers in the release notes.
+
+---
+
+## G4 — Driver complement (audit, already satisfied)
+
+All eight drivers shipped:
+
+Galaxy, Modbus (+ DL205/S7/MELSEC profiles), S7 native, OPC UA Client, AB CIP,
+AB Legacy, TwinCAT ADS, FOCAS (managed wire client — Tier-C isolation retired,
+FOCAS is now Tier A in-process via `WireFocasClient`).
+
+No action required.
+
+---
+
+## G5 — Production deployment checklist (operator action)
+
+The deployment checklist is a separate document covering:
+
+- Windows service install (`scripts/install/Install-Services.ps1`)
+- Config DB migration (`scripts/db/Apply-Migrations.ps1`)
+- Certificate provisioning and trust
+- LDAP / GLAuth configuration for production AD target
+- mxaccessgw API key provisioning (`apikey create-key` in the sibling repo)
+- Service account permissions
+- Prometheus / OpenTelemetry export configuration
+- Firewall rules (port 4840 OPC UA, port 5120 gRPC to mxaccessgw,
+  Admin port 5000/5001)
+
+**Sign-off party**: Fleet Admin (operator). Not automatable.
+
+Record sign-off as a comment on the v2 release issue.
+
+---
+
+## G6 — Live Galaxy end-to-end integration run
+
+**Requires**: AVEVA System Platform installed on dev box (confirmed available
+per project memory `project_aveva_platform_installed.md`); mxaccessgw running
+with a provisioned API key; at least one Galaxy object deployed.
+
+### Procedure
+
+1. Start mxaccessgw:
+
+   ```powershell
+   # in sibling repo C:\Users\dohertj2\Desktop\mxaccessgw\
+   dotnet run --project src/MxGateway.Server -- --apikey-path .local/api-key.txt
+   ```
+
+2. Start OtOpcUa server with Galaxy driver instance configured:
+
+   ```powershell
+   sc start OtOpcUa
+   ```
+
+3. Browse via Client CLI:
+
+   ```powershell
+   dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+       browse -u opc.tcp://localhost:4840 -r -d 3
+   ```
+
+4. Read a known Galaxy tag (e.g. a deployed `$UserDefined` object attribute):
+
+   ```powershell
+   dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+       read -u opc.tcp://localhost:4840 -n "ns=2;s=<tag_name.AttributeName>"
+   ```
+
+5. Subscribe and verify live updates:
+
+   ```powershell
+   dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
+       subscribe -u opc.tcp://localhost:4840 -n "ns=2;s=<tag_name.AttributeName>" -i 1000
+   ```
+
+### Pass criterion
+
+- Browse returns a non-empty node tree mirroring the Galaxy hierarchy.
+- Read returns `Good` quality with a non-null value.
+- Subscribe receives at least one data-change notification within 5 s
+  (or within the configured publishing interval).
+- No `BadNoCommunication` or `BadTimeout` errors in the server log.
+
+Record: Galaxy version, deployed object count, OtOpcUa git SHA.
+
+---
+
+## G7 — FOCAS live-CNC smoke (task #54)
+
+**Requires**: real FANUC CNC with Ethernet option, accessible on TCP port 8193
+from the dev box; CNC series known (e.g. 0i-F, 30i-B).
+
+See `docs/plans/live-hardware-validation-runbooks.md` §FOCAS for the full
+runbook.
+
+### Pass criterion
+
+- `WireFocasClient` opens a FOCAS2 session (`cnc_allclibhndl3` succeeds).
+- Identity nodes (`Identity/SeriesNumber`, `Identity/MaxAxes`) return non-null
+  values matching the physical control panel display.
+- At least one axis position (`Axes/X/AbsolutePosition` or similar) returns
+  `Good` quality with a plausible double value.
+- Subscribe on a polled tag delivers at least three updates within 5 s.
+- No `EW_SOCKET` (-1) or `EW_HANDLE` (-7) errors in the server log during a
+  2-minute soak.
+
+Record: CNC series, firmware version, test date, OtOpcUa git SHA.
+
+---
+
+## G8 — OPC UA Conformance Test Tool (CTT) pass
+
+**Requires**: OPC Foundation OPC UA Compliance Test Tool (CTT) or the
+open-source UA Compliance Test Tool installed on the client machine;
+live OtOpcUa server endpoint.
+
+### Recommended minimum profile set
+
+- `Attribute Read`
+- `Attribute Write`
+- `Browse`
+- `Subscription` (DataChange)
+- `Server-side monitoring`
+- `Security — None profile` (if server configured with `Security:Profiles=[None]`)
+
+### Procedure
+
+1. Launch CTT. Add server endpoint: `opc.tcp://localhost:4840`.
+2. Run the profile set above.
+3. Capture the CTT report HTML/XML.
+
+### Pass criterion
+
+All mandatory test cases in each profile: **PASS** or **NOT APPLICABLE**.
+
+Zero mandatory failures. Advisory failures may be documented with rationale
+(e.g. optional capability not implemented).
+
+Record: CTT version, profile set, OtOpcUa git SHA, report artifact.
+
+---
+
+## G9 — Non-transparent redundancy cutover with production client
+
+See `docs/plans/phase-6-3-redundancy-interop-plan.md` for the full runbook.
+
+**Minimum acceptable result**: one complete pass of the A-block (UaExpert
+OPC UA signal verification) plus scenario B2 (UaExpert failover on Primary
+kill).
+
+Ignition 8.3 is the recommended production client per decision #85. If
+Ignition is not available on the lab machine, UaExpert is accepted for v2 GA.
+
+Record: client name + version, OtOpcUa git SHA, test date.
+
+---
+
+## Gate summary table
+
+| Gate | Command / Procedure | Pass criterion | Owner |
+|------|---------------------|----------------|-------|
+| G1 | `pwsh ./scripts/compliance/phase-6-all.ps1` | Exit 0, no `[FAIL]` | Dev |
+| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` | >= 1159 passing, <= 1 failure | Dev |
+| G3 | Audit PR list in release-readiness.md | All blockers show CLOSED | Dev |
+| G4 | Audit driver table | All 8 drivers listed as shipped | Dev |
+| G5 | Run deployment checklist doc | All items checked; Fleet Admin signs off | Fleet Admin |
+| G6 | Browse/read/subscribe against live Galaxy | Good quality, non-empty tree | Dev (dev box) |
+| G7 | FOCAS CNC smoke — see live-hardware runbook | Session open, Good quality reads | Dev + lab hardware |
+| G8 | CTT profile run against live endpoint | Zero mandatory failures | Dev + CTT tool |
+| G9 | Redundancy cutover runbook | A-block + B2 pass with >= 1 client | Dev + two instances |