docs: add four planning runbooks for Phase 6.3 interop, v2 GA gates, live-hardware validation, and alarms worker wiring

Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20:
- phase-6-3-redundancy-interop-plan.md: automation boundary analysis,
  concrete test matrix (A/B/C blocks), and a step-by-step cutover
  runbook for the deferred Stream F client interop work
- v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion,
  and owner for each of the nine v2 GA exit criteria
- live-hardware-validation-runbooks.md: one runbook per driver (FOCAS
  CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions,
  procedure, expected results, and recording template
- alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1
  worker wiring in the mxaccessgw sibling repo, documenting the
  discovered AVEVA API surface, the architectural decision that blocks
  A.2, the dependency order, and what each item needs to unblock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-18 04:52:07 -04:00
parent da8a3e46f7
commit 16a87b08f3
4 changed files with 1422 additions and 0 deletions

View File

@@ -0,0 +1,340 @@
# Alarms Worker Wiring Plan
> **Context**: The alarms-over-gateway epic shipped 19 PRs across the
> `lmxopcua` and `mxaccessgw` repos (merged 2026-04-30). Contracts are live;
> the sub-attribute fallback path keeps Galaxy alarms functional today. Four
> items remain as inert scaffolds gated on a dev-rig finding. This document is
> the focused implementation plan for those four items only.
>
> **Do not duplicate `docs/plans/alarms-over-gateway.md`** — that document is
> the full historical record of all 19 PRs. This document covers only what is
> still to be done and exactly what blocks each item.
>
> **This work lives in the mxaccessgw sibling repo** at
> `C:\Users\dohertj2\Desktop\mxaccessgw\` — not in this (lmxopcua) repo,
> except where lmxopcua changes are noted explicitly.
---
## Dev-rig finding that blocks everything (2026-04-30)
During PR A.2 work the following was discovered on the dev box:
> The MXAccess COM Toolkit at
> `C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll`
> exposes **no alarm-event family** — only `OnDataChange`, `OnWriteComplete`,
> `OperationComplete`, `OnBufferedDataChange`.
>
> AVEVA's `aaAlarmManagedClient` / `ArchestrAAlarmsAndEvents.SDK` assemblies
> are **x64-only** and incompatible with the worker's x86 net48 bitness.
The architectural decision required before any of A.2, A.3/A.4, C.1 can ship:
> **Either** accept the value-driven sub-attribute path as the production
> architecture (operator-comment fidelity is the only v1 regression), **or**
> add an x64 alarm-helper sub-process alongside the x86 worker.
Resolution drives the implementation shape of every item below. The plan
presented here assumes the x64 alarm-helper sub-process route (the higher
parity option), but notes the sub-attribute-only exit at each step.
---
## Discovered AVEVA API surface
Before implementing, verify the following against the AVEVA SDK actually
installed on the dev box and in the mxaccessgw worker's deployment folder:
| Assembly | Bitness | Likely location | Key types |
|----------|---------|-----------------|-----------|
| `ArchestrA.MXAccess.dll` | x86 | `C:\Program Files (x86)\ArchestrA\Framework\Bin\` | `IMxAlarmEventSink`, `MxAlarmEventArgs`**confirm exists at actual version** |
| `aaAlarmManagedClient.dll` | x64 | `C:\Program Files\ArchestrA\Framework\Bin\` | `AlarmClient`, `IAlarmConsumer`, `AlarmEventArgs` |
| `ArchestrAAlarmsAndEvents.SDK.dll` | x64 | Same or Historian SDK folder | `AlarmHistorianWriter`, `GetAlarmExtendedRec` |
The AVEVA MXAccess Toolkit reference in the mxaccessgw repo (`gateway.md`) is
the canonical API doc for the gateway worker's side. The alarm-client API is
documented separately; verify the following call shapes during PR A.2:
| Operation | Likely API | Notes |
|-----------|-----------|-------|
| Subscribe to alarm events | `AlarmClient.RegisterConsumer(IAlarmConsumer)` + `AlarmClient.Subscribe(filterSpec)` | Confirm exact method signatures against the SDK version on the dev box |
| Receive alarm event | `IAlarmConsumer.OnAlarmEvent(AlarmEventArgs)` callback | Field set: alarm name, source, type, transition kind, severity, timestamps, operator fields |
| Acknowledge alarm | `AlarmClient.AcknowledgeAlarm(alarmRef, comment, userPrincipal)` or equivalent | Confirm whether this is synchronous or returns a status |
| Query active alarms | `AlarmClient.GetAlarmExtendedRec(filter)` or `GetActiveAlarms()` | Returns current active set for ConditionRefresh |
| Get statistics | `AlarmClient.GetStatistics()` | Optional — useful for worker health checks |
Record the exact method signatures against the installed SDK before starting
A.2 — the proto field set in `OnAlarmTransitionEvent` must match the SDK's
actual payload.
---
## Dependency order
```
A.2 (worker: AlarmClient subscription)
└─► A.3 (gateway: dispatch OnAlarmTransition + AcknowledgeAlarm RPC handler)
└─► A.4 (gateway: QueryActiveAlarms RPC handler)
└─► lmxopcua B.2 (GalaxyDriver IAlarmSource live)
└─► C.1 (sidecar: AahClientManagedAlarmEventWriter live)
└─► D.1 (smoke artifact captured)
```
A.2 is the single blocking item. All subsequent items unblock serially once
A.2 delivers alarm events through the channel.
---
## Item A.2 — Worker: subscribe to MxAccess alarm event source
**Repo**: `mxaccessgw``src\MxGateway.Worker\` (net48, x86)
**What it needs**:
The worker must subscribe to AVEVA's alarm events and fan them into the same
bounded channel the data-change pump uses, translating each MxAccess alarm
event into a `WorkerEvent` proto with family `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
(defined in PR A.1, already merged).
**Architectural choice determines the implementation path**:
**Option X1 — aaAlarmManagedClient in a new x64 alarm-helper process**
Add a second worker-mode sub-process (`MxGateway.AlarmWorker`, net8.0 x64)
alongside the existing x86 worker. The AlarmWorker:
1. Loads `aaAlarmManagedClient.dll` (x64) on startup.
2. Calls `AlarmClient.RegisterConsumer` with a `WorkerAlarmConsumer` sink.
3. Calls `AlarmClient.Subscribe` with a session-level filter (all alarms for
the session's Galaxy scope).
4. Translates each `IAlarmConsumer.OnAlarmEvent` callback into a protobuf
`WorkerEvent` (family `ON_ALARM_TRANSITION`) and writes it to an IPC
channel readable by the gateway server-side multiplexer.
5. Handles session lifecycle: re-subscribes after reconnect; unsubscribes on
session close.
IPC from AlarmWorker to gateway: simplest option is a named pipe or an
in-process queue if the AlarmWorker is hosted in the same gateway process
space as a separate `IHostedService`.
**Option X2 — Accept sub-attribute fallback as production (no A.2 work)**
If the architectural decision is to accept the sub-attribute path as permanent:
- `MxAccessAlarmEventSink.Attach()` in the worker remains a no-op (as
currently coded with the architectural comment).
- The `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` proto family stays defined but
the gateway never emits events on it.
- lmxopcua's `GalaxyDriver` does not implement `IAlarmSource` for the
native path; the value-driven sub-attribute path remains the production
path.
- The only regression vs. v1 is operator-comment fidelity on Galaxy alarms.
- C.1 is still needed if scripted-alarm historian write-back is required.
**What blocks it**: the architectural decision above. Once made, A.2 becomes
a 23 day implementation task (sub-process plumbing + proto translation +
unit tests for the consumer sink cancellation behaviour).
**Tests to write (when A.2 proceeds)**:
- `WorkerAlarmConsumerTests` — fake `IAlarmConsumer` source emits canned
transitions; assert each produces the correct `WorkerEvent` body shape.
- Cancellation/session-close test — closing the session unsubscribes from
the AlarmClient cleanly (no leaked `IAlarmConsumer` reference if the
worker is recycled mid-session).
- Re-subscribe-after-reconnect test — `ReconnectSupervisor` triggers a
reconnect; assert the alarm consumer re-attaches to the new session.
---
## Item A.3 / A.4 — Gateway: dispatch and RPC handlers
**Repo**: `mxaccessgw``src\MxGateway.Server\`
**Depends on**: A.2 delivering `WorkerEvent` bodies with family
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION`.
**What it needs**:
### A.3 — Dispatch + AcknowledgeAlarm
1. The session-level event multiplexer (`Sessions\SessionEventStream.cs` or
equivalent — verify name in the mxaccessgw repo) must recognise the new
`WorkerEvent` body and forward it as an `MxEvent` with family
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to every `StreamEvents` subscriber
for that session.
2. New RPC handler `AcknowledgeAlarm` builds an `AlarmAcknowledgeCommand`
worker command and forwards it to the alarm-helper process (Option X1) or
the worker's MxAccess session (Option X2 if MxAccess exposes ack). Maps
the reply status to `AcknowledgeAlarmReply.MxStatusProxy`.
3. Authorization: new API scope `invoke:alarm-ack` on the API key. Keys
without it receive `PERMISSION_DENIED`. Follow the existing scope-check
pattern used by `invoke:write`.
### A.4 — QueryActiveAlarms
1. New RPC handler `QueryActiveAlarms` calls `AlarmClient.GetAlarmExtendedRec`
(or `GetActiveAlarms` — confirm the method name during implementation)
on the alarm-helper process, batches results into `ActiveAlarmSnapshot`
proto messages, and streams them back to the caller.
2. New API scope `invoke:alarm-query` (separate from ack so read-only clients
can refresh without ack rights).
**What blocks A.3/A.4**: A.2 must deliver `WorkerEvent` bodies on the channel.
A.3/A.4 are pure dispatch wiring once the events arrive.
**Tests to write**:
- A.3 dispatch test — fake worker emits an `AlarmTransition` event; assert
the gateway forwards it on the `StreamEvents` channel of every subscribed
session (mirrors existing `OnDataChange` dispatch tests).
- A.3 AcknowledgeAlarm auth test — existing key without `invoke:alarm-ack`
scope returns `PERMISSION_DENIED`.
- A.4 pagination test — synthetic active-alarm set of 0 / 1 / 100 entries;
assert each streams back as separate `ActiveAlarmSnapshot` messages.
- Integration (parity rig — requires dev box with AVEVA platform):
trigger a real Galaxy alarm, call `QueryActiveAlarms`, assert the alarm
appears in the stream; call `AcknowledgeAlarm`, assert the alarm transitions
to `ActiveAcked` and a `Acknowledge` transition event appears on
`StreamEvents`.
---
## Item C.1 — Historian sidecar: AahClientManagedAlarmEventWriter
**Repo**: `lmxopcua``src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\`
**Depends on**: Architectural decision (the sidecar uses `aahClientManaged`
x64, which is not bitness-constrained like the worker). C.1 is independently
unblockable from A.2 if the goal is to wire up the scripted-alarm historian
path.
**Current state**:
`SdkAlarmHistorianWriteBackend` in `src\MxGateway.Worker\MxAccess\` is a
placeholder returning `RetryPlease`. The lmxopcua sidecar's `WriteAlarmEvents`
IPC slot is defined in `Ipc\Contracts.cs` but `Program.cs` constructs
`HistorianFrameHandler` without an `alarmWriter` (line 57 per the alarms plan).
The `IAlarmEventWriter` interface exists; only the production implementation
and the consumer wiring are missing.
**What it needs**:
1. New `AahClientManagedAlarmEventWriter.cs` implementing `IAlarmEventWriter`
(defined in `Ipc\HistorianFrameHandler.cs`). Calls `aahClientManaged`'s
alarm-event write API — same path v1's `GalaxyHistorianWriter` used.
Uses `HistorianClusterEndpointPicker` for multi-node routing.
Maps `MxStatus` write outcomes to `HistorianWriteOutcome` enum
(Ack / PermanentFail / RetryPlease).
2. `Program.cs` — build `AahClientManagedAlarmEventWriter` next to the
existing `BuildHistorian()` call; pass it to `HistorianFrameHandler`.
Gate behind `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED` env var (default `true`
when `OTOPCUA_HISTORIAN_ENABLED=true`).
3. `Install-Services.ps1` — add the new env var to the install-time block.
**What blocks C.1**: access to the `aahClientManaged` SDK on the dev box
(confirmed available per `project_aveva_platform_installed.md` — AVEVA
Historian SDK is present). C.1 can proceed without A.2 since the sidecar's
`aahClientManaged` is x64 and does not share the worker's x86 bitness
constraint.
**Tests to write**:
- Outcome-mapping table: every `MxStatus` on alarm-write → expected
`HistorianWriteOutcome`.
- Batch test: 1 / 100 / 1000 events through a fake `aahClientManaged`
writer; assert per-row outcome list parallel to input order.
- Cluster failover: primary Historian node returns `BadCommunicationError`;
picker rotates to secondary; eventual success.
- `Program.cs` seam: assert handler constructed with alarm writer when env
var enabled; without it when disabled.
- Live integration (parity rig): write a synthetic alarm event through the
IPC; query it back via `ReadEvents`; assert round-trip fidelity.
---
## Item D.1 — Smoke artifact
**Repo**: `lmxopcua` (deployment refresh) + `mxaccessgw` (rig verification)
**Depends on**: A.2, A.3, A.4, and C.1 all passing on the dev rig with a live
Galaxy and live Historian.
**Current state**: The deployment script `Refresh-Services.ps1` (task D.1) has
shipped as PR #417 (merged 2026-04-30). What was NOT captured at that time was
a smoke artifact — a log snippet or test output confirming that:
1. An alarm transition event from a live Galaxy alarm reaches lmxopcua's
`AlarmConditionService` via the new `IAlarmSource` path (not the fallback).
2. A scripted-alarm historian write-back reaches AVEVA Historian via the
sidecar `IAlarmEventWriter`.
**What it needs**:
Once A.2, A.3, C.1 are wired on the parity rig:
1. Deploy the updated mxaccessgw (with A.2 / A.3 / A.4 changes).
2. Deploy the updated sidecar (with C.1 changes).
3. Run `Refresh-Services.ps1` to confirm clean service restarts.
4. Trigger a Galaxy alarm (e.g. set an AnalogLimitAlarm attribute out of
range in Galaxy IDE).
5. Observe the lmxopcua OPC UA alarm surface via the Client CLI:
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
alarms -u opc.tcp://localhost:4840 --subscribe
```
Pass: the alarm condition appears on the OPC UA A&E surface within
2 × publishing interval.
6. Trigger a scripted alarm via the lmxopcua `ScriptedAlarmEngine`
(or an OPC UA method call if one is wired).
7. Confirm in the AVEVA Historian that the scripted alarm event is stored
(query via the Historian client or HistorianWatch tool).
8. Capture log snippets:
- mxaccessgw log: `[INF] AlarmTransition dispatched sessionId=<> alarmRef=<>`
- lmxopcua log: `[INF] AlarmConditionService: IAlarmSource event alarmRef=<> origin=Driver`
- Sidecar log: `[INF] AahClientManagedAlarmEventWriter: Wrote <n> alarm events`
9. Commit the log snippets as `docs/plans/alarms-d1-smoke-artifact.md`
(a new doc, not this one).
**What blocks D.1**: all of A.2, A.3, C.1, plus the operator decision on the
x64 alarm-helper architecture (or explicit acceptance of the sub-attribute
fallback as production).
---
## Summary of blocks
| Item | Blocked by | Estimated effort once unblocked |
|------|-----------|--------------------------------|
| A.2 | Architectural decision (x64 alarm-helper vs. sub-attribute fallback as production) | 23 days implementation; 1 day tests |
| A.3 | A.2 delivering WorkerEvent bodies | 12 days |
| A.4 | A.2 (active-alarm query needs AlarmClient session) | 1 day |
| C.1 | aahClientManaged SDK access (available on dev box); NOT blocked by A.2 | 12 days |
| D.1 | A.2 + A.3 + C.1 all passing on parity rig | 0.5 day (smoke + artifact capture) |
C.1 can proceed in parallel with A.2 / A.3 since the sidecar's `aahClientManaged`
is x64 and does not share the worker bitness constraint.
---
## What this plan does NOT cover
- The value-driven sub-attribute fallback path — already shipped and
functional (not being changed).
- Track B (lmxopcua EventPump, GalaxyDriver IAlarmSource re-implementation)
and Track E (client SDK surface refresh) from the alarms-over-gateway plan —
those are in `lmxopcua` and depend on A.3 being live; they follow naturally
once A.3 ships.
- Galaxy-native alarm historian path — System Platform's own `HistorizeToAveva`
toggle on the Galaxy template; not in scope.
- Alarm ACL / role-grant surface — already shipped in Phase 6.2.

View File

@@ -0,0 +1,497 @@
# Live-Hardware Driver Validation Runbooks
> **Scope**: These runbooks cover the three driver validation tasks that
> require physical hardware or a hardware-equivalent live environment and
> cannot be satisfied by the Docker-based simulator fixtures or unit tests
> alone.
>
> Driver implementation is complete. The runbooks document the preconditions,
> step-by-step procedure, expected results, and how to record the outcome for
> each driver that has an open live-hardware gap.
---
## 1. FANUC FOCAS — Live CNC Smoke (task #54)
### Background
The FOCAS driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/`) uses the
pure-managed `WireFocasClient` that speaks FOCAS2 over TCP directly (no
`Fwlib64.dll`, no P/Invoke). The integration test suite at
`tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.IntegrationTests/` runs against
the `focas-mock` Python server (PDU-verified against `fwlibe64.dll` upstream)
and covers all call-shapes the driver issues. What the mock cannot cover:
- Series-specific firmware quirks (e.g. 0i-F vs 30i-B parameter range limits)
- Real CNC Ethernet stack behaviour (TCP keep-alive, session-close edge cases)
- Series gating: some driver nodes are conditionally emitted based on
`CncSeries` — only a physical CNC can confirm the suppression works
### Preconditions
| Item | Requirement |
|------|-------------|
| CNC hardware | FANUC CNC with Ethernet option enabled; TCP port 8193 reachable from the dev box or from the host running OtOpcUa |
| CNC series | Any of: 0i-D, 0i-F, 0i-MF, 0i-TF, 16i, 30i-B, 31i, 32i, Power Motion i |
| CNC state | Running state (not E-stop, not alarm) for live axis-data reads |
| Network | TCP reachability from OtOpcUa server host to CNC port 8193 |
| OtOpcUa | Server built and deployed (`dotnet publish` or running via `dotnet run`) |
| Config | DriverInstance row for FOCAS in Config DB (`Type="FOCAS"`, `Backend="wire"`, `Devices[0].HostAddress="focas://<cnc-ip>:8193"`, `Devices[0].Series="<series>"`) |
### Procedure
**Step 1 — Verify TCP reachability**
```powershell
Test-NetConnection -ComputerName <cnc-ip> -Port 8193
```
Pass: `TcpTestSucceeded: True`.
**Step 2 — Start OtOpcUa with FOCAS driver configured**
Ensure the Config DB has the DriverInstance row. Start the server:
```powershell
sc start OtOpcUa
# or for a dev run:
dotnet run --project src/Server/ZB.MOM.WW.OtOpcUa.Server
```
Watch the Serilog log for:
```
[INF] FocasDriver initializing device focas://<cnc-ip>:8193 series=<series>
[INF] FocasDriver device <cnc-ip>:8193 Connected
```
If `EW_SOCKET (-1)` appears, the TCP endpoint is unreachable or the CNC
Ethernet option is not active.
**Step 3 — Browse the address space**
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
browse -u opc.tcp://localhost:4840 -r -d 3
```
Expected: a node tree containing at minimum:
```
FOCAS/
<device>/
Identity/
SeriesNumber
Version
MaxAxes
Status/
RunState
Mode
EmergencyStop
Axes/
<X|Y|Z>/
AbsolutePosition
MachinePosition
```
Nodes suppressed by the `Series` capability gate will be absent — this is
correct behaviour.
**Step 4 — Read identity nodes**
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Identity/SeriesNumber"
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Identity/MaxAxes"
```
Pass: `Good` quality; `SeriesNumber` matches the string printed on the CNC
control panel (e.g. `"0i-F"`); `MaxAxes` is a non-zero integer.
**Step 5 — Read live status and axis data**
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Status/RunState"
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Axes/X/AbsolutePosition"
```
Pass: both return `Good` quality. `AbsolutePosition` is a `Double` (e.g.
`-12.3456` mm). Manually compare against the machine's position display.
**Step 6 — Subscribe and observe polling**
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
subscribe -u opc.tcp://localhost:4840 `
-n "ns=2;s=FOCAS/<device>/Status/RunState" -i 500
```
Let run for 30 s while jogging an axis or changing mode on the CNC operator
panel. Pass: at least one data-change event received within 5 s; events
continue arriving every ~500 ms.
**Step 7 — 2-minute soak**
Let the server run for 2 minutes with the subscription active. Pass: no
`EW_SOCKET`, `EW_HANDLE`, `EW_BUSY` errors in the Serilog output; subscribed
node continues delivering updates.
**Step 8 — Run the FOCAS e2e script**
```powershell
pwsh scripts/e2e/test-focas.ps1 -ServerUrl opc.tcp://localhost:4840 `
-DriverInstance "<device>" -Series "<series>"
```
Pass: script exits 0.
### Expected results
| Check | Expected |
|-------|----------|
| TCP connect to CNC port 8193 | Success |
| FOCAS session open (`cnc_allclibhndl3`) | EW_OK (0) in driver log |
| `Identity/SeriesNumber` | Matches CNC panel, `Good` quality |
| `Identity/MaxAxes` | Non-zero integer, `Good` quality |
| `Status/RunState` | Integer 03, `Good` quality |
| `Axes/X/AbsolutePosition` | Double, `Good` quality, matches display |
| Subscribe: events delivered | >= 3 events in 5 s soak |
| 2-minute soak: no FOCAS errors | Clean Serilog log |
### Recording the outcome
```
FOCAS live-CNC smoke — task #54
Date: YYYY-MM-DD
CNC: <manufacturer> <model> series=<series> firmware=<version>
IP: <cnc-ip>:8193
OtOpcUa SHA: <git sha>
TCP connect: PASS
Session open: PASS
Identity reads: PASS SeriesNumber="<>" MaxAxes=<n>
Status read: PASS RunState=<n>
Axis read: PASS X/AbsolutePosition=<value>
Subscribe: PASS <n> events in 30s
2-min soak: PASS no errors
e2e script: PASS
```
---
## 2. Allen-Bradley CIP — Live Boot (ControlLogix)
### Background
The AB CIP driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip/`) uses
`libplctag` 1.6.x. The Docker `ab_server` simulator covers connectivity and
atomic type reads (7 integration tests). Live-boot validation is needed to
confirm UDT shape-reading, array tag access, and the CIP packing behaviour on
a real ControlLogix backplane — all gaps acknowledged in
`docs/drivers/AbServer-Test-Fixture.md`.
AB CIP live-boot was first verified against a ControlLogix rig at PR #222.
Continue running before each release.
### Preconditions
| Item | Requirement |
|------|-------------|
| PLC hardware | ControlLogix (preferred) or CompactLogix; firmware 20+ for request packing |
| Network | TCP port 44818 reachable from OtOpcUa server host |
| PLC state | Running; at least one DINT / REAL / BOOL / STRING controller-scoped tag defined |
| OtOpcUa | Server built and deployed |
| Config | DriverInstance row: `Type="AbCip"`, `Host="<plc-ip>"`, `Path="1,0"`, `PlcType="ControlLogix"` |
### Procedure
**Step 1 — Verify TCP reachability**
```powershell
Test-NetConnection -ComputerName <plc-ip> -Port 44818
```
Pass: `TcpTestSucceeded: True`.
**Step 2 — Start OtOpcUa and watch driver log**
```powershell
sc start OtOpcUa
```
Look for:
```
[INF] AbCipDriver device <plc-ip> Connected path=1,0 plcType=ControlLogix
```
**Step 3 — Browse the address space**
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
browse -u opc.tcp://localhost:4840 -r -d 3
```
Pass: node tree shows the tags defined in the ControlLogix project (controller-
and program-scoped). UDT members appear as child nodes.
**Step 4 — Read atomic tags**
```powershell
# Read a DINT tag
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=AbCip/<device>/<TagName>"
```
Pass: `Good` quality; value type matches the PLC tag type.
**Step 5 — Read a UDT member**
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=AbCip/<device>/<UDT>/<MemberName>"
```
Pass: `Good` quality; value matches the live PLC value.
**Step 6 — Write a DINT tag (if in ReadWrite mode)**
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
write -u opc.tcp://localhost:4840 `
-n "ns=2;s=AbCip/<device>/<TagName>" -v 42 -t Int32
```
Verify the new value via a subsequent read or on the PLC HMI.
Pass: read back returns 42 with `Good` quality.
**Step 7 — Subscribe to a tag that changes**
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
subscribe -u opc.tcp://localhost:4840 `
-n "ns=2;s=AbCip/<device>/<ChangingTag>" -i 500
```
Jog or trigger a value change on the PLC. Pass: events received within 2 s.
**Step 8 — Override endpoint to docker sim and confirm parity**
```powershell
$env:AB_SERVER_ENDPOINT = "<plc-ip>:44818"
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests `
--filter "AbServerFact"
```
Pass: all 7 integration tests pass against the live PLC.
### Expected results
| Check | Expected |
|-------|----------|
| TCP connect | Success |
| Driver log `Connected` | Present, no error |
| Browse | Node tree mirrors PLC tag list |
| Atomic read | `Good` quality, correct type |
| UDT member read | `Good` quality, correct value |
| Write round-trip | Written value reads back |
| Subscribe | Events delivered on value change |
| Integration tests with live PLC | 7/7 pass |
### Recording the outcome
```
AB CIP live-boot
Date: YYYY-MM-DD
PLC: Allen-Bradley <model> firmware=<version>
IP: <plc-ip>:44818 path=1,0
OtOpcUa SHA: <git sha>
TCP connect: PASS
Driver connected: PASS
Browse: PASS <n> tags visible
Atomic read: PASS
UDT read: PASS
Write round-trip: PASS
Subscribe: PASS
Integration tests: 7/7 PASS
```
---
## 3. Beckhoff TwinCAT — Wire-Live Validation
### Background
The TwinCAT driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/`) uses the
Beckhoff `TwinCAT.Ads` .NET SDK v6. The integration test suite at
`tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/`
(`TwinCAT3SmokeTests.cs`) covers 14 `[TwinCATFact]` methods + one 16-case
`[TwinCATTheory]` (30 cases total) against a live ADS runtime. The TCBSD ESXi
VM at `10.100.0.128` (AmsNetId `41.169.163.43.1.1`) is the primary fixture
runtime (project memory `project_tcbsd_fixture.md`) and bypasses the
TwinCAT/Hyper-V conflict on the dev box.
Live-hardware validation extends beyond the TCBSD VM to confirm the driver
works against a production PLC (not just the ESXi test VM) and that the three
defects found during original integration testing do not regress on newer
firmware:
1. Notification cycle time unit (250 ms was being set to ~41 min — fixed).
2. `STRING(N)` / `WSTRING(N)` type mapper (fixed).
3. Bit-indexed BOOL path (fixed).
### Preconditions
**TCBSD ESXi fixture (primary — no physical hardware needed)**
| Item | Requirement |
|------|-------------|
| TCBSD VM | Running on ESXi at `10.100.0.128` |
| AMS Net ID | `41.169.163.43.1.1` |
| ADS port | `851` (TwinCAT 3 PLC runtime 1) |
| PLC project | TwinCAT project from `tests/.../TwinCatProject/` loaded and in Run state |
| Network | TCP port 48898 reachable from dev box to `10.100.0.128` |
**Production PLC (for true wire-live validation)**
| Item | Requirement |
|------|-------------|
| TwinCAT hardware | Beckhoff IPC or CX series, TwinCAT 3 (TC3); TC2 is a known gap per fixture doc |
| AMS route | Route configured on TwinCAT device back to the OtOpcUa host |
| PLC state | Run state |
| GVL | At least a `GVL_Fixture.nCounter` DINT and `GVL_Fixture.rSetpoint` REAL present |
### Procedure — TCBSD ESXi fixture
**Step 1 — Verify TCBSD VM is reachable**
```powershell
Test-NetConnection -ComputerName 10.100.0.128 -Port 48898
```
Pass: `TcpTestSucceeded: True`.
**Step 2 — Run the integration test suite**
```powershell
$env:TWINCAT_TARGET_HOST = "10.100.0.128"
$env:TWINCAT_TARGET_NETID = "41.169.163.43.1.1"
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests `
--logger "console;verbosity=normal"
```
Pass: all 30 test cases pass (14 `[TwinCATFact]` + 16-case `[TwinCATTheory]`).
No `[TwinCATFact]` / `[TwinCATTheory]` skips — the env var is set, so the
runtime probe is expected to succeed.
Key tests to watch:
| Test | Validates |
|------|-----------|
| `Driver_subscribe_receives_native_ADS_notifications_on_counter_changes` | Native ADS notification path (the cycle-time-unit bug regression) |
| `Driver_reads_every_primitive_type_with_correct_mapping` | 16-type theory incl. `STRING(N)` |
| `Driver_reads_bit_indexed_BOOL_from_word` | Bit-indexed BOOL fix regression |
| `Driver_auto_reconnects_after_underlying_client_is_disposed` | Reconnect on ADS client dispose |
| `Driver_routes_reads_per_device_and_isolates_unreachable_peers` | Multi-device isolation |
**Step 3 — OtOpcUa server browse/read via Client CLI**
Start OtOpcUa with a TwinCAT DriverInstance pointing at the TCBSD VM:
```powershell
# appsettings.json DriverInstance: Type=TwinCAT, AmsNetId=41.169.163.43.1.1, AmsPort=851
sc start OtOpcUa
# or dev run
dotnet run --project src/Server/ZB.MOM.WW.OtOpcUa.Server
```
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
browse -u opc.tcp://localhost:4840 -r -d 4
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=TwinCAT/<device>/GVL_Fixture/nCounter"
```
Pass: browse shows the PLC symbol tree; read returns `Good` quality with an
integer value.
### Procedure — Production PLC (optional, for full wire-live signoff)
If a Beckhoff production IPC is available in the lab:
**Step 1** — Configure the AMS route on the TwinCAT device (TwinCAT System
Manager → Routes → Add static route from the TwinCAT device back to the
OtOpcUa server machine).
**Step 2** — Set env vars and run the integration suite against the production
target:
```powershell
$env:TWINCAT_TARGET_HOST = "<production-plc-ip>"
$env:TWINCAT_TARGET_NETID = "<production-ams-net-id>"
$env:TWINCAT_TARGET_PORT = "851"
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests
```
**Step 3** — Subscribe to a counter tag for 30 s to confirm native
notifications arrive:
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
subscribe -u opc.tcp://localhost:4840 `
-n "ns=2;s=TwinCAT/<device>/GVL_Fixture/nCounter" -i 100
```
Pass: events arrive every ~100 ms driven by the PLC's ADS notification, not
by polling.
### Expected results
| Check | TCBSD VM | Production PLC |
|-------|----------|----------------|
| ADS port 48898 reachable | Required | Required |
| Integration tests: all 30 pass | Required | Optional (same 30) |
| Notification cycle-time test passes | Required | Required |
| Server browse shows symbol tree | Required | Optional |
| Read `Good` quality | Required | Optional |
| Native ADS notifications deliver in subscribe | Required | Recommended |
### Known gaps (documented — not blockers for v2 GA)
Per `docs/drivers/TwinCAT-Test-Fixture.md` §"What it does NOT cover":
- Multi-hop AMS routing — single-hop only.
- TC2 (ADS v1) compatibility — TC3 only.
- Notification coalescing under sustained CPU load.
- `Symbol version changed (0x0702)` storm handling under rapid PLC re-downloads.
These are deferred to v3 per `docs/v3/twincat-backlog.md`.
### Recording the outcome
```
TwinCAT wire-live validation
Date: YYYY-MM-DD
Target: TCBSD VM 10.100.0.128 AmsNetId=41.169.163.43.1.1 (and/or production PLC details)
TwinCAT version: <version>
OtOpcUa SHA: <git sha>
ADS port reachable: PASS
Integration tests: 30/30 PASS
notification-cycle-time test: PASS (regression check)
STRING(N) type test: PASS (regression check)
bit-indexed BOOL test: PASS (regression check)
Server browse: PASS
Read Good quality: PASS
Native subscription delivery: PASS <n> events in 30s
```

View File

@@ -0,0 +1,278 @@
# Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan
> **Scope**: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99,
> #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is
> Stream F (task #150): validating that third-party OPC UA clients honour
> our `ServiceLevel` / `ServerUriArray` / `RedundancySupport` signals and
> fail over correctly when the Primary drops. This document defines what is
> automatable as integration tests, what requires two live instances plus a
> real client, and a step-by-step cutover-validation runbook.
>
> **Source of truth**: `docs/Redundancy.md`, `docs/v2/redundancy-interop-playbook.md`,
> `docs/v2/implementation/phase-6-3-redundancy-runtime.md`,
> `scripts/compliance/phase-6-3-compliance.ps1`.
## What is already tested (no live cluster needed)
The following are covered by existing automated tests that run in ordinary
`dotnet test`:
| Area | Test class(es) | What it asserts |
|---|---|---|
| `ServiceLevelCalculator` — 8-state matrix | `ServiceLevelCalculatorTests` | All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations |
| `RecoveryStateManager` — dwell + witness | `RecoveryStateManagerTests` | 60 s dwell default; premature-exit rejection; witness-required gate |
| `ApplyLeaseRegistry` — lease lifecycle | `ApplyLeaseRegistryTests` | Disposal on success / exception / cancellation; watchdog force-close at 10 min |
| `ServerRedundancyNodeWriter` — OPC UA variable binding | `ServerRedundancyNodeWriterTests` | `ServiceLevel` byte push; `RedundancySupport` enum; `ServerUriArray` skip-log when node absent |
| `RedundancyStatePublisher` — orchestration | `RedundancyStatePublisherTests` | Edge-triggered `OnStateChanged`; idempotent dedup |
| `ClusterTopologyLoader` | `ClusterTopologyLoaderTests` | Two-node seed; one-node degenerate; duplicate-URI rejection |
| `DraftValidator.ValidateClusterTopology` | `DraftValidatorTests` (8 cases) | NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection |
Run with:
```powershell
dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy"
```
Compliance gate (every Phase 6.3 static check):
```powershell
pwsh ./scripts/compliance/phase-6-3-compliance.ps1
```
Pass criteria: exit 0; all `[PASS]` lines green; `[DEFERRED]` lines are
known-deferred surfaces, not failures.
## What cannot be automated — requires two live instances
The scenarios below require two running `OtOpcUa.Server` processes in the
same `ServerCluster`, a real SQL Server Config DB, and at least one driver
instance with a reachable endpoint (simulator or real PLC).
### Why it cannot be unit/integration-tested in-process
- UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source
Windows GUI binaries with no headless CLI interface for the
subscribe/browse flows.
- The AVEVA MXAccess failover leg (`IAlarmSource` reconnect, `$MxAccessClient`
quality transition) involves the Galaxy runtime's own client-redundancy
policy and the COM-layer session model — both live outside this repo.
- Even the automatable sub-set (our own `otopcua-cli` as the client) needs
two distinct listening TCP endpoints; that requires two live processes,
which is out of scope for `dotnet test` integration fixtures.
## Test matrix
### Prerequisites
1. Two `OtOpcUa.Server` processes on separate Windows hosts (or separate
ports on the same host for dev) sharing one Config DB (`ServerCluster`
with `NodeCount=2`, `RedundancyMode=Warm` or `Hot`).
2. Each node registered in `ClusterNode`:
- Node A: `RedundancyRole=Primary`, `ServiceLevelBase=255`,
`ApplicationUri=urn:node-a:OtOpcUa`
- Node B: `RedundancyRole=Secondary`, `ServiceLevelBase=100`,
`ApplicationUri=urn:node-b:OtOpcUa`
3. `PeerHttpProbeLoop` and `PeerUaProbeLoop` HostedServices running on both
nodes (registered via `AddHostedService<PeerHttpProbeLoop>` +
`AddHostedService<PeerUaProbeLoop>` in `Program.cs`).
4. At least one `DriverInstance` in the cluster with a reachable PLC or
simulator (e.g. Modbus sim at `10.100.0.35:5020`).
5. Client machine with UaExpert >= 1.7 installed.
6. Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA
OI Gateway 2020R2+.
### Block A — OPC UA protocol signals (UaExpert, no failover yet)
| ID | Scenario | Procedure | Pass criterion | Automatable? |
|----|----------|-----------|----------------|--------------|
| A1 | ServiceLevel published on Primary | Connect UaExpert to Node A. Browse `Server/ServerStatus/ServiceLevel`. | Value = 255 (`AuthoritativePrimary`) | No — requires UaExpert GUI |
| A2 | ServiceLevel published on Backup | Connect UaExpert to Node B. Read same node. | Value = 100 (`AuthoritativeBackup`) | No |
| A3 | ServiceLevel updates when peer drops | Node A connected. Stop Node B (`sc stop OtOpcUa`). Watch `ServiceLevel` on Node A. | Transitions 255 → 230 (`IsolatedPrimary`) within ~6 s (3 × 2 s HTTP probe interval) | No |
| A4 | RedundancySupport | Browse `Server/ServerRedundancy/RedundancySupport` on either node. | Value = `Warm` or `Hot` matching the cluster `RedundancyMode` | No |
| A5 | ServerUriArray | Browse `Server/ServerRedundancy/ServerUriArray` on either node. | Array contains both `ApplicationUri` values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below). | No |
| A6 | Mid-apply ServiceLevel dip | Trigger a `sp_PublishGeneration` apply (via Admin UI draft → publish) while watching Node A `ServiceLevel`. | Drops to 200 (`PrimaryMidApply`) for the apply duration; returns to 255 after `RefreshAsync`. | No |
| A7 | Client.CLI reads correct ServiceLevel | `dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://<node-a>:4840 -n "i=2267"` | Prints current byte value matching expected band. | **Yes** — scriptable with the Client CLI |
| A8 | otopcua-cli failover reconnect | `dotnet run ... -- connect -u opc.tcp://<node-a>:4840 -F opc.tcp://<node-b>:4840` — then kill Node A. | CLI session reconnects to Node B within the session keep-alive timeout. | **Yes** — scriptable with the Client CLI |
### Block B — Third-party client failover
| ID | Scenario | Procedure | Pass criterion |
|----|----------|-----------|----------------|
| B1 | UaExpert picks Primary by ServiceLevel | Configure a Redundancy Group in UaExpert with both endpoint URLs. | Client connects to Node A (higher ServiceLevel) |
| B2 | UaExpert cuts over on Primary kill | Kill Node A `OtOpcUa` service. | Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
| B3 | UaExpert returns when Primary restores | Start Node A. Wait >= 60 s recovery dwell. | `ServiceLevel` on Node A progresses: 180 (`RecoveringPrimary`) → 255 (`AuthoritativePrimary`). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted). |
| B4 | Kepware QuickClient failover | Repeat B1B3 with Kepware configured for the same two endpoints. | Same pass criteria; establishes no UaExpert-specific behaviour. |
| B5 | AVEVA OI Gateway | Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary. | OI Gateway data quality recovers within `ReconnectInterval` (default 20 s); no permanent data-loss alert. |
### Block C — Galaxy MXAccess failover
This block requires a running Galaxy and `$MxAccessClient` object (AVEVA
System Platform installed, Galaxy deployed on dev box — see project memory
`project_aveva_platform_installed.md`).
| ID | Scenario | Procedure | Pass criterion |
|----|----------|-----------|----------------|
| C1 | Galaxy binds to Primary on first connect | Bring cluster up. Start a Galaxy `$MxAccessClient` with both node URLs configured. | Galaxy reports `QUALITY = Good`; initial values stream from Node A. |
| C2 | Galaxy redirects on Primary drop | Stop Node A. | Galaxy `QUALITY` briefly goes `Uncertain`, then returns to `Good`; values continue streaming from Node B within MXAccess's `ReconnectInterval` (default 20 s). |
| C3 | Galaxy tolerates mid-apply dip | Trigger generation apply on Node A. | Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption. |
Note: A negative result on C1C3 does not necessarily indicate an OtOpcUa
defect. Cross-check with Block A / B first to confirm our `ServiceLevel`
signal is correct before debugging the MXAccess client layer.
## Step-by-step cutover-validation runbook
This is the minimum procedure to satisfy the v2 GA exit criterion:
"Non-transparent redundancy cutover validated with at least one production
client (Ignition 8.3 recommended — see decision #85)."
### Step 1 — Provision the cluster
```powershell
# On the Config DB host, seed or verify cluster rows:
# ServerCluster: Id=<id>, Name="test-cluster", NodeCount=2, RedundancyMode=Warm
# ClusterNode A: NodeId="node-a", ClusterId=<id>, RedundancyRole=Primary,
# ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa"
# ClusterNode B: NodeId="node-b", ClusterId=<id>, RedundancyRole=Secondary,
# ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa"
```
Verify uniqueness constraint: no two `ClusterNode` rows share the same
`ApplicationUri` (unique index on `ApplicationUri`).
### Step 2 — Start both server instances
On Node A host:
```powershell
# appsettings.json: Node:NodeId = "node-a"
sc start OtOpcUa
```
On Node B host:
```powershell
# appsettings.json: Node:NodeId = "node-b"
sc start OtOpcUa
```
Wait 10 s for HostedServices to complete first probe cycle.
### Step 3 — Verify baseline ServiceLevel via Client CLI
```powershell
# Node A should report 255
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-a-host>:4840 -n "i=2267"
# Node B should report 100
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-b-host>:4840 -n "i=2267"
```
Pass: Node A = 255, Node B = 100.
### Step 4 — Verify ServerUriArray
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-a-host>:4840 -n "i=2271"
```
Pass: array returned contains both `ApplicationUri` strings. If
`ServerUriArray` node returns empty or an error, the non-transparent
redundancy-type upgrade follow-up is still pending (known limitation —
`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips on the
base `ServerRedundancyState` object type).
### Step 5 — Execute Primary kill + failover (B2 scenario)
1. Connect UaExpert (or Kepware) Redundancy Group to both endpoints.
2. Confirm client is subscribed to at least one variable node.
3. Kill Node A: `sc stop OtOpcUa` on Node A host.
4. Observe:
- Node B `ServiceLevel` should transition: 100 (`AuthoritativeBackup`)
→ 80 (`IsolatedBackup`) within ~6 s.
- Client should reconnect to Node B and resume data-change events.
5. Record: time from kill to client reconnect; whether data gaps occurred.
### Step 6 — Verify Primary recovery (B3 scenario)
1. Restart Node A: `sc start OtOpcUa` on Node A host.
2. Observe Node A `ServiceLevel` progression:
- ~0 s: 1 (`NoData`) briefly while HostedServices start.
- Startup: 180 (`RecoveringPrimary`) — recovery dwell gate active.
- After >= 60 s dwell + one positive publish witness: 255 (`AuthoritativePrimary`).
3. Observe Node B:
- Returns to 100 (`AuthoritativeBackup`) once it sees Node A peer probe succeed.
4. Record dwell duration and whether the client (UaExpert/Kepware) switches back.
### Step 7 — Execute mid-apply dip (A6 scenario)
1. Via Admin UI, create a trivial draft change and publish.
2. Watch Node A `ServiceLevel` during apply.
3. Expected: drops to 200 (`PrimaryMidApply`) for the apply duration
(typically seconds); returns to 255 when `GenerationRefreshHostedService`
releases the lease.
### Step 8 — Record results
Copy the following block into a tracking doc:
```
Run date: YYYY-MM-DD
Release SHA: <git sha>
Cluster: <cluster-id> Primary: node-a Backup: node-b
Config DB: 10.100.0.35,14330
A1: [PASS/FAIL] evidence: <screenshot or CLI output>
A2: [PASS/FAIL]
A3: [PASS/FAIL] time-to-IsolatedPrimary: <N>s
A4: [PASS/FAIL]
A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending]
A6: [PASS/FAIL] mid-apply duration: <N>s
A7: [PASS/FAIL] CLI output attached
A8: [PASS/FAIL] CLI reconnect observed
B1: [PASS/FAIL]
B2: [PASS/FAIL] reconnect time: <N>s
B3: [PASS/FAIL] dwell observed: <N>s
B4: [PASS/FAIL] (Kepware)
B5: [PASS/FAIL] (OI Gateway — if available)
C1: [PASS/FAIL/SKIP - Galaxy not available]
C2: [PASS/FAIL/SKIP]
C3: [PASS/FAIL/SKIP]
```
One pass of every non-SKIP row is the v2 GA acceptance criterion.
## Known limitations
### A5 — ServerUriArray node not yet writable
The OPC UA .NET Standard SDK's default `Server.ServerRedundancy` object is the
base `ServerRedundancyState`, which has no `ServerUriArray` child node.
`ServerRedundancyNodeWriter.ApplyServerUriArray` currently logs a warning and
skips. The operator obtains `ServerUriArray` by reading `ClusterNode` rows
directly until the non-transparent redundancy-type upgrade follow-up ships.
### Recovery dwell is 60 s by default
`RecoveryStateManager.DwellTime` defaults to `TimeSpan.FromSeconds(60)` in
`Program.cs`. Step 6 of the runbook will block for at least 60 s waiting for
Node A to return to `AuthoritativePrimary`. This is intentional per
decision #154 (thrash prevention) — do not lower it for the test run.
### IsolatedBackup (80) does not auto-promote
Per decision #154, the Backup at band 80 does not self-elevate. If the operator
needs authoritative service from Node B while Node A is down, they must write
`RedundancyRole=Primary` on the `ClusterNode` row for Node B and publish a
draft generation. The Admin UI `RedundancyTab` exposes this flow.
## Dependency on existing tests
The cutover runbook validates the end-to-end wire path. The math and edge cases
are already locked by the unit/integration tests enumerated in the first section.
A failing runbook step that contradicts a passing unit test indicates a
deployment configuration error or an SDK version mismatch — not a logic bug.
Check `PeerHttpProbeLoop` logs first (look for `PeerProbe` Serilog events).

View File

@@ -0,0 +1,307 @@
# v2 GA Lab Gates Plan
> **Canonical tracker**: `docs/v2/v2-release-readiness.md` — all code-path
> release blockers are closed as of 2026-04-24. This document maps the
> remaining exit-criteria from that tracker to concrete commands, automation
> boundaries, operator procedures, and pass criteria.
>
> **Status**: RELEASE-READY (code-path). Manual/lab gates remain open.
## The gate list
From `docs/v2/v2-release-readiness.md` §"Release-readiness exit criteria":
| # | Gate | Kind | Automatable here |
|---|------|------|-----------------|
| G1 | All four Phase 6.N compliance scripts exit 0 | Script | Yes — run on this box |
| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes with <= 1 known-flake failure | Script | Yes — run on this box |
| G3 | Release blockers closed | Audit | Already closed (code-path) |
| G4 | Phase 5 driver complement shipped | Audit | Already closed |
| G5 | Production deployment checklist signed off by Fleet Admin | Operator | No — separate doc, human signoff |
| G6 | At least one end-to-end integration run against live Galaxy succeeds | Dev rig | No — requires AVEVA platform |
| G7 | FOCAS live-CNC wire-level smoke (#54) passes against a real FANUC control | Lab hardware | No — requires FANUC CNC |
| G8 | OPC UA CTT / UA Compliance Test Tool passes against the live endpoint | Operator tool | No — requires CTT binary + live endpoint |
| G9 | Non-transparent redundancy cutover validated with >= 1 production client | Lab | No — see `docs/plans/phase-6-3-redundancy-interop-plan.md` |
---
## G1 — Phase 6 compliance scripts
### Command
```powershell
pwsh ./scripts/compliance/phase-6-all.ps1
```
This meta-runner at `scripts/compliance/phase-6-all.ps1` invokes each
sub-script in a separate `powershell.exe` process to isolate exit codes:
| Sub-script | Phase | What it checks |
|-----------|-------|---------------|
| `phase-6-1-compliance.ps1` | 6.1 Resilience & Observability | Polly resilience classes, health endpoints, LiteDB sealed cache, observability sinks |
| `phase-6-2-compliance.ps1` | 6.2 Authorization runtime | `AuthorizationGate`, `TriePermissionEvaluator`, `NodeScopeResolver`, dispatch wiring in `DriverNodeManager` |
| `phase-6-3-compliance.ps1` | 6.3 Redundancy runtime | `ServiceLevelCalculator` 8-state band values, `RecoveryStateManager`, `ApplyLeaseRegistry`, `ServerRedundancyNodeWriter`; also invokes `dotnet test` with a baseline of 1097 |
| `phase-6-4-compliance.ps1` | 6.4 Admin UI completion | Data-layer types, Identification folder, deferred Blazor items marked `[DEFERRED]` |
### Pass criterion
```
Phase 6 aggregate: PASS
```
Exit code 0. Any `[FAIL]` line is a blocker. `[DEFERRED]` lines are expected
for the known-deferred surfaces listed in the implementation docs; they do not
fail the run.
### Prerequisites
- SQL Server `10.100.0.35,14330` reachable (Config DB tests use it).
- `dotnet` SDK on PATH (`.NET 10`).
- Run from repo root.
---
## G2 — Full solution test suite
### Command
```powershell
dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal"
```
For a more targeted run of integration suites that need their fixtures up:
```powershell
# bring modbus fixture up first
lmxopcua-fix up modbus standard
dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal"
```
### Pass criterion
- Passed count >= 1159 (2026-04-19 baseline after Phase 5 driver complement).
- Failed count <= 1 (the pre-existing
`SubscribeCommandTests.Execute_PrintsSubscriptionMessage` flake in
`Client.CLI` is the only tolerated failure).
- No new `[FAILED]` tests relative to the baseline.
### Known flake
`ZB.MOM.WW.OtOpcUa.Client.CLI.Tests::SubscribeCommandTests.Execute_PrintsSubscriptionMessage`
is a timing-sensitive subscribe-then-cancel test. Rerun the specific project
if it appears:
```powershell
dotnet test tests/Client/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests `
--filter "FullyQualifiedName~SubscribeCommandTests.Execute_PrintsSubscriptionMessage" `
--count 3
```
If it fails all three runs, investigate; otherwise treat as flake.
### Docker fixtures needed for integration suites
| Driver | Command | Endpoint used |
|--------|---------|---------------|
| Modbus | `lmxopcua-fix up modbus standard` | `10.100.0.35:5020` |
| AB CIP | `lmxopcua-fix up abcip controllogix` | `10.100.0.35:44818` |
| S7 | `lmxopcua-fix up s7 s7_1500` | `10.100.0.35:1102` |
| OPC UA Client | `lmxopcua-fix up opcuaclient` | `opc.tcp://10.100.0.35:50000` |
| FOCAS | `lmxopcua-fix up focas` (mock server) | `10.100.0.35:8193` |
TwinCAT integration tests require the TCBSD ESXi VM at `10.100.0.128`
(AmsNetId `41.169.163.43.1.1`). Set env var before running:
```powershell
$env:TWINCAT_TARGET_HOST = "10.100.0.128"
$env:TWINCAT_TARGET_NETID = "41.169.163.43.1.1"
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests
```
Galaxy integration tests run against the live mxaccessgw on the dev box
(gate G6).
---
## G3 — Release blockers closed (audit, already satisfied)
All three code-path release blockers are closed per `v2-release-readiness.md`:
- Authorization dispatch wiring (task #143, PR #94) — CLOSED.
- Config fallback Phase 6.1 Stream D (task #136, PR #96) — CLOSED.
- Redundancy Phase 6.3 Streams A/C core (tasks #145/#147, PRs #98-99) — CLOSED.
No action required. Record the PR numbers in the release notes.
---
## G4 — Driver complement (audit, already satisfied)
All eight drivers shipped:
Galaxy, Modbus (+ DL205/S7/MELSEC profiles), S7 native, OPC UA Client, AB CIP,
AB Legacy, TwinCAT ADS, FOCAS (managed wire client — Tier-C isolation retired,
FOCAS is now Tier A in-process via `WireFocasClient`).
No action required.
---
## G5 — Production deployment checklist (operator action)
The deployment checklist is a separate document covering:
- Windows service install (`scripts/install/Install-Services.ps1`)
- Config DB migration (`scripts/db/Apply-Migrations.ps1`)
- Certificate provisioning and trust
- LDAP / GLAuth configuration for production AD target
- mxaccessgw API key provisioning (`apikey create-key` in the sibling repo)
- Service account permissions
- Prometheus / OpenTelemetry export configuration
- Firewall rules (port 4840 OPC UA, port 5120 gRPC to mxaccessgw,
Admin port 5000/5001)
**Sign-off party**: Fleet Admin (operator). Not automatable.
Record sign-off as a comment on the v2 release issue.
---
## G6 — Live Galaxy end-to-end integration run
**Requires**: AVEVA System Platform installed on dev box (confirmed available
per project memory `project_aveva_platform_installed.md`); mxaccessgw running
with a provisioned API key; at least one Galaxy object deployed.
### Procedure
1. Start mxaccessgw:
```powershell
# in sibling repo C:\Users\dohertj2\Desktop\mxaccessgw\
dotnet run --project src/MxGateway.Server -- --apikey-path .local/api-key.txt
```
2. Start OtOpcUa server with Galaxy driver instance configured:
```powershell
sc start OtOpcUa
```
3. Browse via Client CLI:
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
browse -u opc.tcp://localhost:4840 -r -d 3
```
4. Read a known Galaxy tag (e.g. a deployed `$UserDefined` object attribute):
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=<tag_name.AttributeName>"
```
5. Subscribe and verify live updates:
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
subscribe -u opc.tcp://localhost:4840 -n "ns=2;s=<tag_name.AttributeName>" -i 1000
```
### Pass criterion
- Browse returns a non-empty node tree mirroring the Galaxy hierarchy.
- Read returns `Good` quality with a non-null value.
- Subscribe receives at least one data-change notification within 5 s
(or within the configured publishing interval).
- No `BadNoCommunication` or `BadTimeout` errors in the server log.
Record: Galaxy version, deployed object count, OtOpcUa git SHA.
---
## G7 — FOCAS live-CNC smoke (task #54)
**Requires**: real FANUC CNC with Ethernet option, accessible on TCP port 8193
from the dev box; CNC series known (e.g. 0i-F, 30i-B).
See `docs/plans/live-hardware-validation-runbooks.md` §FOCAS for the full
runbook.
### Pass criterion
- `WireFocasClient` opens a FOCAS2 session (`cnc_allclibhndl3` succeeds).
- Identity nodes (`Identity/SeriesNumber`, `Identity/MaxAxes`) return non-null
values matching the physical control panel display.
- At least one axis position (`Axes/X/AbsolutePosition` or similar) returns
`Good` quality with a plausible double value.
- Subscribe on a polled tag delivers at least three updates within 5 s.
- No `EW_SOCKET` (-1) or `EW_HANDLE` (-7) errors in the server log during a
2-minute soak.
Record: CNC series, firmware version, test date, OtOpcUa git SHA.
---
## G8 — OPC UA Conformance Test Tool (CTT) pass
**Requires**: OPC Foundation OPC UA Compliance Test Tool (CTT) or the
open-source UA Compliance Test Tool installed on the client machine;
live OtOpcUa server endpoint.
### Recommended minimum profile set
- `Attribute Read`
- `Attribute Write`
- `Browse`
- `Subscription` (DataChange)
- `Server-side monitoring`
- `Security — None profile` (if server configured with `Security:Profiles=[None]`)
### Procedure
1. Launch CTT. Add server endpoint: `opc.tcp://localhost:4840`.
2. Run the profile set above.
3. Capture the CTT report HTML/XML.
### Pass criterion
All mandatory test cases in each profile: **PASS** or **NOT APPLICABLE**.
Zero mandatory failures. Advisory failures may be documented with rationale
(e.g. optional capability not implemented).
Record: CTT version, profile set, OtOpcUa git SHA, report artifact.
---
## G9 — Non-transparent redundancy cutover with production client
See `docs/plans/phase-6-3-redundancy-interop-plan.md` for the full runbook.
**Minimum acceptable result**: one complete pass of the A-block (UaExpert
OPC UA signal verification) plus scenario B2 (UaExpert failover on Primary
kill).
Ignition 8.3 is the recommended production client per decision #85. If
Ignition is not available on the lab machine, UaExpert is accepted for v2 GA.
Record: client name + version, OtOpcUa git SHA, test date.
---
## Gate summary table
| Gate | Command / Procedure | Pass criterion | Owner |
|------|---------------------|----------------|-------|
| G1 | `pwsh ./scripts/compliance/phase-6-all.ps1` | Exit 0, no `[FAIL]` | Dev |
| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` | >= 1159 passing, <= 1 failure | Dev |
| G3 | Audit PR list in release-readiness.md | All blockers show CLOSED | Dev |
| G4 | Audit driver table | All 8 drivers listed as shipped | Dev |
| G5 | Run deployment checklist doc | All items checked; Fleet Admin signs off | Fleet Admin |
| G6 | Browse/read/subscribe against live Galaxy | Good quality, non-empty tree | Dev (dev box) |
| G7 | FOCAS CNC smoke — see live-hardware runbook | Session open, Good quality reads | Dev + lab hardware |
| G8 | CTT profile run against live endpoint | Zero mandatory failures | Dev + CTT tool |
| G9 | Redundancy cutover runbook | A-block + B2 pass with >= 1 client | Dev + two instances |