diff --git a/docs/plans/alarms-worker-wiring-plan.md b/docs/plans/alarms-worker-wiring-plan.md new file mode 100644 index 0000000..de0632a --- /dev/null +++ b/docs/plans/alarms-worker-wiring-plan.md @@ -0,0 +1,340 @@ +# Alarms Worker Wiring Plan + +> **Context**: The alarms-over-gateway epic shipped 19 PRs across the +> `lmxopcua` and `mxaccessgw` repos (merged 2026-04-30). Contracts are live; +> the sub-attribute fallback path keeps Galaxy alarms functional today. Four +> items remain as inert scaffolds gated on a dev-rig finding. This document is +> the focused implementation plan for those four items only. +> +> **Do not duplicate `docs/plans/alarms-over-gateway.md`** — that document is +> the full historical record of all 19 PRs. This document covers only what is +> still to be done and exactly what blocks each item. +> +> **This work lives in the mxaccessgw sibling repo** at +> `C:\Users\dohertj2\Desktop\mxaccessgw\` — not in this (lmxopcua) repo, +> except where lmxopcua changes are noted explicitly. + +--- + +## Dev-rig finding that blocks everything (2026-04-30) + +During PR A.2 work the following was discovered on the dev box: + +> The MXAccess COM Toolkit at +> `C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll` +> exposes **no alarm-event family** — only `OnDataChange`, `OnWriteComplete`, +> `OperationComplete`, `OnBufferedDataChange`. +> +> AVEVA's `aaAlarmManagedClient` / `ArchestrAAlarmsAndEvents.SDK` assemblies +> are **x64-only** and incompatible with the worker's x86 net48 bitness. + +The architectural decision required before any of A.2, A.3/A.4, C.1 can ship: + +> **Either** accept the value-driven sub-attribute path as the production +> architecture (operator-comment fidelity is the only v1 regression), **or** +> add an x64 alarm-helper sub-process alongside the x86 worker. + +Resolution drives the implementation shape of every item below. The plan +presented here assumes the x64 alarm-helper sub-process route (the higher +parity option), but notes the sub-attribute-only exit at each step. + +--- + +## Discovered AVEVA API surface + +Before implementing, verify the following against the AVEVA SDK actually +installed on the dev box and in the mxaccessgw worker's deployment folder: + +| Assembly | Bitness | Likely location | Key types | +|----------|---------|-----------------|-----------| +| `ArchestrA.MXAccess.dll` | x86 | `C:\Program Files (x86)\ArchestrA\Framework\Bin\` | `IMxAlarmEventSink`, `MxAlarmEventArgs` — **confirm exists at actual version** | +| `aaAlarmManagedClient.dll` | x64 | `C:\Program Files\ArchestrA\Framework\Bin\` | `AlarmClient`, `IAlarmConsumer`, `AlarmEventArgs` | +| `ArchestrAAlarmsAndEvents.SDK.dll` | x64 | Same or Historian SDK folder | `AlarmHistorianWriter`, `GetAlarmExtendedRec` | + +The AVEVA MXAccess Toolkit reference in the mxaccessgw repo (`gateway.md`) is +the canonical API doc for the gateway worker's side. The alarm-client API is +documented separately; verify the following call shapes during PR A.2: + +| Operation | Likely API | Notes | +|-----------|-----------|-------| +| Subscribe to alarm events | `AlarmClient.RegisterConsumer(IAlarmConsumer)` + `AlarmClient.Subscribe(filterSpec)` | Confirm exact method signatures against the SDK version on the dev box | +| Receive alarm event | `IAlarmConsumer.OnAlarmEvent(AlarmEventArgs)` callback | Field set: alarm name, source, type, transition kind, severity, timestamps, operator fields | +| Acknowledge alarm | `AlarmClient.AcknowledgeAlarm(alarmRef, comment, userPrincipal)` or equivalent | Confirm whether this is synchronous or returns a status | +| Query active alarms | `AlarmClient.GetAlarmExtendedRec(filter)` or `GetActiveAlarms()` | Returns current active set for ConditionRefresh | +| Get statistics | `AlarmClient.GetStatistics()` | Optional — useful for worker health checks | + +Record the exact method signatures against the installed SDK before starting +A.2 — the proto field set in `OnAlarmTransitionEvent` must match the SDK's +actual payload. + +--- + +## Dependency order + +``` +A.2 (worker: AlarmClient subscription) + └─► A.3 (gateway: dispatch OnAlarmTransition + AcknowledgeAlarm RPC handler) + └─► A.4 (gateway: QueryActiveAlarms RPC handler) + └─► lmxopcua B.2 (GalaxyDriver IAlarmSource live) + └─► C.1 (sidecar: AahClientManagedAlarmEventWriter live) + └─► D.1 (smoke artifact captured) +``` + +A.2 is the single blocking item. All subsequent items unblock serially once +A.2 delivers alarm events through the channel. + +--- + +## Item A.2 — Worker: subscribe to MxAccess alarm event source + +**Repo**: `mxaccessgw` — `src\MxGateway.Worker\` (net48, x86) + +**What it needs**: + +The worker must subscribe to AVEVA's alarm events and fan them into the same +bounded channel the data-change pump uses, translating each MxAccess alarm +event into a `WorkerEvent` proto with family `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` +(defined in PR A.1, already merged). + +**Architectural choice determines the implementation path**: + +**Option X1 — aaAlarmManagedClient in a new x64 alarm-helper process** + +Add a second worker-mode sub-process (`MxGateway.AlarmWorker`, net8.0 x64) +alongside the existing x86 worker. The AlarmWorker: + +1. Loads `aaAlarmManagedClient.dll` (x64) on startup. +2. Calls `AlarmClient.RegisterConsumer` with a `WorkerAlarmConsumer` sink. +3. Calls `AlarmClient.Subscribe` with a session-level filter (all alarms for + the session's Galaxy scope). +4. Translates each `IAlarmConsumer.OnAlarmEvent` callback into a protobuf + `WorkerEvent` (family `ON_ALARM_TRANSITION`) and writes it to an IPC + channel readable by the gateway server-side multiplexer. +5. Handles session lifecycle: re-subscribes after reconnect; unsubscribes on + session close. + +IPC from AlarmWorker to gateway: simplest option is a named pipe or an +in-process queue if the AlarmWorker is hosted in the same gateway process +space as a separate `IHostedService`. + +**Option X2 — Accept sub-attribute fallback as production (no A.2 work)** + +If the architectural decision is to accept the sub-attribute path as permanent: + +- `MxAccessAlarmEventSink.Attach()` in the worker remains a no-op (as + currently coded with the architectural comment). +- The `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` proto family stays defined but + the gateway never emits events on it. +- lmxopcua's `GalaxyDriver` does not implement `IAlarmSource` for the + native path; the value-driven sub-attribute path remains the production + path. +- The only regression vs. v1 is operator-comment fidelity on Galaxy alarms. +- C.1 is still needed if scripted-alarm historian write-back is required. + +**What blocks it**: the architectural decision above. Once made, A.2 becomes +a 2–3 day implementation task (sub-process plumbing + proto translation + +unit tests for the consumer sink cancellation behaviour). + +**Tests to write (when A.2 proceeds)**: + +- `WorkerAlarmConsumerTests` — fake `IAlarmConsumer` source emits canned + transitions; assert each produces the correct `WorkerEvent` body shape. +- Cancellation/session-close test — closing the session unsubscribes from + the AlarmClient cleanly (no leaked `IAlarmConsumer` reference if the + worker is recycled mid-session). +- Re-subscribe-after-reconnect test — `ReconnectSupervisor` triggers a + reconnect; assert the alarm consumer re-attaches to the new session. + +--- + +## Item A.3 / A.4 — Gateway: dispatch and RPC handlers + +**Repo**: `mxaccessgw` — `src\MxGateway.Server\` + +**Depends on**: A.2 delivering `WorkerEvent` bodies with family +`MX_EVENT_FAMILY_ON_ALARM_TRANSITION`. + +**What it needs**: + +### A.3 — Dispatch + AcknowledgeAlarm + +1. The session-level event multiplexer (`Sessions\SessionEventStream.cs` or + equivalent — verify name in the mxaccessgw repo) must recognise the new + `WorkerEvent` body and forward it as an `MxEvent` with family + `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to every `StreamEvents` subscriber + for that session. + +2. New RPC handler `AcknowledgeAlarm` builds an `AlarmAcknowledgeCommand` + worker command and forwards it to the alarm-helper process (Option X1) or + the worker's MxAccess session (Option X2 if MxAccess exposes ack). Maps + the reply status to `AcknowledgeAlarmReply.MxStatusProxy`. + +3. Authorization: new API scope `invoke:alarm-ack` on the API key. Keys + without it receive `PERMISSION_DENIED`. Follow the existing scope-check + pattern used by `invoke:write`. + +### A.4 — QueryActiveAlarms + +1. New RPC handler `QueryActiveAlarms` calls `AlarmClient.GetAlarmExtendedRec` + (or `GetActiveAlarms` — confirm the method name during implementation) + on the alarm-helper process, batches results into `ActiveAlarmSnapshot` + proto messages, and streams them back to the caller. + +2. New API scope `invoke:alarm-query` (separate from ack so read-only clients + can refresh without ack rights). + +**What blocks A.3/A.4**: A.2 must deliver `WorkerEvent` bodies on the channel. +A.3/A.4 are pure dispatch wiring once the events arrive. + +**Tests to write**: + +- A.3 dispatch test — fake worker emits an `AlarmTransition` event; assert + the gateway forwards it on the `StreamEvents` channel of every subscribed + session (mirrors existing `OnDataChange` dispatch tests). +- A.3 AcknowledgeAlarm auth test — existing key without `invoke:alarm-ack` + scope returns `PERMISSION_DENIED`. +- A.4 pagination test — synthetic active-alarm set of 0 / 1 / 100 entries; + assert each streams back as separate `ActiveAlarmSnapshot` messages. +- Integration (parity rig — requires dev box with AVEVA platform): + trigger a real Galaxy alarm, call `QueryActiveAlarms`, assert the alarm + appears in the stream; call `AcknowledgeAlarm`, assert the alarm transitions + to `ActiveAcked` and a `Acknowledge` transition event appears on + `StreamEvents`. + +--- + +## Item C.1 — Historian sidecar: AahClientManagedAlarmEventWriter + +**Repo**: `lmxopcua` — `src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\` + +**Depends on**: Architectural decision (the sidecar uses `aahClientManaged` +x64, which is not bitness-constrained like the worker). C.1 is independently +unblockable from A.2 if the goal is to wire up the scripted-alarm historian +path. + +**Current state**: + +`SdkAlarmHistorianWriteBackend` in `src\MxGateway.Worker\MxAccess\` is a +placeholder returning `RetryPlease`. The lmxopcua sidecar's `WriteAlarmEvents` +IPC slot is defined in `Ipc\Contracts.cs` but `Program.cs` constructs +`HistorianFrameHandler` without an `alarmWriter` (line 57 per the alarms plan). +The `IAlarmEventWriter` interface exists; only the production implementation +and the consumer wiring are missing. + +**What it needs**: + +1. New `AahClientManagedAlarmEventWriter.cs` implementing `IAlarmEventWriter` + (defined in `Ipc\HistorianFrameHandler.cs`). Calls `aahClientManaged`'s + alarm-event write API — same path v1's `GalaxyHistorianWriter` used. + Uses `HistorianClusterEndpointPicker` for multi-node routing. + Maps `MxStatus` write outcomes to `HistorianWriteOutcome` enum + (Ack / PermanentFail / RetryPlease). + +2. `Program.cs` — build `AahClientManagedAlarmEventWriter` next to the + existing `BuildHistorian()` call; pass it to `HistorianFrameHandler`. + Gate behind `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED` env var (default `true` + when `OTOPCUA_HISTORIAN_ENABLED=true`). + +3. `Install-Services.ps1` — add the new env var to the install-time block. + +**What blocks C.1**: access to the `aahClientManaged` SDK on the dev box +(confirmed available per `project_aveva_platform_installed.md` — AVEVA +Historian SDK is present). C.1 can proceed without A.2 since the sidecar's +`aahClientManaged` is x64 and does not share the worker's x86 bitness +constraint. + +**Tests to write**: + +- Outcome-mapping table: every `MxStatus` on alarm-write → expected + `HistorianWriteOutcome`. +- Batch test: 1 / 100 / 1000 events through a fake `aahClientManaged` + writer; assert per-row outcome list parallel to input order. +- Cluster failover: primary Historian node returns `BadCommunicationError`; + picker rotates to secondary; eventual success. +- `Program.cs` seam: assert handler constructed with alarm writer when env + var enabled; without it when disabled. +- Live integration (parity rig): write a synthetic alarm event through the + IPC; query it back via `ReadEvents`; assert round-trip fidelity. + +--- + +## Item D.1 — Smoke artifact + +**Repo**: `lmxopcua` (deployment refresh) + `mxaccessgw` (rig verification) + +**Depends on**: A.2, A.3, A.4, and C.1 all passing on the dev rig with a live +Galaxy and live Historian. + +**Current state**: The deployment script `Refresh-Services.ps1` (task D.1) has +shipped as PR #417 (merged 2026-04-30). What was NOT captured at that time was +a smoke artifact — a log snippet or test output confirming that: + +1. An alarm transition event from a live Galaxy alarm reaches lmxopcua's + `AlarmConditionService` via the new `IAlarmSource` path (not the fallback). +2. A scripted-alarm historian write-back reaches AVEVA Historian via the + sidecar `IAlarmEventWriter`. + +**What it needs**: + +Once A.2, A.3, C.1 are wired on the parity rig: + +1. Deploy the updated mxaccessgw (with A.2 / A.3 / A.4 changes). +2. Deploy the updated sidecar (with C.1 changes). +3. Run `Refresh-Services.ps1` to confirm clean service restarts. +4. Trigger a Galaxy alarm (e.g. set an AnalogLimitAlarm attribute out of + range in Galaxy IDE). +5. Observe the lmxopcua OPC UA alarm surface via the Client CLI: + + ```powershell + dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + alarms -u opc.tcp://localhost:4840 --subscribe + ``` + + Pass: the alarm condition appears on the OPC UA A&E surface within + 2 × publishing interval. + +6. Trigger a scripted alarm via the lmxopcua `ScriptedAlarmEngine` + (or an OPC UA method call if one is wired). +7. Confirm in the AVEVA Historian that the scripted alarm event is stored + (query via the Historian client or HistorianWatch tool). + +8. Capture log snippets: + - mxaccessgw log: `[INF] AlarmTransition dispatched sessionId=<> alarmRef=<>` + - lmxopcua log: `[INF] AlarmConditionService: IAlarmSource event alarmRef=<> origin=Driver` + - Sidecar log: `[INF] AahClientManagedAlarmEventWriter: Wrote alarm events` + +9. Commit the log snippets as `docs/plans/alarms-d1-smoke-artifact.md` + (a new doc, not this one). + +**What blocks D.1**: all of A.2, A.3, C.1, plus the operator decision on the +x64 alarm-helper architecture (or explicit acceptance of the sub-attribute +fallback as production). + +--- + +## Summary of blocks + +| Item | Blocked by | Estimated effort once unblocked | +|------|-----------|--------------------------------| +| A.2 | Architectural decision (x64 alarm-helper vs. sub-attribute fallback as production) | 2–3 days implementation; 1 day tests | +| A.3 | A.2 delivering WorkerEvent bodies | 1–2 days | +| A.4 | A.2 (active-alarm query needs AlarmClient session) | 1 day | +| C.1 | aahClientManaged SDK access (available on dev box); NOT blocked by A.2 | 1–2 days | +| D.1 | A.2 + A.3 + C.1 all passing on parity rig | 0.5 day (smoke + artifact capture) | + +C.1 can proceed in parallel with A.2 / A.3 since the sidecar's `aahClientManaged` +is x64 and does not share the worker bitness constraint. + +--- + +## What this plan does NOT cover + +- The value-driven sub-attribute fallback path — already shipped and + functional (not being changed). +- Track B (lmxopcua EventPump, GalaxyDriver IAlarmSource re-implementation) + and Track E (client SDK surface refresh) from the alarms-over-gateway plan — + those are in `lmxopcua` and depend on A.3 being live; they follow naturally + once A.3 ships. +- Galaxy-native alarm historian path — System Platform's own `HistorizeToAveva` + toggle on the Galaxy template; not in scope. +- Alarm ACL / role-grant surface — already shipped in Phase 6.2. diff --git a/docs/plans/live-hardware-validation-runbooks.md b/docs/plans/live-hardware-validation-runbooks.md new file mode 100644 index 0000000..09323a1 --- /dev/null +++ b/docs/plans/live-hardware-validation-runbooks.md @@ -0,0 +1,497 @@ +# Live-Hardware Driver Validation Runbooks + +> **Scope**: These runbooks cover the three driver validation tasks that +> require physical hardware or a hardware-equivalent live environment and +> cannot be satisfied by the Docker-based simulator fixtures or unit tests +> alone. +> +> Driver implementation is complete. The runbooks document the preconditions, +> step-by-step procedure, expected results, and how to record the outcome for +> each driver that has an open live-hardware gap. + +--- + +## 1. FANUC FOCAS — Live CNC Smoke (task #54) + +### Background + +The FOCAS driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/`) uses the +pure-managed `WireFocasClient` that speaks FOCAS2 over TCP directly (no +`Fwlib64.dll`, no P/Invoke). The integration test suite at +`tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.IntegrationTests/` runs against +the `focas-mock` Python server (PDU-verified against `fwlibe64.dll` upstream) +and covers all call-shapes the driver issues. What the mock cannot cover: + +- Series-specific firmware quirks (e.g. 0i-F vs 30i-B parameter range limits) +- Real CNC Ethernet stack behaviour (TCP keep-alive, session-close edge cases) +- Series gating: some driver nodes are conditionally emitted based on + `CncSeries` — only a physical CNC can confirm the suppression works + +### Preconditions + +| Item | Requirement | +|------|-------------| +| CNC hardware | FANUC CNC with Ethernet option enabled; TCP port 8193 reachable from the dev box or from the host running OtOpcUa | +| CNC series | Any of: 0i-D, 0i-F, 0i-MF, 0i-TF, 16i, 30i-B, 31i, 32i, Power Motion i | +| CNC state | Running state (not E-stop, not alarm) for live axis-data reads | +| Network | TCP reachability from OtOpcUa server host to CNC port 8193 | +| OtOpcUa | Server built and deployed (`dotnet publish` or running via `dotnet run`) | +| Config | DriverInstance row for FOCAS in Config DB (`Type="FOCAS"`, `Backend="wire"`, `Devices[0].HostAddress="focas://:8193"`, `Devices[0].Series=""`) | + +### Procedure + +**Step 1 — Verify TCP reachability** + +```powershell +Test-NetConnection -ComputerName -Port 8193 +``` + +Pass: `TcpTestSucceeded: True`. + +**Step 2 — Start OtOpcUa with FOCAS driver configured** + +Ensure the Config DB has the DriverInstance row. Start the server: + +```powershell +sc start OtOpcUa +# or for a dev run: +dotnet run --project src/Server/ZB.MOM.WW.OtOpcUa.Server +``` + +Watch the Serilog log for: + +``` +[INF] FocasDriver initializing device focas://:8193 series= +[INF] FocasDriver device :8193 Connected +``` + +If `EW_SOCKET (-1)` appears, the TCP endpoint is unreachable or the CNC +Ethernet option is not active. + +**Step 3 — Browse the address space** + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + browse -u opc.tcp://localhost:4840 -r -d 3 +``` + +Expected: a node tree containing at minimum: + +``` +FOCAS/ + / + Identity/ + SeriesNumber + Version + MaxAxes + Status/ + RunState + Mode + EmergencyStop + Axes/ + / + AbsolutePosition + MachinePosition +``` + +Nodes suppressed by the `Series` capability gate will be absent — this is +correct behaviour. + +**Step 4 — Read identity nodes** + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS//Identity/SeriesNumber" + +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS//Identity/MaxAxes" +``` + +Pass: `Good` quality; `SeriesNumber` matches the string printed on the CNC +control panel (e.g. `"0i-F"`); `MaxAxes` is a non-zero integer. + +**Step 5 — Read live status and axis data** + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS//Status/RunState" + +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS//Axes/X/AbsolutePosition" +``` + +Pass: both return `Good` quality. `AbsolutePosition` is a `Double` (e.g. +`-12.3456` mm). Manually compare against the machine's position display. + +**Step 6 — Subscribe and observe polling** + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + subscribe -u opc.tcp://localhost:4840 ` + -n "ns=2;s=FOCAS//Status/RunState" -i 500 +``` + +Let run for 30 s while jogging an axis or changing mode on the CNC operator +panel. Pass: at least one data-change event received within 5 s; events +continue arriving every ~500 ms. + +**Step 7 — 2-minute soak** + +Let the server run for 2 minutes with the subscription active. Pass: no +`EW_SOCKET`, `EW_HANDLE`, `EW_BUSY` errors in the Serilog output; subscribed +node continues delivering updates. + +**Step 8 — Run the FOCAS e2e script** + +```powershell +pwsh scripts/e2e/test-focas.ps1 -ServerUrl opc.tcp://localhost:4840 ` + -DriverInstance "" -Series "" +``` + +Pass: script exits 0. + +### Expected results + +| Check | Expected | +|-------|----------| +| TCP connect to CNC port 8193 | Success | +| FOCAS session open (`cnc_allclibhndl3`) | EW_OK (0) in driver log | +| `Identity/SeriesNumber` | Matches CNC panel, `Good` quality | +| `Identity/MaxAxes` | Non-zero integer, `Good` quality | +| `Status/RunState` | Integer 0–3, `Good` quality | +| `Axes/X/AbsolutePosition` | Double, `Good` quality, matches display | +| Subscribe: events delivered | >= 3 events in 5 s soak | +| 2-minute soak: no FOCAS errors | Clean Serilog log | + +### Recording the outcome + +``` +FOCAS live-CNC smoke — task #54 +Date: YYYY-MM-DD +CNC: series= firmware= +IP: :8193 +OtOpcUa SHA: + +TCP connect: PASS +Session open: PASS +Identity reads: PASS SeriesNumber="<>" MaxAxes= +Status read: PASS RunState= +Axis read: PASS X/AbsolutePosition= +Subscribe: PASS events in 30s +2-min soak: PASS no errors +e2e script: PASS +``` + +--- + +## 2. Allen-Bradley CIP — Live Boot (ControlLogix) + +### Background + +The AB CIP driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip/`) uses +`libplctag` 1.6.x. The Docker `ab_server` simulator covers connectivity and +atomic type reads (7 integration tests). Live-boot validation is needed to +confirm UDT shape-reading, array tag access, and the CIP packing behaviour on +a real ControlLogix backplane — all gaps acknowledged in +`docs/drivers/AbServer-Test-Fixture.md`. + +AB CIP live-boot was first verified against a ControlLogix rig at PR #222. +Continue running before each release. + +### Preconditions + +| Item | Requirement | +|------|-------------| +| PLC hardware | ControlLogix (preferred) or CompactLogix; firmware 20+ for request packing | +| Network | TCP port 44818 reachable from OtOpcUa server host | +| PLC state | Running; at least one DINT / REAL / BOOL / STRING controller-scoped tag defined | +| OtOpcUa | Server built and deployed | +| Config | DriverInstance row: `Type="AbCip"`, `Host=""`, `Path="1,0"`, `PlcType="ControlLogix"` | + +### Procedure + +**Step 1 — Verify TCP reachability** + +```powershell +Test-NetConnection -ComputerName -Port 44818 +``` + +Pass: `TcpTestSucceeded: True`. + +**Step 2 — Start OtOpcUa and watch driver log** + +```powershell +sc start OtOpcUa +``` + +Look for: + +``` +[INF] AbCipDriver device Connected path=1,0 plcType=ControlLogix +``` + +**Step 3 — Browse the address space** + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + browse -u opc.tcp://localhost:4840 -r -d 3 +``` + +Pass: node tree shows the tags defined in the ControlLogix project (controller- +and program-scoped). UDT members appear as child nodes. + +**Step 4 — Read atomic tags** + +```powershell +# Read a DINT tag +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + read -u opc.tcp://localhost:4840 -n "ns=2;s=AbCip//" +``` + +Pass: `Good` quality; value type matches the PLC tag type. + +**Step 5 — Read a UDT member** + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + read -u opc.tcp://localhost:4840 -n "ns=2;s=AbCip///" +``` + +Pass: `Good` quality; value matches the live PLC value. + +**Step 6 — Write a DINT tag (if in ReadWrite mode)** + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + write -u opc.tcp://localhost:4840 ` + -n "ns=2;s=AbCip//" -v 42 -t Int32 +``` + +Verify the new value via a subsequent read or on the PLC HMI. + +Pass: read back returns 42 with `Good` quality. + +**Step 7 — Subscribe to a tag that changes** + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + subscribe -u opc.tcp://localhost:4840 ` + -n "ns=2;s=AbCip//" -i 500 +``` + +Jog or trigger a value change on the PLC. Pass: events received within 2 s. + +**Step 8 — Override endpoint to docker sim and confirm parity** + +```powershell +$env:AB_SERVER_ENDPOINT = ":44818" +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests ` + --filter "AbServerFact" +``` + +Pass: all 7 integration tests pass against the live PLC. + +### Expected results + +| Check | Expected | +|-------|----------| +| TCP connect | Success | +| Driver log `Connected` | Present, no error | +| Browse | Node tree mirrors PLC tag list | +| Atomic read | `Good` quality, correct type | +| UDT member read | `Good` quality, correct value | +| Write round-trip | Written value reads back | +| Subscribe | Events delivered on value change | +| Integration tests with live PLC | 7/7 pass | + +### Recording the outcome + +``` +AB CIP live-boot +Date: YYYY-MM-DD +PLC: Allen-Bradley firmware= +IP: :44818 path=1,0 +OtOpcUa SHA: + +TCP connect: PASS +Driver connected: PASS +Browse: PASS tags visible +Atomic read: PASS +UDT read: PASS +Write round-trip: PASS +Subscribe: PASS +Integration tests: 7/7 PASS +``` + +--- + +## 3. Beckhoff TwinCAT — Wire-Live Validation + +### Background + +The TwinCAT driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/`) uses the +Beckhoff `TwinCAT.Ads` .NET SDK v6. The integration test suite at +`tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/` +(`TwinCAT3SmokeTests.cs`) covers 14 `[TwinCATFact]` methods + one 16-case +`[TwinCATTheory]` (30 cases total) against a live ADS runtime. The TCBSD ESXi +VM at `10.100.0.128` (AmsNetId `41.169.163.43.1.1`) is the primary fixture +runtime (project memory `project_tcbsd_fixture.md`) and bypasses the +TwinCAT/Hyper-V conflict on the dev box. + +Live-hardware validation extends beyond the TCBSD VM to confirm the driver +works against a production PLC (not just the ESXi test VM) and that the three +defects found during original integration testing do not regress on newer +firmware: + +1. Notification cycle time unit (250 ms was being set to ~41 min — fixed). +2. `STRING(N)` / `WSTRING(N)` type mapper (fixed). +3. Bit-indexed BOOL path (fixed). + +### Preconditions + +**TCBSD ESXi fixture (primary — no physical hardware needed)** + +| Item | Requirement | +|------|-------------| +| TCBSD VM | Running on ESXi at `10.100.0.128` | +| AMS Net ID | `41.169.163.43.1.1` | +| ADS port | `851` (TwinCAT 3 PLC runtime 1) | +| PLC project | TwinCAT project from `tests/.../TwinCatProject/` loaded and in Run state | +| Network | TCP port 48898 reachable from dev box to `10.100.0.128` | + +**Production PLC (for true wire-live validation)** + +| Item | Requirement | +|------|-------------| +| TwinCAT hardware | Beckhoff IPC or CX series, TwinCAT 3 (TC3); TC2 is a known gap per fixture doc | +| AMS route | Route configured on TwinCAT device back to the OtOpcUa host | +| PLC state | Run state | +| GVL | At least a `GVL_Fixture.nCounter` DINT and `GVL_Fixture.rSetpoint` REAL present | + +### Procedure — TCBSD ESXi fixture + +**Step 1 — Verify TCBSD VM is reachable** + +```powershell +Test-NetConnection -ComputerName 10.100.0.128 -Port 48898 +``` + +Pass: `TcpTestSucceeded: True`. + +**Step 2 — Run the integration test suite** + +```powershell +$env:TWINCAT_TARGET_HOST = "10.100.0.128" +$env:TWINCAT_TARGET_NETID = "41.169.163.43.1.1" + +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests ` + --logger "console;verbosity=normal" +``` + +Pass: all 30 test cases pass (14 `[TwinCATFact]` + 16-case `[TwinCATTheory]`). +No `[TwinCATFact]` / `[TwinCATTheory]` skips — the env var is set, so the +runtime probe is expected to succeed. + +Key tests to watch: + +| Test | Validates | +|------|-----------| +| `Driver_subscribe_receives_native_ADS_notifications_on_counter_changes` | Native ADS notification path (the cycle-time-unit bug regression) | +| `Driver_reads_every_primitive_type_with_correct_mapping` | 16-type theory incl. `STRING(N)` | +| `Driver_reads_bit_indexed_BOOL_from_word` | Bit-indexed BOOL fix regression | +| `Driver_auto_reconnects_after_underlying_client_is_disposed` | Reconnect on ADS client dispose | +| `Driver_routes_reads_per_device_and_isolates_unreachable_peers` | Multi-device isolation | + +**Step 3 — OtOpcUa server browse/read via Client CLI** + +Start OtOpcUa with a TwinCAT DriverInstance pointing at the TCBSD VM: + +```powershell +# appsettings.json DriverInstance: Type=TwinCAT, AmsNetId=41.169.163.43.1.1, AmsPort=851 +sc start OtOpcUa +# or dev run +dotnet run --project src/Server/ZB.MOM.WW.OtOpcUa.Server +``` + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + browse -u opc.tcp://localhost:4840 -r -d 4 + +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + read -u opc.tcp://localhost:4840 -n "ns=2;s=TwinCAT//GVL_Fixture/nCounter" +``` + +Pass: browse shows the PLC symbol tree; read returns `Good` quality with an +integer value. + +### Procedure — Production PLC (optional, for full wire-live signoff) + +If a Beckhoff production IPC is available in the lab: + +**Step 1** — Configure the AMS route on the TwinCAT device (TwinCAT System +Manager → Routes → Add static route from the TwinCAT device back to the +OtOpcUa server machine). + +**Step 2** — Set env vars and run the integration suite against the production +target: + +```powershell +$env:TWINCAT_TARGET_HOST = "" +$env:TWINCAT_TARGET_NETID = "" +$env:TWINCAT_TARGET_PORT = "851" + +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests +``` + +**Step 3** — Subscribe to a counter tag for 30 s to confirm native +notifications arrive: + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + subscribe -u opc.tcp://localhost:4840 ` + -n "ns=2;s=TwinCAT//GVL_Fixture/nCounter" -i 100 +``` + +Pass: events arrive every ~100 ms driven by the PLC's ADS notification, not +by polling. + +### Expected results + +| Check | TCBSD VM | Production PLC | +|-------|----------|----------------| +| ADS port 48898 reachable | Required | Required | +| Integration tests: all 30 pass | Required | Optional (same 30) | +| Notification cycle-time test passes | Required | Required | +| Server browse shows symbol tree | Required | Optional | +| Read `Good` quality | Required | Optional | +| Native ADS notifications deliver in subscribe | Required | Recommended | + +### Known gaps (documented — not blockers for v2 GA) + +Per `docs/drivers/TwinCAT-Test-Fixture.md` §"What it does NOT cover": + +- Multi-hop AMS routing — single-hop only. +- TC2 (ADS v1) compatibility — TC3 only. +- Notification coalescing under sustained CPU load. +- `Symbol version changed (0x0702)` storm handling under rapid PLC re-downloads. + +These are deferred to v3 per `docs/v3/twincat-backlog.md`. + +### Recording the outcome + +``` +TwinCAT wire-live validation +Date: YYYY-MM-DD +Target: TCBSD VM 10.100.0.128 AmsNetId=41.169.163.43.1.1 (and/or production PLC details) +TwinCAT version: +OtOpcUa SHA: + +ADS port reachable: PASS +Integration tests: 30/30 PASS + notification-cycle-time test: PASS (regression check) + STRING(N) type test: PASS (regression check) + bit-indexed BOOL test: PASS (regression check) +Server browse: PASS +Read Good quality: PASS +Native subscription delivery: PASS events in 30s +``` diff --git a/docs/plans/phase-6-3-redundancy-interop-plan.md b/docs/plans/phase-6-3-redundancy-interop-plan.md new file mode 100644 index 0000000..ae6464f --- /dev/null +++ b/docs/plans/phase-6-3-redundancy-interop-plan.md @@ -0,0 +1,278 @@ +# Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan + +> **Scope**: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99, +> #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is +> Stream F (task #150): validating that third-party OPC UA clients honour +> our `ServiceLevel` / `ServerUriArray` / `RedundancySupport` signals and +> fail over correctly when the Primary drops. This document defines what is +> automatable as integration tests, what requires two live instances plus a +> real client, and a step-by-step cutover-validation runbook. +> +> **Source of truth**: `docs/Redundancy.md`, `docs/v2/redundancy-interop-playbook.md`, +> `docs/v2/implementation/phase-6-3-redundancy-runtime.md`, +> `scripts/compliance/phase-6-3-compliance.ps1`. + +## What is already tested (no live cluster needed) + +The following are covered by existing automated tests that run in ordinary +`dotnet test`: + +| Area | Test class(es) | What it asserts | +|---|---|---| +| `ServiceLevelCalculator` — 8-state matrix | `ServiceLevelCalculatorTests` | All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations | +| `RecoveryStateManager` — dwell + witness | `RecoveryStateManagerTests` | 60 s dwell default; premature-exit rejection; witness-required gate | +| `ApplyLeaseRegistry` — lease lifecycle | `ApplyLeaseRegistryTests` | Disposal on success / exception / cancellation; watchdog force-close at 10 min | +| `ServerRedundancyNodeWriter` — OPC UA variable binding | `ServerRedundancyNodeWriterTests` | `ServiceLevel` byte push; `RedundancySupport` enum; `ServerUriArray` skip-log when node absent | +| `RedundancyStatePublisher` — orchestration | `RedundancyStatePublisherTests` | Edge-triggered `OnStateChanged`; idempotent dedup | +| `ClusterTopologyLoader` | `ClusterTopologyLoaderTests` | Two-node seed; one-node degenerate; duplicate-URI rejection | +| `DraftValidator.ValidateClusterTopology` | `DraftValidatorTests` (8 cases) | NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection | + +Run with: + +```powershell +dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy" +``` + +Compliance gate (every Phase 6.3 static check): + +```powershell +pwsh ./scripts/compliance/phase-6-3-compliance.ps1 +``` + +Pass criteria: exit 0; all `[PASS]` lines green; `[DEFERRED]` lines are +known-deferred surfaces, not failures. + +## What cannot be automated — requires two live instances + +The scenarios below require two running `OtOpcUa.Server` processes in the +same `ServerCluster`, a real SQL Server Config DB, and at least one driver +instance with a reachable endpoint (simulator or real PLC). + +### Why it cannot be unit/integration-tested in-process + +- UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source + Windows GUI binaries with no headless CLI interface for the + subscribe/browse flows. +- The AVEVA MXAccess failover leg (`IAlarmSource` reconnect, `$MxAccessClient` + quality transition) involves the Galaxy runtime's own client-redundancy + policy and the COM-layer session model — both live outside this repo. +- Even the automatable sub-set (our own `otopcua-cli` as the client) needs + two distinct listening TCP endpoints; that requires two live processes, + which is out of scope for `dotnet test` integration fixtures. + +## Test matrix + +### Prerequisites + +1. Two `OtOpcUa.Server` processes on separate Windows hosts (or separate + ports on the same host for dev) sharing one Config DB (`ServerCluster` + with `NodeCount=2`, `RedundancyMode=Warm` or `Hot`). +2. Each node registered in `ClusterNode`: + - Node A: `RedundancyRole=Primary`, `ServiceLevelBase=255`, + `ApplicationUri=urn:node-a:OtOpcUa` + - Node B: `RedundancyRole=Secondary`, `ServiceLevelBase=100`, + `ApplicationUri=urn:node-b:OtOpcUa` +3. `PeerHttpProbeLoop` and `PeerUaProbeLoop` HostedServices running on both + nodes (registered via `AddHostedService` + + `AddHostedService` in `Program.cs`). +4. At least one `DriverInstance` in the cluster with a reachable PLC or + simulator (e.g. Modbus sim at `10.100.0.35:5020`). +5. Client machine with UaExpert >= 1.7 installed. +6. Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA + OI Gateway 2020R2+. + +### Block A — OPC UA protocol signals (UaExpert, no failover yet) + +| ID | Scenario | Procedure | Pass criterion | Automatable? | +|----|----------|-----------|----------------|--------------| +| A1 | ServiceLevel published on Primary | Connect UaExpert to Node A. Browse `Server/ServerStatus/ServiceLevel`. | Value = 255 (`AuthoritativePrimary`) | No — requires UaExpert GUI | +| A2 | ServiceLevel published on Backup | Connect UaExpert to Node B. Read same node. | Value = 100 (`AuthoritativeBackup`) | No | +| A3 | ServiceLevel updates when peer drops | Node A connected. Stop Node B (`sc stop OtOpcUa`). Watch `ServiceLevel` on Node A. | Transitions 255 → 230 (`IsolatedPrimary`) within ~6 s (3 × 2 s HTTP probe interval) | No | +| A4 | RedundancySupport | Browse `Server/ServerRedundancy/RedundancySupport` on either node. | Value = `Warm` or `Hot` matching the cluster `RedundancyMode` | No | +| A5 | ServerUriArray | Browse `Server/ServerRedundancy/ServerUriArray` on either node. | Array contains both `ApplicationUri` values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below). | No | +| A6 | Mid-apply ServiceLevel dip | Trigger a `sp_PublishGeneration` apply (via Admin UI draft → publish) while watching Node A `ServiceLevel`. | Drops to 200 (`PrimaryMidApply`) for the apply duration; returns to 255 after `RefreshAsync`. | No | +| A7 | Client.CLI reads correct ServiceLevel | `dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://:4840 -n "i=2267"` | Prints current byte value matching expected band. | **Yes** — scriptable with the Client CLI | +| A8 | otopcua-cli failover reconnect | `dotnet run ... -- connect -u opc.tcp://:4840 -F opc.tcp://:4840` — then kill Node A. | CLI session reconnects to Node B within the session keep-alive timeout. | **Yes** — scriptable with the Client CLI | + +### Block B — Third-party client failover + +| ID | Scenario | Procedure | Pass criterion | +|----|----------|-----------|----------------| +| B1 | UaExpert picks Primary by ServiceLevel | Configure a Redundancy Group in UaExpert with both endpoint URLs. | Client connects to Node A (higher ServiceLevel) | +| B2 | UaExpert cuts over on Primary kill | Kill Node A `OtOpcUa` service. | Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. | +| B3 | UaExpert returns when Primary restores | Start Node A. Wait >= 60 s recovery dwell. | `ServiceLevel` on Node A progresses: 180 (`RecoveringPrimary`) → 255 (`AuthoritativePrimary`). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted). | +| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware configured for the same two endpoints. | Same pass criteria; establishes no UaExpert-specific behaviour. | +| B5 | AVEVA OI Gateway | Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary. | OI Gateway data quality recovers within `ReconnectInterval` (default 20 s); no permanent data-loss alert. | + +### Block C — Galaxy MXAccess failover + +This block requires a running Galaxy and `$MxAccessClient` object (AVEVA +System Platform installed, Galaxy deployed on dev box — see project memory +`project_aveva_platform_installed.md`). + +| ID | Scenario | Procedure | Pass criterion | +|----|----------|-----------|----------------| +| C1 | Galaxy binds to Primary on first connect | Bring cluster up. Start a Galaxy `$MxAccessClient` with both node URLs configured. | Galaxy reports `QUALITY = Good`; initial values stream from Node A. | +| C2 | Galaxy redirects on Primary drop | Stop Node A. | Galaxy `QUALITY` briefly goes `Uncertain`, then returns to `Good`; values continue streaming from Node B within MXAccess's `ReconnectInterval` (default 20 s). | +| C3 | Galaxy tolerates mid-apply dip | Trigger generation apply on Node A. | Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption. | + +Note: A negative result on C1–C3 does not necessarily indicate an OtOpcUa +defect. Cross-check with Block A / B first to confirm our `ServiceLevel` +signal is correct before debugging the MXAccess client layer. + +## Step-by-step cutover-validation runbook + +This is the minimum procedure to satisfy the v2 GA exit criterion: +"Non-transparent redundancy cutover validated with at least one production +client (Ignition 8.3 recommended — see decision #85)." + +### Step 1 — Provision the cluster + +```powershell +# On the Config DB host, seed or verify cluster rows: +# ServerCluster: Id=, Name="test-cluster", NodeCount=2, RedundancyMode=Warm +# ClusterNode A: NodeId="node-a", ClusterId=, RedundancyRole=Primary, +# ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa" +# ClusterNode B: NodeId="node-b", ClusterId=, RedundancyRole=Secondary, +# ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa" +``` + +Verify uniqueness constraint: no two `ClusterNode` rows share the same +`ApplicationUri` (unique index on `ApplicationUri`). + +### Step 2 — Start both server instances + +On Node A host: + +```powershell +# appsettings.json: Node:NodeId = "node-a" +sc start OtOpcUa +``` + +On Node B host: + +```powershell +# appsettings.json: Node:NodeId = "node-b" +sc start OtOpcUa +``` + +Wait 10 s for HostedServices to complete first probe cycle. + +### Step 3 — Verify baseline ServiceLevel via Client CLI + +```powershell +# Node A should report 255 +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read ` + -u opc.tcp://:4840 -n "i=2267" + +# Node B should report 100 +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read ` + -u opc.tcp://:4840 -n "i=2267" +``` + +Pass: Node A = 255, Node B = 100. + +### Step 4 — Verify ServerUriArray + +```powershell +dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read ` + -u opc.tcp://:4840 -n "i=2271" +``` + +Pass: array returned contains both `ApplicationUri` strings. If +`ServerUriArray` node returns empty or an error, the non-transparent +redundancy-type upgrade follow-up is still pending (known limitation — +`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips on the +base `ServerRedundancyState` object type). + +### Step 5 — Execute Primary kill + failover (B2 scenario) + +1. Connect UaExpert (or Kepware) Redundancy Group to both endpoints. +2. Confirm client is subscribed to at least one variable node. +3. Kill Node A: `sc stop OtOpcUa` on Node A host. +4. Observe: + - Node B `ServiceLevel` should transition: 100 (`AuthoritativeBackup`) + → 80 (`IsolatedBackup`) within ~6 s. + - Client should reconnect to Node B and resume data-change events. +5. Record: time from kill to client reconnect; whether data gaps occurred. + +### Step 6 — Verify Primary recovery (B3 scenario) + +1. Restart Node A: `sc start OtOpcUa` on Node A host. +2. Observe Node A `ServiceLevel` progression: + - ~0 s: 1 (`NoData`) briefly while HostedServices start. + - Startup: 180 (`RecoveringPrimary`) — recovery dwell gate active. + - After >= 60 s dwell + one positive publish witness: 255 (`AuthoritativePrimary`). +3. Observe Node B: + - Returns to 100 (`AuthoritativeBackup`) once it sees Node A peer probe succeed. +4. Record dwell duration and whether the client (UaExpert/Kepware) switches back. + +### Step 7 — Execute mid-apply dip (A6 scenario) + +1. Via Admin UI, create a trivial draft change and publish. +2. Watch Node A `ServiceLevel` during apply. +3. Expected: drops to 200 (`PrimaryMidApply`) for the apply duration + (typically seconds); returns to 255 when `GenerationRefreshHostedService` + releases the lease. + +### Step 8 — Record results + +Copy the following block into a tracking doc: + +``` +Run date: YYYY-MM-DD +Release SHA: +Cluster: Primary: node-a Backup: node-b +Config DB: 10.100.0.35,14330 + +A1: [PASS/FAIL] evidence: +A2: [PASS/FAIL] +A3: [PASS/FAIL] time-to-IsolatedPrimary: s +A4: [PASS/FAIL] +A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending] +A6: [PASS/FAIL] mid-apply duration: s +A7: [PASS/FAIL] CLI output attached +A8: [PASS/FAIL] CLI reconnect observed +B1: [PASS/FAIL] +B2: [PASS/FAIL] reconnect time: s +B3: [PASS/FAIL] dwell observed: s +B4: [PASS/FAIL] (Kepware) +B5: [PASS/FAIL] (OI Gateway — if available) +C1: [PASS/FAIL/SKIP - Galaxy not available] +C2: [PASS/FAIL/SKIP] +C3: [PASS/FAIL/SKIP] +``` + +One pass of every non-SKIP row is the v2 GA acceptance criterion. + +## Known limitations + +### A5 — ServerUriArray node not yet writable + +The OPC UA .NET Standard SDK's default `Server.ServerRedundancy` object is the +base `ServerRedundancyState`, which has no `ServerUriArray` child node. +`ServerRedundancyNodeWriter.ApplyServerUriArray` currently logs a warning and +skips. The operator obtains `ServerUriArray` by reading `ClusterNode` rows +directly until the non-transparent redundancy-type upgrade follow-up ships. + +### Recovery dwell is 60 s by default + +`RecoveryStateManager.DwellTime` defaults to `TimeSpan.FromSeconds(60)` in +`Program.cs`. Step 6 of the runbook will block for at least 60 s waiting for +Node A to return to `AuthoritativePrimary`. This is intentional per +decision #154 (thrash prevention) — do not lower it for the test run. + +### IsolatedBackup (80) does not auto-promote + +Per decision #154, the Backup at band 80 does not self-elevate. If the operator +needs authoritative service from Node B while Node A is down, they must write +`RedundancyRole=Primary` on the `ClusterNode` row for Node B and publish a +draft generation. The Admin UI `RedundancyTab` exposes this flow. + +## Dependency on existing tests + +The cutover runbook validates the end-to-end wire path. The math and edge cases +are already locked by the unit/integration tests enumerated in the first section. +A failing runbook step that contradicts a passing unit test indicates a +deployment configuration error or an SDK version mismatch — not a logic bug. +Check `PeerHttpProbeLoop` logs first (look for `PeerProbe` Serilog events). diff --git a/docs/plans/v2-ga-lab-gates-plan.md b/docs/plans/v2-ga-lab-gates-plan.md new file mode 100644 index 0000000..c94d909 --- /dev/null +++ b/docs/plans/v2-ga-lab-gates-plan.md @@ -0,0 +1,307 @@ +# v2 GA Lab Gates Plan + +> **Canonical tracker**: `docs/v2/v2-release-readiness.md` — all code-path +> release blockers are closed as of 2026-04-24. This document maps the +> remaining exit-criteria from that tracker to concrete commands, automation +> boundaries, operator procedures, and pass criteria. +> +> **Status**: RELEASE-READY (code-path). Manual/lab gates remain open. + +## The gate list + +From `docs/v2/v2-release-readiness.md` §"Release-readiness exit criteria": + +| # | Gate | Kind | Automatable here | +|---|------|------|-----------------| +| G1 | All four Phase 6.N compliance scripts exit 0 | Script | Yes — run on this box | +| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes with <= 1 known-flake failure | Script | Yes — run on this box | +| G3 | Release blockers closed | Audit | Already closed (code-path) | +| G4 | Phase 5 driver complement shipped | Audit | Already closed | +| G5 | Production deployment checklist signed off by Fleet Admin | Operator | No — separate doc, human signoff | +| G6 | At least one end-to-end integration run against live Galaxy succeeds | Dev rig | No — requires AVEVA platform | +| G7 | FOCAS live-CNC wire-level smoke (#54) passes against a real FANUC control | Lab hardware | No — requires FANUC CNC | +| G8 | OPC UA CTT / UA Compliance Test Tool passes against the live endpoint | Operator tool | No — requires CTT binary + live endpoint | +| G9 | Non-transparent redundancy cutover validated with >= 1 production client | Lab | No — see `docs/plans/phase-6-3-redundancy-interop-plan.md` | + +--- + +## G1 — Phase 6 compliance scripts + +### Command + +```powershell +pwsh ./scripts/compliance/phase-6-all.ps1 +``` + +This meta-runner at `scripts/compliance/phase-6-all.ps1` invokes each +sub-script in a separate `powershell.exe` process to isolate exit codes: + +| Sub-script | Phase | What it checks | +|-----------|-------|---------------| +| `phase-6-1-compliance.ps1` | 6.1 Resilience & Observability | Polly resilience classes, health endpoints, LiteDB sealed cache, observability sinks | +| `phase-6-2-compliance.ps1` | 6.2 Authorization runtime | `AuthorizationGate`, `TriePermissionEvaluator`, `NodeScopeResolver`, dispatch wiring in `DriverNodeManager` | +| `phase-6-3-compliance.ps1` | 6.3 Redundancy runtime | `ServiceLevelCalculator` 8-state band values, `RecoveryStateManager`, `ApplyLeaseRegistry`, `ServerRedundancyNodeWriter`; also invokes `dotnet test` with a baseline of 1097 | +| `phase-6-4-compliance.ps1` | 6.4 Admin UI completion | Data-layer types, Identification folder, deferred Blazor items marked `[DEFERRED]` | + +### Pass criterion + +``` +Phase 6 aggregate: PASS +``` + +Exit code 0. Any `[FAIL]` line is a blocker. `[DEFERRED]` lines are expected +for the known-deferred surfaces listed in the implementation docs; they do not +fail the run. + +### Prerequisites + +- SQL Server `10.100.0.35,14330` reachable (Config DB tests use it). +- `dotnet` SDK on PATH (`.NET 10`). +- Run from repo root. + +--- + +## G2 — Full solution test suite + +### Command + +```powershell +dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal" +``` + +For a more targeted run of integration suites that need their fixtures up: + +```powershell +# bring modbus fixture up first +lmxopcua-fix up modbus standard + +dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal" +``` + +### Pass criterion + +- Passed count >= 1159 (2026-04-19 baseline after Phase 5 driver complement). +- Failed count <= 1 (the pre-existing + `SubscribeCommandTests.Execute_PrintsSubscriptionMessage` flake in + `Client.CLI` is the only tolerated failure). +- No new `[FAILED]` tests relative to the baseline. + +### Known flake + +`ZB.MOM.WW.OtOpcUa.Client.CLI.Tests::SubscribeCommandTests.Execute_PrintsSubscriptionMessage` +is a timing-sensitive subscribe-then-cancel test. Rerun the specific project +if it appears: + +```powershell +dotnet test tests/Client/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests ` + --filter "FullyQualifiedName~SubscribeCommandTests.Execute_PrintsSubscriptionMessage" ` + --count 3 +``` + +If it fails all three runs, investigate; otherwise treat as flake. + +### Docker fixtures needed for integration suites + +| Driver | Command | Endpoint used | +|--------|---------|---------------| +| Modbus | `lmxopcua-fix up modbus standard` | `10.100.0.35:5020` | +| AB CIP | `lmxopcua-fix up abcip controllogix` | `10.100.0.35:44818` | +| S7 | `lmxopcua-fix up s7 s7_1500` | `10.100.0.35:1102` | +| OPC UA Client | `lmxopcua-fix up opcuaclient` | `opc.tcp://10.100.0.35:50000` | +| FOCAS | `lmxopcua-fix up focas` (mock server) | `10.100.0.35:8193` | + +TwinCAT integration tests require the TCBSD ESXi VM at `10.100.0.128` +(AmsNetId `41.169.163.43.1.1`). Set env var before running: + +```powershell +$env:TWINCAT_TARGET_HOST = "10.100.0.128" +$env:TWINCAT_TARGET_NETID = "41.169.163.43.1.1" +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests +``` + +Galaxy integration tests run against the live mxaccessgw on the dev box +(gate G6). + +--- + +## G3 — Release blockers closed (audit, already satisfied) + +All three code-path release blockers are closed per `v2-release-readiness.md`: + +- Authorization dispatch wiring (task #143, PR #94) — CLOSED. +- Config fallback Phase 6.1 Stream D (task #136, PR #96) — CLOSED. +- Redundancy Phase 6.3 Streams A/C core (tasks #145/#147, PRs #98-99) — CLOSED. + +No action required. Record the PR numbers in the release notes. + +--- + +## G4 — Driver complement (audit, already satisfied) + +All eight drivers shipped: + +Galaxy, Modbus (+ DL205/S7/MELSEC profiles), S7 native, OPC UA Client, AB CIP, +AB Legacy, TwinCAT ADS, FOCAS (managed wire client — Tier-C isolation retired, +FOCAS is now Tier A in-process via `WireFocasClient`). + +No action required. + +--- + +## G5 — Production deployment checklist (operator action) + +The deployment checklist is a separate document covering: + +- Windows service install (`scripts/install/Install-Services.ps1`) +- Config DB migration (`scripts/db/Apply-Migrations.ps1`) +- Certificate provisioning and trust +- LDAP / GLAuth configuration for production AD target +- mxaccessgw API key provisioning (`apikey create-key` in the sibling repo) +- Service account permissions +- Prometheus / OpenTelemetry export configuration +- Firewall rules (port 4840 OPC UA, port 5120 gRPC to mxaccessgw, + Admin port 5000/5001) + +**Sign-off party**: Fleet Admin (operator). Not automatable. + +Record sign-off as a comment on the v2 release issue. + +--- + +## G6 — Live Galaxy end-to-end integration run + +**Requires**: AVEVA System Platform installed on dev box (confirmed available +per project memory `project_aveva_platform_installed.md`); mxaccessgw running +with a provisioned API key; at least one Galaxy object deployed. + +### Procedure + +1. Start mxaccessgw: + + ```powershell + # in sibling repo C:\Users\dohertj2\Desktop\mxaccessgw\ + dotnet run --project src/MxGateway.Server -- --apikey-path .local/api-key.txt + ``` + +2. Start OtOpcUa server with Galaxy driver instance configured: + + ```powershell + sc start OtOpcUa + ``` + +3. Browse via Client CLI: + + ```powershell + dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + browse -u opc.tcp://localhost:4840 -r -d 3 + ``` + +4. Read a known Galaxy tag (e.g. a deployed `$UserDefined` object attribute): + + ```powershell + dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + read -u opc.tcp://localhost:4840 -n "ns=2;s=" + ``` + +5. Subscribe and verify live updates: + + ```powershell + dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- ` + subscribe -u opc.tcp://localhost:4840 -n "ns=2;s=" -i 1000 + ``` + +### Pass criterion + +- Browse returns a non-empty node tree mirroring the Galaxy hierarchy. +- Read returns `Good` quality with a non-null value. +- Subscribe receives at least one data-change notification within 5 s + (or within the configured publishing interval). +- No `BadNoCommunication` or `BadTimeout` errors in the server log. + +Record: Galaxy version, deployed object count, OtOpcUa git SHA. + +--- + +## G7 — FOCAS live-CNC smoke (task #54) + +**Requires**: real FANUC CNC with Ethernet option, accessible on TCP port 8193 +from the dev box; CNC series known (e.g. 0i-F, 30i-B). + +See `docs/plans/live-hardware-validation-runbooks.md` §FOCAS for the full +runbook. + +### Pass criterion + +- `WireFocasClient` opens a FOCAS2 session (`cnc_allclibhndl3` succeeds). +- Identity nodes (`Identity/SeriesNumber`, `Identity/MaxAxes`) return non-null + values matching the physical control panel display. +- At least one axis position (`Axes/X/AbsolutePosition` or similar) returns + `Good` quality with a plausible double value. +- Subscribe on a polled tag delivers at least three updates within 5 s. +- No `EW_SOCKET` (-1) or `EW_HANDLE` (-7) errors in the server log during a + 2-minute soak. + +Record: CNC series, firmware version, test date, OtOpcUa git SHA. + +--- + +## G8 — OPC UA Conformance Test Tool (CTT) pass + +**Requires**: OPC Foundation OPC UA Compliance Test Tool (CTT) or the +open-source UA Compliance Test Tool installed on the client machine; +live OtOpcUa server endpoint. + +### Recommended minimum profile set + +- `Attribute Read` +- `Attribute Write` +- `Browse` +- `Subscription` (DataChange) +- `Server-side monitoring` +- `Security — None profile` (if server configured with `Security:Profiles=[None]`) + +### Procedure + +1. Launch CTT. Add server endpoint: `opc.tcp://localhost:4840`. +2. Run the profile set above. +3. Capture the CTT report HTML/XML. + +### Pass criterion + +All mandatory test cases in each profile: **PASS** or **NOT APPLICABLE**. + +Zero mandatory failures. Advisory failures may be documented with rationale +(e.g. optional capability not implemented). + +Record: CTT version, profile set, OtOpcUa git SHA, report artifact. + +--- + +## G9 — Non-transparent redundancy cutover with production client + +See `docs/plans/phase-6-3-redundancy-interop-plan.md` for the full runbook. + +**Minimum acceptable result**: one complete pass of the A-block (UaExpert +OPC UA signal verification) plus scenario B2 (UaExpert failover on Primary +kill). + +Ignition 8.3 is the recommended production client per decision #85. If +Ignition is not available on the lab machine, UaExpert is accepted for v2 GA. + +Record: client name + version, OtOpcUa git SHA, test date. + +--- + +## Gate summary table + +| Gate | Command / Procedure | Pass criterion | Owner | +|------|---------------------|----------------|-------| +| G1 | `pwsh ./scripts/compliance/phase-6-all.ps1` | Exit 0, no `[FAIL]` | Dev | +| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` | >= 1159 passing, <= 1 failure | Dev | +| G3 | Audit PR list in release-readiness.md | All blockers show CLOSED | Dev | +| G4 | Audit driver table | All 8 drivers listed as shipped | Dev | +| G5 | Run deployment checklist doc | All items checked; Fleet Admin signs off | Fleet Admin | +| G6 | Browse/read/subscribe against live Galaxy | Good quality, non-empty tree | Dev (dev box) | +| G7 | FOCAS CNC smoke — see live-hardware runbook | Session open, Good quality reads | Dev + lab hardware | +| G8 | CTT profile run against live endpoint | Zero mandatory failures | Dev + CTT tool | +| G9 | Redundancy cutover runbook | A-block + B2 pass with >= 1 client | Dev + two instances |