docs: add four planning runbooks for Phase 6.3 interop, v2 GA gates, live-hardware validation, and alarms worker wiring
Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20: - phase-6-3-redundancy-interop-plan.md: automation boundary analysis, concrete test matrix (A/B/C blocks), and a step-by-step cutover runbook for the deferred Stream F client interop work - v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion, and owner for each of the nine v2 GA exit criteria - live-hardware-validation-runbooks.md: one runbook per driver (FOCAS CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions, procedure, expected results, and recording template - alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1 worker wiring in the mxaccessgw sibling repo, documenting the discovered AVEVA API surface, the architectural decision that blocks A.2, the dependency order, and what each item needs to unblock Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
340
docs/plans/alarms-worker-wiring-plan.md
Normal file
340
docs/plans/alarms-worker-wiring-plan.md
Normal file
@@ -0,0 +1,340 @@
|
|||||||
|
# Alarms Worker Wiring Plan
|
||||||
|
|
||||||
|
> **Context**: The alarms-over-gateway epic shipped 19 PRs across the
|
||||||
|
> `lmxopcua` and `mxaccessgw` repos (merged 2026-04-30). Contracts are live;
|
||||||
|
> the sub-attribute fallback path keeps Galaxy alarms functional today. Four
|
||||||
|
> items remain as inert scaffolds gated on a dev-rig finding. This document is
|
||||||
|
> the focused implementation plan for those four items only.
|
||||||
|
>
|
||||||
|
> **Do not duplicate `docs/plans/alarms-over-gateway.md`** — that document is
|
||||||
|
> the full historical record of all 19 PRs. This document covers only what is
|
||||||
|
> still to be done and exactly what blocks each item.
|
||||||
|
>
|
||||||
|
> **This work lives in the mxaccessgw sibling repo** at
|
||||||
|
> `C:\Users\dohertj2\Desktop\mxaccessgw\` — not in this (lmxopcua) repo,
|
||||||
|
> except where lmxopcua changes are noted explicitly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dev-rig finding that blocks everything (2026-04-30)
|
||||||
|
|
||||||
|
During PR A.2 work the following was discovered on the dev box:
|
||||||
|
|
||||||
|
> The MXAccess COM Toolkit at
|
||||||
|
> `C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll`
|
||||||
|
> exposes **no alarm-event family** — only `OnDataChange`, `OnWriteComplete`,
|
||||||
|
> `OperationComplete`, `OnBufferedDataChange`.
|
||||||
|
>
|
||||||
|
> AVEVA's `aaAlarmManagedClient` / `ArchestrAAlarmsAndEvents.SDK` assemblies
|
||||||
|
> are **x64-only** and incompatible with the worker's x86 net48 bitness.
|
||||||
|
|
||||||
|
The architectural decision required before any of A.2, A.3/A.4, C.1 can ship:
|
||||||
|
|
||||||
|
> **Either** accept the value-driven sub-attribute path as the production
|
||||||
|
> architecture (operator-comment fidelity is the only v1 regression), **or**
|
||||||
|
> add an x64 alarm-helper sub-process alongside the x86 worker.
|
||||||
|
|
||||||
|
Resolution drives the implementation shape of every item below. The plan
|
||||||
|
presented here assumes the x64 alarm-helper sub-process route (the higher
|
||||||
|
parity option), but notes the sub-attribute-only exit at each step.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Discovered AVEVA API surface
|
||||||
|
|
||||||
|
Before implementing, verify the following against the AVEVA SDK actually
|
||||||
|
installed on the dev box and in the mxaccessgw worker's deployment folder:
|
||||||
|
|
||||||
|
| Assembly | Bitness | Likely location | Key types |
|
||||||
|
|----------|---------|-----------------|-----------|
|
||||||
|
| `ArchestrA.MXAccess.dll` | x86 | `C:\Program Files (x86)\ArchestrA\Framework\Bin\` | `IMxAlarmEventSink`, `MxAlarmEventArgs` — **confirm exists at actual version** |
|
||||||
|
| `aaAlarmManagedClient.dll` | x64 | `C:\Program Files\ArchestrA\Framework\Bin\` | `AlarmClient`, `IAlarmConsumer`, `AlarmEventArgs` |
|
||||||
|
| `ArchestrAAlarmsAndEvents.SDK.dll` | x64 | Same or Historian SDK folder | `AlarmHistorianWriter`, `GetAlarmExtendedRec` |
|
||||||
|
|
||||||
|
The AVEVA MXAccess Toolkit reference in the mxaccessgw repo (`gateway.md`) is
|
||||||
|
the canonical API doc for the gateway worker's side. The alarm-client API is
|
||||||
|
documented separately; verify the following call shapes during PR A.2:
|
||||||
|
|
||||||
|
| Operation | Likely API | Notes |
|
||||||
|
|-----------|-----------|-------|
|
||||||
|
| Subscribe to alarm events | `AlarmClient.RegisterConsumer(IAlarmConsumer)` + `AlarmClient.Subscribe(filterSpec)` | Confirm exact method signatures against the SDK version on the dev box |
|
||||||
|
| Receive alarm event | `IAlarmConsumer.OnAlarmEvent(AlarmEventArgs)` callback | Field set: alarm name, source, type, transition kind, severity, timestamps, operator fields |
|
||||||
|
| Acknowledge alarm | `AlarmClient.AcknowledgeAlarm(alarmRef, comment, userPrincipal)` or equivalent | Confirm whether this is synchronous or returns a status |
|
||||||
|
| Query active alarms | `AlarmClient.GetAlarmExtendedRec(filter)` or `GetActiveAlarms()` | Returns current active set for ConditionRefresh |
|
||||||
|
| Get statistics | `AlarmClient.GetStatistics()` | Optional — useful for worker health checks |
|
||||||
|
|
||||||
|
Record the exact method signatures against the installed SDK before starting
|
||||||
|
A.2 — the proto field set in `OnAlarmTransitionEvent` must match the SDK's
|
||||||
|
actual payload.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependency order
|
||||||
|
|
||||||
|
```
|
||||||
|
A.2 (worker: AlarmClient subscription)
|
||||||
|
└─► A.3 (gateway: dispatch OnAlarmTransition + AcknowledgeAlarm RPC handler)
|
||||||
|
└─► A.4 (gateway: QueryActiveAlarms RPC handler)
|
||||||
|
└─► lmxopcua B.2 (GalaxyDriver IAlarmSource live)
|
||||||
|
└─► C.1 (sidecar: AahClientManagedAlarmEventWriter live)
|
||||||
|
└─► D.1 (smoke artifact captured)
|
||||||
|
```
|
||||||
|
|
||||||
|
A.2 is the single blocking item. All subsequent items unblock serially once
|
||||||
|
A.2 delivers alarm events through the channel.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Item A.2 — Worker: subscribe to MxAccess alarm event source
|
||||||
|
|
||||||
|
**Repo**: `mxaccessgw` — `src\MxGateway.Worker\` (net48, x86)
|
||||||
|
|
||||||
|
**What it needs**:
|
||||||
|
|
||||||
|
The worker must subscribe to AVEVA's alarm events and fan them into the same
|
||||||
|
bounded channel the data-change pump uses, translating each MxAccess alarm
|
||||||
|
event into a `WorkerEvent` proto with family `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
|
||||||
|
(defined in PR A.1, already merged).
|
||||||
|
|
||||||
|
**Architectural choice determines the implementation path**:
|
||||||
|
|
||||||
|
**Option X1 — aaAlarmManagedClient in a new x64 alarm-helper process**
|
||||||
|
|
||||||
|
Add a second worker-mode sub-process (`MxGateway.AlarmWorker`, net8.0 x64)
|
||||||
|
alongside the existing x86 worker. The AlarmWorker:
|
||||||
|
|
||||||
|
1. Loads `aaAlarmManagedClient.dll` (x64) on startup.
|
||||||
|
2. Calls `AlarmClient.RegisterConsumer` with a `WorkerAlarmConsumer` sink.
|
||||||
|
3. Calls `AlarmClient.Subscribe` with a session-level filter (all alarms for
|
||||||
|
the session's Galaxy scope).
|
||||||
|
4. Translates each `IAlarmConsumer.OnAlarmEvent` callback into a protobuf
|
||||||
|
`WorkerEvent` (family `ON_ALARM_TRANSITION`) and writes it to an IPC
|
||||||
|
channel readable by the gateway server-side multiplexer.
|
||||||
|
5. Handles session lifecycle: re-subscribes after reconnect; unsubscribes on
|
||||||
|
session close.
|
||||||
|
|
||||||
|
IPC from AlarmWorker to gateway: simplest option is a named pipe or an
|
||||||
|
in-process queue if the AlarmWorker is hosted in the same gateway process
|
||||||
|
space as a separate `IHostedService`.
|
||||||
|
|
||||||
|
**Option X2 — Accept sub-attribute fallback as production (no A.2 work)**
|
||||||
|
|
||||||
|
If the architectural decision is to accept the sub-attribute path as permanent:
|
||||||
|
|
||||||
|
- `MxAccessAlarmEventSink.Attach()` in the worker remains a no-op (as
|
||||||
|
currently coded with the architectural comment).
|
||||||
|
- The `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` proto family stays defined but
|
||||||
|
the gateway never emits events on it.
|
||||||
|
- lmxopcua's `GalaxyDriver` does not implement `IAlarmSource` for the
|
||||||
|
native path; the value-driven sub-attribute path remains the production
|
||||||
|
path.
|
||||||
|
- The only regression vs. v1 is operator-comment fidelity on Galaxy alarms.
|
||||||
|
- C.1 is still needed if scripted-alarm historian write-back is required.
|
||||||
|
|
||||||
|
**What blocks it**: the architectural decision above. Once made, A.2 becomes
|
||||||
|
a 2–3 day implementation task (sub-process plumbing + proto translation +
|
||||||
|
unit tests for the consumer sink cancellation behaviour).
|
||||||
|
|
||||||
|
**Tests to write (when A.2 proceeds)**:
|
||||||
|
|
||||||
|
- `WorkerAlarmConsumerTests` — fake `IAlarmConsumer` source emits canned
|
||||||
|
transitions; assert each produces the correct `WorkerEvent` body shape.
|
||||||
|
- Cancellation/session-close test — closing the session unsubscribes from
|
||||||
|
the AlarmClient cleanly (no leaked `IAlarmConsumer` reference if the
|
||||||
|
worker is recycled mid-session).
|
||||||
|
- Re-subscribe-after-reconnect test — `ReconnectSupervisor` triggers a
|
||||||
|
reconnect; assert the alarm consumer re-attaches to the new session.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Item A.3 / A.4 — Gateway: dispatch and RPC handlers
|
||||||
|
|
||||||
|
**Repo**: `mxaccessgw` — `src\MxGateway.Server\`
|
||||||
|
|
||||||
|
**Depends on**: A.2 delivering `WorkerEvent` bodies with family
|
||||||
|
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION`.
|
||||||
|
|
||||||
|
**What it needs**:
|
||||||
|
|
||||||
|
### A.3 — Dispatch + AcknowledgeAlarm
|
||||||
|
|
||||||
|
1. The session-level event multiplexer (`Sessions\SessionEventStream.cs` or
|
||||||
|
equivalent — verify name in the mxaccessgw repo) must recognise the new
|
||||||
|
`WorkerEvent` body and forward it as an `MxEvent` with family
|
||||||
|
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to every `StreamEvents` subscriber
|
||||||
|
for that session.
|
||||||
|
|
||||||
|
2. New RPC handler `AcknowledgeAlarm` builds an `AlarmAcknowledgeCommand`
|
||||||
|
worker command and forwards it to the alarm-helper process (Option X1) or
|
||||||
|
the worker's MxAccess session (Option X2 if MxAccess exposes ack). Maps
|
||||||
|
the reply status to `AcknowledgeAlarmReply.MxStatusProxy`.
|
||||||
|
|
||||||
|
3. Authorization: new API scope `invoke:alarm-ack` on the API key. Keys
|
||||||
|
without it receive `PERMISSION_DENIED`. Follow the existing scope-check
|
||||||
|
pattern used by `invoke:write`.
|
||||||
|
|
||||||
|
### A.4 — QueryActiveAlarms
|
||||||
|
|
||||||
|
1. New RPC handler `QueryActiveAlarms` calls `AlarmClient.GetAlarmExtendedRec`
|
||||||
|
(or `GetActiveAlarms` — confirm the method name during implementation)
|
||||||
|
on the alarm-helper process, batches results into `ActiveAlarmSnapshot`
|
||||||
|
proto messages, and streams them back to the caller.
|
||||||
|
|
||||||
|
2. New API scope `invoke:alarm-query` (separate from ack so read-only clients
|
||||||
|
can refresh without ack rights).
|
||||||
|
|
||||||
|
**What blocks A.3/A.4**: A.2 must deliver `WorkerEvent` bodies on the channel.
|
||||||
|
A.3/A.4 are pure dispatch wiring once the events arrive.
|
||||||
|
|
||||||
|
**Tests to write**:
|
||||||
|
|
||||||
|
- A.3 dispatch test — fake worker emits an `AlarmTransition` event; assert
|
||||||
|
the gateway forwards it on the `StreamEvents` channel of every subscribed
|
||||||
|
session (mirrors existing `OnDataChange` dispatch tests).
|
||||||
|
- A.3 AcknowledgeAlarm auth test — existing key without `invoke:alarm-ack`
|
||||||
|
scope returns `PERMISSION_DENIED`.
|
||||||
|
- A.4 pagination test — synthetic active-alarm set of 0 / 1 / 100 entries;
|
||||||
|
assert each streams back as separate `ActiveAlarmSnapshot` messages.
|
||||||
|
- Integration (parity rig — requires dev box with AVEVA platform):
|
||||||
|
trigger a real Galaxy alarm, call `QueryActiveAlarms`, assert the alarm
|
||||||
|
appears in the stream; call `AcknowledgeAlarm`, assert the alarm transitions
|
||||||
|
to `ActiveAcked` and a `Acknowledge` transition event appears on
|
||||||
|
`StreamEvents`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Item C.1 — Historian sidecar: AahClientManagedAlarmEventWriter
|
||||||
|
|
||||||
|
**Repo**: `lmxopcua` — `src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\`
|
||||||
|
|
||||||
|
**Depends on**: Architectural decision (the sidecar uses `aahClientManaged`
|
||||||
|
x64, which is not bitness-constrained like the worker). C.1 is independently
|
||||||
|
unblockable from A.2 if the goal is to wire up the scripted-alarm historian
|
||||||
|
path.
|
||||||
|
|
||||||
|
**Current state**:
|
||||||
|
|
||||||
|
`SdkAlarmHistorianWriteBackend` in `src\MxGateway.Worker\MxAccess\` is a
|
||||||
|
placeholder returning `RetryPlease`. The lmxopcua sidecar's `WriteAlarmEvents`
|
||||||
|
IPC slot is defined in `Ipc\Contracts.cs` but `Program.cs` constructs
|
||||||
|
`HistorianFrameHandler` without an `alarmWriter` (line 57 per the alarms plan).
|
||||||
|
The `IAlarmEventWriter` interface exists; only the production implementation
|
||||||
|
and the consumer wiring are missing.
|
||||||
|
|
||||||
|
**What it needs**:
|
||||||
|
|
||||||
|
1. New `AahClientManagedAlarmEventWriter.cs` implementing `IAlarmEventWriter`
|
||||||
|
(defined in `Ipc\HistorianFrameHandler.cs`). Calls `aahClientManaged`'s
|
||||||
|
alarm-event write API — same path v1's `GalaxyHistorianWriter` used.
|
||||||
|
Uses `HistorianClusterEndpointPicker` for multi-node routing.
|
||||||
|
Maps `MxStatus` write outcomes to `HistorianWriteOutcome` enum
|
||||||
|
(Ack / PermanentFail / RetryPlease).
|
||||||
|
|
||||||
|
2. `Program.cs` — build `AahClientManagedAlarmEventWriter` next to the
|
||||||
|
existing `BuildHistorian()` call; pass it to `HistorianFrameHandler`.
|
||||||
|
Gate behind `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED` env var (default `true`
|
||||||
|
when `OTOPCUA_HISTORIAN_ENABLED=true`).
|
||||||
|
|
||||||
|
3. `Install-Services.ps1` — add the new env var to the install-time block.
|
||||||
|
|
||||||
|
**What blocks C.1**: access to the `aahClientManaged` SDK on the dev box
|
||||||
|
(confirmed available per `project_aveva_platform_installed.md` — AVEVA
|
||||||
|
Historian SDK is present). C.1 can proceed without A.2 since the sidecar's
|
||||||
|
`aahClientManaged` is x64 and does not share the worker's x86 bitness
|
||||||
|
constraint.
|
||||||
|
|
||||||
|
**Tests to write**:
|
||||||
|
|
||||||
|
- Outcome-mapping table: every `MxStatus` on alarm-write → expected
|
||||||
|
`HistorianWriteOutcome`.
|
||||||
|
- Batch test: 1 / 100 / 1000 events through a fake `aahClientManaged`
|
||||||
|
writer; assert per-row outcome list parallel to input order.
|
||||||
|
- Cluster failover: primary Historian node returns `BadCommunicationError`;
|
||||||
|
picker rotates to secondary; eventual success.
|
||||||
|
- `Program.cs` seam: assert handler constructed with alarm writer when env
|
||||||
|
var enabled; without it when disabled.
|
||||||
|
- Live integration (parity rig): write a synthetic alarm event through the
|
||||||
|
IPC; query it back via `ReadEvents`; assert round-trip fidelity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Item D.1 — Smoke artifact
|
||||||
|
|
||||||
|
**Repo**: `lmxopcua` (deployment refresh) + `mxaccessgw` (rig verification)
|
||||||
|
|
||||||
|
**Depends on**: A.2, A.3, A.4, and C.1 all passing on the dev rig with a live
|
||||||
|
Galaxy and live Historian.
|
||||||
|
|
||||||
|
**Current state**: The deployment script `Refresh-Services.ps1` (task D.1) has
|
||||||
|
shipped as PR #417 (merged 2026-04-30). What was NOT captured at that time was
|
||||||
|
a smoke artifact — a log snippet or test output confirming that:
|
||||||
|
|
||||||
|
1. An alarm transition event from a live Galaxy alarm reaches lmxopcua's
|
||||||
|
`AlarmConditionService` via the new `IAlarmSource` path (not the fallback).
|
||||||
|
2. A scripted-alarm historian write-back reaches AVEVA Historian via the
|
||||||
|
sidecar `IAlarmEventWriter`.
|
||||||
|
|
||||||
|
**What it needs**:
|
||||||
|
|
||||||
|
Once A.2, A.3, C.1 are wired on the parity rig:
|
||||||
|
|
||||||
|
1. Deploy the updated mxaccessgw (with A.2 / A.3 / A.4 changes).
|
||||||
|
2. Deploy the updated sidecar (with C.1 changes).
|
||||||
|
3. Run `Refresh-Services.ps1` to confirm clean service restarts.
|
||||||
|
4. Trigger a Galaxy alarm (e.g. set an AnalogLimitAlarm attribute out of
|
||||||
|
range in Galaxy IDE).
|
||||||
|
5. Observe the lmxopcua OPC UA alarm surface via the Client CLI:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
alarms -u opc.tcp://localhost:4840 --subscribe
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: the alarm condition appears on the OPC UA A&E surface within
|
||||||
|
2 × publishing interval.
|
||||||
|
|
||||||
|
6. Trigger a scripted alarm via the lmxopcua `ScriptedAlarmEngine`
|
||||||
|
(or an OPC UA method call if one is wired).
|
||||||
|
7. Confirm in the AVEVA Historian that the scripted alarm event is stored
|
||||||
|
(query via the Historian client or HistorianWatch tool).
|
||||||
|
|
||||||
|
8. Capture log snippets:
|
||||||
|
- mxaccessgw log: `[INF] AlarmTransition dispatched sessionId=<> alarmRef=<>`
|
||||||
|
- lmxopcua log: `[INF] AlarmConditionService: IAlarmSource event alarmRef=<> origin=Driver`
|
||||||
|
- Sidecar log: `[INF] AahClientManagedAlarmEventWriter: Wrote <n> alarm events`
|
||||||
|
|
||||||
|
9. Commit the log snippets as `docs/plans/alarms-d1-smoke-artifact.md`
|
||||||
|
(a new doc, not this one).
|
||||||
|
|
||||||
|
**What blocks D.1**: all of A.2, A.3, C.1, plus the operator decision on the
|
||||||
|
x64 alarm-helper architecture (or explicit acceptance of the sub-attribute
|
||||||
|
fallback as production).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary of blocks
|
||||||
|
|
||||||
|
| Item | Blocked by | Estimated effort once unblocked |
|
||||||
|
|------|-----------|--------------------------------|
|
||||||
|
| A.2 | Architectural decision (x64 alarm-helper vs. sub-attribute fallback as production) | 2–3 days implementation; 1 day tests |
|
||||||
|
| A.3 | A.2 delivering WorkerEvent bodies | 1–2 days |
|
||||||
|
| A.4 | A.2 (active-alarm query needs AlarmClient session) | 1 day |
|
||||||
|
| C.1 | aahClientManaged SDK access (available on dev box); NOT blocked by A.2 | 1–2 days |
|
||||||
|
| D.1 | A.2 + A.3 + C.1 all passing on parity rig | 0.5 day (smoke + artifact capture) |
|
||||||
|
|
||||||
|
C.1 can proceed in parallel with A.2 / A.3 since the sidecar's `aahClientManaged`
|
||||||
|
is x64 and does not share the worker bitness constraint.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this plan does NOT cover
|
||||||
|
|
||||||
|
- The value-driven sub-attribute fallback path — already shipped and
|
||||||
|
functional (not being changed).
|
||||||
|
- Track B (lmxopcua EventPump, GalaxyDriver IAlarmSource re-implementation)
|
||||||
|
and Track E (client SDK surface refresh) from the alarms-over-gateway plan —
|
||||||
|
those are in `lmxopcua` and depend on A.3 being live; they follow naturally
|
||||||
|
once A.3 ships.
|
||||||
|
- Galaxy-native alarm historian path — System Platform's own `HistorizeToAveva`
|
||||||
|
toggle on the Galaxy template; not in scope.
|
||||||
|
- Alarm ACL / role-grant surface — already shipped in Phase 6.2.
|
||||||
497
docs/plans/live-hardware-validation-runbooks.md
Normal file
497
docs/plans/live-hardware-validation-runbooks.md
Normal file
@@ -0,0 +1,497 @@
|
|||||||
|
# Live-Hardware Driver Validation Runbooks
|
||||||
|
|
||||||
|
> **Scope**: These runbooks cover the three driver validation tasks that
|
||||||
|
> require physical hardware or a hardware-equivalent live environment and
|
||||||
|
> cannot be satisfied by the Docker-based simulator fixtures or unit tests
|
||||||
|
> alone.
|
||||||
|
>
|
||||||
|
> Driver implementation is complete. The runbooks document the preconditions,
|
||||||
|
> step-by-step procedure, expected results, and how to record the outcome for
|
||||||
|
> each driver that has an open live-hardware gap.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. FANUC FOCAS — Live CNC Smoke (task #54)
|
||||||
|
|
||||||
|
### Background
|
||||||
|
|
||||||
|
The FOCAS driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/`) uses the
|
||||||
|
pure-managed `WireFocasClient` that speaks FOCAS2 over TCP directly (no
|
||||||
|
`Fwlib64.dll`, no P/Invoke). The integration test suite at
|
||||||
|
`tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.IntegrationTests/` runs against
|
||||||
|
the `focas-mock` Python server (PDU-verified against `fwlibe64.dll` upstream)
|
||||||
|
and covers all call-shapes the driver issues. What the mock cannot cover:
|
||||||
|
|
||||||
|
- Series-specific firmware quirks (e.g. 0i-F vs 30i-B parameter range limits)
|
||||||
|
- Real CNC Ethernet stack behaviour (TCP keep-alive, session-close edge cases)
|
||||||
|
- Series gating: some driver nodes are conditionally emitted based on
|
||||||
|
`CncSeries` — only a physical CNC can confirm the suppression works
|
||||||
|
|
||||||
|
### Preconditions
|
||||||
|
|
||||||
|
| Item | Requirement |
|
||||||
|
|------|-------------|
|
||||||
|
| CNC hardware | FANUC CNC with Ethernet option enabled; TCP port 8193 reachable from the dev box or from the host running OtOpcUa |
|
||||||
|
| CNC series | Any of: 0i-D, 0i-F, 0i-MF, 0i-TF, 16i, 30i-B, 31i, 32i, Power Motion i |
|
||||||
|
| CNC state | Running state (not E-stop, not alarm) for live axis-data reads |
|
||||||
|
| Network | TCP reachability from OtOpcUa server host to CNC port 8193 |
|
||||||
|
| OtOpcUa | Server built and deployed (`dotnet publish` or running via `dotnet run`) |
|
||||||
|
| Config | DriverInstance row for FOCAS in Config DB (`Type="FOCAS"`, `Backend="wire"`, `Devices[0].HostAddress="focas://<cnc-ip>:8193"`, `Devices[0].Series="<series>"`) |
|
||||||
|
|
||||||
|
### Procedure
|
||||||
|
|
||||||
|
**Step 1 — Verify TCP reachability**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Test-NetConnection -ComputerName <cnc-ip> -Port 8193
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: `TcpTestSucceeded: True`.
|
||||||
|
|
||||||
|
**Step 2 — Start OtOpcUa with FOCAS driver configured**
|
||||||
|
|
||||||
|
Ensure the Config DB has the DriverInstance row. Start the server:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
sc start OtOpcUa
|
||||||
|
# or for a dev run:
|
||||||
|
dotnet run --project src/Server/ZB.MOM.WW.OtOpcUa.Server
|
||||||
|
```
|
||||||
|
|
||||||
|
Watch the Serilog log for:
|
||||||
|
|
||||||
|
```
|
||||||
|
[INF] FocasDriver initializing device focas://<cnc-ip>:8193 series=<series>
|
||||||
|
[INF] FocasDriver device <cnc-ip>:8193 Connected
|
||||||
|
```
|
||||||
|
|
||||||
|
If `EW_SOCKET (-1)` appears, the TCP endpoint is unreachable or the CNC
|
||||||
|
Ethernet option is not active.
|
||||||
|
|
||||||
|
**Step 3 — Browse the address space**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
browse -u opc.tcp://localhost:4840 -r -d 3
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: a node tree containing at minimum:
|
||||||
|
|
||||||
|
```
|
||||||
|
FOCAS/
|
||||||
|
<device>/
|
||||||
|
Identity/
|
||||||
|
SeriesNumber
|
||||||
|
Version
|
||||||
|
MaxAxes
|
||||||
|
Status/
|
||||||
|
RunState
|
||||||
|
Mode
|
||||||
|
EmergencyStop
|
||||||
|
Axes/
|
||||||
|
<X|Y|Z>/
|
||||||
|
AbsolutePosition
|
||||||
|
MachinePosition
|
||||||
|
```
|
||||||
|
|
||||||
|
Nodes suppressed by the `Series` capability gate will be absent — this is
|
||||||
|
correct behaviour.
|
||||||
|
|
||||||
|
**Step 4 — Read identity nodes**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Identity/SeriesNumber"
|
||||||
|
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Identity/MaxAxes"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: `Good` quality; `SeriesNumber` matches the string printed on the CNC
|
||||||
|
control panel (e.g. `"0i-F"`); `MaxAxes` is a non-zero integer.
|
||||||
|
|
||||||
|
**Step 5 — Read live status and axis data**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Status/RunState"
|
||||||
|
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
read -u opc.tcp://localhost:4840 -n "ns=2;s=FOCAS/<device>/Axes/X/AbsolutePosition"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: both return `Good` quality. `AbsolutePosition` is a `Double` (e.g.
|
||||||
|
`-12.3456` mm). Manually compare against the machine's position display.
|
||||||
|
|
||||||
|
**Step 6 — Subscribe and observe polling**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
subscribe -u opc.tcp://localhost:4840 `
|
||||||
|
-n "ns=2;s=FOCAS/<device>/Status/RunState" -i 500
|
||||||
|
```
|
||||||
|
|
||||||
|
Let run for 30 s while jogging an axis or changing mode on the CNC operator
|
||||||
|
panel. Pass: at least one data-change event received within 5 s; events
|
||||||
|
continue arriving every ~500 ms.
|
||||||
|
|
||||||
|
**Step 7 — 2-minute soak**
|
||||||
|
|
||||||
|
Let the server run for 2 minutes with the subscription active. Pass: no
|
||||||
|
`EW_SOCKET`, `EW_HANDLE`, `EW_BUSY` errors in the Serilog output; subscribed
|
||||||
|
node continues delivering updates.
|
||||||
|
|
||||||
|
**Step 8 — Run the FOCAS e2e script**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
pwsh scripts/e2e/test-focas.ps1 -ServerUrl opc.tcp://localhost:4840 `
|
||||||
|
-DriverInstance "<device>" -Series "<series>"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: script exits 0.
|
||||||
|
|
||||||
|
### Expected results
|
||||||
|
|
||||||
|
| Check | Expected |
|
||||||
|
|-------|----------|
|
||||||
|
| TCP connect to CNC port 8193 | Success |
|
||||||
|
| FOCAS session open (`cnc_allclibhndl3`) | EW_OK (0) in driver log |
|
||||||
|
| `Identity/SeriesNumber` | Matches CNC panel, `Good` quality |
|
||||||
|
| `Identity/MaxAxes` | Non-zero integer, `Good` quality |
|
||||||
|
| `Status/RunState` | Integer 0–3, `Good` quality |
|
||||||
|
| `Axes/X/AbsolutePosition` | Double, `Good` quality, matches display |
|
||||||
|
| Subscribe: events delivered | >= 3 events in 5 s soak |
|
||||||
|
| 2-minute soak: no FOCAS errors | Clean Serilog log |
|
||||||
|
|
||||||
|
### Recording the outcome
|
||||||
|
|
||||||
|
```
|
||||||
|
FOCAS live-CNC smoke — task #54
|
||||||
|
Date: YYYY-MM-DD
|
||||||
|
CNC: <manufacturer> <model> series=<series> firmware=<version>
|
||||||
|
IP: <cnc-ip>:8193
|
||||||
|
OtOpcUa SHA: <git sha>
|
||||||
|
|
||||||
|
TCP connect: PASS
|
||||||
|
Session open: PASS
|
||||||
|
Identity reads: PASS SeriesNumber="<>" MaxAxes=<n>
|
||||||
|
Status read: PASS RunState=<n>
|
||||||
|
Axis read: PASS X/AbsolutePosition=<value>
|
||||||
|
Subscribe: PASS <n> events in 30s
|
||||||
|
2-min soak: PASS no errors
|
||||||
|
e2e script: PASS
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Allen-Bradley CIP — Live Boot (ControlLogix)
|
||||||
|
|
||||||
|
### Background
|
||||||
|
|
||||||
|
The AB CIP driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip/`) uses
|
||||||
|
`libplctag` 1.6.x. The Docker `ab_server` simulator covers connectivity and
|
||||||
|
atomic type reads (7 integration tests). Live-boot validation is needed to
|
||||||
|
confirm UDT shape-reading, array tag access, and the CIP packing behaviour on
|
||||||
|
a real ControlLogix backplane — all gaps acknowledged in
|
||||||
|
`docs/drivers/AbServer-Test-Fixture.md`.
|
||||||
|
|
||||||
|
AB CIP live-boot was first verified against a ControlLogix rig at PR #222.
|
||||||
|
Continue running before each release.
|
||||||
|
|
||||||
|
### Preconditions
|
||||||
|
|
||||||
|
| Item | Requirement |
|
||||||
|
|------|-------------|
|
||||||
|
| PLC hardware | ControlLogix (preferred) or CompactLogix; firmware 20+ for request packing |
|
||||||
|
| Network | TCP port 44818 reachable from OtOpcUa server host |
|
||||||
|
| PLC state | Running; at least one DINT / REAL / BOOL / STRING controller-scoped tag defined |
|
||||||
|
| OtOpcUa | Server built and deployed |
|
||||||
|
| Config | DriverInstance row: `Type="AbCip"`, `Host="<plc-ip>"`, `Path="1,0"`, `PlcType="ControlLogix"` |
|
||||||
|
|
||||||
|
### Procedure
|
||||||
|
|
||||||
|
**Step 1 — Verify TCP reachability**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Test-NetConnection -ComputerName <plc-ip> -Port 44818
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: `TcpTestSucceeded: True`.
|
||||||
|
|
||||||
|
**Step 2 — Start OtOpcUa and watch driver log**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
sc start OtOpcUa
|
||||||
|
```
|
||||||
|
|
||||||
|
Look for:
|
||||||
|
|
||||||
|
```
|
||||||
|
[INF] AbCipDriver device <plc-ip> Connected path=1,0 plcType=ControlLogix
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3 — Browse the address space**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
browse -u opc.tcp://localhost:4840 -r -d 3
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: node tree shows the tags defined in the ControlLogix project (controller-
|
||||||
|
and program-scoped). UDT members appear as child nodes.
|
||||||
|
|
||||||
|
**Step 4 — Read atomic tags**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# Read a DINT tag
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
read -u opc.tcp://localhost:4840 -n "ns=2;s=AbCip/<device>/<TagName>"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: `Good` quality; value type matches the PLC tag type.
|
||||||
|
|
||||||
|
**Step 5 — Read a UDT member**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
read -u opc.tcp://localhost:4840 -n "ns=2;s=AbCip/<device>/<UDT>/<MemberName>"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: `Good` quality; value matches the live PLC value.
|
||||||
|
|
||||||
|
**Step 6 — Write a DINT tag (if in ReadWrite mode)**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
write -u opc.tcp://localhost:4840 `
|
||||||
|
-n "ns=2;s=AbCip/<device>/<TagName>" -v 42 -t Int32
|
||||||
|
```
|
||||||
|
|
||||||
|
Verify the new value via a subsequent read or on the PLC HMI.
|
||||||
|
|
||||||
|
Pass: read back returns 42 with `Good` quality.
|
||||||
|
|
||||||
|
**Step 7 — Subscribe to a tag that changes**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
subscribe -u opc.tcp://localhost:4840 `
|
||||||
|
-n "ns=2;s=AbCip/<device>/<ChangingTag>" -i 500
|
||||||
|
```
|
||||||
|
|
||||||
|
Jog or trigger a value change on the PLC. Pass: events received within 2 s.
|
||||||
|
|
||||||
|
**Step 8 — Override endpoint to docker sim and confirm parity**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$env:AB_SERVER_ENDPOINT = "<plc-ip>:44818"
|
||||||
|
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests `
|
||||||
|
--filter "AbServerFact"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: all 7 integration tests pass against the live PLC.
|
||||||
|
|
||||||
|
### Expected results
|
||||||
|
|
||||||
|
| Check | Expected |
|
||||||
|
|-------|----------|
|
||||||
|
| TCP connect | Success |
|
||||||
|
| Driver log `Connected` | Present, no error |
|
||||||
|
| Browse | Node tree mirrors PLC tag list |
|
||||||
|
| Atomic read | `Good` quality, correct type |
|
||||||
|
| UDT member read | `Good` quality, correct value |
|
||||||
|
| Write round-trip | Written value reads back |
|
||||||
|
| Subscribe | Events delivered on value change |
|
||||||
|
| Integration tests with live PLC | 7/7 pass |
|
||||||
|
|
||||||
|
### Recording the outcome
|
||||||
|
|
||||||
|
```
|
||||||
|
AB CIP live-boot
|
||||||
|
Date: YYYY-MM-DD
|
||||||
|
PLC: Allen-Bradley <model> firmware=<version>
|
||||||
|
IP: <plc-ip>:44818 path=1,0
|
||||||
|
OtOpcUa SHA: <git sha>
|
||||||
|
|
||||||
|
TCP connect: PASS
|
||||||
|
Driver connected: PASS
|
||||||
|
Browse: PASS <n> tags visible
|
||||||
|
Atomic read: PASS
|
||||||
|
UDT read: PASS
|
||||||
|
Write round-trip: PASS
|
||||||
|
Subscribe: PASS
|
||||||
|
Integration tests: 7/7 PASS
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Beckhoff TwinCAT — Wire-Live Validation
|
||||||
|
|
||||||
|
### Background
|
||||||
|
|
||||||
|
The TwinCAT driver (`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/`) uses the
|
||||||
|
Beckhoff `TwinCAT.Ads` .NET SDK v6. The integration test suite at
|
||||||
|
`tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests/`
|
||||||
|
(`TwinCAT3SmokeTests.cs`) covers 14 `[TwinCATFact]` methods + one 16-case
|
||||||
|
`[TwinCATTheory]` (30 cases total) against a live ADS runtime. The TCBSD ESXi
|
||||||
|
VM at `10.100.0.128` (AmsNetId `41.169.163.43.1.1`) is the primary fixture
|
||||||
|
runtime (project memory `project_tcbsd_fixture.md`) and bypasses the
|
||||||
|
TwinCAT/Hyper-V conflict on the dev box.
|
||||||
|
|
||||||
|
Live-hardware validation extends beyond the TCBSD VM to confirm the driver
|
||||||
|
works against a production PLC (not just the ESXi test VM) and that the three
|
||||||
|
defects found during original integration testing do not regress on newer
|
||||||
|
firmware:
|
||||||
|
|
||||||
|
1. Notification cycle time unit (250 ms was being set to ~41 min — fixed).
|
||||||
|
2. `STRING(N)` / `WSTRING(N)` type mapper (fixed).
|
||||||
|
3. Bit-indexed BOOL path (fixed).
|
||||||
|
|
||||||
|
### Preconditions
|
||||||
|
|
||||||
|
**TCBSD ESXi fixture (primary — no physical hardware needed)**
|
||||||
|
|
||||||
|
| Item | Requirement |
|
||||||
|
|------|-------------|
|
||||||
|
| TCBSD VM | Running on ESXi at `10.100.0.128` |
|
||||||
|
| AMS Net ID | `41.169.163.43.1.1` |
|
||||||
|
| ADS port | `851` (TwinCAT 3 PLC runtime 1) |
|
||||||
|
| PLC project | TwinCAT project from `tests/.../TwinCatProject/` loaded and in Run state |
|
||||||
|
| Network | TCP port 48898 reachable from dev box to `10.100.0.128` |
|
||||||
|
|
||||||
|
**Production PLC (for true wire-live validation)**
|
||||||
|
|
||||||
|
| Item | Requirement |
|
||||||
|
|------|-------------|
|
||||||
|
| TwinCAT hardware | Beckhoff IPC or CX series, TwinCAT 3 (TC3); TC2 is a known gap per fixture doc |
|
||||||
|
| AMS route | Route configured on TwinCAT device back to the OtOpcUa host |
|
||||||
|
| PLC state | Run state |
|
||||||
|
| GVL | At least a `GVL_Fixture.nCounter` DINT and `GVL_Fixture.rSetpoint` REAL present |
|
||||||
|
|
||||||
|
### Procedure — TCBSD ESXi fixture
|
||||||
|
|
||||||
|
**Step 1 — Verify TCBSD VM is reachable**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Test-NetConnection -ComputerName 10.100.0.128 -Port 48898
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: `TcpTestSucceeded: True`.
|
||||||
|
|
||||||
|
**Step 2 — Run the integration test suite**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$env:TWINCAT_TARGET_HOST = "10.100.0.128"
|
||||||
|
$env:TWINCAT_TARGET_NETID = "41.169.163.43.1.1"
|
||||||
|
|
||||||
|
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests `
|
||||||
|
--logger "console;verbosity=normal"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: all 30 test cases pass (14 `[TwinCATFact]` + 16-case `[TwinCATTheory]`).
|
||||||
|
No `[TwinCATFact]` / `[TwinCATTheory]` skips — the env var is set, so the
|
||||||
|
runtime probe is expected to succeed.
|
||||||
|
|
||||||
|
Key tests to watch:
|
||||||
|
|
||||||
|
| Test | Validates |
|
||||||
|
|------|-----------|
|
||||||
|
| `Driver_subscribe_receives_native_ADS_notifications_on_counter_changes` | Native ADS notification path (the cycle-time-unit bug regression) |
|
||||||
|
| `Driver_reads_every_primitive_type_with_correct_mapping` | 16-type theory incl. `STRING(N)` |
|
||||||
|
| `Driver_reads_bit_indexed_BOOL_from_word` | Bit-indexed BOOL fix regression |
|
||||||
|
| `Driver_auto_reconnects_after_underlying_client_is_disposed` | Reconnect on ADS client dispose |
|
||||||
|
| `Driver_routes_reads_per_device_and_isolates_unreachable_peers` | Multi-device isolation |
|
||||||
|
|
||||||
|
**Step 3 — OtOpcUa server browse/read via Client CLI**
|
||||||
|
|
||||||
|
Start OtOpcUa with a TwinCAT DriverInstance pointing at the TCBSD VM:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# appsettings.json DriverInstance: Type=TwinCAT, AmsNetId=41.169.163.43.1.1, AmsPort=851
|
||||||
|
sc start OtOpcUa
|
||||||
|
# or dev run
|
||||||
|
dotnet run --project src/Server/ZB.MOM.WW.OtOpcUa.Server
|
||||||
|
```
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
browse -u opc.tcp://localhost:4840 -r -d 4
|
||||||
|
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
read -u opc.tcp://localhost:4840 -n "ns=2;s=TwinCAT/<device>/GVL_Fixture/nCounter"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: browse shows the PLC symbol tree; read returns `Good` quality with an
|
||||||
|
integer value.
|
||||||
|
|
||||||
|
### Procedure — Production PLC (optional, for full wire-live signoff)
|
||||||
|
|
||||||
|
If a Beckhoff production IPC is available in the lab:
|
||||||
|
|
||||||
|
**Step 1** — Configure the AMS route on the TwinCAT device (TwinCAT System
|
||||||
|
Manager → Routes → Add static route from the TwinCAT device back to the
|
||||||
|
OtOpcUa server machine).
|
||||||
|
|
||||||
|
**Step 2** — Set env vars and run the integration suite against the production
|
||||||
|
target:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$env:TWINCAT_TARGET_HOST = "<production-plc-ip>"
|
||||||
|
$env:TWINCAT_TARGET_NETID = "<production-ams-net-id>"
|
||||||
|
$env:TWINCAT_TARGET_PORT = "851"
|
||||||
|
|
||||||
|
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3** — Subscribe to a counter tag for 30 s to confirm native
|
||||||
|
notifications arrive:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
subscribe -u opc.tcp://localhost:4840 `
|
||||||
|
-n "ns=2;s=TwinCAT/<device>/GVL_Fixture/nCounter" -i 100
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: events arrive every ~100 ms driven by the PLC's ADS notification, not
|
||||||
|
by polling.
|
||||||
|
|
||||||
|
### Expected results
|
||||||
|
|
||||||
|
| Check | TCBSD VM | Production PLC |
|
||||||
|
|-------|----------|----------------|
|
||||||
|
| ADS port 48898 reachable | Required | Required |
|
||||||
|
| Integration tests: all 30 pass | Required | Optional (same 30) |
|
||||||
|
| Notification cycle-time test passes | Required | Required |
|
||||||
|
| Server browse shows symbol tree | Required | Optional |
|
||||||
|
| Read `Good` quality | Required | Optional |
|
||||||
|
| Native ADS notifications deliver in subscribe | Required | Recommended |
|
||||||
|
|
||||||
|
### Known gaps (documented — not blockers for v2 GA)
|
||||||
|
|
||||||
|
Per `docs/drivers/TwinCAT-Test-Fixture.md` §"What it does NOT cover":
|
||||||
|
|
||||||
|
- Multi-hop AMS routing — single-hop only.
|
||||||
|
- TC2 (ADS v1) compatibility — TC3 only.
|
||||||
|
- Notification coalescing under sustained CPU load.
|
||||||
|
- `Symbol version changed (0x0702)` storm handling under rapid PLC re-downloads.
|
||||||
|
|
||||||
|
These are deferred to v3 per `docs/v3/twincat-backlog.md`.
|
||||||
|
|
||||||
|
### Recording the outcome
|
||||||
|
|
||||||
|
```
|
||||||
|
TwinCAT wire-live validation
|
||||||
|
Date: YYYY-MM-DD
|
||||||
|
Target: TCBSD VM 10.100.0.128 AmsNetId=41.169.163.43.1.1 (and/or production PLC details)
|
||||||
|
TwinCAT version: <version>
|
||||||
|
OtOpcUa SHA: <git sha>
|
||||||
|
|
||||||
|
ADS port reachable: PASS
|
||||||
|
Integration tests: 30/30 PASS
|
||||||
|
notification-cycle-time test: PASS (regression check)
|
||||||
|
STRING(N) type test: PASS (regression check)
|
||||||
|
bit-indexed BOOL test: PASS (regression check)
|
||||||
|
Server browse: PASS
|
||||||
|
Read Good quality: PASS
|
||||||
|
Native subscription delivery: PASS <n> events in 30s
|
||||||
|
```
|
||||||
278
docs/plans/phase-6-3-redundancy-interop-plan.md
Normal file
278
docs/plans/phase-6-3-redundancy-interop-plan.md
Normal file
@@ -0,0 +1,278 @@
|
|||||||
|
# Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan
|
||||||
|
|
||||||
|
> **Scope**: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99,
|
||||||
|
> #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is
|
||||||
|
> Stream F (task #150): validating that third-party OPC UA clients honour
|
||||||
|
> our `ServiceLevel` / `ServerUriArray` / `RedundancySupport` signals and
|
||||||
|
> fail over correctly when the Primary drops. This document defines what is
|
||||||
|
> automatable as integration tests, what requires two live instances plus a
|
||||||
|
> real client, and a step-by-step cutover-validation runbook.
|
||||||
|
>
|
||||||
|
> **Source of truth**: `docs/Redundancy.md`, `docs/v2/redundancy-interop-playbook.md`,
|
||||||
|
> `docs/v2/implementation/phase-6-3-redundancy-runtime.md`,
|
||||||
|
> `scripts/compliance/phase-6-3-compliance.ps1`.
|
||||||
|
|
||||||
|
## What is already tested (no live cluster needed)
|
||||||
|
|
||||||
|
The following are covered by existing automated tests that run in ordinary
|
||||||
|
`dotnet test`:
|
||||||
|
|
||||||
|
| Area | Test class(es) | What it asserts |
|
||||||
|
|---|---|---|
|
||||||
|
| `ServiceLevelCalculator` — 8-state matrix | `ServiceLevelCalculatorTests` | All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations |
|
||||||
|
| `RecoveryStateManager` — dwell + witness | `RecoveryStateManagerTests` | 60 s dwell default; premature-exit rejection; witness-required gate |
|
||||||
|
| `ApplyLeaseRegistry` — lease lifecycle | `ApplyLeaseRegistryTests` | Disposal on success / exception / cancellation; watchdog force-close at 10 min |
|
||||||
|
| `ServerRedundancyNodeWriter` — OPC UA variable binding | `ServerRedundancyNodeWriterTests` | `ServiceLevel` byte push; `RedundancySupport` enum; `ServerUriArray` skip-log when node absent |
|
||||||
|
| `RedundancyStatePublisher` — orchestration | `RedundancyStatePublisherTests` | Edge-triggered `OnStateChanged`; idempotent dedup |
|
||||||
|
| `ClusterTopologyLoader` | `ClusterTopologyLoaderTests` | Two-node seed; one-node degenerate; duplicate-URI rejection |
|
||||||
|
| `DraftValidator.ValidateClusterTopology` | `DraftValidatorTests` (8 cases) | NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection |
|
||||||
|
|
||||||
|
Run with:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy"
|
||||||
|
```
|
||||||
|
|
||||||
|
Compliance gate (every Phase 6.3 static check):
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
pwsh ./scripts/compliance/phase-6-3-compliance.ps1
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass criteria: exit 0; all `[PASS]` lines green; `[DEFERRED]` lines are
|
||||||
|
known-deferred surfaces, not failures.
|
||||||
|
|
||||||
|
## What cannot be automated — requires two live instances
|
||||||
|
|
||||||
|
The scenarios below require two running `OtOpcUa.Server` processes in the
|
||||||
|
same `ServerCluster`, a real SQL Server Config DB, and at least one driver
|
||||||
|
instance with a reachable endpoint (simulator or real PLC).
|
||||||
|
|
||||||
|
### Why it cannot be unit/integration-tested in-process
|
||||||
|
|
||||||
|
- UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source
|
||||||
|
Windows GUI binaries with no headless CLI interface for the
|
||||||
|
subscribe/browse flows.
|
||||||
|
- The AVEVA MXAccess failover leg (`IAlarmSource` reconnect, `$MxAccessClient`
|
||||||
|
quality transition) involves the Galaxy runtime's own client-redundancy
|
||||||
|
policy and the COM-layer session model — both live outside this repo.
|
||||||
|
- Even the automatable sub-set (our own `otopcua-cli` as the client) needs
|
||||||
|
two distinct listening TCP endpoints; that requires two live processes,
|
||||||
|
which is out of scope for `dotnet test` integration fixtures.
|
||||||
|
|
||||||
|
## Test matrix
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
1. Two `OtOpcUa.Server` processes on separate Windows hosts (or separate
|
||||||
|
ports on the same host for dev) sharing one Config DB (`ServerCluster`
|
||||||
|
with `NodeCount=2`, `RedundancyMode=Warm` or `Hot`).
|
||||||
|
2. Each node registered in `ClusterNode`:
|
||||||
|
- Node A: `RedundancyRole=Primary`, `ServiceLevelBase=255`,
|
||||||
|
`ApplicationUri=urn:node-a:OtOpcUa`
|
||||||
|
- Node B: `RedundancyRole=Secondary`, `ServiceLevelBase=100`,
|
||||||
|
`ApplicationUri=urn:node-b:OtOpcUa`
|
||||||
|
3. `PeerHttpProbeLoop` and `PeerUaProbeLoop` HostedServices running on both
|
||||||
|
nodes (registered via `AddHostedService<PeerHttpProbeLoop>` +
|
||||||
|
`AddHostedService<PeerUaProbeLoop>` in `Program.cs`).
|
||||||
|
4. At least one `DriverInstance` in the cluster with a reachable PLC or
|
||||||
|
simulator (e.g. Modbus sim at `10.100.0.35:5020`).
|
||||||
|
5. Client machine with UaExpert >= 1.7 installed.
|
||||||
|
6. Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA
|
||||||
|
OI Gateway 2020R2+.
|
||||||
|
|
||||||
|
### Block A — OPC UA protocol signals (UaExpert, no failover yet)
|
||||||
|
|
||||||
|
| ID | Scenario | Procedure | Pass criterion | Automatable? |
|
||||||
|
|----|----------|-----------|----------------|--------------|
|
||||||
|
| A1 | ServiceLevel published on Primary | Connect UaExpert to Node A. Browse `Server/ServerStatus/ServiceLevel`. | Value = 255 (`AuthoritativePrimary`) | No — requires UaExpert GUI |
|
||||||
|
| A2 | ServiceLevel published on Backup | Connect UaExpert to Node B. Read same node. | Value = 100 (`AuthoritativeBackup`) | No |
|
||||||
|
| A3 | ServiceLevel updates when peer drops | Node A connected. Stop Node B (`sc stop OtOpcUa`). Watch `ServiceLevel` on Node A. | Transitions 255 → 230 (`IsolatedPrimary`) within ~6 s (3 × 2 s HTTP probe interval) | No |
|
||||||
|
| A4 | RedundancySupport | Browse `Server/ServerRedundancy/RedundancySupport` on either node. | Value = `Warm` or `Hot` matching the cluster `RedundancyMode` | No |
|
||||||
|
| A5 | ServerUriArray | Browse `Server/ServerRedundancy/ServerUriArray` on either node. | Array contains both `ApplicationUri` values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below). | No |
|
||||||
|
| A6 | Mid-apply ServiceLevel dip | Trigger a `sp_PublishGeneration` apply (via Admin UI draft → publish) while watching Node A `ServiceLevel`. | Drops to 200 (`PrimaryMidApply`) for the apply duration; returns to 255 after `RefreshAsync`. | No |
|
||||||
|
| A7 | Client.CLI reads correct ServiceLevel | `dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://<node-a>:4840 -n "i=2267"` | Prints current byte value matching expected band. | **Yes** — scriptable with the Client CLI |
|
||||||
|
| A8 | otopcua-cli failover reconnect | `dotnet run ... -- connect -u opc.tcp://<node-a>:4840 -F opc.tcp://<node-b>:4840` — then kill Node A. | CLI session reconnects to Node B within the session keep-alive timeout. | **Yes** — scriptable with the Client CLI |
|
||||||
|
|
||||||
|
### Block B — Third-party client failover
|
||||||
|
|
||||||
|
| ID | Scenario | Procedure | Pass criterion |
|
||||||
|
|----|----------|-----------|----------------|
|
||||||
|
| B1 | UaExpert picks Primary by ServiceLevel | Configure a Redundancy Group in UaExpert with both endpoint URLs. | Client connects to Node A (higher ServiceLevel) |
|
||||||
|
| B2 | UaExpert cuts over on Primary kill | Kill Node A `OtOpcUa` service. | Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
|
||||||
|
| B3 | UaExpert returns when Primary restores | Start Node A. Wait >= 60 s recovery dwell. | `ServiceLevel` on Node A progresses: 180 (`RecoveringPrimary`) → 255 (`AuthoritativePrimary`). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted). |
|
||||||
|
| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware configured for the same two endpoints. | Same pass criteria; establishes no UaExpert-specific behaviour. |
|
||||||
|
| B5 | AVEVA OI Gateway | Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary. | OI Gateway data quality recovers within `ReconnectInterval` (default 20 s); no permanent data-loss alert. |
|
||||||
|
|
||||||
|
### Block C — Galaxy MXAccess failover
|
||||||
|
|
||||||
|
This block requires a running Galaxy and `$MxAccessClient` object (AVEVA
|
||||||
|
System Platform installed, Galaxy deployed on dev box — see project memory
|
||||||
|
`project_aveva_platform_installed.md`).
|
||||||
|
|
||||||
|
| ID | Scenario | Procedure | Pass criterion |
|
||||||
|
|----|----------|-----------|----------------|
|
||||||
|
| C1 | Galaxy binds to Primary on first connect | Bring cluster up. Start a Galaxy `$MxAccessClient` with both node URLs configured. | Galaxy reports `QUALITY = Good`; initial values stream from Node A. |
|
||||||
|
| C2 | Galaxy redirects on Primary drop | Stop Node A. | Galaxy `QUALITY` briefly goes `Uncertain`, then returns to `Good`; values continue streaming from Node B within MXAccess's `ReconnectInterval` (default 20 s). |
|
||||||
|
| C3 | Galaxy tolerates mid-apply dip | Trigger generation apply on Node A. | Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption. |
|
||||||
|
|
||||||
|
Note: A negative result on C1–C3 does not necessarily indicate an OtOpcUa
|
||||||
|
defect. Cross-check with Block A / B first to confirm our `ServiceLevel`
|
||||||
|
signal is correct before debugging the MXAccess client layer.
|
||||||
|
|
||||||
|
## Step-by-step cutover-validation runbook
|
||||||
|
|
||||||
|
This is the minimum procedure to satisfy the v2 GA exit criterion:
|
||||||
|
"Non-transparent redundancy cutover validated with at least one production
|
||||||
|
client (Ignition 8.3 recommended — see decision #85)."
|
||||||
|
|
||||||
|
### Step 1 — Provision the cluster
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# On the Config DB host, seed or verify cluster rows:
|
||||||
|
# ServerCluster: Id=<id>, Name="test-cluster", NodeCount=2, RedundancyMode=Warm
|
||||||
|
# ClusterNode A: NodeId="node-a", ClusterId=<id>, RedundancyRole=Primary,
|
||||||
|
# ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa"
|
||||||
|
# ClusterNode B: NodeId="node-b", ClusterId=<id>, RedundancyRole=Secondary,
|
||||||
|
# ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa"
|
||||||
|
```
|
||||||
|
|
||||||
|
Verify uniqueness constraint: no two `ClusterNode` rows share the same
|
||||||
|
`ApplicationUri` (unique index on `ApplicationUri`).
|
||||||
|
|
||||||
|
### Step 2 — Start both server instances
|
||||||
|
|
||||||
|
On Node A host:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# appsettings.json: Node:NodeId = "node-a"
|
||||||
|
sc start OtOpcUa
|
||||||
|
```
|
||||||
|
|
||||||
|
On Node B host:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# appsettings.json: Node:NodeId = "node-b"
|
||||||
|
sc start OtOpcUa
|
||||||
|
```
|
||||||
|
|
||||||
|
Wait 10 s for HostedServices to complete first probe cycle.
|
||||||
|
|
||||||
|
### Step 3 — Verify baseline ServiceLevel via Client CLI
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# Node A should report 255
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
|
||||||
|
-u opc.tcp://<node-a-host>:4840 -n "i=2267"
|
||||||
|
|
||||||
|
# Node B should report 100
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
|
||||||
|
-u opc.tcp://<node-b-host>:4840 -n "i=2267"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: Node A = 255, Node B = 100.
|
||||||
|
|
||||||
|
### Step 4 — Verify ServerUriArray
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
|
||||||
|
-u opc.tcp://<node-a-host>:4840 -n "i=2271"
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass: array returned contains both `ApplicationUri` strings. If
|
||||||
|
`ServerUriArray` node returns empty or an error, the non-transparent
|
||||||
|
redundancy-type upgrade follow-up is still pending (known limitation —
|
||||||
|
`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips on the
|
||||||
|
base `ServerRedundancyState` object type).
|
||||||
|
|
||||||
|
### Step 5 — Execute Primary kill + failover (B2 scenario)
|
||||||
|
|
||||||
|
1. Connect UaExpert (or Kepware) Redundancy Group to both endpoints.
|
||||||
|
2. Confirm client is subscribed to at least one variable node.
|
||||||
|
3. Kill Node A: `sc stop OtOpcUa` on Node A host.
|
||||||
|
4. Observe:
|
||||||
|
- Node B `ServiceLevel` should transition: 100 (`AuthoritativeBackup`)
|
||||||
|
→ 80 (`IsolatedBackup`) within ~6 s.
|
||||||
|
- Client should reconnect to Node B and resume data-change events.
|
||||||
|
5. Record: time from kill to client reconnect; whether data gaps occurred.
|
||||||
|
|
||||||
|
### Step 6 — Verify Primary recovery (B3 scenario)
|
||||||
|
|
||||||
|
1. Restart Node A: `sc start OtOpcUa` on Node A host.
|
||||||
|
2. Observe Node A `ServiceLevel` progression:
|
||||||
|
- ~0 s: 1 (`NoData`) briefly while HostedServices start.
|
||||||
|
- Startup: 180 (`RecoveringPrimary`) — recovery dwell gate active.
|
||||||
|
- After >= 60 s dwell + one positive publish witness: 255 (`AuthoritativePrimary`).
|
||||||
|
3. Observe Node B:
|
||||||
|
- Returns to 100 (`AuthoritativeBackup`) once it sees Node A peer probe succeed.
|
||||||
|
4. Record dwell duration and whether the client (UaExpert/Kepware) switches back.
|
||||||
|
|
||||||
|
### Step 7 — Execute mid-apply dip (A6 scenario)
|
||||||
|
|
||||||
|
1. Via Admin UI, create a trivial draft change and publish.
|
||||||
|
2. Watch Node A `ServiceLevel` during apply.
|
||||||
|
3. Expected: drops to 200 (`PrimaryMidApply`) for the apply duration
|
||||||
|
(typically seconds); returns to 255 when `GenerationRefreshHostedService`
|
||||||
|
releases the lease.
|
||||||
|
|
||||||
|
### Step 8 — Record results
|
||||||
|
|
||||||
|
Copy the following block into a tracking doc:
|
||||||
|
|
||||||
|
```
|
||||||
|
Run date: YYYY-MM-DD
|
||||||
|
Release SHA: <git sha>
|
||||||
|
Cluster: <cluster-id> Primary: node-a Backup: node-b
|
||||||
|
Config DB: 10.100.0.35,14330
|
||||||
|
|
||||||
|
A1: [PASS/FAIL] evidence: <screenshot or CLI output>
|
||||||
|
A2: [PASS/FAIL]
|
||||||
|
A3: [PASS/FAIL] time-to-IsolatedPrimary: <N>s
|
||||||
|
A4: [PASS/FAIL]
|
||||||
|
A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending]
|
||||||
|
A6: [PASS/FAIL] mid-apply duration: <N>s
|
||||||
|
A7: [PASS/FAIL] CLI output attached
|
||||||
|
A8: [PASS/FAIL] CLI reconnect observed
|
||||||
|
B1: [PASS/FAIL]
|
||||||
|
B2: [PASS/FAIL] reconnect time: <N>s
|
||||||
|
B3: [PASS/FAIL] dwell observed: <N>s
|
||||||
|
B4: [PASS/FAIL] (Kepware)
|
||||||
|
B5: [PASS/FAIL] (OI Gateway — if available)
|
||||||
|
C1: [PASS/FAIL/SKIP - Galaxy not available]
|
||||||
|
C2: [PASS/FAIL/SKIP]
|
||||||
|
C3: [PASS/FAIL/SKIP]
|
||||||
|
```
|
||||||
|
|
||||||
|
One pass of every non-SKIP row is the v2 GA acceptance criterion.
|
||||||
|
|
||||||
|
## Known limitations
|
||||||
|
|
||||||
|
### A5 — ServerUriArray node not yet writable
|
||||||
|
|
||||||
|
The OPC UA .NET Standard SDK's default `Server.ServerRedundancy` object is the
|
||||||
|
base `ServerRedundancyState`, which has no `ServerUriArray` child node.
|
||||||
|
`ServerRedundancyNodeWriter.ApplyServerUriArray` currently logs a warning and
|
||||||
|
skips. The operator obtains `ServerUriArray` by reading `ClusterNode` rows
|
||||||
|
directly until the non-transparent redundancy-type upgrade follow-up ships.
|
||||||
|
|
||||||
|
### Recovery dwell is 60 s by default
|
||||||
|
|
||||||
|
`RecoveryStateManager.DwellTime` defaults to `TimeSpan.FromSeconds(60)` in
|
||||||
|
`Program.cs`. Step 6 of the runbook will block for at least 60 s waiting for
|
||||||
|
Node A to return to `AuthoritativePrimary`. This is intentional per
|
||||||
|
decision #154 (thrash prevention) — do not lower it for the test run.
|
||||||
|
|
||||||
|
### IsolatedBackup (80) does not auto-promote
|
||||||
|
|
||||||
|
Per decision #154, the Backup at band 80 does not self-elevate. If the operator
|
||||||
|
needs authoritative service from Node B while Node A is down, they must write
|
||||||
|
`RedundancyRole=Primary` on the `ClusterNode` row for Node B and publish a
|
||||||
|
draft generation. The Admin UI `RedundancyTab` exposes this flow.
|
||||||
|
|
||||||
|
## Dependency on existing tests
|
||||||
|
|
||||||
|
The cutover runbook validates the end-to-end wire path. The math and edge cases
|
||||||
|
are already locked by the unit/integration tests enumerated in the first section.
|
||||||
|
A failing runbook step that contradicts a passing unit test indicates a
|
||||||
|
deployment configuration error or an SDK version mismatch — not a logic bug.
|
||||||
|
Check `PeerHttpProbeLoop` logs first (look for `PeerProbe` Serilog events).
|
||||||
307
docs/plans/v2-ga-lab-gates-plan.md
Normal file
307
docs/plans/v2-ga-lab-gates-plan.md
Normal file
@@ -0,0 +1,307 @@
|
|||||||
|
# v2 GA Lab Gates Plan
|
||||||
|
|
||||||
|
> **Canonical tracker**: `docs/v2/v2-release-readiness.md` — all code-path
|
||||||
|
> release blockers are closed as of 2026-04-24. This document maps the
|
||||||
|
> remaining exit-criteria from that tracker to concrete commands, automation
|
||||||
|
> boundaries, operator procedures, and pass criteria.
|
||||||
|
>
|
||||||
|
> **Status**: RELEASE-READY (code-path). Manual/lab gates remain open.
|
||||||
|
|
||||||
|
## The gate list
|
||||||
|
|
||||||
|
From `docs/v2/v2-release-readiness.md` §"Release-readiness exit criteria":
|
||||||
|
|
||||||
|
| # | Gate | Kind | Automatable here |
|
||||||
|
|---|------|------|-----------------|
|
||||||
|
| G1 | All four Phase 6.N compliance scripts exit 0 | Script | Yes — run on this box |
|
||||||
|
| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes with <= 1 known-flake failure | Script | Yes — run on this box |
|
||||||
|
| G3 | Release blockers closed | Audit | Already closed (code-path) |
|
||||||
|
| G4 | Phase 5 driver complement shipped | Audit | Already closed |
|
||||||
|
| G5 | Production deployment checklist signed off by Fleet Admin | Operator | No — separate doc, human signoff |
|
||||||
|
| G6 | At least one end-to-end integration run against live Galaxy succeeds | Dev rig | No — requires AVEVA platform |
|
||||||
|
| G7 | FOCAS live-CNC wire-level smoke (#54) passes against a real FANUC control | Lab hardware | No — requires FANUC CNC |
|
||||||
|
| G8 | OPC UA CTT / UA Compliance Test Tool passes against the live endpoint | Operator tool | No — requires CTT binary + live endpoint |
|
||||||
|
| G9 | Non-transparent redundancy cutover validated with >= 1 production client | Lab | No — see `docs/plans/phase-6-3-redundancy-interop-plan.md` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G1 — Phase 6 compliance scripts
|
||||||
|
|
||||||
|
### Command
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
pwsh ./scripts/compliance/phase-6-all.ps1
|
||||||
|
```
|
||||||
|
|
||||||
|
This meta-runner at `scripts/compliance/phase-6-all.ps1` invokes each
|
||||||
|
sub-script in a separate `powershell.exe` process to isolate exit codes:
|
||||||
|
|
||||||
|
| Sub-script | Phase | What it checks |
|
||||||
|
|-----------|-------|---------------|
|
||||||
|
| `phase-6-1-compliance.ps1` | 6.1 Resilience & Observability | Polly resilience classes, health endpoints, LiteDB sealed cache, observability sinks |
|
||||||
|
| `phase-6-2-compliance.ps1` | 6.2 Authorization runtime | `AuthorizationGate`, `TriePermissionEvaluator`, `NodeScopeResolver`, dispatch wiring in `DriverNodeManager` |
|
||||||
|
| `phase-6-3-compliance.ps1` | 6.3 Redundancy runtime | `ServiceLevelCalculator` 8-state band values, `RecoveryStateManager`, `ApplyLeaseRegistry`, `ServerRedundancyNodeWriter`; also invokes `dotnet test` with a baseline of 1097 |
|
||||||
|
| `phase-6-4-compliance.ps1` | 6.4 Admin UI completion | Data-layer types, Identification folder, deferred Blazor items marked `[DEFERRED]` |
|
||||||
|
|
||||||
|
### Pass criterion
|
||||||
|
|
||||||
|
```
|
||||||
|
Phase 6 aggregate: PASS
|
||||||
|
```
|
||||||
|
|
||||||
|
Exit code 0. Any `[FAIL]` line is a blocker. `[DEFERRED]` lines are expected
|
||||||
|
for the known-deferred surfaces listed in the implementation docs; they do not
|
||||||
|
fail the run.
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
- SQL Server `10.100.0.35,14330` reachable (Config DB tests use it).
|
||||||
|
- `dotnet` SDK on PATH (`.NET 10`).
|
||||||
|
- Run from repo root.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G2 — Full solution test suite
|
||||||
|
|
||||||
|
### Command
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal"
|
||||||
|
```
|
||||||
|
|
||||||
|
For a more targeted run of integration suites that need their fixtures up:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# bring modbus fixture up first
|
||||||
|
lmxopcua-fix up modbus standard
|
||||||
|
|
||||||
|
dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pass criterion
|
||||||
|
|
||||||
|
- Passed count >= 1159 (2026-04-19 baseline after Phase 5 driver complement).
|
||||||
|
- Failed count <= 1 (the pre-existing
|
||||||
|
`SubscribeCommandTests.Execute_PrintsSubscriptionMessage` flake in
|
||||||
|
`Client.CLI` is the only tolerated failure).
|
||||||
|
- No new `[FAILED]` tests relative to the baseline.
|
||||||
|
|
||||||
|
### Known flake
|
||||||
|
|
||||||
|
`ZB.MOM.WW.OtOpcUa.Client.CLI.Tests::SubscribeCommandTests.Execute_PrintsSubscriptionMessage`
|
||||||
|
is a timing-sensitive subscribe-then-cancel test. Rerun the specific project
|
||||||
|
if it appears:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet test tests/Client/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests `
|
||||||
|
--filter "FullyQualifiedName~SubscribeCommandTests.Execute_PrintsSubscriptionMessage" `
|
||||||
|
--count 3
|
||||||
|
```
|
||||||
|
|
||||||
|
If it fails all three runs, investigate; otherwise treat as flake.
|
||||||
|
|
||||||
|
### Docker fixtures needed for integration suites
|
||||||
|
|
||||||
|
| Driver | Command | Endpoint used |
|
||||||
|
|--------|---------|---------------|
|
||||||
|
| Modbus | `lmxopcua-fix up modbus standard` | `10.100.0.35:5020` |
|
||||||
|
| AB CIP | `lmxopcua-fix up abcip controllogix` | `10.100.0.35:44818` |
|
||||||
|
| S7 | `lmxopcua-fix up s7 s7_1500` | `10.100.0.35:1102` |
|
||||||
|
| OPC UA Client | `lmxopcua-fix up opcuaclient` | `opc.tcp://10.100.0.35:50000` |
|
||||||
|
| FOCAS | `lmxopcua-fix up focas` (mock server) | `10.100.0.35:8193` |
|
||||||
|
|
||||||
|
TwinCAT integration tests require the TCBSD ESXi VM at `10.100.0.128`
|
||||||
|
(AmsNetId `41.169.163.43.1.1`). Set env var before running:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$env:TWINCAT_TARGET_HOST = "10.100.0.128"
|
||||||
|
$env:TWINCAT_TARGET_NETID = "41.169.163.43.1.1"
|
||||||
|
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests
|
||||||
|
```
|
||||||
|
|
||||||
|
Galaxy integration tests run against the live mxaccessgw on the dev box
|
||||||
|
(gate G6).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G3 — Release blockers closed (audit, already satisfied)
|
||||||
|
|
||||||
|
All three code-path release blockers are closed per `v2-release-readiness.md`:
|
||||||
|
|
||||||
|
- Authorization dispatch wiring (task #143, PR #94) — CLOSED.
|
||||||
|
- Config fallback Phase 6.1 Stream D (task #136, PR #96) — CLOSED.
|
||||||
|
- Redundancy Phase 6.3 Streams A/C core (tasks #145/#147, PRs #98-99) — CLOSED.
|
||||||
|
|
||||||
|
No action required. Record the PR numbers in the release notes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G4 — Driver complement (audit, already satisfied)
|
||||||
|
|
||||||
|
All eight drivers shipped:
|
||||||
|
|
||||||
|
Galaxy, Modbus (+ DL205/S7/MELSEC profiles), S7 native, OPC UA Client, AB CIP,
|
||||||
|
AB Legacy, TwinCAT ADS, FOCAS (managed wire client — Tier-C isolation retired,
|
||||||
|
FOCAS is now Tier A in-process via `WireFocasClient`).
|
||||||
|
|
||||||
|
No action required.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G5 — Production deployment checklist (operator action)
|
||||||
|
|
||||||
|
The deployment checklist is a separate document covering:
|
||||||
|
|
||||||
|
- Windows service install (`scripts/install/Install-Services.ps1`)
|
||||||
|
- Config DB migration (`scripts/db/Apply-Migrations.ps1`)
|
||||||
|
- Certificate provisioning and trust
|
||||||
|
- LDAP / GLAuth configuration for production AD target
|
||||||
|
- mxaccessgw API key provisioning (`apikey create-key` in the sibling repo)
|
||||||
|
- Service account permissions
|
||||||
|
- Prometheus / OpenTelemetry export configuration
|
||||||
|
- Firewall rules (port 4840 OPC UA, port 5120 gRPC to mxaccessgw,
|
||||||
|
Admin port 5000/5001)
|
||||||
|
|
||||||
|
**Sign-off party**: Fleet Admin (operator). Not automatable.
|
||||||
|
|
||||||
|
Record sign-off as a comment on the v2 release issue.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G6 — Live Galaxy end-to-end integration run
|
||||||
|
|
||||||
|
**Requires**: AVEVA System Platform installed on dev box (confirmed available
|
||||||
|
per project memory `project_aveva_platform_installed.md`); mxaccessgw running
|
||||||
|
with a provisioned API key; at least one Galaxy object deployed.
|
||||||
|
|
||||||
|
### Procedure
|
||||||
|
|
||||||
|
1. Start mxaccessgw:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# in sibling repo C:\Users\dohertj2\Desktop\mxaccessgw\
|
||||||
|
dotnet run --project src/MxGateway.Server -- --apikey-path .local/api-key.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Start OtOpcUa server with Galaxy driver instance configured:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
sc start OtOpcUa
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Browse via Client CLI:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
browse -u opc.tcp://localhost:4840 -r -d 3
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Read a known Galaxy tag (e.g. a deployed `$UserDefined` object attribute):
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
read -u opc.tcp://localhost:4840 -n "ns=2;s=<tag_name.AttributeName>"
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Subscribe and verify live updates:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||||||
|
subscribe -u opc.tcp://localhost:4840 -n "ns=2;s=<tag_name.AttributeName>" -i 1000
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pass criterion
|
||||||
|
|
||||||
|
- Browse returns a non-empty node tree mirroring the Galaxy hierarchy.
|
||||||
|
- Read returns `Good` quality with a non-null value.
|
||||||
|
- Subscribe receives at least one data-change notification within 5 s
|
||||||
|
(or within the configured publishing interval).
|
||||||
|
- No `BadNoCommunication` or `BadTimeout` errors in the server log.
|
||||||
|
|
||||||
|
Record: Galaxy version, deployed object count, OtOpcUa git SHA.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G7 — FOCAS live-CNC smoke (task #54)
|
||||||
|
|
||||||
|
**Requires**: real FANUC CNC with Ethernet option, accessible on TCP port 8193
|
||||||
|
from the dev box; CNC series known (e.g. 0i-F, 30i-B).
|
||||||
|
|
||||||
|
See `docs/plans/live-hardware-validation-runbooks.md` §FOCAS for the full
|
||||||
|
runbook.
|
||||||
|
|
||||||
|
### Pass criterion
|
||||||
|
|
||||||
|
- `WireFocasClient` opens a FOCAS2 session (`cnc_allclibhndl3` succeeds).
|
||||||
|
- Identity nodes (`Identity/SeriesNumber`, `Identity/MaxAxes`) return non-null
|
||||||
|
values matching the physical control panel display.
|
||||||
|
- At least one axis position (`Axes/X/AbsolutePosition` or similar) returns
|
||||||
|
`Good` quality with a plausible double value.
|
||||||
|
- Subscribe on a polled tag delivers at least three updates within 5 s.
|
||||||
|
- No `EW_SOCKET` (-1) or `EW_HANDLE` (-7) errors in the server log during a
|
||||||
|
2-minute soak.
|
||||||
|
|
||||||
|
Record: CNC series, firmware version, test date, OtOpcUa git SHA.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G8 — OPC UA Conformance Test Tool (CTT) pass
|
||||||
|
|
||||||
|
**Requires**: OPC Foundation OPC UA Compliance Test Tool (CTT) or the
|
||||||
|
open-source UA Compliance Test Tool installed on the client machine;
|
||||||
|
live OtOpcUa server endpoint.
|
||||||
|
|
||||||
|
### Recommended minimum profile set
|
||||||
|
|
||||||
|
- `Attribute Read`
|
||||||
|
- `Attribute Write`
|
||||||
|
- `Browse`
|
||||||
|
- `Subscription` (DataChange)
|
||||||
|
- `Server-side monitoring`
|
||||||
|
- `Security — None profile` (if server configured with `Security:Profiles=[None]`)
|
||||||
|
|
||||||
|
### Procedure
|
||||||
|
|
||||||
|
1. Launch CTT. Add server endpoint: `opc.tcp://localhost:4840`.
|
||||||
|
2. Run the profile set above.
|
||||||
|
3. Capture the CTT report HTML/XML.
|
||||||
|
|
||||||
|
### Pass criterion
|
||||||
|
|
||||||
|
All mandatory test cases in each profile: **PASS** or **NOT APPLICABLE**.
|
||||||
|
|
||||||
|
Zero mandatory failures. Advisory failures may be documented with rationale
|
||||||
|
(e.g. optional capability not implemented).
|
||||||
|
|
||||||
|
Record: CTT version, profile set, OtOpcUa git SHA, report artifact.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G9 — Non-transparent redundancy cutover with production client
|
||||||
|
|
||||||
|
See `docs/plans/phase-6-3-redundancy-interop-plan.md` for the full runbook.
|
||||||
|
|
||||||
|
**Minimum acceptable result**: one complete pass of the A-block (UaExpert
|
||||||
|
OPC UA signal verification) plus scenario B2 (UaExpert failover on Primary
|
||||||
|
kill).
|
||||||
|
|
||||||
|
Ignition 8.3 is the recommended production client per decision #85. If
|
||||||
|
Ignition is not available on the lab machine, UaExpert is accepted for v2 GA.
|
||||||
|
|
||||||
|
Record: client name + version, OtOpcUa git SHA, test date.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gate summary table
|
||||||
|
|
||||||
|
| Gate | Command / Procedure | Pass criterion | Owner |
|
||||||
|
|------|---------------------|----------------|-------|
|
||||||
|
| G1 | `pwsh ./scripts/compliance/phase-6-all.ps1` | Exit 0, no `[FAIL]` | Dev |
|
||||||
|
| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` | >= 1159 passing, <= 1 failure | Dev |
|
||||||
|
| G3 | Audit PR list in release-readiness.md | All blockers show CLOSED | Dev |
|
||||||
|
| G4 | Audit driver table | All 8 drivers listed as shipped | Dev |
|
||||||
|
| G5 | Run deployment checklist doc | All items checked; Fleet Admin signs off | Fleet Admin |
|
||||||
|
| G6 | Browse/read/subscribe against live Galaxy | Good quality, non-empty tree | Dev (dev box) |
|
||||||
|
| G7 | FOCAS CNC smoke — see live-hardware runbook | Session open, Good quality reads | Dev + lab hardware |
|
||||||
|
| G8 | CTT profile run against live endpoint | Zero mandatory failures | Dev + CTT tool |
|
||||||
|
| G9 | Redundancy cutover runbook | A-block + B2 pass with >= 1 client | Dev + two instances |
|
||||||
Reference in New Issue
Block a user