Files
lmxopcua/docs/plans/v2-ga-lab-gates-plan.md
Joseph Doherty 16a87b08f3 docs: add four planning runbooks for Phase 6.3 interop, v2 GA gates, live-hardware validation, and alarms worker wiring
Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20:
- phase-6-3-redundancy-interop-plan.md: automation boundary analysis,
  concrete test matrix (A/B/C blocks), and a step-by-step cutover
  runbook for the deferred Stream F client interop work
- v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion,
  and owner for each of the nine v2 GA exit criteria
- live-hardware-validation-runbooks.md: one runbook per driver (FOCAS
  CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions,
  procedure, expected results, and recording template
- alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1
  worker wiring in the mxaccessgw sibling repo, documenting the
  discovered AVEVA API surface, the architectural decision that blocks
  A.2, the dependency order, and what each item needs to unblock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 04:53:36 -04:00

308 lines
11 KiB
Markdown

# v2 GA Lab Gates Plan
> **Canonical tracker**: `docs/v2/v2-release-readiness.md` — all code-path
> release blockers are closed as of 2026-04-24. This document maps the
> remaining exit-criteria from that tracker to concrete commands, automation
> boundaries, operator procedures, and pass criteria.
>
> **Status**: RELEASE-READY (code-path). Manual/lab gates remain open.
## The gate list
From `docs/v2/v2-release-readiness.md` §"Release-readiness exit criteria":
| # | Gate | Kind | Automatable here |
|---|------|------|-----------------|
| G1 | All four Phase 6.N compliance scripts exit 0 | Script | Yes — run on this box |
| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes with <= 1 known-flake failure | Script | Yes — run on this box |
| G3 | Release blockers closed | Audit | Already closed (code-path) |
| G4 | Phase 5 driver complement shipped | Audit | Already closed |
| G5 | Production deployment checklist signed off by Fleet Admin | Operator | No — separate doc, human signoff |
| G6 | At least one end-to-end integration run against live Galaxy succeeds | Dev rig | No — requires AVEVA platform |
| G7 | FOCAS live-CNC wire-level smoke (#54) passes against a real FANUC control | Lab hardware | No — requires FANUC CNC |
| G8 | OPC UA CTT / UA Compliance Test Tool passes against the live endpoint | Operator tool | No — requires CTT binary + live endpoint |
| G9 | Non-transparent redundancy cutover validated with >= 1 production client | Lab | No — see `docs/plans/phase-6-3-redundancy-interop-plan.md` |
---
## G1 — Phase 6 compliance scripts
### Command
```powershell
pwsh ./scripts/compliance/phase-6-all.ps1
```
This meta-runner at `scripts/compliance/phase-6-all.ps1` invokes each
sub-script in a separate `powershell.exe` process to isolate exit codes:
| Sub-script | Phase | What it checks |
|-----------|-------|---------------|
| `phase-6-1-compliance.ps1` | 6.1 Resilience & Observability | Polly resilience classes, health endpoints, LiteDB sealed cache, observability sinks |
| `phase-6-2-compliance.ps1` | 6.2 Authorization runtime | `AuthorizationGate`, `TriePermissionEvaluator`, `NodeScopeResolver`, dispatch wiring in `DriverNodeManager` |
| `phase-6-3-compliance.ps1` | 6.3 Redundancy runtime | `ServiceLevelCalculator` 8-state band values, `RecoveryStateManager`, `ApplyLeaseRegistry`, `ServerRedundancyNodeWriter`; also invokes `dotnet test` with a baseline of 1097 |
| `phase-6-4-compliance.ps1` | 6.4 Admin UI completion | Data-layer types, Identification folder, deferred Blazor items marked `[DEFERRED]` |
### Pass criterion
```
Phase 6 aggregate: PASS
```
Exit code 0. Any `[FAIL]` line is a blocker. `[DEFERRED]` lines are expected
for the known-deferred surfaces listed in the implementation docs; they do not
fail the run.
### Prerequisites
- SQL Server `10.100.0.35,14330` reachable (Config DB tests use it).
- `dotnet` SDK on PATH (`.NET 10`).
- Run from repo root.
---
## G2 — Full solution test suite
### Command
```powershell
dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal"
```
For a more targeted run of integration suites that need their fixtures up:
```powershell
# bring modbus fixture up first
lmxopcua-fix up modbus standard
dotnet test ZB.MOM.WW.OtOpcUa.slnx --logger "console;verbosity=minimal"
```
### Pass criterion
- Passed count >= 1159 (2026-04-19 baseline after Phase 5 driver complement).
- Failed count <= 1 (the pre-existing
`SubscribeCommandTests.Execute_PrintsSubscriptionMessage` flake in
`Client.CLI` is the only tolerated failure).
- No new `[FAILED]` tests relative to the baseline.
### Known flake
`ZB.MOM.WW.OtOpcUa.Client.CLI.Tests::SubscribeCommandTests.Execute_PrintsSubscriptionMessage`
is a timing-sensitive subscribe-then-cancel test. Rerun the specific project
if it appears:
```powershell
dotnet test tests/Client/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests `
--filter "FullyQualifiedName~SubscribeCommandTests.Execute_PrintsSubscriptionMessage" `
--count 3
```
If it fails all three runs, investigate; otherwise treat as flake.
### Docker fixtures needed for integration suites
| Driver | Command | Endpoint used |
|--------|---------|---------------|
| Modbus | `lmxopcua-fix up modbus standard` | `10.100.0.35:5020` |
| AB CIP | `lmxopcua-fix up abcip controllogix` | `10.100.0.35:44818` |
| S7 | `lmxopcua-fix up s7 s7_1500` | `10.100.0.35:1102` |
| OPC UA Client | `lmxopcua-fix up opcuaclient` | `opc.tcp://10.100.0.35:50000` |
| FOCAS | `lmxopcua-fix up focas` (mock server) | `10.100.0.35:8193` |
TwinCAT integration tests require the TCBSD ESXi VM at `10.100.0.128`
(AmsNetId `41.169.163.43.1.1`). Set env var before running:
```powershell
$env:TWINCAT_TARGET_HOST = "10.100.0.128"
$env:TWINCAT_TARGET_NETID = "41.169.163.43.1.1"
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.IntegrationTests
```
Galaxy integration tests run against the live mxaccessgw on the dev box
(gate G6).
---
## G3 — Release blockers closed (audit, already satisfied)
All three code-path release blockers are closed per `v2-release-readiness.md`:
- Authorization dispatch wiring (task #143, PR #94) — CLOSED.
- Config fallback Phase 6.1 Stream D (task #136, PR #96) — CLOSED.
- Redundancy Phase 6.3 Streams A/C core (tasks #145/#147, PRs #98-99) — CLOSED.
No action required. Record the PR numbers in the release notes.
---
## G4 — Driver complement (audit, already satisfied)
All eight drivers shipped:
Galaxy, Modbus (+ DL205/S7/MELSEC profiles), S7 native, OPC UA Client, AB CIP,
AB Legacy, TwinCAT ADS, FOCAS (managed wire client — Tier-C isolation retired,
FOCAS is now Tier A in-process via `WireFocasClient`).
No action required.
---
## G5 — Production deployment checklist (operator action)
The deployment checklist is a separate document covering:
- Windows service install (`scripts/install/Install-Services.ps1`)
- Config DB migration (`scripts/db/Apply-Migrations.ps1`)
- Certificate provisioning and trust
- LDAP / GLAuth configuration for production AD target
- mxaccessgw API key provisioning (`apikey create-key` in the sibling repo)
- Service account permissions
- Prometheus / OpenTelemetry export configuration
- Firewall rules (port 4840 OPC UA, port 5120 gRPC to mxaccessgw,
Admin port 5000/5001)
**Sign-off party**: Fleet Admin (operator). Not automatable.
Record sign-off as a comment on the v2 release issue.
---
## G6 — Live Galaxy end-to-end integration run
**Requires**: AVEVA System Platform installed on dev box (confirmed available
per project memory `project_aveva_platform_installed.md`); mxaccessgw running
with a provisioned API key; at least one Galaxy object deployed.
### Procedure
1. Start mxaccessgw:
```powershell
# in sibling repo C:\Users\dohertj2\Desktop\mxaccessgw\
dotnet run --project src/MxGateway.Server -- --apikey-path .local/api-key.txt
```
2. Start OtOpcUa server with Galaxy driver instance configured:
```powershell
sc start OtOpcUa
```
3. Browse via Client CLI:
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
browse -u opc.tcp://localhost:4840 -r -d 3
```
4. Read a known Galaxy tag (e.g. a deployed `$UserDefined` object attribute):
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
read -u opc.tcp://localhost:4840 -n "ns=2;s=<tag_name.AttributeName>"
```
5. Subscribe and verify live updates:
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
subscribe -u opc.tcp://localhost:4840 -n "ns=2;s=<tag_name.AttributeName>" -i 1000
```
### Pass criterion
- Browse returns a non-empty node tree mirroring the Galaxy hierarchy.
- Read returns `Good` quality with a non-null value.
- Subscribe receives at least one data-change notification within 5 s
(or within the configured publishing interval).
- No `BadNoCommunication` or `BadTimeout` errors in the server log.
Record: Galaxy version, deployed object count, OtOpcUa git SHA.
---
## G7 — FOCAS live-CNC smoke (task #54)
**Requires**: real FANUC CNC with Ethernet option, accessible on TCP port 8193
from the dev box; CNC series known (e.g. 0i-F, 30i-B).
See `docs/plans/live-hardware-validation-runbooks.md` §FOCAS for the full
runbook.
### Pass criterion
- `WireFocasClient` opens a FOCAS2 session (`cnc_allclibhndl3` succeeds).
- Identity nodes (`Identity/SeriesNumber`, `Identity/MaxAxes`) return non-null
values matching the physical control panel display.
- At least one axis position (`Axes/X/AbsolutePosition` or similar) returns
`Good` quality with a plausible double value.
- Subscribe on a polled tag delivers at least three updates within 5 s.
- No `EW_SOCKET` (-1) or `EW_HANDLE` (-7) errors in the server log during a
2-minute soak.
Record: CNC series, firmware version, test date, OtOpcUa git SHA.
---
## G8 — OPC UA Conformance Test Tool (CTT) pass
**Requires**: OPC Foundation OPC UA Compliance Test Tool (CTT) or the
open-source UA Compliance Test Tool installed on the client machine;
live OtOpcUa server endpoint.
### Recommended minimum profile set
- `Attribute Read`
- `Attribute Write`
- `Browse`
- `Subscription` (DataChange)
- `Server-side monitoring`
- `Security — None profile` (if server configured with `Security:Profiles=[None]`)
### Procedure
1. Launch CTT. Add server endpoint: `opc.tcp://localhost:4840`.
2. Run the profile set above.
3. Capture the CTT report HTML/XML.
### Pass criterion
All mandatory test cases in each profile: **PASS** or **NOT APPLICABLE**.
Zero mandatory failures. Advisory failures may be documented with rationale
(e.g. optional capability not implemented).
Record: CTT version, profile set, OtOpcUa git SHA, report artifact.
---
## G9 — Non-transparent redundancy cutover with production client
See `docs/plans/phase-6-3-redundancy-interop-plan.md` for the full runbook.
**Minimum acceptable result**: one complete pass of the A-block (UaExpert
OPC UA signal verification) plus scenario B2 (UaExpert failover on Primary
kill).
Ignition 8.3 is the recommended production client per decision #85. If
Ignition is not available on the lab machine, UaExpert is accepted for v2 GA.
Record: client name + version, OtOpcUa git SHA, test date.
---
## Gate summary table
| Gate | Command / Procedure | Pass criterion | Owner |
|------|---------------------|----------------|-------|
| G1 | `pwsh ./scripts/compliance/phase-6-all.ps1` | Exit 0, no `[FAIL]` | Dev |
| G2 | `dotnet test ZB.MOM.WW.OtOpcUa.slnx` | >= 1159 passing, <= 1 failure | Dev |
| G3 | Audit PR list in release-readiness.md | All blockers show CLOSED | Dev |
| G4 | Audit driver table | All 8 drivers listed as shipped | Dev |
| G5 | Run deployment checklist doc | All items checked; Fleet Admin signs off | Fleet Admin |
| G6 | Browse/read/subscribe against live Galaxy | Good quality, non-empty tree | Dev (dev box) |
| G7 | FOCAS CNC smoke — see live-hardware runbook | Session open, Good quality reads | Dev + lab hardware |
| G8 | CTT profile run against live endpoint | Zero mandatory failures | Dev + CTT tool |
| G9 | Redundancy cutover runbook | A-block + B2 pass with >= 1 client | Dev + two instances |