Files
lmxopcua/docs/v2/v2-release-readiness.md
Joseph Doherty fb6dd3478d Phase 6.2 Stream C wiring — AuthorizationBootstrap + OpcUaApplicationHost.SetAuthorization
Closes task #133 — the "authz gate is inert in production" blocker
surfaced during task #123. Before this commit, every ACL check on the
six dispatch surfaces (Read, Write, HistoryRead, Browse,
CreateMonitoredItems, Call) short-circuited to allow because Program.cs
constructed OpcUaApplicationHost without passing authzGate or
scopeResolver.

New pieces:

- `AuthorizationOptions` — bound to `Node:Authorization` in
  appsettings.json. `Enabled` (default false) is the master switch;
  `StrictMode` (default false) controls the anonymous / no-LDAP-groups
  fallback behaviour.
- `AuthorizationBootstrap` — singleton service that loads `NodeAcl`
  rows for the published generation, builds a `PermissionTrieCache` +
  `AuthorizationGate`, merges every registered driver's
  `EquipmentNamespaceContent` through `ScopePathIndexBuilder` into one
  full-path `NodeScopeResolver`. Returns `(null, null)` when disabled
  or when no generation is Published yet.
- `DriverEquipmentContentRegistry.Snapshot()` — new method returning a
  defensive copy of the driver → content map so the bootstrap can
  iterate without holding the lock.
- `OpcUaApplicationHost.SetAuthorization(gate, resolver)` — late-bind
  method matching the existing `SetPhase7Sources` pattern. Must run
  before `StartAsync`; rejects post-start rebinding with
  InvalidOperationException.
- `OpcUaServerService.ExecuteAsync` calls `AuthorizationBootstrap.BuildAsync`
  after `PopulateEquipmentContentAsync` and before `applicationHost.StartAsync`,
  in the same window that `SetPhase7Sources` runs.

Behaviour change
- Default (Enabled=false): no behaviour change — the gate stays null,
  all six dispatch surfaces run unchanged. Safe for any existing
  deployment on upgrade.
- Enabled=true with StrictMode=false: identities carrying LDAP groups
  are evaluated against the trie; anonymous / no-groups identities
  pass through (v1 legacy-client compatibility).
- Enabled=true with StrictMode=true: everything evaluates. Anonymous
  or no-groups identities are denied.

Follow-up not covered here: rebind the gate+resolver on generation
refresh (the `GenerationRefreshHostedService` that shipped earlier in
this session). Today the gate only reflects the bootstrap generation
— operators publishing new ACL changes need a process restart to see
them. Matches the current driver-hot-reload limitation and is tracked
in the existing 6.3 follow-up bullet.

Docs: v2-release-readiness.md Phase 6.2 Stream C.12 bullet flipped to
Closed with operator-facing config pointer (`Node:Authorization:Enabled`).

All 283/283 Server.Tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:35:46 -04:00

131 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# v2 Release Readiness
> **Last updated**: 2026-04-24 (Phase 5 driver complement closed — AB CIP, AB Legacy, TwinCAT, FOCAS all shipped; FOCAS Tier-C retired for a pure-managed in-process client)
> **Status**: **RELEASE-READY (code-path)** for v2 GA. All three original code-path release blockers remain closed. Phase 5 is now complete. Remaining work is manual (live-hardware validations, client interop matrix, deployment checklist signoff, OPC UA CTT pass) + hardening follow-ups; see exit-criteria checklist below.
This doc is the single view of where v2 stands against its release criteria. Update it whenever a deferred follow-up closes or a new release blocker is discovered.
## Release-readiness dashboard
| Phase | Shipped | Status |
|---|---|---|
| Phase 0 — Rename + entry gate | ✓ | Shipped |
| Phase 1 — Configuration + Admin scaffold | ✓ | Shipped (some UI items deferred to 6.4) |
| Phase 2 — Galaxy driver split (Proxy/Host/Shared) | ✓ | Shipped |
| Phase 3 — OPC UA server + LDAP + security profiles | ✓ | Shipped |
| Phase 4 — Redundancy scaffold (entities + endpoints) | ✓ | Shipped (runtime closes in 6.3) |
| Phase 5 — Drivers | ✓ | **Shipped** — Galaxy, Modbus (+ DL205/S7/MELSEC profiles), S7 native, OPC UA Client, AB CIP, AB Legacy, TwinCAT ADS, FOCAS (managed wire client) |
| Phase 6.1 — Resilience & Observability | ✓ | Shipped (PRs #7883) |
| Phase 6.2 — Authorization runtime | ◐ core | Core shipped (PRs #8488, #94 dispatch wiring); finer-grained Browse/Subscribe/Alarm/Call gating + 3-user interop matrix deferred |
| Phase 6.3 — Redundancy runtime | ◐ core | Core shipped (PRs #8990, #9899); peer-probe HostedServices, OPC UA variable-node binding, `sp_PublishGeneration` lease wrap, client interop matrix deferred |
| Phase 6.4 — Admin UI completion | ◐ data layer + Identification | Data layer + OPC 40010 Identification folder shipped (PRs #9192, Identification audit close-out 2026-04-23); Blazor UI pieces deferred |
**Driver integration-test counts** (end-to-end against live or simulated targets): Modbus 26, FOCAS 9, AbCip 7, OpcUaClient 3, S7 3, AbLegacy 2, TwinCAT 2. Plus Galaxy's separate cross-FX parity/stability suite.
**Aggregate test counts** (2026-04-19 baseline): 1159 passing across the solution. One pre-existing Client.CLI `SubscribeCommandTests.Execute_PrintsSubscriptionMessage` flake tracked separately. Rerun `dotnet test ZB.MOM.WW.OtOpcUa.slnx` after the FOCAS migration commits land to refresh the number.
## Release blockers (must close before v2 GA)
All code-path release blockers are closed. The remaining items are live-hardware / manual validations listed under exit criteria.
### ~~Security — Phase 6.2 dispatch wiring~~ (task #143 — **CLOSED** 2026-04-19, PR #94)
**Closed**. `AuthorizationGate` + `NodeScopeResolver` thread through `OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager`. `OnReadValue` + `OnWriteValue` + all four HistoryRead paths call `gate.IsAllowed(identity, operation, scope)` before the invoker. Production deployments activate enforcement by constructing `OpcUaApplicationHost` with an `AuthorizationGate(StrictMode: true)` + populating the `NodeAcl` table.
Remaining Stream C surfaces (hardening, not release-blocking):
- ~~Browse + TranslateBrowsePathsToNodeIds gating with ancestor-visibility logic per `acl-design.md` §Browse.~~ **Partial, 2026-04-24.** `DriverNodeManager.Browse` override post-filters the `ReferenceDescription` list via a new `FilterBrowseReferences` helper — denied nodes disappear silently per OPC UA convention. Ancestor-visibility implication (Read-grant at `Line/Tag` implying Browse on `Line`) still to ship; needs a subtree-has-any-grant query on the trie evaluator. `TranslateBrowsePathsToNodeIds` surface not yet wired.
- ~~CreateMonitoredItems + TransferSubscriptions gating with per-item `(AuthGenerationId, MembershipVersion)` stamp so revoked grants surface `BadUserAccessDenied` within one publish cycle (decision #153).~~ **Partial, 2026-04-24.** `DriverNodeManager.CreateMonitoredItems` override pre-gates each request and pre-populates `BadUserAccessDenied` into the errors slot for denied items (the base stack honours pre-set errors and skips those items). Decision #153's per-item `(AuthGenerationId, MembershipVersion)` stamp for detecting mid-subscription revocation is still to ship — needs subscription-layer plumbing. TransferSubscriptions not yet wired (same pattern).
- ~~Alarm Acknowledge / Confirm / Shelve gating.~~ **Partial, 2026-04-24.** Acknowledge + Confirm map to dedicated `OpcUaOperation.AlarmAcknowledge` / `AlarmConfirm` via `MapCallOperation`; Shelve falls through to generic `OpcUaOperation.Call` (needs per-instance method NodeId resolution to distinguish — follow-up).
- ~~Call (method invocation) gating.~~ **Closed 2026-04-24.** `DriverNodeManager.Call` override pre-gates each `CallMethodRequest` via `GateCallMethodRequests`. Denied calls return `BadUserAccessDenied` without running the method. Alarm methods map to alarm-specific operation kinds; everything else gates as generic `Call`.
- ~~Finer-grained scope resolution — current `NodeScopeResolver` returns a flat cluster-level scope. Joining against the live Configuration DB to populate UnsArea / UnsLine / Equipment path is tracked as Stream C.12.~~ **Closed 2026-04-24.** `AuthorizationBootstrap` now loads `NodeAcl` rows for the current generation into a `PermissionTrieCache`, builds the gate, and merges every registered driver's `EquipmentNamespaceContent` into a full-path `NodeScopeResolver` index. `OpcUaServerService` calls the bootstrap after the equipment registry is populated, before `OpcUaApplicationHost.StartAsync`. Disabled by default — operators flip `Node:Authorization:Enabled=true` to enforce, `StrictMode=true` to reject anonymous/no-groups identities.
- 3-user integration matrix covering every operation × allow/deny.
### ~~Config fallback — Phase 6.1 Stream D wiring~~ (task #136 — **CLOSED** 2026-04-19, PR #96)
**Closed**. `SealedBootstrap` consumes `ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag` end-to-end; `/healthz` surfaces the stale flag.
Remaining follow-ups (hardening):
- A `HostedService` that polls `sp_GetCurrentGenerationForCluster` periodically so peer-published generations land in this node's cache without a restart.
- Richer snapshot payload via `sp_GetGenerationContent` so fallback can serve full generation content (DriverInstance enumeration, ACL rows, etc.) from the sealed cache alone.
### ~~Redundancy — Phase 6.3 Streams A/C core~~ (tasks #145 + #147 — **CLOSED** 2026-04-19, PRs #9899)
**Closed**. `RedundancyCoordinator` + `RedundancyStatePublisher` + `PeerReachabilityTracker` orchestrate topology + apply lease + recovery state + peer reachability through `ServiceLevelCalculator` + emit `OnStateChanged` / `OnServerUriArrayChanged` edge-triggered events.
Remaining Phase 6.3 surfaces (hardening, not release-blocking):
- ~~`PeerHttpProbeLoop` + `PeerUaProbeLoop` HostedServices populating `PeerReachabilityTracker` on each tick.~~ **Closed 2026-04-24.** Two-layer probe model shipped: HTTP probe at 2 s / 1 s timeout against `/healthz`; OPC UA probe at 10 s / 5 s timeout via `DiscoveryClient.GetEndpoints`, short-circuiting when HTTP reports the peer unhealthy. Registered on the Server as `AddHostedService<PeerHttpProbeLoop>` + `AddHostedService<PeerUaProbeLoop>`. Publisher now sees accurate `PeerReachability` per peer instead of degrading to `Unknown` → Isolated-Primary band (230).
- OPC UA variable-node wiring: bind `ServiceLevel` Byte + `ServerUriArray` String[] to the publisher's events via `BaseDataVariable.OnReadValue` / direct value push.
- ~~`sp_PublishGeneration` wraps its apply in `await using var lease = coordinator.BeginApplyLease(...)` so the `PrimaryMidApply` band (200) fires during actual publishes (task #148 part 2).~~ **Closed 2026-04-24.** The apply loop now lives in `GenerationRefreshHostedService` — polls `sp_GetCurrentGenerationForCluster` every 5s, opens a lease when a new generation is detected, calls `RedundancyCoordinator.RefreshAsync` inside the `await using`, releases the lease on all exit paths. Replaces the previous "topology never refreshes without a process restart" behaviour.
- Client interop matrix — Ignition / Kepware / Aveva OI Gateway (Stream F, task #150). Manual + doc-only.
### ~~Phase 5 driver complement~~ (task #120 — **CLOSED** 2026-04-24)
**Closed**. All four deferred drivers shipped:
- **AB CIP** (PRs #202222) — `Driver.AbCip`, `Driver.AbCip.IntegrationTests` (7 tests), AB CIP Cli. Live-boot verified against a ControlLogix rig.
- **AB Legacy** (PRs #202, #223) — `Driver.AbLegacy`, `Driver.AbLegacy.IntegrationTests` (2 tests), AB Legacy Cli. PCCC cip-path workaround for SLC/MicroLogix.
- **TwinCAT ADS** (PRs #205, this branch `task-galaxy-e2e`) — `Driver.TwinCAT`, `Driver.TwinCAT.IntegrationTests` (2 tests), TwinCAT Cli. TCBSD/ESXi fixture for e2e since local Hyper-V / TwinCAT RTIME are mutually exclusive on the dev box.
- **FOCAS** (PRs #173, #199 + this session's migration) — `Driver.FOCAS` with an **in-process managed `FocasWireClient`** that speaks FOCAS/2 over TCP directly. Tier-C isolation retired — `Driver.FOCAS.Host` + `Driver.FOCAS.Shared` + `FwlibNative` P/Invoke + shim DLL + NSSM service all deleted. `Driver.FOCAS.IntegrationTests` covers 9 scenarios (fixed tree identity/axes/program/timers/spindle + user-authored PARAM/MACRO/PMC reads, Browse, Subscribe, IAlarmSource raise/clear, Probe transitions).
Decision recorded: FOCAS is **read-only** against the CNC by design — writes return `BadNotWritable`. See `docs/drivers/FOCAS.md` + `docs/drivers/FOCAS-Test-Fixture.md` for the deployment + coverage map.
## Nice-to-haves (not release-blocking)
- **Admin UI** — Phase 6.1 Stream E.2/E.3 (`/hosts` column refresh), Phase 6.2 Stream D (`RoleGrantsTab` + `AclsTab` Probe), Phase 6.3 Stream E (`RedundancyTab`), Phase 6.4 Streams A/B UI pieces, Stream C DiffViewer, Stream D `IdentificationFields.razor`. Tasks #134, #144, #149, #153, #155, #156, #157.
- **Background services** — Phase 6.1 Stream B.4 `ScheduledRecycleScheduler` HostedService (task #137), Phase 6.1 Stream A analyzer (task #135 — Roslyn analyzer asserting every capability surface routes through `CapabilityInvoker`).
- **Multi-host dispatch** — Phase 6.1 Stream A follow-up (task #135). Every driver currently gets a single pipeline keyed on `driver.DriverInstanceId`; multi-host drivers (Modbus with N PLCs) need per-PLC host resolution so failing PLCs trip per-PLC breakers without poisoning siblings. Decision #144 requires this but not wired.
- **Phase 7** — scripting + alarming + historian sink (plan drafted 2026-04-20 in `docs/v2/implementation/phase-7-*.md`). Out of scope for v2 GA.
## Live-hardware validations (task #54 + task family)
The code ships; these tasks remain open as lab/field verification:
- **#54** — FOCAS live-CNC wire-level smoke against a real FANUC control. The mock's wire responder is PDU-verified against `fwlibe64.dll` upstream but OtOpcUa's managed client has not been pointed at a production CNC.
- **AB CIP live-boot** — already passed on a ControlLogix rig (PR #222). Continue to run ahead of each release.
- **TwinCAT wire-live** — TCBSD/ESXi fixture covers the common path; production PLC verification remains lab-gated.
## Running the release-readiness check
```bash
pwsh ./scripts/compliance/phase-6-all.ps1
```
This meta-runner invokes each `phase-6-N-compliance.ps1` script in sequence and reports an aggregate PASS/FAIL:
- `phase-6-1-compliance.ps1` — Resilience & Observability
- `phase-6-2-compliance.ps1` — Authorization runtime
- `phase-6-3-compliance.ps1` — Redundancy runtime
- `phase-6-4-compliance.ps1` — Admin UI completion
Exit 0 = every phase passes its compliance checks + no test-count regression.
## Release-readiness exit criteria
v2 GA requires all of the following:
- [ ] All four Phase 6.N compliance scripts exit 0.
- [ ] `dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes with ≤ 1 known-flake failure.
- [x] Release blockers listed above all closed.
- [x] Phase 5 driver complement shipped (Galaxy, Modbus, S7, OpcUaClient, AbCip, AbLegacy, TwinCAT, FOCAS).
- [ ] Production deployment checklist (separate doc) signed off by Fleet Admin.
- [ ] At least one end-to-end integration run against the live Galaxy on the dev box succeeds.
- [ ] FOCAS live-CNC wire-level smoke (#54) runs clean against a real FANUC control.
- [ ] OPC UA conformance test (CTT or UA Compliance Test Tool) passes against the live endpoint.
- [ ] Non-transparent redundancy cutover validated with at least one production client (Ignition 8.3 recommended — see decision #85).
## Change log
- **2026-04-24** — Phase 5 driver complement closed (task #120 CLOSED). AB CIP, AB Legacy, TwinCAT, FOCAS all shipped. FOCAS migration: retired the Tier-C split (`Driver.FOCAS.Host` + `Driver.FOCAS.Shared` + `FwlibNative` + shim DLL deleted) in favour of a pure-managed in-process `FocasWireClient` inlined into `Driver.FOCAS`; driver is now read-only against the CNC by design. Integration test matrix grew to cover Browse / Subscribe / IAlarmSource / Probe end-to-end.
- **2026-04-23** — Phase 6.4 audit close-out. IdentificationFolderBuilder + OPC 40010 Identification folder verified against the shipped code.
- **2026-04-20** — Phase 7 plan drafted (`phase-7-scripting-and-alarming.md`, `phase-7-e2e-smoke.md`). Out of scope for v2 GA.
- **2026-04-19** — Release blocker #3 closed (PRs #9899). Phase 6.3 Streams A + C core shipped: `ClusterTopologyLoader` + `RedundancyCoordinator` + `RedundancyStatePublisher` + `PeerReachabilityTracker`. Code-path release blockers all closed; remaining Phase 6.3 surfaces (peer-probe HostedServices, OPC UA variable-node binding, `sp_PublishGeneration` lease wrap, client interop matrix) are hardening follow-ups.
- **2026-04-19** — Release blocker #2 closed (PR #96). `SealedBootstrap` consumes `ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag`; `/healthz` surfaces the stale flag. Remaining follow-ups (periodic poller + richer snapshot payload) downgraded to hardening.
- **2026-04-19** — Release blocker #1 closed (PR #94). `AuthorizationGate` wired into `DriverNodeManager` Read / Write / HistoryRead dispatch. Remaining Stream C surfaces (Browse / Subscribe / Alarm / Call + finer-grained scope resolution) downgraded to hardening follow-ups — no longer release-blocking.
- **2026-04-19** — Phase 6.4 data layer merged (PRs #9192). Phase 6 core complete.
- **2026-04-19** — Phase 6.3 core merged (PRs #8990). `ServiceLevelCalculator` + `RecoveryStateManager` + `ApplyLeaseRegistry` land as pure logic; coordinator / UA-node wiring / Admin UI / interop deferred.
- **2026-04-19** — Phase 6.2 core merged (PRs #8488). `AuthorizationGate` + `TriePermissionEvaluator` + `LdapGroupRoleMapping` land; dispatch wiring + Admin UI deferred.
- **2026-04-19** — Phase 6.1 shipped (PRs #7883). Polly resilience + Tier A/B/C stability + health endpoints + LiteDB generation-sealed cache + Admin `/hosts` data layer all live.