Closes task #116 (GA hardening backlog). Before this commit the RedundancyStatePublisher saw PeerReachability.Unknown for every peer because the tracker had no writers — every healthy peer got degraded to the Isolated-Primary band (230) even when fully reachable. Not release-blocking (safe default), but not the full non-transparent- redundancy UX either. Two-layer probe model per docs/v2/implementation/phase-6-3-redundancy-runtime.md §Stream B: - PeerHttpProbeLoop (Stream B.1) — fast-fail layer at 2 s / 1 s timeout. Hits each peer's http://{Host}:{DashboardPort}/healthz via an injected IHttpClientFactory. Writes the HTTP bit of PeerReachability while preserving the UA bit from the last UA probe so a transient HTTP blip doesn't clobber the authoritative UA reading. - PeerUaProbeLoop (Stream B.2) — authoritative layer at 10 s / 5 s timeout. Calls DiscoveryClient.GetEndpoints against opc.tcp://{Host}: {OpcUaPort} — cheap compared to a full Session.Create, no cert trust required. Short-circuits when the HTTP probe last reported the peer unhealthy (no wasted handshakes on a known-dead endpoint), clearing the stale UaHealthy bit in that case. Both inherit from BackgroundService, follow the tick/delay/catch pattern RedundancyPublisherHostedService + ResilienceStatusPublisherHostedService established, and expose TickAsync() as internal for test drive-through. New PeerProbeOptions class carries the four intervals/timeouts so operators can tune cadence per site. Registered as singleton in Program.cs; HTTP client registered by name so the OtOpcUa handler chain (Serilog enrichers, potential future OpenTelemetry instrumentation) isn't bypassed. Tests — 9 new unit tests across PeerHttpProbeLoopTests (5) and PeerUaProbeLoopTests (4). All pass. Server.Tests total 243 → 252. Full solution build clean. Docs: v2-release-readiness.md Phase 6.3 follow-ups list marks the peer-probe bullet struck-through with a close-out note. Still deferred in Phase 6.3: - OPC UA variable-node binding (task #117 — ServiceLevel + ServerUriArray) - sp_PublishGeneration lease wrap (task #118) - Client interop matrix (task #119) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
v2 Release Readiness
Last updated: 2026-04-24 (Phase 5 driver complement closed — AB CIP, AB Legacy, TwinCAT, FOCAS all shipped; FOCAS Tier-C retired for a pure-managed in-process client) Status: RELEASE-READY (code-path) for v2 GA. All three original code-path release blockers remain closed. Phase 5 is now complete. Remaining work is manual (live-hardware validations, client interop matrix, deployment checklist signoff, OPC UA CTT pass) + hardening follow-ups; see exit-criteria checklist below.
This doc is the single view of where v2 stands against its release criteria. Update it whenever a deferred follow-up closes or a new release blocker is discovered.
Release-readiness dashboard
| Phase | Shipped | Status |
|---|---|---|
| Phase 0 — Rename + entry gate | ✓ | Shipped |
| Phase 1 — Configuration + Admin scaffold | ✓ | Shipped (some UI items deferred to 6.4) |
| Phase 2 — Galaxy driver split (Proxy/Host/Shared) | ✓ | Shipped |
| Phase 3 — OPC UA server + LDAP + security profiles | ✓ | Shipped |
| Phase 4 — Redundancy scaffold (entities + endpoints) | ✓ | Shipped (runtime closes in 6.3) |
| Phase 5 — Drivers | ✓ | Shipped — Galaxy, Modbus (+ DL205/S7/MELSEC profiles), S7 native, OPC UA Client, AB CIP, AB Legacy, TwinCAT ADS, FOCAS (managed wire client) |
| Phase 6.1 — Resilience & Observability | ✓ | Shipped (PRs #78–83) |
| Phase 6.2 — Authorization runtime | ◐ core | Core shipped (PRs #84–88, #94 dispatch wiring); finer-grained Browse/Subscribe/Alarm/Call gating + 3-user interop matrix deferred |
| Phase 6.3 — Redundancy runtime | ◐ core | Core shipped (PRs #89–90, #98–99); peer-probe HostedServices, OPC UA variable-node binding, sp_PublishGeneration lease wrap, client interop matrix deferred |
| Phase 6.4 — Admin UI completion | ◐ data layer + Identification | Data layer + OPC 40010 Identification folder shipped (PRs #91–92, Identification audit close-out 2026-04-23); Blazor UI pieces deferred |
Driver integration-test counts (end-to-end against live or simulated targets): Modbus 26, FOCAS 9, AbCip 7, OpcUaClient 3, S7 3, AbLegacy 2, TwinCAT 2. Plus Galaxy's separate cross-FX parity/stability suite.
Aggregate test counts (2026-04-19 baseline): 1159 passing across the solution. One pre-existing Client.CLI SubscribeCommandTests.Execute_PrintsSubscriptionMessage flake tracked separately. Rerun dotnet test ZB.MOM.WW.OtOpcUa.slnx after the FOCAS migration commits land to refresh the number.
Release blockers (must close before v2 GA)
All code-path release blockers are closed. The remaining items are live-hardware / manual validations listed under exit criteria.
Security — Phase 6.2 dispatch wiring (task #143 — CLOSED 2026-04-19, PR #94)
Closed. AuthorizationGate + NodeScopeResolver thread through OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager. OnReadValue + OnWriteValue + all four HistoryRead paths call gate.IsAllowed(identity, operation, scope) before the invoker. Production deployments activate enforcement by constructing OpcUaApplicationHost with an AuthorizationGate(StrictMode: true) + populating the NodeAcl table.
Remaining Stream C surfaces (hardening, not release-blocking):
- Browse + TranslateBrowsePathsToNodeIds gating with ancestor-visibility logic per
acl-design.md§Browse. - CreateMonitoredItems + TransferSubscriptions gating with per-item
(AuthGenerationId, MembershipVersion)stamp so revoked grants surfaceBadUserAccessDeniedwithin one publish cycle (decision #153). - Alarm Acknowledge / Confirm / Shelve gating.
- Call (method invocation) gating.
- Finer-grained scope resolution — current
NodeScopeResolverreturns a flat cluster-level scope. Joining against the live Configuration DB to populate UnsArea / UnsLine / Equipment path is tracked as Stream C.12. - 3-user integration matrix covering every operation × allow/deny.
Config fallback — Phase 6.1 Stream D wiring (task #136 — CLOSED 2026-04-19, PR #96)
Closed. SealedBootstrap consumes ResilientConfigReader + GenerationSealedCache + StaleConfigFlag end-to-end; /healthz surfaces the stale flag.
Remaining follow-ups (hardening):
- A
HostedServicethat pollssp_GetCurrentGenerationForClusterperiodically so peer-published generations land in this node's cache without a restart. - Richer snapshot payload via
sp_GetGenerationContentso fallback can serve full generation content (DriverInstance enumeration, ACL rows, etc.) from the sealed cache alone.
Redundancy — Phase 6.3 Streams A/C core (tasks #145 + #147 — CLOSED 2026-04-19, PRs #98–99)
Closed. RedundancyCoordinator + RedundancyStatePublisher + PeerReachabilityTracker orchestrate topology + apply lease + recovery state + peer reachability through ServiceLevelCalculator + emit OnStateChanged / OnServerUriArrayChanged edge-triggered events.
Remaining Phase 6.3 surfaces (hardening, not release-blocking):
Closed 2026-04-24. Two-layer probe model shipped: HTTP probe at 2 s / 1 s timeout againstPeerHttpProbeLoop+PeerUaProbeLoopHostedServices populatingPeerReachabilityTrackeron each tick./healthz; OPC UA probe at 10 s / 5 s timeout viaDiscoveryClient.GetEndpoints, short-circuiting when HTTP reports the peer unhealthy. Registered on the Server asAddHostedService<PeerHttpProbeLoop>+AddHostedService<PeerUaProbeLoop>. Publisher now sees accuratePeerReachabilityper peer instead of degrading toUnknown→ Isolated-Primary band (230).- OPC UA variable-node wiring: bind
ServiceLevelByte +ServerUriArrayString[] to the publisher's events viaBaseDataVariable.OnReadValue/ direct value push. sp_PublishGenerationwraps its apply inawait using var lease = coordinator.BeginApplyLease(...)so thePrimaryMidApplyband (200) fires during actual publishes (task #148 part 2).- Client interop matrix — Ignition / Kepware / Aveva OI Gateway (Stream F, task #150). Manual + doc-only.
Phase 5 driver complement (task #120 — CLOSED 2026-04-24)
Closed. All four deferred drivers shipped:
- AB CIP (PRs #202–222) —
Driver.AbCip,Driver.AbCip.IntegrationTests(7 tests), AB CIP Cli. Live-boot verified against a ControlLogix rig. - AB Legacy (PRs #202, #223) —
Driver.AbLegacy,Driver.AbLegacy.IntegrationTests(2 tests), AB Legacy Cli. PCCC cip-path workaround for SLC/MicroLogix. - TwinCAT ADS (PRs #205, this branch
task-galaxy-e2e) —Driver.TwinCAT,Driver.TwinCAT.IntegrationTests(2 tests), TwinCAT Cli. TCBSD/ESXi fixture for e2e since local Hyper-V / TwinCAT RTIME are mutually exclusive on the dev box. - FOCAS (PRs #173, #199 + this session's migration) —
Driver.FOCASwith an in-process managedFocasWireClientthat speaks FOCAS/2 over TCP directly. Tier-C isolation retired —Driver.FOCAS.Host+Driver.FOCAS.Shared+FwlibNativeP/Invoke + shim DLL + NSSM service all deleted.Driver.FOCAS.IntegrationTestscovers 9 scenarios (fixed tree identity/axes/program/timers/spindle + user-authored PARAM/MACRO/PMC reads, Browse, Subscribe, IAlarmSource raise/clear, Probe transitions).
Decision recorded: FOCAS is read-only against the CNC by design — writes return BadNotWritable. See docs/drivers/FOCAS.md + docs/drivers/FOCAS-Test-Fixture.md for the deployment + coverage map.
Nice-to-haves (not release-blocking)
- Admin UI — Phase 6.1 Stream E.2/E.3 (
/hostscolumn refresh), Phase 6.2 Stream D (RoleGrantsTab+AclsTabProbe), Phase 6.3 Stream E (RedundancyTab), Phase 6.4 Streams A/B UI pieces, Stream C DiffViewer, Stream DIdentificationFields.razor. Tasks #134, #144, #149, #153, #155, #156, #157. - Background services — Phase 6.1 Stream B.4
ScheduledRecycleSchedulerHostedService (task #137), Phase 6.1 Stream A analyzer (task #135 — Roslyn analyzer asserting every capability surface routes throughCapabilityInvoker). - Multi-host dispatch — Phase 6.1 Stream A follow-up (task #135). Every driver currently gets a single pipeline keyed on
driver.DriverInstanceId; multi-host drivers (Modbus with N PLCs) need per-PLC host resolution so failing PLCs trip per-PLC breakers without poisoning siblings. Decision #144 requires this but not wired. - Phase 7 — scripting + alarming + historian sink (plan drafted 2026-04-20 in
docs/v2/implementation/phase-7-*.md). Out of scope for v2 GA.
Live-hardware validations (task #54 + task family)
The code ships; these tasks remain open as lab/field verification:
- #54 — FOCAS live-CNC wire-level smoke against a real FANUC control. The mock's wire responder is PDU-verified against
fwlibe64.dllupstream but OtOpcUa's managed client has not been pointed at a production CNC. - AB CIP live-boot — already passed on a ControlLogix rig (PR #222). Continue to run ahead of each release.
- TwinCAT wire-live — TCBSD/ESXi fixture covers the common path; production PLC verification remains lab-gated.
Running the release-readiness check
pwsh ./scripts/compliance/phase-6-all.ps1
This meta-runner invokes each phase-6-N-compliance.ps1 script in sequence and reports an aggregate PASS/FAIL:
phase-6-1-compliance.ps1— Resilience & Observabilityphase-6-2-compliance.ps1— Authorization runtimephase-6-3-compliance.ps1— Redundancy runtimephase-6-4-compliance.ps1— Admin UI completion
Exit 0 = every phase passes its compliance checks + no test-count regression.
Release-readiness exit criteria
v2 GA requires all of the following:
- All four Phase 6.N compliance scripts exit 0.
dotnet test ZB.MOM.WW.OtOpcUa.slnxpasses with ≤ 1 known-flake failure.- Release blockers listed above all closed.
- Phase 5 driver complement shipped (Galaxy, Modbus, S7, OpcUaClient, AbCip, AbLegacy, TwinCAT, FOCAS).
- Production deployment checklist (separate doc) signed off by Fleet Admin.
- At least one end-to-end integration run against the live Galaxy on the dev box succeeds.
- FOCAS live-CNC wire-level smoke (#54) runs clean against a real FANUC control.
- OPC UA conformance test (CTT or UA Compliance Test Tool) passes against the live endpoint.
- Non-transparent redundancy cutover validated with at least one production client (Ignition 8.3 recommended — see decision #85).
Change log
- 2026-04-24 — Phase 5 driver complement closed (task #120 CLOSED). AB CIP, AB Legacy, TwinCAT, FOCAS all shipped. FOCAS migration: retired the Tier-C split (
Driver.FOCAS.Host+Driver.FOCAS.Shared+FwlibNative+ shim DLL deleted) in favour of a pure-managed in-processFocasWireClientinlined intoDriver.FOCAS; driver is now read-only against the CNC by design. Integration test matrix grew to cover Browse / Subscribe / IAlarmSource / Probe end-to-end. - 2026-04-23 — Phase 6.4 audit close-out. IdentificationFolderBuilder + OPC 40010 Identification folder verified against the shipped code.
- 2026-04-20 — Phase 7 plan drafted (
phase-7-scripting-and-alarming.md,phase-7-e2e-smoke.md). Out of scope for v2 GA. - 2026-04-19 — Release blocker #3 closed (PRs #98–99). Phase 6.3 Streams A + C core shipped:
ClusterTopologyLoader+RedundancyCoordinator+RedundancyStatePublisher+PeerReachabilityTracker. Code-path release blockers all closed; remaining Phase 6.3 surfaces (peer-probe HostedServices, OPC UA variable-node binding,sp_PublishGenerationlease wrap, client interop matrix) are hardening follow-ups. - 2026-04-19 — Release blocker #2 closed (PR #96).
SealedBootstrapconsumesResilientConfigReader+GenerationSealedCache+StaleConfigFlag;/healthzsurfaces the stale flag. Remaining follow-ups (periodic poller + richer snapshot payload) downgraded to hardening. - 2026-04-19 — Release blocker #1 closed (PR #94).
AuthorizationGatewired intoDriverNodeManagerRead / Write / HistoryRead dispatch. Remaining Stream C surfaces (Browse / Subscribe / Alarm / Call + finer-grained scope resolution) downgraded to hardening follow-ups — no longer release-blocking. - 2026-04-19 — Phase 6.4 data layer merged (PRs #91–92). Phase 6 core complete.
- 2026-04-19 — Phase 6.3 core merged (PRs #89–90).
ServiceLevelCalculator+RecoveryStateManager+ApplyLeaseRegistryland as pure logic; coordinator / UA-node wiring / Admin UI / interop deferred. - 2026-04-19 — Phase 6.2 core merged (PRs #84–88).
AuthorizationGate+TriePermissionEvaluator+LdapGroupRoleMappingland; dispatch wiring + Admin UI deferred. - 2026-04-19 — Phase 6.1 shipped (PRs #78–83). Polly resilience + Tier A/B/C stability + health endpoints + LiteDB generation-sealed cache + Admin
/hostsdata layer all live.