docs(v2): update for gap-closeout — peer-URI discovery, role overlays, release status
This commit is contained in:
@@ -59,7 +59,7 @@ For Modbus / S7 / AB CIP / AB Legacy / TwinCAT / FOCAS / OPC UA Client specifics
|
||||
| [security.md](security.md) | Transport security profiles, LDAP auth, ACL trie, role grants, OTOPCUA0001 analyzer |
|
||||
| [Redundancy.md](Redundancy.md) | `RedundancyCoordinator`, `ServiceLevelCalculator`, apply-lease, Prometheus metrics |
|
||||
| [Reservations.md](Reservations.md) | Fleet-wide ZTag / SAPID external-ID reservations — publish-time claim, release flow |
|
||||
| [ServiceHosting.md](ServiceHosting.md) | Two-process deploy (Server + Admin) install/uninstall, plus the optional `OtOpcUaWonderwareHistorian` sidecar |
|
||||
| [ServiceHosting.md](ServiceHosting.md) | Single fused `OtOpcUa.Host` binary install/uninstall with `OTOPCUA_ROLES` gating, plus the optional `OtOpcUaWonderwareHistorian` sidecar |
|
||||
| [StatusDashboard.md](StatusDashboard.md) | Pointer — superseded by [v2/admin-ui.md](v2/admin-ui.md) |
|
||||
|
||||
### Client tooling
|
||||
|
||||
@@ -2,7 +2,9 @@
|
||||
|
||||
## Overview
|
||||
|
||||
OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two or more `OtOpcUa.Host` processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` byte that each server publishes.
|
||||
OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two or more `OtOpcUa.Host` processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct `ApplicationUri`; OPC UA clients discover both endpoints by reading `Server.ServerArray` (NodeId `i=2254`) on either node and pick one based on the `ServiceLevel` byte that each server publishes.
|
||||
|
||||
> **Discovery surface.** The `ServerArray` path on the `Server` object is what each node populates with self + peer `ApplicationUri`s — see `OpcUaApplicationHost.PopulateServerArray` and the per-node `PeerApplicationUris` option below. The redundancy-object-type `ServerUriArray` proper (a child of `Server.ServerRedundancy`) remains deferred pending an SDK object-type upgrade; clients should read `Server.ServerArray` for peer discovery today.
|
||||
|
||||
> **v2 change.** v1's operator-managed `ClusterNode.RedundancyRole` column + `RedundancyCoordinator` / `ApplyLeaseRegistry` / `PeerHttpProbeLoop` are gone. Primary/secondary is now derived from **Akka cluster role-leader** for the `driver` role. The operator no longer writes a role into the DB; cluster topology + health drive ServiceLevel automatically.
|
||||
|
||||
@@ -78,6 +80,20 @@ Both nodes share the same `ConfigDb` connection string; `Cluster.PublicHostname`
|
||||
|
||||
There is no longer a `Node:NodeId` setting, no `ClusterNode.RedundancyRole`, no `ServiceLevelBase`. NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula).
|
||||
|
||||
### Peer URI advertising
|
||||
|
||||
Each node advertises its partner via `OpcUaApplicationHostOptions.PeerApplicationUris` (an `IList<string>`, default empty). `OpcUaApplicationHost.PopulateServerArray` appends each configured peer URI to the SDK's `IServerInternal.ServerUris` string table after server startup, so that `Server.ServerArray` reads served by `OnReadServerArray` return both self + peers. Set this per-node in `appsettings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"OpcUaServer": {
|
||||
"PeerApplicationUris": ["urn:node-b:OtOpcUa"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Node A lists Node B's `ApplicationUri` and vice-versa. Validated by `DualEndpointTests` in `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/` — boots two `OpcUaApplicationHost` instances on loopback, asserts a real OPCFoundation client `Session` reading `Server.ServerArray` from Node A sees both URIs.
|
||||
|
||||
## Split-brain
|
||||
|
||||
`akka.conf` configures Akka's split-brain resolver with `active-strategy = keep-oldest`, `stable-after = 15s`, and `failure-detector.threshold = 10.0`. Under a clean partition: the oldest member stays up + the smaller (or younger) side downs itself within ~15 seconds. The `RedundancyStateActor` on the surviving partition re-computes from the post-partition `Cluster.State`.
|
||||
|
||||
@@ -25,6 +25,16 @@ Galaxy access still uses the separately-installed **mxaccessgw** sidecar (see `d
|
||||
|
||||
Single-node dev: `OTOPCUA_ROLES=admin,driver`. Production: typically two admin nodes (HA pair) + N driver nodes.
|
||||
|
||||
### Per-role configuration overlays
|
||||
|
||||
`Program.cs:33-35` builds a role suffix by joining the parsed roles **alphabetically** with `-` and loads `appsettings.{roleSuffix}.json` as an optional overlay on top of base `appsettings.json`. Three overlays ship in `src/Server/ZB.MOM.WW.OtOpcUa.Host/`:
|
||||
|
||||
- `appsettings.admin.json` — admin-only nodes
|
||||
- `appsettings.driver.json` — driver-only nodes
|
||||
- `appsettings.admin-driver.json` — fused single-node dev / small deployments
|
||||
|
||||
All three carry Serilog log-level overrides + `Security:Ldap:DevStubMode = false`. Loading order is **base `appsettings.json` → role overlay (`appsettings.{role}.json`) → environment overlay (`appsettings.{Environment}.json`)** — later layers win. Overlays are optional; the base file boots a node on its own.
|
||||
|
||||
## Akka cluster
|
||||
|
||||
The host joins an Akka.NET cluster bound to the address in `appsettings.json::Cluster`:
|
||||
|
||||
@@ -124,4 +124,5 @@ Each cluster member has a `NodeId` derived as `{PublicHostname}:{Port}` of the A
|
||||
| Driver actors | `Runtime.WithOtOpcUaRuntimeActors` | extension on `AkkaConfigurationBuilder` |
|
||||
| Auth pipeline | `Security.AddOtOpcUaAuth` + `MapOtOpcUaAuth` | extensions on `IServiceCollection` / `IEndpointRouteBuilder` |
|
||||
| OPC UA facade | `OpcUaServer.OpcUaApplicationHost` | runtime host, started by driver-role startup |
|
||||
| Partner-URI advertising | `OpcUaServer.OpcUaApplicationHost.PopulateServerArray` | runs after `_application.Start`, appends `PeerApplicationUris` to the SDK `ServerUris` `StringTable` so `Server.ServerArray` (i=2254) returns self + peers |
|
||||
| Health endpoints | `Host.Health.AddOtOpcUaHealth` + `MapOtOpcUaHealth` | extensions on `IServiceCollection` / `IEndpointRouteBuilder` |
|
||||
|
||||
@@ -67,6 +67,8 @@ The Cluster.Tests project verifies these key values stay correct (`HoconLoaderTe
|
||||
- `SeedNodes`: where new nodes go to join. List one (or two) stable nodes. First node bootstraps the cluster from its own address.
|
||||
- `Roles`: free-form tags Akka gossip propagates. v2 uses `admin` + `driver`; per-role wiring in `Program.cs` reads `OTOPCUA_ROLES` env var, not this list — these two should stay in sync.
|
||||
|
||||
Per-role overlay files (`appsettings.admin.json`, `appsettings.driver.json`, `appsettings.admin-driver.json`) layer on top of base `appsettings.json` based on the parsed `OTOPCUA_ROLES` (alphabetical, joined by `-`). See [ServiceHosting.md § Per-role configuration overlays](../ServiceHosting.md#per-role-configuration-overlays).
|
||||
|
||||
## IClusterRoleInfo
|
||||
|
||||
Anywhere in the host that needs the local node's identity or a view of who-else-is-in-the-cluster, inject `IClusterRoleInfo`:
|
||||
|
||||
@@ -96,7 +96,7 @@ Shipped as PR #183 (12 tests in configuration; 13 more in Admin.Tests).
|
||||
| F.4 — Test harness (modal, synthetic inputs, output + logger display) | **Partial** | `ScriptTestHarnessService.cs` is complete and tested. `ScriptsTab.razor` calls `Harness.RunVirtualTagAsync` with zero-value synthetic inputs derived from the extractor. A full interactive input-form modal was not shipped — the harness zeroes all inputs automatically rather than prompting the operator per-tag. |
|
||||
| F.5 — Script log viewer (SignalR tail of `scripts-*.log` filtered by `ScriptName`, load-more) | **Not started** | No SignalR stream of the scripts log is wired in the Admin UI. The `AlertHub` / `FleetStatusHub` exist but there is no `ScriptLogHub`. |
|
||||
| F.6 — `/alarms/historian` diagnostics view | **Done** | `AlarmsHistorian.razor` + `HistorianDiagnosticsService.cs` |
|
||||
| F.7 — Playwright smoke (author calc tag, verify in equipment tree; author alarm, verify in `AlarmsAndConditions`) | **Not started** | `tests/Server/ZB.MOM.WW.OtOpcUa.Admin.E2ETests/` exists but its `UnsTabDragDropE2ETests.cs` is the only Playwright test; no Phase 7 Admin UI playwright scenario. |
|
||||
| F.7 — Playwright smoke (author calc tag, verify in equipment tree; author alarm, verify in `AlarmsAndConditions`) | **Not started** | No Phase 7 Playwright/E2E project exists in the repo today; future-work item without an assigned path. |
|
||||
|
||||
Shipped as PR #185 (13 Admin service tests; UI completeness is partial — see gaps section).
|
||||
|
||||
|
||||
@@ -55,6 +55,7 @@ Each row is one manual run; pass criterion in the right column.
|
||||
| A2 | ServiceLevel updates on peer down | Connect to Primary. Stop Backup (`sc stop OtOpcUa`). Watch `ServiceLevel`. | Transitions 200 → 150 within ~2 s of peer probe timeout |
|
||||
| A3 | RedundancySupport | Browse to `Server.ServerRedundancy.RedundancySupport`. | Value matches the declared `RedundancyMode` (Warm / Hot / None) |
|
||||
| A4 | ServerUriArray (non-transparent upgrade) | Requires a redundancy-object-type upgrade follow-up. | When upgrade lands: `ServerUriArray` reports both ApplicationUris, self first |
|
||||
| A4b | Peer URI visibility via `Server.ServerArray` (i=2254) | Configure each `OpcUaApplicationHost` with the partner's `ApplicationUri` via `OpcUaApplicationHostOptions.PeerApplicationUris`. From any client, Read NodeId `i=2254` (`Server.ServerArray`). | Returned `String[]` includes both self + peer `ApplicationUri`s. Validated by `DualEndpointTests` in `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/` (loopback dual-host with real OPCFoundation client `Session` read). |
|
||||
| A5 | Mid-apply dip | On Primary trigger a `sp_PublishGeneration` apply. | `ServiceLevel` drops to 75 for the apply duration + dwell |
|
||||
|
||||
### Block B — Client failover
|
||||
@@ -101,7 +102,9 @@ flips A4 from "deferred" to "expected pass").
|
||||
- **A4 pending**: `Server.ServerRedundancy` on our current SDK build lands as
|
||||
the base `ServerRedundancyState`, which has no `ServerUriArray` child.
|
||||
`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips until the
|
||||
redundancy-object-type upgrade follow-up lands.
|
||||
redundancy-object-type upgrade follow-up lands. Cross-reference **A4b** —
|
||||
peer URIs are visible today via `Server.ServerArray` (i=2254) populated by
|
||||
`OpcUaApplicationHost.PopulateServerArray`.
|
||||
- **Recovery dwell default**: `RecoveryStateManager.DwellTime` defaults to 60 s
|
||||
in `Program.cs`. Adjust via future config knob if B3 takes too long to
|
||||
observe.
|
||||
@@ -121,8 +124,8 @@ flips A4 from "deferred" to "expected pass").
|
||||
redundancy implementations we don't control.
|
||||
- For the sub-set of scenarios that *can* be automated — the self-loopback
|
||||
case where our own `otopcua-cli` drives Primary + Backup — the existing
|
||||
`tests/Server/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests` +
|
||||
`ServiceLevelCalculatorTests` (unit) + `ClusterTopologyLoaderTests`
|
||||
(integration) already cover the math + data path. The wire-level assertion
|
||||
that the values actually land on the right OPC UA nodes is covered by
|
||||
`ServerRedundancyNodeWriterTests`.
|
||||
`tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/RedundancyStateActorTests` +
|
||||
`ServiceLevelCalculatorTests` (unit) already cover the math + data path.
|
||||
The wire-level assertion that the peer URIs actually land on the
|
||||
`Server.ServerArray` node (i=2254) is covered by `DualEndpointTests` in
|
||||
`tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/`.
|
||||
|
||||
@@ -57,7 +57,7 @@ Remaining follow-ups (hardening):
|
||||
Remaining Phase 6.3 surfaces (hardening, not release-blocking):
|
||||
|
||||
- ~~`PeerHttpProbeLoop` + `PeerUaProbeLoop` HostedServices populating `PeerReachabilityTracker` on each tick.~~ **Closed 2026-04-24.** Two-layer probe model shipped: HTTP probe at 2 s / 1 s timeout against `/healthz`; OPC UA probe at 10 s / 5 s timeout via `DiscoveryClient.GetEndpoints`, short-circuiting when HTTP reports the peer unhealthy. Registered on the Server as `AddHostedService<PeerHttpProbeLoop>` + `AddHostedService<PeerUaProbeLoop>`. Publisher now sees accurate `PeerReachability` per peer instead of degrading to `Unknown` → Isolated-Primary band (230).
|
||||
- OPC UA variable-node wiring: bind `ServiceLevel` Byte + `ServerUriArray` String[] to the publisher's events via `BaseDataVariable.OnReadValue` / direct value push.
|
||||
- ~~OPC UA variable-node wiring: bind `ServiceLevel` Byte + `ServerUriArray` String[] to the publisher's events via `BaseDataVariable.OnReadValue` / direct value push.~~ **Closed 2026-05-26.** `ServiceLevel` byte binding closed earlier under Path D. Peer-URI half closed via `OpcUaApplicationHost.PopulateServerArray` — populates self + each `PeerApplicationUris` entry into the SDK `IServerInternal.ServerUris` `StringTable`; clients read `Server.ServerArray` (NodeId `i=2254`). Validated by `DualEndpointTests` in `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/`. `ServerUriArray` proper (the redundancy-object-type child) remains deferred pending object-type upgrade.
|
||||
- ~~`sp_PublishGeneration` wraps its apply in `await using var lease = coordinator.BeginApplyLease(...)` so the `PrimaryMidApply` band (200) fires during actual publishes (task #148 part 2).~~ **Closed 2026-04-24.** The apply loop now lives in `GenerationRefreshHostedService` — polls `sp_GetCurrentGenerationForCluster` every 5s, opens a lease when a new generation is detected, calls `RedundancyCoordinator.RefreshAsync` inside the `await using`, releases the lease on all exit paths. Replaces the previous "topology never refreshes without a process restart" behaviour.
|
||||
- Client interop matrix — Ignition / Kepware / Aveva OI Gateway (Stream F, task #150). Manual + doc-only.
|
||||
|
||||
@@ -118,6 +118,7 @@ v2 GA requires all of the following:
|
||||
|
||||
## Change log
|
||||
|
||||
- **2026-05-26** — Gap-closeout pass. `OpcUaApplicationHost.PopulateServerArray` populates `Server.ServerArray` (NodeId `i=2254`) with self + `OpcUaApplicationHostOptions.PeerApplicationUris`, giving non-transparent peer URI visibility through the standard discovery surface. New `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/` IT project (`DualEndpointTests`) validates with two real `OpcUaApplicationHost` instances on loopback + a live OPCFoundation client `Session` read. CI `v2-ci.yml` `integration:` job converted to a matrix across `Host.IntegrationTests` + `OpcUaServer.IntegrationTests`. Per-role appsettings overlays shipped (`appsettings.admin.json` / `appsettings.driver.json` / `appsettings.admin-driver.json`) — `Program.cs:33-35` loads by alphabetical-joined role suffix. `FailoverScenarioTests` → `FailoverDuringDeployTests` rename. Stale empty `src/Server/{Server,Admin}` + `tests/Server/{Server.Tests,Admin.Tests,Admin.E2ETests}` directories deleted (no source, absent from `.slnx`).
|
||||
- **2026-04-24** — Phase 5 driver complement closed (task #120 CLOSED). AB CIP, AB Legacy, TwinCAT, FOCAS all shipped. FOCAS migration: retired the Tier-C split (`Driver.FOCAS.Host` + `Driver.FOCAS.Shared` + `FwlibNative` + shim DLL deleted) in favour of a pure-managed in-process `FocasWireClient` inlined into `Driver.FOCAS`; driver is now read-only against the CNC by design. Integration test matrix grew to cover Browse / Subscribe / IAlarmSource / Probe end-to-end.
|
||||
- **2026-04-23** — Phase 6.4 audit close-out. IdentificationFolderBuilder + OPC 40010 Identification folder verified against the shipped code.
|
||||
- **2026-04-20** — Phase 7 plan drafted (`phase-7-scripting-and-alarming.md`, `phase-7-e2e-smoke.md`). Out of scope for v2 GA.
|
||||
|
||||
Reference in New Issue
Block a user