diff --git a/docs/README.md b/docs/README.md index c85de84..793cd86 100644 --- a/docs/README.md +++ b/docs/README.md @@ -59,7 +59,7 @@ For Modbus / S7 / AB CIP / AB Legacy / TwinCAT / FOCAS / OPC UA Client specifics | [security.md](security.md) | Transport security profiles, LDAP auth, ACL trie, role grants, OTOPCUA0001 analyzer | | [Redundancy.md](Redundancy.md) | `RedundancyCoordinator`, `ServiceLevelCalculator`, apply-lease, Prometheus metrics | | [Reservations.md](Reservations.md) | Fleet-wide ZTag / SAPID external-ID reservations — publish-time claim, release flow | -| [ServiceHosting.md](ServiceHosting.md) | Two-process deploy (Server + Admin) install/uninstall, plus the optional `OtOpcUaWonderwareHistorian` sidecar | +| [ServiceHosting.md](ServiceHosting.md) | Single fused `OtOpcUa.Host` binary install/uninstall with `OTOPCUA_ROLES` gating, plus the optional `OtOpcUaWonderwareHistorian` sidecar | | [StatusDashboard.md](StatusDashboard.md) | Pointer — superseded by [v2/admin-ui.md](v2/admin-ui.md) | ### Client tooling diff --git a/docs/Redundancy.md b/docs/Redundancy.md index fbab890..1be82f5 100644 --- a/docs/Redundancy.md +++ b/docs/Redundancy.md @@ -2,7 +2,9 @@ ## Overview -OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two or more `OtOpcUa.Host` processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` byte that each server publishes. +OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two or more `OtOpcUa.Host` processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct `ApplicationUri`; OPC UA clients discover both endpoints by reading `Server.ServerArray` (NodeId `i=2254`) on either node and pick one based on the `ServiceLevel` byte that each server publishes. + +> **Discovery surface.** The `ServerArray` path on the `Server` object is what each node populates with self + peer `ApplicationUri`s — see `OpcUaApplicationHost.PopulateServerArray` and the per-node `PeerApplicationUris` option below. The redundancy-object-type `ServerUriArray` proper (a child of `Server.ServerRedundancy`) remains deferred pending an SDK object-type upgrade; clients should read `Server.ServerArray` for peer discovery today. > **v2 change.** v1's operator-managed `ClusterNode.RedundancyRole` column + `RedundancyCoordinator` / `ApplyLeaseRegistry` / `PeerHttpProbeLoop` are gone. Primary/secondary is now derived from **Akka cluster role-leader** for the `driver` role. The operator no longer writes a role into the DB; cluster topology + health drive ServiceLevel automatically. @@ -78,6 +80,20 @@ Both nodes share the same `ConfigDb` connection string; `Cluster.PublicHostname` There is no longer a `Node:NodeId` setting, no `ClusterNode.RedundancyRole`, no `ServiceLevelBase`. NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula). +### Peer URI advertising + +Each node advertises its partner via `OpcUaApplicationHostOptions.PeerApplicationUris` (an `IList`, default empty). `OpcUaApplicationHost.PopulateServerArray` appends each configured peer URI to the SDK's `IServerInternal.ServerUris` string table after server startup, so that `Server.ServerArray` reads served by `OnReadServerArray` return both self + peers. Set this per-node in `appsettings.json`: + +```json +{ + "OpcUaServer": { + "PeerApplicationUris": ["urn:node-b:OtOpcUa"] + } +} +``` + +Node A lists Node B's `ApplicationUri` and vice-versa. Validated by `DualEndpointTests` in `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/` — boots two `OpcUaApplicationHost` instances on loopback, asserts a real OPCFoundation client `Session` reading `Server.ServerArray` from Node A sees both URIs. + ## Split-brain `akka.conf` configures Akka's split-brain resolver with `active-strategy = keep-oldest`, `stable-after = 15s`, and `failure-detector.threshold = 10.0`. Under a clean partition: the oldest member stays up + the smaller (or younger) side downs itself within ~15 seconds. The `RedundancyStateActor` on the surviving partition re-computes from the post-partition `Cluster.State`. diff --git a/docs/ServiceHosting.md b/docs/ServiceHosting.md index df0ff98..eb0cc5f 100644 --- a/docs/ServiceHosting.md +++ b/docs/ServiceHosting.md @@ -25,6 +25,16 @@ Galaxy access still uses the separately-installed **mxaccessgw** sidecar (see `d Single-node dev: `OTOPCUA_ROLES=admin,driver`. Production: typically two admin nodes (HA pair) + N driver nodes. +### Per-role configuration overlays + +`Program.cs:33-35` builds a role suffix by joining the parsed roles **alphabetically** with `-` and loads `appsettings.{roleSuffix}.json` as an optional overlay on top of base `appsettings.json`. Three overlays ship in `src/Server/ZB.MOM.WW.OtOpcUa.Host/`: + +- `appsettings.admin.json` — admin-only nodes +- `appsettings.driver.json` — driver-only nodes +- `appsettings.admin-driver.json` — fused single-node dev / small deployments + +All three carry Serilog log-level overrides + `Security:Ldap:DevStubMode = false`. Loading order is **base `appsettings.json` → role overlay (`appsettings.{role}.json`) → environment overlay (`appsettings.{Environment}.json`)** — later layers win. Overlays are optional; the base file boots a node on its own. + ## Akka cluster The host joins an Akka.NET cluster bound to the address in `appsettings.json::Cluster`: diff --git a/docs/v2/Architecture-v2.md b/docs/v2/Architecture-v2.md index 1f35833..4240635 100644 --- a/docs/v2/Architecture-v2.md +++ b/docs/v2/Architecture-v2.md @@ -124,4 +124,5 @@ Each cluster member has a `NodeId` derived as `{PublicHostname}:{Port}` of the A | Driver actors | `Runtime.WithOtOpcUaRuntimeActors` | extension on `AkkaConfigurationBuilder` | | Auth pipeline | `Security.AddOtOpcUaAuth` + `MapOtOpcUaAuth` | extensions on `IServiceCollection` / `IEndpointRouteBuilder` | | OPC UA facade | `OpcUaServer.OpcUaApplicationHost` | runtime host, started by driver-role startup | +| Partner-URI advertising | `OpcUaServer.OpcUaApplicationHost.PopulateServerArray` | runs after `_application.Start`, appends `PeerApplicationUris` to the SDK `ServerUris` `StringTable` so `Server.ServerArray` (i=2254) returns self + peers | | Health endpoints | `Host.Health.AddOtOpcUaHealth` + `MapOtOpcUaHealth` | extensions on `IServiceCollection` / `IEndpointRouteBuilder` | diff --git a/docs/v2/Cluster.md b/docs/v2/Cluster.md index beed2cb..2840a34 100644 --- a/docs/v2/Cluster.md +++ b/docs/v2/Cluster.md @@ -67,6 +67,8 @@ The Cluster.Tests project verifies these key values stay correct (`HoconLoaderTe - `SeedNodes`: where new nodes go to join. List one (or two) stable nodes. First node bootstraps the cluster from its own address. - `Roles`: free-form tags Akka gossip propagates. v2 uses `admin` + `driver`; per-role wiring in `Program.cs` reads `OTOPCUA_ROLES` env var, not this list — these two should stay in sync. +Per-role overlay files (`appsettings.admin.json`, `appsettings.driver.json`, `appsettings.admin-driver.json`) layer on top of base `appsettings.json` based on the parsed `OTOPCUA_ROLES` (alphabetical, joined by `-`). See [ServiceHosting.md § Per-role configuration overlays](../ServiceHosting.md#per-role-configuration-overlays). + ## IClusterRoleInfo Anywhere in the host that needs the local node's identity or a view of who-else-is-in-the-cluster, inject `IClusterRoleInfo`: diff --git a/docs/v2/phase-7-status.md b/docs/v2/phase-7-status.md index cb7929d..ff25919 100644 --- a/docs/v2/phase-7-status.md +++ b/docs/v2/phase-7-status.md @@ -96,7 +96,7 @@ Shipped as PR #183 (12 tests in configuration; 13 more in Admin.Tests). | F.4 — Test harness (modal, synthetic inputs, output + logger display) | **Partial** | `ScriptTestHarnessService.cs` is complete and tested. `ScriptsTab.razor` calls `Harness.RunVirtualTagAsync` with zero-value synthetic inputs derived from the extractor. A full interactive input-form modal was not shipped — the harness zeroes all inputs automatically rather than prompting the operator per-tag. | | F.5 — Script log viewer (SignalR tail of `scripts-*.log` filtered by `ScriptName`, load-more) | **Not started** | No SignalR stream of the scripts log is wired in the Admin UI. The `AlertHub` / `FleetStatusHub` exist but there is no `ScriptLogHub`. | | F.6 — `/alarms/historian` diagnostics view | **Done** | `AlarmsHistorian.razor` + `HistorianDiagnosticsService.cs` | -| F.7 — Playwright smoke (author calc tag, verify in equipment tree; author alarm, verify in `AlarmsAndConditions`) | **Not started** | `tests/Server/ZB.MOM.WW.OtOpcUa.Admin.E2ETests/` exists but its `UnsTabDragDropE2ETests.cs` is the only Playwright test; no Phase 7 Admin UI playwright scenario. | +| F.7 — Playwright smoke (author calc tag, verify in equipment tree; author alarm, verify in `AlarmsAndConditions`) | **Not started** | No Phase 7 Playwright/E2E project exists in the repo today; future-work item without an assigned path. | Shipped as PR #185 (13 Admin service tests; UI completeness is partial — see gaps section). diff --git a/docs/v2/redundancy-interop-playbook.md b/docs/v2/redundancy-interop-playbook.md index 3fc6aab..a2a8ed8 100644 --- a/docs/v2/redundancy-interop-playbook.md +++ b/docs/v2/redundancy-interop-playbook.md @@ -55,6 +55,7 @@ Each row is one manual run; pass criterion in the right column. | A2 | ServiceLevel updates on peer down | Connect to Primary. Stop Backup (`sc stop OtOpcUa`). Watch `ServiceLevel`. | Transitions 200 → 150 within ~2 s of peer probe timeout | | A3 | RedundancySupport | Browse to `Server.ServerRedundancy.RedundancySupport`. | Value matches the declared `RedundancyMode` (Warm / Hot / None) | | A4 | ServerUriArray (non-transparent upgrade) | Requires a redundancy-object-type upgrade follow-up. | When upgrade lands: `ServerUriArray` reports both ApplicationUris, self first | +| A4b | Peer URI visibility via `Server.ServerArray` (i=2254) | Configure each `OpcUaApplicationHost` with the partner's `ApplicationUri` via `OpcUaApplicationHostOptions.PeerApplicationUris`. From any client, Read NodeId `i=2254` (`Server.ServerArray`). | Returned `String[]` includes both self + peer `ApplicationUri`s. Validated by `DualEndpointTests` in `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/` (loopback dual-host with real OPCFoundation client `Session` read). | | A5 | Mid-apply dip | On Primary trigger a `sp_PublishGeneration` apply. | `ServiceLevel` drops to 75 for the apply duration + dwell | ### Block B — Client failover @@ -101,7 +102,9 @@ flips A4 from "deferred" to "expected pass"). - **A4 pending**: `Server.ServerRedundancy` on our current SDK build lands as the base `ServerRedundancyState`, which has no `ServerUriArray` child. `ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips until the - redundancy-object-type upgrade follow-up lands. + redundancy-object-type upgrade follow-up lands. Cross-reference **A4b** — + peer URIs are visible today via `Server.ServerArray` (i=2254) populated by + `OpcUaApplicationHost.PopulateServerArray`. - **Recovery dwell default**: `RecoveryStateManager.DwellTime` defaults to 60 s in `Program.cs`. Adjust via future config knob if B3 takes too long to observe. @@ -121,8 +124,8 @@ flips A4 from "deferred" to "expected pass"). redundancy implementations we don't control. - For the sub-set of scenarios that *can* be automated — the self-loopback case where our own `otopcua-cli` drives Primary + Backup — the existing - `tests/Server/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests` + - `ServiceLevelCalculatorTests` (unit) + `ClusterTopologyLoaderTests` - (integration) already cover the math + data path. The wire-level assertion - that the values actually land on the right OPC UA nodes is covered by - `ServerRedundancyNodeWriterTests`. + `tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/RedundancyStateActorTests` + + `ServiceLevelCalculatorTests` (unit) already cover the math + data path. + The wire-level assertion that the peer URIs actually land on the + `Server.ServerArray` node (i=2254) is covered by `DualEndpointTests` in + `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/`. diff --git a/docs/v2/v2-release-readiness.md b/docs/v2/v2-release-readiness.md index 93ffa51..1a806cf 100644 --- a/docs/v2/v2-release-readiness.md +++ b/docs/v2/v2-release-readiness.md @@ -57,7 +57,7 @@ Remaining follow-ups (hardening): Remaining Phase 6.3 surfaces (hardening, not release-blocking): - ~~`PeerHttpProbeLoop` + `PeerUaProbeLoop` HostedServices populating `PeerReachabilityTracker` on each tick.~~ **Closed 2026-04-24.** Two-layer probe model shipped: HTTP probe at 2 s / 1 s timeout against `/healthz`; OPC UA probe at 10 s / 5 s timeout via `DiscoveryClient.GetEndpoints`, short-circuiting when HTTP reports the peer unhealthy. Registered on the Server as `AddHostedService` + `AddHostedService`. Publisher now sees accurate `PeerReachability` per peer instead of degrading to `Unknown` → Isolated-Primary band (230). -- OPC UA variable-node wiring: bind `ServiceLevel` Byte + `ServerUriArray` String[] to the publisher's events via `BaseDataVariable.OnReadValue` / direct value push. +- ~~OPC UA variable-node wiring: bind `ServiceLevel` Byte + `ServerUriArray` String[] to the publisher's events via `BaseDataVariable.OnReadValue` / direct value push.~~ **Closed 2026-05-26.** `ServiceLevel` byte binding closed earlier under Path D. Peer-URI half closed via `OpcUaApplicationHost.PopulateServerArray` — populates self + each `PeerApplicationUris` entry into the SDK `IServerInternal.ServerUris` `StringTable`; clients read `Server.ServerArray` (NodeId `i=2254`). Validated by `DualEndpointTests` in `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/`. `ServerUriArray` proper (the redundancy-object-type child) remains deferred pending object-type upgrade. - ~~`sp_PublishGeneration` wraps its apply in `await using var lease = coordinator.BeginApplyLease(...)` so the `PrimaryMidApply` band (200) fires during actual publishes (task #148 part 2).~~ **Closed 2026-04-24.** The apply loop now lives in `GenerationRefreshHostedService` — polls `sp_GetCurrentGenerationForCluster` every 5s, opens a lease when a new generation is detected, calls `RedundancyCoordinator.RefreshAsync` inside the `await using`, releases the lease on all exit paths. Replaces the previous "topology never refreshes without a process restart" behaviour. - Client interop matrix — Ignition / Kepware / Aveva OI Gateway (Stream F, task #150). Manual + doc-only. @@ -118,6 +118,7 @@ v2 GA requires all of the following: ## Change log +- **2026-05-26** — Gap-closeout pass. `OpcUaApplicationHost.PopulateServerArray` populates `Server.ServerArray` (NodeId `i=2254`) with self + `OpcUaApplicationHostOptions.PeerApplicationUris`, giving non-transparent peer URI visibility through the standard discovery surface. New `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/` IT project (`DualEndpointTests`) validates with two real `OpcUaApplicationHost` instances on loopback + a live OPCFoundation client `Session` read. CI `v2-ci.yml` `integration:` job converted to a matrix across `Host.IntegrationTests` + `OpcUaServer.IntegrationTests`. Per-role appsettings overlays shipped (`appsettings.admin.json` / `appsettings.driver.json` / `appsettings.admin-driver.json`) — `Program.cs:33-35` loads by alphabetical-joined role suffix. `FailoverScenarioTests` → `FailoverDuringDeployTests` rename. Stale empty `src/Server/{Server,Admin}` + `tests/Server/{Server.Tests,Admin.Tests,Admin.E2ETests}` directories deleted (no source, absent from `.slnx`). - **2026-04-24** — Phase 5 driver complement closed (task #120 CLOSED). AB CIP, AB Legacy, TwinCAT, FOCAS all shipped. FOCAS migration: retired the Tier-C split (`Driver.FOCAS.Host` + `Driver.FOCAS.Shared` + `FwlibNative` + shim DLL deleted) in favour of a pure-managed in-process `FocasWireClient` inlined into `Driver.FOCAS`; driver is now read-only against the CNC by design. Integration test matrix grew to cover Browse / Subscribe / IAlarmSource / Probe end-to-end. - **2026-04-23** — Phase 6.4 audit close-out. IdentificationFolderBuilder + OPC 40010 Identification folder verified against the shipped code. - **2026-04-20** — Phase 7 plan drafted (`phase-7-scripting-and-alarming.md`, `phase-7-e2e-smoke.md`). Out of scope for v2 GA.