Five operational docs rewritten for v2 (multi-process, multi-driver, Config-DB authoritative): - docs/Configuration.md — replaced appsettings-only story with the two-layer model. appsettings.json is bootstrap only (Node identity, Config DB connection string, transport security, LDAP bind, logging). Authoritative config (clusters, namespaces, UNS, equipment, tags, driver instances, ACLs, role grants, poll groups) lives in the Config DB accessed via OtOpcUaConfigDbContext and edited through the Admin UI draft/publish workflow. Added v1-to-v2 migration index so operators can locate where each old section moved. Cross-links to docs/v2/config-db-schema.md + docs/v2/admin-ui.md. - docs/Redundancy.md — Phase 6.3 rewrite. Named every class under src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/: RedundancyCoordinator, RedundancyTopology, ApplyLeaseRegistry (publish fencing), PeerReachabilityTracker, RecoveryStateManager, ServiceLevelCalculator (pure function), RedundancyStatePublisher. Documented the full 11-band ServiceLevel matrix (Maintenance=0 through AuthoritativePrimary=255) from ServiceLevelCalculator.cs and the per-ClusterNode fields (RedundancyRole, ServiceLevelBase, ApplicationUri). Covered metrics (otopcua.redundancy.role_transition counter + primary/secondary/stale_count gauges on meter ZB.MOM.WW.OtOpcUa.Redundancy) and SignalR RoleChanged push from FleetStatusPoller to RedundancyTab.razor. - docs/security.md — preserved the transport-security section (still accurate) and added Phase 6.2 authorization. Four concerns now documented in one place: (1) transport security profiles, (2) OPC UA auth via LdapUserAuthenticator (note: task spec called this LdapAuthenticationProvider — actual class name is LdapUserAuthenticator in Server/Security/), (3) data-plane authorization via NodeAcl + PermissionTrie + AuthorizationGate — additive-only model per decision #129, ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag hierarchy, NodePermissions bundle, PermissionProbeService in Admin for "probe this permission", (4) control-plane authorization via LdapGroupRoleMapping + AdminRole (ConfigViewer / ConfigEditor / FleetAdmin, CanEdit / CanPublish policies) — deliberately independent of data-plane ACLs per decision #150. Documented the OTOPCUA0001 Roslyn analyzer (UnwrappedCapabilityCallAnalyzer) as the compile-time guard ensuring every driver-capability async call is wrapped by CapabilityInvoker. - docs/ServiceHosting.md — three-process rewrite: OtOpcUa Server (net10 x64, BackgroundService + AddWindowsService, hosts OPC UA endpoint + all non-Galaxy drivers), OtOpcUa Admin (net10 x64, Blazor Server + SignalR + /metrics via OpenTelemetry Prometheus exporter), OtOpcUa Galaxy.Host (.NET Framework 4.8 x86, NSSM-wrapped, env-variable driven, STA thread + MXAccess COM). Pipe ACL denies-Admins detail + non-elevated shell requirement captured from feedback memory. Divergence from CLAUDE.md: task spec said "TopShelf is still the service-installer wrapper per CLAUDE.md note" but no csproj in the repo references TopShelf — decision #30 replaced it with the generic host's AddWindowsService wrapper (per the doc comment on OpcUaServerService). Reflected the actual state + flagged this divergence here so someone can update CLAUDE.md separately. - docs/StatusDashboard.md — replaced the full v1 reference (dashboard endpoints, health check rules, StatusData DTO, etc.) with a short "superseded by Admin UI" pointer that preserves git-blame continuity + avoids broken links from other docs that reference it. Class references verified by reading: src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{RedundancyCoordinator, ServiceLevelCalculator, ApplyLeaseRegistry, RedundancyStatePublisher}.cs src/ZB.MOM.WW.OtOpcUa.Core/Authorization/{PermissionTrie, PermissionTrieBuilder, PermissionTrieCache, TriePermissionEvaluator, AuthorizationGate}.cs src/ZB.MOM.WW.OtOpcUa.Server/Security/{AuthorizationGate, LdapUserAuthenticator}.cs src/ZB.MOM.WW.OtOpcUa.Admin/{Program.cs, Services/AdminRoles.cs, Services/RedundancyMetrics.cs, Hubs/FleetStatusPoller.cs} src/ZB.MOM.WW.OtOpcUa.Server/Program.cs + appsettings.json src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/{Program.cs, Ipc/PipeServer.cs} src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{ClusterNode, NodeAcl, LdapGroupRoleMapping}.cs src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.2 KiB
Redundancy
Overview
OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct ApplicationUri; OPC UA clients see both endpoints via the standard ServerUriArray and pick one based on the ServiceLevel that each server publishes.
The redundancy surface lives in src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/:
| Class | Role |
|---|---|
RedundancyCoordinator |
Process-singleton; owns the current RedundancyTopology loaded from the ClusterNode table. RefreshAsync re-reads after sp_PublishGeneration so operator role swaps take effect without a process restart. CAS-style swap (Interlocked.Exchange) means readers always see a coherent snapshot. |
RedundancyTopology |
Immutable (ClusterId, Self, Peers, ServerUriArray, ValidityFlags) snapshot. |
ApplyLeaseRegistry |
Tracks in-progress sp_PublishGeneration apply leases keyed on (ConfigGenerationId, PublishRequestId). await using the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than ApplyMaxDuration (default 10 minutes) so a crashed publisher can't pin the node at PrimaryMidApply. |
PeerReachabilityTracker |
Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP /healthz. Both must succeed for peerReachable = true. |
RecoveryStateManager |
Gates transitions out of the Recovering* bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. |
ServiceLevelCalculator |
Pure function (role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte. |
RedundancyStatePublisher |
Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA ServiceLevel variable via an edge-triggered OnStateChanged event, and fires OnServerUriArrayChanged when the topology's ServerUriArray shifts. |
Data model
Per-node redundancy state lives in the Config DB ClusterNode table (src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs):
| Column | Role |
|---|---|
NodeId |
Unique node identity; matches Node:NodeId in the server's bootstrap appsettings.json. |
ClusterId |
Foreign key into ServerCluster. |
RedundancyRole |
Primary, Secondary, or Standalone (RedundancyRole enum in Configuration/Enums). |
ServiceLevelBase |
Per-node base value used to bias nominal ServiceLevel output. |
ApplicationUri |
Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. |
ServerUriArray is derived from the set of peer ApplicationUri values at topology-load time and republished when the topology changes.
ServiceLevel matrix
ServiceLevelCalculator produces one of the following bands (see ServiceLevelBand enum in the same file):
| Band | Byte | Meaning |
|---|---|---|
Maintenance |
0 | Operator-declared maintenance. |
NoData |
1 | Self-reported unhealthy (/healthz fails). |
InvalidTopology |
2 | More than one Primary detected; both nodes self-demote. |
RecoveringBackup |
30 | Backup post-fault, dwell not met. |
BackupMidApply |
50 | Backup inside a publish-apply window. |
IsolatedBackup |
80 | Primary unreachable; Backup says "take over if asked" — does not auto-promote (non-transparent model). |
AuthoritativeBackup |
100 | Backup nominal. |
RecoveringPrimary |
180 | Primary post-fault, dwell not met. |
PrimaryMidApply |
200 | Primary inside a publish-apply window. |
IsolatedPrimary |
230 | Primary with unreachable peer, retains authority. |
AuthoritativePrimary |
255 | Primary nominal. |
The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.
Standalone nodes (single-instance deployments) report AuthoritativePrimary when healthy and PrimaryMidApply during publish.
Publish fencing and split-brain prevention
Any Admin-triggered sp_PublishGeneration acquires an apply lease through ApplyLeaseRegistry.BeginApplyLease. While the lease is held:
- The calculator reports
PrimaryMidApply/BackupMidApply— clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation. RedundancyCoordinator.RefreshAsyncis called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.- The watchdog force-closes any lease older than
ApplyMaxDuration; a stuck publisher therefore cannot strand a node atPrimaryMidApply.
Because role transitions are operator-driven (write RedundancyRole in the Config DB + publish), the Backup never auto-promotes. An IsolatedBackup at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).
Metrics
RedundancyMetrics in src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs registers the ZB.MOM.WW.OtOpcUa.Redundancy meter on the Admin process. Instruments:
| Name | Kind | Tags | Description |
|---|---|---|---|
otopcua.redundancy.role_transition |
Counter | cluster.id, node.id, from_role, to_role |
Incremented every time FleetStatusPoller observes a RedundancyRole change on a ClusterNode row. |
otopcua.redundancy.primary_count |
ObservableGauge | cluster.id |
Primary-role nodes per cluster — should be exactly 1 in nominal state. |
otopcua.redundancy.secondary_count |
ObservableGauge | cluster.id |
Secondary-role nodes per cluster. |
otopcua.redundancy.stale_count |
ObservableGauge | cluster.id |
Nodes whose LastSeenAt exceeded the stale threshold. |
Admin Program.cs wires OpenTelemetry to the Prometheus exporter when Metrics:Prometheus:Enabled=true (default), exposing the meter under /metrics. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.
Real-time notifications (Admin UI)
FleetStatusPoller in src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/ polls the ClusterNode table, records role transitions, updates RedundancyMetrics.SetClusterCounts, and pushes a RoleChanged SignalR event onto FleetStatusHub when a transition is observed. RedundancyTab.razor subscribes with _hub.On<RoleChangedMessage>("RoleChanged", …) so connected Admin sessions see role swaps the moment they happen.
Configuring a redundant pair
Redundancy is configured in the Config DB, not appsettings.json. The fields that must differ between the two instances:
| Field | Location | Instance 1 | Instance 2 |
|---|---|---|---|
NodeId |
appsettings.json Node:NodeId (bootstrap) |
node-a |
node-b |
ClusterNode.ApplicationUri |
Config DB | urn:node-a:OtOpcUa |
urn:node-b:OtOpcUa |
ClusterNode.RedundancyRole |
Config DB | Primary |
Secondary |
ClusterNode.ServiceLevelBase |
Config DB | typically 255 | typically 100 |
Shared between instances: ClusterId, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.
Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI RedundancyTab — the operator edits the ClusterNode row in a draft generation and publishes. RedundancyCoordinator.RefreshAsync picks up the new topology without a process restart.
Client-side failover
The OtOpcUa Client CLI at src/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md for the command reference.
Depth reference
For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see docs/v2/plan.md §Phase 6.3.