Files

Joseph Doherty a55153d7d5 Add configurable non-transparent OPC UA server redundancy

Separates ApplicationUri from namespace identity so each instance in a
redundant pair has a unique server URI while sharing the same Galaxy
namespace. Exposes RedundancySupport, ServerUriArray, and dynamic
ServiceLevel through the standard OPC UA server object. ServiceLevel
is computed from role (Primary/Secondary) and runtime health (MXAccess
and DB connectivity). Adds CLI redundancy command, second deployed
service instance, and 31 new tests including paired-server integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-28 13:32:17 -04:00

21 KiB

Raw Blame History

OPC UA Server Redundancy Plan

Summary

Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance should advertise the redundant set through the standard OPC UA redundancy nodes, publish a dynamic ServiceLevel based on runtime health, and allow clients to discover and fail over between the instances. The CLI tool should gain a redundancy command for inspecting the redundant server set.

This review tightens the original draft in a few important ways:

It separates namespace identity from application identity. The current host uses urn:{GalaxyName}:LmxOpcUa as both the namespace URI and ApplicationUri; that must change for redundancy because each server in the pair needs a unique server URI.
It avoids hand-wavy "write the redundancy nodes directly" language and instead targets the OPC UA SDK's built-in ServerObjectState / ServerRedundancyState model.
It removes a few inaccurate hardcoded assumptions, including the ServerUriArray node id and the deployment port examples.
It fixes execution order so test-builder and helper changes happen before integration coverage depends on them.

This plan still covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does not implement automatic server-side failover or subscription transfer; those remain client responsibilities per the OPC UA specification.

Background: OPC UA Redundancy Model

OPC UA exposes redundancy through standard nodes under Server/ServerRedundancy plus the Server/ServiceLevel property:

Node	Type	Purpose
`RedundancySupport`	`RedundancySupport` enum	Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored`
`ServerUriArray`	`String[]`	Lists the `ApplicationUri` values of all servers in the redundant set for non-transparent redundancy
`ServiceLevel`	`Byte` (0-255)	Indicates current operational quality; clients prefer the server with the highest value

Non-Transparent Redundancy (our target)

In non-transparent redundancy (Warm or Hot), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading ServerUriArray, monitor ServiceLevel on each server, and manage their own failover. This fits the current architecture, where each instance independently connects to the same Galaxy repository and MXAccess runtime.

ServiceLevel semantics

Range	Meaning
0	Server is not operational
1-99	Degraded
100-199	Healthy secondary
200-255	Healthy primary

The primary should advertise a higher ServiceLevel than the secondary so clients prefer it when both are healthy.

Current State

LmxOpcUaServer extends StandardServer but does not expose redundancy state
ServerRedundancy/RedundancySupport remains the SDK default (None)
Server/ServiceLevel remains the SDK default (255)
No configuration exists for redundancy mode, role, or redundant partner URIs
OpcUaServerHost currently sets ApplicationUri = urn:{GalaxyName}:LmxOpcUa
LmxNodeManager uses the same urn:{GalaxyName}:LmxOpcUa as the published namespace URI
A single deployed instance is documented in service_info.md
No CLI command exists for reading redundancy information

Key gap to fix first

For redundancy, each server in the set must advertise a unique ApplicationUri, and ServerUriArray must contain those unique values. The current implementation cannot do that because it reuses the namespace URI as the server ApplicationUri. Phase 1 therefore needs an application-identity change before the redundancy nodes can be correct.

Scope

In scope (Phase 1)

Add explicit application-identity configuration so each instance can have a unique ApplicationUri
Add redundancy configuration for mode, role, and server URI membership
Expose RedundancySupport, ServerUriArray, and dynamic ServiceLevel
Compute ServiceLevel from runtime health and preferred role
Add a CLI redundancy command
Document two-instance deployment
Add unit and integration coverage

Deferred

Automatic subscription transfer
Server-initiated failover
Transparent redundancy mode
Load-balancer-specific HTTP health endpoints
Mirrored data/session state

Configuration Design

1. Add explicit `OpcUa.ApplicationUri`

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/OpcUaConfiguration.cs

Add:

public string? ApplicationUri { get; set; }

Rules:

ApplicationUri = null preserves the current behavior for non-redundant deployments
when Redundancy.Enabled = true, ApplicationUri must be explicitly set and unique per instance
LmxNodeManager should continue using urn:{GalaxyName}:LmxOpcUa as the namespace URI so both redundant servers expose the same namespace
Redundancy.ServerUris must contain the exact ApplicationUri values for all servers in the redundant set

Example:

{
  "OpcUa": {
    "ServerName": "LmxOpcUa",
    "GalaxyName": "ZB",
    "ApplicationUri": "urn:localhost:LmxOpcUa:instance1"
  }
}

2. New `Redundancy` section in `appsettings.json`

{
  "Redundancy": {
    "Enabled": false,
    "Mode": "Warm",
    "Role": "Primary",
    "ServerUris": [],
    "ServiceLevelBase": 200
  }
}

3. Configuration model

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs (new)

public class RedundancyConfiguration
{
    public bool Enabled { get; set; } = false;
    public string Mode { get; set; } = "Warm";
    public string Role { get; set; } = "Primary";
    public List<string> ServerUris { get; set; } = new List<string>();
    public int ServiceLevelBase { get; set; } = 200;
}

4. Configuration rules

Enabled defaults to false
Mode supports Warm and Hot in Phase 1
Role supports Primary and Secondary
ServerUris must contain the local OpcUa.ApplicationUri when redundancy is enabled
ServerUris should contain at least two unique entries when redundancy is enabled
ServiceLevelBase should be in the range 1-255
Effective baseline:
- Primary: ServiceLevelBase
- Secondary: max(0, ServiceLevelBase - 50)

App root updates

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs

Add public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();

Implementation Steps

Step 1: Separate application identity from namespace identity

Files:

src/.../Configuration/OpcUaConfiguration.cs
src/.../OpcUa/OpcUaServerHost.cs
docs/OpcUaServer.md
tests/.../Configuration/ConfigurationLoadingTests.cs

Changes:

Add optional OpcUa.ApplicationUri
Keep urn:{GalaxyName}:LmxOpcUa as the namespace URI used by LmxNodeManager
Set ApplicationConfiguration.ApplicationUri from OpcUa.ApplicationUri when supplied
Keep ApplicationUri and namespace URI distinct in docs and tests

This step is required before redundancy can be correct.

Step 2: Add `RedundancyConfiguration` and bind it

Files:

src/.../Configuration/RedundancyConfiguration.cs (new)
src/.../Configuration/AppConfiguration.cs
src/.../OpcUaService.cs

Changes:

Create RedundancyConfiguration
Add Redundancy to AppConfiguration
Bind configuration.GetSection("Redundancy").Bind(_config.Redundancy);
Pass _config.Redundancy through to OpcUaServerHost and LmxOpcUaServer

Step 3: Add `RedundancyModeResolver`

File: src/.../OpcUa/RedundancyModeResolver.cs (new)

Responsibilities:

map Mode to RedundancySupport
validate supported Phase 1 modes
fall back safely when disabled or invalid

public static class RedundancyModeResolver
{
    public static RedundancySupport Resolve(string mode, bool enabled);
}

Step 4: Add `ServiceLevelCalculator`

File: src/.../OpcUa/ServiceLevelCalculator.cs (new)

Purpose:

compute the current ServiceLevel from a baseline plus health inputs

Suggested signature:

public sealed class ServiceLevelCalculator
{
    public byte Calculate(int baseLevel, bool mxAccessConnected, bool dbConnected);
}

Suggested logic:

start with the role-adjusted baseline supplied by the caller
subtract 100 if MXAccess is disconnected
subtract 50 if the Galaxy DB is unreachable
return 0 if both are down
clamp to 0-255

Step 5: Extend `ConfigurationValidator`

File: src/.../Configuration/ConfigurationValidator.cs

Add validation/logging for:

OpcUa.ApplicationUri
Redundancy.Enabled, Mode, Role
ServerUris membership and uniqueness
ServiceLevelBase
local OpcUa.ApplicationUri must appear in Redundancy.ServerUris when enabled
warning when fewer than 2 unique server URIs are configured

Step 6: Expose redundancy through the standard OPC UA server object

File: src/.../OpcUa/LmxOpcUaServer.cs

Changes:

Accept RedundancyConfiguration and local ApplicationUri
On startup, locate the built-in ServerObjectState
Configure ServerObjectState.ServiceLevel
Configure the server redundancy object using the SDK's standard server-state types instead of writing guessed node ids directly
If the default ServerRedundancyState does not expose ServerUriArray, replace or upgrade it with the appropriate non-transparent redundancy state type from the SDK before populating values
Expose an internal method such as UpdateServiceLevel(bool mxConnected, bool dbConnected) for service-layer health updates

Important: the implementation should use SDK types/constants (ServerObjectState, ServerRedundancyState, NonTransparentRedundancyState, VariableIds.*) rather than hand-maintained numeric literals.

Step 7: Update `OpcUaServerHost`

File: src/.../OpcUa/OpcUaServerHost.cs

Changes:

Accept RedundancyConfiguration
Pass redundancy config and resolved local ApplicationUri into LmxOpcUaServer
Log redundancy mode/role/server URIs at startup

Step 8: Wire health updates in `OpcUaService`

File: src/.../OpcUaService.cs

Changes:

Bind and pass redundancy config
After startup, initialize the starting ServiceLevel
Subscribe to IMxAccessClient.ConnectionStateChanged
Update DB health whenever startup repository checks, change-detection work, or rebuild attempts succeed/fail
Prefer event-driven updates; add a lightweight periodic refresh only if necessary

Avoid introducing a second large standalone polling loop when existing connection and repository activity already gives most of the needed health signals.

Step 9: Update test builders and helpers before integration coverage

Files:

src/.../OpcUaServiceBuilder.cs
tests/.../Helpers/OpcUaServerFixture.cs
tests/.../Helpers/OpcUaTestClient.cs

Changes:

add WithRedundancy(...)
add WithApplicationUri(...) or allow full OpcUaConfiguration override
ensure two in-process redundancy tests can run with distinct ServerName, ApplicationUri, and certificate identity
when needed, use separate PKI roots in tests so paired fixtures do not collide on certificate state

Step 10: Update `appsettings.json`

File: src/.../appsettings.json

Add:

OpcUa.ApplicationUri example/commentary in docs
Redundancy section with Enabled = false defaults

Step 11: Add CLI `redundancy` command

Files:

tools/opcuacli-dotnet/Commands/RedundancyCommand.cs (new)
tools/opcuacli-dotnet/README.md
docs/CliTool.md

Command: redundancy

Read:

VariableIds.Server_ServerRedundancy_RedundancySupport
VariableIds.Server_ServiceLevel
VariableIds.Server_ServerRedundancy_ServerUriArray

Output example:

Redundancy Mode:  Warm
Service Level:    200
Server URIs:
  - urn:localhost:LmxOpcUa:instance1
  - urn:localhost:LmxOpcUa:instance2

Use SDK constants instead of hardcoded numeric ids in the command implementation.

Step 12: Deploy the second service instance

Deployment target: C:\publish\lmxopcua\instance2

Suggested configuration differences:

Setting	instance1	instance2
`OpcUa.Port`	`4840`	`4841`
`Dashboard.Port`	`8081`	`8082`
`OpcUa.ServerName`	`LmxOpcUa`	`LmxOpcUa2`
`OpcUa.ApplicationUri`	`urn:localhost:LmxOpcUa:instance1`	`urn:localhost:LmxOpcUa:instance2`
`Redundancy.Enabled`	`true`	`true`
`Redundancy.Role`	`Primary`	`Secondary`
`Redundancy.Mode`	`Warm`	`Warm`
`Redundancy.ServerUris`	same two-entry set	same two-entry set

Deployment notes:

both instances should share the same GalaxyName and namespace URI
each instance must have a distinct application certificate identity
if certificate handling is sensitive, give each instance an explicit Security.CertificateSubject or separate PKI root

Update service_info.md with the second instance details after deployment is real, not speculative.

Test Plan

Unit tests: `RedundancyModeResolver`

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs

Test	Description
`Resolve_Disabled_ReturnsNone`	`Enabled=false` returns `None`
`Resolve_Warm_ReturnsWarm`	`Mode="Warm"` maps correctly
`Resolve_Hot_ReturnsHot`	`Mode="Hot"` maps correctly
`Resolve_Unknown_FallsBackToNone`	Unknown mode falls back safely
`Resolve_CaseInsensitive`	Case-insensitive parsing works

Unit tests: `ServiceLevelCalculator`

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs

Test	Description
`FullyHealthy_Primary_ReturnsBase`	Healthy primary baseline is preserved
`FullyHealthy_Secondary_ReturnsBaseMinusFifty`	Healthy secondary baseline is lower
`MxAccessDown_ReducesServiceLevel`	MXAccess failure reduces score
`DbDown_ReducesServiceLevel`	DB failure reduces score
`BothDown_ReturnsZero`	Both unavailable returns 0
`ClampedTo255`	Upper clamp works
`ClampedToZero`	Lower clamp works

Unit tests: `RedundancyConfiguration`

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs

Test	Description
`DefaultConfig_Disabled`	`Enabled` defaults to `false`
`DefaultConfig_ModeWarm`	`Mode` defaults to `Warm`
`DefaultConfig_RolePrimary`	`Role` defaults to `Primary`
`DefaultConfig_EmptyServerUris`	`ServerUris` defaults to empty
`DefaultConfig_ServiceLevelBase200`	`ServiceLevelBase` defaults to `200`

Updates to existing configuration tests

File: tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs

Add coverage for:

OpcUa.ApplicationUri
Redundancy section binding
redundancy validation when ApplicationUri is missing
redundancy validation when local ApplicationUri is absent from ServerUris
invalid ServiceLevelBase

Integration tests

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs

Cover:

redundancy disabled reports None
warm redundancy reports configured mode
ServerUriArray matches configuration
primary reports higher ServiceLevel than secondary
both servers expose the same namespace URI but different ApplicationUri values
service level drops when MXAccess disconnects

Pattern:

use two fixture instances
give each fixture a distinct ServerName, ApplicationUri, and port
if secure transport is enabled in those tests, isolate PKI roots to avoid certificate cross-talk

Documentation Plan

New file

docs/Redundancy.md

Contents:

overview of OPC UA non-transparent redundancy
difference between namespace URI and server ApplicationUri
redundancy configuration reference
service-level computation
two-instance deployment guide
CLI redundancy command usage
troubleshooting

Updates to existing docs

File	Changes
`docs/Configuration.md`	Add `OpcUa.ApplicationUri` and `Redundancy` sections
`docs/OpcUaServer.md`	Correct the current `ApplicationUri == namespace` description and add redundancy behavior
`docs/CliTool.md`	Add `redundancy` command
`docs/ServiceHosting.md`	Add multi-instance deployment notes
`README.md`	Mention redundancy support and link docs
`CLAUDE.md`	Add redundancy architecture note

Update after real deployment

service_info.md

Only update this once the second instance is actually deployed and verified.

File Change Summary

File	Action	Description
`src/.../Configuration/OpcUaConfiguration.cs`	Modify	Add explicit `ApplicationUri`
`src/.../Configuration/RedundancyConfiguration.cs`	New	Redundancy config model
`src/.../Configuration/AppConfiguration.cs`	Modify	Add `Redundancy` section
`src/.../Configuration/ConfigurationValidator.cs`	Modify	Validate/log redundancy and application identity
`src/.../OpcUa/RedundancyModeResolver.cs`	New	Map config mode to `RedundancySupport`
`src/.../OpcUa/ServiceLevelCalculator.cs`	New	Compute `ServiceLevel` from health inputs
`src/.../OpcUa/LmxOpcUaServer.cs`	Modify	Expose redundancy state via SDK server object
`src/.../OpcUa/OpcUaServerHost.cs`	Modify	Pass local application identity and redundancy config
`src/.../OpcUaService.cs`	Modify	Bind config and wire health updates
`src/.../OpcUaServiceBuilder.cs`	Modify	Support redundancy/application identity injection
`src/.../appsettings.json`	Modify	Add redundancy settings
`tools/opcuacli-dotnet/Commands/RedundancyCommand.cs`	New	Read redundancy state from a server
`tests/.../Redundancy/*.cs`	New	Unit tests for redundancy config and calculators
`tests/.../Configuration/ConfigurationLoadingTests.cs`	Modify	Bind/validate new settings
`tests/.../Integration/RedundancyTests.cs`	New	Paired-server integration tests
`tests/.../Helpers/OpcUaServerFixture.cs`	Modify	Support paired redundancy fixtures
`tests/.../Helpers/OpcUaTestClient.cs`	Modify	Read redundancy nodes in integration tests
`docs/Redundancy.md`	New	Dedicated redundancy guide
`docs/Configuration.md`	Modify	Document new config
`docs/OpcUaServer.md`	Modify	Correct application identity and add redundancy details
`docs/CliTool.md`	Modify	Document `redundancy` command
`docs/ServiceHosting.md`	Modify	Multi-instance deployment notes
`README.md`	Modify	Link redundancy docs
`CLAUDE.md`	Modify	Architecture note
`service_info.md`	Modify later	Document real second-instance deployment

Verification Guardrails

Gate 1: Build

dotnet build ZB.MOM.WW.LmxOpcUa.slnx

Gate 2: Unit tests

dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests

Gate 3: Redundancy integration tests

dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Redundancy"

Gate 4: CLI build

cd tools/opcuacli-dotnet
dotnet build

Gate 5: Manual single-instance check

opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa

Expected:

RedundancySupport=None
ServiceLevel=255

Gate 6: Manual paired-instance check

opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa

Expected:

both report the same ServerUriArray
each reports its own unique local ApplicationUri
primary reports a higher ServiceLevel

Gate 7: Full test suite

dotnet test ZB.MOM.WW.LmxOpcUa.slnx

Risks and Considerations

Application identity is the main correctness risk. Without unique ApplicationUri values, the redundant set is invalid even if ServerUriArray is populated.
SDK wiring may require replacing the default redundancy state node. The base ServerRedundancyState does not expose ServerUriArray; the implementation may need the non-transparent subtype from the SDK.
Two in-process servers can collide on certificates. Tests and deployment need distinct application identities and, when necessary, isolated PKI roots.
Both instances hit the same MXAccess runtime and Galaxy DB. Verify client-registration and polling behavior under paired load.
ServiceLevel should remain meaningful, not noisy. Prefer deterministic role + health inputs over frequent arbitrary adjustments.
service_info.md is deployment documentation, not design. Do not prefill it with speculative values before the second instance actually exists.

Execution Order

Step 1: add OpcUa.ApplicationUri and separate it from namespace identity
Steps 2-5: config model, resolver, calculator, validator
Gate 1 + Gate 2
Step 9: update builders/helpers so tests can express paired servers cleanly
Step 6-8: server exposure and service-layer health wiring
Gate 1 + Gate 2 + Gate 3
Step 10: update appsettings.json
Step 11: add CLI redundancy command
Gate 4 + Gate 5
Step 12: deploy and verify the second instance
Update service_info.md with real deployment details
Documentation updates
Gate 7

21 KiB Raw Blame History

OPC UA Server Redundancy Plan

Summary

Background: OPC UA Redundancy Model

Non-Transparent Redundancy (our target)

ServiceLevel semantics

Current State

Key gap to fix first

Scope

In scope (Phase 1)

Deferred

Configuration Design

1. Add explicit OpcUa.ApplicationUri

2. New Redundancy section in appsettings.json

3. Configuration model

4. Configuration rules

App root updates

Implementation Steps

Step 1: Separate application identity from namespace identity

Step 2: Add RedundancyConfiguration and bind it

Step 3: Add RedundancyModeResolver

Step 4: Add ServiceLevelCalculator

Step 5: Extend ConfigurationValidator

Step 6: Expose redundancy through the standard OPC UA server object

Step 7: Update OpcUaServerHost

Step 8: Wire health updates in OpcUaService

Step 9: Update test builders and helpers before integration coverage

Step 10: Update appsettings.json

Step 11: Add CLI redundancy command

Step 12: Deploy the second service instance

Test Plan

Unit tests: RedundancyModeResolver

Unit tests: ServiceLevelCalculator

Unit tests: RedundancyConfiguration

Updates to existing configuration tests

Integration tests

Documentation Plan

New file

Updates to existing docs

Update after real deployment

File Change Summary

Verification Guardrails

Gate 1: Build

Gate 2: Unit tests

Gate 3: Redundancy integration tests

Gate 4: CLI build

Gate 5: Manual single-instance check

Gate 6: Manual paired-instance check

Gate 7: Full test suite

Risks and Considerations

Execution Order

21 KiB

Raw Blame History

1. Add explicit `OpcUa.ApplicationUri`

2. New `Redundancy` section in `appsettings.json`

Step 2: Add `RedundancyConfiguration` and bind it

Step 3: Add `RedundancyModeResolver`

Step 4: Add `ServiceLevelCalculator`

Step 5: Extend `ConfigurationValidator`

Step 7: Update `OpcUaServerHost`

Step 8: Wire health updates in `OpcUaService`

Step 10: Update `appsettings.json`

Step 11: Add CLI `redundancy` command

Unit tests: `RedundancyModeResolver`

Unit tests: `ServiceLevelCalculator`

Unit tests: `RedundancyConfiguration`