Files
lmxopcua/redundancy.md
Joseph Doherty a55153d7d5 Add configurable non-transparent OPC UA server redundancy
Separates ApplicationUri from namespace identity so each instance in a
redundant pair has a unique server URI while sharing the same Galaxy
namespace. Exposes RedundancySupport, ServerUriArray, and dynamic
ServiceLevel through the standard OPC UA server object. ServiceLevel
is computed from role (Primary/Secondary) and runtime health (MXAccess
and DB connectivity). Adds CLI redundancy command, second deployed
service instance, and 31 new tests including paired-server integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 13:32:17 -04:00

21 KiB

OPC UA Server Redundancy Plan

Summary

Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance should advertise the redundant set through the standard OPC UA redundancy nodes, publish a dynamic ServiceLevel based on runtime health, and allow clients to discover and fail over between the instances. The CLI tool should gain a redundancy command for inspecting the redundant server set.

This review tightens the original draft in a few important ways:

  • It separates namespace identity from application identity. The current host uses urn:{GalaxyName}:LmxOpcUa as both the namespace URI and ApplicationUri; that must change for redundancy because each server in the pair needs a unique server URI.
  • It avoids hand-wavy "write the redundancy nodes directly" language and instead targets the OPC UA SDK's built-in ServerObjectState / ServerRedundancyState model.
  • It removes a few inaccurate hardcoded assumptions, including the ServerUriArray node id and the deployment port examples.
  • It fixes execution order so test-builder and helper changes happen before integration coverage depends on them.

This plan still covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does not implement automatic server-side failover or subscription transfer; those remain client responsibilities per the OPC UA specification.


Background: OPC UA Redundancy Model

OPC UA exposes redundancy through standard nodes under Server/ServerRedundancy plus the Server/ServiceLevel property:

Node Type Purpose
RedundancySupport RedundancySupport enum Declares the redundancy mode: None, Cold, Warm, Hot, Transparent, HotAndMirrored
ServerUriArray String[] Lists the ApplicationUri values of all servers in the redundant set for non-transparent redundancy
ServiceLevel Byte (0-255) Indicates current operational quality; clients prefer the server with the highest value

Non-Transparent Redundancy (our target)

In non-transparent redundancy (Warm or Hot), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading ServerUriArray, monitor ServiceLevel on each server, and manage their own failover. This fits the current architecture, where each instance independently connects to the same Galaxy repository and MXAccess runtime.

ServiceLevel semantics

Range Meaning
0 Server is not operational
1-99 Degraded
100-199 Healthy secondary
200-255 Healthy primary

The primary should advertise a higher ServiceLevel than the secondary so clients prefer it when both are healthy.


Current State

  • LmxOpcUaServer extends StandardServer but does not expose redundancy state
  • ServerRedundancy/RedundancySupport remains the SDK default (None)
  • Server/ServiceLevel remains the SDK default (255)
  • No configuration exists for redundancy mode, role, or redundant partner URIs
  • OpcUaServerHost currently sets ApplicationUri = urn:{GalaxyName}:LmxOpcUa
  • LmxNodeManager uses the same urn:{GalaxyName}:LmxOpcUa as the published namespace URI
  • A single deployed instance is documented in service_info.md
  • No CLI command exists for reading redundancy information

Key gap to fix first

For redundancy, each server in the set must advertise a unique ApplicationUri, and ServerUriArray must contain those unique values. The current implementation cannot do that because it reuses the namespace URI as the server ApplicationUri. Phase 1 therefore needs an application-identity change before the redundancy nodes can be correct.


Scope

In scope (Phase 1)

  1. Add explicit application-identity configuration so each instance can have a unique ApplicationUri
  2. Add redundancy configuration for mode, role, and server URI membership
  3. Expose RedundancySupport, ServerUriArray, and dynamic ServiceLevel
  4. Compute ServiceLevel from runtime health and preferred role
  5. Add a CLI redundancy command
  6. Document two-instance deployment
  7. Add unit and integration coverage

Deferred

  • Automatic subscription transfer
  • Server-initiated failover
  • Transparent redundancy mode
  • Load-balancer-specific HTTP health endpoints
  • Mirrored data/session state

Configuration Design

1. Add explicit OpcUa.ApplicationUri

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/OpcUaConfiguration.cs

Add:

public string? ApplicationUri { get; set; }

Rules:

  • ApplicationUri = null preserves the current behavior for non-redundant deployments
  • when Redundancy.Enabled = true, ApplicationUri must be explicitly set and unique per instance
  • LmxNodeManager should continue using urn:{GalaxyName}:LmxOpcUa as the namespace URI so both redundant servers expose the same namespace
  • Redundancy.ServerUris must contain the exact ApplicationUri values for all servers in the redundant set

Example:

{
  "OpcUa": {
    "ServerName": "LmxOpcUa",
    "GalaxyName": "ZB",
    "ApplicationUri": "urn:localhost:LmxOpcUa:instance1"
  }
}

2. New Redundancy section in appsettings.json

{
  "Redundancy": {
    "Enabled": false,
    "Mode": "Warm",
    "Role": "Primary",
    "ServerUris": [],
    "ServiceLevelBase": 200
  }
}

3. Configuration model

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs (new)

public class RedundancyConfiguration
{
    public bool Enabled { get; set; } = false;
    public string Mode { get; set; } = "Warm";
    public string Role { get; set; } = "Primary";
    public List<string> ServerUris { get; set; } = new List<string>();
    public int ServiceLevelBase { get; set; } = 200;
}

4. Configuration rules

  • Enabled defaults to false
  • Mode supports Warm and Hot in Phase 1
  • Role supports Primary and Secondary
  • ServerUris must contain the local OpcUa.ApplicationUri when redundancy is enabled
  • ServerUris should contain at least two unique entries when redundancy is enabled
  • ServiceLevelBase should be in the range 1-255
  • Effective baseline:
    • Primary: ServiceLevelBase
    • Secondary: max(0, ServiceLevelBase - 50)

App root updates

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs

  • Add public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();

Implementation Steps

Step 1: Separate application identity from namespace identity

Files:

  • src/.../Configuration/OpcUaConfiguration.cs
  • src/.../OpcUa/OpcUaServerHost.cs
  • docs/OpcUaServer.md
  • tests/.../Configuration/ConfigurationLoadingTests.cs

Changes:

  1. Add optional OpcUa.ApplicationUri
  2. Keep urn:{GalaxyName}:LmxOpcUa as the namespace URI used by LmxNodeManager
  3. Set ApplicationConfiguration.ApplicationUri from OpcUa.ApplicationUri when supplied
  4. Keep ApplicationUri and namespace URI distinct in docs and tests

This step is required before redundancy can be correct.

Step 2: Add RedundancyConfiguration and bind it

Files:

  • src/.../Configuration/RedundancyConfiguration.cs (new)
  • src/.../Configuration/AppConfiguration.cs
  • src/.../OpcUaService.cs

Changes:

  1. Create RedundancyConfiguration
  2. Add Redundancy to AppConfiguration
  3. Bind configuration.GetSection("Redundancy").Bind(_config.Redundancy);
  4. Pass _config.Redundancy through to OpcUaServerHost and LmxOpcUaServer

Step 3: Add RedundancyModeResolver

File: src/.../OpcUa/RedundancyModeResolver.cs (new)

Responsibilities:

  • map Mode to RedundancySupport
  • validate supported Phase 1 modes
  • fall back safely when disabled or invalid
public static class RedundancyModeResolver
{
    public static RedundancySupport Resolve(string mode, bool enabled);
}

Step 4: Add ServiceLevelCalculator

File: src/.../OpcUa/ServiceLevelCalculator.cs (new)

Purpose:

  • compute the current ServiceLevel from a baseline plus health inputs

Suggested signature:

public sealed class ServiceLevelCalculator
{
    public byte Calculate(int baseLevel, bool mxAccessConnected, bool dbConnected);
}

Suggested logic:

  • start with the role-adjusted baseline supplied by the caller
  • subtract 100 if MXAccess is disconnected
  • subtract 50 if the Galaxy DB is unreachable
  • return 0 if both are down
  • clamp to 0-255

Step 5: Extend ConfigurationValidator

File: src/.../Configuration/ConfigurationValidator.cs

Add validation/logging for:

  • OpcUa.ApplicationUri
  • Redundancy.Enabled, Mode, Role
  • ServerUris membership and uniqueness
  • ServiceLevelBase
  • local OpcUa.ApplicationUri must appear in Redundancy.ServerUris when enabled
  • warning when fewer than 2 unique server URIs are configured

Step 6: Expose redundancy through the standard OPC UA server object

File: src/.../OpcUa/LmxOpcUaServer.cs

Changes:

  1. Accept RedundancyConfiguration and local ApplicationUri
  2. On startup, locate the built-in ServerObjectState
  3. Configure ServerObjectState.ServiceLevel
  4. Configure the server redundancy object using the SDK's standard server-state types instead of writing guessed node ids directly
  5. If the default ServerRedundancyState does not expose ServerUriArray, replace or upgrade it with the appropriate non-transparent redundancy state type from the SDK before populating values
  6. Expose an internal method such as UpdateServiceLevel(bool mxConnected, bool dbConnected) for service-layer health updates

Important: the implementation should use SDK types/constants (ServerObjectState, ServerRedundancyState, NonTransparentRedundancyState, VariableIds.*) rather than hand-maintained numeric literals.

Step 7: Update OpcUaServerHost

File: src/.../OpcUa/OpcUaServerHost.cs

Changes:

  1. Accept RedundancyConfiguration
  2. Pass redundancy config and resolved local ApplicationUri into LmxOpcUaServer
  3. Log redundancy mode/role/server URIs at startup

Step 8: Wire health updates in OpcUaService

File: src/.../OpcUaService.cs

Changes:

  1. Bind and pass redundancy config
  2. After startup, initialize the starting ServiceLevel
  3. Subscribe to IMxAccessClient.ConnectionStateChanged
  4. Update DB health whenever startup repository checks, change-detection work, or rebuild attempts succeed/fail
  5. Prefer event-driven updates; add a lightweight periodic refresh only if necessary

Avoid introducing a second large standalone polling loop when existing connection and repository activity already gives most of the needed health signals.

Step 9: Update test builders and helpers before integration coverage

Files:

  • src/.../OpcUaServiceBuilder.cs
  • tests/.../Helpers/OpcUaServerFixture.cs
  • tests/.../Helpers/OpcUaTestClient.cs

Changes:

  • add WithRedundancy(...)
  • add WithApplicationUri(...) or allow full OpcUaConfiguration override
  • ensure two in-process redundancy tests can run with distinct ServerName, ApplicationUri, and certificate identity
  • when needed, use separate PKI roots in tests so paired fixtures do not collide on certificate state

Step 10: Update appsettings.json

File: src/.../appsettings.json

Add:

  • OpcUa.ApplicationUri example/commentary in docs
  • Redundancy section with Enabled = false defaults

Step 11: Add CLI redundancy command

Files:

  • tools/opcuacli-dotnet/Commands/RedundancyCommand.cs (new)
  • tools/opcuacli-dotnet/README.md
  • docs/CliTool.md

Command: redundancy

Read:

  • VariableIds.Server_ServerRedundancy_RedundancySupport
  • VariableIds.Server_ServiceLevel
  • VariableIds.Server_ServerRedundancy_ServerUriArray

Output example:

Redundancy Mode:  Warm
Service Level:    200
Server URIs:
  - urn:localhost:LmxOpcUa:instance1
  - urn:localhost:LmxOpcUa:instance2

Use SDK constants instead of hardcoded numeric ids in the command implementation.

Step 12: Deploy the second service instance

Deployment target: C:\publish\lmxopcua\instance2

Suggested configuration differences:

Setting instance1 instance2
OpcUa.Port 4840 4841
Dashboard.Port 8081 8082
OpcUa.ServerName LmxOpcUa LmxOpcUa2
OpcUa.ApplicationUri urn:localhost:LmxOpcUa:instance1 urn:localhost:LmxOpcUa:instance2
Redundancy.Enabled true true
Redundancy.Role Primary Secondary
Redundancy.Mode Warm Warm
Redundancy.ServerUris same two-entry set same two-entry set

Deployment notes:

  • both instances should share the same GalaxyName and namespace URI
  • each instance must have a distinct application certificate identity
  • if certificate handling is sensitive, give each instance an explicit Security.CertificateSubject or separate PKI root

Update service_info.md with the second instance details after deployment is real, not speculative.


Test Plan

Unit tests: RedundancyModeResolver

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs

Test Description
Resolve_Disabled_ReturnsNone Enabled=false returns None
Resolve_Warm_ReturnsWarm Mode="Warm" maps correctly
Resolve_Hot_ReturnsHot Mode="Hot" maps correctly
Resolve_Unknown_FallsBackToNone Unknown mode falls back safely
Resolve_CaseInsensitive Case-insensitive parsing works

Unit tests: ServiceLevelCalculator

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs

Test Description
FullyHealthy_Primary_ReturnsBase Healthy primary baseline is preserved
FullyHealthy_Secondary_ReturnsBaseMinusFifty Healthy secondary baseline is lower
MxAccessDown_ReducesServiceLevel MXAccess failure reduces score
DbDown_ReducesServiceLevel DB failure reduces score
BothDown_ReturnsZero Both unavailable returns 0
ClampedTo255 Upper clamp works
ClampedToZero Lower clamp works

Unit tests: RedundancyConfiguration

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs

Test Description
DefaultConfig_Disabled Enabled defaults to false
DefaultConfig_ModeWarm Mode defaults to Warm
DefaultConfig_RolePrimary Role defaults to Primary
DefaultConfig_EmptyServerUris ServerUris defaults to empty
DefaultConfig_ServiceLevelBase200 ServiceLevelBase defaults to 200

Updates to existing configuration tests

File: tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs

Add coverage for:

  • OpcUa.ApplicationUri
  • Redundancy section binding
  • redundancy validation when ApplicationUri is missing
  • redundancy validation when local ApplicationUri is absent from ServerUris
  • invalid ServiceLevelBase

Integration tests

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs

Cover:

  • redundancy disabled reports None
  • warm redundancy reports configured mode
  • ServerUriArray matches configuration
  • primary reports higher ServiceLevel than secondary
  • both servers expose the same namespace URI but different ApplicationUri values
  • service level drops when MXAccess disconnects

Pattern:

  • use two fixture instances
  • give each fixture a distinct ServerName, ApplicationUri, and port
  • if secure transport is enabled in those tests, isolate PKI roots to avoid certificate cross-talk

Documentation Plan

New file

  • docs/Redundancy.md

Contents:

  1. overview of OPC UA non-transparent redundancy
  2. difference between namespace URI and server ApplicationUri
  3. redundancy configuration reference
  4. service-level computation
  5. two-instance deployment guide
  6. CLI redundancy command usage
  7. troubleshooting

Updates to existing docs

File Changes
docs/Configuration.md Add OpcUa.ApplicationUri and Redundancy sections
docs/OpcUaServer.md Correct the current ApplicationUri == namespace description and add redundancy behavior
docs/CliTool.md Add redundancy command
docs/ServiceHosting.md Add multi-instance deployment notes
README.md Mention redundancy support and link docs
CLAUDE.md Add redundancy architecture note

Update after real deployment

  • service_info.md

Only update this once the second instance is actually deployed and verified.


File Change Summary

File Action Description
src/.../Configuration/OpcUaConfiguration.cs Modify Add explicit ApplicationUri
src/.../Configuration/RedundancyConfiguration.cs New Redundancy config model
src/.../Configuration/AppConfiguration.cs Modify Add Redundancy section
src/.../Configuration/ConfigurationValidator.cs Modify Validate/log redundancy and application identity
src/.../OpcUa/RedundancyModeResolver.cs New Map config mode to RedundancySupport
src/.../OpcUa/ServiceLevelCalculator.cs New Compute ServiceLevel from health inputs
src/.../OpcUa/LmxOpcUaServer.cs Modify Expose redundancy state via SDK server object
src/.../OpcUa/OpcUaServerHost.cs Modify Pass local application identity and redundancy config
src/.../OpcUaService.cs Modify Bind config and wire health updates
src/.../OpcUaServiceBuilder.cs Modify Support redundancy/application identity injection
src/.../appsettings.json Modify Add redundancy settings
tools/opcuacli-dotnet/Commands/RedundancyCommand.cs New Read redundancy state from a server
tests/.../Redundancy/*.cs New Unit tests for redundancy config and calculators
tests/.../Configuration/ConfigurationLoadingTests.cs Modify Bind/validate new settings
tests/.../Integration/RedundancyTests.cs New Paired-server integration tests
tests/.../Helpers/OpcUaServerFixture.cs Modify Support paired redundancy fixtures
tests/.../Helpers/OpcUaTestClient.cs Modify Read redundancy nodes in integration tests
docs/Redundancy.md New Dedicated redundancy guide
docs/Configuration.md Modify Document new config
docs/OpcUaServer.md Modify Correct application identity and add redundancy details
docs/CliTool.md Modify Document redundancy command
docs/ServiceHosting.md Modify Multi-instance deployment notes
README.md Modify Link redundancy docs
CLAUDE.md Modify Architecture note
service_info.md Modify later Document real second-instance deployment

Verification Guardrails

Gate 1: Build

dotnet build ZB.MOM.WW.LmxOpcUa.slnx

Gate 2: Unit tests

dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests

Gate 3: Redundancy integration tests

dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Redundancy"

Gate 4: CLI build

cd tools/opcuacli-dotnet
dotnet build

Gate 5: Manual single-instance check

opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa

Expected:

  • RedundancySupport=None
  • ServiceLevel=255

Gate 6: Manual paired-instance check

opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa

Expected:

  • both report the same ServerUriArray
  • each reports its own unique local ApplicationUri
  • primary reports a higher ServiceLevel

Gate 7: Full test suite

dotnet test ZB.MOM.WW.LmxOpcUa.slnx

Risks and Considerations

  1. Application identity is the main correctness risk. Without unique ApplicationUri values, the redundant set is invalid even if ServerUriArray is populated.
  2. SDK wiring may require replacing the default redundancy state node. The base ServerRedundancyState does not expose ServerUriArray; the implementation may need the non-transparent subtype from the SDK.
  3. Two in-process servers can collide on certificates. Tests and deployment need distinct application identities and, when necessary, isolated PKI roots.
  4. Both instances hit the same MXAccess runtime and Galaxy DB. Verify client-registration and polling behavior under paired load.
  5. ServiceLevel should remain meaningful, not noisy. Prefer deterministic role + health inputs over frequent arbitrary adjustments.
  6. service_info.md is deployment documentation, not design. Do not prefill it with speculative values before the second instance actually exists.

Execution Order

  1. Step 1: add OpcUa.ApplicationUri and separate it from namespace identity
  2. Steps 2-5: config model, resolver, calculator, validator
  3. Gate 1 + Gate 2
  4. Step 9: update builders/helpers so tests can express paired servers cleanly
  5. Step 6-8: server exposure and service-layer health wiring
  6. Gate 1 + Gate 2 + Gate 3
  7. Step 10: update appsettings.json
  8. Step 11: add CLI redundancy command
  9. Gate 4 + Gate 5
  10. Step 12: deploy and verify the second instance
  11. Update service_info.md with real deployment details
  12. Documentation updates
  13. Gate 7