Files

Joseph Doherty a3c2d9b243 Add OPC UA server redundancy implementation plan

Covers non-transparent warm/hot redundancy with configurable roles,
dynamic ServiceLevel, CLI support, second service instance deployment,
and verification guardrails across unit, integration, and manual tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-28 12:52:15 -04:00

21 KiB

Raw Blame History

OPC UA Server Redundancy Plan

Summary

Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance advertises itself and its partner through the OPC UA ServerRedundancy node, publishes a dynamic ServiceLevel reflecting runtime health, and allows clients to discover the redundant set and fail over between instances. The CLI tool gains a redundancy command for inspecting the redundant server set.

This plan covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does not implement automatic server-side failover or subscription transfer — those are client responsibilities per the OPC UA specification.

Background: OPC UA Redundancy Model

OPC UA defines redundancy through three address-space nodes under Server/ServerRedundancy:

Node	Type	Purpose
`RedundancySupport`	`RedundancySupport` enum	Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored`
`ServerUriArray`	`String[]`	Lists the `ApplicationUri` values of all servers in the redundant set (non-transparent modes)
`ServiceLevel`	`Byte` (0–255)	Indicates current operational quality; clients prefer the server with the highest value

Non-Transparent Redundancy (our target)

In non-transparent redundancy (Warm or Hot), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading ServerUriArray, monitor ServiceLevel on each server, and manage their own failover. This model fits our architecture where each instance connects to the same Galaxy repository and MXAccess runtime independently.

ServiceLevel Semantics

Range	Meaning
0	Server is not operational
1–99	Degraded (e.g., MXAccess disconnected, DB unreachable)
100–199	Healthy secondary
200–255	Healthy primary (preferred)

The primary server should advertise a higher ServiceLevel than the secondary so clients prefer it when both are healthy.

Current State

LmxOpcUaServer extends StandardServer but does not override any redundancy-related methods
ServerRedundancy/RedundancySupport defaults to None (SDK default)
ServiceLevel defaults to 255 (SDK default — "fully operational")
No configuration for redundant partner URIs or role designation
Single deployed instance at C:\publish\lmxopcua\instance1 on port 4840
No CLI support for reading redundancy information

Scope

In Scope (Phase 1)

Redundancy configuration model — role, partner URIs, ServiceLevel weights
Server redundancy node exposure — RedundancySupport, ServerUriArray, dynamic ServiceLevel
ServiceLevel computation — based on runtime health (MXAccess state, DB connectivity, role)
CLI redundancy command — read RedundancySupport, ServerUriArray, ServiceLevel from a server
Second service instance — deployed at C:\publish\lmxopcua\instance2 with non-overlapping ports
Documentation — new docs/Redundancy.md component doc, updates to existing docs
Unit tests — config, ServiceLevel computation, resolver tests
Integration tests — two-server redundancy E2E test in the integration test project

Deferred

Automatic subscription transfer (client-side responsibility)
Server-initiated failover (Galaxy redundancy table / engine flags)
Transparent redundancy mode
Health-check HTTP endpoint for load balancers

Configuration Design

New `Redundancy` section in `appsettings.json`

{
  "Redundancy": {
    "Enabled": false,
    "Mode": "Warm",
    "Role": "Primary",
    "ServerUris": [],
    "ServiceLevelBase": 200
  }
}

Configuration model

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs (new)

public class RedundancyConfiguration
{
    public bool Enabled { get; set; } = false;
    public string Mode { get; set; } = "Warm";
    public string Role { get; set; } = "Primary";
    public List<string> ServerUris { get; set; } = new List<string>();
    public int ServiceLevelBase { get; set; } = 200;
}

Configuration rules

Enabled defaults to false for backward compatibility. When false, RedundancySupport = None and ServiceLevel = 255 (SDK defaults).
Mode must be Warm or Hot (Phase 1). Maps to RedundancySupport.Warm or RedundancySupport.Hot.
Role must be Primary or Secondary. Controls the base ServiceLevel (Primary gets ServiceLevelBase, Secondary gets ServiceLevelBase - 50).
ServerUris lists the ApplicationUri values for all servers in the redundant set, including the local server. The OPC UA spec requires this to contain the full set. These are namespace URIs like urn:ZB:LmxOpcUa, not endpoint URLs.
ServiceLevelBase is the starting ServiceLevel when the server is fully healthy. Degraded conditions subtract from this value.

App root updates

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs

Add public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();

Implementation Steps

Step 1: Add RedundancyConfiguration model and bind it

Files:

src/.../Configuration/RedundancyConfiguration.cs (new)
src/.../Configuration/AppConfiguration.cs
src/.../OpcUaService.cs

Changes:

Create RedundancyConfiguration class with properties above
Add Redundancy property to AppConfiguration
Bind configuration.GetSection("Redundancy").Bind(_config.Redundancy);
Pass _config.Redundancy through to OpcUaServerHost and LmxOpcUaServer

Step 2: Add RedundancyModeResolver

File: src/.../OpcUa/RedundancyModeResolver.cs (new)

Responsibilities:

Map Mode string to RedundancySupport enum value
Validate against supported Phase 1 modes (Warm, Hot)
Fall back to None with warning for unknown modes

public static class RedundancyModeResolver
{
    public static RedundancySupport Resolve(string mode, bool enabled);
}

Step 3: Add ServiceLevelCalculator

File: src/.../OpcUa/ServiceLevelCalculator.cs (new)

Computes the dynamic ServiceLevel byte from runtime health:

public class ServiceLevelCalculator
{
    public byte Calculate(int baseLine, bool mxAccessConnected, bool dbConnected, bool isPrimary);
}

Logic:

Start with baseLine (from config, e.g., 200 for Primary, 150 for Secondary)
Subtract 100 if MXAccess is disconnected
Subtract 50 if Galaxy DB is unreachable
Clamp to 0–255
Return 0 if both MXAccess and DB are down

Step 4: Extend ConfigurationValidator for redundancy

File: src/.../Configuration/ConfigurationValidator.cs

Add validation/logging for:

Redundancy.Enabled, Mode, Role
ServerUris should not be empty when Enabled = true
ServiceLevelBase should be 1–255
Warning when Enabled = true but ServerUris has fewer than 2 entries
Log effective redundancy configuration at startup

Step 5: Update LmxOpcUaServer to expose redundancy state

File: src/.../OpcUa/LmxOpcUaServer.cs

Changes:

Accept RedundancyConfiguration in the constructor
Override OnServerStarted to write redundancy nodes:
- Set Server/ServerRedundancy/RedundancySupport to the resolved mode
- Set Server/ServerRedundancy/ServerUriArray to the configured URIs
Override SetServerState or use a timer to update Server/ServiceLevel periodically based on ServiceLevelCalculator
Expose a method UpdateServiceLevel(bool mxConnected, bool dbConnected) that the service layer can call when health state changes

Step 6: Update OpcUaServerHost to pass redundancy config

File: src/.../OpcUa/OpcUaServerHost.cs

Changes:

Accept RedundancyConfiguration in the constructor
Pass it through to LmxOpcUaServer
Log active redundancy mode at startup

Step 7: Wire ServiceLevel updates in OpcUaService

File: src/.../OpcUaService.cs

Changes:

Bind redundancy config section
Pass redundancy config to OpcUaServerHost
Subscribe to MxAccessClient.ConnectionStateChanged to trigger ServiceLevel updates
After Galaxy DB health checks, trigger ServiceLevel updates
Use a periodic timer (e.g., every 5 seconds) to refresh ServiceLevel based on current component health

Step 8: Update appsettings.json

File: src/.../appsettings.json

Add the Redundancy section with backward-compatible defaults (Enabled: false).

Step 9: Update OpcUaServiceBuilder for test injection

File: src/.../OpcUaServiceBuilder.cs

Add WithRedundancy(RedundancyConfiguration) builder method so tests can inject redundancy configuration.

Step 10: Add CLI `redundancy` command

Files:

tools/opcuacli-dotnet/Commands/RedundancyCommand.cs (new)

Command: redundancy

Reads from the target server:

Server/ServerRedundancy/RedundancySupport (i=11314)
Server/ServiceLevel (i=2267)
Server/ServerRedundancy/ServerUriArray (i=11492, if non-transparent redundancy)

Output format:

Redundancy Mode:  Warm
Service Level:    200
Server URIs:
  - urn:ZB:LmxOpcUa
  - urn:ZB:LmxOpcUa2

Options: --url, --username, --password, --security (same shared options as other commands).

Step 11: Deploy second service instance

Deployment target: C:\publish\lmxopcua\instance2

Configuration differences from instance1:

Setting	instance1	instance2
`OpcUa.Port`	`4840`	`4841`
`OpcUa.ServerName`	`LmxOpcUa`	`LmxOpcUa2`
`Dashboard.Port`	`8083`	`8084`
`Redundancy.Enabled`	`true`	`true`
`Redundancy.Role`	`Primary`	`Secondary`
`Redundancy.Mode`	`Warm`	`Warm`
`Redundancy.ServerUris`	`["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]`	`["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]`
`Redundancy.ServiceLevelBase`	`200`	`200`

Windows service for instance2:

Name: LmxOpcUa2
Display name: LMX OPC UA Server (Instance 2)
Executable: C:\publish\lmxopcua\instance2\ZB.MOM.WW.LmxOpcUa.Host.exe

Both instances share the same Galaxy DB (ZB) and MXAccess runtime. The GalaxyName remains ZB for both so they expose the same namespace.

Update service_info.md with the second instance details.

Test Plan

Unit tests — RedundancyModeResolver

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs

Test	Description
`Resolve_Disabled_ReturnsNone`	`Enabled=false` always returns `RedundancySupport.None`
`Resolve_Warm_ReturnsWarm`	`Mode="Warm"` maps to `RedundancySupport.Warm`
`Resolve_Hot_ReturnsHot`	`Mode="Hot"` maps to `RedundancySupport.Hot`
`Resolve_Unknown_FallsBackToNone`	Unknown mode falls back safely
`Resolve_CaseInsensitive`	`"warm"` and `"WARM"` both resolve

Unit tests — ServiceLevelCalculator

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs

Test	Description
`FullyHealthy_Primary_ReturnsBase`	All healthy, primary role → `ServiceLevelBase`
`FullyHealthy_Secondary_ReturnsBaseMinusFifty`	All healthy, secondary role → `ServiceLevelBase - 50`
`MxAccessDown_ReducesServiceLevel`	MXAccess disconnected subtracts 100
`DbDown_ReducesServiceLevel`	DB unreachable subtracts 50
`BothDown_ReturnsZero`	MXAccess + DB both down → 0
`ClampedTo255`	Base of 255 with healthy → 255
`ClampedToZero`	Heavy penalties don't go negative

Unit tests — RedundancyConfiguration defaults

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs

Test	Description
`DefaultConfig_Disabled`	`Enabled` defaults to `false`
`DefaultConfig_ModeWarm`	`Mode` defaults to `"Warm"`
`DefaultConfig_RolePrimary`	`Role` defaults to `"Primary"`
`DefaultConfig_EmptyServerUris`	`ServerUris` defaults to empty
`DefaultConfig_ServiceLevelBase200`	`ServiceLevelBase` defaults to `200`

Updates to existing configuration tests

File: tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs

Add:

Redundancy_Section_BindsCorrectly — verify binding from appsettings.json
Redundancy_Section_BindsCustomValues — in-memory override test
Validator_RedundancyEnabled_EmptyServerUris_ReturnsTrue_WithWarning — validates but warns
Validator_RedundancyEnabled_InvalidServiceLevelBase_ReturnsFalse — rejects 0 or >255

Integration tests — redundancy E2E

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs

These tests start two in-process OPC UA servers with redundancy enabled and verify client-visible behavior:

Test	Description
`Server_WithRedundancyDisabled_ReportsNone`	Default config → `RedundancySupport.None`, `ServiceLevel=255`
`Server_WithRedundancyEnabled_ReportsConfiguredMode`	`Enabled=true, Mode=Warm` → `RedundancySupport.Warm`
`Server_WithRedundancyEnabled_ExposesServerUriArray`	Client can read `ServerUriArray` and it matches config
`Server_Primary_HasHigherServiceLevel_ThanSecondary`	Primary server reports higher `ServiceLevel` than secondary
`TwoServers_BothExposeSameRedundantSet`	Two server fixtures, both report the same `ServerUriArray`
`Server_ServiceLevel_DropsWith_MxAccessDisconnect`	Simulate MXAccess disconnect → `ServiceLevel` decreases

Pattern: Use OpcUaServerFixture.WithFakeMxAccessClient() with redundancy config injected, connect with OpcUaTestClient, read the standard OPC UA redundancy nodes.

Documentation Plan

New file: `docs/Redundancy.md`

Contents:

Overview of OPC UA non-transparent redundancy
Redundancy configuration section reference (Enabled, Mode, Role, ServerUris, ServiceLevelBase)
ServiceLevel computation logic and degraded-state penalties
How clients discover and fail over between instances
Deployment guide for a two-instance redundant pair (ports, service names, shared Galaxy DB)
CLI redundancy command usage
Troubleshooting: mismatched ServerUris, ServiceLevel stuck at 0, etc.

Updates to existing docs

File	Changes
`docs/Configuration.md`	Add `Redundancy` section table, example JSON, add to validation rules list, update example appsettings.json
`docs/OpcUaServer.md`	Add redundancy state exposure section, link to `Redundancy.md`
`docs/CliTool.md`	Add `redundancy` command documentation
`docs/ServiceHosting.md`	Add multi-instance deployment notes
`README.md`	Add `Redundancy` to the component documentation table, mention redundancy in Quick Start
`CLAUDE.md`	Add redundancy architecture note

Update: `service_info.md`

Add a second section documenting instance2:

Path: C:\publish\lmxopcua\instance2
Windows service name: LmxOpcUa2
Port: 4841
Dashboard port: 8084
Redundancy role: Secondary
Endpoint: opc.tcp://localhost:4841/LmxOpcUa

File Change Summary

File	Action	Description
`src/.../Configuration/RedundancyConfiguration.cs`	New	Redundancy config model
`src/.../Configuration/AppConfiguration.cs`	Modify	Add `Redundancy` section
`src/.../Configuration/ConfigurationValidator.cs`	Modify	Validate/log redundancy settings
`src/.../OpcUa/RedundancyModeResolver.cs`	New	Mode string → `RedundancySupport` enum
`src/.../OpcUa/ServiceLevelCalculator.cs`	New	Dynamic ServiceLevel from health state
`src/.../OpcUa/LmxOpcUaServer.cs`	Modify	Expose redundancy nodes, accept ServiceLevel updates
`src/.../OpcUa/OpcUaServerHost.cs`	Modify	Pass redundancy config through
`src/.../OpcUaService.cs`	Modify	Bind redundancy config, wire ServiceLevel updates
`src/.../OpcUaServiceBuilder.cs`	Modify	Add `WithRedundancy()` builder
`src/.../appsettings.json`	Modify	Add `Redundancy` section
`tools/opcuacli-dotnet/Commands/RedundancyCommand.cs`	New	CLI command to read redundancy info
`tests/.../Redundancy/RedundancyModeResolverTests.cs`	New	Mode resolver unit tests
`tests/.../Redundancy/ServiceLevelCalculatorTests.cs`	New	ServiceLevel computation tests
`tests/.../Redundancy/RedundancyConfigurationTests.cs`	New	Config defaults tests
`tests/.../Configuration/ConfigurationLoadingTests.cs`	Modify	Binding + validation tests
`tests/.../Integration/RedundancyTests.cs`	New	E2E two-server redundancy tests
`tests/.../Helpers/OpcUaServerFixture.cs`	Modify	Accept redundancy config
`docs/Redundancy.md`	New	Dedicated redundancy component doc
`docs/Configuration.md`	Modify	Add Redundancy section
`docs/OpcUaServer.md`	Modify	Add redundancy state section
`docs/CliTool.md`	Modify	Add redundancy command
`docs/ServiceHosting.md`	Modify	Multi-instance notes
`README.md`	Modify	Add Redundancy to component table
`CLAUDE.md`	Modify	Add redundancy architecture note
`service_info.md`	Modify	Add instance2 details

Verification Guardrails

Each step must pass these gates before proceeding to the next:

Gate 1: Build (after each implementation step)

dotnet build ZB.MOM.WW.LmxOpcUa.slnx

Must produce 0 errors. Proceed only when green.

Gate 2: Unit tests (after steps 1–4, 9)

dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests

All existing + new tests must pass. No regressions.

Gate 3: Integration tests (after steps 5–7)

dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Integration.RedundancyTests"

All redundancy E2E tests must pass.

Gate 4: CLI tool builds (after step 10)

cd tools/opcuacli-dotnet && dotnet build

Must compile without errors.

Gate 5: Manual verification — single instance (after step 8)

# Publish and start with Redundancy.Enabled=false
opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=None, ServiceLevel=255

Gate 6: Manual verification — redundant pair (after step 11)

# Start both instances
sc start LmxOpcUa
sc start LmxOpcUa2

# Verify instance1 (Primary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=200, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]

# Verify instance2 (Secondary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=150, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]

# Both instances should serve the same Galaxy address space
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4840/LmxOpcUa -r -d 2
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4841/LmxOpcUa -r -d 2

Gate 7: Full test suite (final)

dotnet test ZB.MOM.WW.LmxOpcUa.slnx

All tests across all projects must pass.

Gate 8: Documentation review

All new/modified doc files render correctly in Markdown
Example JSON snippets match the actual appsettings.json
CLI examples use correct flags and expected output
service_info.md accurately reflects both deployed instances

Risks and Considerations

Backward compatibility: Redundancy.Enabled = false must be the default so existing single-instance deployments are unaffected.
ServiceLevel timing: Updates must not race with OPC UA publish cycles. Use the server's internal lock or ServerInternal APIs.
ServerUriArray immutability: The OPC UA spec expects this to be static during a server session. Changes require a server restart.
MXAccess shared state: Both instances connect to the same MXAccess runtime. If MXAccess has per-client registration limits, verify that two clients can coexist.
Galaxy DB contention: Both instances poll for deploy changes. Ensure change detection doesn't trigger duplicate rebuilds or locking issues.
Port conflicts: The second instance must use different ports for OPC UA (4841) and Dashboard (8084).
Certificate identity: Each instance needs its own application certificate with a distinct SubjectName matching its ServerName.

Execution Order

Steps 1–4: Config model, resolver, calculator, validator (unit-testable in isolation)
Gate 1 + Gate 2: Build + unit tests pass
Steps 5–7: Server integration (redundancy nodes, ServiceLevel wiring)
Gate 1 + Gate 2 + Gate 3: Build + all tests including E2E
Step 8: Update appsettings.json
Gate 5: Manual single-instance verification
Step 9: Update service builder for tests
Step 10: CLI redundancy command
Gate 4: CLI builds
Step 11: Deploy second instance + update service_info.md
Gate 6: Manual two-instance verification
Documentation updates (all doc files)
Gate 7 + Gate 8: Full test suite + documentation review
Commit and push

21 KiB Raw Blame History Unescape Escape

OPC UA Server Redundancy Plan

Summary

Background: OPC UA Redundancy Model

Non-Transparent Redundancy (our target)

ServiceLevel Semantics

Current State

Scope

In Scope (Phase 1)

Deferred

Configuration Design

New Redundancy section in appsettings.json

Configuration model

Configuration rules

App root updates

Implementation Steps

Step 1: Add RedundancyConfiguration model and bind it

Step 2: Add RedundancyModeResolver

Step 3: Add ServiceLevelCalculator

Step 4: Extend ConfigurationValidator for redundancy

Step 5: Update LmxOpcUaServer to expose redundancy state

Step 6: Update OpcUaServerHost to pass redundancy config

Step 7: Wire ServiceLevel updates in OpcUaService

Step 8: Update appsettings.json

Step 9: Update OpcUaServiceBuilder for test injection

Step 10: Add CLI redundancy command

Step 11: Deploy second service instance

Test Plan

Unit tests — RedundancyModeResolver

Unit tests — ServiceLevelCalculator

Unit tests — RedundancyConfiguration defaults

Updates to existing configuration tests

Integration tests — redundancy E2E

Documentation Plan

New file: docs/Redundancy.md

Updates to existing docs

Update: service_info.md

File Change Summary

Verification Guardrails

Gate 1: Build (after each implementation step)

Gate 2: Unit tests (after steps 1–4, 9)

Gate 3: Integration tests (after steps 5–7)

Gate 4: CLI tool builds (after step 10)

Gate 5: Manual verification — single instance (after step 8)

Gate 6: Manual verification — redundant pair (after step 11)

Gate 7: Full test suite (final)

Gate 8: Documentation review

Risks and Considerations

Execution Order

21 KiB

Raw Blame History

New `Redundancy` section in `appsettings.json`

Step 10: Add CLI `redundancy` command

New file: `docs/Redundancy.md`

Update: `service_info.md`