Files
lmxopcua/redundancy.md
Joseph Doherty a3c2d9b243 Add OPC UA server redundancy implementation plan
Covers non-transparent warm/hot redundancy with configurable roles,
dynamic ServiceLevel, CLI support, second service instance deployment,
and verification guardrails across unit, integration, and manual tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 12:52:15 -04:00

21 KiB
Raw Blame History

OPC UA Server Redundancy Plan

Summary

Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance advertises itself and its partner through the OPC UA ServerRedundancy node, publishes a dynamic ServiceLevel reflecting runtime health, and allows clients to discover the redundant set and fail over between instances. The CLI tool gains a redundancy command for inspecting the redundant server set.

This plan covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does not implement automatic server-side failover or subscription transfer — those are client responsibilities per the OPC UA specification.


Background: OPC UA Redundancy Model

OPC UA defines redundancy through three address-space nodes under Server/ServerRedundancy:

Node Type Purpose
RedundancySupport RedundancySupport enum Declares the redundancy mode: None, Cold, Warm, Hot, Transparent, HotAndMirrored
ServerUriArray String[] Lists the ApplicationUri values of all servers in the redundant set (non-transparent modes)
ServiceLevel Byte (0255) Indicates current operational quality; clients prefer the server with the highest value

Non-Transparent Redundancy (our target)

In non-transparent redundancy (Warm or Hot), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading ServerUriArray, monitor ServiceLevel on each server, and manage their own failover. This model fits our architecture where each instance connects to the same Galaxy repository and MXAccess runtime independently.

ServiceLevel Semantics

Range Meaning
0 Server is not operational
199 Degraded (e.g., MXAccess disconnected, DB unreachable)
100199 Healthy secondary
200255 Healthy primary (preferred)

The primary server should advertise a higher ServiceLevel than the secondary so clients prefer it when both are healthy.


Current State

  • LmxOpcUaServer extends StandardServer but does not override any redundancy-related methods
  • ServerRedundancy/RedundancySupport defaults to None (SDK default)
  • ServiceLevel defaults to 255 (SDK default — "fully operational")
  • No configuration for redundant partner URIs or role designation
  • Single deployed instance at C:\publish\lmxopcua\instance1 on port 4840
  • No CLI support for reading redundancy information

Scope

In Scope (Phase 1)

  1. Redundancy configuration model — role, partner URIs, ServiceLevel weights
  2. Server redundancy node exposureRedundancySupport, ServerUriArray, dynamic ServiceLevel
  3. ServiceLevel computation — based on runtime health (MXAccess state, DB connectivity, role)
  4. CLI redundancy command — read RedundancySupport, ServerUriArray, ServiceLevel from a server
  5. Second service instance — deployed at C:\publish\lmxopcua\instance2 with non-overlapping ports
  6. Documentation — new docs/Redundancy.md component doc, updates to existing docs
  7. Unit tests — config, ServiceLevel computation, resolver tests
  8. Integration tests — two-server redundancy E2E test in the integration test project

Deferred

  • Automatic subscription transfer (client-side responsibility)
  • Server-initiated failover (Galaxy redundancy table / engine flags)
  • Transparent redundancy mode
  • Health-check HTTP endpoint for load balancers

Configuration Design

New Redundancy section in appsettings.json

{
  "Redundancy": {
    "Enabled": false,
    "Mode": "Warm",
    "Role": "Primary",
    "ServerUris": [],
    "ServiceLevelBase": 200
  }
}

Configuration model

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs (new)

public class RedundancyConfiguration
{
    public bool Enabled { get; set; } = false;
    public string Mode { get; set; } = "Warm";
    public string Role { get; set; } = "Primary";
    public List<string> ServerUris { get; set; } = new List<string>();
    public int ServiceLevelBase { get; set; } = 200;
}

Configuration rules

  • Enabled defaults to false for backward compatibility. When false, RedundancySupport = None and ServiceLevel = 255 (SDK defaults).
  • Mode must be Warm or Hot (Phase 1). Maps to RedundancySupport.Warm or RedundancySupport.Hot.
  • Role must be Primary or Secondary. Controls the base ServiceLevel (Primary gets ServiceLevelBase, Secondary gets ServiceLevelBase - 50).
  • ServerUris lists the ApplicationUri values for all servers in the redundant set, including the local server. The OPC UA spec requires this to contain the full set. These are namespace URIs like urn:ZB:LmxOpcUa, not endpoint URLs.
  • ServiceLevelBase is the starting ServiceLevel when the server is fully healthy. Degraded conditions subtract from this value.

App root updates

File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs

  • Add public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();

Implementation Steps

Step 1: Add RedundancyConfiguration model and bind it

Files:

  • src/.../Configuration/RedundancyConfiguration.cs (new)
  • src/.../Configuration/AppConfiguration.cs
  • src/.../OpcUaService.cs

Changes:

  1. Create RedundancyConfiguration class with properties above
  2. Add Redundancy property to AppConfiguration
  3. Bind configuration.GetSection("Redundancy").Bind(_config.Redundancy);
  4. Pass _config.Redundancy through to OpcUaServerHost and LmxOpcUaServer

Step 2: Add RedundancyModeResolver

File: src/.../OpcUa/RedundancyModeResolver.cs (new)

Responsibilities:

  • Map Mode string to RedundancySupport enum value
  • Validate against supported Phase 1 modes (Warm, Hot)
  • Fall back to None with warning for unknown modes
public static class RedundancyModeResolver
{
    public static RedundancySupport Resolve(string mode, bool enabled);
}

Step 3: Add ServiceLevelCalculator

File: src/.../OpcUa/ServiceLevelCalculator.cs (new)

Computes the dynamic ServiceLevel byte from runtime health:

public class ServiceLevelCalculator
{
    public byte Calculate(int baseLine, bool mxAccessConnected, bool dbConnected, bool isPrimary);
}

Logic:

  • Start with baseLine (from config, e.g., 200 for Primary, 150 for Secondary)
  • Subtract 100 if MXAccess is disconnected
  • Subtract 50 if Galaxy DB is unreachable
  • Clamp to 0255
  • Return 0 if both MXAccess and DB are down

Step 4: Extend ConfigurationValidator for redundancy

File: src/.../Configuration/ConfigurationValidator.cs

Add validation/logging for:

  • Redundancy.Enabled, Mode, Role
  • ServerUris should not be empty when Enabled = true
  • ServiceLevelBase should be 1255
  • Warning when Enabled = true but ServerUris has fewer than 2 entries
  • Log effective redundancy configuration at startup

Step 5: Update LmxOpcUaServer to expose redundancy state

File: src/.../OpcUa/LmxOpcUaServer.cs

Changes:

  1. Accept RedundancyConfiguration in the constructor
  2. Override OnServerStarted to write redundancy nodes:
    • Set Server/ServerRedundancy/RedundancySupport to the resolved mode
    • Set Server/ServerRedundancy/ServerUriArray to the configured URIs
  3. Override SetServerState or use a timer to update Server/ServiceLevel periodically based on ServiceLevelCalculator
  4. Expose a method UpdateServiceLevel(bool mxConnected, bool dbConnected) that the service layer can call when health state changes

Step 6: Update OpcUaServerHost to pass redundancy config

File: src/.../OpcUa/OpcUaServerHost.cs

Changes:

  1. Accept RedundancyConfiguration in the constructor
  2. Pass it through to LmxOpcUaServer
  3. Log active redundancy mode at startup

Step 7: Wire ServiceLevel updates in OpcUaService

File: src/.../OpcUaService.cs

Changes:

  1. Bind redundancy config section
  2. Pass redundancy config to OpcUaServerHost
  3. Subscribe to MxAccessClient.ConnectionStateChanged to trigger ServiceLevel updates
  4. After Galaxy DB health checks, trigger ServiceLevel updates
  5. Use a periodic timer (e.g., every 5 seconds) to refresh ServiceLevel based on current component health

Step 8: Update appsettings.json

File: src/.../appsettings.json

Add the Redundancy section with backward-compatible defaults (Enabled: false).

Step 9: Update OpcUaServiceBuilder for test injection

File: src/.../OpcUaServiceBuilder.cs

Add WithRedundancy(RedundancyConfiguration) builder method so tests can inject redundancy configuration.

Step 10: Add CLI redundancy command

Files:

  • tools/opcuacli-dotnet/Commands/RedundancyCommand.cs (new)

Command: redundancy

Reads from the target server:

  • Server/ServerRedundancy/RedundancySupport (i=11314)
  • Server/ServiceLevel (i=2267)
  • Server/ServerRedundancy/ServerUriArray (i=11492, if non-transparent redundancy)

Output format:

Redundancy Mode:  Warm
Service Level:    200
Server URIs:
  - urn:ZB:LmxOpcUa
  - urn:ZB:LmxOpcUa2

Options: --url, --username, --password, --security (same shared options as other commands).

Step 11: Deploy second service instance

Deployment target: C:\publish\lmxopcua\instance2

Configuration differences from instance1:

Setting instance1 instance2
OpcUa.Port 4840 4841
OpcUa.ServerName LmxOpcUa LmxOpcUa2
Dashboard.Port 8083 8084
Redundancy.Enabled true true
Redundancy.Role Primary Secondary
Redundancy.Mode Warm Warm
Redundancy.ServerUris ["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"] ["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]
Redundancy.ServiceLevelBase 200 200

Windows service for instance2:

  • Name: LmxOpcUa2
  • Display name: LMX OPC UA Server (Instance 2)
  • Executable: C:\publish\lmxopcua\instance2\ZB.MOM.WW.LmxOpcUa.Host.exe

Both instances share the same Galaxy DB (ZB) and MXAccess runtime. The GalaxyName remains ZB for both so they expose the same namespace.

Update service_info.md with the second instance details.


Test Plan

Unit tests — RedundancyModeResolver

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs

Test Description
Resolve_Disabled_ReturnsNone Enabled=false always returns RedundancySupport.None
Resolve_Warm_ReturnsWarm Mode="Warm" maps to RedundancySupport.Warm
Resolve_Hot_ReturnsHot Mode="Hot" maps to RedundancySupport.Hot
Resolve_Unknown_FallsBackToNone Unknown mode falls back safely
Resolve_CaseInsensitive "warm" and "WARM" both resolve

Unit tests — ServiceLevelCalculator

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs

Test Description
FullyHealthy_Primary_ReturnsBase All healthy, primary role → ServiceLevelBase
FullyHealthy_Secondary_ReturnsBaseMinusFifty All healthy, secondary role → ServiceLevelBase - 50
MxAccessDown_ReducesServiceLevel MXAccess disconnected subtracts 100
DbDown_ReducesServiceLevel DB unreachable subtracts 50
BothDown_ReturnsZero MXAccess + DB both down → 0
ClampedTo255 Base of 255 with healthy → 255
ClampedToZero Heavy penalties don't go negative

Unit tests — RedundancyConfiguration defaults

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs

Test Description
DefaultConfig_Disabled Enabled defaults to false
DefaultConfig_ModeWarm Mode defaults to "Warm"
DefaultConfig_RolePrimary Role defaults to "Primary"
DefaultConfig_EmptyServerUris ServerUris defaults to empty
DefaultConfig_ServiceLevelBase200 ServiceLevelBase defaults to 200

Updates to existing configuration tests

File: tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs

Add:

  • Redundancy_Section_BindsCorrectly — verify binding from appsettings.json
  • Redundancy_Section_BindsCustomValues — in-memory override test
  • Validator_RedundancyEnabled_EmptyServerUris_ReturnsTrue_WithWarning — validates but warns
  • Validator_RedundancyEnabled_InvalidServiceLevelBase_ReturnsFalse — rejects 0 or >255

Integration tests — redundancy E2E

New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs

These tests start two in-process OPC UA servers with redundancy enabled and verify client-visible behavior:

Test Description
Server_WithRedundancyDisabled_ReportsNone Default config → RedundancySupport.None, ServiceLevel=255
Server_WithRedundancyEnabled_ReportsConfiguredMode Enabled=true, Mode=WarmRedundancySupport.Warm
Server_WithRedundancyEnabled_ExposesServerUriArray Client can read ServerUriArray and it matches config
Server_Primary_HasHigherServiceLevel_ThanSecondary Primary server reports higher ServiceLevel than secondary
TwoServers_BothExposeSameRedundantSet Two server fixtures, both report the same ServerUriArray
Server_ServiceLevel_DropsWith_MxAccessDisconnect Simulate MXAccess disconnect → ServiceLevel decreases

Pattern: Use OpcUaServerFixture.WithFakeMxAccessClient() with redundancy config injected, connect with OpcUaTestClient, read the standard OPC UA redundancy nodes.


Documentation Plan

New file: docs/Redundancy.md

Contents:

  1. Overview of OPC UA non-transparent redundancy
  2. Redundancy configuration section reference (Enabled, Mode, Role, ServerUris, ServiceLevelBase)
  3. ServiceLevel computation logic and degraded-state penalties
  4. How clients discover and fail over between instances
  5. Deployment guide for a two-instance redundant pair (ports, service names, shared Galaxy DB)
  6. CLI redundancy command usage
  7. Troubleshooting: mismatched ServerUris, ServiceLevel stuck at 0, etc.

Updates to existing docs

File Changes
docs/Configuration.md Add Redundancy section table, example JSON, add to validation rules list, update example appsettings.json
docs/OpcUaServer.md Add redundancy state exposure section, link to Redundancy.md
docs/CliTool.md Add redundancy command documentation
docs/ServiceHosting.md Add multi-instance deployment notes
README.md Add Redundancy to the component documentation table, mention redundancy in Quick Start
CLAUDE.md Add redundancy architecture note

Update: service_info.md

Add a second section documenting instance2:

  • Path: C:\publish\lmxopcua\instance2
  • Windows service name: LmxOpcUa2
  • Port: 4841
  • Dashboard port: 8084
  • Redundancy role: Secondary
  • Endpoint: opc.tcp://localhost:4841/LmxOpcUa

File Change Summary

File Action Description
src/.../Configuration/RedundancyConfiguration.cs New Redundancy config model
src/.../Configuration/AppConfiguration.cs Modify Add Redundancy section
src/.../Configuration/ConfigurationValidator.cs Modify Validate/log redundancy settings
src/.../OpcUa/RedundancyModeResolver.cs New Mode string → RedundancySupport enum
src/.../OpcUa/ServiceLevelCalculator.cs New Dynamic ServiceLevel from health state
src/.../OpcUa/LmxOpcUaServer.cs Modify Expose redundancy nodes, accept ServiceLevel updates
src/.../OpcUa/OpcUaServerHost.cs Modify Pass redundancy config through
src/.../OpcUaService.cs Modify Bind redundancy config, wire ServiceLevel updates
src/.../OpcUaServiceBuilder.cs Modify Add WithRedundancy() builder
src/.../appsettings.json Modify Add Redundancy section
tools/opcuacli-dotnet/Commands/RedundancyCommand.cs New CLI command to read redundancy info
tests/.../Redundancy/RedundancyModeResolverTests.cs New Mode resolver unit tests
tests/.../Redundancy/ServiceLevelCalculatorTests.cs New ServiceLevel computation tests
tests/.../Redundancy/RedundancyConfigurationTests.cs New Config defaults tests
tests/.../Configuration/ConfigurationLoadingTests.cs Modify Binding + validation tests
tests/.../Integration/RedundancyTests.cs New E2E two-server redundancy tests
tests/.../Helpers/OpcUaServerFixture.cs Modify Accept redundancy config
docs/Redundancy.md New Dedicated redundancy component doc
docs/Configuration.md Modify Add Redundancy section
docs/OpcUaServer.md Modify Add redundancy state section
docs/CliTool.md Modify Add redundancy command
docs/ServiceHosting.md Modify Multi-instance notes
README.md Modify Add Redundancy to component table
CLAUDE.md Modify Add redundancy architecture note
service_info.md Modify Add instance2 details

Verification Guardrails

Each step must pass these gates before proceeding to the next:

Gate 1: Build (after each implementation step)

dotnet build ZB.MOM.WW.LmxOpcUa.slnx

Must produce 0 errors. Proceed only when green.

Gate 2: Unit tests (after steps 14, 9)

dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests

All existing + new tests must pass. No regressions.

Gate 3: Integration tests (after steps 57)

dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Integration.RedundancyTests"

All redundancy E2E tests must pass.

Gate 4: CLI tool builds (after step 10)

cd tools/opcuacli-dotnet && dotnet build

Must compile without errors.

Gate 5: Manual verification — single instance (after step 8)

# Publish and start with Redundancy.Enabled=false
opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=None, ServiceLevel=255

Gate 6: Manual verification — redundant pair (after step 11)

# Start both instances
sc start LmxOpcUa
sc start LmxOpcUa2

# Verify instance1 (Primary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=200, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]

# Verify instance2 (Secondary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=150, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]

# Both instances should serve the same Galaxy address space
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4840/LmxOpcUa -r -d 2
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4841/LmxOpcUa -r -d 2

Gate 7: Full test suite (final)

dotnet test ZB.MOM.WW.LmxOpcUa.slnx

All tests across all projects must pass.

Gate 8: Documentation review

  • All new/modified doc files render correctly in Markdown
  • Example JSON snippets match the actual appsettings.json
  • CLI examples use correct flags and expected output
  • service_info.md accurately reflects both deployed instances

Risks and Considerations

  1. Backward compatibility: Redundancy.Enabled = false must be the default so existing single-instance deployments are unaffected.
  2. ServiceLevel timing: Updates must not race with OPC UA publish cycles. Use the server's internal lock or ServerInternal APIs.
  3. ServerUriArray immutability: The OPC UA spec expects this to be static during a server session. Changes require a server restart.
  4. MXAccess shared state: Both instances connect to the same MXAccess runtime. If MXAccess has per-client registration limits, verify that two clients can coexist.
  5. Galaxy DB contention: Both instances poll for deploy changes. Ensure change detection doesn't trigger duplicate rebuilds or locking issues.
  6. Port conflicts: The second instance must use different ports for OPC UA (4841) and Dashboard (8084).
  7. Certificate identity: Each instance needs its own application certificate with a distinct SubjectName matching its ServerName.

Execution Order

  1. Steps 14: Config model, resolver, calculator, validator (unit-testable in isolation)
  2. Gate 1 + Gate 2: Build + unit tests pass
  3. Steps 57: Server integration (redundancy nodes, ServiceLevel wiring)
  4. Gate 1 + Gate 2 + Gate 3: Build + all tests including E2E
  5. Step 8: Update appsettings.json
  6. Gate 5: Manual single-instance verification
  7. Step 9: Update service builder for tests
  8. Step 10: CLI redundancy command
  9. Gate 4: CLI builds
  10. Step 11: Deploy second instance + update service_info.md
  11. Gate 6: Manual two-instance verification
  12. Documentation updates (all doc files)
  13. Gate 7 + Gate 8: Full test suite + documentation review
  14. Commit and push