Covers non-transparent warm/hot redundancy with configurable roles, dynamic ServiceLevel, CLI support, second service instance deployment, and verification guardrails across unit, integration, and manual tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
21 KiB
OPC UA Server Redundancy Plan
Summary
Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance advertises itself and its partner through the OPC UA ServerRedundancy node, publishes a dynamic ServiceLevel reflecting runtime health, and allows clients to discover the redundant set and fail over between instances. The CLI tool gains a redundancy command for inspecting the redundant server set.
This plan covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does not implement automatic server-side failover or subscription transfer — those are client responsibilities per the OPC UA specification.
Background: OPC UA Redundancy Model
OPC UA defines redundancy through three address-space nodes under Server/ServerRedundancy:
| Node | Type | Purpose |
|---|---|---|
RedundancySupport |
RedundancySupport enum |
Declares the redundancy mode: None, Cold, Warm, Hot, Transparent, HotAndMirrored |
ServerUriArray |
String[] |
Lists the ApplicationUri values of all servers in the redundant set (non-transparent modes) |
ServiceLevel |
Byte (0–255) |
Indicates current operational quality; clients prefer the server with the highest value |
Non-Transparent Redundancy (our target)
In non-transparent redundancy (Warm or Hot), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading ServerUriArray, monitor ServiceLevel on each server, and manage their own failover. This model fits our architecture where each instance connects to the same Galaxy repository and MXAccess runtime independently.
ServiceLevel Semantics
| Range | Meaning |
|---|---|
| 0 | Server is not operational |
| 1–99 | Degraded (e.g., MXAccess disconnected, DB unreachable) |
| 100–199 | Healthy secondary |
| 200–255 | Healthy primary (preferred) |
The primary server should advertise a higher ServiceLevel than the secondary so clients prefer it when both are healthy.
Current State
LmxOpcUaServerextendsStandardServerbut does not override any redundancy-related methodsServerRedundancy/RedundancySupportdefaults toNone(SDK default)ServiceLeveldefaults to255(SDK default — "fully operational")- No configuration for redundant partner URIs or role designation
- Single deployed instance at
C:\publish\lmxopcua\instance1on port 4840 - No CLI support for reading redundancy information
Scope
In Scope (Phase 1)
- Redundancy configuration model — role, partner URIs, ServiceLevel weights
- Server redundancy node exposure —
RedundancySupport,ServerUriArray, dynamicServiceLevel - ServiceLevel computation — based on runtime health (MXAccess state, DB connectivity, role)
- CLI redundancy command — read
RedundancySupport,ServerUriArray,ServiceLevelfrom a server - Second service instance — deployed at
C:\publish\lmxopcua\instance2with non-overlapping ports - Documentation — new
docs/Redundancy.mdcomponent doc, updates to existing docs - Unit tests — config, ServiceLevel computation, resolver tests
- Integration tests — two-server redundancy E2E test in the integration test project
Deferred
- Automatic subscription transfer (client-side responsibility)
- Server-initiated failover (Galaxy
redundancytable / engine flags) - Transparent redundancy mode
- Health-check HTTP endpoint for load balancers
Configuration Design
New Redundancy section in appsettings.json
{
"Redundancy": {
"Enabled": false,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": [],
"ServiceLevelBase": 200
}
}
Configuration model
File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs (new)
public class RedundancyConfiguration
{
public bool Enabled { get; set; } = false;
public string Mode { get; set; } = "Warm";
public string Role { get; set; } = "Primary";
public List<string> ServerUris { get; set; } = new List<string>();
public int ServiceLevelBase { get; set; } = 200;
}
Configuration rules
Enableddefaults tofalsefor backward compatibility. Whenfalse,RedundancySupport = NoneandServiceLevel = 255(SDK defaults).Modemust beWarmorHot(Phase 1). Maps toRedundancySupport.WarmorRedundancySupport.Hot.Rolemust bePrimaryorSecondary. Controls the baseServiceLevel(Primary getsServiceLevelBase, Secondary getsServiceLevelBase - 50).ServerUrislists theApplicationUrivalues for all servers in the redundant set, including the local server. The OPC UA spec requires this to contain the full set. These are namespace URIs likeurn:ZB:LmxOpcUa, not endpoint URLs.ServiceLevelBaseis the starting ServiceLevel when the server is fully healthy. Degraded conditions subtract from this value.
App root updates
File: src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs
- Add
public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();
Implementation Steps
Step 1: Add RedundancyConfiguration model and bind it
Files:
src/.../Configuration/RedundancyConfiguration.cs(new)src/.../Configuration/AppConfiguration.cssrc/.../OpcUaService.cs
Changes:
- Create
RedundancyConfigurationclass with properties above - Add
Redundancyproperty toAppConfiguration - Bind
configuration.GetSection("Redundancy").Bind(_config.Redundancy); - Pass
_config.Redundancythrough toOpcUaServerHostandLmxOpcUaServer
Step 2: Add RedundancyModeResolver
File: src/.../OpcUa/RedundancyModeResolver.cs (new)
Responsibilities:
- Map
Modestring toRedundancySupportenum value - Validate against supported Phase 1 modes (
Warm,Hot) - Fall back to
Nonewith warning for unknown modes
public static class RedundancyModeResolver
{
public static RedundancySupport Resolve(string mode, bool enabled);
}
Step 3: Add ServiceLevelCalculator
File: src/.../OpcUa/ServiceLevelCalculator.cs (new)
Computes the dynamic ServiceLevel byte from runtime health:
public class ServiceLevelCalculator
{
public byte Calculate(int baseLine, bool mxAccessConnected, bool dbConnected, bool isPrimary);
}
Logic:
- Start with
baseLine(from config, e.g., 200 for Primary, 150 for Secondary) - Subtract 100 if MXAccess is disconnected
- Subtract 50 if Galaxy DB is unreachable
- Clamp to 0–255
- Return 0 if both MXAccess and DB are down
Step 4: Extend ConfigurationValidator for redundancy
File: src/.../Configuration/ConfigurationValidator.cs
Add validation/logging for:
Redundancy.Enabled,Mode,RoleServerUrisshould not be empty whenEnabled = trueServiceLevelBaseshould be 1–255- Warning when
Enabled = truebutServerUrishas fewer than 2 entries - Log effective redundancy configuration at startup
Step 5: Update LmxOpcUaServer to expose redundancy state
File: src/.../OpcUa/LmxOpcUaServer.cs
Changes:
- Accept
RedundancyConfigurationin the constructor - Override
OnServerStartedto write redundancy nodes:- Set
Server/ServerRedundancy/RedundancySupportto the resolved mode - Set
Server/ServerRedundancy/ServerUriArrayto the configured URIs
- Set
- Override
SetServerStateor use a timer to updateServer/ServiceLevelperiodically based onServiceLevelCalculator - Expose a method
UpdateServiceLevel(bool mxConnected, bool dbConnected)that the service layer can call when health state changes
Step 6: Update OpcUaServerHost to pass redundancy config
File: src/.../OpcUa/OpcUaServerHost.cs
Changes:
- Accept
RedundancyConfigurationin the constructor - Pass it through to
LmxOpcUaServer - Log active redundancy mode at startup
Step 7: Wire ServiceLevel updates in OpcUaService
File: src/.../OpcUaService.cs
Changes:
- Bind redundancy config section
- Pass redundancy config to
OpcUaServerHost - Subscribe to
MxAccessClient.ConnectionStateChangedto triggerServiceLevelupdates - After Galaxy DB health checks, trigger
ServiceLevelupdates - Use a periodic timer (e.g., every 5 seconds) to refresh
ServiceLevelbased on current component health
Step 8: Update appsettings.json
File: src/.../appsettings.json
Add the Redundancy section with backward-compatible defaults (Enabled: false).
Step 9: Update OpcUaServiceBuilder for test injection
File: src/.../OpcUaServiceBuilder.cs
Add WithRedundancy(RedundancyConfiguration) builder method so tests can inject redundancy configuration.
Step 10: Add CLI redundancy command
Files:
tools/opcuacli-dotnet/Commands/RedundancyCommand.cs(new)
Command: redundancy
Reads from the target server:
Server/ServerRedundancy/RedundancySupport(i=11314)Server/ServiceLevel(i=2267)Server/ServerRedundancy/ServerUriArray(i=11492, if non-transparent redundancy)
Output format:
Redundancy Mode: Warm
Service Level: 200
Server URIs:
- urn:ZB:LmxOpcUa
- urn:ZB:LmxOpcUa2
Options: --url, --username, --password, --security (same shared options as other commands).
Step 11: Deploy second service instance
Deployment target: C:\publish\lmxopcua\instance2
Configuration differences from instance1:
| Setting | instance1 | instance2 |
|---|---|---|
OpcUa.Port |
4840 |
4841 |
OpcUa.ServerName |
LmxOpcUa |
LmxOpcUa2 |
Dashboard.Port |
8083 |
8084 |
Redundancy.Enabled |
true |
true |
Redundancy.Role |
Primary |
Secondary |
Redundancy.Mode |
Warm |
Warm |
Redundancy.ServerUris |
["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"] |
["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"] |
Redundancy.ServiceLevelBase |
200 |
200 |
Windows service for instance2:
- Name:
LmxOpcUa2 - Display name:
LMX OPC UA Server (Instance 2) - Executable:
C:\publish\lmxopcua\instance2\ZB.MOM.WW.LmxOpcUa.Host.exe
Both instances share the same Galaxy DB (ZB) and MXAccess runtime. The GalaxyName remains ZB for both so they expose the same namespace.
Update service_info.md with the second instance details.
Test Plan
Unit tests — RedundancyModeResolver
New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs
| Test | Description |
|---|---|
Resolve_Disabled_ReturnsNone |
Enabled=false always returns RedundancySupport.None |
Resolve_Warm_ReturnsWarm |
Mode="Warm" maps to RedundancySupport.Warm |
Resolve_Hot_ReturnsHot |
Mode="Hot" maps to RedundancySupport.Hot |
Resolve_Unknown_FallsBackToNone |
Unknown mode falls back safely |
Resolve_CaseInsensitive |
"warm" and "WARM" both resolve |
Unit tests — ServiceLevelCalculator
New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs
| Test | Description |
|---|---|
FullyHealthy_Primary_ReturnsBase |
All healthy, primary role → ServiceLevelBase |
FullyHealthy_Secondary_ReturnsBaseMinusFifty |
All healthy, secondary role → ServiceLevelBase - 50 |
MxAccessDown_ReducesServiceLevel |
MXAccess disconnected subtracts 100 |
DbDown_ReducesServiceLevel |
DB unreachable subtracts 50 |
BothDown_ReturnsZero |
MXAccess + DB both down → 0 |
ClampedTo255 |
Base of 255 with healthy → 255 |
ClampedToZero |
Heavy penalties don't go negative |
Unit tests — RedundancyConfiguration defaults
New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs
| Test | Description |
|---|---|
DefaultConfig_Disabled |
Enabled defaults to false |
DefaultConfig_ModeWarm |
Mode defaults to "Warm" |
DefaultConfig_RolePrimary |
Role defaults to "Primary" |
DefaultConfig_EmptyServerUris |
ServerUris defaults to empty |
DefaultConfig_ServiceLevelBase200 |
ServiceLevelBase defaults to 200 |
Updates to existing configuration tests
File: tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs
Add:
Redundancy_Section_BindsCorrectly— verify binding from appsettings.jsonRedundancy_Section_BindsCustomValues— in-memory override testValidator_RedundancyEnabled_EmptyServerUris_ReturnsTrue_WithWarning— validates but warnsValidator_RedundancyEnabled_InvalidServiceLevelBase_ReturnsFalse— rejects 0 or >255
Integration tests — redundancy E2E
New file: tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs
These tests start two in-process OPC UA servers with redundancy enabled and verify client-visible behavior:
| Test | Description |
|---|---|
Server_WithRedundancyDisabled_ReportsNone |
Default config → RedundancySupport.None, ServiceLevel=255 |
Server_WithRedundancyEnabled_ReportsConfiguredMode |
Enabled=true, Mode=Warm → RedundancySupport.Warm |
Server_WithRedundancyEnabled_ExposesServerUriArray |
Client can read ServerUriArray and it matches config |
Server_Primary_HasHigherServiceLevel_ThanSecondary |
Primary server reports higher ServiceLevel than secondary |
TwoServers_BothExposeSameRedundantSet |
Two server fixtures, both report the same ServerUriArray |
Server_ServiceLevel_DropsWith_MxAccessDisconnect |
Simulate MXAccess disconnect → ServiceLevel decreases |
Pattern: Use OpcUaServerFixture.WithFakeMxAccessClient() with redundancy config injected, connect with OpcUaTestClient, read the standard OPC UA redundancy nodes.
Documentation Plan
New file: docs/Redundancy.md
Contents:
- Overview of OPC UA non-transparent redundancy
- Redundancy configuration section reference (
Enabled,Mode,Role,ServerUris,ServiceLevelBase) - ServiceLevel computation logic and degraded-state penalties
- How clients discover and fail over between instances
- Deployment guide for a two-instance redundant pair (ports, service names, shared Galaxy DB)
- CLI
redundancycommand usage - Troubleshooting: mismatched
ServerUris, ServiceLevel stuck at 0, etc.
Updates to existing docs
| File | Changes |
|---|---|
docs/Configuration.md |
Add Redundancy section table, example JSON, add to validation rules list, update example appsettings.json |
docs/OpcUaServer.md |
Add redundancy state exposure section, link to Redundancy.md |
docs/CliTool.md |
Add redundancy command documentation |
docs/ServiceHosting.md |
Add multi-instance deployment notes |
README.md |
Add Redundancy to the component documentation table, mention redundancy in Quick Start |
CLAUDE.md |
Add redundancy architecture note |
Update: service_info.md
Add a second section documenting instance2:
- Path:
C:\publish\lmxopcua\instance2 - Windows service name:
LmxOpcUa2 - Port:
4841 - Dashboard port:
8084 - Redundancy role:
Secondary - Endpoint:
opc.tcp://localhost:4841/LmxOpcUa
File Change Summary
| File | Action | Description |
|---|---|---|
src/.../Configuration/RedundancyConfiguration.cs |
New | Redundancy config model |
src/.../Configuration/AppConfiguration.cs |
Modify | Add Redundancy section |
src/.../Configuration/ConfigurationValidator.cs |
Modify | Validate/log redundancy settings |
src/.../OpcUa/RedundancyModeResolver.cs |
New | Mode string → RedundancySupport enum |
src/.../OpcUa/ServiceLevelCalculator.cs |
New | Dynamic ServiceLevel from health state |
src/.../OpcUa/LmxOpcUaServer.cs |
Modify | Expose redundancy nodes, accept ServiceLevel updates |
src/.../OpcUa/OpcUaServerHost.cs |
Modify | Pass redundancy config through |
src/.../OpcUaService.cs |
Modify | Bind redundancy config, wire ServiceLevel updates |
src/.../OpcUaServiceBuilder.cs |
Modify | Add WithRedundancy() builder |
src/.../appsettings.json |
Modify | Add Redundancy section |
tools/opcuacli-dotnet/Commands/RedundancyCommand.cs |
New | CLI command to read redundancy info |
tests/.../Redundancy/RedundancyModeResolverTests.cs |
New | Mode resolver unit tests |
tests/.../Redundancy/ServiceLevelCalculatorTests.cs |
New | ServiceLevel computation tests |
tests/.../Redundancy/RedundancyConfigurationTests.cs |
New | Config defaults tests |
tests/.../Configuration/ConfigurationLoadingTests.cs |
Modify | Binding + validation tests |
tests/.../Integration/RedundancyTests.cs |
New | E2E two-server redundancy tests |
tests/.../Helpers/OpcUaServerFixture.cs |
Modify | Accept redundancy config |
docs/Redundancy.md |
New | Dedicated redundancy component doc |
docs/Configuration.md |
Modify | Add Redundancy section |
docs/OpcUaServer.md |
Modify | Add redundancy state section |
docs/CliTool.md |
Modify | Add redundancy command |
docs/ServiceHosting.md |
Modify | Multi-instance notes |
README.md |
Modify | Add Redundancy to component table |
CLAUDE.md |
Modify | Add redundancy architecture note |
service_info.md |
Modify | Add instance2 details |
Verification Guardrails
Each step must pass these gates before proceeding to the next:
Gate 1: Build (after each implementation step)
dotnet build ZB.MOM.WW.LmxOpcUa.slnx
Must produce 0 errors. Proceed only when green.
Gate 2: Unit tests (after steps 1–4, 9)
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests
All existing + new tests must pass. No regressions.
Gate 3: Integration tests (after steps 5–7)
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Integration.RedundancyTests"
All redundancy E2E tests must pass.
Gate 4: CLI tool builds (after step 10)
cd tools/opcuacli-dotnet && dotnet build
Must compile without errors.
Gate 5: Manual verification — single instance (after step 8)
# Publish and start with Redundancy.Enabled=false
opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=None, ServiceLevel=255
Gate 6: Manual verification — redundant pair (after step 11)
# Start both instances
sc start LmxOpcUa
sc start LmxOpcUa2
# Verify instance1 (Primary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=200, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]
# Verify instance2 (Secondary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=150, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]
# Both instances should serve the same Galaxy address space
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4840/LmxOpcUa -r -d 2
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4841/LmxOpcUa -r -d 2
Gate 7: Full test suite (final)
dotnet test ZB.MOM.WW.LmxOpcUa.slnx
All tests across all projects must pass.
Gate 8: Documentation review
- All new/modified doc files render correctly in Markdown
- Example JSON snippets match the actual
appsettings.json - CLI examples use correct flags and expected output
service_info.mdaccurately reflects both deployed instances
Risks and Considerations
- Backward compatibility:
Redundancy.Enabled = falsemust be the default so existing single-instance deployments are unaffected. - ServiceLevel timing: Updates must not race with OPC UA publish cycles. Use the server's internal lock or
ServerInternalAPIs. - ServerUriArray immutability: The OPC UA spec expects this to be static during a server session. Changes require a server restart.
- MXAccess shared state: Both instances connect to the same MXAccess runtime. If MXAccess has per-client registration limits, verify that two clients can coexist.
- Galaxy DB contention: Both instances poll for deploy changes. Ensure change detection doesn't trigger duplicate rebuilds or locking issues.
- Port conflicts: The second instance must use different ports for OPC UA (4841) and Dashboard (8084).
- Certificate identity: Each instance needs its own application certificate with a distinct
SubjectNamematching itsServerName.
Execution Order
- Steps 1–4: Config model, resolver, calculator, validator (unit-testable in isolation)
- Gate 1 + Gate 2: Build + unit tests pass
- Steps 5–7: Server integration (redundancy nodes, ServiceLevel wiring)
- Gate 1 + Gate 2 + Gate 3: Build + all tests including E2E
- Step 8: Update appsettings.json
- Gate 5: Manual single-instance verification
- Step 9: Update service builder for tests
- Step 10: CLI redundancy command
- Gate 4: CLI builds
- Step 11: Deploy second instance + update service_info.md
- Gate 6: Manual two-instance verification
- Documentation updates (all doc files)
- Gate 7 + Gate 8: Full test suite + documentation review
- Commit and push