# OPC UA Server Redundancy Plan ## Summary Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance should advertise the redundant set through the standard OPC UA redundancy nodes, publish a dynamic `ServiceLevel` based on runtime health, and allow clients to discover and fail over between the instances. The CLI tool should gain a `redundancy` command for inspecting the redundant server set. This review tightens the original draft in a few important ways: - It separates **namespace identity** from **application identity**. The current host uses `urn:{GalaxyName}:LmxOpcUa` as both the namespace URI and `ApplicationUri`; that must change for redundancy because each server in the pair needs a unique server URI. - It avoids hand-wavy "write the redundancy nodes directly" language and instead targets the OPC UA SDK's built-in `ServerObjectState` / `ServerRedundancyState` model. - It removes a few inaccurate hardcoded assumptions, including the `ServerUriArray` node id and the deployment port examples. - It fixes execution order so test-builder and helper changes happen before integration coverage depends on them. This plan still covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does **not** implement automatic server-side failover or subscription transfer; those remain client responsibilities per the OPC UA specification. --- ## Background: OPC UA Redundancy Model OPC UA exposes redundancy through standard nodes under `Server/ServerRedundancy` plus the `Server/ServiceLevel` property: | Node | Type | Purpose | |---|---|---| | `RedundancySupport` | `RedundancySupport` enum | Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored` | | `ServerUriArray` | `String[]` | Lists the `ApplicationUri` values of all servers in the redundant set for non-transparent redundancy | | `ServiceLevel` | `Byte` (0-255) | Indicates current operational quality; clients prefer the server with the highest value | ### Non-Transparent Redundancy (our target) In non-transparent redundancy (`Warm` or `Hot`), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading `ServerUriArray`, monitor `ServiceLevel` on each server, and manage their own failover. This fits the current architecture, where each instance independently connects to the same Galaxy repository and MXAccess runtime. ### ServiceLevel semantics | Range | Meaning | |---|---| | 0 | Server is not operational | | 1-99 | Degraded | | 100-199 | Healthy secondary | | 200-255 | Healthy primary | The primary should advertise a higher `ServiceLevel` than the secondary so clients prefer it when both are healthy. --- ## Current State - `LmxOpcUaServer` extends `StandardServer` but does not expose redundancy state - `ServerRedundancy/RedundancySupport` remains the SDK default (`None`) - `Server/ServiceLevel` remains the SDK default (`255`) - No configuration exists for redundancy mode, role, or redundant partner URIs - `OpcUaServerHost` currently sets `ApplicationUri = urn:{GalaxyName}:LmxOpcUa` - `LmxNodeManager` uses the same `urn:{GalaxyName}:LmxOpcUa` as the published namespace URI - A single deployed instance is documented in [service_info.md](C:\Users\dohertj2\Desktop\lmxopcua\service_info.md) - No CLI command exists for reading redundancy information ## Key gap to fix first For redundancy, each server in the set must advertise a unique `ApplicationUri`, and `ServerUriArray` must contain those unique values. The current implementation cannot do that because it reuses the namespace URI as the server `ApplicationUri`. Phase 1 therefore needs an application-identity change before the redundancy nodes can be correct. --- ## Scope ### In scope (Phase 1) 1. Add explicit application-identity configuration so each instance can have a unique `ApplicationUri` 2. Add redundancy configuration for mode, role, and server URI membership 3. Expose `RedundancySupport`, `ServerUriArray`, and dynamic `ServiceLevel` 4. Compute `ServiceLevel` from runtime health and preferred role 5. Add a CLI `redundancy` command 6. Document two-instance deployment 7. Add unit and integration coverage ### Deferred - Automatic subscription transfer - Server-initiated failover - Transparent redundancy mode - Load-balancer-specific HTTP health endpoints - Mirrored data/session state --- ## Configuration Design ### 1. Add explicit `OpcUa.ApplicationUri` **File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/OpcUaConfiguration.cs` Add: ```csharp public string? ApplicationUri { get; set; } ``` Rules: - `ApplicationUri = null` preserves the current behavior for non-redundant deployments - when `Redundancy.Enabled = true`, `ApplicationUri` must be explicitly set and unique per instance - `LmxNodeManager` should continue using `urn:{GalaxyName}:LmxOpcUa` as the namespace URI so both redundant servers expose the same namespace - `Redundancy.ServerUris` must contain the exact `ApplicationUri` values for all servers in the redundant set Example: ```json { "OpcUa": { "ServerName": "LmxOpcUa", "GalaxyName": "ZB", "ApplicationUri": "urn:localhost:LmxOpcUa:instance1" } } ``` ### 2. New `Redundancy` section in `appsettings.json` ```json { "Redundancy": { "Enabled": false, "Mode": "Warm", "Role": "Primary", "ServerUris": [], "ServiceLevelBase": 200 } } ``` ### 3. Configuration model **File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs` (new) ```csharp public class RedundancyConfiguration { public bool Enabled { get; set; } = false; public string Mode { get; set; } = "Warm"; public string Role { get; set; } = "Primary"; public List ServerUris { get; set; } = new List(); public int ServiceLevelBase { get; set; } = 200; } ``` ### 4. Configuration rules - `Enabled` defaults to `false` - `Mode` supports `Warm` and `Hot` in Phase 1 - `Role` supports `Primary` and `Secondary` - `ServerUris` must contain the local `OpcUa.ApplicationUri` when redundancy is enabled - `ServerUris` should contain at least two unique entries when redundancy is enabled - `ServiceLevelBase` should be in the range `1-255` - Effective baseline: - Primary: `ServiceLevelBase` - Secondary: `max(0, ServiceLevelBase - 50)` ### App root updates **File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs` - Add `public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();` --- ## Implementation Steps ### Step 1: Separate application identity from namespace identity **Files:** - `src/.../Configuration/OpcUaConfiguration.cs` - `src/.../OpcUa/OpcUaServerHost.cs` - `docs/OpcUaServer.md` - `tests/.../Configuration/ConfigurationLoadingTests.cs` Changes: 1. Add optional `OpcUa.ApplicationUri` 2. Keep `urn:{GalaxyName}:LmxOpcUa` as the namespace URI used by `LmxNodeManager` 3. Set `ApplicationConfiguration.ApplicationUri` from `OpcUa.ApplicationUri` when supplied 4. Keep `ApplicationUri` and namespace URI distinct in docs and tests This step is required before redundancy can be correct. ### Step 2: Add `RedundancyConfiguration` and bind it **Files:** - `src/.../Configuration/RedundancyConfiguration.cs` (new) - `src/.../Configuration/AppConfiguration.cs` - `src/.../OpcUaService.cs` Changes: 1. Create `RedundancyConfiguration` 2. Add `Redundancy` to `AppConfiguration` 3. Bind `configuration.GetSection("Redundancy").Bind(_config.Redundancy);` 4. Pass `_config.Redundancy` through to `OpcUaServerHost` and `LmxOpcUaServer` ### Step 3: Add `RedundancyModeResolver` **File:** `src/.../OpcUa/RedundancyModeResolver.cs` (new) Responsibilities: - map `Mode` to `RedundancySupport` - validate supported Phase 1 modes - fall back safely when disabled or invalid ```csharp public static class RedundancyModeResolver { public static RedundancySupport Resolve(string mode, bool enabled); } ``` ### Step 4: Add `ServiceLevelCalculator` **File:** `src/.../OpcUa/ServiceLevelCalculator.cs` (new) Purpose: - compute the current `ServiceLevel` from a baseline plus health inputs Suggested signature: ```csharp public sealed class ServiceLevelCalculator { public byte Calculate(int baseLevel, bool mxAccessConnected, bool dbConnected); } ``` Suggested logic: - start with the role-adjusted baseline supplied by the caller - subtract 100 if MXAccess is disconnected - subtract 50 if the Galaxy DB is unreachable - return `0` if both are down - clamp to `0-255` ### Step 5: Extend `ConfigurationValidator` **File:** `src/.../Configuration/ConfigurationValidator.cs` Add validation/logging for: - `OpcUa.ApplicationUri` - `Redundancy.Enabled`, `Mode`, `Role` - `ServerUris` membership and uniqueness - `ServiceLevelBase` - local `OpcUa.ApplicationUri` must appear in `Redundancy.ServerUris` when enabled - warning when fewer than 2 unique server URIs are configured ### Step 6: Expose redundancy through the standard OPC UA server object **File:** `src/.../OpcUa/LmxOpcUaServer.cs` Changes: 1. Accept `RedundancyConfiguration` and local `ApplicationUri` 2. On startup, locate the built-in `ServerObjectState` 3. Configure `ServerObjectState.ServiceLevel` 4. Configure the server redundancy object using the SDK's standard server-state types instead of writing guessed node ids directly 5. If the default `ServerRedundancyState` does not expose `ServerUriArray`, replace or upgrade it with the appropriate non-transparent redundancy state type from the SDK before populating values 6. Expose an internal method such as `UpdateServiceLevel(bool mxConnected, bool dbConnected)` for service-layer health updates Important: the implementation should use SDK types/constants (`ServerObjectState`, `ServerRedundancyState`, `NonTransparentRedundancyState`, `VariableIds.*`) rather than hand-maintained numeric literals. ### Step 7: Update `OpcUaServerHost` **File:** `src/.../OpcUa/OpcUaServerHost.cs` Changes: 1. Accept `RedundancyConfiguration` 2. Pass redundancy config and resolved local `ApplicationUri` into `LmxOpcUaServer` 3. Log redundancy mode/role/server URIs at startup ### Step 8: Wire health updates in `OpcUaService` **File:** `src/.../OpcUaService.cs` Changes: 1. Bind and pass redundancy config 2. After startup, initialize the starting `ServiceLevel` 3. Subscribe to `IMxAccessClient.ConnectionStateChanged` 4. Update DB health whenever startup repository checks, change-detection work, or rebuild attempts succeed/fail 5. Prefer event-driven updates; add a lightweight periodic refresh only if necessary Avoid introducing a second large standalone polling loop when existing connection and repository activity already gives most of the needed health signals. ### Step 9: Update test builders and helpers before integration coverage **Files:** - `src/.../OpcUaServiceBuilder.cs` - `tests/.../Helpers/OpcUaServerFixture.cs` - `tests/.../Helpers/OpcUaTestClient.cs` Changes: - add `WithRedundancy(...)` - add `WithApplicationUri(...)` or allow full `OpcUaConfiguration` override - ensure two in-process redundancy tests can run with distinct `ServerName`, `ApplicationUri`, and certificate identity - when needed, use separate PKI roots in tests so paired fixtures do not collide on certificate state ### Step 10: Update `appsettings.json` **File:** `src/.../appsettings.json` Add: - `OpcUa.ApplicationUri` example/commentary in docs - `Redundancy` section with `Enabled = false` defaults ### Step 11: Add CLI `redundancy` command **Files:** - `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` (new) - `tools/opcuacli-dotnet/README.md` - `docs/CliTool.md` Command: `redundancy` Read: - `VariableIds.Server_ServerRedundancy_RedundancySupport` - `VariableIds.Server_ServiceLevel` - `VariableIds.Server_ServerRedundancy_ServerUriArray` Output example: ```text Redundancy Mode: Warm Service Level: 200 Server URIs: - urn:localhost:LmxOpcUa:instance1 - urn:localhost:LmxOpcUa:instance2 ``` Use SDK constants instead of hardcoded numeric ids in the command implementation. ### Step 12: Deploy the second service instance **Deployment target:** `C:\publish\lmxopcua\instance2` Suggested configuration differences: | Setting | instance1 | instance2 | |---|---|---| | `OpcUa.Port` | `4840` | `4841` | | `Dashboard.Port` | `8081` | `8082` | | `OpcUa.ServerName` | `LmxOpcUa` | `LmxOpcUa2` | | `OpcUa.ApplicationUri` | `urn:localhost:LmxOpcUa:instance1` | `urn:localhost:LmxOpcUa:instance2` | | `Redundancy.Enabled` | `true` | `true` | | `Redundancy.Role` | `Primary` | `Secondary` | | `Redundancy.Mode` | `Warm` | `Warm` | | `Redundancy.ServerUris` | same two-entry set | same two-entry set | Deployment notes: - both instances should share the same `GalaxyName` and namespace URI - each instance must have a distinct application certificate identity - if certificate handling is sensitive, give each instance an explicit `Security.CertificateSubject` or separate PKI root Update [service_info.md](C:\Users\dohertj2\Desktop\lmxopcua\service_info.md) with the second instance details after deployment is real, not speculative. --- ## Test Plan ### Unit tests: `RedundancyModeResolver` **New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs` | Test | Description | |---|---| | `Resolve_Disabled_ReturnsNone` | `Enabled=false` returns `None` | | `Resolve_Warm_ReturnsWarm` | `Mode="Warm"` maps correctly | | `Resolve_Hot_ReturnsHot` | `Mode="Hot"` maps correctly | | `Resolve_Unknown_FallsBackToNone` | Unknown mode falls back safely | | `Resolve_CaseInsensitive` | Case-insensitive parsing works | ### Unit tests: `ServiceLevelCalculator` **New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs` | Test | Description | |---|---| | `FullyHealthy_Primary_ReturnsBase` | Healthy primary baseline is preserved | | `FullyHealthy_Secondary_ReturnsBaseMinusFifty` | Healthy secondary baseline is lower | | `MxAccessDown_ReducesServiceLevel` | MXAccess failure reduces score | | `DbDown_ReducesServiceLevel` | DB failure reduces score | | `BothDown_ReturnsZero` | Both unavailable returns 0 | | `ClampedTo255` | Upper clamp works | | `ClampedToZero` | Lower clamp works | ### Unit tests: `RedundancyConfiguration` **New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs` | Test | Description | |---|---| | `DefaultConfig_Disabled` | `Enabled` defaults to `false` | | `DefaultConfig_ModeWarm` | `Mode` defaults to `Warm` | | `DefaultConfig_RolePrimary` | `Role` defaults to `Primary` | | `DefaultConfig_EmptyServerUris` | `ServerUris` defaults to empty | | `DefaultConfig_ServiceLevelBase200` | `ServiceLevelBase` defaults to `200` | ### Updates to existing configuration tests **File:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs` Add coverage for: - `OpcUa.ApplicationUri` - `Redundancy` section binding - redundancy validation when `ApplicationUri` is missing - redundancy validation when local `ApplicationUri` is absent from `ServerUris` - invalid `ServiceLevelBase` ### Integration tests **New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs` Cover: - redundancy disabled reports `None` - warm redundancy reports configured mode - `ServerUriArray` matches configuration - primary reports higher `ServiceLevel` than secondary - both servers expose the same namespace URI but different `ApplicationUri` values - service level drops when MXAccess disconnects Pattern: - use two fixture instances - give each fixture a distinct `ServerName`, `ApplicationUri`, and port - if secure transport is enabled in those tests, isolate PKI roots to avoid certificate cross-talk --- ## Documentation Plan ### New file - `docs/Redundancy.md` Contents: 1. overview of OPC UA non-transparent redundancy 2. difference between namespace URI and server `ApplicationUri` 3. redundancy configuration reference 4. service-level computation 5. two-instance deployment guide 6. CLI `redundancy` command usage 7. troubleshooting ### Updates to existing docs | File | Changes | |---|---| | `docs/Configuration.md` | Add `OpcUa.ApplicationUri` and `Redundancy` sections | | `docs/OpcUaServer.md` | Correct the current `ApplicationUri == namespace` description and add redundancy behavior | | `docs/CliTool.md` | Add `redundancy` command | | `docs/ServiceHosting.md` | Add multi-instance deployment notes | | `README.md` | Mention redundancy support and link docs | | `CLAUDE.md` | Add redundancy architecture note | ### Update after real deployment - `service_info.md` Only update this once the second instance is actually deployed and verified. --- ## File Change Summary | File | Action | Description | |---|---|---| | `src/.../Configuration/OpcUaConfiguration.cs` | Modify | Add explicit `ApplicationUri` | | `src/.../Configuration/RedundancyConfiguration.cs` | New | Redundancy config model | | `src/.../Configuration/AppConfiguration.cs` | Modify | Add `Redundancy` section | | `src/.../Configuration/ConfigurationValidator.cs` | Modify | Validate/log redundancy and application identity | | `src/.../OpcUa/RedundancyModeResolver.cs` | New | Map config mode to `RedundancySupport` | | `src/.../OpcUa/ServiceLevelCalculator.cs` | New | Compute `ServiceLevel` from health inputs | | `src/.../OpcUa/LmxOpcUaServer.cs` | Modify | Expose redundancy state via SDK server object | | `src/.../OpcUa/OpcUaServerHost.cs` | Modify | Pass local application identity and redundancy config | | `src/.../OpcUaService.cs` | Modify | Bind config and wire health updates | | `src/.../OpcUaServiceBuilder.cs` | Modify | Support redundancy/application identity injection | | `src/.../appsettings.json` | Modify | Add redundancy settings | | `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` | New | Read redundancy state from a server | | `tests/.../Redundancy/*.cs` | New | Unit tests for redundancy config and calculators | | `tests/.../Configuration/ConfigurationLoadingTests.cs` | Modify | Bind/validate new settings | | `tests/.../Integration/RedundancyTests.cs` | New | Paired-server integration tests | | `tests/.../Helpers/OpcUaServerFixture.cs` | Modify | Support paired redundancy fixtures | | `tests/.../Helpers/OpcUaTestClient.cs` | Modify | Read redundancy nodes in integration tests | | `docs/Redundancy.md` | New | Dedicated redundancy guide | | `docs/Configuration.md` | Modify | Document new config | | `docs/OpcUaServer.md` | Modify | Correct application identity and add redundancy details | | `docs/CliTool.md` | Modify | Document `redundancy` command | | `docs/ServiceHosting.md` | Modify | Multi-instance deployment notes | | `README.md` | Modify | Link redundancy docs | | `CLAUDE.md` | Modify | Architecture note | | `service_info.md` | Modify later | Document real second-instance deployment | --- ## Verification Guardrails ### Gate 1: Build ```bash dotnet build ZB.MOM.WW.LmxOpcUa.slnx ``` ### Gate 2: Unit tests ```bash dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests ``` ### Gate 3: Redundancy integration tests ```bash dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Redundancy" ``` ### Gate 4: CLI build ```bash cd tools/opcuacli-dotnet dotnet build ``` ### Gate 5: Manual single-instance check ```bash opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa ``` Expected: - `RedundancySupport=None` - `ServiceLevel=255` ### Gate 6: Manual paired-instance check ```bash opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa ``` Expected: - both report the same `ServerUriArray` - each reports its own unique local `ApplicationUri` - primary reports a higher `ServiceLevel` ### Gate 7: Full test suite ```bash dotnet test ZB.MOM.WW.LmxOpcUa.slnx ``` --- ## Risks and Considerations 1. **Application identity is the main correctness risk.** Without unique `ApplicationUri` values, the redundant set is invalid even if `ServerUriArray` is populated. 2. **SDK wiring may require replacing the default redundancy state node.** The base `ServerRedundancyState` does not expose `ServerUriArray`; the implementation may need the non-transparent subtype from the SDK. 3. **Two in-process servers can collide on certificates.** Tests and deployment need distinct application identities and, when necessary, isolated PKI roots. 4. **Both instances hit the same MXAccess runtime and Galaxy DB.** Verify client-registration and polling behavior under paired load. 5. **`ServiceLevel` should remain meaningful, not noisy.** Prefer deterministic role + health inputs over frequent arbitrary adjustments. 6. **`service_info.md` is deployment documentation, not design.** Do not prefill it with speculative values before the second instance actually exists. --- ## Execution Order 1. Step 1: add `OpcUa.ApplicationUri` and separate it from namespace identity 2. Steps 2-5: config model, resolver, calculator, validator 3. Gate 1 + Gate 2 4. Step 9: update builders/helpers so tests can express paired servers cleanly 5. Step 6-8: server exposure and service-layer health wiring 6. Gate 1 + Gate 2 + Gate 3 7. Step 10: update `appsettings.json` 8. Step 11: add CLI `redundancy` command 9. Gate 4 + Gate 5 10. Step 12: deploy and verify the second instance 11. Update `service_info.md` with real deployment details 12. Documentation updates 13. Gate 7