# OPC UA Server Redundancy Plan ## Summary Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance advertises itself and its partner through the OPC UA `ServerRedundancy` node, publishes a dynamic `ServiceLevel` reflecting runtime health, and allows clients to discover the redundant set and fail over between instances. The CLI tool gains a `redundancy` command for inspecting the redundant server set. This plan covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does **not** implement automatic server-side failover or subscription transfer — those are client responsibilities per the OPC UA specification. --- ## Background: OPC UA Redundancy Model OPC UA defines redundancy through three address-space nodes under `Server/ServerRedundancy`: | Node | Type | Purpose | |---|---|---| | `RedundancySupport` | `RedundancySupport` enum | Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored` | | `ServerUriArray` | `String[]` | Lists the `ApplicationUri` values of all servers in the redundant set (non-transparent modes) | | `ServiceLevel` | `Byte` (0–255) | Indicates current operational quality; clients prefer the server with the highest value | ### Non-Transparent Redundancy (our target) In non-transparent redundancy (`Warm` or `Hot`), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading `ServerUriArray`, monitor `ServiceLevel` on each server, and manage their own failover. This model fits our architecture where each instance connects to the same Galaxy repository and MXAccess runtime independently. ### ServiceLevel Semantics | Range | Meaning | |---|---| | 0 | Server is not operational | | 1–99 | Degraded (e.g., MXAccess disconnected, DB unreachable) | | 100–199 | Healthy secondary | | 200–255 | Healthy primary (preferred) | The primary server should advertise a higher `ServiceLevel` than the secondary so clients prefer it when both are healthy. --- ## Current State - `LmxOpcUaServer` extends `StandardServer` but does not override any redundancy-related methods - `ServerRedundancy/RedundancySupport` defaults to `None` (SDK default) - `ServiceLevel` defaults to `255` (SDK default — "fully operational") - No configuration for redundant partner URIs or role designation - Single deployed instance at `C:\publish\lmxopcua\instance1` on port 4840 - No CLI support for reading redundancy information --- ## Scope ### In Scope (Phase 1) 1. **Redundancy configuration model** — role, partner URIs, ServiceLevel weights 2. **Server redundancy node exposure** — `RedundancySupport`, `ServerUriArray`, dynamic `ServiceLevel` 3. **ServiceLevel computation** — based on runtime health (MXAccess state, DB connectivity, role) 4. **CLI redundancy command** — read `RedundancySupport`, `ServerUriArray`, `ServiceLevel` from a server 5. **Second service instance** — deployed at `C:\publish\lmxopcua\instance2` with non-overlapping ports 6. **Documentation** — new `docs/Redundancy.md` component doc, updates to existing docs 7. **Unit tests** — config, ServiceLevel computation, resolver tests 8. **Integration tests** — two-server redundancy E2E test in the integration test project ### Deferred - Automatic subscription transfer (client-side responsibility) - Server-initiated failover (Galaxy `redundancy` table / engine flags) - Transparent redundancy mode - Health-check HTTP endpoint for load balancers --- ## Configuration Design ### New `Redundancy` section in `appsettings.json` ```json { "Redundancy": { "Enabled": false, "Mode": "Warm", "Role": "Primary", "ServerUris": [], "ServiceLevelBase": 200 } } ``` ### Configuration model **File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs` (new) ```csharp public class RedundancyConfiguration { public bool Enabled { get; set; } = false; public string Mode { get; set; } = "Warm"; public string Role { get; set; } = "Primary"; public List ServerUris { get; set; } = new List(); public int ServiceLevelBase { get; set; } = 200; } ``` ### Configuration rules - `Enabled` defaults to `false` for backward compatibility. When `false`, `RedundancySupport = None` and `ServiceLevel = 255` (SDK defaults). - `Mode` must be `Warm` or `Hot` (Phase 1). Maps to `RedundancySupport.Warm` or `RedundancySupport.Hot`. - `Role` must be `Primary` or `Secondary`. Controls the base `ServiceLevel` (Primary gets `ServiceLevelBase`, Secondary gets `ServiceLevelBase - 50`). - `ServerUris` lists the `ApplicationUri` values for **all** servers in the redundant set, including the local server. The OPC UA spec requires this to contain the full set. These are namespace URIs like `urn:ZB:LmxOpcUa`, not endpoint URLs. - `ServiceLevelBase` is the starting ServiceLevel when the server is fully healthy. Degraded conditions subtract from this value. ### App root updates **File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs` - Add `public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();` --- ## Implementation Steps ### Step 1: Add RedundancyConfiguration model and bind it **Files:** - `src/.../Configuration/RedundancyConfiguration.cs` (new) - `src/.../Configuration/AppConfiguration.cs` - `src/.../OpcUaService.cs` Changes: 1. Create `RedundancyConfiguration` class with properties above 2. Add `Redundancy` property to `AppConfiguration` 3. Bind `configuration.GetSection("Redundancy").Bind(_config.Redundancy);` 4. Pass `_config.Redundancy` through to `OpcUaServerHost` and `LmxOpcUaServer` ### Step 2: Add RedundancyModeResolver **File:** `src/.../OpcUa/RedundancyModeResolver.cs` (new) Responsibilities: - Map `Mode` string to `RedundancySupport` enum value - Validate against supported Phase 1 modes (`Warm`, `Hot`) - Fall back to `None` with warning for unknown modes ```csharp public static class RedundancyModeResolver { public static RedundancySupport Resolve(string mode, bool enabled); } ``` ### Step 3: Add ServiceLevelCalculator **File:** `src/.../OpcUa/ServiceLevelCalculator.cs` (new) Computes the dynamic `ServiceLevel` byte from runtime health: ```csharp public class ServiceLevelCalculator { public byte Calculate(int baseLine, bool mxAccessConnected, bool dbConnected, bool isPrimary); } ``` Logic: - Start with `baseLine` (from config, e.g., 200 for Primary, 150 for Secondary) - Subtract 100 if MXAccess is disconnected - Subtract 50 if Galaxy DB is unreachable - Clamp to 0–255 - Return 0 if both MXAccess and DB are down ### Step 4: Extend ConfigurationValidator for redundancy **File:** `src/.../Configuration/ConfigurationValidator.cs` Add validation/logging for: - `Redundancy.Enabled`, `Mode`, `Role` - `ServerUris` should not be empty when `Enabled = true` - `ServiceLevelBase` should be 1–255 - Warning when `Enabled = true` but `ServerUris` has fewer than 2 entries - Log effective redundancy configuration at startup ### Step 5: Update LmxOpcUaServer to expose redundancy state **File:** `src/.../OpcUa/LmxOpcUaServer.cs` Changes: 1. Accept `RedundancyConfiguration` in the constructor 2. Override `OnServerStarted` to write redundancy nodes: - Set `Server/ServerRedundancy/RedundancySupport` to the resolved mode - Set `Server/ServerRedundancy/ServerUriArray` to the configured URIs 3. Override `SetServerState` or use a timer to update `Server/ServiceLevel` periodically based on `ServiceLevelCalculator` 4. Expose a method `UpdateServiceLevel(bool mxConnected, bool dbConnected)` that the service layer can call when health state changes ### Step 6: Update OpcUaServerHost to pass redundancy config **File:** `src/.../OpcUa/OpcUaServerHost.cs` Changes: 1. Accept `RedundancyConfiguration` in the constructor 2. Pass it through to `LmxOpcUaServer` 3. Log active redundancy mode at startup ### Step 7: Wire ServiceLevel updates in OpcUaService **File:** `src/.../OpcUaService.cs` Changes: 1. Bind redundancy config section 2. Pass redundancy config to `OpcUaServerHost` 3. Subscribe to `MxAccessClient.ConnectionStateChanged` to trigger `ServiceLevel` updates 4. After Galaxy DB health checks, trigger `ServiceLevel` updates 5. Use a periodic timer (e.g., every 5 seconds) to refresh `ServiceLevel` based on current component health ### Step 8: Update appsettings.json **File:** `src/.../appsettings.json` Add the `Redundancy` section with backward-compatible defaults (`Enabled: false`). ### Step 9: Update OpcUaServiceBuilder for test injection **File:** `src/.../OpcUaServiceBuilder.cs` Add `WithRedundancy(RedundancyConfiguration)` builder method so tests can inject redundancy configuration. ### Step 10: Add CLI `redundancy` command **Files:** - `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` (new) Command: `redundancy` Reads from the target server: - `Server/ServerRedundancy/RedundancySupport` (i=11314) - `Server/ServiceLevel` (i=2267) - `Server/ServerRedundancy/ServerUriArray` (i=11492, if non-transparent redundancy) Output format: ``` Redundancy Mode: Warm Service Level: 200 Server URIs: - urn:ZB:LmxOpcUa - urn:ZB:LmxOpcUa2 ``` Options: `--url`, `--username`, `--password`, `--security` (same shared options as other commands). ### Step 11: Deploy second service instance **Deployment target:** `C:\publish\lmxopcua\instance2` Configuration differences from instance1: | Setting | instance1 | instance2 | |---|---|---| | `OpcUa.Port` | `4840` | `4841` | | `OpcUa.ServerName` | `LmxOpcUa` | `LmxOpcUa2` | | `Dashboard.Port` | `8083` | `8084` | | `Redundancy.Enabled` | `true` | `true` | | `Redundancy.Role` | `Primary` | `Secondary` | | `Redundancy.Mode` | `Warm` | `Warm` | | `Redundancy.ServerUris` | `["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]` | `["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]` | | `Redundancy.ServiceLevelBase` | `200` | `200` | Windows service for instance2: - Name: `LmxOpcUa2` - Display name: `LMX OPC UA Server (Instance 2)` - Executable: `C:\publish\lmxopcua\instance2\ZB.MOM.WW.LmxOpcUa.Host.exe` Both instances share the same Galaxy DB (`ZB`) and MXAccess runtime. The `GalaxyName` remains `ZB` for both so they expose the same namespace. Update `service_info.md` with the second instance details. --- ## Test Plan ### Unit tests — RedundancyModeResolver **New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs` | Test | Description | |---|---| | `Resolve_Disabled_ReturnsNone` | `Enabled=false` always returns `RedundancySupport.None` | | `Resolve_Warm_ReturnsWarm` | `Mode="Warm"` maps to `RedundancySupport.Warm` | | `Resolve_Hot_ReturnsHot` | `Mode="Hot"` maps to `RedundancySupport.Hot` | | `Resolve_Unknown_FallsBackToNone` | Unknown mode falls back safely | | `Resolve_CaseInsensitive` | `"warm"` and `"WARM"` both resolve | ### Unit tests — ServiceLevelCalculator **New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs` | Test | Description | |---|---| | `FullyHealthy_Primary_ReturnsBase` | All healthy, primary role → `ServiceLevelBase` | | `FullyHealthy_Secondary_ReturnsBaseMinusFifty` | All healthy, secondary role → `ServiceLevelBase - 50` | | `MxAccessDown_ReducesServiceLevel` | MXAccess disconnected subtracts 100 | | `DbDown_ReducesServiceLevel` | DB unreachable subtracts 50 | | `BothDown_ReturnsZero` | MXAccess + DB both down → 0 | | `ClampedTo255` | Base of 255 with healthy → 255 | | `ClampedToZero` | Heavy penalties don't go negative | ### Unit tests — RedundancyConfiguration defaults **New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs` | Test | Description | |---|---| | `DefaultConfig_Disabled` | `Enabled` defaults to `false` | | `DefaultConfig_ModeWarm` | `Mode` defaults to `"Warm"` | | `DefaultConfig_RolePrimary` | `Role` defaults to `"Primary"` | | `DefaultConfig_EmptyServerUris` | `ServerUris` defaults to empty | | `DefaultConfig_ServiceLevelBase200` | `ServiceLevelBase` defaults to `200` | ### Updates to existing configuration tests **File:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs` Add: - `Redundancy_Section_BindsCorrectly` — verify binding from appsettings.json - `Redundancy_Section_BindsCustomValues` — in-memory override test - `Validator_RedundancyEnabled_EmptyServerUris_ReturnsTrue_WithWarning` — validates but warns - `Validator_RedundancyEnabled_InvalidServiceLevelBase_ReturnsFalse` — rejects 0 or >255 ### Integration tests — redundancy E2E **New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs` These tests start two in-process OPC UA servers with redundancy enabled and verify client-visible behavior: | Test | Description | |---|---| | `Server_WithRedundancyDisabled_ReportsNone` | Default config → `RedundancySupport.None`, `ServiceLevel=255` | | `Server_WithRedundancyEnabled_ReportsConfiguredMode` | `Enabled=true, Mode=Warm` → `RedundancySupport.Warm` | | `Server_WithRedundancyEnabled_ExposesServerUriArray` | Client can read `ServerUriArray` and it matches config | | `Server_Primary_HasHigherServiceLevel_ThanSecondary` | Primary server reports higher `ServiceLevel` than secondary | | `TwoServers_BothExposeSameRedundantSet` | Two server fixtures, both report the same `ServerUriArray` | | `Server_ServiceLevel_DropsWith_MxAccessDisconnect` | Simulate MXAccess disconnect → `ServiceLevel` decreases | Pattern: Use `OpcUaServerFixture.WithFakeMxAccessClient()` with redundancy config injected, connect with `OpcUaTestClient`, read the standard OPC UA redundancy nodes. --- ## Documentation Plan ### New file: `docs/Redundancy.md` Contents: 1. Overview of OPC UA non-transparent redundancy 2. Redundancy configuration section reference (`Enabled`, `Mode`, `Role`, `ServerUris`, `ServiceLevelBase`) 3. ServiceLevel computation logic and degraded-state penalties 4. How clients discover and fail over between instances 5. Deployment guide for a two-instance redundant pair (ports, service names, shared Galaxy DB) 6. CLI `redundancy` command usage 7. Troubleshooting: mismatched `ServerUris`, ServiceLevel stuck at 0, etc. ### Updates to existing docs | File | Changes | |---|---| | `docs/Configuration.md` | Add `Redundancy` section table, example JSON, add to validation rules list, update example appsettings.json | | `docs/OpcUaServer.md` | Add redundancy state exposure section, link to `Redundancy.md` | | `docs/CliTool.md` | Add `redundancy` command documentation | | `docs/ServiceHosting.md` | Add multi-instance deployment notes | | `README.md` | Add `Redundancy` to the component documentation table, mention redundancy in Quick Start | | `CLAUDE.md` | Add redundancy architecture note | ### Update: `service_info.md` Add a second section documenting `instance2`: - Path: `C:\publish\lmxopcua\instance2` - Windows service name: `LmxOpcUa2` - Port: `4841` - Dashboard port: `8084` - Redundancy role: `Secondary` - Endpoint: `opc.tcp://localhost:4841/LmxOpcUa` --- ## File Change Summary | File | Action | Description | |---|---|---| | `src/.../Configuration/RedundancyConfiguration.cs` | New | Redundancy config model | | `src/.../Configuration/AppConfiguration.cs` | Modify | Add `Redundancy` section | | `src/.../Configuration/ConfigurationValidator.cs` | Modify | Validate/log redundancy settings | | `src/.../OpcUa/RedundancyModeResolver.cs` | New | Mode string → `RedundancySupport` enum | | `src/.../OpcUa/ServiceLevelCalculator.cs` | New | Dynamic ServiceLevel from health state | | `src/.../OpcUa/LmxOpcUaServer.cs` | Modify | Expose redundancy nodes, accept ServiceLevel updates | | `src/.../OpcUa/OpcUaServerHost.cs` | Modify | Pass redundancy config through | | `src/.../OpcUaService.cs` | Modify | Bind redundancy config, wire ServiceLevel updates | | `src/.../OpcUaServiceBuilder.cs` | Modify | Add `WithRedundancy()` builder | | `src/.../appsettings.json` | Modify | Add `Redundancy` section | | `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` | New | CLI command to read redundancy info | | `tests/.../Redundancy/RedundancyModeResolverTests.cs` | New | Mode resolver unit tests | | `tests/.../Redundancy/ServiceLevelCalculatorTests.cs` | New | ServiceLevel computation tests | | `tests/.../Redundancy/RedundancyConfigurationTests.cs` | New | Config defaults tests | | `tests/.../Configuration/ConfigurationLoadingTests.cs` | Modify | Binding + validation tests | | `tests/.../Integration/RedundancyTests.cs` | New | E2E two-server redundancy tests | | `tests/.../Helpers/OpcUaServerFixture.cs` | Modify | Accept redundancy config | | `docs/Redundancy.md` | New | Dedicated redundancy component doc | | `docs/Configuration.md` | Modify | Add Redundancy section | | `docs/OpcUaServer.md` | Modify | Add redundancy state section | | `docs/CliTool.md` | Modify | Add redundancy command | | `docs/ServiceHosting.md` | Modify | Multi-instance notes | | `README.md` | Modify | Add Redundancy to component table | | `CLAUDE.md` | Modify | Add redundancy architecture note | | `service_info.md` | Modify | Add instance2 details | --- ## Verification Guardrails Each step must pass these gates before proceeding to the next: ### Gate 1: Build (after each implementation step) ```bash dotnet build ZB.MOM.WW.LmxOpcUa.slnx ``` Must produce 0 errors. Proceed only when green. ### Gate 2: Unit tests (after steps 1–4, 9) ```bash dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests ``` All existing + new tests must pass. No regressions. ### Gate 3: Integration tests (after steps 5–7) ```bash dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Integration.RedundancyTests" ``` All redundancy E2E tests must pass. ### Gate 4: CLI tool builds (after step 10) ```bash cd tools/opcuacli-dotnet && dotnet build ``` Must compile without errors. ### Gate 5: Manual verification — single instance (after step 8) ```bash # Publish and start with Redundancy.Enabled=false opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa # Should report: RedundancySupport=None, ServiceLevel=255 ``` ### Gate 6: Manual verification — redundant pair (after step 11) ```bash # Start both instances sc start LmxOpcUa sc start LmxOpcUa2 # Verify instance1 (Primary) opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa # Should report: RedundancySupport=Warm, ServiceLevel=200, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2] # Verify instance2 (Secondary) opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa # Should report: RedundancySupport=Warm, ServiceLevel=150, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2] # Both instances should serve the same Galaxy address space opcuacli-dotnet.exe browse -u opc.tcp://localhost:4840/LmxOpcUa -r -d 2 opcuacli-dotnet.exe browse -u opc.tcp://localhost:4841/LmxOpcUa -r -d 2 ``` ### Gate 7: Full test suite (final) ```bash dotnet test ZB.MOM.WW.LmxOpcUa.slnx ``` All tests across all projects must pass. ### Gate 8: Documentation review - All new/modified doc files render correctly in Markdown - Example JSON snippets match the actual `appsettings.json` - CLI examples use correct flags and expected output - `service_info.md` accurately reflects both deployed instances --- ## Risks and Considerations 1. **Backward compatibility**: `Redundancy.Enabled = false` must be the default so existing single-instance deployments are unaffected. 2. **ServiceLevel timing**: Updates must not race with OPC UA publish cycles. Use the server's internal lock or `ServerInternal` APIs. 3. **ServerUriArray immutability**: The OPC UA spec expects this to be static during a server session. Changes require a server restart. 4. **MXAccess shared state**: Both instances connect to the same MXAccess runtime. If MXAccess has per-client registration limits, verify that two clients can coexist. 5. **Galaxy DB contention**: Both instances poll for deploy changes. Ensure change detection doesn't trigger duplicate rebuilds or locking issues. 6. **Port conflicts**: The second instance must use different ports for OPC UA (4841) and Dashboard (8084). 7. **Certificate identity**: Each instance needs its own application certificate with a distinct `SubjectName` matching its `ServerName`. --- ## Execution Order 1. Steps 1–4: Config model, resolver, calculator, validator (unit-testable in isolation) 2. **Gate 1 + Gate 2**: Build + unit tests pass 3. Steps 5–7: Server integration (redundancy nodes, ServiceLevel wiring) 4. **Gate 1 + Gate 2 + Gate 3**: Build + all tests including E2E 5. Step 8: Update appsettings.json 6. **Gate 5**: Manual single-instance verification 7. Step 9: Update service builder for tests 8. Step 10: CLI redundancy command 9. **Gate 4**: CLI builds 10. Step 11: Deploy second instance + update service_info.md 11. **Gate 6**: Manual two-instance verification 12. Documentation updates (all doc files) 13. **Gate 7 + Gate 8**: Full test suite + documentation review 14. Commit and push