diff --git a/redundancy.md b/redundancy.md new file mode 100644 index 0000000..106a8df --- /dev/null +++ b/redundancy.md @@ -0,0 +1,508 @@ +# OPC UA Server Redundancy Plan + +## Summary + +Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance advertises itself and its partner through the OPC UA `ServerRedundancy` node, publishes a dynamic `ServiceLevel` reflecting runtime health, and allows clients to discover the redundant set and fail over between instances. The CLI tool gains a `redundancy` command for inspecting the redundant server set. + +This plan covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does **not** implement automatic server-side failover or subscription transfer — those are client responsibilities per the OPC UA specification. + +--- + +## Background: OPC UA Redundancy Model + +OPC UA defines redundancy through three address-space nodes under `Server/ServerRedundancy`: + +| Node | Type | Purpose | +|---|---|---| +| `RedundancySupport` | `RedundancySupport` enum | Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored` | +| `ServerUriArray` | `String[]` | Lists the `ApplicationUri` values of all servers in the redundant set (non-transparent modes) | +| `ServiceLevel` | `Byte` (0–255) | Indicates current operational quality; clients prefer the server with the highest value | + +### Non-Transparent Redundancy (our target) + +In non-transparent redundancy (`Warm` or `Hot`), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading `ServerUriArray`, monitor `ServiceLevel` on each server, and manage their own failover. This model fits our architecture where each instance connects to the same Galaxy repository and MXAccess runtime independently. + +### ServiceLevel Semantics + +| Range | Meaning | +|---|---| +| 0 | Server is not operational | +| 1–99 | Degraded (e.g., MXAccess disconnected, DB unreachable) | +| 100–199 | Healthy secondary | +| 200–255 | Healthy primary (preferred) | + +The primary server should advertise a higher `ServiceLevel` than the secondary so clients prefer it when both are healthy. + +--- + +## Current State + +- `LmxOpcUaServer` extends `StandardServer` but does not override any redundancy-related methods +- `ServerRedundancy/RedundancySupport` defaults to `None` (SDK default) +- `ServiceLevel` defaults to `255` (SDK default — "fully operational") +- No configuration for redundant partner URIs or role designation +- Single deployed instance at `C:\publish\lmxopcua\instance1` on port 4840 +- No CLI support for reading redundancy information + +--- + +## Scope + +### In Scope (Phase 1) + +1. **Redundancy configuration model** — role, partner URIs, ServiceLevel weights +2. **Server redundancy node exposure** — `RedundancySupport`, `ServerUriArray`, dynamic `ServiceLevel` +3. **ServiceLevel computation** — based on runtime health (MXAccess state, DB connectivity, role) +4. **CLI redundancy command** — read `RedundancySupport`, `ServerUriArray`, `ServiceLevel` from a server +5. **Second service instance** — deployed at `C:\publish\lmxopcua\instance2` with non-overlapping ports +6. **Documentation** — new `docs/Redundancy.md` component doc, updates to existing docs +7. **Unit tests** — config, ServiceLevel computation, resolver tests +8. **Integration tests** — two-server redundancy E2E test in the integration test project + +### Deferred + +- Automatic subscription transfer (client-side responsibility) +- Server-initiated failover (Galaxy `redundancy` table / engine flags) +- Transparent redundancy mode +- Health-check HTTP endpoint for load balancers + +--- + +## Configuration Design + +### New `Redundancy` section in `appsettings.json` + +```json +{ + "Redundancy": { + "Enabled": false, + "Mode": "Warm", + "Role": "Primary", + "ServerUris": [], + "ServiceLevelBase": 200 + } +} +``` + +### Configuration model + +**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs` (new) + +```csharp +public class RedundancyConfiguration +{ + public bool Enabled { get; set; } = false; + public string Mode { get; set; } = "Warm"; + public string Role { get; set; } = "Primary"; + public List ServerUris { get; set; } = new List(); + public int ServiceLevelBase { get; set; } = 200; +} +``` + +### Configuration rules + +- `Enabled` defaults to `false` for backward compatibility. When `false`, `RedundancySupport = None` and `ServiceLevel = 255` (SDK defaults). +- `Mode` must be `Warm` or `Hot` (Phase 1). Maps to `RedundancySupport.Warm` or `RedundancySupport.Hot`. +- `Role` must be `Primary` or `Secondary`. Controls the base `ServiceLevel` (Primary gets `ServiceLevelBase`, Secondary gets `ServiceLevelBase - 50`). +- `ServerUris` lists the `ApplicationUri` values for **all** servers in the redundant set, including the local server. The OPC UA spec requires this to contain the full set. These are namespace URIs like `urn:ZB:LmxOpcUa`, not endpoint URLs. +- `ServiceLevelBase` is the starting ServiceLevel when the server is fully healthy. Degraded conditions subtract from this value. + +### App root updates + +**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs` + +- Add `public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();` + +--- + +## Implementation Steps + +### Step 1: Add RedundancyConfiguration model and bind it + +**Files:** +- `src/.../Configuration/RedundancyConfiguration.cs` (new) +- `src/.../Configuration/AppConfiguration.cs` +- `src/.../OpcUaService.cs` + +Changes: +1. Create `RedundancyConfiguration` class with properties above +2. Add `Redundancy` property to `AppConfiguration` +3. Bind `configuration.GetSection("Redundancy").Bind(_config.Redundancy);` +4. Pass `_config.Redundancy` through to `OpcUaServerHost` and `LmxOpcUaServer` + +### Step 2: Add RedundancyModeResolver + +**File:** `src/.../OpcUa/RedundancyModeResolver.cs` (new) + +Responsibilities: +- Map `Mode` string to `RedundancySupport` enum value +- Validate against supported Phase 1 modes (`Warm`, `Hot`) +- Fall back to `None` with warning for unknown modes + +```csharp +public static class RedundancyModeResolver +{ + public static RedundancySupport Resolve(string mode, bool enabled); +} +``` + +### Step 3: Add ServiceLevelCalculator + +**File:** `src/.../OpcUa/ServiceLevelCalculator.cs` (new) + +Computes the dynamic `ServiceLevel` byte from runtime health: + +```csharp +public class ServiceLevelCalculator +{ + public byte Calculate(int baseLine, bool mxAccessConnected, bool dbConnected, bool isPrimary); +} +``` + +Logic: +- Start with `baseLine` (from config, e.g., 200 for Primary, 150 for Secondary) +- Subtract 100 if MXAccess is disconnected +- Subtract 50 if Galaxy DB is unreachable +- Clamp to 0–255 +- Return 0 if both MXAccess and DB are down + +### Step 4: Extend ConfigurationValidator for redundancy + +**File:** `src/.../Configuration/ConfigurationValidator.cs` + +Add validation/logging for: +- `Redundancy.Enabled`, `Mode`, `Role` +- `ServerUris` should not be empty when `Enabled = true` +- `ServiceLevelBase` should be 1–255 +- Warning when `Enabled = true` but `ServerUris` has fewer than 2 entries +- Log effective redundancy configuration at startup + +### Step 5: Update LmxOpcUaServer to expose redundancy state + +**File:** `src/.../OpcUa/LmxOpcUaServer.cs` + +Changes: +1. Accept `RedundancyConfiguration` in the constructor +2. Override `OnServerStarted` to write redundancy nodes: + - Set `Server/ServerRedundancy/RedundancySupport` to the resolved mode + - Set `Server/ServerRedundancy/ServerUriArray` to the configured URIs +3. Override `SetServerState` or use a timer to update `Server/ServiceLevel` periodically based on `ServiceLevelCalculator` +4. Expose a method `UpdateServiceLevel(bool mxConnected, bool dbConnected)` that the service layer can call when health state changes + +### Step 6: Update OpcUaServerHost to pass redundancy config + +**File:** `src/.../OpcUa/OpcUaServerHost.cs` + +Changes: +1. Accept `RedundancyConfiguration` in the constructor +2. Pass it through to `LmxOpcUaServer` +3. Log active redundancy mode at startup + +### Step 7: Wire ServiceLevel updates in OpcUaService + +**File:** `src/.../OpcUaService.cs` + +Changes: +1. Bind redundancy config section +2. Pass redundancy config to `OpcUaServerHost` +3. Subscribe to `MxAccessClient.ConnectionStateChanged` to trigger `ServiceLevel` updates +4. After Galaxy DB health checks, trigger `ServiceLevel` updates +5. Use a periodic timer (e.g., every 5 seconds) to refresh `ServiceLevel` based on current component health + +### Step 8: Update appsettings.json + +**File:** `src/.../appsettings.json` + +Add the `Redundancy` section with backward-compatible defaults (`Enabled: false`). + +### Step 9: Update OpcUaServiceBuilder for test injection + +**File:** `src/.../OpcUaServiceBuilder.cs` + +Add `WithRedundancy(RedundancyConfiguration)` builder method so tests can inject redundancy configuration. + +### Step 10: Add CLI `redundancy` command + +**Files:** +- `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` (new) + +Command: `redundancy` + +Reads from the target server: +- `Server/ServerRedundancy/RedundancySupport` (i=11314) +- `Server/ServiceLevel` (i=2267) +- `Server/ServerRedundancy/ServerUriArray` (i=11492, if non-transparent redundancy) + +Output format: +``` +Redundancy Mode: Warm +Service Level: 200 +Server URIs: + - urn:ZB:LmxOpcUa + - urn:ZB:LmxOpcUa2 +``` + +Options: `--url`, `--username`, `--password`, `--security` (same shared options as other commands). + +### Step 11: Deploy second service instance + +**Deployment target:** `C:\publish\lmxopcua\instance2` + +Configuration differences from instance1: + +| Setting | instance1 | instance2 | +|---|---|---| +| `OpcUa.Port` | `4840` | `4841` | +| `OpcUa.ServerName` | `LmxOpcUa` | `LmxOpcUa2` | +| `Dashboard.Port` | `8083` | `8084` | +| `Redundancy.Enabled` | `true` | `true` | +| `Redundancy.Role` | `Primary` | `Secondary` | +| `Redundancy.Mode` | `Warm` | `Warm` | +| `Redundancy.ServerUris` | `["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]` | `["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]` | +| `Redundancy.ServiceLevelBase` | `200` | `200` | + +Windows service for instance2: +- Name: `LmxOpcUa2` +- Display name: `LMX OPC UA Server (Instance 2)` +- Executable: `C:\publish\lmxopcua\instance2\ZB.MOM.WW.LmxOpcUa.Host.exe` + +Both instances share the same Galaxy DB (`ZB`) and MXAccess runtime. The `GalaxyName` remains `ZB` for both so they expose the same namespace. + +Update `service_info.md` with the second instance details. + +--- + +## Test Plan + +### Unit tests — RedundancyModeResolver + +**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs` + +| Test | Description | +|---|---| +| `Resolve_Disabled_ReturnsNone` | `Enabled=false` always returns `RedundancySupport.None` | +| `Resolve_Warm_ReturnsWarm` | `Mode="Warm"` maps to `RedundancySupport.Warm` | +| `Resolve_Hot_ReturnsHot` | `Mode="Hot"` maps to `RedundancySupport.Hot` | +| `Resolve_Unknown_FallsBackToNone` | Unknown mode falls back safely | +| `Resolve_CaseInsensitive` | `"warm"` and `"WARM"` both resolve | + +### Unit tests — ServiceLevelCalculator + +**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs` + +| Test | Description | +|---|---| +| `FullyHealthy_Primary_ReturnsBase` | All healthy, primary role → `ServiceLevelBase` | +| `FullyHealthy_Secondary_ReturnsBaseMinusFifty` | All healthy, secondary role → `ServiceLevelBase - 50` | +| `MxAccessDown_ReducesServiceLevel` | MXAccess disconnected subtracts 100 | +| `DbDown_ReducesServiceLevel` | DB unreachable subtracts 50 | +| `BothDown_ReturnsZero` | MXAccess + DB both down → 0 | +| `ClampedTo255` | Base of 255 with healthy → 255 | +| `ClampedToZero` | Heavy penalties don't go negative | + +### Unit tests — RedundancyConfiguration defaults + +**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs` + +| Test | Description | +|---|---| +| `DefaultConfig_Disabled` | `Enabled` defaults to `false` | +| `DefaultConfig_ModeWarm` | `Mode` defaults to `"Warm"` | +| `DefaultConfig_RolePrimary` | `Role` defaults to `"Primary"` | +| `DefaultConfig_EmptyServerUris` | `ServerUris` defaults to empty | +| `DefaultConfig_ServiceLevelBase200` | `ServiceLevelBase` defaults to `200` | + +### Updates to existing configuration tests + +**File:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs` + +Add: +- `Redundancy_Section_BindsCorrectly` — verify binding from appsettings.json +- `Redundancy_Section_BindsCustomValues` — in-memory override test +- `Validator_RedundancyEnabled_EmptyServerUris_ReturnsTrue_WithWarning` — validates but warns +- `Validator_RedundancyEnabled_InvalidServiceLevelBase_ReturnsFalse` — rejects 0 or >255 + +### Integration tests — redundancy E2E + +**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs` + +These tests start two in-process OPC UA servers with redundancy enabled and verify client-visible behavior: + +| Test | Description | +|---|---| +| `Server_WithRedundancyDisabled_ReportsNone` | Default config → `RedundancySupport.None`, `ServiceLevel=255` | +| `Server_WithRedundancyEnabled_ReportsConfiguredMode` | `Enabled=true, Mode=Warm` → `RedundancySupport.Warm` | +| `Server_WithRedundancyEnabled_ExposesServerUriArray` | Client can read `ServerUriArray` and it matches config | +| `Server_Primary_HasHigherServiceLevel_ThanSecondary` | Primary server reports higher `ServiceLevel` than secondary | +| `TwoServers_BothExposeSameRedundantSet` | Two server fixtures, both report the same `ServerUriArray` | +| `Server_ServiceLevel_DropsWith_MxAccessDisconnect` | Simulate MXAccess disconnect → `ServiceLevel` decreases | + +Pattern: Use `OpcUaServerFixture.WithFakeMxAccessClient()` with redundancy config injected, connect with `OpcUaTestClient`, read the standard OPC UA redundancy nodes. + +--- + +## Documentation Plan + +### New file: `docs/Redundancy.md` + +Contents: +1. Overview of OPC UA non-transparent redundancy +2. Redundancy configuration section reference (`Enabled`, `Mode`, `Role`, `ServerUris`, `ServiceLevelBase`) +3. ServiceLevel computation logic and degraded-state penalties +4. How clients discover and fail over between instances +5. Deployment guide for a two-instance redundant pair (ports, service names, shared Galaxy DB) +6. CLI `redundancy` command usage +7. Troubleshooting: mismatched `ServerUris`, ServiceLevel stuck at 0, etc. + +### Updates to existing docs + +| File | Changes | +|---|---| +| `docs/Configuration.md` | Add `Redundancy` section table, example JSON, add to validation rules list, update example appsettings.json | +| `docs/OpcUaServer.md` | Add redundancy state exposure section, link to `Redundancy.md` | +| `docs/CliTool.md` | Add `redundancy` command documentation | +| `docs/ServiceHosting.md` | Add multi-instance deployment notes | +| `README.md` | Add `Redundancy` to the component documentation table, mention redundancy in Quick Start | +| `CLAUDE.md` | Add redundancy architecture note | + +### Update: `service_info.md` + +Add a second section documenting `instance2`: +- Path: `C:\publish\lmxopcua\instance2` +- Windows service name: `LmxOpcUa2` +- Port: `4841` +- Dashboard port: `8084` +- Redundancy role: `Secondary` +- Endpoint: `opc.tcp://localhost:4841/LmxOpcUa` + +--- + +## File Change Summary + +| File | Action | Description | +|---|---|---| +| `src/.../Configuration/RedundancyConfiguration.cs` | New | Redundancy config model | +| `src/.../Configuration/AppConfiguration.cs` | Modify | Add `Redundancy` section | +| `src/.../Configuration/ConfigurationValidator.cs` | Modify | Validate/log redundancy settings | +| `src/.../OpcUa/RedundancyModeResolver.cs` | New | Mode string → `RedundancySupport` enum | +| `src/.../OpcUa/ServiceLevelCalculator.cs` | New | Dynamic ServiceLevel from health state | +| `src/.../OpcUa/LmxOpcUaServer.cs` | Modify | Expose redundancy nodes, accept ServiceLevel updates | +| `src/.../OpcUa/OpcUaServerHost.cs` | Modify | Pass redundancy config through | +| `src/.../OpcUaService.cs` | Modify | Bind redundancy config, wire ServiceLevel updates | +| `src/.../OpcUaServiceBuilder.cs` | Modify | Add `WithRedundancy()` builder | +| `src/.../appsettings.json` | Modify | Add `Redundancy` section | +| `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` | New | CLI command to read redundancy info | +| `tests/.../Redundancy/RedundancyModeResolverTests.cs` | New | Mode resolver unit tests | +| `tests/.../Redundancy/ServiceLevelCalculatorTests.cs` | New | ServiceLevel computation tests | +| `tests/.../Redundancy/RedundancyConfigurationTests.cs` | New | Config defaults tests | +| `tests/.../Configuration/ConfigurationLoadingTests.cs` | Modify | Binding + validation tests | +| `tests/.../Integration/RedundancyTests.cs` | New | E2E two-server redundancy tests | +| `tests/.../Helpers/OpcUaServerFixture.cs` | Modify | Accept redundancy config | +| `docs/Redundancy.md` | New | Dedicated redundancy component doc | +| `docs/Configuration.md` | Modify | Add Redundancy section | +| `docs/OpcUaServer.md` | Modify | Add redundancy state section | +| `docs/CliTool.md` | Modify | Add redundancy command | +| `docs/ServiceHosting.md` | Modify | Multi-instance notes | +| `README.md` | Modify | Add Redundancy to component table | +| `CLAUDE.md` | Modify | Add redundancy architecture note | +| `service_info.md` | Modify | Add instance2 details | + +--- + +## Verification Guardrails + +Each step must pass these gates before proceeding to the next: + +### Gate 1: Build (after each implementation step) +```bash +dotnet build ZB.MOM.WW.LmxOpcUa.slnx +``` +Must produce 0 errors. Proceed only when green. + +### Gate 2: Unit tests (after steps 1–4, 9) +```bash +dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests +``` +All existing + new tests must pass. No regressions. + +### Gate 3: Integration tests (after steps 5–7) +```bash +dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Integration.RedundancyTests" +``` +All redundancy E2E tests must pass. + +### Gate 4: CLI tool builds (after step 10) +```bash +cd tools/opcuacli-dotnet && dotnet build +``` +Must compile without errors. + +### Gate 5: Manual verification — single instance (after step 8) +```bash +# Publish and start with Redundancy.Enabled=false +opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa +opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa +# Should report: RedundancySupport=None, ServiceLevel=255 +``` + +### Gate 6: Manual verification — redundant pair (after step 11) +```bash +# Start both instances +sc start LmxOpcUa +sc start LmxOpcUa2 + +# Verify instance1 (Primary) +opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa +# Should report: RedundancySupport=Warm, ServiceLevel=200, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2] + +# Verify instance2 (Secondary) +opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa +# Should report: RedundancySupport=Warm, ServiceLevel=150, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2] + +# Both instances should serve the same Galaxy address space +opcuacli-dotnet.exe browse -u opc.tcp://localhost:4840/LmxOpcUa -r -d 2 +opcuacli-dotnet.exe browse -u opc.tcp://localhost:4841/LmxOpcUa -r -d 2 +``` + +### Gate 7: Full test suite (final) +```bash +dotnet test ZB.MOM.WW.LmxOpcUa.slnx +``` +All tests across all projects must pass. + +### Gate 8: Documentation review +- All new/modified doc files render correctly in Markdown +- Example JSON snippets match the actual `appsettings.json` +- CLI examples use correct flags and expected output +- `service_info.md` accurately reflects both deployed instances + +--- + +## Risks and Considerations + +1. **Backward compatibility**: `Redundancy.Enabled = false` must be the default so existing single-instance deployments are unaffected. +2. **ServiceLevel timing**: Updates must not race with OPC UA publish cycles. Use the server's internal lock or `ServerInternal` APIs. +3. **ServerUriArray immutability**: The OPC UA spec expects this to be static during a server session. Changes require a server restart. +4. **MXAccess shared state**: Both instances connect to the same MXAccess runtime. If MXAccess has per-client registration limits, verify that two clients can coexist. +5. **Galaxy DB contention**: Both instances poll for deploy changes. Ensure change detection doesn't trigger duplicate rebuilds or locking issues. +6. **Port conflicts**: The second instance must use different ports for OPC UA (4841) and Dashboard (8084). +7. **Certificate identity**: Each instance needs its own application certificate with a distinct `SubjectName` matching its `ServerName`. + +--- + +## Execution Order + +1. Steps 1–4: Config model, resolver, calculator, validator (unit-testable in isolation) +2. **Gate 1 + Gate 2**: Build + unit tests pass +3. Steps 5–7: Server integration (redundancy nodes, ServiceLevel wiring) +4. **Gate 1 + Gate 2 + Gate 3**: Build + all tests including E2E +5. Step 8: Update appsettings.json +6. **Gate 5**: Manual single-instance verification +7. Step 9: Update service builder for tests +8. Step 10: CLI redundancy command +9. **Gate 4**: CLI builds +10. Step 11: Deploy second instance + update service_info.md +11. **Gate 6**: Manual two-instance verification +12. Documentation updates (all doc files) +13. **Gate 7 + Gate 8**: Full test suite + documentation review +14. Commit and push