Files
lmxopcua/redundancy.md
Joseph Doherty a3c2d9b243 Add OPC UA server redundancy implementation plan
Covers non-transparent warm/hot redundancy with configurable roles,
dynamic ServiceLevel, CLI support, second service instance deployment,
and verification guardrails across unit, integration, and manual tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 12:52:15 -04:00

509 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OPC UA Server Redundancy Plan
## Summary
Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance advertises itself and its partner through the OPC UA `ServerRedundancy` node, publishes a dynamic `ServiceLevel` reflecting runtime health, and allows clients to discover the redundant set and fail over between instances. The CLI tool gains a `redundancy` command for inspecting the redundant server set.
This plan covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does **not** implement automatic server-side failover or subscription transfer — those are client responsibilities per the OPC UA specification.
---
## Background: OPC UA Redundancy Model
OPC UA defines redundancy through three address-space nodes under `Server/ServerRedundancy`:
| Node | Type | Purpose |
|---|---|---|
| `RedundancySupport` | `RedundancySupport` enum | Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored` |
| `ServerUriArray` | `String[]` | Lists the `ApplicationUri` values of all servers in the redundant set (non-transparent modes) |
| `ServiceLevel` | `Byte` (0255) | Indicates current operational quality; clients prefer the server with the highest value |
### Non-Transparent Redundancy (our target)
In non-transparent redundancy (`Warm` or `Hot`), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading `ServerUriArray`, monitor `ServiceLevel` on each server, and manage their own failover. This model fits our architecture where each instance connects to the same Galaxy repository and MXAccess runtime independently.
### ServiceLevel Semantics
| Range | Meaning |
|---|---|
| 0 | Server is not operational |
| 199 | Degraded (e.g., MXAccess disconnected, DB unreachable) |
| 100199 | Healthy secondary |
| 200255 | Healthy primary (preferred) |
The primary server should advertise a higher `ServiceLevel` than the secondary so clients prefer it when both are healthy.
---
## Current State
- `LmxOpcUaServer` extends `StandardServer` but does not override any redundancy-related methods
- `ServerRedundancy/RedundancySupport` defaults to `None` (SDK default)
- `ServiceLevel` defaults to `255` (SDK default — "fully operational")
- No configuration for redundant partner URIs or role designation
- Single deployed instance at `C:\publish\lmxopcua\instance1` on port 4840
- No CLI support for reading redundancy information
---
## Scope
### In Scope (Phase 1)
1. **Redundancy configuration model** — role, partner URIs, ServiceLevel weights
2. **Server redundancy node exposure**`RedundancySupport`, `ServerUriArray`, dynamic `ServiceLevel`
3. **ServiceLevel computation** — based on runtime health (MXAccess state, DB connectivity, role)
4. **CLI redundancy command** — read `RedundancySupport`, `ServerUriArray`, `ServiceLevel` from a server
5. **Second service instance** — deployed at `C:\publish\lmxopcua\instance2` with non-overlapping ports
6. **Documentation** — new `docs/Redundancy.md` component doc, updates to existing docs
7. **Unit tests** — config, ServiceLevel computation, resolver tests
8. **Integration tests** — two-server redundancy E2E test in the integration test project
### Deferred
- Automatic subscription transfer (client-side responsibility)
- Server-initiated failover (Galaxy `redundancy` table / engine flags)
- Transparent redundancy mode
- Health-check HTTP endpoint for load balancers
---
## Configuration Design
### New `Redundancy` section in `appsettings.json`
```json
{
"Redundancy": {
"Enabled": false,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": [],
"ServiceLevelBase": 200
}
}
```
### Configuration model
**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs` (new)
```csharp
public class RedundancyConfiguration
{
public bool Enabled { get; set; } = false;
public string Mode { get; set; } = "Warm";
public string Role { get; set; } = "Primary";
public List<string> ServerUris { get; set; } = new List<string>();
public int ServiceLevelBase { get; set; } = 200;
}
```
### Configuration rules
- `Enabled` defaults to `false` for backward compatibility. When `false`, `RedundancySupport = None` and `ServiceLevel = 255` (SDK defaults).
- `Mode` must be `Warm` or `Hot` (Phase 1). Maps to `RedundancySupport.Warm` or `RedundancySupport.Hot`.
- `Role` must be `Primary` or `Secondary`. Controls the base `ServiceLevel` (Primary gets `ServiceLevelBase`, Secondary gets `ServiceLevelBase - 50`).
- `ServerUris` lists the `ApplicationUri` values for **all** servers in the redundant set, including the local server. The OPC UA spec requires this to contain the full set. These are namespace URIs like `urn:ZB:LmxOpcUa`, not endpoint URLs.
- `ServiceLevelBase` is the starting ServiceLevel when the server is fully healthy. Degraded conditions subtract from this value.
### App root updates
**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs`
- Add `public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();`
---
## Implementation Steps
### Step 1: Add RedundancyConfiguration model and bind it
**Files:**
- `src/.../Configuration/RedundancyConfiguration.cs` (new)
- `src/.../Configuration/AppConfiguration.cs`
- `src/.../OpcUaService.cs`
Changes:
1. Create `RedundancyConfiguration` class with properties above
2. Add `Redundancy` property to `AppConfiguration`
3. Bind `configuration.GetSection("Redundancy").Bind(_config.Redundancy);`
4. Pass `_config.Redundancy` through to `OpcUaServerHost` and `LmxOpcUaServer`
### Step 2: Add RedundancyModeResolver
**File:** `src/.../OpcUa/RedundancyModeResolver.cs` (new)
Responsibilities:
- Map `Mode` string to `RedundancySupport` enum value
- Validate against supported Phase 1 modes (`Warm`, `Hot`)
- Fall back to `None` with warning for unknown modes
```csharp
public static class RedundancyModeResolver
{
public static RedundancySupport Resolve(string mode, bool enabled);
}
```
### Step 3: Add ServiceLevelCalculator
**File:** `src/.../OpcUa/ServiceLevelCalculator.cs` (new)
Computes the dynamic `ServiceLevel` byte from runtime health:
```csharp
public class ServiceLevelCalculator
{
public byte Calculate(int baseLine, bool mxAccessConnected, bool dbConnected, bool isPrimary);
}
```
Logic:
- Start with `baseLine` (from config, e.g., 200 for Primary, 150 for Secondary)
- Subtract 100 if MXAccess is disconnected
- Subtract 50 if Galaxy DB is unreachable
- Clamp to 0255
- Return 0 if both MXAccess and DB are down
### Step 4: Extend ConfigurationValidator for redundancy
**File:** `src/.../Configuration/ConfigurationValidator.cs`
Add validation/logging for:
- `Redundancy.Enabled`, `Mode`, `Role`
- `ServerUris` should not be empty when `Enabled = true`
- `ServiceLevelBase` should be 1255
- Warning when `Enabled = true` but `ServerUris` has fewer than 2 entries
- Log effective redundancy configuration at startup
### Step 5: Update LmxOpcUaServer to expose redundancy state
**File:** `src/.../OpcUa/LmxOpcUaServer.cs`
Changes:
1. Accept `RedundancyConfiguration` in the constructor
2. Override `OnServerStarted` to write redundancy nodes:
- Set `Server/ServerRedundancy/RedundancySupport` to the resolved mode
- Set `Server/ServerRedundancy/ServerUriArray` to the configured URIs
3. Override `SetServerState` or use a timer to update `Server/ServiceLevel` periodically based on `ServiceLevelCalculator`
4. Expose a method `UpdateServiceLevel(bool mxConnected, bool dbConnected)` that the service layer can call when health state changes
### Step 6: Update OpcUaServerHost to pass redundancy config
**File:** `src/.../OpcUa/OpcUaServerHost.cs`
Changes:
1. Accept `RedundancyConfiguration` in the constructor
2. Pass it through to `LmxOpcUaServer`
3. Log active redundancy mode at startup
### Step 7: Wire ServiceLevel updates in OpcUaService
**File:** `src/.../OpcUaService.cs`
Changes:
1. Bind redundancy config section
2. Pass redundancy config to `OpcUaServerHost`
3. Subscribe to `MxAccessClient.ConnectionStateChanged` to trigger `ServiceLevel` updates
4. After Galaxy DB health checks, trigger `ServiceLevel` updates
5. Use a periodic timer (e.g., every 5 seconds) to refresh `ServiceLevel` based on current component health
### Step 8: Update appsettings.json
**File:** `src/.../appsettings.json`
Add the `Redundancy` section with backward-compatible defaults (`Enabled: false`).
### Step 9: Update OpcUaServiceBuilder for test injection
**File:** `src/.../OpcUaServiceBuilder.cs`
Add `WithRedundancy(RedundancyConfiguration)` builder method so tests can inject redundancy configuration.
### Step 10: Add CLI `redundancy` command
**Files:**
- `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` (new)
Command: `redundancy`
Reads from the target server:
- `Server/ServerRedundancy/RedundancySupport` (i=11314)
- `Server/ServiceLevel` (i=2267)
- `Server/ServerRedundancy/ServerUriArray` (i=11492, if non-transparent redundancy)
Output format:
```
Redundancy Mode: Warm
Service Level: 200
Server URIs:
- urn:ZB:LmxOpcUa
- urn:ZB:LmxOpcUa2
```
Options: `--url`, `--username`, `--password`, `--security` (same shared options as other commands).
### Step 11: Deploy second service instance
**Deployment target:** `C:\publish\lmxopcua\instance2`
Configuration differences from instance1:
| Setting | instance1 | instance2 |
|---|---|---|
| `OpcUa.Port` | `4840` | `4841` |
| `OpcUa.ServerName` | `LmxOpcUa` | `LmxOpcUa2` |
| `Dashboard.Port` | `8083` | `8084` |
| `Redundancy.Enabled` | `true` | `true` |
| `Redundancy.Role` | `Primary` | `Secondary` |
| `Redundancy.Mode` | `Warm` | `Warm` |
| `Redundancy.ServerUris` | `["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]` | `["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]` |
| `Redundancy.ServiceLevelBase` | `200` | `200` |
Windows service for instance2:
- Name: `LmxOpcUa2`
- Display name: `LMX OPC UA Server (Instance 2)`
- Executable: `C:\publish\lmxopcua\instance2\ZB.MOM.WW.LmxOpcUa.Host.exe`
Both instances share the same Galaxy DB (`ZB`) and MXAccess runtime. The `GalaxyName` remains `ZB` for both so they expose the same namespace.
Update `service_info.md` with the second instance details.
---
## Test Plan
### Unit tests — RedundancyModeResolver
**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs`
| Test | Description |
|---|---|
| `Resolve_Disabled_ReturnsNone` | `Enabled=false` always returns `RedundancySupport.None` |
| `Resolve_Warm_ReturnsWarm` | `Mode="Warm"` maps to `RedundancySupport.Warm` |
| `Resolve_Hot_ReturnsHot` | `Mode="Hot"` maps to `RedundancySupport.Hot` |
| `Resolve_Unknown_FallsBackToNone` | Unknown mode falls back safely |
| `Resolve_CaseInsensitive` | `"warm"` and `"WARM"` both resolve |
### Unit tests — ServiceLevelCalculator
**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs`
| Test | Description |
|---|---|
| `FullyHealthy_Primary_ReturnsBase` | All healthy, primary role → `ServiceLevelBase` |
| `FullyHealthy_Secondary_ReturnsBaseMinusFifty` | All healthy, secondary role → `ServiceLevelBase - 50` |
| `MxAccessDown_ReducesServiceLevel` | MXAccess disconnected subtracts 100 |
| `DbDown_ReducesServiceLevel` | DB unreachable subtracts 50 |
| `BothDown_ReturnsZero` | MXAccess + DB both down → 0 |
| `ClampedTo255` | Base of 255 with healthy → 255 |
| `ClampedToZero` | Heavy penalties don't go negative |
### Unit tests — RedundancyConfiguration defaults
**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs`
| Test | Description |
|---|---|
| `DefaultConfig_Disabled` | `Enabled` defaults to `false` |
| `DefaultConfig_ModeWarm` | `Mode` defaults to `"Warm"` |
| `DefaultConfig_RolePrimary` | `Role` defaults to `"Primary"` |
| `DefaultConfig_EmptyServerUris` | `ServerUris` defaults to empty |
| `DefaultConfig_ServiceLevelBase200` | `ServiceLevelBase` defaults to `200` |
### Updates to existing configuration tests
**File:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs`
Add:
- `Redundancy_Section_BindsCorrectly` — verify binding from appsettings.json
- `Redundancy_Section_BindsCustomValues` — in-memory override test
- `Validator_RedundancyEnabled_EmptyServerUris_ReturnsTrue_WithWarning` — validates but warns
- `Validator_RedundancyEnabled_InvalidServiceLevelBase_ReturnsFalse` — rejects 0 or >255
### Integration tests — redundancy E2E
**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs`
These tests start two in-process OPC UA servers with redundancy enabled and verify client-visible behavior:
| Test | Description |
|---|---|
| `Server_WithRedundancyDisabled_ReportsNone` | Default config → `RedundancySupport.None`, `ServiceLevel=255` |
| `Server_WithRedundancyEnabled_ReportsConfiguredMode` | `Enabled=true, Mode=Warm``RedundancySupport.Warm` |
| `Server_WithRedundancyEnabled_ExposesServerUriArray` | Client can read `ServerUriArray` and it matches config |
| `Server_Primary_HasHigherServiceLevel_ThanSecondary` | Primary server reports higher `ServiceLevel` than secondary |
| `TwoServers_BothExposeSameRedundantSet` | Two server fixtures, both report the same `ServerUriArray` |
| `Server_ServiceLevel_DropsWith_MxAccessDisconnect` | Simulate MXAccess disconnect → `ServiceLevel` decreases |
Pattern: Use `OpcUaServerFixture.WithFakeMxAccessClient()` with redundancy config injected, connect with `OpcUaTestClient`, read the standard OPC UA redundancy nodes.
---
## Documentation Plan
### New file: `docs/Redundancy.md`
Contents:
1. Overview of OPC UA non-transparent redundancy
2. Redundancy configuration section reference (`Enabled`, `Mode`, `Role`, `ServerUris`, `ServiceLevelBase`)
3. ServiceLevel computation logic and degraded-state penalties
4. How clients discover and fail over between instances
5. Deployment guide for a two-instance redundant pair (ports, service names, shared Galaxy DB)
6. CLI `redundancy` command usage
7. Troubleshooting: mismatched `ServerUris`, ServiceLevel stuck at 0, etc.
### Updates to existing docs
| File | Changes |
|---|---|
| `docs/Configuration.md` | Add `Redundancy` section table, example JSON, add to validation rules list, update example appsettings.json |
| `docs/OpcUaServer.md` | Add redundancy state exposure section, link to `Redundancy.md` |
| `docs/CliTool.md` | Add `redundancy` command documentation |
| `docs/ServiceHosting.md` | Add multi-instance deployment notes |
| `README.md` | Add `Redundancy` to the component documentation table, mention redundancy in Quick Start |
| `CLAUDE.md` | Add redundancy architecture note |
### Update: `service_info.md`
Add a second section documenting `instance2`:
- Path: `C:\publish\lmxopcua\instance2`
- Windows service name: `LmxOpcUa2`
- Port: `4841`
- Dashboard port: `8084`
- Redundancy role: `Secondary`
- Endpoint: `opc.tcp://localhost:4841/LmxOpcUa`
---
## File Change Summary
| File | Action | Description |
|---|---|---|
| `src/.../Configuration/RedundancyConfiguration.cs` | New | Redundancy config model |
| `src/.../Configuration/AppConfiguration.cs` | Modify | Add `Redundancy` section |
| `src/.../Configuration/ConfigurationValidator.cs` | Modify | Validate/log redundancy settings |
| `src/.../OpcUa/RedundancyModeResolver.cs` | New | Mode string → `RedundancySupport` enum |
| `src/.../OpcUa/ServiceLevelCalculator.cs` | New | Dynamic ServiceLevel from health state |
| `src/.../OpcUa/LmxOpcUaServer.cs` | Modify | Expose redundancy nodes, accept ServiceLevel updates |
| `src/.../OpcUa/OpcUaServerHost.cs` | Modify | Pass redundancy config through |
| `src/.../OpcUaService.cs` | Modify | Bind redundancy config, wire ServiceLevel updates |
| `src/.../OpcUaServiceBuilder.cs` | Modify | Add `WithRedundancy()` builder |
| `src/.../appsettings.json` | Modify | Add `Redundancy` section |
| `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` | New | CLI command to read redundancy info |
| `tests/.../Redundancy/RedundancyModeResolverTests.cs` | New | Mode resolver unit tests |
| `tests/.../Redundancy/ServiceLevelCalculatorTests.cs` | New | ServiceLevel computation tests |
| `tests/.../Redundancy/RedundancyConfigurationTests.cs` | New | Config defaults tests |
| `tests/.../Configuration/ConfigurationLoadingTests.cs` | Modify | Binding + validation tests |
| `tests/.../Integration/RedundancyTests.cs` | New | E2E two-server redundancy tests |
| `tests/.../Helpers/OpcUaServerFixture.cs` | Modify | Accept redundancy config |
| `docs/Redundancy.md` | New | Dedicated redundancy component doc |
| `docs/Configuration.md` | Modify | Add Redundancy section |
| `docs/OpcUaServer.md` | Modify | Add redundancy state section |
| `docs/CliTool.md` | Modify | Add redundancy command |
| `docs/ServiceHosting.md` | Modify | Multi-instance notes |
| `README.md` | Modify | Add Redundancy to component table |
| `CLAUDE.md` | Modify | Add redundancy architecture note |
| `service_info.md` | Modify | Add instance2 details |
---
## Verification Guardrails
Each step must pass these gates before proceeding to the next:
### Gate 1: Build (after each implementation step)
```bash
dotnet build ZB.MOM.WW.LmxOpcUa.slnx
```
Must produce 0 errors. Proceed only when green.
### Gate 2: Unit tests (after steps 14, 9)
```bash
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests
```
All existing + new tests must pass. No regressions.
### Gate 3: Integration tests (after steps 57)
```bash
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Integration.RedundancyTests"
```
All redundancy E2E tests must pass.
### Gate 4: CLI tool builds (after step 10)
```bash
cd tools/opcuacli-dotnet && dotnet build
```
Must compile without errors.
### Gate 5: Manual verification — single instance (after step 8)
```bash
# Publish and start with Redundancy.Enabled=false
opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=None, ServiceLevel=255
```
### Gate 6: Manual verification — redundant pair (after step 11)
```bash
# Start both instances
sc start LmxOpcUa
sc start LmxOpcUa2
# Verify instance1 (Primary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=200, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]
# Verify instance2 (Secondary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=150, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]
# Both instances should serve the same Galaxy address space
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4840/LmxOpcUa -r -d 2
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4841/LmxOpcUa -r -d 2
```
### Gate 7: Full test suite (final)
```bash
dotnet test ZB.MOM.WW.LmxOpcUa.slnx
```
All tests across all projects must pass.
### Gate 8: Documentation review
- All new/modified doc files render correctly in Markdown
- Example JSON snippets match the actual `appsettings.json`
- CLI examples use correct flags and expected output
- `service_info.md` accurately reflects both deployed instances
---
## Risks and Considerations
1. **Backward compatibility**: `Redundancy.Enabled = false` must be the default so existing single-instance deployments are unaffected.
2. **ServiceLevel timing**: Updates must not race with OPC UA publish cycles. Use the server's internal lock or `ServerInternal` APIs.
3. **ServerUriArray immutability**: The OPC UA spec expects this to be static during a server session. Changes require a server restart.
4. **MXAccess shared state**: Both instances connect to the same MXAccess runtime. If MXAccess has per-client registration limits, verify that two clients can coexist.
5. **Galaxy DB contention**: Both instances poll for deploy changes. Ensure change detection doesn't trigger duplicate rebuilds or locking issues.
6. **Port conflicts**: The second instance must use different ports for OPC UA (4841) and Dashboard (8084).
7. **Certificate identity**: Each instance needs its own application certificate with a distinct `SubjectName` matching its `ServerName`.
---
## Execution Order
1. Steps 14: Config model, resolver, calculator, validator (unit-testable in isolation)
2. **Gate 1 + Gate 2**: Build + unit tests pass
3. Steps 57: Server integration (redundancy nodes, ServiceLevel wiring)
4. **Gate 1 + Gate 2 + Gate 3**: Build + all tests including E2E
5. Step 8: Update appsettings.json
6. **Gate 5**: Manual single-instance verification
7. Step 9: Update service builder for tests
8. Step 10: CLI redundancy command
9. **Gate 4**: CLI builds
10. Step 11: Deploy second instance + update service_info.md
11. **Gate 6**: Manual two-instance verification
12. Documentation updates (all doc files)
13. **Gate 7 + Gate 8**: Full test suite + documentation review
14. Commit and push