Files
lmxopcua/redundancy.md
Joseph Doherty a55153d7d5 Add configurable non-transparent OPC UA server redundancy
Separates ApplicationUri from namespace identity so each instance in a
redundant pair has a unique server URI while sharing the same Galaxy
namespace. Exposes RedundancySupport, ServerUriArray, and dynamic
ServiceLevel through the standard OPC UA server object. ServiceLevel
is computed from role (Primary/Secondary) and runtime health (MXAccess
and DB connectivity). Adds CLI redundancy command, second deployed
service instance, and 31 new tests including paired-server integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 13:32:17 -04:00

598 lines
21 KiB
Markdown

# OPC UA Server Redundancy Plan
## Summary
Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance should advertise the redundant set through the standard OPC UA redundancy nodes, publish a dynamic `ServiceLevel` based on runtime health, and allow clients to discover and fail over between the instances. The CLI tool should gain a `redundancy` command for inspecting the redundant server set.
This review tightens the original draft in a few important ways:
- It separates **namespace identity** from **application identity**. The current host uses `urn:{GalaxyName}:LmxOpcUa` as both the namespace URI and `ApplicationUri`; that must change for redundancy because each server in the pair needs a unique server URI.
- It avoids hand-wavy "write the redundancy nodes directly" language and instead targets the OPC UA SDK's built-in `ServerObjectState` / `ServerRedundancyState` model.
- It removes a few inaccurate hardcoded assumptions, including the `ServerUriArray` node id and the deployment port examples.
- It fixes execution order so test-builder and helper changes happen before integration coverage depends on them.
This plan still covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does **not** implement automatic server-side failover or subscription transfer; those remain client responsibilities per the OPC UA specification.
---
## Background: OPC UA Redundancy Model
OPC UA exposes redundancy through standard nodes under `Server/ServerRedundancy` plus the `Server/ServiceLevel` property:
| Node | Type | Purpose |
|---|---|---|
| `RedundancySupport` | `RedundancySupport` enum | Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored` |
| `ServerUriArray` | `String[]` | Lists the `ApplicationUri` values of all servers in the redundant set for non-transparent redundancy |
| `ServiceLevel` | `Byte` (0-255) | Indicates current operational quality; clients prefer the server with the highest value |
### Non-Transparent Redundancy (our target)
In non-transparent redundancy (`Warm` or `Hot`), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading `ServerUriArray`, monitor `ServiceLevel` on each server, and manage their own failover. This fits the current architecture, where each instance independently connects to the same Galaxy repository and MXAccess runtime.
### ServiceLevel semantics
| Range | Meaning |
|---|---|
| 0 | Server is not operational |
| 1-99 | Degraded |
| 100-199 | Healthy secondary |
| 200-255 | Healthy primary |
The primary should advertise a higher `ServiceLevel` than the secondary so clients prefer it when both are healthy.
---
## Current State
- `LmxOpcUaServer` extends `StandardServer` but does not expose redundancy state
- `ServerRedundancy/RedundancySupport` remains the SDK default (`None`)
- `Server/ServiceLevel` remains the SDK default (`255`)
- No configuration exists for redundancy mode, role, or redundant partner URIs
- `OpcUaServerHost` currently sets `ApplicationUri = urn:{GalaxyName}:LmxOpcUa`
- `LmxNodeManager` uses the same `urn:{GalaxyName}:LmxOpcUa` as the published namespace URI
- A single deployed instance is documented in [service_info.md](C:\Users\dohertj2\Desktop\lmxopcua\service_info.md)
- No CLI command exists for reading redundancy information
## Key gap to fix first
For redundancy, each server in the set must advertise a unique `ApplicationUri`, and `ServerUriArray` must contain those unique values. The current implementation cannot do that because it reuses the namespace URI as the server `ApplicationUri`. Phase 1 therefore needs an application-identity change before the redundancy nodes can be correct.
---
## Scope
### In scope (Phase 1)
1. Add explicit application-identity configuration so each instance can have a unique `ApplicationUri`
2. Add redundancy configuration for mode, role, and server URI membership
3. Expose `RedundancySupport`, `ServerUriArray`, and dynamic `ServiceLevel`
4. Compute `ServiceLevel` from runtime health and preferred role
5. Add a CLI `redundancy` command
6. Document two-instance deployment
7. Add unit and integration coverage
### Deferred
- Automatic subscription transfer
- Server-initiated failover
- Transparent redundancy mode
- Load-balancer-specific HTTP health endpoints
- Mirrored data/session state
---
## Configuration Design
### 1. Add explicit `OpcUa.ApplicationUri`
**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/OpcUaConfiguration.cs`
Add:
```csharp
public string? ApplicationUri { get; set; }
```
Rules:
- `ApplicationUri = null` preserves the current behavior for non-redundant deployments
- when `Redundancy.Enabled = true`, `ApplicationUri` must be explicitly set and unique per instance
- `LmxNodeManager` should continue using `urn:{GalaxyName}:LmxOpcUa` as the namespace URI so both redundant servers expose the same namespace
- `Redundancy.ServerUris` must contain the exact `ApplicationUri` values for all servers in the redundant set
Example:
```json
{
"OpcUa": {
"ServerName": "LmxOpcUa",
"GalaxyName": "ZB",
"ApplicationUri": "urn:localhost:LmxOpcUa:instance1"
}
}
```
### 2. New `Redundancy` section in `appsettings.json`
```json
{
"Redundancy": {
"Enabled": false,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": [],
"ServiceLevelBase": 200
}
}
```
### 3. Configuration model
**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs` (new)
```csharp
public class RedundancyConfiguration
{
public bool Enabled { get; set; } = false;
public string Mode { get; set; } = "Warm";
public string Role { get; set; } = "Primary";
public List<string> ServerUris { get; set; } = new List<string>();
public int ServiceLevelBase { get; set; } = 200;
}
```
### 4. Configuration rules
- `Enabled` defaults to `false`
- `Mode` supports `Warm` and `Hot` in Phase 1
- `Role` supports `Primary` and `Secondary`
- `ServerUris` must contain the local `OpcUa.ApplicationUri` when redundancy is enabled
- `ServerUris` should contain at least two unique entries when redundancy is enabled
- `ServiceLevelBase` should be in the range `1-255`
- Effective baseline:
- Primary: `ServiceLevelBase`
- Secondary: `max(0, ServiceLevelBase - 50)`
### App root updates
**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs`
- Add `public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();`
---
## Implementation Steps
### Step 1: Separate application identity from namespace identity
**Files:**
- `src/.../Configuration/OpcUaConfiguration.cs`
- `src/.../OpcUa/OpcUaServerHost.cs`
- `docs/OpcUaServer.md`
- `tests/.../Configuration/ConfigurationLoadingTests.cs`
Changes:
1. Add optional `OpcUa.ApplicationUri`
2. Keep `urn:{GalaxyName}:LmxOpcUa` as the namespace URI used by `LmxNodeManager`
3. Set `ApplicationConfiguration.ApplicationUri` from `OpcUa.ApplicationUri` when supplied
4. Keep `ApplicationUri` and namespace URI distinct in docs and tests
This step is required before redundancy can be correct.
### Step 2: Add `RedundancyConfiguration` and bind it
**Files:**
- `src/.../Configuration/RedundancyConfiguration.cs` (new)
- `src/.../Configuration/AppConfiguration.cs`
- `src/.../OpcUaService.cs`
Changes:
1. Create `RedundancyConfiguration`
2. Add `Redundancy` to `AppConfiguration`
3. Bind `configuration.GetSection("Redundancy").Bind(_config.Redundancy);`
4. Pass `_config.Redundancy` through to `OpcUaServerHost` and `LmxOpcUaServer`
### Step 3: Add `RedundancyModeResolver`
**File:** `src/.../OpcUa/RedundancyModeResolver.cs` (new)
Responsibilities:
- map `Mode` to `RedundancySupport`
- validate supported Phase 1 modes
- fall back safely when disabled or invalid
```csharp
public static class RedundancyModeResolver
{
public static RedundancySupport Resolve(string mode, bool enabled);
}
```
### Step 4: Add `ServiceLevelCalculator`
**File:** `src/.../OpcUa/ServiceLevelCalculator.cs` (new)
Purpose:
- compute the current `ServiceLevel` from a baseline plus health inputs
Suggested signature:
```csharp
public sealed class ServiceLevelCalculator
{
public byte Calculate(int baseLevel, bool mxAccessConnected, bool dbConnected);
}
```
Suggested logic:
- start with the role-adjusted baseline supplied by the caller
- subtract 100 if MXAccess is disconnected
- subtract 50 if the Galaxy DB is unreachable
- return `0` if both are down
- clamp to `0-255`
### Step 5: Extend `ConfigurationValidator`
**File:** `src/.../Configuration/ConfigurationValidator.cs`
Add validation/logging for:
- `OpcUa.ApplicationUri`
- `Redundancy.Enabled`, `Mode`, `Role`
- `ServerUris` membership and uniqueness
- `ServiceLevelBase`
- local `OpcUa.ApplicationUri` must appear in `Redundancy.ServerUris` when enabled
- warning when fewer than 2 unique server URIs are configured
### Step 6: Expose redundancy through the standard OPC UA server object
**File:** `src/.../OpcUa/LmxOpcUaServer.cs`
Changes:
1. Accept `RedundancyConfiguration` and local `ApplicationUri`
2. On startup, locate the built-in `ServerObjectState`
3. Configure `ServerObjectState.ServiceLevel`
4. Configure the server redundancy object using the SDK's standard server-state types instead of writing guessed node ids directly
5. If the default `ServerRedundancyState` does not expose `ServerUriArray`, replace or upgrade it with the appropriate non-transparent redundancy state type from the SDK before populating values
6. Expose an internal method such as `UpdateServiceLevel(bool mxConnected, bool dbConnected)` for service-layer health updates
Important: the implementation should use SDK types/constants (`ServerObjectState`, `ServerRedundancyState`, `NonTransparentRedundancyState`, `VariableIds.*`) rather than hand-maintained numeric literals.
### Step 7: Update `OpcUaServerHost`
**File:** `src/.../OpcUa/OpcUaServerHost.cs`
Changes:
1. Accept `RedundancyConfiguration`
2. Pass redundancy config and resolved local `ApplicationUri` into `LmxOpcUaServer`
3. Log redundancy mode/role/server URIs at startup
### Step 8: Wire health updates in `OpcUaService`
**File:** `src/.../OpcUaService.cs`
Changes:
1. Bind and pass redundancy config
2. After startup, initialize the starting `ServiceLevel`
3. Subscribe to `IMxAccessClient.ConnectionStateChanged`
4. Update DB health whenever startup repository checks, change-detection work, or rebuild attempts succeed/fail
5. Prefer event-driven updates; add a lightweight periodic refresh only if necessary
Avoid introducing a second large standalone polling loop when existing connection and repository activity already gives most of the needed health signals.
### Step 9: Update test builders and helpers before integration coverage
**Files:**
- `src/.../OpcUaServiceBuilder.cs`
- `tests/.../Helpers/OpcUaServerFixture.cs`
- `tests/.../Helpers/OpcUaTestClient.cs`
Changes:
- add `WithRedundancy(...)`
- add `WithApplicationUri(...)` or allow full `OpcUaConfiguration` override
- ensure two in-process redundancy tests can run with distinct `ServerName`, `ApplicationUri`, and certificate identity
- when needed, use separate PKI roots in tests so paired fixtures do not collide on certificate state
### Step 10: Update `appsettings.json`
**File:** `src/.../appsettings.json`
Add:
- `OpcUa.ApplicationUri` example/commentary in docs
- `Redundancy` section with `Enabled = false` defaults
### Step 11: Add CLI `redundancy` command
**Files:**
- `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` (new)
- `tools/opcuacli-dotnet/README.md`
- `docs/CliTool.md`
Command: `redundancy`
Read:
- `VariableIds.Server_ServerRedundancy_RedundancySupport`
- `VariableIds.Server_ServiceLevel`
- `VariableIds.Server_ServerRedundancy_ServerUriArray`
Output example:
```text
Redundancy Mode: Warm
Service Level: 200
Server URIs:
- urn:localhost:LmxOpcUa:instance1
- urn:localhost:LmxOpcUa:instance2
```
Use SDK constants instead of hardcoded numeric ids in the command implementation.
### Step 12: Deploy the second service instance
**Deployment target:** `C:\publish\lmxopcua\instance2`
Suggested configuration differences:
| Setting | instance1 | instance2 |
|---|---|---|
| `OpcUa.Port` | `4840` | `4841` |
| `Dashboard.Port` | `8081` | `8082` |
| `OpcUa.ServerName` | `LmxOpcUa` | `LmxOpcUa2` |
| `OpcUa.ApplicationUri` | `urn:localhost:LmxOpcUa:instance1` | `urn:localhost:LmxOpcUa:instance2` |
| `Redundancy.Enabled` | `true` | `true` |
| `Redundancy.Role` | `Primary` | `Secondary` |
| `Redundancy.Mode` | `Warm` | `Warm` |
| `Redundancy.ServerUris` | same two-entry set | same two-entry set |
Deployment notes:
- both instances should share the same `GalaxyName` and namespace URI
- each instance must have a distinct application certificate identity
- if certificate handling is sensitive, give each instance an explicit `Security.CertificateSubject` or separate PKI root
Update [service_info.md](C:\Users\dohertj2\Desktop\lmxopcua\service_info.md) with the second instance details after deployment is real, not speculative.
---
## Test Plan
### Unit tests: `RedundancyModeResolver`
**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs`
| Test | Description |
|---|---|
| `Resolve_Disabled_ReturnsNone` | `Enabled=false` returns `None` |
| `Resolve_Warm_ReturnsWarm` | `Mode="Warm"` maps correctly |
| `Resolve_Hot_ReturnsHot` | `Mode="Hot"` maps correctly |
| `Resolve_Unknown_FallsBackToNone` | Unknown mode falls back safely |
| `Resolve_CaseInsensitive` | Case-insensitive parsing works |
### Unit tests: `ServiceLevelCalculator`
**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs`
| Test | Description |
|---|---|
| `FullyHealthy_Primary_ReturnsBase` | Healthy primary baseline is preserved |
| `FullyHealthy_Secondary_ReturnsBaseMinusFifty` | Healthy secondary baseline is lower |
| `MxAccessDown_ReducesServiceLevel` | MXAccess failure reduces score |
| `DbDown_ReducesServiceLevel` | DB failure reduces score |
| `BothDown_ReturnsZero` | Both unavailable returns 0 |
| `ClampedTo255` | Upper clamp works |
| `ClampedToZero` | Lower clamp works |
### Unit tests: `RedundancyConfiguration`
**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs`
| Test | Description |
|---|---|
| `DefaultConfig_Disabled` | `Enabled` defaults to `false` |
| `DefaultConfig_ModeWarm` | `Mode` defaults to `Warm` |
| `DefaultConfig_RolePrimary` | `Role` defaults to `Primary` |
| `DefaultConfig_EmptyServerUris` | `ServerUris` defaults to empty |
| `DefaultConfig_ServiceLevelBase200` | `ServiceLevelBase` defaults to `200` |
### Updates to existing configuration tests
**File:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs`
Add coverage for:
- `OpcUa.ApplicationUri`
- `Redundancy` section binding
- redundancy validation when `ApplicationUri` is missing
- redundancy validation when local `ApplicationUri` is absent from `ServerUris`
- invalid `ServiceLevelBase`
### Integration tests
**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs`
Cover:
- redundancy disabled reports `None`
- warm redundancy reports configured mode
- `ServerUriArray` matches configuration
- primary reports higher `ServiceLevel` than secondary
- both servers expose the same namespace URI but different `ApplicationUri` values
- service level drops when MXAccess disconnects
Pattern:
- use two fixture instances
- give each fixture a distinct `ServerName`, `ApplicationUri`, and port
- if secure transport is enabled in those tests, isolate PKI roots to avoid certificate cross-talk
---
## Documentation Plan
### New file
- `docs/Redundancy.md`
Contents:
1. overview of OPC UA non-transparent redundancy
2. difference between namespace URI and server `ApplicationUri`
3. redundancy configuration reference
4. service-level computation
5. two-instance deployment guide
6. CLI `redundancy` command usage
7. troubleshooting
### Updates to existing docs
| File | Changes |
|---|---|
| `docs/Configuration.md` | Add `OpcUa.ApplicationUri` and `Redundancy` sections |
| `docs/OpcUaServer.md` | Correct the current `ApplicationUri == namespace` description and add redundancy behavior |
| `docs/CliTool.md` | Add `redundancy` command |
| `docs/ServiceHosting.md` | Add multi-instance deployment notes |
| `README.md` | Mention redundancy support and link docs |
| `CLAUDE.md` | Add redundancy architecture note |
### Update after real deployment
- `service_info.md`
Only update this once the second instance is actually deployed and verified.
---
## File Change Summary
| File | Action | Description |
|---|---|---|
| `src/.../Configuration/OpcUaConfiguration.cs` | Modify | Add explicit `ApplicationUri` |
| `src/.../Configuration/RedundancyConfiguration.cs` | New | Redundancy config model |
| `src/.../Configuration/AppConfiguration.cs` | Modify | Add `Redundancy` section |
| `src/.../Configuration/ConfigurationValidator.cs` | Modify | Validate/log redundancy and application identity |
| `src/.../OpcUa/RedundancyModeResolver.cs` | New | Map config mode to `RedundancySupport` |
| `src/.../OpcUa/ServiceLevelCalculator.cs` | New | Compute `ServiceLevel` from health inputs |
| `src/.../OpcUa/LmxOpcUaServer.cs` | Modify | Expose redundancy state via SDK server object |
| `src/.../OpcUa/OpcUaServerHost.cs` | Modify | Pass local application identity and redundancy config |
| `src/.../OpcUaService.cs` | Modify | Bind config and wire health updates |
| `src/.../OpcUaServiceBuilder.cs` | Modify | Support redundancy/application identity injection |
| `src/.../appsettings.json` | Modify | Add redundancy settings |
| `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` | New | Read redundancy state from a server |
| `tests/.../Redundancy/*.cs` | New | Unit tests for redundancy config and calculators |
| `tests/.../Configuration/ConfigurationLoadingTests.cs` | Modify | Bind/validate new settings |
| `tests/.../Integration/RedundancyTests.cs` | New | Paired-server integration tests |
| `tests/.../Helpers/OpcUaServerFixture.cs` | Modify | Support paired redundancy fixtures |
| `tests/.../Helpers/OpcUaTestClient.cs` | Modify | Read redundancy nodes in integration tests |
| `docs/Redundancy.md` | New | Dedicated redundancy guide |
| `docs/Configuration.md` | Modify | Document new config |
| `docs/OpcUaServer.md` | Modify | Correct application identity and add redundancy details |
| `docs/CliTool.md` | Modify | Document `redundancy` command |
| `docs/ServiceHosting.md` | Modify | Multi-instance deployment notes |
| `README.md` | Modify | Link redundancy docs |
| `CLAUDE.md` | Modify | Architecture note |
| `service_info.md` | Modify later | Document real second-instance deployment |
---
## Verification Guardrails
### Gate 1: Build
```bash
dotnet build ZB.MOM.WW.LmxOpcUa.slnx
```
### Gate 2: Unit tests
```bash
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests
```
### Gate 3: Redundancy integration tests
```bash
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Redundancy"
```
### Gate 4: CLI build
```bash
cd tools/opcuacli-dotnet
dotnet build
```
### Gate 5: Manual single-instance check
```bash
opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
```
Expected:
- `RedundancySupport=None`
- `ServiceLevel=255`
### Gate 6: Manual paired-instance check
```bash
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
```
Expected:
- both report the same `ServerUriArray`
- each reports its own unique local `ApplicationUri`
- primary reports a higher `ServiceLevel`
### Gate 7: Full test suite
```bash
dotnet test ZB.MOM.WW.LmxOpcUa.slnx
```
---
## Risks and Considerations
1. **Application identity is the main correctness risk.** Without unique `ApplicationUri` values, the redundant set is invalid even if `ServerUriArray` is populated.
2. **SDK wiring may require replacing the default redundancy state node.** The base `ServerRedundancyState` does not expose `ServerUriArray`; the implementation may need the non-transparent subtype from the SDK.
3. **Two in-process servers can collide on certificates.** Tests and deployment need distinct application identities and, when necessary, isolated PKI roots.
4. **Both instances hit the same MXAccess runtime and Galaxy DB.** Verify client-registration and polling behavior under paired load.
5. **`ServiceLevel` should remain meaningful, not noisy.** Prefer deterministic role + health inputs over frequent arbitrary adjustments.
6. **`service_info.md` is deployment documentation, not design.** Do not prefill it with speculative values before the second instance actually exists.
---
## Execution Order
1. Step 1: add `OpcUa.ApplicationUri` and separate it from namespace identity
2. Steps 2-5: config model, resolver, calculator, validator
3. Gate 1 + Gate 2
4. Step 9: update builders/helpers so tests can express paired servers cleanly
5. Step 6-8: server exposure and service-layer health wiring
6. Gate 1 + Gate 2 + Gate 3
7. Step 10: update `appsettings.json`
8. Step 11: add CLI `redundancy` command
9. Gate 4 + Gate 5
10. Step 12: deploy and verify the second instance
11. Update `service_info.md` with real deployment details
12. Documentation updates
13. Gate 7