lmxopcua/redundancy.md

# OPC UA Server Redundancy Plan

## Summary

Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance should advertise the redundant set through the standard OPC UA redundancy nodes, publish a dynamic `ServiceLevel` based on runtime health, and allow clients to discover and fail over between the instances. The CLI tool should gain a `redundancy` command for inspecting the redundant server set.

This review tightens the original draft in a few important ways:

- It separates **namespace identity** from **application identity**. The current host uses `urn:{GalaxyName}:LmxOpcUa` as both the namespace URI and `ApplicationUri`; that must change for redundancy because each server in the pair needs a unique server URI.
- It avoids hand-wavy "write the redundancy nodes directly" language and instead targets the OPC UA SDK's built-in `ServerObjectState` / `ServerRedundancyState` model.
- It removes a few inaccurate hardcoded assumptions, including the `ServerUriArray` node id and the deployment port examples.
- It fixes execution order so test-builder and helper changes happen before integration coverage depends on them.

This plan still covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does **not** implement automatic server-side failover or subscription transfer; those remain client responsibilities per the OPC UA specification.

---

## Background: OPC UA Redundancy Model

OPC UA exposes redundancy through standard nodes under `Server/ServerRedundancy` plus the `Server/ServiceLevel` property:

| Node | Type | Purpose |
|---|---|---|
| `RedundancySupport` | `RedundancySupport` enum | Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored` |
| `ServerUriArray` | `String[]` | Lists the `ApplicationUri` values of all servers in the redundant set for non-transparent redundancy |
| `ServiceLevel` | `Byte` (0-255) | Indicates current operational quality; clients prefer the server with the highest value |

### Non-Transparent Redundancy (our target)

In non-transparent redundancy (`Warm` or `Hot`), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading `ServerUriArray`, monitor `ServiceLevel` on each server, and manage their own failover. This fits the current architecture, where each instance independently connects to the same Galaxy repository and MXAccess runtime.

### ServiceLevel semantics

| Range | Meaning |
|---|---|
| 0 | Server is not operational |
| 1-99 | Degraded |
| 100-199 | Healthy secondary |
| 200-255 | Healthy primary |

The primary should advertise a higher `ServiceLevel` than the secondary so clients prefer it when both are healthy.

---

## Current State

- `LmxOpcUaServer` extends `StandardServer` but does not expose redundancy state
- `ServerRedundancy/RedundancySupport` remains the SDK default (`None`)
- `Server/ServiceLevel` remains the SDK default (`255`)
- No configuration exists for redundancy mode, role, or redundant partner URIs
- `OpcUaServerHost` currently sets `ApplicationUri = urn:{GalaxyName}:LmxOpcUa`
- `LmxNodeManager` uses the same `urn:{GalaxyName}:LmxOpcUa` as the published namespace URI
- A single deployed instance is documented in [service_info.md](C:\Users\dohertj2\Desktop\lmxopcua\service_info.md)
- No CLI command exists for reading redundancy information

## Key gap to fix first

For redundancy, each server in the set must advertise a unique `ApplicationUri`, and `ServerUriArray` must contain those unique values. The current implementation cannot do that because it reuses the namespace URI as the server `ApplicationUri`. Phase 1 therefore needs an application-identity change before the redundancy nodes can be correct.

---

## Scope

### In scope (Phase 1)

1. Add explicit application-identity configuration so each instance can have a unique `ApplicationUri`
2. Add redundancy configuration for mode, role, and server URI membership
3. Expose `RedundancySupport`, `ServerUriArray`, and dynamic `ServiceLevel`
4. Compute `ServiceLevel` from runtime health and preferred role
5. Add a CLI `redundancy` command
6. Document two-instance deployment
7. Add unit and integration coverage

### Deferred

- Automatic subscription transfer
- Server-initiated failover
- Transparent redundancy mode
- Load-balancer-specific HTTP health endpoints
- Mirrored data/session state

---

## Configuration Design

### 1. Add explicit `OpcUa.ApplicationUri`

**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/OpcUaConfiguration.cs`

Add:

```csharp
public string? ApplicationUri { get; set; }
```

Rules:

- `ApplicationUri = null` preserves the current behavior for non-redundant deployments
- when `Redundancy.Enabled = true`, `ApplicationUri` must be explicitly set and unique per instance
- `LmxNodeManager` should continue using `urn:{GalaxyName}:LmxOpcUa` as the namespace URI so both redundant servers expose the same namespace
- `Redundancy.ServerUris` must contain the exact `ApplicationUri` values for all servers in the redundant set

Example:

```json
{
  "OpcUa": {
    "ServerName": "LmxOpcUa",
    "GalaxyName": "ZB",
    "ApplicationUri": "urn:localhost:LmxOpcUa:instance1"
  }
}
```

### 2. New `Redundancy` section in `appsettings.json`

```json
{
  "Redundancy": {
    "Enabled": false,
    "Mode": "Warm",
    "Role": "Primary",
    "ServerUris": [],
    "ServiceLevelBase": 200
  }
}
```

### 3. Configuration model

**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs` (new)

```csharp
public class RedundancyConfiguration
{
    public bool Enabled { get; set; } = false;
    public string Mode { get; set; } = "Warm";
    public string Role { get; set; } = "Primary";
    public List<string> ServerUris { get; set; } = new List<string>();
    public int ServiceLevelBase { get; set; } = 200;
}
```

### 4. Configuration rules

- `Enabled` defaults to `false`
- `Mode` supports `Warm` and `Hot` in Phase 1
- `Role` supports `Primary` and `Secondary`
- `ServerUris` must contain the local `OpcUa.ApplicationUri` when redundancy is enabled
- `ServerUris` should contain at least two unique entries when redundancy is enabled
- `ServiceLevelBase` should be in the range `1-255`
- Effective baseline:
  - Primary: `ServiceLevelBase`
  - Secondary: `max(0, ServiceLevelBase - 50)`

### App root updates

**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs`

- Add `public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();`

---

## Implementation Steps

### Step 1: Separate application identity from namespace identity

**Files:**

- `src/.../Configuration/OpcUaConfiguration.cs`
- `src/.../OpcUa/OpcUaServerHost.cs`
- `docs/OpcUaServer.md`
- `tests/.../Configuration/ConfigurationLoadingTests.cs`

Changes:

1. Add optional `OpcUa.ApplicationUri`
2. Keep `urn:{GalaxyName}:LmxOpcUa` as the namespace URI used by `LmxNodeManager`
3. Set `ApplicationConfiguration.ApplicationUri` from `OpcUa.ApplicationUri` when supplied
4. Keep `ApplicationUri` and namespace URI distinct in docs and tests

This step is required before redundancy can be correct.

### Step 2: Add `RedundancyConfiguration` and bind it

**Files:**

- `src/.../Configuration/RedundancyConfiguration.cs` (new)
- `src/.../Configuration/AppConfiguration.cs`
- `src/.../OpcUaService.cs`

Changes:

1. Create `RedundancyConfiguration`
2. Add `Redundancy` to `AppConfiguration`
3. Bind `configuration.GetSection("Redundancy").Bind(_config.Redundancy);`
4. Pass `_config.Redundancy` through to `OpcUaServerHost` and `LmxOpcUaServer`

### Step 3: Add `RedundancyModeResolver`

**File:** `src/.../OpcUa/RedundancyModeResolver.cs` (new)

Responsibilities:

- map `Mode` to `RedundancySupport`
- validate supported Phase 1 modes
- fall back safely when disabled or invalid

```csharp
public static class RedundancyModeResolver
{
    public static RedundancySupport Resolve(string mode, bool enabled);
}
```

### Step 4: Add `ServiceLevelCalculator`

**File:** `src/.../OpcUa/ServiceLevelCalculator.cs` (new)

Purpose:

- compute the current `ServiceLevel` from a baseline plus health inputs

Suggested signature:

```csharp
public sealed class ServiceLevelCalculator
{
    public byte Calculate(int baseLevel, bool mxAccessConnected, bool dbConnected);
}
```

Suggested logic:

- start with the role-adjusted baseline supplied by the caller
- subtract 100 if MXAccess is disconnected
- subtract 50 if the Galaxy DB is unreachable
- return `0` if both are down
- clamp to `0-255`

### Step 5: Extend `ConfigurationValidator`

**File:** `src/.../Configuration/ConfigurationValidator.cs`

Add validation/logging for:

- `OpcUa.ApplicationUri`
- `Redundancy.Enabled`, `Mode`, `Role`
- `ServerUris` membership and uniqueness
- `ServiceLevelBase`
- local `OpcUa.ApplicationUri` must appear in `Redundancy.ServerUris` when enabled
- warning when fewer than 2 unique server URIs are configured

### Step 6: Expose redundancy through the standard OPC UA server object

**File:** `src/.../OpcUa/LmxOpcUaServer.cs`

Changes:

1. Accept `RedundancyConfiguration` and local `ApplicationUri`
2. On startup, locate the built-in `ServerObjectState`
3. Configure `ServerObjectState.ServiceLevel`
4. Configure the server redundancy object using the SDK's standard server-state types instead of writing guessed node ids directly
5. If the default `ServerRedundancyState` does not expose `ServerUriArray`, replace or upgrade it with the appropriate non-transparent redundancy state type from the SDK before populating values
6. Expose an internal method such as `UpdateServiceLevel(bool mxConnected, bool dbConnected)` for service-layer health updates

Important: the implementation should use SDK types/constants (`ServerObjectState`, `ServerRedundancyState`, `NonTransparentRedundancyState`, `VariableIds.*`) rather than hand-maintained numeric literals.

### Step 7: Update `OpcUaServerHost`

**File:** `src/.../OpcUa/OpcUaServerHost.cs`

Changes:

1. Accept `RedundancyConfiguration`
2. Pass redundancy config and resolved local `ApplicationUri` into `LmxOpcUaServer`
3. Log redundancy mode/role/server URIs at startup

### Step 8: Wire health updates in `OpcUaService`

**File:** `src/.../OpcUaService.cs`

Changes:

1. Bind and pass redundancy config
2. After startup, initialize the starting `ServiceLevel`
3. Subscribe to `IMxAccessClient.ConnectionStateChanged`
4. Update DB health whenever startup repository checks, change-detection work, or rebuild attempts succeed/fail
5. Prefer event-driven updates; add a lightweight periodic refresh only if necessary

Avoid introducing a second large standalone polling loop when existing connection and repository activity already gives most of the needed health signals.

### Step 9: Update test builders and helpers before integration coverage

**Files:**

- `src/.../OpcUaServiceBuilder.cs`
- `tests/.../Helpers/OpcUaServerFixture.cs`
- `tests/.../Helpers/OpcUaTestClient.cs`

Changes:

- add `WithRedundancy(...)`
- add `WithApplicationUri(...)` or allow full `OpcUaConfiguration` override
- ensure two in-process redundancy tests can run with distinct `ServerName`, `ApplicationUri`, and certificate identity
- when needed, use separate PKI roots in tests so paired fixtures do not collide on certificate state

### Step 10: Update `appsettings.json`

**File:** `src/.../appsettings.json`

Add:

- `OpcUa.ApplicationUri` example/commentary in docs
- `Redundancy` section with `Enabled = false` defaults

### Step 11: Add CLI `redundancy` command

**Files:**

- `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` (new)
- `tools/opcuacli-dotnet/README.md`
- `docs/CliTool.md`

Command: `redundancy`

Read:

- `VariableIds.Server_ServerRedundancy_RedundancySupport`
- `VariableIds.Server_ServiceLevel`
- `VariableIds.Server_ServerRedundancy_ServerUriArray`

Output example:

```text
Redundancy Mode:  Warm
Service Level:    200
Server URIs:
  - urn:localhost:LmxOpcUa:instance1
  - urn:localhost:LmxOpcUa:instance2
```

Use SDK constants instead of hardcoded numeric ids in the command implementation.

### Step 12: Deploy the second service instance

**Deployment target:** `C:\publish\lmxopcua\instance2`

Suggested configuration differences:

| Setting | instance1 | instance2 |
|---|---|---|
| `OpcUa.Port` | `4840` | `4841` |
| `Dashboard.Port` | `8081` | `8082` |
| `OpcUa.ServerName` | `LmxOpcUa` | `LmxOpcUa2` |
| `OpcUa.ApplicationUri` | `urn:localhost:LmxOpcUa:instance1` | `urn:localhost:LmxOpcUa:instance2` |
| `Redundancy.Enabled` | `true` | `true` |
| `Redundancy.Role` | `Primary` | `Secondary` |
| `Redundancy.Mode` | `Warm` | `Warm` |
| `Redundancy.ServerUris` | same two-entry set | same two-entry set |

Deployment notes:

- both instances should share the same `GalaxyName` and namespace URI
- each instance must have a distinct application certificate identity
- if certificate handling is sensitive, give each instance an explicit `Security.CertificateSubject` or separate PKI root

Update [service_info.md](C:\Users\dohertj2\Desktop\lmxopcua\service_info.md) with the second instance details after deployment is real, not speculative.

---

## Test Plan

### Unit tests: `RedundancyModeResolver`

**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs`

| Test | Description |
|---|---|
| `Resolve_Disabled_ReturnsNone` | `Enabled=false` returns `None` |
| `Resolve_Warm_ReturnsWarm` | `Mode="Warm"` maps correctly |
| `Resolve_Hot_ReturnsHot` | `Mode="Hot"` maps correctly |
| `Resolve_Unknown_FallsBackToNone` | Unknown mode falls back safely |
| `Resolve_CaseInsensitive` | Case-insensitive parsing works |

### Unit tests: `ServiceLevelCalculator`

**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs`

| Test | Description |
|---|---|
| `FullyHealthy_Primary_ReturnsBase` | Healthy primary baseline is preserved |
| `FullyHealthy_Secondary_ReturnsBaseMinusFifty` | Healthy secondary baseline is lower |
| `MxAccessDown_ReducesServiceLevel` | MXAccess failure reduces score |
| `DbDown_ReducesServiceLevel` | DB failure reduces score |
| `BothDown_ReturnsZero` | Both unavailable returns 0 |
| `ClampedTo255` | Upper clamp works |
| `ClampedToZero` | Lower clamp works |

### Unit tests: `RedundancyConfiguration`

**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs`

| Test | Description |
|---|---|
| `DefaultConfig_Disabled` | `Enabled` defaults to `false` |
| `DefaultConfig_ModeWarm` | `Mode` defaults to `Warm` |
| `DefaultConfig_RolePrimary` | `Role` defaults to `Primary` |
| `DefaultConfig_EmptyServerUris` | `ServerUris` defaults to empty |
| `DefaultConfig_ServiceLevelBase200` | `ServiceLevelBase` defaults to `200` |

### Updates to existing configuration tests

**File:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs`

Add coverage for:

- `OpcUa.ApplicationUri`
- `Redundancy` section binding
- redundancy validation when `ApplicationUri` is missing
- redundancy validation when local `ApplicationUri` is absent from `ServerUris`
- invalid `ServiceLevelBase`

### Integration tests

**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs`

Cover:

- redundancy disabled reports `None`
- warm redundancy reports configured mode
- `ServerUriArray` matches configuration
- primary reports higher `ServiceLevel` than secondary
- both servers expose the same namespace URI but different `ApplicationUri` values
- service level drops when MXAccess disconnects

Pattern:

- use two fixture instances
- give each fixture a distinct `ServerName`, `ApplicationUri`, and port
- if secure transport is enabled in those tests, isolate PKI roots to avoid certificate cross-talk

---

## Documentation Plan

### New file

- `docs/Redundancy.md`

Contents:

1. overview of OPC UA non-transparent redundancy
2. difference between namespace URI and server `ApplicationUri`
3. redundancy configuration reference
4. service-level computation
5. two-instance deployment guide
6. CLI `redundancy` command usage
7. troubleshooting

### Updates to existing docs

| File | Changes |
|---|---|
| `docs/Configuration.md` | Add `OpcUa.ApplicationUri` and `Redundancy` sections |
| `docs/OpcUaServer.md` | Correct the current `ApplicationUri == namespace` description and add redundancy behavior |
| `docs/CliTool.md` | Add `redundancy` command |
| `docs/ServiceHosting.md` | Add multi-instance deployment notes |
| `README.md` | Mention redundancy support and link docs |
| `CLAUDE.md` | Add redundancy architecture note |

### Update after real deployment

- `service_info.md`

Only update this once the second instance is actually deployed and verified.

---

## File Change Summary

| File | Action | Description |
|---|---|---|
| `src/.../Configuration/OpcUaConfiguration.cs` | Modify | Add explicit `ApplicationUri` |
| `src/.../Configuration/RedundancyConfiguration.cs` | New | Redundancy config model |
| `src/.../Configuration/AppConfiguration.cs` | Modify | Add `Redundancy` section |
| `src/.../Configuration/ConfigurationValidator.cs` | Modify | Validate/log redundancy and application identity |
| `src/.../OpcUa/RedundancyModeResolver.cs` | New | Map config mode to `RedundancySupport` |
| `src/.../OpcUa/ServiceLevelCalculator.cs` | New | Compute `ServiceLevel` from health inputs |
| `src/.../OpcUa/LmxOpcUaServer.cs` | Modify | Expose redundancy state via SDK server object |
| `src/.../OpcUa/OpcUaServerHost.cs` | Modify | Pass local application identity and redundancy config |
| `src/.../OpcUaService.cs` | Modify | Bind config and wire health updates |
| `src/.../OpcUaServiceBuilder.cs` | Modify | Support redundancy/application identity injection |
| `src/.../appsettings.json` | Modify | Add redundancy settings |
| `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` | New | Read redundancy state from a server |
| `tests/.../Redundancy/*.cs` | New | Unit tests for redundancy config and calculators |
| `tests/.../Configuration/ConfigurationLoadingTests.cs` | Modify | Bind/validate new settings |
| `tests/.../Integration/RedundancyTests.cs` | New | Paired-server integration tests |
| `tests/.../Helpers/OpcUaServerFixture.cs` | Modify | Support paired redundancy fixtures |
| `tests/.../Helpers/OpcUaTestClient.cs` | Modify | Read redundancy nodes in integration tests |
| `docs/Redundancy.md` | New | Dedicated redundancy guide |
| `docs/Configuration.md` | Modify | Document new config |
| `docs/OpcUaServer.md` | Modify | Correct application identity and add redundancy details |
| `docs/CliTool.md` | Modify | Document `redundancy` command |
| `docs/ServiceHosting.md` | Modify | Multi-instance deployment notes |
| `README.md` | Modify | Link redundancy docs |
| `CLAUDE.md` | Modify | Architecture note |
| `service_info.md` | Modify later | Document real second-instance deployment |

---

## Verification Guardrails

### Gate 1: Build

```bash
dotnet build ZB.MOM.WW.LmxOpcUa.slnx
```

### Gate 2: Unit tests

```bash
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests
```

### Gate 3: Redundancy integration tests

```bash
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Redundancy"
```

### Gate 4: CLI build

```bash
cd tools/opcuacli-dotnet
dotnet build
```

### Gate 5: Manual single-instance check

```bash
opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
```

Expected:

- `RedundancySupport=None`
- `ServiceLevel=255`

### Gate 6: Manual paired-instance check

```bash
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
```

Expected:

- both report the same `ServerUriArray`
- each reports its own unique local `ApplicationUri`
- primary reports a higher `ServiceLevel`

### Gate 7: Full test suite

```bash
dotnet test ZB.MOM.WW.LmxOpcUa.slnx
```

---

## Risks and Considerations

1. **Application identity is the main correctness risk.** Without unique `ApplicationUri` values, the redundant set is invalid even if `ServerUriArray` is populated.
2. **SDK wiring may require replacing the default redundancy state node.** The base `ServerRedundancyState` does not expose `ServerUriArray`; the implementation may need the non-transparent subtype from the SDK.
3. **Two in-process servers can collide on certificates.** Tests and deployment need distinct application identities and, when necessary, isolated PKI roots.
4. **Both instances hit the same MXAccess runtime and Galaxy DB.** Verify client-registration and polling behavior under paired load.
5. **`ServiceLevel` should remain meaningful, not noisy.** Prefer deterministic role + health inputs over frequent arbitrary adjustments.
6. **`service_info.md` is deployment documentation, not design.** Do not prefill it with speculative values before the second instance actually exists.

---

## Execution Order

1. Step 1: add `OpcUa.ApplicationUri` and separate it from namespace identity
2. Steps 2-5: config model, resolver, calculator, validator
3. Gate 1 + Gate 2
4. Step 9: update builders/helpers so tests can express paired servers cleanly
5. Step 6-8: server exposure and service-layer health wiring
6. Gate 1 + Gate 2 + Gate 3
7. Step 10: update `appsettings.json`
8. Step 11: add CLI `redundancy` command
9. Gate 4 + Gate 5
10. Step 12: deploy and verify the second instance
11. Update `service_info.md` with real deployment details
12. Documentation updates
13. Gate 7