lmxopcua/redundancy.md

# OPC UA Server Redundancy Plan

## Summary

Add configurable non-transparent warm/hot redundancy to the LmxOpcUa server so that two instances sharing the same Galaxy repository can operate as a redundant pair. Each instance advertises itself and its partner through the OPC UA `ServerRedundancy` node, publishes a dynamic `ServiceLevel` reflecting runtime health, and allows clients to discover the redundant set and fail over between instances. The CLI tool gains a `redundancy` command for inspecting the redundant server set.

This plan covers server-side redundancy exposure, client-side discovery, a second deployed service instance, documentation, and tests. It does **not** implement automatic server-side failover or subscription transfer — those are client responsibilities per the OPC UA specification.

---

## Background: OPC UA Redundancy Model

OPC UA defines redundancy through three address-space nodes under `Server/ServerRedundancy`:

| Node | Type | Purpose |
|---|---|---|
| `RedundancySupport` | `RedundancySupport` enum | Declares the redundancy mode: `None`, `Cold`, `Warm`, `Hot`, `Transparent`, `HotAndMirrored` |
| `ServerUriArray` | `String[]` | Lists the `ApplicationUri` values of all servers in the redundant set (non-transparent modes) |
| `ServiceLevel` | `Byte` (0–255) | Indicates current operational quality; clients prefer the server with the highest value |

### Non-Transparent Redundancy (our target)

In non-transparent redundancy (`Warm` or `Hot`), both servers run independently with their own sessions and subscriptions. Clients discover the redundant set by reading `ServerUriArray`, monitor `ServiceLevel` on each server, and manage their own failover. This model fits our architecture where each instance connects to the same Galaxy repository and MXAccess runtime independently.

### ServiceLevel Semantics

| Range | Meaning |
|---|---|
| 0 | Server is not operational |
| 1–99 | Degraded (e.g., MXAccess disconnected, DB unreachable) |
| 100–199 | Healthy secondary |
| 200–255 | Healthy primary (preferred) |

The primary server should advertise a higher `ServiceLevel` than the secondary so clients prefer it when both are healthy.

---

## Current State

- `LmxOpcUaServer` extends `StandardServer` but does not override any redundancy-related methods
- `ServerRedundancy/RedundancySupport` defaults to `None` (SDK default)
- `ServiceLevel` defaults to `255` (SDK default — "fully operational")
- No configuration for redundant partner URIs or role designation
- Single deployed instance at `C:\publish\lmxopcua\instance1` on port 4840
- No CLI support for reading redundancy information

---

## Scope

### In Scope (Phase 1)

1. **Redundancy configuration model** — role, partner URIs, ServiceLevel weights
2. **Server redundancy node exposure** — `RedundancySupport`, `ServerUriArray`, dynamic `ServiceLevel`
3. **ServiceLevel computation** — based on runtime health (MXAccess state, DB connectivity, role)
4. **CLI redundancy command** — read `RedundancySupport`, `ServerUriArray`, `ServiceLevel` from a server
5. **Second service instance** — deployed at `C:\publish\lmxopcua\instance2` with non-overlapping ports
6. **Documentation** — new `docs/Redundancy.md` component doc, updates to existing docs
7. **Unit tests** — config, ServiceLevel computation, resolver tests
8. **Integration tests** — two-server redundancy E2E test in the integration test project

### Deferred

- Automatic subscription transfer (client-side responsibility)
- Server-initiated failover (Galaxy `redundancy` table / engine flags)
- Transparent redundancy mode
- Health-check HTTP endpoint for load balancers

---

## Configuration Design

### New `Redundancy` section in `appsettings.json`

```json
{
  "Redundancy": {
    "Enabled": false,
    "Mode": "Warm",
    "Role": "Primary",
    "ServerUris": [],
    "ServiceLevelBase": 200
  }
}
```

### Configuration model

**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/RedundancyConfiguration.cs` (new)

```csharp
public class RedundancyConfiguration
{
    public bool Enabled { get; set; } = false;
    public string Mode { get; set; } = "Warm";
    public string Role { get; set; } = "Primary";
    public List<string> ServerUris { get; set; } = new List<string>();
    public int ServiceLevelBase { get; set; } = 200;
}
```

### Configuration rules

- `Enabled` defaults to `false` for backward compatibility. When `false`, `RedundancySupport = None` and `ServiceLevel = 255` (SDK defaults).
- `Mode` must be `Warm` or `Hot` (Phase 1). Maps to `RedundancySupport.Warm` or `RedundancySupport.Hot`.
- `Role` must be `Primary` or `Secondary`. Controls the base `ServiceLevel` (Primary gets `ServiceLevelBase`, Secondary gets `ServiceLevelBase - 50`).
- `ServerUris` lists the `ApplicationUri` values for **all** servers in the redundant set, including the local server. The OPC UA spec requires this to contain the full set. These are namespace URIs like `urn:ZB:LmxOpcUa`, not endpoint URLs.
- `ServiceLevelBase` is the starting ServiceLevel when the server is fully healthy. Degraded conditions subtract from this value.

### App root updates

**File:** `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/AppConfiguration.cs`

- Add `public RedundancyConfiguration Redundancy { get; set; } = new RedundancyConfiguration();`

---

## Implementation Steps

### Step 1: Add RedundancyConfiguration model and bind it

**Files:**
- `src/.../Configuration/RedundancyConfiguration.cs` (new)
- `src/.../Configuration/AppConfiguration.cs`
- `src/.../OpcUaService.cs`

Changes:
1. Create `RedundancyConfiguration` class with properties above
2. Add `Redundancy` property to `AppConfiguration`
3. Bind `configuration.GetSection("Redundancy").Bind(_config.Redundancy);`
4. Pass `_config.Redundancy` through to `OpcUaServerHost` and `LmxOpcUaServer`

### Step 2: Add RedundancyModeResolver

**File:** `src/.../OpcUa/RedundancyModeResolver.cs` (new)

Responsibilities:
- Map `Mode` string to `RedundancySupport` enum value
- Validate against supported Phase 1 modes (`Warm`, `Hot`)
- Fall back to `None` with warning for unknown modes

```csharp
public static class RedundancyModeResolver
{
    public static RedundancySupport Resolve(string mode, bool enabled);
}
```

### Step 3: Add ServiceLevelCalculator

**File:** `src/.../OpcUa/ServiceLevelCalculator.cs` (new)

Computes the dynamic `ServiceLevel` byte from runtime health:

```csharp
public class ServiceLevelCalculator
{
    public byte Calculate(int baseLine, bool mxAccessConnected, bool dbConnected, bool isPrimary);
}
```

Logic:
- Start with `baseLine` (from config, e.g., 200 for Primary, 150 for Secondary)
- Subtract 100 if MXAccess is disconnected
- Subtract 50 if Galaxy DB is unreachable
- Clamp to 0–255
- Return 0 if both MXAccess and DB are down

### Step 4: Extend ConfigurationValidator for redundancy

**File:** `src/.../Configuration/ConfigurationValidator.cs`

Add validation/logging for:
- `Redundancy.Enabled`, `Mode`, `Role`
- `ServerUris` should not be empty when `Enabled = true`
- `ServiceLevelBase` should be 1–255
- Warning when `Enabled = true` but `ServerUris` has fewer than 2 entries
- Log effective redundancy configuration at startup

### Step 5: Update LmxOpcUaServer to expose redundancy state

**File:** `src/.../OpcUa/LmxOpcUaServer.cs`

Changes:
1. Accept `RedundancyConfiguration` in the constructor
2. Override `OnServerStarted` to write redundancy nodes:
   - Set `Server/ServerRedundancy/RedundancySupport` to the resolved mode
   - Set `Server/ServerRedundancy/ServerUriArray` to the configured URIs
3. Override `SetServerState` or use a timer to update `Server/ServiceLevel` periodically based on `ServiceLevelCalculator`
4. Expose a method `UpdateServiceLevel(bool mxConnected, bool dbConnected)` that the service layer can call when health state changes

### Step 6: Update OpcUaServerHost to pass redundancy config

**File:** `src/.../OpcUa/OpcUaServerHost.cs`

Changes:
1. Accept `RedundancyConfiguration` in the constructor
2. Pass it through to `LmxOpcUaServer`
3. Log active redundancy mode at startup

### Step 7: Wire ServiceLevel updates in OpcUaService

**File:** `src/.../OpcUaService.cs`

Changes:
1. Bind redundancy config section
2. Pass redundancy config to `OpcUaServerHost`
3. Subscribe to `MxAccessClient.ConnectionStateChanged` to trigger `ServiceLevel` updates
4. After Galaxy DB health checks, trigger `ServiceLevel` updates
5. Use a periodic timer (e.g., every 5 seconds) to refresh `ServiceLevel` based on current component health

### Step 8: Update appsettings.json

**File:** `src/.../appsettings.json`

Add the `Redundancy` section with backward-compatible defaults (`Enabled: false`).

### Step 9: Update OpcUaServiceBuilder for test injection

**File:** `src/.../OpcUaServiceBuilder.cs`

Add `WithRedundancy(RedundancyConfiguration)` builder method so tests can inject redundancy configuration.

### Step 10: Add CLI `redundancy` command

**Files:**
- `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` (new)

Command: `redundancy`

Reads from the target server:
- `Server/ServerRedundancy/RedundancySupport` (i=11314)
- `Server/ServiceLevel` (i=2267)
- `Server/ServerRedundancy/ServerUriArray` (i=11492, if non-transparent redundancy)

Output format:
```
Redundancy Mode:  Warm
Service Level:    200
Server URIs:
  - urn:ZB:LmxOpcUa
  - urn:ZB:LmxOpcUa2
```

Options: `--url`, `--username`, `--password`, `--security` (same shared options as other commands).

### Step 11: Deploy second service instance

**Deployment target:** `C:\publish\lmxopcua\instance2`

Configuration differences from instance1:

| Setting | instance1 | instance2 |
|---|---|---|
| `OpcUa.Port` | `4840` | `4841` |
| `OpcUa.ServerName` | `LmxOpcUa` | `LmxOpcUa2` |
| `Dashboard.Port` | `8083` | `8084` |
| `Redundancy.Enabled` | `true` | `true` |
| `Redundancy.Role` | `Primary` | `Secondary` |
| `Redundancy.Mode` | `Warm` | `Warm` |
| `Redundancy.ServerUris` | `["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]` | `["urn:ZB:LmxOpcUa", "urn:ZB:LmxOpcUa2"]` |
| `Redundancy.ServiceLevelBase` | `200` | `200` |

Windows service for instance2:
- Name: `LmxOpcUa2`
- Display name: `LMX OPC UA Server (Instance 2)`
- Executable: `C:\publish\lmxopcua\instance2\ZB.MOM.WW.LmxOpcUa.Host.exe`

Both instances share the same Galaxy DB (`ZB`) and MXAccess runtime. The `GalaxyName` remains `ZB` for both so they expose the same namespace.

Update `service_info.md` with the second instance details.

---

## Test Plan

### Unit tests — RedundancyModeResolver

**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyModeResolverTests.cs`

| Test | Description |
|---|---|
| `Resolve_Disabled_ReturnsNone` | `Enabled=false` always returns `RedundancySupport.None` |
| `Resolve_Warm_ReturnsWarm` | `Mode="Warm"` maps to `RedundancySupport.Warm` |
| `Resolve_Hot_ReturnsHot` | `Mode="Hot"` maps to `RedundancySupport.Hot` |
| `Resolve_Unknown_FallsBackToNone` | Unknown mode falls back safely |
| `Resolve_CaseInsensitive` | `"warm"` and `"WARM"` both resolve |

### Unit tests — ServiceLevelCalculator

**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/ServiceLevelCalculatorTests.cs`

| Test | Description |
|---|---|
| `FullyHealthy_Primary_ReturnsBase` | All healthy, primary role → `ServiceLevelBase` |
| `FullyHealthy_Secondary_ReturnsBaseMinusFifty` | All healthy, secondary role → `ServiceLevelBase - 50` |
| `MxAccessDown_ReducesServiceLevel` | MXAccess disconnected subtracts 100 |
| `DbDown_ReducesServiceLevel` | DB unreachable subtracts 50 |
| `BothDown_ReturnsZero` | MXAccess + DB both down → 0 |
| `ClampedTo255` | Base of 255 with healthy → 255 |
| `ClampedToZero` | Heavy penalties don't go negative |

### Unit tests — RedundancyConfiguration defaults

**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Redundancy/RedundancyConfigurationTests.cs`

| Test | Description |
|---|---|
| `DefaultConfig_Disabled` | `Enabled` defaults to `false` |
| `DefaultConfig_ModeWarm` | `Mode` defaults to `"Warm"` |
| `DefaultConfig_RolePrimary` | `Role` defaults to `"Primary"` |
| `DefaultConfig_EmptyServerUris` | `ServerUris` defaults to empty |
| `DefaultConfig_ServiceLevelBase200` | `ServiceLevelBase` defaults to `200` |

### Updates to existing configuration tests

**File:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Configuration/ConfigurationLoadingTests.cs`

Add:
- `Redundancy_Section_BindsCorrectly` — verify binding from appsettings.json
- `Redundancy_Section_BindsCustomValues` — in-memory override test
- `Validator_RedundancyEnabled_EmptyServerUris_ReturnsTrue_WithWarning` — validates but warns
- `Validator_RedundancyEnabled_InvalidServiceLevelBase_ReturnsFalse` — rejects 0 or >255

### Integration tests — redundancy E2E

**New file:** `tests/ZB.MOM.WW.LmxOpcUa.Tests/Integration/RedundancyTests.cs`

These tests start two in-process OPC UA servers with redundancy enabled and verify client-visible behavior:

| Test | Description |
|---|---|
| `Server_WithRedundancyDisabled_ReportsNone` | Default config → `RedundancySupport.None`, `ServiceLevel=255` |
| `Server_WithRedundancyEnabled_ReportsConfiguredMode` | `Enabled=true, Mode=Warm` → `RedundancySupport.Warm` |
| `Server_WithRedundancyEnabled_ExposesServerUriArray` | Client can read `ServerUriArray` and it matches config |
| `Server_Primary_HasHigherServiceLevel_ThanSecondary` | Primary server reports higher `ServiceLevel` than secondary |
| `TwoServers_BothExposeSameRedundantSet` | Two server fixtures, both report the same `ServerUriArray` |
| `Server_ServiceLevel_DropsWith_MxAccessDisconnect` | Simulate MXAccess disconnect → `ServiceLevel` decreases |

Pattern: Use `OpcUaServerFixture.WithFakeMxAccessClient()` with redundancy config injected, connect with `OpcUaTestClient`, read the standard OPC UA redundancy nodes.

---

## Documentation Plan

### New file: `docs/Redundancy.md`

Contents:
1. Overview of OPC UA non-transparent redundancy
2. Redundancy configuration section reference (`Enabled`, `Mode`, `Role`, `ServerUris`, `ServiceLevelBase`)
3. ServiceLevel computation logic and degraded-state penalties
4. How clients discover and fail over between instances
5. Deployment guide for a two-instance redundant pair (ports, service names, shared Galaxy DB)
6. CLI `redundancy` command usage
7. Troubleshooting: mismatched `ServerUris`, ServiceLevel stuck at 0, etc.

### Updates to existing docs

| File | Changes |
|---|---|
| `docs/Configuration.md` | Add `Redundancy` section table, example JSON, add to validation rules list, update example appsettings.json |
| `docs/OpcUaServer.md` | Add redundancy state exposure section, link to `Redundancy.md` |
| `docs/CliTool.md` | Add `redundancy` command documentation |
| `docs/ServiceHosting.md` | Add multi-instance deployment notes |
| `README.md` | Add `Redundancy` to the component documentation table, mention redundancy in Quick Start |
| `CLAUDE.md` | Add redundancy architecture note |

### Update: `service_info.md`

Add a second section documenting `instance2`:
- Path: `C:\publish\lmxopcua\instance2`
- Windows service name: `LmxOpcUa2`
- Port: `4841`
- Dashboard port: `8084`
- Redundancy role: `Secondary`
- Endpoint: `opc.tcp://localhost:4841/LmxOpcUa`

---

## File Change Summary

| File | Action | Description |
|---|---|---|
| `src/.../Configuration/RedundancyConfiguration.cs` | New | Redundancy config model |
| `src/.../Configuration/AppConfiguration.cs` | Modify | Add `Redundancy` section |
| `src/.../Configuration/ConfigurationValidator.cs` | Modify | Validate/log redundancy settings |
| `src/.../OpcUa/RedundancyModeResolver.cs` | New | Mode string → `RedundancySupport` enum |
| `src/.../OpcUa/ServiceLevelCalculator.cs` | New | Dynamic ServiceLevel from health state |
| `src/.../OpcUa/LmxOpcUaServer.cs` | Modify | Expose redundancy nodes, accept ServiceLevel updates |
| `src/.../OpcUa/OpcUaServerHost.cs` | Modify | Pass redundancy config through |
| `src/.../OpcUaService.cs` | Modify | Bind redundancy config, wire ServiceLevel updates |
| `src/.../OpcUaServiceBuilder.cs` | Modify | Add `WithRedundancy()` builder |
| `src/.../appsettings.json` | Modify | Add `Redundancy` section |
| `tools/opcuacli-dotnet/Commands/RedundancyCommand.cs` | New | CLI command to read redundancy info |
| `tests/.../Redundancy/RedundancyModeResolverTests.cs` | New | Mode resolver unit tests |
| `tests/.../Redundancy/ServiceLevelCalculatorTests.cs` | New | ServiceLevel computation tests |
| `tests/.../Redundancy/RedundancyConfigurationTests.cs` | New | Config defaults tests |
| `tests/.../Configuration/ConfigurationLoadingTests.cs` | Modify | Binding + validation tests |
| `tests/.../Integration/RedundancyTests.cs` | New | E2E two-server redundancy tests |
| `tests/.../Helpers/OpcUaServerFixture.cs` | Modify | Accept redundancy config |
| `docs/Redundancy.md` | New | Dedicated redundancy component doc |
| `docs/Configuration.md` | Modify | Add Redundancy section |
| `docs/OpcUaServer.md` | Modify | Add redundancy state section |
| `docs/CliTool.md` | Modify | Add redundancy command |
| `docs/ServiceHosting.md` | Modify | Multi-instance notes |
| `README.md` | Modify | Add Redundancy to component table |
| `CLAUDE.md` | Modify | Add redundancy architecture note |
| `service_info.md` | Modify | Add instance2 details |

---

## Verification Guardrails

Each step must pass these gates before proceeding to the next:

### Gate 1: Build (after each implementation step)
```bash
dotnet build ZB.MOM.WW.LmxOpcUa.slnx
```
Must produce 0 errors. Proceed only when green.

### Gate 2: Unit tests (after steps 1–4, 9)
```bash
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests
```
All existing + new tests must pass. No regressions.

### Gate 3: Integration tests (after steps 5–7)
```bash
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~Integration.RedundancyTests"
```
All redundancy E2E tests must pass.

### Gate 4: CLI tool builds (after step 10)
```bash
cd tools/opcuacli-dotnet && dotnet build
```
Must compile without errors.

### Gate 5: Manual verification — single instance (after step 8)
```bash
# Publish and start with Redundancy.Enabled=false
opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=None, ServiceLevel=255
```

### Gate 6: Manual verification — redundant pair (after step 11)
```bash
# Start both instances
sc start LmxOpcUa
sc start LmxOpcUa2

# Verify instance1 (Primary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=200, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]

# Verify instance2 (Secondary)
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
# Should report: RedundancySupport=Warm, ServiceLevel=150, ServerUris=[urn:ZB:LmxOpcUa, urn:ZB:LmxOpcUa2]

# Both instances should serve the same Galaxy address space
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4840/LmxOpcUa -r -d 2
opcuacli-dotnet.exe browse -u opc.tcp://localhost:4841/LmxOpcUa -r -d 2
```

### Gate 7: Full test suite (final)
```bash
dotnet test ZB.MOM.WW.LmxOpcUa.slnx
```
All tests across all projects must pass.

### Gate 8: Documentation review
- All new/modified doc files render correctly in Markdown
- Example JSON snippets match the actual `appsettings.json`
- CLI examples use correct flags and expected output
- `service_info.md` accurately reflects both deployed instances

---

## Risks and Considerations

1. **Backward compatibility**: `Redundancy.Enabled = false` must be the default so existing single-instance deployments are unaffected.
2. **ServiceLevel timing**: Updates must not race with OPC UA publish cycles. Use the server's internal lock or `ServerInternal` APIs.
3. **ServerUriArray immutability**: The OPC UA spec expects this to be static during a server session. Changes require a server restart.
4. **MXAccess shared state**: Both instances connect to the same MXAccess runtime. If MXAccess has per-client registration limits, verify that two clients can coexist.
5. **Galaxy DB contention**: Both instances poll for deploy changes. Ensure change detection doesn't trigger duplicate rebuilds or locking issues.
6. **Port conflicts**: The second instance must use different ports for OPC UA (4841) and Dashboard (8084).
7. **Certificate identity**: Each instance needs its own application certificate with a distinct `SubjectName` matching its `ServerName`.

---

## Execution Order

1. Steps 1–4: Config model, resolver, calculator, validator (unit-testable in isolation)
2. **Gate 1 + Gate 2**: Build + unit tests pass
3. Steps 5–7: Server integration (redundancy nodes, ServiceLevel wiring)
4. **Gate 1 + Gate 2 + Gate 3**: Build + all tests including E2E
5. Step 8: Update appsettings.json
6. **Gate 5**: Manual single-instance verification
7. Step 9: Update service builder for tests
8. Step 10: CLI redundancy command
9. **Gate 4**: CLI builds
10. Step 11: Deploy second instance + update service_info.md
11. **Gate 6**: Manual two-instance verification
12. Documentation updates (all doc files)
13. **Gate 7 + Gate 8**: Full test suite + documentation review
14. Commit and push