Covers entity model, failover state machine, health reporting, UI/CLI changes, and deployment flow for optional backup endpoints with automatic failover after configurable retry count.
5.7 KiB
Primary/Backup Data Connection Endpoints — Design
Date: 2026-03-22 Status: Approved
Problem
Data connections currently support a single endpoint. If that endpoint goes down, the connection retries indefinitely at 5s intervals against the same address. When redundant infrastructure exists (e.g., two LmxProxy instances, two OPC UA servers), there is no way to automatically fail over to a backup.
Design Decisions
| Decision | Choice |
|---|---|
| Failover mode | Automatic after N failed retries |
| Failback | No auto-failback; stay on active until it fails (round-robin) |
| Backup required? | Optional — single-endpoint connections work unchanged |
| Failover trigger | After configurable retry count (default 3) |
| Entity model | Separate PrimaryConfiguration and BackupConfiguration columns |
| UI approach | Two JSON text areas; backup collapsible |
| Failover logic location | DataConnectionActor (adapters stay single-endpoint) |
| Observability | Health reports + site event log entries |
Entity Model
DataConnection changes:
| Field | Type | Notes |
|---|---|---|
PrimaryConfiguration |
string? (max 4000) | Renamed from Configuration |
BackupConfiguration |
string? (max 4000) | New. Null = no backup |
FailoverRetryCount |
int (default 3) | New. Retries before switching |
Both endpoints use the same Protocol. EF Core migration renames Configuration → PrimaryConfiguration (data-preserving).
DataConnectionArtifact changes:
ConfigurationJson→PrimaryConfigurationJson+BackupConfigurationJson
Failover State Machine
The DataConnectionActor Reconnecting state is extended:
Connected
│ disconnect detected
▼
Push bad quality to all subscribers
│
▼
Retry active endpoint (5s interval)
│ failure
▼
_consecutiveFailures++
│
├─ < FailoverRetryCount → retry same endpoint
│
├─ ≥ FailoverRetryCount AND backup exists
│ → dispose adapter, switch _activeEndpoint, reset counter
│ → create fresh adapter with other config
│ → attempt connect
│
└─ ≥ FailoverRetryCount AND no backup
→ keep retrying indefinitely (current behavior)
On successful reconnect (either endpoint):
- Reset
_consecutiveFailures = 0 ReSubscribeAll()— re-create all subscriptions on the new adapter- Transition to Connected
- Log failover event if endpoint changed
- Report active endpoint in health metrics
Round-robin on failure: primary → backup → primary → backup...
Adapter lifecycle on failover: Actor disposes current IDataConnection adapter and creates a fresh one via DataConnectionFactory.Create() with the other endpoint's config. Clean slate — no stale state.
Actor State
New fields in DataConnectionActor:
IDictionary<string, string> _primaryConfigIDictionary<string, string>? _backupConfigActiveEndpoint _activeEndpoint(enum: Primary, Backup)int _consecutiveFailuresint _failoverRetryCount
CreateConnectionCommand gains: primaryConfig, backupConfig, failoverRetryCount.
DataConnectionFactory is unchanged — still creates single-endpoint adapters.
Health & Observability
DataConnectionHealthReport gains:
ActiveEndpoint(string):"Primary","Backup", or"Primary (no backup)"
Site event log entries:
DataConnectionFailover— connection name, from-endpoint, to-endpoint, reasonDataConnectionRestored— connection name, active endpoint
Uses existing ISiteEventLogger.
Central UI
List page: Add Active Endpoint column from health reports.
Form (Create/Edit):
- "Primary Endpoint Configuration" label (renamed from "Configuration")
- "Add Backup Endpoint" button reveals second JSON text area
- "Remove Backup" button in edit mode when backup exists
- "Failover Retry Count" numeric input (default 3, min 1, max 20) — visible only when backup configured
- Vertical stacking, collapsible backup subsection
CLI
--configurationrenamed to--primary-config(hidden alias for backwards compat)--backup-config(optional)--failover-retry-count(optional, default 3)data-connection getshows both configs and active endpoint
Management API
CreateDataConnectionCommand/UpdateDataConnectionCommandgainPrimaryConfiguration,BackupConfiguration,FailoverRetryCount- Setting
BackupConfigurationto null removes the backup GetDataConnectionResponsereturns both configs
Deployment Flow
DataConnectionArtifact carries PrimaryConfigurationJson and BackupConfigurationJson. Site-side deployment handler passes both to CreateConnectionCommand.
Testing
Unit tests:
- Actor: failover after N failures, round-robin, single-endpoint retries forever, counter reset, ReSubscribeAll on failover
- Manager actor: updated CreateConnectionCommand
- Factory: unchanged registration
Integration test (manual with test infra):
- Primary=
opc.tcp://localhost:50000, backup=opc.tcp://localhost:50010 - Subscribe to
Motor.Speed docker compose stop opcua→ verify failover to opcua2 after 3 retriesdocker compose stop opcua2 && docker compose start opcua→ verify round-robin back
Implementation Tasks
- #4 Entity model & database (foundation)
- #6 CreateConnectionCommand & DataConnectionManagerActor (blocked by #4)
- #5 DataConnectionActor failover state machine (blocked by #4, #6)
- #7 Health reporting & site event log (blocked by #5)
- #8 Central UI (blocked by #4)
- #9 CLI, Management API, deployment (blocked by #4)
- #10 Documentation (blocked by #5)
- #11 Tests (blocked by #5)