Files

Joseph Doherty 6267ff882c docs(dcl): add primary/backup data connection endpoints design

Covers entity model, failover state machine, health reporting,
UI/CLI changes, and deployment flow for optional backup endpoints
with automatic failover after configurable retry count.

2026-03-22 08:09:25 -04:00

5.7 KiB

Raw Blame History

Primary/Backup Data Connection Endpoints — Design

Date: 2026-03-22 Status: Approved

Problem

Data connections currently support a single endpoint. If that endpoint goes down, the connection retries indefinitely at 5s intervals against the same address. When redundant infrastructure exists (e.g., two LmxProxy instances, two OPC UA servers), there is no way to automatically fail over to a backup.

Design Decisions

Decision	Choice
Failover mode	Automatic after N failed retries
Failback	No auto-failback; stay on active until it fails (round-robin)
Backup required?	Optional — single-endpoint connections work unchanged
Failover trigger	After configurable retry count (default 3)
Entity model	Separate `PrimaryConfiguration` and `BackupConfiguration` columns
UI approach	Two JSON text areas; backup collapsible
Failover logic location	DataConnectionActor (adapters stay single-endpoint)
Observability	Health reports + site event log entries

Entity Model

DataConnection changes:

Field	Type	Notes
`PrimaryConfiguration`	string? (max 4000)	Renamed from `Configuration`
`BackupConfiguration`	string? (max 4000)	New. Null = no backup
`FailoverRetryCount`	int (default 3)	New. Retries before switching

Both endpoints use the same Protocol. EF Core migration renames Configuration → PrimaryConfiguration (data-preserving).

DataConnectionArtifact changes:

ConfigurationJson → PrimaryConfigurationJson + BackupConfigurationJson

Failover State Machine

The DataConnectionActor Reconnecting state is extended:

Connected
    │ disconnect detected
    ▼
Push bad quality to all subscribers
    │
    ▼
Retry active endpoint (5s interval)
    │ failure
    ▼
_consecutiveFailures++
    │
    ├─ < FailoverRetryCount → retry same endpoint
    │
    ├─ ≥ FailoverRetryCount AND backup exists
    │     → dispose adapter, switch _activeEndpoint, reset counter
    │     → create fresh adapter with other config
    │     → attempt connect
    │
    └─ ≥ FailoverRetryCount AND no backup
          → keep retrying indefinitely (current behavior)

On successful reconnect (either endpoint):

Reset _consecutiveFailures = 0
ReSubscribeAll() — re-create all subscriptions on the new adapter
Transition to Connected
Log failover event if endpoint changed
Report active endpoint in health metrics

Round-robin on failure: primary → backup → primary → backup...

Adapter lifecycle on failover: Actor disposes current IDataConnection adapter and creates a fresh one via DataConnectionFactory.Create() with the other endpoint's config. Clean slate — no stale state.

Actor State

New fields in DataConnectionActor:

IDictionary<string, string> _primaryConfig
IDictionary<string, string>? _backupConfig
ActiveEndpoint _activeEndpoint (enum: Primary, Backup)
int _consecutiveFailures
int _failoverRetryCount

CreateConnectionCommand gains: primaryConfig, backupConfig, failoverRetryCount.

DataConnectionFactory is unchanged — still creates single-endpoint adapters.

Health & Observability

DataConnectionHealthReport gains:

ActiveEndpoint (string): "Primary", "Backup", or "Primary (no backup)"

Site event log entries:

DataConnectionFailover — connection name, from-endpoint, to-endpoint, reason
DataConnectionRestored — connection name, active endpoint

Uses existing ISiteEventLogger.

Central UI

List page: Add Active Endpoint column from health reports.

Form (Create/Edit):

"Primary Endpoint Configuration" label (renamed from "Configuration")
"Add Backup Endpoint" button reveals second JSON text area
"Remove Backup" button in edit mode when backup exists
"Failover Retry Count" numeric input (default 3, min 1, max 20) — visible only when backup configured
Vertical stacking, collapsible backup subsection

CLI

--configuration renamed to --primary-config (hidden alias for backwards compat)
--backup-config (optional)
--failover-retry-count (optional, default 3)
data-connection get shows both configs and active endpoint

Management API

CreateDataConnectionCommand / UpdateDataConnectionCommand gain PrimaryConfiguration, BackupConfiguration, FailoverRetryCount
Setting BackupConfiguration to null removes the backup
GetDataConnectionResponse returns both configs

Deployment Flow

DataConnectionArtifact carries PrimaryConfigurationJson and BackupConfigurationJson. Site-side deployment handler passes both to CreateConnectionCommand.

Testing

Unit tests:

Actor: failover after N failures, round-robin, single-endpoint retries forever, counter reset, ReSubscribeAll on failover
Manager actor: updated CreateConnectionCommand
Factory: unchanged registration

Integration test (manual with test infra):

Primary=opc.tcp://localhost:50000, backup=opc.tcp://localhost:50010
Subscribe to Motor.Speed
docker compose stop opcua → verify failover to opcua2 after 3 retries
docker compose stop opcua2 && docker compose start opcua → verify round-robin back

Implementation Tasks

#4 Entity model & database (foundation)
#6 CreateConnectionCommand & DataConnectionManagerActor (blocked by #4)
#5 DataConnectionActor failover state machine (blocked by #4, #6)
#7 Health reporting & site event log (blocked by #5)
#8 Central UI (blocked by #4)
#9 CLI, Management API, deployment (blocked by #4)
#10 Documentation (blocked by #5)
#11 Tests (blocked by #5)

5.7 KiB Raw Blame History