Files
ScadaBridge/docs/plans/2026-03-22-primary-backup-data-connections-design.md
Joseph Doherty 43228185b4 docs: convert standard diagrams from draw.io PNGs to inline Mermaid
Gitea renders mermaid inline, so the flow/state/hierarchy/DAG diagrams
move to text-in-markdown: auto-layout (removes the manual overlap-prone
draw.io step), diffable source, no committed binaries, and a dark-text
theme so labels stay legible. Keep draw.io PNGs only for the two complex
bespoke diagrams (logical architecture, env2 topology) where pixel
control still wins. All 24 mermaid blocks validated by rendering.
2026-06-01 00:23:00 -04:00

6.8 KiB

Primary/Backup Data Connection Endpoints — Design

Date: 2026-03-22 Status: Approved

Problem

Data connections currently support a single endpoint. If that endpoint goes down, the connection retries indefinitely at 5s intervals against the same address. When redundant infrastructure exists (e.g., two OPC UA servers), there is no way to automatically fail over to a backup.

Design Decisions

Decision Choice
Failover mode Automatic after N failed retries
Failback No auto-failback; stay on active until it fails (round-robin)
Backup required? Optional — single-endpoint connections work unchanged
Failover trigger After configurable retry count (default 3)
Entity model Separate PrimaryConfiguration and BackupConfiguration columns
UI approach Two JSON text areas; backup collapsible
Failover logic location DataConnectionActor (adapters stay single-endpoint)
Observability Health reports + site event log entries

Entity Model

DataConnection changes:

Field Type Notes
PrimaryConfiguration string? (max 4000) Renamed from Configuration
BackupConfiguration string? (max 4000) New. Null = no backup
FailoverRetryCount int (default 3) New. Retries before switching

Both endpoints use the same Protocol. EF Core migration renames ConfigurationPrimaryConfiguration (data-preserving).

DataConnectionArtifact changes:

  • ConfigurationJsonPrimaryConfigurationJson + BackupConfigurationJson

Failover State Machine

The DataConnectionActor Reconnecting state is extended:

%%{init: {'theme':'base', 'themeVariables': {'textColor':'#111111','lineColor':'#555555','edgeLabelBackground':'#ffffff','fontSize':'15px'}}}%%
flowchart TD
    C(["Connected"])
    BQ["Push bad quality<br/>to all subscribers"]
    RT["Retry active endpoint<br/>(5s interval)"]
    INC["_consecutiveFailures++"]
    BR{"Evaluate<br/>_consecutiveFailures"}
    SAME["Retry same endpoint"]
    FO["Failover<br/>- dispose adapter, switch _activeEndpoint, reset counter<br/>- create fresh adapter with other config<br/>- attempt connect"]
    NB["Keep retrying indefinitely<br/>(current behavior)"]
    RC(["On successful reconnect (either endpoint)<br/>1. Reset _consecutiveFailures = 0<br/>2. ReSubscribeAll() — re-create subscriptions on new adapter<br/>3. Transition to Connected<br/>4. Log failover event if endpoint changed<br/>5. Report active endpoint in health metrics"])

    C -->|"disconnect detected"| BQ
    BQ --> RT
    RT -->|"failure"| INC
    INC --> BR
    BR -->|"&lt; FailoverRetryCount"| SAME
    SAME -.->|"retry"| RT
    BR -->|"&gt;= FailoverRetryCount AND backup exists"| FO
    BR -->|"&gt;= FailoverRetryCount AND no backup"| NB
    NB -.->|"retry (round-robin n/a)"| RT
    FO -->|"connect succeeds"| RC
    FO -.->|"connect fails (round-robin: primary to backup to primary...)"| RT
    RC -->|"Transition to Connected"| C

    classDef start fill:#d5e8d4,stroke:#82b366,color:#111111;
    classDef proc fill:#dae8fc,stroke:#6c8ebf,color:#111111;
    classDef dec fill:#fff2cc,stroke:#d6b656,color:#111111;
    classDef warn fill:#ffe6cc,stroke:#d79b00,color:#111111;
    classDef bad fill:#f8cecc,stroke:#b85450,color:#111111;
    class C,RC start
    class BQ,RT,SAME proc
    class INC,BR dec
    class FO warn
    class NB bad

On successful reconnect (either endpoint):

  1. Reset _consecutiveFailures = 0
  2. ReSubscribeAll() — re-create all subscriptions on the new adapter
  3. Transition to Connected
  4. Log failover event if endpoint changed
  5. Report active endpoint in health metrics

Round-robin on failure: primary → backup → primary → backup...

Adapter lifecycle on failover: Actor disposes current IDataConnection adapter and creates a fresh one via DataConnectionFactory.Create() with the other endpoint's config. Clean slate — no stale state.

Actor State

New fields in DataConnectionActor:

  • IDictionary<string, string> _primaryConfig
  • IDictionary<string, string>? _backupConfig
  • ActiveEndpoint _activeEndpoint (enum: Primary, Backup)
  • int _consecutiveFailures
  • int _failoverRetryCount

CreateConnectionCommand gains: primaryConfig, backupConfig, failoverRetryCount.

DataConnectionFactory is unchanged — still creates single-endpoint adapters.

Health & Observability

DataConnectionHealthReport gains:

  • ActiveEndpoint (string): "Primary", "Backup", or "Primary (no backup)"

Site event log entries:

  • DataConnectionFailover — connection name, from-endpoint, to-endpoint, reason
  • DataConnectionRestored — connection name, active endpoint

Uses existing ISiteEventLogger.

Central UI

List page: Add Active Endpoint column from health reports.

Form (Create/Edit):

  • "Primary Endpoint Configuration" label (renamed from "Configuration")
  • "Add Backup Endpoint" button reveals second JSON text area
  • "Remove Backup" button in edit mode when backup exists
  • "Failover Retry Count" numeric input (default 3, min 1, max 20) — visible only when backup configured
  • Vertical stacking, collapsible backup subsection

CLI

  • --configuration renamed to --primary-config (hidden alias for backwards compat)
  • --backup-config (optional)
  • --failover-retry-count (optional, default 3)
  • data-connection get shows both configs and active endpoint

Management API

  • CreateDataConnectionCommand / UpdateDataConnectionCommand gain PrimaryConfiguration, BackupConfiguration, FailoverRetryCount
  • Setting BackupConfiguration to null removes the backup
  • GetDataConnectionResponse returns both configs

Deployment Flow

DataConnectionArtifact carries PrimaryConfigurationJson and BackupConfigurationJson. Site-side deployment handler passes both to CreateConnectionCommand.

Testing

Unit tests:

  • Actor: failover after N failures, round-robin, single-endpoint retries forever, counter reset, ReSubscribeAll on failover
  • Manager actor: updated CreateConnectionCommand
  • Factory: unchanged registration

Integration test (manual with test infra):

  1. Primary=opc.tcp://localhost:50000, backup=opc.tcp://localhost:50010
  2. Subscribe to Motor.Speed
  3. docker compose stop opcua → verify failover to opcua2 after 3 retries
  4. docker compose stop opcua2 && docker compose start opcua → verify round-robin back

Implementation Tasks

  1. #4 Entity model & database (foundation)
  2. #6 CreateConnectionCommand & DataConnectionManagerActor (blocked by #4)
  3. #5 DataConnectionActor failover state machine (blocked by #4, #6)
  4. #7 Health reporting & site event log (blocked by #5)
  5. #8 Central UI (blocked by #4)
  6. #9 CLI, Management API, deployment (blocked by #4)
  7. #10 Documentation (blocked by #5)
  8. #11 Tests (blocked by #5)