Files
scadalink-design/docs/plans/2026-03-22-primary-backup-data-connections.md
Joseph Doherty 5ca1be328c docs(dcl): add primary/backup data connections implementation plan
8 tasks with TDD steps, exact file paths, and code samples.
Covers entity model, failover state machine, health reporting,
UI, CLI, management API, deployment, and documentation.
2026-03-22 08:13:23 -04:00

22 KiB

Primary/Backup Data Connection Endpoints — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task.

Goal: Add optional backup endpoints to data connections with automatic failover after configurable retry count.

Architecture: The DataConnectionActor gains failover logic in its Reconnecting state — after N failed retries on the active endpoint, it disposes the adapter and creates a fresh one with the other endpoint's config. Adapters remain single-endpoint. Entity model splits Configuration into PrimaryConfiguration + BackupConfiguration.

Tech Stack: C# / .NET 10, Akka.NET, EF Core, Blazor Server, System.CommandLine

Design doc: docs/plans/2026-03-22-primary-backup-data-connections-design.md


Task 1: Entity Model & Database Migration

Files:

  • Modify: src/ScadaLink.Commons/Entities/Sites/DataConnection.cs
  • Modify: src/ScadaLink.ConfigurationDatabase/Configurations/SiteConfiguration.cs (lines 32-56)
  • Modify: src/ScadaLink.Commons/Messages/Artifacts/DataConnectionArtifact.cs

Step 1: Update DataConnection entity

In DataConnection.cs, rename Configuration to PrimaryConfiguration, add BackupConfiguration and FailoverRetryCount:

public class DataConnection
{
    public int Id { get; set; }
    public int SiteId { get; set; }
    public string Name { get; set; }
    public string Protocol { get; set; }
    public string? PrimaryConfiguration { get; set; }
    public string? BackupConfiguration { get; set; }
    public int FailoverRetryCount { get; set; } = 3;

    public DataConnection(int siteId, string name, string protocol)
    {
        SiteId = siteId;
        Name = name ?? throw new ArgumentNullException(nameof(name));
        Protocol = protocol ?? throw new ArgumentNullException(nameof(protocol));
    }
}

Step 2: Update EF Core mapping

In SiteConfiguration.cs, update the DataConnection mapping (around lines 46-47):

  • Rename Configuration property mapping to PrimaryConfiguration (MaxLength 4000)
  • Add BackupConfiguration property (optional, MaxLength 4000)
  • Add FailoverRetryCount property (required, default 3)
builder.Property(d => d.PrimaryConfiguration).HasMaxLength(4000);
builder.Property(d => d.BackupConfiguration).HasMaxLength(4000);
builder.Property(d => d.FailoverRetryCount).HasDefaultValue(3);

Step 3: Create EF Core migration

Run:

cd src/ScadaLink.ConfigurationDatabase
dotnet ef migrations add AddDataConnectionBackupEndpoint \
  --startup-project ../ScadaLink.Host

Verify the migration renames ConfigurationPrimaryConfiguration (should use RenameColumn, not drop+add). If the scaffolded migration drops and recreates, manually fix it:

migrationBuilder.RenameColumn(
    name: "Configuration",
    table: "DataConnections",
    newName: "PrimaryConfiguration");

migrationBuilder.AddColumn<string>(
    name: "BackupConfiguration",
    table: "DataConnections",
    maxLength: 4000,
    nullable: true);

migrationBuilder.AddColumn<int>(
    name: "FailoverRetryCount",
    table: "DataConnections",
    nullable: false,
    defaultValue: 3);

Step 4: Update DataConnectionArtifact

In DataConnectionArtifact.cs, replace single ConfigurationJson with both:

public record DataConnectionArtifact(
    string Name,
    string Protocol,
    string? PrimaryConfigurationJson,
    string? BackupConfigurationJson,
    int FailoverRetryCount = 3);

Step 5: Build and fix compile errors

Run: dotnet build ScadaLink.slnx

This will surface all references to the old Configuration and ConfigurationJson fields across the codebase. Fix each one — this includes:

  • ManagementActor handlers
  • CLI commands
  • UI pages
  • Deployment/flattening code
  • Tests

Fix only the field name renames in this step (use PrimaryConfiguration where Configuration was). Don't add backup logic yet — just make it compile.

Step 6: Run tests, fix failures

Run: dotnet test ScadaLink.slnx

Fix any test failures caused by the rename.

Step 7: Commit

git add -A
git commit -m "feat(dcl): rename Configuration to PrimaryConfiguration, add BackupConfiguration and FailoverRetryCount"

Task 2: Update CreateConnectionCommand & Manager Actor

Files:

  • Modify: src/ScadaLink.Commons/Messages/DataConnection/CreateConnectionCommand.cs
  • Modify: src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs (lines 39-62)

Step 1: Update CreateConnectionCommand message

public record CreateConnectionCommand(
    string ConnectionName,
    string ProtocolType,
    IDictionary<string, string> PrimaryConnectionDetails,
    IDictionary<string, string>? BackupConnectionDetails = null,
    int FailoverRetryCount = 3);

Step 2: Update DataConnectionManagerActor.HandleCreateConnection

Update the handler (around line 39-62) to pass both configs to DataConnectionActor:

private void HandleCreateConnection(CreateConnectionCommand command)
{
    if (_connectionActors.ContainsKey(command.ConnectionName))
    {
        _log.Warning("Connection {0} already exists", command.ConnectionName);
        return;
    }

    var adapter = _factory.Create(command.ProtocolType, command.PrimaryConnectionDetails);

    var props = Props.Create(() => new DataConnectionActor(
        command.ConnectionName,
        adapter,
        _options,
        _healthCollector,
        command.ProtocolType,
        command.PrimaryConnectionDetails,
        command.BackupConnectionDetails,
        command.FailoverRetryCount));

    var actorName = new string(command.ConnectionName
        .Select(c => char.IsLetterOrDigit(c) || "-_.*$+:@&=,!~';()".Contains(c) ? c : '-')
        .ToArray());
    var actorRef = Context.ActorOf(props, actorName);
    _connectionActors[command.ConnectionName] = actorRef;

    _log.Info("Created DataConnectionActor for {0} (protocol={1}, backup={2})",
        command.ConnectionName, command.ProtocolType, command.BackupConnectionDetails != null ? "yes" : "none");
}

Step 3: Update all callers of CreateConnectionCommand

Search for all places that construct CreateConnectionCommand and update them to use the new signature. The primary caller is the site-side deployment handler.

Step 4: Build and test

Run: dotnet build ScadaLink.slnx && dotnet test tests/ScadaLink.DataConnectionLayer.Tests

Step 5: Commit

git add -A
git commit -m "feat(dcl): extend CreateConnectionCommand with backup config and failover retry count"

Task 3: DataConnectionActor Failover State Machine

Files:

  • Modify: src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs
  • Modify: src/ScadaLink.DataConnectionLayer/DataConnectionFactory.cs

This is the core change. The actor gains failover logic in its Reconnecting state.

Step 1: Add new state fields to DataConnectionActor

Add these fields alongside the existing ones (around line 30):

private readonly string _protocolType;
private readonly IDictionary<string, string> _primaryConfig;
private readonly IDictionary<string, string>? _backupConfig;
private readonly int _failoverRetryCount;
private readonly IDataConnectionFactory _factory;
private ActiveEndpoint _activeEndpoint = ActiveEndpoint.Primary;
private int _consecutiveFailures;

public enum ActiveEndpoint { Primary, Backup }

Step 2: Update constructor

Extend the constructor to accept both configs and the factory:

public DataConnectionActor(
    string connectionName,
    IDataConnection adapter,
    DataConnectionOptions options,
    ISiteHealthCollector healthCollector,
    string protocolType,
    IDictionary<string, string> primaryConfig,
    IDictionary<string, string>? backupConfig = null,
    int failoverRetryCount = 3)
{
    _connectionName = connectionName;
    _adapter = adapter;
    _options = options;
    _healthCollector = healthCollector;
    _protocolType = protocolType;
    _primaryConfig = primaryConfig;
    _backupConfig = backupConfig;
    _failoverRetryCount = failoverRetryCount;
    _connectionDetails = primaryConfig; // start with primary
}

Note: The actor also needs IDataConnectionFactory injected to create new adapters on failover. Pass it through the constructor or resolve via DI. The DataConnectionManagerActor already has the factory — pass it through to the actor constructor.

Step 3: Extend HandleReconnectResult with failover logic

Replace the reconnect failure handling (around lines 279-296) to include failover:

private void HandleReconnectResult(ConnectResult result)
{
    if (result.Success)
    {
        _consecutiveFailures = 0;
        _log.Info("Reconnected {0} on {1} endpoint", _connectionName, _activeEndpoint);
        ReSubscribeAll();
        BecomeConnected();
        return;
    }

    _consecutiveFailures++;
    _log.Warning("Reconnect attempt {0}/{1} failed for {2} on {3}: {4}",
        _consecutiveFailures, _failoverRetryCount, _connectionName, _activeEndpoint, result.Error);

    if (_consecutiveFailures >= _failoverRetryCount && _backupConfig != null)
    {
        // Switch endpoint
        var previousEndpoint = _activeEndpoint;
        _activeEndpoint = _activeEndpoint == ActiveEndpoint.Primary
            ? ActiveEndpoint.Backup
            : ActiveEndpoint.Primary;
        _consecutiveFailures = 0;

        var newConfig = _activeEndpoint == ActiveEndpoint.Primary ? _primaryConfig : _backupConfig;

        _log.Warning("Failing over {0} from {1} to {2}", _connectionName, previousEndpoint, _activeEndpoint);

        // Dispose old adapter, create new one
        _ = _adapter.DisposeAsync();
        _adapter = _factory.Create(_protocolType, newConfig);
        _connectionDetails = newConfig;

        // Wire up disconnect handler on new adapter
        _adapter.Disconnected += () => _self.Tell(new AdapterDisconnected());
    }

    // Schedule next retry
    Context.System.Scheduler.ScheduleTellOnce(
        _options.ReconnectInterval, Self, AttemptConnect.Instance, ActorRefs.NoSender);
}

Step 4: Pass IDataConnectionFactory to DataConnectionActor

Update DataConnectionManagerActor.HandleCreateConnection to pass the factory:

var props = Props.Create(() => new DataConnectionActor(
    command.ConnectionName, adapter, _options, _healthCollector,
    _factory, // pass factory for failover adapter creation
    command.ProtocolType, command.PrimaryConnectionDetails,
    command.BackupConnectionDetails, command.FailoverRetryCount));

And update the DataConnectionActor constructor to store _factory.

Step 5: Build and run existing tests

Run: dotnet build ScadaLink.slnx && dotnet test tests/ScadaLink.DataConnectionLayer.Tests

Existing tests must pass (they use single-endpoint configs, so no failover triggered).

Step 6: Commit

git add -A
git commit -m "feat(dcl): add failover state machine to DataConnectionActor with round-robin endpoint switching"

Task 4: Failover Tests

Files:

  • Modify: tests/ScadaLink.DataConnectionLayer.Tests/DataConnectionActorTests.cs

Step 1: Write test — failover after N retries

[Fact]
public async Task Reconnecting_AfterFailoverRetryCount_SwitchesToBackup()
{
    // Arrange: create actor with primary + backup, failoverRetryCount = 2
    var primaryAdapter = Substitute.For<IDataConnection>();
    var backupAdapter = Substitute.For<IDataConnection>();
    var factory = Substitute.For<IDataConnectionFactory>();
    factory.Create("OpcUa", Arg.Is<IDictionary<string, string>>(d => d["endpoint"] == "backup"))
        .Returns(backupAdapter);

    // Primary connects then disconnects
    primaryAdapter.ConnectAsync(Arg.Any<IDictionary<string, string>>(), Arg.Any<CancellationToken>())
        .Returns(Task.CompletedTask);
    primaryAdapter.Status.Returns(ConnectionHealth.Connected);

    var primaryConfig = new Dictionary<string, string> { ["endpoint"] = "primary" };
    var backupConfig = new Dictionary<string, string> { ["endpoint"] = "backup" };

    // Create actor, connect on primary
    // ... (use test kit patterns from existing tests)
    // Simulate disconnect, verify 2 failures then factory.Create called with backup config
}

Step 2: Write test — single endpoint retries forever

[Fact]
public async Task Reconnecting_NoBackup_RetriesIndefinitely()
{
    // Arrange: create actor with primary only, no backup
    // Simulate 10 reconnect failures
    // Verify: factory.Create never called with backup, just keeps retrying
}

Step 3: Write test — round-robin back to primary after backup fails

[Fact]
public async Task Reconnecting_BackupFails_SwitchesBackToPrimary()
{
    // Arrange: primary + backup, failoverRetryCount = 1
    // Simulate: primary fails 1x → switch to backup → backup fails 1x → switch to primary
    // Verify: round-robin pattern
}

Step 4: Write test — successful reconnect resets counter

[Fact]
public async Task Reconnecting_SuccessfulConnect_ResetsConsecutiveFailures()
{
    // Arrange: failoverRetryCount = 3
    // Simulate: 2 failures on primary, then success
    // Verify: no failover, counter reset
}

Step 5: Write test — ReSubscribeAll called after failover

[Fact]
public async Task Failover_ReSubscribesAllTagsOnNewAdapter()
{
    // Arrange: actor with subscriptions, then failover
    // Verify: new adapter receives SubscribeAsync calls for all previously subscribed tags
}

Step 6: Run all tests

Run: dotnet test tests/ScadaLink.DataConnectionLayer.Tests -v

Step 7: Commit

git add -A
git commit -m "test(dcl): add failover state machine tests for DataConnectionActor"

Task 5: Health Reporting & Site Event Logging

Files:

  • Modify: src/ScadaLink.Commons/Messages/DataConnection/DataConnectionHealthReport.cs
  • Modify: src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs (ReplyWithHealthReport, HandleReconnectResult)

Step 1: Add ActiveEndpoint to health report

public record DataConnectionHealthReport(
    string ConnectionName,
    ConnectionHealth Status,
    int TotalSubscribedTags,
    int ResolvedTags,
    string ActiveEndpoint,
    DateTimeOffset Timestamp);

Step 2: Update ReplyWithHealthReport in DataConnectionActor

Update the health report method (around line 516) to include the active endpoint:

private void ReplyWithHealthReport()
{
    var endpointLabel = _backupConfig == null
        ? "Primary (no backup)"
        : _activeEndpoint.ToString();

    Sender.Tell(new DataConnectionHealthReport(
        _connectionName, _adapter.Status,
        _subscriptionsByInstance.Values.Sum(s => s.Count),
        _resolvedTags,
        endpointLabel,
        DateTimeOffset.UtcNow));
}

Step 3: Add site event logging on failover

In HandleReconnectResult, after switching endpoints, log a site event:

if (_siteEventLogger != null)
{
    _ = _siteEventLogger.LogEventAsync(
        "connection", "Warning", null, _connectionName,
        $"Failover from {previousEndpoint} to {_activeEndpoint}",
        $"After {_failoverRetryCount} consecutive failures");
}

Note: The actor needs ISiteEventLogger injected. Add it as an optional constructor parameter.

Step 4: Add site event logging on successful reconnect after failover

In HandleReconnectResult success path, if the endpoint changed from last known good:

if (_siteEventLogger != null)
{
    _ = _siteEventLogger.LogEventAsync(
        "connection", "Info", null, _connectionName,
        $"Connection restored on {_activeEndpoint} endpoint", null);
}

Step 5: Build and test

Run: dotnet build ScadaLink.slnx && dotnet test tests/ScadaLink.DataConnectionLayer.Tests

Step 6: Commit

git add -A
git commit -m "feat(dcl): add active endpoint to health reports and log failover events"

Task 6: Central UI Changes

Files:

  • Modify: src/ScadaLink.CentralUI/Components/Pages/Admin/DataConnections.razor
  • Modify: src/ScadaLink.CentralUI/Components/Pages/Admin/DataConnectionForm.razor

Step 1: Update DataConnections list page

Add Active Endpoint column to the table (around line 28-64). Insert after the Protocol column:

<th>Active Endpoint</th>

And in the row template:

<td>@connection.ActiveEndpoint</td>

This requires the list page to fetch health data alongside the connection list. Add a health status lookup or include ActiveEndpoint in the data connection response.

Step 2: Update DataConnectionForm — rename Configuration label

Change the "Configuration" label to "Primary Endpoint Configuration" (around line 44-61).

Step 3: Add backup endpoint section

Below the primary config field, add:

@if (!_showBackup)
{
    <button type="button" class="btn btn-outline-secondary btn-sm mt-2"
            @onclick="() => _showBackup = true">
        Add Backup Endpoint
    </button>
}
else
{
    <div class="mt-3">
        <div class="d-flex justify-content-between align-items-center">
            <label class="form-label">Backup Endpoint Configuration</label>
            <button type="button" class="btn btn-outline-danger btn-sm"
                    @onclick="RemoveBackup">
                Remove Backup
            </button>
        </div>
        <textarea class="form-control" rows="4"
                  @bind="_model.BackupConfiguration"
                  placeholder='{"Host": "backup-host", "Port": 50101}' />
    </div>

    <div class="mt-3">
        <label class="form-label">Failover Retry Count</label>
        <input type="number" class="form-control" min="1" max="20"
               @bind="_model.FailoverRetryCount" />
        <small class="text-muted">Retries before switching to backup (default: 3)</small>
    </div>
}

Step 4: Update form model and save logic

Add BackupConfiguration and FailoverRetryCount to the form model. Update the save method to pass both configs to the management API.

In edit mode, set _showBackup = true if BackupConfiguration is not null.

Step 5: Build and verify visually

Run: dotnet build ScadaLink.slnx

Visual verification requires running the cluster — document as manual test.

Step 6: Commit

git add -A
git commit -m "feat(ui): add primary/backup endpoint fields to data connection form"

Task 7: CLI, Management API, and Deployment

Files:

  • Modify: src/ScadaLink.Commons/Messages/Management/DataConnectionCommands.cs
  • Modify: src/ScadaLink.CLI/Commands/DataConnectionCommands.cs
  • Modify: src/ScadaLink.ManagementService/ManagementActor.cs (lines 689-711)
  • Modify: Deployment/flattening code that creates DataConnectionArtifact

Step 1: Update management command messages

public record CreateDataConnectionCommand(
    int SiteId, string Name, string Protocol,
    string? PrimaryConfiguration,
    string? BackupConfiguration = null,
    int FailoverRetryCount = 3);

public record UpdateDataConnectionCommand(
    int DataConnectionId, string Name, string Protocol,
    string? PrimaryConfiguration,
    string? BackupConfiguration = null,
    int FailoverRetryCount = 3);

Step 2: Update ManagementActor handlers

In HandleCreateDataConnection (around line 689): set PrimaryConfiguration, BackupConfiguration, FailoverRetryCount from command.

In HandleUpdateDataConnection (around line 699): same fields.

Step 3: Update CLI commands

In BuildCreate (around line 75-98):

  • Rename --configuration to --primary-config
  • Add hidden alias --configuration pointing to same option
  • Add --backup-config option (optional)
  • Add --failover-retry-count option (optional, default 3)

In BuildUpdate (around line 36-59): same changes.

In BuildGet (around line 22-34): update output to show both configs.

Step 4: Update deployment artifact creation

Find where DataConnectionArtifact is constructed (in deployment/flattening code). Update to pass PrimaryConfigurationJson and BackupConfigurationJson from the entity.

Step 5: Build and test CLI

Run: dotnet build ScadaLink.slnx

Test CLI manually:

scadalink data-connection create --site-id 1 --name "Test" --protocol OpcUa \
  --primary-config '{"endpoint":"opc.tcp://localhost:50000"}' \
  --backup-config '{"endpoint":"opc.tcp://localhost:50010"}' \
  --failover-retry-count 3

Step 6: Commit

git add -A
git commit -m "feat(cli): add --primary-config, --backup-config, --failover-retry-count to data connection commands"

Task 8: Documentation Updates

Files:

  • Modify: docs/requirements/Component-DataConnectionLayer.md
  • Modify: docs/requirements/HighLevelReqs.md
  • Modify: docs/requirements/Component-CentralUI.md
  • Modify: docs/test_infra/test_infra.md

Step 1: Update Component-DataConnectionLayer.md

Add new section "Endpoint Redundancy" covering:

  • Optional backup endpoints
  • Failover state machine (include ASCII diagram from design doc)
  • Configuration model (PrimaryConfiguration + BackupConfiguration)
  • Failover retry count and round-robin behavior
  • Subscription re-creation on failover
  • Health reporting (ActiveEndpoint field)
  • Site event logging (DataConnectionFailover, DataConnectionRestored)

Update the configuration reference tables to show the new entity fields.

Step 2: Update HighLevelReqs.md

Add requirement: "Data connections support optional backup endpoints with automatic failover after configurable retry count. On failover, all subscriptions are transparently re-created on the new endpoint."

Step 3: Update Component-CentralUI.md

Update the Data Connections workflow section to describe:

  • Primary/backup config fields on the form
  • Collapsible backup section
  • Failover retry count field
  • Active endpoint column on list page

Step 4: Update test_infra.md

Add a note in the Remote Test Infrastructure section that the dual OPC UA servers (50000/50010) and dual LmxProxy instances (50100/50101) enable primary/backup testing.

Step 5: Commit

git add -A
git commit -m "docs(dcl): document primary/backup endpoint redundancy across requirements and test infra"