Files

Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.

2026-03-16 22:12:31 -04:00

6.3 KiB

Raw Blame History

ScadaLink Cluster Topology Guide

Architecture Overview

ScadaLink uses a hub-and-spoke architecture:

Central Cluster: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
Site Clusters: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.

                    ┌──────────────────────────┐
                    │     Central Cluster       │
                    │  ┌──────┐    ┌──────┐    │
  Users ──────────► │  │Node A│◄──►│Node B│    │
  (HTTPS/LB)        │  │Active│    │Stby  │    │
                    │  └──┬───┘    └──┬───┘    │
                    └─────┼───────────┼────────┘
                          │           │
              ┌───────────┼───────────┼───────────┐
              │           │           │           │
        ┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐
        │  Site 01   │ │ Site 02 │ │ Site 03 │ │ Site N  │
        │ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│
        │ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││
        │ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│
        └───────────┘ └─────────┘ └─────────┘ └─────────┘

Central Cluster Setup

Cluster Configuration

Both central nodes must be configured as seed nodes for each other:

Node A (central-01.example.com):

{
  "ScadaLink": {
    "Node": {
      "Role": "Central",
      "NodeHostname": "central-01.example.com",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadalink@central-01.example.com:8081",
        "akka.tcp://scadalink@central-02.example.com:8081"
      ]
    }
  }
}

Node B (central-02.example.com):

{
  "ScadaLink": {
    "Node": {
      "Role": "Central",
      "NodeHostname": "central-02.example.com",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadalink@central-01.example.com:8081",
        "akka.tcp://scadalink@central-02.example.com:8081"
      ]
    }
  }
}

Cluster Behavior

Split-brain resolver: Keep-oldest with down-if-alone = on, 15-second stable-after.
Minimum members: min-nr-of-members = 1 — a single node can form a cluster.
Failure detection: 2-second heartbeat interval, 10-second threshold.
Total failover time: ~25 seconds from node failure to singleton migration.
Singleton handover: Uses CoordinatedShutdown for graceful migration.

Shared State

Both central nodes share state through:

SQL Server: All configuration, deployment records, templates, and audit logs.
JWT signing key: Same JwtSigningKey in both nodes' configuration.
Data Protection keys: Shared key ring (stored in SQL Server or shared file path).

Load Balancer

A load balancer sits in front of both central nodes for the Blazor Server UI:

Health check: GET /health/ready
Protocol: HTTPS (TLS termination at LB or pass-through)
Sticky sessions: Not required (JWT + shared Data Protection keys)
If the active node fails, the LB routes to the standby (which becomes active after singleton migration).

Site Cluster Setup

Cluster Configuration

Each site has its own two-node cluster:

Site Node A (site-01-a.example.com):

{
  "ScadaLink": {
    "Node": {
      "Role": "Site",
      "NodeHostname": "site-01-a.example.com",
      "SiteId": "plant-north",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadalink@site-01-a.example.com:8081",
        "akka.tcp://scadalink@site-01-b.example.com:8081"
      ]
    }
  }
}

Site Cluster Behavior

Same split-brain resolver as central (keep-oldest).
Singleton actors: Site Deployment Manager migrates on failover.
Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).

Central-Site Communication

Sites connect to central via Akka.NET remoting.
The Communication:CentralSeedNode setting in the site config points to one of the central nodes.
If that central node is down, the site's communication actor will retry until it connects to the active central node.

Scaling Guidelines

Target Scale

10 sites maximum per central cluster
500 machines (instances) total across all sites
75 tags per machine (37,500 total tag subscriptions)

Resource Requirements

Component	CPU	RAM	Disk	Notes
Central node	4 cores	8 GB	50 GB	SQL Server is separate
Site node	2 cores	4 GB	20 GB	SQLite databases grow with S&F
SQL Server	4 cores	16 GB	100 GB	Shared across central cluster

Network Bandwidth

Health reports: ~1 KB per site per 30 seconds = negligible
Tag value updates: Depends on data change rate; OPC UA subscription-based
Deployment artifacts: One-time burst per deployment (varies by config size)
Debug view streaming: ~500 bytes per attribute change per subscriber

Dual-Node Failure Recovery

Scenario: Both Nodes Down

First node starts: Forms a single-node cluster (min-nr-of-members = 1).
Central: Reconnects to SQL Server, reads deployment state, becomes operational.
Site: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
Second node starts: Joins the existing cluster as standby.

Automatic Recovery

No manual intervention required for dual-node failure. The first node to start will:

Form the cluster
Take over all singletons
Begin processing immediately
Accept the second node when it joins

6.3 KiB Raw Blame History

ScadaLink Cluster Topology Guide

Architecture Overview

Central Cluster Setup

Cluster Configuration

Cluster Behavior

Shared State

Load Balancer

Site Cluster Setup

Cluster Configuration

Site Cluster Behavior

Central-Site Communication

Scaling Guidelines

Target Scale

Resource Requirements

Network Bandwidth

Dual-Node Failure Recovery

Scenario: Both Nodes Down

Automatic Recovery

6.3 KiB

Raw Blame History