Files
scadalink-design/docs/deployment/topology-guide.md
Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs
- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.
2026-03-16 22:12:31 -04:00

6.3 KiB

ScadaLink Cluster Topology Guide

Architecture Overview

ScadaLink uses a hub-and-spoke architecture:

  • Central Cluster: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
  • Site Clusters: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
                    ┌──────────────────────────┐
                    │     Central Cluster       │
                    │  ┌──────┐    ┌──────┐    │
  Users ──────────► │  │Node A│◄──►│Node B│    │
  (HTTPS/LB)        │  │Active│    │Stby  │    │
                    │  └──┬───┘    └──┬───┘    │
                    └─────┼───────────┼────────┘
                          │           │
              ┌───────────┼───────────┼───────────┐
              │           │           │           │
        ┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐
        │  Site 01   │ │ Site 02 │ │ Site 03 │ │ Site N  │
        │ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│
        │ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││
        │ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│
        └───────────┘ └─────────┘ └─────────┘ └─────────┘

Central Cluster Setup

Cluster Configuration

Both central nodes must be configured as seed nodes for each other:

Node A (central-01.example.com):

{
  "ScadaLink": {
    "Node": {
      "Role": "Central",
      "NodeHostname": "central-01.example.com",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadalink@central-01.example.com:8081",
        "akka.tcp://scadalink@central-02.example.com:8081"
      ]
    }
  }
}

Node B (central-02.example.com):

{
  "ScadaLink": {
    "Node": {
      "Role": "Central",
      "NodeHostname": "central-02.example.com",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadalink@central-01.example.com:8081",
        "akka.tcp://scadalink@central-02.example.com:8081"
      ]
    }
  }
}

Cluster Behavior

  • Split-brain resolver: Keep-oldest with down-if-alone = on, 15-second stable-after.
  • Minimum members: min-nr-of-members = 1 — a single node can form a cluster.
  • Failure detection: 2-second heartbeat interval, 10-second threshold.
  • Total failover time: ~25 seconds from node failure to singleton migration.
  • Singleton handover: Uses CoordinatedShutdown for graceful migration.

Shared State

Both central nodes share state through:

  • SQL Server: All configuration, deployment records, templates, and audit logs.
  • JWT signing key: Same JwtSigningKey in both nodes' configuration.
  • Data Protection keys: Shared key ring (stored in SQL Server or shared file path).

Load Balancer

A load balancer sits in front of both central nodes for the Blazor Server UI:

  • Health check: GET /health/ready
  • Protocol: HTTPS (TLS termination at LB or pass-through)
  • Sticky sessions: Not required (JWT + shared Data Protection keys)
  • If the active node fails, the LB routes to the standby (which becomes active after singleton migration).

Site Cluster Setup

Cluster Configuration

Each site has its own two-node cluster:

Site Node A (site-01-a.example.com):

{
  "ScadaLink": {
    "Node": {
      "Role": "Site",
      "NodeHostname": "site-01-a.example.com",
      "SiteId": "plant-north",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadalink@site-01-a.example.com:8081",
        "akka.tcp://scadalink@site-01-b.example.com:8081"
      ]
    }
  }
}

Site Cluster Behavior

  • Same split-brain resolver as central (keep-oldest).
  • Singleton actors: Site Deployment Manager migrates on failover.
  • Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
  • SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).

Central-Site Communication

  • Sites connect to central via Akka.NET remoting.
  • The Communication:CentralSeedNode setting in the site config points to one of the central nodes.
  • If that central node is down, the site's communication actor will retry until it connects to the active central node.

Scaling Guidelines

Target Scale

  • 10 sites maximum per central cluster
  • 500 machines (instances) total across all sites
  • 75 tags per machine (37,500 total tag subscriptions)

Resource Requirements

Component CPU RAM Disk Notes
Central node 4 cores 8 GB 50 GB SQL Server is separate
Site node 2 cores 4 GB 20 GB SQLite databases grow with S&F
SQL Server 4 cores 16 GB 100 GB Shared across central cluster

Network Bandwidth

  • Health reports: ~1 KB per site per 30 seconds = negligible
  • Tag value updates: Depends on data change rate; OPC UA subscription-based
  • Deployment artifacts: One-time burst per deployment (varies by config size)
  • Debug view streaming: ~500 bytes per attribute change per subscriber

Dual-Node Failure Recovery

Scenario: Both Nodes Down

  1. First node starts: Forms a single-node cluster (min-nr-of-members = 1).
  2. Central: Reconnects to SQL Server, reads deployment state, becomes operational.
  3. Site: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
  4. Second node starts: Joins the existing cluster as standby.

Automatic Recovery

No manual intervention required for dual-node failure. The first node to start will:

  • Form the cluster
  • Take over all singletons
  • Begin processing immediately
  • Accept the second node when it joins