# ScadaLink Cluster Topology Guide ## Architecture Overview ScadaLink uses a hub-and-spoke architecture: - **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination. - **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing. ``` ┌──────────────────────────┐ │ Central Cluster │ │ ┌──────┐ ┌──────┐ │ Users ──────────► │ │Node A│◄──►│Node B│ │ (HTTPS/LB) │ │Active│ │Stby │ │ │ └──┬───┘ └──┬───┘ │ └─────┼───────────┼────────┘ │ │ ┌───────────┼───────────┼───────────┐ │ │ │ │ ┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐ │ Site 01 │ │ Site 02 │ │ Site 03 │ │ Site N │ │ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││ │ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│ └───────────┘ └─────────┘ └─────────┘ └─────────┘ ``` ## Central Cluster Setup ### Cluster Configuration Both central nodes must be configured as seed nodes for each other: **Node A** (`central-01.example.com`): ```json { "ScadaLink": { "Node": { "Role": "Central", "NodeHostname": "central-01.example.com", "RemotingPort": 8081 }, "Cluster": { "SeedNodes": [ "akka.tcp://scadalink@central-01.example.com:8081", "akka.tcp://scadalink@central-02.example.com:8081" ] } } } ``` **Node B** (`central-02.example.com`): ```json { "ScadaLink": { "Node": { "Role": "Central", "NodeHostname": "central-02.example.com", "RemotingPort": 8081 }, "Cluster": { "SeedNodes": [ "akka.tcp://scadalink@central-01.example.com:8081", "akka.tcp://scadalink@central-02.example.com:8081" ] } } } ``` ### Cluster Behavior - **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after. - **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster. - **Failure detection**: 2-second heartbeat interval, 10-second threshold. - **Total failover time**: ~25 seconds from node failure to singleton migration. - **Singleton handover**: Uses CoordinatedShutdown for graceful migration. ### Shared State Both central nodes share state through: - **SQL Server**: All configuration, deployment records, templates, and audit logs. - **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration. - **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path). ### Load Balancer A load balancer sits in front of both central nodes for the Blazor Server UI: - Health check: `GET /health/ready` - Protocol: HTTPS (TLS termination at LB or pass-through) - Sticky sessions: Not required (JWT + shared Data Protection keys) - If the active node fails, the LB routes to the standby (which becomes active after singleton migration). ## Site Cluster Setup ### Cluster Configuration Each site has its own two-node cluster: **Site Node A** (`site-01-a.example.com`): ```json { "ScadaLink": { "Node": { "Role": "Site", "NodeHostname": "site-01-a.example.com", "SiteId": "plant-north", "RemotingPort": 8081 }, "Cluster": { "SeedNodes": [ "akka.tcp://scadalink@site-01-a.example.com:8081", "akka.tcp://scadalink@site-01-b.example.com:8081" ] } } } ``` ### Site Cluster Behavior - Same split-brain resolver as central (keep-oldest). - Singleton actors: Site Deployment Manager migrates on failover. - Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms. - SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication). ### Central-Site Communication - Sites connect to central via Akka.NET remoting. - The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes. - If that central node is down, the site's communication actor will retry until it connects to the active central node. ## Scaling Guidelines ### Target Scale - 10 sites maximum per central cluster - 500 machines (instances) total across all sites - 75 tags per machine (37,500 total tag subscriptions) ### Resource Requirements | Component | CPU | RAM | Disk | Notes | |-----------|-----|-----|------|-------| | Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate | | Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F | | SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster | ### Network Bandwidth - Health reports: ~1 KB per site per 30 seconds = negligible - Tag value updates: Depends on data change rate; OPC UA subscription-based - Deployment artifacts: One-time burst per deployment (varies by config size) - Debug view streaming: ~500 bytes per attribute change per subscriber ## Dual-Node Failure Recovery ### Scenario: Both Nodes Down 1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`). 2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational. 3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries. 4. **Second node starts**: Joins the existing cluster as standby. ### Automatic Recovery No manual intervention required for dual-node failure. The first node to start will: - Form the cluster - Take over all singletons - Begin processing immediately - Accept the second node when it joins