bdee12f4e9
Replace ASCII-art diagrams across the README and docs/ with editable .drawio sources plus exported PNGs, so the diagrams render clearly in rendered markdown and can be maintained/regenerated instead of being hand-edited as fragile text art. Non-diagram blocks (code, folder trees, UI wireframes) were left as text.
5.0 KiB
5.0 KiB
ScadaBridge Cluster Topology Guide
Architecture Overview
ScadaBridge uses a hub-and-spoke architecture:
- Central Cluster: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
- Site Clusters: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
Central Cluster Setup
Cluster Configuration
Both central nodes must be configured as seed nodes for each other:
Node A (central-01.example.com):
{
"ScadaBridge": {
"Node": {
"Role": "Central",
"NodeHostname": "central-01.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadabridge@central-01.example.com:8081",
"akka.tcp://scadabridge@central-02.example.com:8081"
]
}
}
}
Node B (central-02.example.com):
{
"ScadaBridge": {
"Node": {
"Role": "Central",
"NodeHostname": "central-02.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadabridge@central-01.example.com:8081",
"akka.tcp://scadabridge@central-02.example.com:8081"
]
}
}
}
Cluster Behavior
- Split-brain resolver: Keep-oldest with
down-if-alone = on, 15-second stable-after. - Minimum members:
min-nr-of-members = 1— a single node can form a cluster. - Failure detection: 2-second heartbeat interval, 10-second threshold.
- Total failover time: ~25 seconds from node failure to singleton migration.
- Singleton handover: Uses CoordinatedShutdown for graceful migration.
Shared State
Both central nodes share state through:
- SQL Server: All configuration, deployment records, templates, and audit logs.
- JWT signing key: Same
JwtSigningKeyin both nodes' configuration. - Data Protection keys: Shared key ring (stored in SQL Server or shared file path).
Load Balancer
A load balancer sits in front of both central nodes for the Blazor Server UI:
- Health check:
GET /health/ready - Protocol: HTTPS (TLS termination at LB or pass-through)
- Sticky sessions: Not required (JWT + shared Data Protection keys)
- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).
Site Cluster Setup
Cluster Configuration
Each site has its own two-node cluster:
Site Node A (site-01-a.example.com):
{
"ScadaBridge": {
"Node": {
"Role": "Site",
"NodeHostname": "site-01-a.example.com",
"SiteId": "plant-north",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadabridge@site-01-a.example.com:8081",
"akka.tcp://scadabridge@site-01-b.example.com:8081"
]
}
}
}
Site Cluster Behavior
- Same split-brain resolver as central (keep-oldest).
- Singleton actors: Site Deployment Manager migrates on failover.
- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).
Central-Site Communication
- Sites connect to central via Akka.NET remoting.
- The
Communication:CentralSeedNodesetting in the site config points to one of the central nodes. - If that central node is down, the site's communication actor will retry until it connects to the active central node.
Scaling Guidelines
Target Scale
- 10 sites maximum per central cluster
- 500 machines (instances) total across all sites
- 75 tags per machine (37,500 total tag subscriptions)
Resource Requirements
| Component | CPU | RAM | Disk | Notes |
|---|---|---|---|---|
| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |
Network Bandwidth
- Health reports: ~1 KB per site per 30 seconds = negligible
- Tag value updates: Depends on data change rate; OPC UA subscription-based
- Deployment artifacts: One-time burst per deployment (varies by config size)
- Debug view streaming: ~500 bytes per attribute change per subscriber
Dual-Node Failure Recovery
Scenario: Both Nodes Down
- First node starts: Forms a single-node cluster (
min-nr-of-members = 1). - Central: Reconnects to SQL Server, reads deployment state, becomes operational.
- Site: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
- Second node starts: Joins the existing cluster as standby.
Automatic Recovery
No manual intervention required for dual-node failure. The first node to start will:
- Form the cluster
- Take over all singletons
- Begin processing immediately
- Accept the second node when it joins
