Files
Joseph Doherty 43228185b4 docs: convert standard diagrams from draw.io PNGs to inline Mermaid
Gitea renders mermaid inline, so the flow/state/hierarchy/DAG diagrams
move to text-in-markdown: auto-layout (removes the manual overlap-prone
draw.io step), diffable source, no committed binaries, and a dark-text
theme so labels stay legible. Keep draw.io PNGs only for the two complex
bespoke diagrams (logical architecture, env2 topology) where pixel
control still wins. All 24 mermaid blocks validated by rendering.
2026-06-01 00:23:00 -04:00

200 lines
6.1 KiB
Markdown

# ScadaBridge Cluster Topology Guide
## Architecture Overview
ScadaBridge uses a hub-and-spoke architecture:
- **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
- **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'textColor':'#111111','lineColor':'#555555','edgeLabelBackground':'#ffffff','fontSize':'15px'}}}%%
flowchart TD
USERS["Users<br/>(HTTPS / LB)"]
subgraph CENTRAL["Central Cluster"]
NA["Node A<br/>Active"]
NB["Node B<br/>Standby"]
NA <--> NB
end
USERS --> NA
CENTRAL --> SITE01
CENTRAL --> SITE02
CENTRAL --> SITE03
CENTRAL --> SITEN
subgraph SITE01["Site 01"]
S01A["A<br/>Active"]
S01B["B<br/>Standby"]
end
subgraph SITE02["Site 02"]
S02A["A<br/>Active"]
S02B["B<br/>Standby"]
end
subgraph SITE03["Site 03"]
S03A["A<br/>Active"]
S03B["B<br/>Standby"]
end
subgraph SITEN["Site N"]
SNA["A<br/>Active"]
SNB["B<br/>Standby"]
end
classDef start fill:#d5e8d4,stroke:#82b366,color:#111111;
classDef proc fill:#dae8fc,stroke:#6c8ebf,color:#111111;
classDef dec fill:#fff2cc,stroke:#d6b656,color:#111111;
classDef warn fill:#ffe6cc,stroke:#d79b00,color:#111111;
classDef muted fill:#f5f5f5,stroke:#999999,color:#666666;
class USERS dec
class CENTRAL proc
class NA,S01A,S02A,S03A,SNA start
class NB,S01B,S02B,S03B,SNB muted
class SITE01,SITE02,SITE03,SITEN warn
```
## Central Cluster Setup
### Cluster Configuration
Both central nodes must be configured as seed nodes for each other:
**Node A** (`central-01.example.com`):
```json
{
"ScadaBridge": {
"Node": {
"Role": "Central",
"NodeHostname": "central-01.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadabridge@central-01.example.com:8081",
"akka.tcp://scadabridge@central-02.example.com:8081"
]
}
}
}
```
**Node B** (`central-02.example.com`):
```json
{
"ScadaBridge": {
"Node": {
"Role": "Central",
"NodeHostname": "central-02.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadabridge@central-01.example.com:8081",
"akka.tcp://scadabridge@central-02.example.com:8081"
]
}
}
}
```
### Cluster Behavior
- **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after.
- **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster.
- **Failure detection**: 2-second heartbeat interval, 10-second threshold.
- **Total failover time**: ~25 seconds from node failure to singleton migration.
- **Singleton handover**: Uses CoordinatedShutdown for graceful migration.
### Shared State
Both central nodes share state through:
- **SQL Server**: All configuration, deployment records, templates, and audit logs.
- **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration.
- **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path).
### Load Balancer
A load balancer sits in front of both central nodes for the Blazor Server UI:
- Health check: `GET /health/ready`
- Protocol: HTTPS (TLS termination at LB or pass-through)
- Sticky sessions: Not required (JWT + shared Data Protection keys)
- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).
## Site Cluster Setup
### Cluster Configuration
Each site has its own two-node cluster:
**Site Node A** (`site-01-a.example.com`):
```json
{
"ScadaBridge": {
"Node": {
"Role": "Site",
"NodeHostname": "site-01-a.example.com",
"SiteId": "plant-north",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadabridge@site-01-a.example.com:8081",
"akka.tcp://scadabridge@site-01-b.example.com:8081"
]
}
}
}
```
### Site Cluster Behavior
- Same split-brain resolver as central (keep-oldest).
- Singleton actors: Site Deployment Manager migrates on failover.
- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).
### Central-Site Communication
- Sites connect to central via Akka.NET remoting.
- The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes.
- If that central node is down, the site's communication actor will retry until it connects to the active central node.
## Scaling Guidelines
### Target Scale
- 10 sites maximum per central cluster
- 500 machines (instances) total across all sites
- 75 tags per machine (37,500 total tag subscriptions)
### Resource Requirements
| Component | CPU | RAM | Disk | Notes |
|-----------|-----|-----|------|-------|
| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |
### Network Bandwidth
- Health reports: ~1 KB per site per 30 seconds = negligible
- Tag value updates: Depends on data change rate; OPC UA subscription-based
- Deployment artifacts: One-time burst per deployment (varies by config size)
- Debug view streaming: ~500 bytes per attribute change per subscriber
## Dual-Node Failure Recovery
### Scenario: Both Nodes Down
1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`).
2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational.
3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
4. **Second node starts**: Joins the existing cluster as standby.
### Automatic Recovery
No manual intervention required for dual-node failure. The first node to start will:
- Form the cluster
- Take over all singletons
- Begin processing immediately
- Accept the second node when it joins