- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
173 lines
6.3 KiB
Markdown
173 lines
6.3 KiB
Markdown
# ScadaLink Cluster Topology Guide
|
|
|
|
## Architecture Overview
|
|
|
|
ScadaLink uses a hub-and-spoke architecture:
|
|
- **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
|
|
- **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
|
|
|
|
```
|
|
┌──────────────────────────┐
|
|
│ Central Cluster │
|
|
│ ┌──────┐ ┌──────┐ │
|
|
Users ──────────► │ │Node A│◄──►│Node B│ │
|
|
(HTTPS/LB) │ │Active│ │Stby │ │
|
|
│ └──┬───┘ └──┬───┘ │
|
|
└─────┼───────────┼────────┘
|
|
│ │
|
|
┌───────────┼───────────┼───────────┐
|
|
│ │ │ │
|
|
┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐
|
|
│ Site 01 │ │ Site 02 │ │ Site 03 │ │ Site N │
|
|
│ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│
|
|
│ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││
|
|
│ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│
|
|
└───────────┘ └─────────┘ └─────────┘ └─────────┘
|
|
```
|
|
|
|
## Central Cluster Setup
|
|
|
|
### Cluster Configuration
|
|
|
|
Both central nodes must be configured as seed nodes for each other:
|
|
|
|
**Node A** (`central-01.example.com`):
|
|
```json
|
|
{
|
|
"ScadaLink": {
|
|
"Node": {
|
|
"Role": "Central",
|
|
"NodeHostname": "central-01.example.com",
|
|
"RemotingPort": 8081
|
|
},
|
|
"Cluster": {
|
|
"SeedNodes": [
|
|
"akka.tcp://scadalink@central-01.example.com:8081",
|
|
"akka.tcp://scadalink@central-02.example.com:8081"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Node B** (`central-02.example.com`):
|
|
```json
|
|
{
|
|
"ScadaLink": {
|
|
"Node": {
|
|
"Role": "Central",
|
|
"NodeHostname": "central-02.example.com",
|
|
"RemotingPort": 8081
|
|
},
|
|
"Cluster": {
|
|
"SeedNodes": [
|
|
"akka.tcp://scadalink@central-01.example.com:8081",
|
|
"akka.tcp://scadalink@central-02.example.com:8081"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Cluster Behavior
|
|
|
|
- **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after.
|
|
- **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster.
|
|
- **Failure detection**: 2-second heartbeat interval, 10-second threshold.
|
|
- **Total failover time**: ~25 seconds from node failure to singleton migration.
|
|
- **Singleton handover**: Uses CoordinatedShutdown for graceful migration.
|
|
|
|
### Shared State
|
|
|
|
Both central nodes share state through:
|
|
- **SQL Server**: All configuration, deployment records, templates, and audit logs.
|
|
- **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration.
|
|
- **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path).
|
|
|
|
### Load Balancer
|
|
|
|
A load balancer sits in front of both central nodes for the Blazor Server UI:
|
|
- Health check: `GET /health/ready`
|
|
- Protocol: HTTPS (TLS termination at LB or pass-through)
|
|
- Sticky sessions: Not required (JWT + shared Data Protection keys)
|
|
- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).
|
|
|
|
## Site Cluster Setup
|
|
|
|
### Cluster Configuration
|
|
|
|
Each site has its own two-node cluster:
|
|
|
|
**Site Node A** (`site-01-a.example.com`):
|
|
```json
|
|
{
|
|
"ScadaLink": {
|
|
"Node": {
|
|
"Role": "Site",
|
|
"NodeHostname": "site-01-a.example.com",
|
|
"SiteId": "plant-north",
|
|
"RemotingPort": 8081
|
|
},
|
|
"Cluster": {
|
|
"SeedNodes": [
|
|
"akka.tcp://scadalink@site-01-a.example.com:8081",
|
|
"akka.tcp://scadalink@site-01-b.example.com:8081"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Site Cluster Behavior
|
|
|
|
- Same split-brain resolver as central (keep-oldest).
|
|
- Singleton actors: Site Deployment Manager migrates on failover.
|
|
- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
|
|
- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).
|
|
|
|
### Central-Site Communication
|
|
|
|
- Sites connect to central via Akka.NET remoting.
|
|
- The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes.
|
|
- If that central node is down, the site's communication actor will retry until it connects to the active central node.
|
|
|
|
## Scaling Guidelines
|
|
|
|
### Target Scale
|
|
|
|
- 10 sites maximum per central cluster
|
|
- 500 machines (instances) total across all sites
|
|
- 75 tags per machine (37,500 total tag subscriptions)
|
|
|
|
### Resource Requirements
|
|
|
|
| Component | CPU | RAM | Disk | Notes |
|
|
|-----------|-----|-----|------|-------|
|
|
| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
|
|
| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
|
|
| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |
|
|
|
|
### Network Bandwidth
|
|
|
|
- Health reports: ~1 KB per site per 30 seconds = negligible
|
|
- Tag value updates: Depends on data change rate; OPC UA subscription-based
|
|
- Deployment artifacts: One-time burst per deployment (varies by config size)
|
|
- Debug view streaming: ~500 bytes per attribute change per subscriber
|
|
|
|
## Dual-Node Failure Recovery
|
|
|
|
### Scenario: Both Nodes Down
|
|
|
|
1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`).
|
|
2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational.
|
|
3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
|
|
4. **Second node starts**: Joins the existing cluster as standby.
|
|
|
|
### Automatic Recovery
|
|
|
|
No manual intervention required for dual-node failure. The first node to start will:
|
|
- Form the cluster
|
|
- Take over all singletons
|
|
- Begin processing immediately
|
|
- Accept the second node when it joins
|