Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.
This commit is contained in:
Joseph Doherty
2026-03-16 22:12:31 -04:00
parent 3b2320bd35
commit b659978764
68 changed files with 6253 additions and 44 deletions

View File

@@ -0,0 +1,195 @@
# ScadaLink Installation Guide
## Prerequisites
- Windows Server 2019 or later
- .NET 10.0 Runtime
- SQL Server 2019+ (Central nodes only)
- Network connectivity between all cluster nodes (TCP ports 8081-8082)
- LDAP/Active Directory server accessible from Central nodes
- SMTP server accessible from all nodes (for Notification Service)
## Single Binary Deployment
ScadaLink ships as a single executable (`ScadaLink.Host.exe`) that runs in either Central or Site role based on configuration.
### Windows Service Installation
```powershell
# Central Node
sc.exe create "ScadaLink-Central" binPath="C:\ScadaLink\ScadaLink.Host.exe" start=auto
sc.exe description "ScadaLink-Central" "ScadaLink SCADA Central Hub"
# Site Node
sc.exe create "ScadaLink-Site" binPath="C:\ScadaLink\ScadaLink.Host.exe" start=auto
sc.exe description "ScadaLink-Site" "ScadaLink SCADA Site Agent"
```
### Directory Structure
```
C:\ScadaLink\
ScadaLink.Host.exe
appsettings.json
appsettings.Production.json
data\ # Site: SQLite databases
site.db # Deployed configs, static overrides
store-and-forward.db # S&F message buffer
logs\ # Rolling log files
scadalink-20260316.log
```
## Configuration Templates
### Central Node — `appsettings.json`
```json
{
"ScadaLink": {
"Node": {
"Role": "Central",
"NodeHostname": "central-01.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@central-01.example.com:8081",
"akka.tcp://scadalink@central-02.example.com:8081"
]
},
"Database": {
"ConfigurationDb": "Server=sqlserver.example.com;Database=ScadaLink;User Id=scadalink_svc;Password=<CHANGE_ME>;Encrypt=true;TrustServerCertificate=false",
"MachineDataDb": "Server=sqlserver.example.com;Database=ScadaLink_MachineData;User Id=scadalink_svc;Password=<CHANGE_ME>;Encrypt=true;TrustServerCertificate=false"
},
"Security": {
"LdapServer": "ldap.example.com",
"LdapPort": 636,
"LdapUseTls": true,
"AllowInsecureLdap": false,
"LdapSearchBase": "dc=example,dc=com",
"JwtSigningKey": "<GENERATE_A_32_PLUS_CHAR_RANDOM_STRING>",
"JwtExpiryMinutes": 15,
"IdleTimeoutMinutes": 30
},
"HealthMonitoring": {
"ReportInterval": "00:00:30",
"OfflineTimeout": "00:01:00"
},
"Logging": {
"MinimumLevel": "Information"
}
},
"Serilog": {
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft": "Warning",
"Akka": "Warning"
}
}
}
}
```
### Site Node — `appsettings.json`
```json
{
"ScadaLink": {
"Node": {
"Role": "Site",
"NodeHostname": "site-01-node-a.example.com",
"SiteId": "plant-north",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@site-01-node-a.example.com:8081",
"akka.tcp://scadalink@site-01-node-b.example.com:8081"
]
},
"Database": {
"SiteDbPath": "C:\\ScadaLink\\data\\site.db"
},
"DataConnection": {
"ReconnectInterval": "00:00:05",
"TagResolutionRetryInterval": "00:00:30"
},
"StoreAndForward": {
"SqliteDbPath": "C:\\ScadaLink\\data\\store-and-forward.db",
"DefaultRetryInterval": "00:00:30",
"DefaultMaxRetries": 50,
"ReplicationEnabled": true
},
"SiteRuntime": {
"ScriptTimeoutSeconds": 30,
"StaggeredStartupDelayMs": 50
},
"SiteEventLog": {
"RetentionDays": 30,
"MaxStorageMB": 1024,
"PurgeIntervalHours": 24
},
"Communication": {
"CentralSeedNode": "akka.tcp://scadalink@central-01.example.com:8081"
},
"HealthMonitoring": {
"ReportInterval": "00:00:30"
},
"Logging": {
"MinimumLevel": "Information"
}
}
}
```
## Database Setup (Central Only)
### SQL Server
1. Create the configuration database:
```sql
CREATE DATABASE ScadaLink;
CREATE LOGIN scadalink_svc WITH PASSWORD = '<STRONG_PASSWORD>';
USE ScadaLink;
CREATE USER scadalink_svc FOR LOGIN scadalink_svc;
ALTER ROLE db_owner ADD MEMBER scadalink_svc;
```
2. Create the machine data database:
```sql
CREATE DATABASE ScadaLink_MachineData;
USE ScadaLink_MachineData;
CREATE USER scadalink_svc FOR LOGIN scadalink_svc;
ALTER ROLE db_owner ADD MEMBER scadalink_svc;
```
3. Apply EF Core migrations (development):
- Migrations auto-apply on startup in Development environment.
4. Apply EF Core migrations (production):
- Generate SQL script: `dotnet ef migrations script --project src/ScadaLink.ConfigurationDatabase`
- Review and execute the SQL script against the production database.
## Network Requirements
| Source | Destination | Port | Protocol | Purpose |
|--------|------------|------|----------|---------|
| Central A | Central B | 8081 | TCP | Akka.NET remoting |
| Site A | Site B | 8081 | TCP | Akka.NET remoting |
| Site nodes | Central nodes | 8081 | TCP | Central-site communication |
| Central nodes | LDAP server | 636 | TCP/TLS | Authentication |
| All nodes | SMTP server | 587 | TCP/TLS | Notification delivery |
| Central nodes | SQL Server | 1433 | TCP | Configuration database |
| Users | Central nodes | 443 | HTTPS | Blazor Server UI |
## Firewall Rules
Ensure bidirectional TCP connectivity between all Akka.NET cluster peers. The remoting port (default 8081) must be open in both directions.
## Post-Installation Verification
1. Start the service: `sc.exe start ScadaLink-Central`
2. Check the log file: `type C:\ScadaLink\logs\scadalink-*.log`
3. Verify the readiness endpoint: `curl http://localhost:5000/health/ready`
4. For Central: verify the UI is accessible at `https://central-01.example.com/`

View File

@@ -0,0 +1,97 @@
# ScadaLink Production Deployment Checklist
## Pre-Deployment
### Configuration Verification
- [ ] `ScadaLink:Node:Role` is set correctly (`Central` or `Site`)
- [ ] `ScadaLink:Node:NodeHostname` matches the machine's resolvable hostname
- [ ] `ScadaLink:Cluster:SeedNodes` contains exactly 2 entries for the cluster pair
- [ ] Seed node addresses use fully qualified hostnames (not `localhost`)
- [ ] Remoting port (default 8081) is open bidirectionally between cluster peers
### Central Node
- [ ] `ScadaLink:Database:ConfigurationDb` connection string is valid and tested
- [ ] `ScadaLink:Database:MachineDataDb` connection string is valid and tested
- [ ] SQL Server login has `db_owner` role on both databases
- [ ] EF Core migrations have been applied (SQL script reviewed and executed)
- [ ] `ScadaLink:Security:JwtSigningKey` is at least 32 characters, randomly generated
- [ ] **Both central nodes use the same JwtSigningKey** (required for JWT failover)
- [ ] `ScadaLink:Security:LdapServer` points to the production LDAP/AD server
- [ ] `ScadaLink:Security:LdapUseTls` is `true` (LDAPS required in production)
- [ ] `ScadaLink:Security:AllowInsecureLdap` is `false`
- [ ] LDAP search base DN is correct for the organization
- [ ] LDAP group-to-role mappings are configured
- [ ] Load balancer is configured in front of central UI (sticky sessions not required)
- [ ] ASP.NET Data Protection keys are shared between central nodes (for cookie failover)
- [ ] HTTPS certificate is installed and configured
### Site Node
- [ ] `ScadaLink:Node:SiteId` is set and unique across all sites
- [ ] `ScadaLink:Database:SiteDbPath` points to a writable directory
- [ ] SQLite data directory has sufficient disk space (no max buffer size for S&F)
- [ ] `ScadaLink:Communication:CentralSeedNode` points to a reachable central node
- [ ] OPC UA server endpoints are accessible from site nodes
- [ ] OPC UA security certificates are configured if required
### Security
- [ ] No secrets in `appsettings.json` committed to source control
- [ ] Secrets managed via environment variables or a secrets manager
- [ ] Windows Service account has minimum necessary permissions
- [ ] Log directory permissions restrict access to service account and administrators
- [ ] SMTP credentials use OAuth2 Client Credentials (preferred) or secure Basic Auth
- [ ] API keys for Inbound API are generated with sufficient entropy (32+ chars)
### Network
- [ ] DNS resolution works between all cluster nodes
- [ ] Firewall rules permit Akka.NET remoting (TCP 8081)
- [ ] Firewall rules permit LDAP (TCP 636 for LDAPS)
- [ ] Firewall rules permit SMTP (TCP 587 for TLS)
- [ ] Firewall rules permit SQL Server (TCP 1433) from central nodes only
- [ ] Load balancer health check configured against `/health/ready`
## Deployment
### Order of Operations
1. Deploy central node A (forms single-node cluster)
2. Verify central node A is healthy: `GET /health/ready` returns 200
3. Deploy central node B (joins existing cluster)
4. Verify both central nodes show as cluster members in logs
5. Deploy site nodes (order does not matter)
6. Verify sites register with central via health dashboard
### Rollback Plan
- [ ] Previous version binaries are retained for rollback
- [ ] Database backup taken before migration
- [ ] Rollback SQL script is available (if migration requires it)
- [ ] Service can be stopped and previous binary restored
## Post-Deployment
### Smoke Tests
- [ ] Central UI is accessible and login works
- [ ] Health dashboard shows all expected sites as online
- [ ] Template engine can create/save/delete a test template
- [ ] Deployment pipeline can deploy a test instance to a site
- [ ] Inbound API responds to test requests with valid API key
- [ ] Notification Service can send a test email
### Monitoring Setup
- [ ] Log aggregation is configured (Serilog file sink + centralized collector)
- [ ] Health dashboard bookmarked for operations team
- [ ] Alerting configured for site offline threshold violations
- [ ] Disk space monitoring on site nodes (SQLite growth)
### Documentation
- [ ] Cluster topology documented (hostnames, ports, roles)
- [ ] Runbook updated with environment-specific details
- [ ] On-call team briefed on failover procedures

View File

@@ -0,0 +1,172 @@
# ScadaLink Cluster Topology Guide
## Architecture Overview
ScadaLink uses a hub-and-spoke architecture:
- **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
- **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
```
┌──────────────────────────┐
│ Central Cluster │
│ ┌──────┐ ┌──────┐ │
Users ──────────► │ │Node A│◄──►│Node B│ │
(HTTPS/LB) │ │Active│ │Stby │ │
│ └──┬───┘ └──┬───┘ │
└─────┼───────────┼────────┘
│ │
┌───────────┼───────────┼───────────┐
│ │ │ │
┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐
│ Site 01 │ │ Site 02 │ │ Site 03 │ │ Site N │
│ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│
│ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││
│ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│
└───────────┘ └─────────┘ └─────────┘ └─────────┘
```
## Central Cluster Setup
### Cluster Configuration
Both central nodes must be configured as seed nodes for each other:
**Node A** (`central-01.example.com`):
```json
{
"ScadaLink": {
"Node": {
"Role": "Central",
"NodeHostname": "central-01.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@central-01.example.com:8081",
"akka.tcp://scadalink@central-02.example.com:8081"
]
}
}
}
```
**Node B** (`central-02.example.com`):
```json
{
"ScadaLink": {
"Node": {
"Role": "Central",
"NodeHostname": "central-02.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@central-01.example.com:8081",
"akka.tcp://scadalink@central-02.example.com:8081"
]
}
}
}
```
### Cluster Behavior
- **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after.
- **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster.
- **Failure detection**: 2-second heartbeat interval, 10-second threshold.
- **Total failover time**: ~25 seconds from node failure to singleton migration.
- **Singleton handover**: Uses CoordinatedShutdown for graceful migration.
### Shared State
Both central nodes share state through:
- **SQL Server**: All configuration, deployment records, templates, and audit logs.
- **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration.
- **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path).
### Load Balancer
A load balancer sits in front of both central nodes for the Blazor Server UI:
- Health check: `GET /health/ready`
- Protocol: HTTPS (TLS termination at LB or pass-through)
- Sticky sessions: Not required (JWT + shared Data Protection keys)
- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).
## Site Cluster Setup
### Cluster Configuration
Each site has its own two-node cluster:
**Site Node A** (`site-01-a.example.com`):
```json
{
"ScadaLink": {
"Node": {
"Role": "Site",
"NodeHostname": "site-01-a.example.com",
"SiteId": "plant-north",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@site-01-a.example.com:8081",
"akka.tcp://scadalink@site-01-b.example.com:8081"
]
}
}
}
```
### Site Cluster Behavior
- Same split-brain resolver as central (keep-oldest).
- Singleton actors: Site Deployment Manager migrates on failover.
- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).
### Central-Site Communication
- Sites connect to central via Akka.NET remoting.
- The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes.
- If that central node is down, the site's communication actor will retry until it connects to the active central node.
## Scaling Guidelines
### Target Scale
- 10 sites maximum per central cluster
- 500 machines (instances) total across all sites
- 75 tags per machine (37,500 total tag subscriptions)
### Resource Requirements
| Component | CPU | RAM | Disk | Notes |
|-----------|-----|-----|------|-------|
| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |
### Network Bandwidth
- Health reports: ~1 KB per site per 30 seconds = negligible
- Tag value updates: Depends on data change rate; OPC UA subscription-based
- Deployment artifacts: One-time burst per deployment (varies by config size)
- Debug view streaming: ~500 bytes per attribute change per subscriber
## Dual-Node Failure Recovery
### Scenario: Both Nodes Down
1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`).
2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational.
3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
4. **Second node starts**: Joins the existing cluster as standby.
### Automatic Recovery
No manual intervention required for dual-node failure. The first node to start will:
- Form the cluster
- Take over all singletons
- Begin processing immediately
- Accept the second node when it joins