Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs
- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
This commit is contained in:
195
docs/deployment/installation-guide.md
Normal file
195
docs/deployment/installation-guide.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# ScadaLink Installation Guide
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Windows Server 2019 or later
|
||||
- .NET 10.0 Runtime
|
||||
- SQL Server 2019+ (Central nodes only)
|
||||
- Network connectivity between all cluster nodes (TCP ports 8081-8082)
|
||||
- LDAP/Active Directory server accessible from Central nodes
|
||||
- SMTP server accessible from all nodes (for Notification Service)
|
||||
|
||||
## Single Binary Deployment
|
||||
|
||||
ScadaLink ships as a single executable (`ScadaLink.Host.exe`) that runs in either Central or Site role based on configuration.
|
||||
|
||||
### Windows Service Installation
|
||||
|
||||
```powershell
|
||||
# Central Node
|
||||
sc.exe create "ScadaLink-Central" binPath="C:\ScadaLink\ScadaLink.Host.exe" start=auto
|
||||
sc.exe description "ScadaLink-Central" "ScadaLink SCADA Central Hub"
|
||||
|
||||
# Site Node
|
||||
sc.exe create "ScadaLink-Site" binPath="C:\ScadaLink\ScadaLink.Host.exe" start=auto
|
||||
sc.exe description "ScadaLink-Site" "ScadaLink SCADA Site Agent"
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
C:\ScadaLink\
|
||||
ScadaLink.Host.exe
|
||||
appsettings.json
|
||||
appsettings.Production.json
|
||||
data\ # Site: SQLite databases
|
||||
site.db # Deployed configs, static overrides
|
||||
store-and-forward.db # S&F message buffer
|
||||
logs\ # Rolling log files
|
||||
scadalink-20260316.log
|
||||
```
|
||||
|
||||
## Configuration Templates
|
||||
|
||||
### Central Node — `appsettings.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Central",
|
||||
"NodeHostname": "central-01.example.com",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@central-01.example.com:8081",
|
||||
"akka.tcp://scadalink@central-02.example.com:8081"
|
||||
]
|
||||
},
|
||||
"Database": {
|
||||
"ConfigurationDb": "Server=sqlserver.example.com;Database=ScadaLink;User Id=scadalink_svc;Password=<CHANGE_ME>;Encrypt=true;TrustServerCertificate=false",
|
||||
"MachineDataDb": "Server=sqlserver.example.com;Database=ScadaLink_MachineData;User Id=scadalink_svc;Password=<CHANGE_ME>;Encrypt=true;TrustServerCertificate=false"
|
||||
},
|
||||
"Security": {
|
||||
"LdapServer": "ldap.example.com",
|
||||
"LdapPort": 636,
|
||||
"LdapUseTls": true,
|
||||
"AllowInsecureLdap": false,
|
||||
"LdapSearchBase": "dc=example,dc=com",
|
||||
"JwtSigningKey": "<GENERATE_A_32_PLUS_CHAR_RANDOM_STRING>",
|
||||
"JwtExpiryMinutes": 15,
|
||||
"IdleTimeoutMinutes": 30
|
||||
},
|
||||
"HealthMonitoring": {
|
||||
"ReportInterval": "00:00:30",
|
||||
"OfflineTimeout": "00:01:00"
|
||||
},
|
||||
"Logging": {
|
||||
"MinimumLevel": "Information"
|
||||
}
|
||||
},
|
||||
"Serilog": {
|
||||
"MinimumLevel": {
|
||||
"Default": "Information",
|
||||
"Override": {
|
||||
"Microsoft": "Warning",
|
||||
"Akka": "Warning"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Site Node — `appsettings.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Site",
|
||||
"NodeHostname": "site-01-node-a.example.com",
|
||||
"SiteId": "plant-north",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@site-01-node-a.example.com:8081",
|
||||
"akka.tcp://scadalink@site-01-node-b.example.com:8081"
|
||||
]
|
||||
},
|
||||
"Database": {
|
||||
"SiteDbPath": "C:\\ScadaLink\\data\\site.db"
|
||||
},
|
||||
"DataConnection": {
|
||||
"ReconnectInterval": "00:00:05",
|
||||
"TagResolutionRetryInterval": "00:00:30"
|
||||
},
|
||||
"StoreAndForward": {
|
||||
"SqliteDbPath": "C:\\ScadaLink\\data\\store-and-forward.db",
|
||||
"DefaultRetryInterval": "00:00:30",
|
||||
"DefaultMaxRetries": 50,
|
||||
"ReplicationEnabled": true
|
||||
},
|
||||
"SiteRuntime": {
|
||||
"ScriptTimeoutSeconds": 30,
|
||||
"StaggeredStartupDelayMs": 50
|
||||
},
|
||||
"SiteEventLog": {
|
||||
"RetentionDays": 30,
|
||||
"MaxStorageMB": 1024,
|
||||
"PurgeIntervalHours": 24
|
||||
},
|
||||
"Communication": {
|
||||
"CentralSeedNode": "akka.tcp://scadalink@central-01.example.com:8081"
|
||||
},
|
||||
"HealthMonitoring": {
|
||||
"ReportInterval": "00:00:30"
|
||||
},
|
||||
"Logging": {
|
||||
"MinimumLevel": "Information"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Database Setup (Central Only)
|
||||
|
||||
### SQL Server
|
||||
|
||||
1. Create the configuration database:
|
||||
```sql
|
||||
CREATE DATABASE ScadaLink;
|
||||
CREATE LOGIN scadalink_svc WITH PASSWORD = '<STRONG_PASSWORD>';
|
||||
USE ScadaLink;
|
||||
CREATE USER scadalink_svc FOR LOGIN scadalink_svc;
|
||||
ALTER ROLE db_owner ADD MEMBER scadalink_svc;
|
||||
```
|
||||
|
||||
2. Create the machine data database:
|
||||
```sql
|
||||
CREATE DATABASE ScadaLink_MachineData;
|
||||
USE ScadaLink_MachineData;
|
||||
CREATE USER scadalink_svc FOR LOGIN scadalink_svc;
|
||||
ALTER ROLE db_owner ADD MEMBER scadalink_svc;
|
||||
```
|
||||
|
||||
3. Apply EF Core migrations (development):
|
||||
- Migrations auto-apply on startup in Development environment.
|
||||
|
||||
4. Apply EF Core migrations (production):
|
||||
- Generate SQL script: `dotnet ef migrations script --project src/ScadaLink.ConfigurationDatabase`
|
||||
- Review and execute the SQL script against the production database.
|
||||
|
||||
## Network Requirements
|
||||
|
||||
| Source | Destination | Port | Protocol | Purpose |
|
||||
|--------|------------|------|----------|---------|
|
||||
| Central A | Central B | 8081 | TCP | Akka.NET remoting |
|
||||
| Site A | Site B | 8081 | TCP | Akka.NET remoting |
|
||||
| Site nodes | Central nodes | 8081 | TCP | Central-site communication |
|
||||
| Central nodes | LDAP server | 636 | TCP/TLS | Authentication |
|
||||
| All nodes | SMTP server | 587 | TCP/TLS | Notification delivery |
|
||||
| Central nodes | SQL Server | 1433 | TCP | Configuration database |
|
||||
| Users | Central nodes | 443 | HTTPS | Blazor Server UI |
|
||||
|
||||
## Firewall Rules
|
||||
|
||||
Ensure bidirectional TCP connectivity between all Akka.NET cluster peers. The remoting port (default 8081) must be open in both directions.
|
||||
|
||||
## Post-Installation Verification
|
||||
|
||||
1. Start the service: `sc.exe start ScadaLink-Central`
|
||||
2. Check the log file: `type C:\ScadaLink\logs\scadalink-*.log`
|
||||
3. Verify the readiness endpoint: `curl http://localhost:5000/health/ready`
|
||||
4. For Central: verify the UI is accessible at `https://central-01.example.com/`
|
||||
97
docs/deployment/production-checklist.md
Normal file
97
docs/deployment/production-checklist.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# ScadaLink Production Deployment Checklist
|
||||
|
||||
## Pre-Deployment
|
||||
|
||||
### Configuration Verification
|
||||
|
||||
- [ ] `ScadaLink:Node:Role` is set correctly (`Central` or `Site`)
|
||||
- [ ] `ScadaLink:Node:NodeHostname` matches the machine's resolvable hostname
|
||||
- [ ] `ScadaLink:Cluster:SeedNodes` contains exactly 2 entries for the cluster pair
|
||||
- [ ] Seed node addresses use fully qualified hostnames (not `localhost`)
|
||||
- [ ] Remoting port (default 8081) is open bidirectionally between cluster peers
|
||||
|
||||
### Central Node
|
||||
|
||||
- [ ] `ScadaLink:Database:ConfigurationDb` connection string is valid and tested
|
||||
- [ ] `ScadaLink:Database:MachineDataDb` connection string is valid and tested
|
||||
- [ ] SQL Server login has `db_owner` role on both databases
|
||||
- [ ] EF Core migrations have been applied (SQL script reviewed and executed)
|
||||
- [ ] `ScadaLink:Security:JwtSigningKey` is at least 32 characters, randomly generated
|
||||
- [ ] **Both central nodes use the same JwtSigningKey** (required for JWT failover)
|
||||
- [ ] `ScadaLink:Security:LdapServer` points to the production LDAP/AD server
|
||||
- [ ] `ScadaLink:Security:LdapUseTls` is `true` (LDAPS required in production)
|
||||
- [ ] `ScadaLink:Security:AllowInsecureLdap` is `false`
|
||||
- [ ] LDAP search base DN is correct for the organization
|
||||
- [ ] LDAP group-to-role mappings are configured
|
||||
- [ ] Load balancer is configured in front of central UI (sticky sessions not required)
|
||||
- [ ] ASP.NET Data Protection keys are shared between central nodes (for cookie failover)
|
||||
- [ ] HTTPS certificate is installed and configured
|
||||
|
||||
### Site Node
|
||||
|
||||
- [ ] `ScadaLink:Node:SiteId` is set and unique across all sites
|
||||
- [ ] `ScadaLink:Database:SiteDbPath` points to a writable directory
|
||||
- [ ] SQLite data directory has sufficient disk space (no max buffer size for S&F)
|
||||
- [ ] `ScadaLink:Communication:CentralSeedNode` points to a reachable central node
|
||||
- [ ] OPC UA server endpoints are accessible from site nodes
|
||||
- [ ] OPC UA security certificates are configured if required
|
||||
|
||||
### Security
|
||||
|
||||
- [ ] No secrets in `appsettings.json` committed to source control
|
||||
- [ ] Secrets managed via environment variables or a secrets manager
|
||||
- [ ] Windows Service account has minimum necessary permissions
|
||||
- [ ] Log directory permissions restrict access to service account and administrators
|
||||
- [ ] SMTP credentials use OAuth2 Client Credentials (preferred) or secure Basic Auth
|
||||
- [ ] API keys for Inbound API are generated with sufficient entropy (32+ chars)
|
||||
|
||||
### Network
|
||||
|
||||
- [ ] DNS resolution works between all cluster nodes
|
||||
- [ ] Firewall rules permit Akka.NET remoting (TCP 8081)
|
||||
- [ ] Firewall rules permit LDAP (TCP 636 for LDAPS)
|
||||
- [ ] Firewall rules permit SMTP (TCP 587 for TLS)
|
||||
- [ ] Firewall rules permit SQL Server (TCP 1433) from central nodes only
|
||||
- [ ] Load balancer health check configured against `/health/ready`
|
||||
|
||||
## Deployment
|
||||
|
||||
### Order of Operations
|
||||
|
||||
1. Deploy central node A (forms single-node cluster)
|
||||
2. Verify central node A is healthy: `GET /health/ready` returns 200
|
||||
3. Deploy central node B (joins existing cluster)
|
||||
4. Verify both central nodes show as cluster members in logs
|
||||
5. Deploy site nodes (order does not matter)
|
||||
6. Verify sites register with central via health dashboard
|
||||
|
||||
### Rollback Plan
|
||||
|
||||
- [ ] Previous version binaries are retained for rollback
|
||||
- [ ] Database backup taken before migration
|
||||
- [ ] Rollback SQL script is available (if migration requires it)
|
||||
- [ ] Service can be stopped and previous binary restored
|
||||
|
||||
## Post-Deployment
|
||||
|
||||
### Smoke Tests
|
||||
|
||||
- [ ] Central UI is accessible and login works
|
||||
- [ ] Health dashboard shows all expected sites as online
|
||||
- [ ] Template engine can create/save/delete a test template
|
||||
- [ ] Deployment pipeline can deploy a test instance to a site
|
||||
- [ ] Inbound API responds to test requests with valid API key
|
||||
- [ ] Notification Service can send a test email
|
||||
|
||||
### Monitoring Setup
|
||||
|
||||
- [ ] Log aggregation is configured (Serilog file sink + centralized collector)
|
||||
- [ ] Health dashboard bookmarked for operations team
|
||||
- [ ] Alerting configured for site offline threshold violations
|
||||
- [ ] Disk space monitoring on site nodes (SQLite growth)
|
||||
|
||||
### Documentation
|
||||
|
||||
- [ ] Cluster topology documented (hostnames, ports, roles)
|
||||
- [ ] Runbook updated with environment-specific details
|
||||
- [ ] On-call team briefed on failover procedures
|
||||
172
docs/deployment/topology-guide.md
Normal file
172
docs/deployment/topology-guide.md
Normal file
@@ -0,0 +1,172 @@
|
||||
# ScadaLink Cluster Topology Guide
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
ScadaLink uses a hub-and-spoke architecture:
|
||||
- **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
|
||||
- **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
|
||||
|
||||
```
|
||||
┌──────────────────────────┐
|
||||
│ Central Cluster │
|
||||
│ ┌──────┐ ┌──────┐ │
|
||||
Users ──────────► │ │Node A│◄──►│Node B│ │
|
||||
(HTTPS/LB) │ │Active│ │Stby │ │
|
||||
│ └──┬───┘ └──┬───┘ │
|
||||
└─────┼───────────┼────────┘
|
||||
│ │
|
||||
┌───────────┼───────────┼───────────┐
|
||||
│ │ │ │
|
||||
┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐
|
||||
│ Site 01 │ │ Site 02 │ │ Site 03 │ │ Site N │
|
||||
│ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│
|
||||
│ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││
|
||||
│ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│
|
||||
└───────────┘ └─────────┘ └─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
## Central Cluster Setup
|
||||
|
||||
### Cluster Configuration
|
||||
|
||||
Both central nodes must be configured as seed nodes for each other:
|
||||
|
||||
**Node A** (`central-01.example.com`):
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Central",
|
||||
"NodeHostname": "central-01.example.com",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@central-01.example.com:8081",
|
||||
"akka.tcp://scadalink@central-02.example.com:8081"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Node B** (`central-02.example.com`):
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Central",
|
||||
"NodeHostname": "central-02.example.com",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@central-01.example.com:8081",
|
||||
"akka.tcp://scadalink@central-02.example.com:8081"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Cluster Behavior
|
||||
|
||||
- **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after.
|
||||
- **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster.
|
||||
- **Failure detection**: 2-second heartbeat interval, 10-second threshold.
|
||||
- **Total failover time**: ~25 seconds from node failure to singleton migration.
|
||||
- **Singleton handover**: Uses CoordinatedShutdown for graceful migration.
|
||||
|
||||
### Shared State
|
||||
|
||||
Both central nodes share state through:
|
||||
- **SQL Server**: All configuration, deployment records, templates, and audit logs.
|
||||
- **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration.
|
||||
- **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path).
|
||||
|
||||
### Load Balancer
|
||||
|
||||
A load balancer sits in front of both central nodes for the Blazor Server UI:
|
||||
- Health check: `GET /health/ready`
|
||||
- Protocol: HTTPS (TLS termination at LB or pass-through)
|
||||
- Sticky sessions: Not required (JWT + shared Data Protection keys)
|
||||
- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).
|
||||
|
||||
## Site Cluster Setup
|
||||
|
||||
### Cluster Configuration
|
||||
|
||||
Each site has its own two-node cluster:
|
||||
|
||||
**Site Node A** (`site-01-a.example.com`):
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Site",
|
||||
"NodeHostname": "site-01-a.example.com",
|
||||
"SiteId": "plant-north",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@site-01-a.example.com:8081",
|
||||
"akka.tcp://scadalink@site-01-b.example.com:8081"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Site Cluster Behavior
|
||||
|
||||
- Same split-brain resolver as central (keep-oldest).
|
||||
- Singleton actors: Site Deployment Manager migrates on failover.
|
||||
- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
|
||||
- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).
|
||||
|
||||
### Central-Site Communication
|
||||
|
||||
- Sites connect to central via Akka.NET remoting.
|
||||
- The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes.
|
||||
- If that central node is down, the site's communication actor will retry until it connects to the active central node.
|
||||
|
||||
## Scaling Guidelines
|
||||
|
||||
### Target Scale
|
||||
|
||||
- 10 sites maximum per central cluster
|
||||
- 500 machines (instances) total across all sites
|
||||
- 75 tags per machine (37,500 total tag subscriptions)
|
||||
|
||||
### Resource Requirements
|
||||
|
||||
| Component | CPU | RAM | Disk | Notes |
|
||||
|-----------|-----|-----|------|-------|
|
||||
| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
|
||||
| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
|
||||
| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |
|
||||
|
||||
### Network Bandwidth
|
||||
|
||||
- Health reports: ~1 KB per site per 30 seconds = negligible
|
||||
- Tag value updates: Depends on data change rate; OPC UA subscription-based
|
||||
- Deployment artifacts: One-time burst per deployment (varies by config size)
|
||||
- Debug view streaming: ~500 bytes per attribute change per subscriber
|
||||
|
||||
## Dual-Node Failure Recovery
|
||||
|
||||
### Scenario: Both Nodes Down
|
||||
|
||||
1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`).
|
||||
2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational.
|
||||
3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
|
||||
4. **Second node starts**: Joins the existing cluster as standby.
|
||||
|
||||
### Automatic Recovery
|
||||
|
||||
No manual intervention required for dual-node failure. The first node to start will:
|
||||
- Form the cluster
|
||||
- Take over all singletons
|
||||
- Begin processing immediately
|
||||
- Accept the second node when it joins
|
||||
Reference in New Issue
Block a user