Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs
- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
This commit is contained in:
195
docs/deployment/installation-guide.md
Normal file
195
docs/deployment/installation-guide.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# ScadaLink Installation Guide
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Windows Server 2019 or later
|
||||
- .NET 10.0 Runtime
|
||||
- SQL Server 2019+ (Central nodes only)
|
||||
- Network connectivity between all cluster nodes (TCP ports 8081-8082)
|
||||
- LDAP/Active Directory server accessible from Central nodes
|
||||
- SMTP server accessible from all nodes (for Notification Service)
|
||||
|
||||
## Single Binary Deployment
|
||||
|
||||
ScadaLink ships as a single executable (`ScadaLink.Host.exe`) that runs in either Central or Site role based on configuration.
|
||||
|
||||
### Windows Service Installation
|
||||
|
||||
```powershell
|
||||
# Central Node
|
||||
sc.exe create "ScadaLink-Central" binPath="C:\ScadaLink\ScadaLink.Host.exe" start=auto
|
||||
sc.exe description "ScadaLink-Central" "ScadaLink SCADA Central Hub"
|
||||
|
||||
# Site Node
|
||||
sc.exe create "ScadaLink-Site" binPath="C:\ScadaLink\ScadaLink.Host.exe" start=auto
|
||||
sc.exe description "ScadaLink-Site" "ScadaLink SCADA Site Agent"
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
C:\ScadaLink\
|
||||
ScadaLink.Host.exe
|
||||
appsettings.json
|
||||
appsettings.Production.json
|
||||
data\ # Site: SQLite databases
|
||||
site.db # Deployed configs, static overrides
|
||||
store-and-forward.db # S&F message buffer
|
||||
logs\ # Rolling log files
|
||||
scadalink-20260316.log
|
||||
```
|
||||
|
||||
## Configuration Templates
|
||||
|
||||
### Central Node — `appsettings.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Central",
|
||||
"NodeHostname": "central-01.example.com",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@central-01.example.com:8081",
|
||||
"akka.tcp://scadalink@central-02.example.com:8081"
|
||||
]
|
||||
},
|
||||
"Database": {
|
||||
"ConfigurationDb": "Server=sqlserver.example.com;Database=ScadaLink;User Id=scadalink_svc;Password=<CHANGE_ME>;Encrypt=true;TrustServerCertificate=false",
|
||||
"MachineDataDb": "Server=sqlserver.example.com;Database=ScadaLink_MachineData;User Id=scadalink_svc;Password=<CHANGE_ME>;Encrypt=true;TrustServerCertificate=false"
|
||||
},
|
||||
"Security": {
|
||||
"LdapServer": "ldap.example.com",
|
||||
"LdapPort": 636,
|
||||
"LdapUseTls": true,
|
||||
"AllowInsecureLdap": false,
|
||||
"LdapSearchBase": "dc=example,dc=com",
|
||||
"JwtSigningKey": "<GENERATE_A_32_PLUS_CHAR_RANDOM_STRING>",
|
||||
"JwtExpiryMinutes": 15,
|
||||
"IdleTimeoutMinutes": 30
|
||||
},
|
||||
"HealthMonitoring": {
|
||||
"ReportInterval": "00:00:30",
|
||||
"OfflineTimeout": "00:01:00"
|
||||
},
|
||||
"Logging": {
|
||||
"MinimumLevel": "Information"
|
||||
}
|
||||
},
|
||||
"Serilog": {
|
||||
"MinimumLevel": {
|
||||
"Default": "Information",
|
||||
"Override": {
|
||||
"Microsoft": "Warning",
|
||||
"Akka": "Warning"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Site Node — `appsettings.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Site",
|
||||
"NodeHostname": "site-01-node-a.example.com",
|
||||
"SiteId": "plant-north",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@site-01-node-a.example.com:8081",
|
||||
"akka.tcp://scadalink@site-01-node-b.example.com:8081"
|
||||
]
|
||||
},
|
||||
"Database": {
|
||||
"SiteDbPath": "C:\\ScadaLink\\data\\site.db"
|
||||
},
|
||||
"DataConnection": {
|
||||
"ReconnectInterval": "00:00:05",
|
||||
"TagResolutionRetryInterval": "00:00:30"
|
||||
},
|
||||
"StoreAndForward": {
|
||||
"SqliteDbPath": "C:\\ScadaLink\\data\\store-and-forward.db",
|
||||
"DefaultRetryInterval": "00:00:30",
|
||||
"DefaultMaxRetries": 50,
|
||||
"ReplicationEnabled": true
|
||||
},
|
||||
"SiteRuntime": {
|
||||
"ScriptTimeoutSeconds": 30,
|
||||
"StaggeredStartupDelayMs": 50
|
||||
},
|
||||
"SiteEventLog": {
|
||||
"RetentionDays": 30,
|
||||
"MaxStorageMB": 1024,
|
||||
"PurgeIntervalHours": 24
|
||||
},
|
||||
"Communication": {
|
||||
"CentralSeedNode": "akka.tcp://scadalink@central-01.example.com:8081"
|
||||
},
|
||||
"HealthMonitoring": {
|
||||
"ReportInterval": "00:00:30"
|
||||
},
|
||||
"Logging": {
|
||||
"MinimumLevel": "Information"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Database Setup (Central Only)
|
||||
|
||||
### SQL Server
|
||||
|
||||
1. Create the configuration database:
|
||||
```sql
|
||||
CREATE DATABASE ScadaLink;
|
||||
CREATE LOGIN scadalink_svc WITH PASSWORD = '<STRONG_PASSWORD>';
|
||||
USE ScadaLink;
|
||||
CREATE USER scadalink_svc FOR LOGIN scadalink_svc;
|
||||
ALTER ROLE db_owner ADD MEMBER scadalink_svc;
|
||||
```
|
||||
|
||||
2. Create the machine data database:
|
||||
```sql
|
||||
CREATE DATABASE ScadaLink_MachineData;
|
||||
USE ScadaLink_MachineData;
|
||||
CREATE USER scadalink_svc FOR LOGIN scadalink_svc;
|
||||
ALTER ROLE db_owner ADD MEMBER scadalink_svc;
|
||||
```
|
||||
|
||||
3. Apply EF Core migrations (development):
|
||||
- Migrations auto-apply on startup in Development environment.
|
||||
|
||||
4. Apply EF Core migrations (production):
|
||||
- Generate SQL script: `dotnet ef migrations script --project src/ScadaLink.ConfigurationDatabase`
|
||||
- Review and execute the SQL script against the production database.
|
||||
|
||||
## Network Requirements
|
||||
|
||||
| Source | Destination | Port | Protocol | Purpose |
|
||||
|--------|------------|------|----------|---------|
|
||||
| Central A | Central B | 8081 | TCP | Akka.NET remoting |
|
||||
| Site A | Site B | 8081 | TCP | Akka.NET remoting |
|
||||
| Site nodes | Central nodes | 8081 | TCP | Central-site communication |
|
||||
| Central nodes | LDAP server | 636 | TCP/TLS | Authentication |
|
||||
| All nodes | SMTP server | 587 | TCP/TLS | Notification delivery |
|
||||
| Central nodes | SQL Server | 1433 | TCP | Configuration database |
|
||||
| Users | Central nodes | 443 | HTTPS | Blazor Server UI |
|
||||
|
||||
## Firewall Rules
|
||||
|
||||
Ensure bidirectional TCP connectivity between all Akka.NET cluster peers. The remoting port (default 8081) must be open in both directions.
|
||||
|
||||
## Post-Installation Verification
|
||||
|
||||
1. Start the service: `sc.exe start ScadaLink-Central`
|
||||
2. Check the log file: `type C:\ScadaLink\logs\scadalink-*.log`
|
||||
3. Verify the readiness endpoint: `curl http://localhost:5000/health/ready`
|
||||
4. For Central: verify the UI is accessible at `https://central-01.example.com/`
|
||||
97
docs/deployment/production-checklist.md
Normal file
97
docs/deployment/production-checklist.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# ScadaLink Production Deployment Checklist
|
||||
|
||||
## Pre-Deployment
|
||||
|
||||
### Configuration Verification
|
||||
|
||||
- [ ] `ScadaLink:Node:Role` is set correctly (`Central` or `Site`)
|
||||
- [ ] `ScadaLink:Node:NodeHostname` matches the machine's resolvable hostname
|
||||
- [ ] `ScadaLink:Cluster:SeedNodes` contains exactly 2 entries for the cluster pair
|
||||
- [ ] Seed node addresses use fully qualified hostnames (not `localhost`)
|
||||
- [ ] Remoting port (default 8081) is open bidirectionally between cluster peers
|
||||
|
||||
### Central Node
|
||||
|
||||
- [ ] `ScadaLink:Database:ConfigurationDb` connection string is valid and tested
|
||||
- [ ] `ScadaLink:Database:MachineDataDb` connection string is valid and tested
|
||||
- [ ] SQL Server login has `db_owner` role on both databases
|
||||
- [ ] EF Core migrations have been applied (SQL script reviewed and executed)
|
||||
- [ ] `ScadaLink:Security:JwtSigningKey` is at least 32 characters, randomly generated
|
||||
- [ ] **Both central nodes use the same JwtSigningKey** (required for JWT failover)
|
||||
- [ ] `ScadaLink:Security:LdapServer` points to the production LDAP/AD server
|
||||
- [ ] `ScadaLink:Security:LdapUseTls` is `true` (LDAPS required in production)
|
||||
- [ ] `ScadaLink:Security:AllowInsecureLdap` is `false`
|
||||
- [ ] LDAP search base DN is correct for the organization
|
||||
- [ ] LDAP group-to-role mappings are configured
|
||||
- [ ] Load balancer is configured in front of central UI (sticky sessions not required)
|
||||
- [ ] ASP.NET Data Protection keys are shared between central nodes (for cookie failover)
|
||||
- [ ] HTTPS certificate is installed and configured
|
||||
|
||||
### Site Node
|
||||
|
||||
- [ ] `ScadaLink:Node:SiteId` is set and unique across all sites
|
||||
- [ ] `ScadaLink:Database:SiteDbPath` points to a writable directory
|
||||
- [ ] SQLite data directory has sufficient disk space (no max buffer size for S&F)
|
||||
- [ ] `ScadaLink:Communication:CentralSeedNode` points to a reachable central node
|
||||
- [ ] OPC UA server endpoints are accessible from site nodes
|
||||
- [ ] OPC UA security certificates are configured if required
|
||||
|
||||
### Security
|
||||
|
||||
- [ ] No secrets in `appsettings.json` committed to source control
|
||||
- [ ] Secrets managed via environment variables or a secrets manager
|
||||
- [ ] Windows Service account has minimum necessary permissions
|
||||
- [ ] Log directory permissions restrict access to service account and administrators
|
||||
- [ ] SMTP credentials use OAuth2 Client Credentials (preferred) or secure Basic Auth
|
||||
- [ ] API keys for Inbound API are generated with sufficient entropy (32+ chars)
|
||||
|
||||
### Network
|
||||
|
||||
- [ ] DNS resolution works between all cluster nodes
|
||||
- [ ] Firewall rules permit Akka.NET remoting (TCP 8081)
|
||||
- [ ] Firewall rules permit LDAP (TCP 636 for LDAPS)
|
||||
- [ ] Firewall rules permit SMTP (TCP 587 for TLS)
|
||||
- [ ] Firewall rules permit SQL Server (TCP 1433) from central nodes only
|
||||
- [ ] Load balancer health check configured against `/health/ready`
|
||||
|
||||
## Deployment
|
||||
|
||||
### Order of Operations
|
||||
|
||||
1. Deploy central node A (forms single-node cluster)
|
||||
2. Verify central node A is healthy: `GET /health/ready` returns 200
|
||||
3. Deploy central node B (joins existing cluster)
|
||||
4. Verify both central nodes show as cluster members in logs
|
||||
5. Deploy site nodes (order does not matter)
|
||||
6. Verify sites register with central via health dashboard
|
||||
|
||||
### Rollback Plan
|
||||
|
||||
- [ ] Previous version binaries are retained for rollback
|
||||
- [ ] Database backup taken before migration
|
||||
- [ ] Rollback SQL script is available (if migration requires it)
|
||||
- [ ] Service can be stopped and previous binary restored
|
||||
|
||||
## Post-Deployment
|
||||
|
||||
### Smoke Tests
|
||||
|
||||
- [ ] Central UI is accessible and login works
|
||||
- [ ] Health dashboard shows all expected sites as online
|
||||
- [ ] Template engine can create/save/delete a test template
|
||||
- [ ] Deployment pipeline can deploy a test instance to a site
|
||||
- [ ] Inbound API responds to test requests with valid API key
|
||||
- [ ] Notification Service can send a test email
|
||||
|
||||
### Monitoring Setup
|
||||
|
||||
- [ ] Log aggregation is configured (Serilog file sink + centralized collector)
|
||||
- [ ] Health dashboard bookmarked for operations team
|
||||
- [ ] Alerting configured for site offline threshold violations
|
||||
- [ ] Disk space monitoring on site nodes (SQLite growth)
|
||||
|
||||
### Documentation
|
||||
|
||||
- [ ] Cluster topology documented (hostnames, ports, roles)
|
||||
- [ ] Runbook updated with environment-specific details
|
||||
- [ ] On-call team briefed on failover procedures
|
||||
172
docs/deployment/topology-guide.md
Normal file
172
docs/deployment/topology-guide.md
Normal file
@@ -0,0 +1,172 @@
|
||||
# ScadaLink Cluster Topology Guide
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
ScadaLink uses a hub-and-spoke architecture:
|
||||
- **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
|
||||
- **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
|
||||
|
||||
```
|
||||
┌──────────────────────────┐
|
||||
│ Central Cluster │
|
||||
│ ┌──────┐ ┌──────┐ │
|
||||
Users ──────────► │ │Node A│◄──►│Node B│ │
|
||||
(HTTPS/LB) │ │Active│ │Stby │ │
|
||||
│ └──┬───┘ └──┬───┘ │
|
||||
└─────┼───────────┼────────┘
|
||||
│ │
|
||||
┌───────────┼───────────┼───────────┐
|
||||
│ │ │ │
|
||||
┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐
|
||||
│ Site 01 │ │ Site 02 │ │ Site 03 │ │ Site N │
|
||||
│ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│
|
||||
│ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││
|
||||
│ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│
|
||||
└───────────┘ └─────────┘ └─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
## Central Cluster Setup
|
||||
|
||||
### Cluster Configuration
|
||||
|
||||
Both central nodes must be configured as seed nodes for each other:
|
||||
|
||||
**Node A** (`central-01.example.com`):
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Central",
|
||||
"NodeHostname": "central-01.example.com",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@central-01.example.com:8081",
|
||||
"akka.tcp://scadalink@central-02.example.com:8081"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Node B** (`central-02.example.com`):
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Central",
|
||||
"NodeHostname": "central-02.example.com",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@central-01.example.com:8081",
|
||||
"akka.tcp://scadalink@central-02.example.com:8081"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Cluster Behavior
|
||||
|
||||
- **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after.
|
||||
- **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster.
|
||||
- **Failure detection**: 2-second heartbeat interval, 10-second threshold.
|
||||
- **Total failover time**: ~25 seconds from node failure to singleton migration.
|
||||
- **Singleton handover**: Uses CoordinatedShutdown for graceful migration.
|
||||
|
||||
### Shared State
|
||||
|
||||
Both central nodes share state through:
|
||||
- **SQL Server**: All configuration, deployment records, templates, and audit logs.
|
||||
- **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration.
|
||||
- **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path).
|
||||
|
||||
### Load Balancer
|
||||
|
||||
A load balancer sits in front of both central nodes for the Blazor Server UI:
|
||||
- Health check: `GET /health/ready`
|
||||
- Protocol: HTTPS (TLS termination at LB or pass-through)
|
||||
- Sticky sessions: Not required (JWT + shared Data Protection keys)
|
||||
- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).
|
||||
|
||||
## Site Cluster Setup
|
||||
|
||||
### Cluster Configuration
|
||||
|
||||
Each site has its own two-node cluster:
|
||||
|
||||
**Site Node A** (`site-01-a.example.com`):
|
||||
```json
|
||||
{
|
||||
"ScadaLink": {
|
||||
"Node": {
|
||||
"Role": "Site",
|
||||
"NodeHostname": "site-01-a.example.com",
|
||||
"SiteId": "plant-north",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadalink@site-01-a.example.com:8081",
|
||||
"akka.tcp://scadalink@site-01-b.example.com:8081"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Site Cluster Behavior
|
||||
|
||||
- Same split-brain resolver as central (keep-oldest).
|
||||
- Singleton actors: Site Deployment Manager migrates on failover.
|
||||
- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
|
||||
- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).
|
||||
|
||||
### Central-Site Communication
|
||||
|
||||
- Sites connect to central via Akka.NET remoting.
|
||||
- The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes.
|
||||
- If that central node is down, the site's communication actor will retry until it connects to the active central node.
|
||||
|
||||
## Scaling Guidelines
|
||||
|
||||
### Target Scale
|
||||
|
||||
- 10 sites maximum per central cluster
|
||||
- 500 machines (instances) total across all sites
|
||||
- 75 tags per machine (37,500 total tag subscriptions)
|
||||
|
||||
### Resource Requirements
|
||||
|
||||
| Component | CPU | RAM | Disk | Notes |
|
||||
|-----------|-----|-----|------|-------|
|
||||
| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
|
||||
| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
|
||||
| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |
|
||||
|
||||
### Network Bandwidth
|
||||
|
||||
- Health reports: ~1 KB per site per 30 seconds = negligible
|
||||
- Tag value updates: Depends on data change rate; OPC UA subscription-based
|
||||
- Deployment artifacts: One-time burst per deployment (varies by config size)
|
||||
- Debug view streaming: ~500 bytes per attribute change per subscriber
|
||||
|
||||
## Dual-Node Failure Recovery
|
||||
|
||||
### Scenario: Both Nodes Down
|
||||
|
||||
1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`).
|
||||
2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational.
|
||||
3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
|
||||
4. **Second node starts**: Joins the existing cluster as standby.
|
||||
|
||||
### Automatic Recovery
|
||||
|
||||
No manual intervention required for dual-node failure. The first node to start will:
|
||||
- Form the cluster
|
||||
- Take over all singletons
|
||||
- Begin processing immediately
|
||||
- Accept the second node when it joins
|
||||
134
docs/operations/failover-procedures.md
Normal file
134
docs/operations/failover-procedures.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# ScadaLink Failover Procedures
|
||||
|
||||
## Automatic Failover (No Intervention Required)
|
||||
|
||||
### Central Cluster Failover
|
||||
|
||||
**What happens automatically:**
|
||||
|
||||
1. Active central node becomes unreachable (process crash, network failure, hardware failure).
|
||||
2. Akka.NET failure detection triggers after ~10 seconds (2s heartbeat, 10s threshold).
|
||||
3. Split-brain resolver (keep-oldest) evaluates cluster state for 15 seconds (stable-after).
|
||||
4. Standby node is promoted to active. Total time: ~25 seconds.
|
||||
5. Cluster singletons migrate to the new active node.
|
||||
6. Load balancer detects the failed node via `/health/ready` and routes traffic to the surviving node.
|
||||
7. Active user sessions continue (JWT tokens are validated by the new node using the shared signing key).
|
||||
8. SignalR connections are dropped and Blazor clients automatically reconnect.
|
||||
|
||||
**What is preserved:**
|
||||
- All configuration and deployment state (stored in SQL Server)
|
||||
- Active JWT sessions (shared signing key)
|
||||
- Deployment status records (SQL Server with optimistic concurrency)
|
||||
|
||||
**What is temporarily disrupted:**
|
||||
- In-flight deployments: Central re-queries site state and re-issues if needed (idempotent)
|
||||
- Real-time debug view streams: Clients reconnect automatically
|
||||
- Health dashboard: Resumes on reconnect
|
||||
|
||||
### Site Cluster Failover
|
||||
|
||||
**What happens automatically:**
|
||||
|
||||
1. Active site node becomes unreachable.
|
||||
2. Failure detection and split-brain resolution (~25 seconds total).
|
||||
3. Site Deployment Manager singleton migrates to standby.
|
||||
4. Instance Actors are recreated from persisted SQLite configurations.
|
||||
5. Staggered startup: 50ms delay between instance creations to prevent reconnection storms.
|
||||
6. DCL connection actors reconnect to OPC UA servers.
|
||||
7. Script Actors and Alarm Actors resume processing from incoming values (no stale state).
|
||||
8. S&F buffer is read from SQLite — pending retries resume.
|
||||
|
||||
**What is preserved:**
|
||||
- Deployed instance configurations (SQLite)
|
||||
- Static attribute overrides (SQLite)
|
||||
- S&F message buffer (SQLite)
|
||||
- Site event logs (SQLite)
|
||||
|
||||
**What is temporarily disrupted:**
|
||||
- Tag value subscriptions: DCL reconnects and re-subscribes transparently
|
||||
- Active script executions: Cancelled; trigger fires again on next value change
|
||||
- Alarm states: Re-evaluated from incoming tag values (correct state within one update cycle)
|
||||
|
||||
## Manual Intervention Scenarios
|
||||
|
||||
### Scenario 1: Both Central Nodes Down
|
||||
|
||||
**Symptoms:** No central UI access, sites report "central unreachable" in logs.
|
||||
|
||||
**Recovery:**
|
||||
1. Start either central node. It will form a single-node cluster.
|
||||
2. Verify SQL Server is accessible.
|
||||
3. Check `/health/ready` returns 200.
|
||||
4. Start the second node. It will join the cluster automatically.
|
||||
5. Verify both nodes appear in the Akka.NET cluster member list (check logs for "Member joined").
|
||||
|
||||
**No data loss:** All state is in SQL Server.
|
||||
|
||||
### Scenario 2: Both Site Nodes Down
|
||||
|
||||
**Symptoms:** Site appears offline in central health dashboard.
|
||||
|
||||
**Recovery:**
|
||||
1. Start either site node.
|
||||
2. Check logs for "Store-and-forward SQLite storage initialized".
|
||||
3. Verify instance actors are recreated: "Instance {Name}: created N script actors and M alarm actors".
|
||||
4. Start the second site node.
|
||||
5. Verify the site appears online in the central health dashboard within 60 seconds.
|
||||
|
||||
**No data loss:** All state is in SQLite.
|
||||
|
||||
### Scenario 3: Split-Brain (Network Partition Between Peers)
|
||||
|
||||
**Symptoms:** Both nodes believe they are the active node. Logs show "Cluster partition detected".
|
||||
|
||||
**How the system handles it:**
|
||||
- Keep-oldest resolver: The older node (first to join cluster) survives; the younger is downed.
|
||||
- `down-if-alone = on`: If a node is alone (no peers), it downs itself.
|
||||
- Stable-after (15s): The resolver waits 15 seconds for the partition to stabilize before acting.
|
||||
|
||||
**Manual intervention (if auto-resolution fails):**
|
||||
1. Stop both nodes.
|
||||
2. Start the preferred node first (it becomes the "oldest").
|
||||
3. Start the second node.
|
||||
|
||||
### Scenario 4: SQL Server Outage (Central)
|
||||
|
||||
**Symptoms:** Central UI returns errors. `/health/ready` returns 503. Logs show database connection failures.
|
||||
|
||||
**Impact:**
|
||||
- Active sessions with valid JWTs can still access cached UI state.
|
||||
- New logins fail (LDAP auth still works but role mapping requires DB).
|
||||
- Template changes and deployments fail.
|
||||
- Sites continue operating independently.
|
||||
|
||||
**Recovery:**
|
||||
1. Restore SQL Server access.
|
||||
2. Central nodes will automatically reconnect (EF Core connection resiliency).
|
||||
3. Verify `/health/ready` returns 200.
|
||||
4. No manual intervention needed on ScadaLink nodes.
|
||||
|
||||
### Scenario 5: Forced Singleton Migration
|
||||
|
||||
**When to use:** The active node is degraded but not crashed (e.g., high CPU, disk full).
|
||||
|
||||
**Procedure:**
|
||||
1. Initiate graceful shutdown on the degraded node:
|
||||
- Stop the Windows Service: `sc.exe stop ScadaLink-Central`
|
||||
- CoordinatedShutdown will migrate singletons to the standby.
|
||||
2. Wait for the standby to take over (check logs for "Singleton acquired").
|
||||
3. Fix the issue on the original node.
|
||||
4. Restart the service. It will rejoin as standby.
|
||||
|
||||
## Failover Timeline
|
||||
|
||||
```
|
||||
T+0s Node failure detected (heartbeat timeout)
|
||||
T+2s Akka.NET marks node as unreachable
|
||||
T+10s Failure detection confirmed (threshold reached)
|
||||
T+10s Split-brain resolver begins stable-after countdown
|
||||
T+25s Resolver actions: surviving node promoted
|
||||
T+25s Singleton migration begins
|
||||
T+26s Instance Actors start recreating (staggered)
|
||||
T+30s Health report sent from new active node
|
||||
T+60s All instances operational (500 instances * 50ms stagger = 25s)
|
||||
```
|
||||
215
docs/operations/maintenance-procedures.md
Normal file
215
docs/operations/maintenance-procedures.md
Normal file
@@ -0,0 +1,215 @@
|
||||
# ScadaLink Maintenance Procedures
|
||||
|
||||
## SQL Server Maintenance (Central)
|
||||
|
||||
### Regular Maintenance Schedule
|
||||
|
||||
| Task | Frequency | Window |
|
||||
|------|-----------|--------|
|
||||
| Index rebuild | Weekly | Off-peak hours |
|
||||
| Statistics update | Daily | Automated |
|
||||
| Backup (full) | Daily | Off-peak hours |
|
||||
| Backup (differential) | Every 4 hours | Anytime |
|
||||
| Backup (transaction log) | Every 15 minutes | Anytime |
|
||||
| Integrity check (DBCC CHECKDB) | Weekly | Off-peak hours |
|
||||
|
||||
### Index Maintenance
|
||||
|
||||
```sql
|
||||
-- Rebuild fragmented indexes on configuration database
|
||||
USE ScadaLink;
|
||||
EXEC sp_MSforeachtable 'ALTER INDEX ALL ON ? REBUILD WITH (ONLINE = ON)';
|
||||
```
|
||||
|
||||
For large tables (AuditLogEntries, DeploymentRecords), consider filtered rebuilds:
|
||||
```sql
|
||||
ALTER INDEX IX_AuditLogEntries_Timestamp ON AuditLogEntries REBUILD
|
||||
WITH (ONLINE = ON, FILLFACTOR = 90);
|
||||
```
|
||||
|
||||
### Audit Log Retention
|
||||
|
||||
The AuditLogEntries table grows continuously. Implement a retention policy:
|
||||
|
||||
```sql
|
||||
-- Delete audit entries older than 1 year
|
||||
DELETE FROM AuditLogEntries
|
||||
WHERE Timestamp < DATEADD(YEAR, -1, GETUTCDATE());
|
||||
```
|
||||
|
||||
Consider partitioning the AuditLogEntries table by month for efficient purging.
|
||||
|
||||
### Database Growth Monitoring
|
||||
|
||||
```sql
|
||||
-- Check database sizes
|
||||
EXEC sp_helpdb 'ScadaLink';
|
||||
EXEC sp_helpdb 'ScadaLink_MachineData';
|
||||
|
||||
-- Check table sizes
|
||||
SELECT
|
||||
t.NAME AS TableName,
|
||||
p.rows AS RowCount,
|
||||
SUM(a.total_pages) * 8 / 1024.0 AS TotalSpaceMB
|
||||
FROM sys.tables t
|
||||
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
|
||||
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
|
||||
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
|
||||
GROUP BY t.Name, p.Rows
|
||||
ORDER BY TotalSpaceMB DESC;
|
||||
```
|
||||
|
||||
## SQLite Management (Site)
|
||||
|
||||
### Database Files
|
||||
|
||||
| File | Purpose | Growth Pattern |
|
||||
|------|---------|---------------|
|
||||
| `site.db` | Deployed configs, static overrides | Stable (grows with deployments) |
|
||||
| `store-and-forward.db` | S&F message buffer | Variable (grows during outages) |
|
||||
|
||||
### Monitoring SQLite Size
|
||||
|
||||
```powershell
|
||||
# Check SQLite file sizes
|
||||
Get-ChildItem C:\ScadaLink\data\*.db | Select-Object Name, @{N='SizeMB';E={[math]::Round($_.Length/1MB,2)}}
|
||||
```
|
||||
|
||||
### S&F Database Growth
|
||||
|
||||
The S&F database has **no max buffer size** by design. During extended outages, it can grow significantly.
|
||||
|
||||
**Monitoring:**
|
||||
- Check buffer depth in the health dashboard.
|
||||
- Alert if `store-and-forward.db` exceeds 1 GB.
|
||||
|
||||
**Manual cleanup (if needed):**
|
||||
1. Identify and discard permanently undeliverable parked messages via the central UI.
|
||||
2. If the database is very large and the site is healthy, the messages will be delivered and removed automatically.
|
||||
|
||||
### SQLite Vacuum
|
||||
|
||||
SQLite does not reclaim disk space after deleting rows. Periodically vacuum:
|
||||
|
||||
```powershell
|
||||
# Stop the ScadaLink service first
|
||||
sc.exe stop ScadaLink-Site
|
||||
|
||||
# Vacuum the S&F database
|
||||
sqlite3 C:\ScadaLink\data\store-and-forward.db "VACUUM;"
|
||||
|
||||
# Restart the service
|
||||
sc.exe start ScadaLink-Site
|
||||
```
|
||||
|
||||
**Important:** Only vacuum when the service is stopped. SQLite does not support concurrent vacuum.
|
||||
|
||||
### SQLite Backup
|
||||
|
||||
```powershell
|
||||
# Hot backup using SQLite backup API (safe while service is running)
|
||||
sqlite3 C:\ScadaLink\data\site.db ".backup C:\Backups\site-$(Get-Date -Format yyyyMMdd).db"
|
||||
sqlite3 C:\ScadaLink\data\store-and-forward.db ".backup C:\Backups\sf-$(Get-Date -Format yyyyMMdd).db"
|
||||
```
|
||||
|
||||
## Log Rotation
|
||||
|
||||
### Serilog File Sink
|
||||
|
||||
ScadaLink uses Serilog's rolling file sink with daily rotation:
|
||||
- New file created each day: `scadalink-20260316.log`
|
||||
- Files are not automatically deleted.
|
||||
|
||||
### Log Retention Policy
|
||||
|
||||
Implement a scheduled task to delete old log files:
|
||||
|
||||
```powershell
|
||||
# Delete log files older than 30 days
|
||||
Get-ChildItem C:\ScadaLink\logs\scadalink-*.log |
|
||||
Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
|
||||
Remove-Item -Force
|
||||
```
|
||||
|
||||
Schedule this as a Windows Task:
|
||||
```powershell
|
||||
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -Command `"Get-ChildItem C:\ScadaLink\logs\scadalink-*.log | Where-Object { `$_.LastWriteTime -lt (Get-Date).AddDays(-30) } | Remove-Item -Force`""
|
||||
$trigger = New-ScheduledTaskTrigger -Daily -At "03:00"
|
||||
Register-ScheduledTask -TaskName "ScadaLink-LogCleanup" -Action $action -Trigger $trigger -Description "Clean up ScadaLink log files older than 30 days"
|
||||
```
|
||||
|
||||
### Log Disk Space
|
||||
|
||||
Monitor disk space on all nodes:
|
||||
```powershell
|
||||
Get-PSDrive C | Select-Object @{N='UsedGB';E={[math]::Round($_.Used/1GB,1)}}, @{N='FreeGB';E={[math]::Round($_.Free/1GB,1)}}
|
||||
```
|
||||
|
||||
Alert if free space drops below 5 GB.
|
||||
|
||||
## Site Event Log Maintenance
|
||||
|
||||
### Automatic Purge
|
||||
|
||||
The Site Event Logging component has built-in purge:
|
||||
- **Retention**: 30 days (configurable via `SiteEventLog:RetentionDays`)
|
||||
- **Storage cap**: 1 GB (configurable via `SiteEventLog:MaxStorageMB`)
|
||||
- **Purge interval**: Every 24 hours (configurable via `SiteEventLog:PurgeIntervalHours`)
|
||||
|
||||
No manual intervention needed under normal conditions.
|
||||
|
||||
### Manual Purge (Emergency)
|
||||
|
||||
If event log storage is consuming excessive disk space:
|
||||
|
||||
```powershell
|
||||
# Stop the service
|
||||
sc.exe stop ScadaLink-Site
|
||||
|
||||
# Delete the event log database and let it be recreated
|
||||
Remove-Item C:\ScadaLink\data\event-log.db
|
||||
|
||||
# Restart the service
|
||||
sc.exe start ScadaLink-Site
|
||||
```
|
||||
|
||||
## Certificate Management
|
||||
|
||||
### LDAP Certificates
|
||||
|
||||
If using LDAPS (port 636), the LDAP server's TLS certificate must be trusted:
|
||||
1. Export the CA certificate from Active Directory.
|
||||
2. Import into the Windows certificate store on both central nodes.
|
||||
3. Restart the ScadaLink service.
|
||||
|
||||
### OPC UA Certificates
|
||||
|
||||
OPC UA connections may require certificate trust configuration:
|
||||
1. On first connection, the OPC UA client generates a self-signed certificate.
|
||||
2. The OPC UA server must trust this certificate.
|
||||
3. If the site node is replaced, a new certificate is generated; update the server trust list.
|
||||
|
||||
## Scheduled Maintenance Window
|
||||
|
||||
### Recommended Procedure
|
||||
|
||||
1. **Notify operators** that the system will be in maintenance mode.
|
||||
2. **Gracefully stop the standby node** first (allows singleton to remain on active).
|
||||
3. Perform maintenance on the standby node (OS updates, disk cleanup, etc.).
|
||||
4. **Start the standby node** and verify it joins the cluster.
|
||||
5. **Gracefully stop the active node** (CoordinatedShutdown migrates singletons to the now-running standby).
|
||||
6. Perform maintenance on the former active node.
|
||||
7. **Start the former active node** — it rejoins as standby.
|
||||
|
||||
This procedure maintains availability throughout the maintenance window.
|
||||
|
||||
### Emergency Maintenance (Both Nodes)
|
||||
|
||||
If both nodes must be stopped simultaneously:
|
||||
1. Stop both nodes.
|
||||
2. Perform maintenance.
|
||||
3. Start one node (it forms a single-node cluster).
|
||||
4. Verify health.
|
||||
5. Start the second node.
|
||||
|
||||
Sites continue operating independently during central maintenance. Site-buffered data (S&F) will be delivered when central communication restores.
|
||||
201
docs/operations/troubleshooting-guide.md
Normal file
201
docs/operations/troubleshooting-guide.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# ScadaLink Troubleshooting Guide
|
||||
|
||||
## Log Analysis
|
||||
|
||||
### Log Location
|
||||
|
||||
- **File logs:** `C:\ScadaLink\logs\scadalink-YYYYMMDD.log`
|
||||
- **Console output:** Available when running interactively (not as a Windows Service)
|
||||
|
||||
### Log Format
|
||||
|
||||
```
|
||||
[14:32:05 INF] [Central/central-01] Template "PumpStation" saved by admin
|
||||
```
|
||||
|
||||
Format: `[Time Level] [NodeRole/NodeHostname] Message`
|
||||
|
||||
All log entries are enriched with:
|
||||
- `SiteId` — Site identifier (or "central" for central nodes)
|
||||
- `NodeHostname` — Machine hostname
|
||||
- `NodeRole` — "Central" or "Site"
|
||||
|
||||
### Key Log Patterns
|
||||
|
||||
| Pattern | Meaning |
|
||||
|---------|---------|
|
||||
| `Starting ScadaLink host as {Role}` | Node startup |
|
||||
| `Member joined` | Cluster peer connected |
|
||||
| `Member removed` | Cluster peer departed |
|
||||
| `Singleton acquired` | This node became the active singleton holder |
|
||||
| `Instance {Name}: created N script actors` | Instance successfully deployed |
|
||||
| `Script {Name} failed trust validation` | Script uses forbidden API |
|
||||
| `Immediate delivery to {Target} failed` | S&F transient failure, message buffered |
|
||||
| `Message {Id} parked` | S&F max retries reached |
|
||||
| `Site {SiteId} marked offline` | No health report for 60 seconds |
|
||||
| `Rejecting stale report` | Out-of-order health report (normal during failover) |
|
||||
|
||||
### Filtering Logs
|
||||
|
||||
Use the structured log properties for targeted analysis:
|
||||
|
||||
```powershell
|
||||
# Find all errors for a specific site
|
||||
Select-String -Path "logs\scadalink-*.log" -Pattern "\[ERR\].*site-01"
|
||||
|
||||
# Find S&F activity
|
||||
Select-String -Path "logs\scadalink-*.log" -Pattern "store-and-forward|buffered|parked"
|
||||
|
||||
# Find failover events
|
||||
Select-String -Path "logs\scadalink-*.log" -Pattern "Singleton|Member joined|Member removed"
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Issue: Site Appears Offline in Health Dashboard
|
||||
|
||||
**Possible causes:**
|
||||
1. Site nodes are actually down.
|
||||
2. Network connectivity between site and central is broken.
|
||||
3. Health report interval has not elapsed since site startup.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Check if the site service is running: `sc.exe query ScadaLink-Site`
|
||||
2. Check site logs for errors.
|
||||
3. Verify network: `Test-NetConnection -ComputerName central-01.example.com -Port 8081`
|
||||
4. Wait 60 seconds (the offline detection threshold).
|
||||
|
||||
**Resolution:**
|
||||
- If the service is stopped, start it.
|
||||
- If network is blocked, open firewall port 8081.
|
||||
- If the site just started, wait for the first health report (30-second interval).
|
||||
|
||||
### Issue: Deployment Stuck in "InProgress"
|
||||
|
||||
**Possible causes:**
|
||||
1. Site is unreachable during deployment.
|
||||
2. Central node failed over mid-deployment.
|
||||
3. Instance compilation failed on site.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Check deployment status in the UI.
|
||||
2. Check site logs for the deployment ID: `Select-String "dep-XXXXX"`
|
||||
3. Check central logs for the deployment ID.
|
||||
|
||||
**Resolution:**
|
||||
- If the site is unreachable: fix connectivity, then re-deploy (idempotent by revision hash).
|
||||
- If compilation failed: check the script errors in site logs, fix the template, re-deploy.
|
||||
- If stuck after failover: the new central node will re-query site state; wait or manually re-deploy.
|
||||
|
||||
### Issue: S&F Messages Accumulating
|
||||
|
||||
**Possible causes:**
|
||||
1. External system is down.
|
||||
2. SMTP server is unreachable.
|
||||
3. Network issues between site and external target.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Check S&F buffer depth in health dashboard.
|
||||
2. Check site logs for retry activity and error messages.
|
||||
3. Verify external system connectivity from the site node.
|
||||
|
||||
**Resolution:**
|
||||
- Fix the external system / SMTP / network issue. Retries resume automatically.
|
||||
- If messages are permanently undeliverable: park and discard via the central UI.
|
||||
- Check parked messages for patterns (same target, same error).
|
||||
|
||||
### Issue: OPC UA Connection Keeps Disconnecting
|
||||
|
||||
**Possible causes:**
|
||||
1. OPC UA server is unstable.
|
||||
2. Network intermittency.
|
||||
3. Certificate trust issues.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Check DCL logs: look for "Entering Reconnecting state" frequency.
|
||||
2. Check health dashboard: data connection status for the affected connection.
|
||||
3. Verify OPC UA server health independently.
|
||||
|
||||
**Resolution:**
|
||||
- DCL auto-reconnects at the configured interval (default 5 seconds).
|
||||
- If the server certificate changed, update the trust store.
|
||||
- If the server is consistently unstable, investigate the OPC UA server directly.
|
||||
|
||||
### Issue: Script Execution Errors
|
||||
|
||||
**Possible causes:**
|
||||
1. Script timeout (default 30 seconds).
|
||||
2. Runtime exception in script code.
|
||||
3. Script references external system that is down.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Check health dashboard: script error count per interval.
|
||||
2. Check site logs for the script name and error details.
|
||||
3. Check if the script uses `ExternalSystem.Call()` — the target may be down.
|
||||
|
||||
**Resolution:**
|
||||
- If timeout: optimize the script or increase the timeout in configuration.
|
||||
- If runtime error: fix the script in the template editor, re-deploy.
|
||||
- If external system is down: script errors will stop when the system recovers.
|
||||
|
||||
### Issue: Login Fails but LDAP Server is Up
|
||||
|
||||
**Possible causes:**
|
||||
1. Incorrect LDAP search base DN.
|
||||
2. User account is locked in AD.
|
||||
3. LDAP group-to-role mapping does not include a required group.
|
||||
4. TLS certificate issue on LDAP connection.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Check central logs for LDAP bind errors.
|
||||
2. Verify LDAP connectivity: `Test-NetConnection -ComputerName ldap.example.com -Port 636`
|
||||
3. Test LDAP bind manually using an LDAP browser tool.
|
||||
|
||||
**Resolution:**
|
||||
- Fix the LDAP configuration.
|
||||
- Unlock the user account in AD.
|
||||
- Update group mappings in the configuration database.
|
||||
|
||||
### Issue: High Dead Letter Count
|
||||
|
||||
**Possible causes:**
|
||||
1. Messages being sent to actors that no longer exist (e.g., after instance deletion).
|
||||
2. Actor mailbox overflow.
|
||||
3. Misconfigured actor paths after deployment changes.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Check health dashboard: dead letter count trend.
|
||||
2. Check site logs for dead letter details (actor path, message type).
|
||||
|
||||
**Resolution:**
|
||||
- Dead letters during failover are expected and transient.
|
||||
- Persistent dead letters indicate a configuration or code issue.
|
||||
- If dead letters reference deleted instances, they are harmless (S&F messages are retained by design).
|
||||
|
||||
## Health Dashboard Interpretation
|
||||
|
||||
### Metric: Data Connection Status
|
||||
|
||||
| Status | Meaning | Action |
|
||||
|--------|---------|--------|
|
||||
| Connected | OPC UA connection active | None |
|
||||
| Disconnected | Connection lost, auto-reconnecting | Check OPC UA server |
|
||||
| Connecting | Initial connection in progress | Wait |
|
||||
|
||||
### Metric: Tag Resolution
|
||||
|
||||
- `TotalSubscribed`: Number of tags the system is trying to monitor.
|
||||
- `SuccessfullyResolved`: Tags with active subscriptions.
|
||||
- Gap indicates unresolved tags (devices still booting or path errors).
|
||||
|
||||
### Metric: S&F Buffer Depth
|
||||
|
||||
- `ExternalSystem`: Messages to external REST APIs awaiting delivery.
|
||||
- `Notification`: Email notifications awaiting SMTP delivery.
|
||||
- Growing depth indicates the target system is unreachable.
|
||||
|
||||
### Metric: Error Counts (Per Interval)
|
||||
|
||||
- Counts reset every 30 seconds (health report interval).
|
||||
- Raw counts, not rates — compare across intervals.
|
||||
- Occasional script errors during failover are expected.
|
||||
Reference in New Issue
Block a user