Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.
This commit is contained in:
Joseph Doherty
2026-03-16 22:12:31 -04:00
parent 3b2320bd35
commit b659978764
68 changed files with 6253 additions and 44 deletions

View File

@@ -0,0 +1,195 @@
# ScadaLink Installation Guide
## Prerequisites
- Windows Server 2019 or later
- .NET 10.0 Runtime
- SQL Server 2019+ (Central nodes only)
- Network connectivity between all cluster nodes (TCP ports 8081-8082)
- LDAP/Active Directory server accessible from Central nodes
- SMTP server accessible from all nodes (for Notification Service)
## Single Binary Deployment
ScadaLink ships as a single executable (`ScadaLink.Host.exe`) that runs in either Central or Site role based on configuration.
### Windows Service Installation
```powershell
# Central Node
sc.exe create "ScadaLink-Central" binPath="C:\ScadaLink\ScadaLink.Host.exe" start=auto
sc.exe description "ScadaLink-Central" "ScadaLink SCADA Central Hub"
# Site Node
sc.exe create "ScadaLink-Site" binPath="C:\ScadaLink\ScadaLink.Host.exe" start=auto
sc.exe description "ScadaLink-Site" "ScadaLink SCADA Site Agent"
```
### Directory Structure
```
C:\ScadaLink\
ScadaLink.Host.exe
appsettings.json
appsettings.Production.json
data\ # Site: SQLite databases
site.db # Deployed configs, static overrides
store-and-forward.db # S&F message buffer
logs\ # Rolling log files
scadalink-20260316.log
```
## Configuration Templates
### Central Node — `appsettings.json`
```json
{
"ScadaLink": {
"Node": {
"Role": "Central",
"NodeHostname": "central-01.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@central-01.example.com:8081",
"akka.tcp://scadalink@central-02.example.com:8081"
]
},
"Database": {
"ConfigurationDb": "Server=sqlserver.example.com;Database=ScadaLink;User Id=scadalink_svc;Password=<CHANGE_ME>;Encrypt=true;TrustServerCertificate=false",
"MachineDataDb": "Server=sqlserver.example.com;Database=ScadaLink_MachineData;User Id=scadalink_svc;Password=<CHANGE_ME>;Encrypt=true;TrustServerCertificate=false"
},
"Security": {
"LdapServer": "ldap.example.com",
"LdapPort": 636,
"LdapUseTls": true,
"AllowInsecureLdap": false,
"LdapSearchBase": "dc=example,dc=com",
"JwtSigningKey": "<GENERATE_A_32_PLUS_CHAR_RANDOM_STRING>",
"JwtExpiryMinutes": 15,
"IdleTimeoutMinutes": 30
},
"HealthMonitoring": {
"ReportInterval": "00:00:30",
"OfflineTimeout": "00:01:00"
},
"Logging": {
"MinimumLevel": "Information"
}
},
"Serilog": {
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft": "Warning",
"Akka": "Warning"
}
}
}
}
```
### Site Node — `appsettings.json`
```json
{
"ScadaLink": {
"Node": {
"Role": "Site",
"NodeHostname": "site-01-node-a.example.com",
"SiteId": "plant-north",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@site-01-node-a.example.com:8081",
"akka.tcp://scadalink@site-01-node-b.example.com:8081"
]
},
"Database": {
"SiteDbPath": "C:\\ScadaLink\\data\\site.db"
},
"DataConnection": {
"ReconnectInterval": "00:00:05",
"TagResolutionRetryInterval": "00:00:30"
},
"StoreAndForward": {
"SqliteDbPath": "C:\\ScadaLink\\data\\store-and-forward.db",
"DefaultRetryInterval": "00:00:30",
"DefaultMaxRetries": 50,
"ReplicationEnabled": true
},
"SiteRuntime": {
"ScriptTimeoutSeconds": 30,
"StaggeredStartupDelayMs": 50
},
"SiteEventLog": {
"RetentionDays": 30,
"MaxStorageMB": 1024,
"PurgeIntervalHours": 24
},
"Communication": {
"CentralSeedNode": "akka.tcp://scadalink@central-01.example.com:8081"
},
"HealthMonitoring": {
"ReportInterval": "00:00:30"
},
"Logging": {
"MinimumLevel": "Information"
}
}
}
```
## Database Setup (Central Only)
### SQL Server
1. Create the configuration database:
```sql
CREATE DATABASE ScadaLink;
CREATE LOGIN scadalink_svc WITH PASSWORD = '<STRONG_PASSWORD>';
USE ScadaLink;
CREATE USER scadalink_svc FOR LOGIN scadalink_svc;
ALTER ROLE db_owner ADD MEMBER scadalink_svc;
```
2. Create the machine data database:
```sql
CREATE DATABASE ScadaLink_MachineData;
USE ScadaLink_MachineData;
CREATE USER scadalink_svc FOR LOGIN scadalink_svc;
ALTER ROLE db_owner ADD MEMBER scadalink_svc;
```
3. Apply EF Core migrations (development):
- Migrations auto-apply on startup in Development environment.
4. Apply EF Core migrations (production):
- Generate SQL script: `dotnet ef migrations script --project src/ScadaLink.ConfigurationDatabase`
- Review and execute the SQL script against the production database.
## Network Requirements
| Source | Destination | Port | Protocol | Purpose |
|--------|------------|------|----------|---------|
| Central A | Central B | 8081 | TCP | Akka.NET remoting |
| Site A | Site B | 8081 | TCP | Akka.NET remoting |
| Site nodes | Central nodes | 8081 | TCP | Central-site communication |
| Central nodes | LDAP server | 636 | TCP/TLS | Authentication |
| All nodes | SMTP server | 587 | TCP/TLS | Notification delivery |
| Central nodes | SQL Server | 1433 | TCP | Configuration database |
| Users | Central nodes | 443 | HTTPS | Blazor Server UI |
## Firewall Rules
Ensure bidirectional TCP connectivity between all Akka.NET cluster peers. The remoting port (default 8081) must be open in both directions.
## Post-Installation Verification
1. Start the service: `sc.exe start ScadaLink-Central`
2. Check the log file: `type C:\ScadaLink\logs\scadalink-*.log`
3. Verify the readiness endpoint: `curl http://localhost:5000/health/ready`
4. For Central: verify the UI is accessible at `https://central-01.example.com/`

View File

@@ -0,0 +1,97 @@
# ScadaLink Production Deployment Checklist
## Pre-Deployment
### Configuration Verification
- [ ] `ScadaLink:Node:Role` is set correctly (`Central` or `Site`)
- [ ] `ScadaLink:Node:NodeHostname` matches the machine's resolvable hostname
- [ ] `ScadaLink:Cluster:SeedNodes` contains exactly 2 entries for the cluster pair
- [ ] Seed node addresses use fully qualified hostnames (not `localhost`)
- [ ] Remoting port (default 8081) is open bidirectionally between cluster peers
### Central Node
- [ ] `ScadaLink:Database:ConfigurationDb` connection string is valid and tested
- [ ] `ScadaLink:Database:MachineDataDb` connection string is valid and tested
- [ ] SQL Server login has `db_owner` role on both databases
- [ ] EF Core migrations have been applied (SQL script reviewed and executed)
- [ ] `ScadaLink:Security:JwtSigningKey` is at least 32 characters, randomly generated
- [ ] **Both central nodes use the same JwtSigningKey** (required for JWT failover)
- [ ] `ScadaLink:Security:LdapServer` points to the production LDAP/AD server
- [ ] `ScadaLink:Security:LdapUseTls` is `true` (LDAPS required in production)
- [ ] `ScadaLink:Security:AllowInsecureLdap` is `false`
- [ ] LDAP search base DN is correct for the organization
- [ ] LDAP group-to-role mappings are configured
- [ ] Load balancer is configured in front of central UI (sticky sessions not required)
- [ ] ASP.NET Data Protection keys are shared between central nodes (for cookie failover)
- [ ] HTTPS certificate is installed and configured
### Site Node
- [ ] `ScadaLink:Node:SiteId` is set and unique across all sites
- [ ] `ScadaLink:Database:SiteDbPath` points to a writable directory
- [ ] SQLite data directory has sufficient disk space (no max buffer size for S&F)
- [ ] `ScadaLink:Communication:CentralSeedNode` points to a reachable central node
- [ ] OPC UA server endpoints are accessible from site nodes
- [ ] OPC UA security certificates are configured if required
### Security
- [ ] No secrets in `appsettings.json` committed to source control
- [ ] Secrets managed via environment variables or a secrets manager
- [ ] Windows Service account has minimum necessary permissions
- [ ] Log directory permissions restrict access to service account and administrators
- [ ] SMTP credentials use OAuth2 Client Credentials (preferred) or secure Basic Auth
- [ ] API keys for Inbound API are generated with sufficient entropy (32+ chars)
### Network
- [ ] DNS resolution works between all cluster nodes
- [ ] Firewall rules permit Akka.NET remoting (TCP 8081)
- [ ] Firewall rules permit LDAP (TCP 636 for LDAPS)
- [ ] Firewall rules permit SMTP (TCP 587 for TLS)
- [ ] Firewall rules permit SQL Server (TCP 1433) from central nodes only
- [ ] Load balancer health check configured against `/health/ready`
## Deployment
### Order of Operations
1. Deploy central node A (forms single-node cluster)
2. Verify central node A is healthy: `GET /health/ready` returns 200
3. Deploy central node B (joins existing cluster)
4. Verify both central nodes show as cluster members in logs
5. Deploy site nodes (order does not matter)
6. Verify sites register with central via health dashboard
### Rollback Plan
- [ ] Previous version binaries are retained for rollback
- [ ] Database backup taken before migration
- [ ] Rollback SQL script is available (if migration requires it)
- [ ] Service can be stopped and previous binary restored
## Post-Deployment
### Smoke Tests
- [ ] Central UI is accessible and login works
- [ ] Health dashboard shows all expected sites as online
- [ ] Template engine can create/save/delete a test template
- [ ] Deployment pipeline can deploy a test instance to a site
- [ ] Inbound API responds to test requests with valid API key
- [ ] Notification Service can send a test email
### Monitoring Setup
- [ ] Log aggregation is configured (Serilog file sink + centralized collector)
- [ ] Health dashboard bookmarked for operations team
- [ ] Alerting configured for site offline threshold violations
- [ ] Disk space monitoring on site nodes (SQLite growth)
### Documentation
- [ ] Cluster topology documented (hostnames, ports, roles)
- [ ] Runbook updated with environment-specific details
- [ ] On-call team briefed on failover procedures

View File

@@ -0,0 +1,172 @@
# ScadaLink Cluster Topology Guide
## Architecture Overview
ScadaLink uses a hub-and-spoke architecture:
- **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
- **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
```
┌──────────────────────────┐
│ Central Cluster │
│ ┌──────┐ ┌──────┐ │
Users ──────────► │ │Node A│◄──►│Node B│ │
(HTTPS/LB) │ │Active│ │Stby │ │
│ └──┬───┘ └──┬───┘ │
└─────┼───────────┼────────┘
│ │
┌───────────┼───────────┼───────────┐
│ │ │ │
┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐
│ Site 01 │ │ Site 02 │ │ Site 03 │ │ Site N │
│ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│
│ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││
│ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│
└───────────┘ └─────────┘ └─────────┘ └─────────┘
```
## Central Cluster Setup
### Cluster Configuration
Both central nodes must be configured as seed nodes for each other:
**Node A** (`central-01.example.com`):
```json
{
"ScadaLink": {
"Node": {
"Role": "Central",
"NodeHostname": "central-01.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@central-01.example.com:8081",
"akka.tcp://scadalink@central-02.example.com:8081"
]
}
}
}
```
**Node B** (`central-02.example.com`):
```json
{
"ScadaLink": {
"Node": {
"Role": "Central",
"NodeHostname": "central-02.example.com",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@central-01.example.com:8081",
"akka.tcp://scadalink@central-02.example.com:8081"
]
}
}
}
```
### Cluster Behavior
- **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after.
- **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster.
- **Failure detection**: 2-second heartbeat interval, 10-second threshold.
- **Total failover time**: ~25 seconds from node failure to singleton migration.
- **Singleton handover**: Uses CoordinatedShutdown for graceful migration.
### Shared State
Both central nodes share state through:
- **SQL Server**: All configuration, deployment records, templates, and audit logs.
- **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration.
- **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path).
### Load Balancer
A load balancer sits in front of both central nodes for the Blazor Server UI:
- Health check: `GET /health/ready`
- Protocol: HTTPS (TLS termination at LB or pass-through)
- Sticky sessions: Not required (JWT + shared Data Protection keys)
- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).
## Site Cluster Setup
### Cluster Configuration
Each site has its own two-node cluster:
**Site Node A** (`site-01-a.example.com`):
```json
{
"ScadaLink": {
"Node": {
"Role": "Site",
"NodeHostname": "site-01-a.example.com",
"SiteId": "plant-north",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadalink@site-01-a.example.com:8081",
"akka.tcp://scadalink@site-01-b.example.com:8081"
]
}
}
}
```
### Site Cluster Behavior
- Same split-brain resolver as central (keep-oldest).
- Singleton actors: Site Deployment Manager migrates on failover.
- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).
### Central-Site Communication
- Sites connect to central via Akka.NET remoting.
- The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes.
- If that central node is down, the site's communication actor will retry until it connects to the active central node.
## Scaling Guidelines
### Target Scale
- 10 sites maximum per central cluster
- 500 machines (instances) total across all sites
- 75 tags per machine (37,500 total tag subscriptions)
### Resource Requirements
| Component | CPU | RAM | Disk | Notes |
|-----------|-----|-----|------|-------|
| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |
### Network Bandwidth
- Health reports: ~1 KB per site per 30 seconds = negligible
- Tag value updates: Depends on data change rate; OPC UA subscription-based
- Deployment artifacts: One-time burst per deployment (varies by config size)
- Debug view streaming: ~500 bytes per attribute change per subscriber
## Dual-Node Failure Recovery
### Scenario: Both Nodes Down
1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`).
2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational.
3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
4. **Second node starts**: Joins the existing cluster as standby.
### Automatic Recovery
No manual intervention required for dual-node failure. The first node to start will:
- Form the cluster
- Take over all singletons
- Begin processing immediately
- Accept the second node when it joins

View File

@@ -0,0 +1,134 @@
# ScadaLink Failover Procedures
## Automatic Failover (No Intervention Required)
### Central Cluster Failover
**What happens automatically:**
1. Active central node becomes unreachable (process crash, network failure, hardware failure).
2. Akka.NET failure detection triggers after ~10 seconds (2s heartbeat, 10s threshold).
3. Split-brain resolver (keep-oldest) evaluates cluster state for 15 seconds (stable-after).
4. Standby node is promoted to active. Total time: ~25 seconds.
5. Cluster singletons migrate to the new active node.
6. Load balancer detects the failed node via `/health/ready` and routes traffic to the surviving node.
7. Active user sessions continue (JWT tokens are validated by the new node using the shared signing key).
8. SignalR connections are dropped and Blazor clients automatically reconnect.
**What is preserved:**
- All configuration and deployment state (stored in SQL Server)
- Active JWT sessions (shared signing key)
- Deployment status records (SQL Server with optimistic concurrency)
**What is temporarily disrupted:**
- In-flight deployments: Central re-queries site state and re-issues if needed (idempotent)
- Real-time debug view streams: Clients reconnect automatically
- Health dashboard: Resumes on reconnect
### Site Cluster Failover
**What happens automatically:**
1. Active site node becomes unreachable.
2. Failure detection and split-brain resolution (~25 seconds total).
3. Site Deployment Manager singleton migrates to standby.
4. Instance Actors are recreated from persisted SQLite configurations.
5. Staggered startup: 50ms delay between instance creations to prevent reconnection storms.
6. DCL connection actors reconnect to OPC UA servers.
7. Script Actors and Alarm Actors resume processing from incoming values (no stale state).
8. S&F buffer is read from SQLite — pending retries resume.
**What is preserved:**
- Deployed instance configurations (SQLite)
- Static attribute overrides (SQLite)
- S&F message buffer (SQLite)
- Site event logs (SQLite)
**What is temporarily disrupted:**
- Tag value subscriptions: DCL reconnects and re-subscribes transparently
- Active script executions: Cancelled; trigger fires again on next value change
- Alarm states: Re-evaluated from incoming tag values (correct state within one update cycle)
## Manual Intervention Scenarios
### Scenario 1: Both Central Nodes Down
**Symptoms:** No central UI access, sites report "central unreachable" in logs.
**Recovery:**
1. Start either central node. It will form a single-node cluster.
2. Verify SQL Server is accessible.
3. Check `/health/ready` returns 200.
4. Start the second node. It will join the cluster automatically.
5. Verify both nodes appear in the Akka.NET cluster member list (check logs for "Member joined").
**No data loss:** All state is in SQL Server.
### Scenario 2: Both Site Nodes Down
**Symptoms:** Site appears offline in central health dashboard.
**Recovery:**
1. Start either site node.
2. Check logs for "Store-and-forward SQLite storage initialized".
3. Verify instance actors are recreated: "Instance {Name}: created N script actors and M alarm actors".
4. Start the second site node.
5. Verify the site appears online in the central health dashboard within 60 seconds.
**No data loss:** All state is in SQLite.
### Scenario 3: Split-Brain (Network Partition Between Peers)
**Symptoms:** Both nodes believe they are the active node. Logs show "Cluster partition detected".
**How the system handles it:**
- Keep-oldest resolver: The older node (first to join cluster) survives; the younger is downed.
- `down-if-alone = on`: If a node is alone (no peers), it downs itself.
- Stable-after (15s): The resolver waits 15 seconds for the partition to stabilize before acting.
**Manual intervention (if auto-resolution fails):**
1. Stop both nodes.
2. Start the preferred node first (it becomes the "oldest").
3. Start the second node.
### Scenario 4: SQL Server Outage (Central)
**Symptoms:** Central UI returns errors. `/health/ready` returns 503. Logs show database connection failures.
**Impact:**
- Active sessions with valid JWTs can still access cached UI state.
- New logins fail (LDAP auth still works but role mapping requires DB).
- Template changes and deployments fail.
- Sites continue operating independently.
**Recovery:**
1. Restore SQL Server access.
2. Central nodes will automatically reconnect (EF Core connection resiliency).
3. Verify `/health/ready` returns 200.
4. No manual intervention needed on ScadaLink nodes.
### Scenario 5: Forced Singleton Migration
**When to use:** The active node is degraded but not crashed (e.g., high CPU, disk full).
**Procedure:**
1. Initiate graceful shutdown on the degraded node:
- Stop the Windows Service: `sc.exe stop ScadaLink-Central`
- CoordinatedShutdown will migrate singletons to the standby.
2. Wait for the standby to take over (check logs for "Singleton acquired").
3. Fix the issue on the original node.
4. Restart the service. It will rejoin as standby.
## Failover Timeline
```
T+0s Node failure detected (heartbeat timeout)
T+2s Akka.NET marks node as unreachable
T+10s Failure detection confirmed (threshold reached)
T+10s Split-brain resolver begins stable-after countdown
T+25s Resolver actions: surviving node promoted
T+25s Singleton migration begins
T+26s Instance Actors start recreating (staggered)
T+30s Health report sent from new active node
T+60s All instances operational (500 instances * 50ms stagger = 25s)
```

View File

@@ -0,0 +1,215 @@
# ScadaLink Maintenance Procedures
## SQL Server Maintenance (Central)
### Regular Maintenance Schedule
| Task | Frequency | Window |
|------|-----------|--------|
| Index rebuild | Weekly | Off-peak hours |
| Statistics update | Daily | Automated |
| Backup (full) | Daily | Off-peak hours |
| Backup (differential) | Every 4 hours | Anytime |
| Backup (transaction log) | Every 15 minutes | Anytime |
| Integrity check (DBCC CHECKDB) | Weekly | Off-peak hours |
### Index Maintenance
```sql
-- Rebuild fragmented indexes on configuration database
USE ScadaLink;
EXEC sp_MSforeachtable 'ALTER INDEX ALL ON ? REBUILD WITH (ONLINE = ON)';
```
For large tables (AuditLogEntries, DeploymentRecords), consider filtered rebuilds:
```sql
ALTER INDEX IX_AuditLogEntries_Timestamp ON AuditLogEntries REBUILD
WITH (ONLINE = ON, FILLFACTOR = 90);
```
### Audit Log Retention
The AuditLogEntries table grows continuously. Implement a retention policy:
```sql
-- Delete audit entries older than 1 year
DELETE FROM AuditLogEntries
WHERE Timestamp < DATEADD(YEAR, -1, GETUTCDATE());
```
Consider partitioning the AuditLogEntries table by month for efficient purging.
### Database Growth Monitoring
```sql
-- Check database sizes
EXEC sp_helpdb 'ScadaLink';
EXEC sp_helpdb 'ScadaLink_MachineData';
-- Check table sizes
SELECT
t.NAME AS TableName,
p.rows AS RowCount,
SUM(a.total_pages) * 8 / 1024.0 AS TotalSpaceMB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
GROUP BY t.Name, p.Rows
ORDER BY TotalSpaceMB DESC;
```
## SQLite Management (Site)
### Database Files
| File | Purpose | Growth Pattern |
|------|---------|---------------|
| `site.db` | Deployed configs, static overrides | Stable (grows with deployments) |
| `store-and-forward.db` | S&F message buffer | Variable (grows during outages) |
### Monitoring SQLite Size
```powershell
# Check SQLite file sizes
Get-ChildItem C:\ScadaLink\data\*.db | Select-Object Name, @{N='SizeMB';E={[math]::Round($_.Length/1MB,2)}}
```
### S&F Database Growth
The S&F database has **no max buffer size** by design. During extended outages, it can grow significantly.
**Monitoring:**
- Check buffer depth in the health dashboard.
- Alert if `store-and-forward.db` exceeds 1 GB.
**Manual cleanup (if needed):**
1. Identify and discard permanently undeliverable parked messages via the central UI.
2. If the database is very large and the site is healthy, the messages will be delivered and removed automatically.
### SQLite Vacuum
SQLite does not reclaim disk space after deleting rows. Periodically vacuum:
```powershell
# Stop the ScadaLink service first
sc.exe stop ScadaLink-Site
# Vacuum the S&F database
sqlite3 C:\ScadaLink\data\store-and-forward.db "VACUUM;"
# Restart the service
sc.exe start ScadaLink-Site
```
**Important:** Only vacuum when the service is stopped. SQLite does not support concurrent vacuum.
### SQLite Backup
```powershell
# Hot backup using SQLite backup API (safe while service is running)
sqlite3 C:\ScadaLink\data\site.db ".backup C:\Backups\site-$(Get-Date -Format yyyyMMdd).db"
sqlite3 C:\ScadaLink\data\store-and-forward.db ".backup C:\Backups\sf-$(Get-Date -Format yyyyMMdd).db"
```
## Log Rotation
### Serilog File Sink
ScadaLink uses Serilog's rolling file sink with daily rotation:
- New file created each day: `scadalink-20260316.log`
- Files are not automatically deleted.
### Log Retention Policy
Implement a scheduled task to delete old log files:
```powershell
# Delete log files older than 30 days
Get-ChildItem C:\ScadaLink\logs\scadalink-*.log |
Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
Remove-Item -Force
```
Schedule this as a Windows Task:
```powershell
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -Command `"Get-ChildItem C:\ScadaLink\logs\scadalink-*.log | Where-Object { `$_.LastWriteTime -lt (Get-Date).AddDays(-30) } | Remove-Item -Force`""
$trigger = New-ScheduledTaskTrigger -Daily -At "03:00"
Register-ScheduledTask -TaskName "ScadaLink-LogCleanup" -Action $action -Trigger $trigger -Description "Clean up ScadaLink log files older than 30 days"
```
### Log Disk Space
Monitor disk space on all nodes:
```powershell
Get-PSDrive C | Select-Object @{N='UsedGB';E={[math]::Round($_.Used/1GB,1)}}, @{N='FreeGB';E={[math]::Round($_.Free/1GB,1)}}
```
Alert if free space drops below 5 GB.
## Site Event Log Maintenance
### Automatic Purge
The Site Event Logging component has built-in purge:
- **Retention**: 30 days (configurable via `SiteEventLog:RetentionDays`)
- **Storage cap**: 1 GB (configurable via `SiteEventLog:MaxStorageMB`)
- **Purge interval**: Every 24 hours (configurable via `SiteEventLog:PurgeIntervalHours`)
No manual intervention needed under normal conditions.
### Manual Purge (Emergency)
If event log storage is consuming excessive disk space:
```powershell
# Stop the service
sc.exe stop ScadaLink-Site
# Delete the event log database and let it be recreated
Remove-Item C:\ScadaLink\data\event-log.db
# Restart the service
sc.exe start ScadaLink-Site
```
## Certificate Management
### LDAP Certificates
If using LDAPS (port 636), the LDAP server's TLS certificate must be trusted:
1. Export the CA certificate from Active Directory.
2. Import into the Windows certificate store on both central nodes.
3. Restart the ScadaLink service.
### OPC UA Certificates
OPC UA connections may require certificate trust configuration:
1. On first connection, the OPC UA client generates a self-signed certificate.
2. The OPC UA server must trust this certificate.
3. If the site node is replaced, a new certificate is generated; update the server trust list.
## Scheduled Maintenance Window
### Recommended Procedure
1. **Notify operators** that the system will be in maintenance mode.
2. **Gracefully stop the standby node** first (allows singleton to remain on active).
3. Perform maintenance on the standby node (OS updates, disk cleanup, etc.).
4. **Start the standby node** and verify it joins the cluster.
5. **Gracefully stop the active node** (CoordinatedShutdown migrates singletons to the now-running standby).
6. Perform maintenance on the former active node.
7. **Start the former active node** — it rejoins as standby.
This procedure maintains availability throughout the maintenance window.
### Emergency Maintenance (Both Nodes)
If both nodes must be stopped simultaneously:
1. Stop both nodes.
2. Perform maintenance.
3. Start one node (it forms a single-node cluster).
4. Verify health.
5. Start the second node.
Sites continue operating independently during central maintenance. Site-buffered data (S&F) will be delivered when central communication restores.

View File

@@ -0,0 +1,201 @@
# ScadaLink Troubleshooting Guide
## Log Analysis
### Log Location
- **File logs:** `C:\ScadaLink\logs\scadalink-YYYYMMDD.log`
- **Console output:** Available when running interactively (not as a Windows Service)
### Log Format
```
[14:32:05 INF] [Central/central-01] Template "PumpStation" saved by admin
```
Format: `[Time Level] [NodeRole/NodeHostname] Message`
All log entries are enriched with:
- `SiteId` — Site identifier (or "central" for central nodes)
- `NodeHostname` — Machine hostname
- `NodeRole` — "Central" or "Site"
### Key Log Patterns
| Pattern | Meaning |
|---------|---------|
| `Starting ScadaLink host as {Role}` | Node startup |
| `Member joined` | Cluster peer connected |
| `Member removed` | Cluster peer departed |
| `Singleton acquired` | This node became the active singleton holder |
| `Instance {Name}: created N script actors` | Instance successfully deployed |
| `Script {Name} failed trust validation` | Script uses forbidden API |
| `Immediate delivery to {Target} failed` | S&F transient failure, message buffered |
| `Message {Id} parked` | S&F max retries reached |
| `Site {SiteId} marked offline` | No health report for 60 seconds |
| `Rejecting stale report` | Out-of-order health report (normal during failover) |
### Filtering Logs
Use the structured log properties for targeted analysis:
```powershell
# Find all errors for a specific site
Select-String -Path "logs\scadalink-*.log" -Pattern "\[ERR\].*site-01"
# Find S&F activity
Select-String -Path "logs\scadalink-*.log" -Pattern "store-and-forward|buffered|parked"
# Find failover events
Select-String -Path "logs\scadalink-*.log" -Pattern "Singleton|Member joined|Member removed"
```
## Common Issues
### Issue: Site Appears Offline in Health Dashboard
**Possible causes:**
1. Site nodes are actually down.
2. Network connectivity between site and central is broken.
3. Health report interval has not elapsed since site startup.
**Diagnosis:**
1. Check if the site service is running: `sc.exe query ScadaLink-Site`
2. Check site logs for errors.
3. Verify network: `Test-NetConnection -ComputerName central-01.example.com -Port 8081`
4. Wait 60 seconds (the offline detection threshold).
**Resolution:**
- If the service is stopped, start it.
- If network is blocked, open firewall port 8081.
- If the site just started, wait for the first health report (30-second interval).
### Issue: Deployment Stuck in "InProgress"
**Possible causes:**
1. Site is unreachable during deployment.
2. Central node failed over mid-deployment.
3. Instance compilation failed on site.
**Diagnosis:**
1. Check deployment status in the UI.
2. Check site logs for the deployment ID: `Select-String "dep-XXXXX"`
3. Check central logs for the deployment ID.
**Resolution:**
- If the site is unreachable: fix connectivity, then re-deploy (idempotent by revision hash).
- If compilation failed: check the script errors in site logs, fix the template, re-deploy.
- If stuck after failover: the new central node will re-query site state; wait or manually re-deploy.
### Issue: S&F Messages Accumulating
**Possible causes:**
1. External system is down.
2. SMTP server is unreachable.
3. Network issues between site and external target.
**Diagnosis:**
1. Check S&F buffer depth in health dashboard.
2. Check site logs for retry activity and error messages.
3. Verify external system connectivity from the site node.
**Resolution:**
- Fix the external system / SMTP / network issue. Retries resume automatically.
- If messages are permanently undeliverable: park and discard via the central UI.
- Check parked messages for patterns (same target, same error).
### Issue: OPC UA Connection Keeps Disconnecting
**Possible causes:**
1. OPC UA server is unstable.
2. Network intermittency.
3. Certificate trust issues.
**Diagnosis:**
1. Check DCL logs: look for "Entering Reconnecting state" frequency.
2. Check health dashboard: data connection status for the affected connection.
3. Verify OPC UA server health independently.
**Resolution:**
- DCL auto-reconnects at the configured interval (default 5 seconds).
- If the server certificate changed, update the trust store.
- If the server is consistently unstable, investigate the OPC UA server directly.
### Issue: Script Execution Errors
**Possible causes:**
1. Script timeout (default 30 seconds).
2. Runtime exception in script code.
3. Script references external system that is down.
**Diagnosis:**
1. Check health dashboard: script error count per interval.
2. Check site logs for the script name and error details.
3. Check if the script uses `ExternalSystem.Call()` — the target may be down.
**Resolution:**
- If timeout: optimize the script or increase the timeout in configuration.
- If runtime error: fix the script in the template editor, re-deploy.
- If external system is down: script errors will stop when the system recovers.
### Issue: Login Fails but LDAP Server is Up
**Possible causes:**
1. Incorrect LDAP search base DN.
2. User account is locked in AD.
3. LDAP group-to-role mapping does not include a required group.
4. TLS certificate issue on LDAP connection.
**Diagnosis:**
1. Check central logs for LDAP bind errors.
2. Verify LDAP connectivity: `Test-NetConnection -ComputerName ldap.example.com -Port 636`
3. Test LDAP bind manually using an LDAP browser tool.
**Resolution:**
- Fix the LDAP configuration.
- Unlock the user account in AD.
- Update group mappings in the configuration database.
### Issue: High Dead Letter Count
**Possible causes:**
1. Messages being sent to actors that no longer exist (e.g., after instance deletion).
2. Actor mailbox overflow.
3. Misconfigured actor paths after deployment changes.
**Diagnosis:**
1. Check health dashboard: dead letter count trend.
2. Check site logs for dead letter details (actor path, message type).
**Resolution:**
- Dead letters during failover are expected and transient.
- Persistent dead letters indicate a configuration or code issue.
- If dead letters reference deleted instances, they are harmless (S&F messages are retained by design).
## Health Dashboard Interpretation
### Metric: Data Connection Status
| Status | Meaning | Action |
|--------|---------|--------|
| Connected | OPC UA connection active | None |
| Disconnected | Connection lost, auto-reconnecting | Check OPC UA server |
| Connecting | Initial connection in progress | Wait |
### Metric: Tag Resolution
- `TotalSubscribed`: Number of tags the system is trying to monitor.
- `SuccessfullyResolved`: Tags with active subscriptions.
- Gap indicates unresolved tags (devices still booting or path errors).
### Metric: S&F Buffer Depth
- `ExternalSystem`: Messages to external REST APIs awaiting delivery.
- `Notification`: Email notifications awaiting SMTP delivery.
- Growing depth indicates the target system is unreachable.
### Metric: Error Counts (Per Interval)
- Counts reset every 30 seconds (health report interval).
- Raw counts, not rates — compare across intervals.
- Occasional script errors during failover are expected.