Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.
This commit is contained in:
Joseph Doherty
2026-03-16 22:12:31 -04:00
parent 3b2320bd35
commit b659978764
68 changed files with 6253 additions and 44 deletions

View File

@@ -0,0 +1,134 @@
# ScadaLink Failover Procedures
## Automatic Failover (No Intervention Required)
### Central Cluster Failover
**What happens automatically:**
1. Active central node becomes unreachable (process crash, network failure, hardware failure).
2. Akka.NET failure detection triggers after ~10 seconds (2s heartbeat, 10s threshold).
3. Split-brain resolver (keep-oldest) evaluates cluster state for 15 seconds (stable-after).
4. Standby node is promoted to active. Total time: ~25 seconds.
5. Cluster singletons migrate to the new active node.
6. Load balancer detects the failed node via `/health/ready` and routes traffic to the surviving node.
7. Active user sessions continue (JWT tokens are validated by the new node using the shared signing key).
8. SignalR connections are dropped and Blazor clients automatically reconnect.
**What is preserved:**
- All configuration and deployment state (stored in SQL Server)
- Active JWT sessions (shared signing key)
- Deployment status records (SQL Server with optimistic concurrency)
**What is temporarily disrupted:**
- In-flight deployments: Central re-queries site state and re-issues if needed (idempotent)
- Real-time debug view streams: Clients reconnect automatically
- Health dashboard: Resumes on reconnect
### Site Cluster Failover
**What happens automatically:**
1. Active site node becomes unreachable.
2. Failure detection and split-brain resolution (~25 seconds total).
3. Site Deployment Manager singleton migrates to standby.
4. Instance Actors are recreated from persisted SQLite configurations.
5. Staggered startup: 50ms delay between instance creations to prevent reconnection storms.
6. DCL connection actors reconnect to OPC UA servers.
7. Script Actors and Alarm Actors resume processing from incoming values (no stale state).
8. S&F buffer is read from SQLite — pending retries resume.
**What is preserved:**
- Deployed instance configurations (SQLite)
- Static attribute overrides (SQLite)
- S&F message buffer (SQLite)
- Site event logs (SQLite)
**What is temporarily disrupted:**
- Tag value subscriptions: DCL reconnects and re-subscribes transparently
- Active script executions: Cancelled; trigger fires again on next value change
- Alarm states: Re-evaluated from incoming tag values (correct state within one update cycle)
## Manual Intervention Scenarios
### Scenario 1: Both Central Nodes Down
**Symptoms:** No central UI access, sites report "central unreachable" in logs.
**Recovery:**
1. Start either central node. It will form a single-node cluster.
2. Verify SQL Server is accessible.
3. Check `/health/ready` returns 200.
4. Start the second node. It will join the cluster automatically.
5. Verify both nodes appear in the Akka.NET cluster member list (check logs for "Member joined").
**No data loss:** All state is in SQL Server.
### Scenario 2: Both Site Nodes Down
**Symptoms:** Site appears offline in central health dashboard.
**Recovery:**
1. Start either site node.
2. Check logs for "Store-and-forward SQLite storage initialized".
3. Verify instance actors are recreated: "Instance {Name}: created N script actors and M alarm actors".
4. Start the second site node.
5. Verify the site appears online in the central health dashboard within 60 seconds.
**No data loss:** All state is in SQLite.
### Scenario 3: Split-Brain (Network Partition Between Peers)
**Symptoms:** Both nodes believe they are the active node. Logs show "Cluster partition detected".
**How the system handles it:**
- Keep-oldest resolver: The older node (first to join cluster) survives; the younger is downed.
- `down-if-alone = on`: If a node is alone (no peers), it downs itself.
- Stable-after (15s): The resolver waits 15 seconds for the partition to stabilize before acting.
**Manual intervention (if auto-resolution fails):**
1. Stop both nodes.
2. Start the preferred node first (it becomes the "oldest").
3. Start the second node.
### Scenario 4: SQL Server Outage (Central)
**Symptoms:** Central UI returns errors. `/health/ready` returns 503. Logs show database connection failures.
**Impact:**
- Active sessions with valid JWTs can still access cached UI state.
- New logins fail (LDAP auth still works but role mapping requires DB).
- Template changes and deployments fail.
- Sites continue operating independently.
**Recovery:**
1. Restore SQL Server access.
2. Central nodes will automatically reconnect (EF Core connection resiliency).
3. Verify `/health/ready` returns 200.
4. No manual intervention needed on ScadaLink nodes.
### Scenario 5: Forced Singleton Migration
**When to use:** The active node is degraded but not crashed (e.g., high CPU, disk full).
**Procedure:**
1. Initiate graceful shutdown on the degraded node:
- Stop the Windows Service: `sc.exe stop ScadaLink-Central`
- CoordinatedShutdown will migrate singletons to the standby.
2. Wait for the standby to take over (check logs for "Singleton acquired").
3. Fix the issue on the original node.
4. Restart the service. It will rejoin as standby.
## Failover Timeline
```
T+0s Node failure detected (heartbeat timeout)
T+2s Akka.NET marks node as unreachable
T+10s Failure detection confirmed (threshold reached)
T+10s Split-brain resolver begins stable-after countdown
T+25s Resolver actions: surviving node promoted
T+25s Singleton migration begins
T+26s Instance Actors start recreating (staggered)
T+30s Health report sent from new active node
T+60s All instances operational (500 instances * 50ms stagger = 25s)
```

View File

@@ -0,0 +1,215 @@
# ScadaLink Maintenance Procedures
## SQL Server Maintenance (Central)
### Regular Maintenance Schedule
| Task | Frequency | Window |
|------|-----------|--------|
| Index rebuild | Weekly | Off-peak hours |
| Statistics update | Daily | Automated |
| Backup (full) | Daily | Off-peak hours |
| Backup (differential) | Every 4 hours | Anytime |
| Backup (transaction log) | Every 15 minutes | Anytime |
| Integrity check (DBCC CHECKDB) | Weekly | Off-peak hours |
### Index Maintenance
```sql
-- Rebuild fragmented indexes on configuration database
USE ScadaLink;
EXEC sp_MSforeachtable 'ALTER INDEX ALL ON ? REBUILD WITH (ONLINE = ON)';
```
For large tables (AuditLogEntries, DeploymentRecords), consider filtered rebuilds:
```sql
ALTER INDEX IX_AuditLogEntries_Timestamp ON AuditLogEntries REBUILD
WITH (ONLINE = ON, FILLFACTOR = 90);
```
### Audit Log Retention
The AuditLogEntries table grows continuously. Implement a retention policy:
```sql
-- Delete audit entries older than 1 year
DELETE FROM AuditLogEntries
WHERE Timestamp < DATEADD(YEAR, -1, GETUTCDATE());
```
Consider partitioning the AuditLogEntries table by month for efficient purging.
### Database Growth Monitoring
```sql
-- Check database sizes
EXEC sp_helpdb 'ScadaLink';
EXEC sp_helpdb 'ScadaLink_MachineData';
-- Check table sizes
SELECT
t.NAME AS TableName,
p.rows AS RowCount,
SUM(a.total_pages) * 8 / 1024.0 AS TotalSpaceMB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
GROUP BY t.Name, p.Rows
ORDER BY TotalSpaceMB DESC;
```
## SQLite Management (Site)
### Database Files
| File | Purpose | Growth Pattern |
|------|---------|---------------|
| `site.db` | Deployed configs, static overrides | Stable (grows with deployments) |
| `store-and-forward.db` | S&F message buffer | Variable (grows during outages) |
### Monitoring SQLite Size
```powershell
# Check SQLite file sizes
Get-ChildItem C:\ScadaLink\data\*.db | Select-Object Name, @{N='SizeMB';E={[math]::Round($_.Length/1MB,2)}}
```
### S&F Database Growth
The S&F database has **no max buffer size** by design. During extended outages, it can grow significantly.
**Monitoring:**
- Check buffer depth in the health dashboard.
- Alert if `store-and-forward.db` exceeds 1 GB.
**Manual cleanup (if needed):**
1. Identify and discard permanently undeliverable parked messages via the central UI.
2. If the database is very large and the site is healthy, the messages will be delivered and removed automatically.
### SQLite Vacuum
SQLite does not reclaim disk space after deleting rows. Periodically vacuum:
```powershell
# Stop the ScadaLink service first
sc.exe stop ScadaLink-Site
# Vacuum the S&F database
sqlite3 C:\ScadaLink\data\store-and-forward.db "VACUUM;"
# Restart the service
sc.exe start ScadaLink-Site
```
**Important:** Only vacuum when the service is stopped. SQLite does not support concurrent vacuum.
### SQLite Backup
```powershell
# Hot backup using SQLite backup API (safe while service is running)
sqlite3 C:\ScadaLink\data\site.db ".backup C:\Backups\site-$(Get-Date -Format yyyyMMdd).db"
sqlite3 C:\ScadaLink\data\store-and-forward.db ".backup C:\Backups\sf-$(Get-Date -Format yyyyMMdd).db"
```
## Log Rotation
### Serilog File Sink
ScadaLink uses Serilog's rolling file sink with daily rotation:
- New file created each day: `scadalink-20260316.log`
- Files are not automatically deleted.
### Log Retention Policy
Implement a scheduled task to delete old log files:
```powershell
# Delete log files older than 30 days
Get-ChildItem C:\ScadaLink\logs\scadalink-*.log |
Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
Remove-Item -Force
```
Schedule this as a Windows Task:
```powershell
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -Command `"Get-ChildItem C:\ScadaLink\logs\scadalink-*.log | Where-Object { `$_.LastWriteTime -lt (Get-Date).AddDays(-30) } | Remove-Item -Force`""
$trigger = New-ScheduledTaskTrigger -Daily -At "03:00"
Register-ScheduledTask -TaskName "ScadaLink-LogCleanup" -Action $action -Trigger $trigger -Description "Clean up ScadaLink log files older than 30 days"
```
### Log Disk Space
Monitor disk space on all nodes:
```powershell
Get-PSDrive C | Select-Object @{N='UsedGB';E={[math]::Round($_.Used/1GB,1)}}, @{N='FreeGB';E={[math]::Round($_.Free/1GB,1)}}
```
Alert if free space drops below 5 GB.
## Site Event Log Maintenance
### Automatic Purge
The Site Event Logging component has built-in purge:
- **Retention**: 30 days (configurable via `SiteEventLog:RetentionDays`)
- **Storage cap**: 1 GB (configurable via `SiteEventLog:MaxStorageMB`)
- **Purge interval**: Every 24 hours (configurable via `SiteEventLog:PurgeIntervalHours`)
No manual intervention needed under normal conditions.
### Manual Purge (Emergency)
If event log storage is consuming excessive disk space:
```powershell
# Stop the service
sc.exe stop ScadaLink-Site
# Delete the event log database and let it be recreated
Remove-Item C:\ScadaLink\data\event-log.db
# Restart the service
sc.exe start ScadaLink-Site
```
## Certificate Management
### LDAP Certificates
If using LDAPS (port 636), the LDAP server's TLS certificate must be trusted:
1. Export the CA certificate from Active Directory.
2. Import into the Windows certificate store on both central nodes.
3. Restart the ScadaLink service.
### OPC UA Certificates
OPC UA connections may require certificate trust configuration:
1. On first connection, the OPC UA client generates a self-signed certificate.
2. The OPC UA server must trust this certificate.
3. If the site node is replaced, a new certificate is generated; update the server trust list.
## Scheduled Maintenance Window
### Recommended Procedure
1. **Notify operators** that the system will be in maintenance mode.
2. **Gracefully stop the standby node** first (allows singleton to remain on active).
3. Perform maintenance on the standby node (OS updates, disk cleanup, etc.).
4. **Start the standby node** and verify it joins the cluster.
5. **Gracefully stop the active node** (CoordinatedShutdown migrates singletons to the now-running standby).
6. Perform maintenance on the former active node.
7. **Start the former active node** — it rejoins as standby.
This procedure maintains availability throughout the maintenance window.
### Emergency Maintenance (Both Nodes)
If both nodes must be stopped simultaneously:
1. Stop both nodes.
2. Perform maintenance.
3. Start one node (it forms a single-node cluster).
4. Verify health.
5. Start the second node.
Sites continue operating independently during central maintenance. Site-buffered data (S&F) will be delivered when central communication restores.

View File

@@ -0,0 +1,201 @@
# ScadaLink Troubleshooting Guide
## Log Analysis
### Log Location
- **File logs:** `C:\ScadaLink\logs\scadalink-YYYYMMDD.log`
- **Console output:** Available when running interactively (not as a Windows Service)
### Log Format
```
[14:32:05 INF] [Central/central-01] Template "PumpStation" saved by admin
```
Format: `[Time Level] [NodeRole/NodeHostname] Message`
All log entries are enriched with:
- `SiteId` — Site identifier (or "central" for central nodes)
- `NodeHostname` — Machine hostname
- `NodeRole` — "Central" or "Site"
### Key Log Patterns
| Pattern | Meaning |
|---------|---------|
| `Starting ScadaLink host as {Role}` | Node startup |
| `Member joined` | Cluster peer connected |
| `Member removed` | Cluster peer departed |
| `Singleton acquired` | This node became the active singleton holder |
| `Instance {Name}: created N script actors` | Instance successfully deployed |
| `Script {Name} failed trust validation` | Script uses forbidden API |
| `Immediate delivery to {Target} failed` | S&F transient failure, message buffered |
| `Message {Id} parked` | S&F max retries reached |
| `Site {SiteId} marked offline` | No health report for 60 seconds |
| `Rejecting stale report` | Out-of-order health report (normal during failover) |
### Filtering Logs
Use the structured log properties for targeted analysis:
```powershell
# Find all errors for a specific site
Select-String -Path "logs\scadalink-*.log" -Pattern "\[ERR\].*site-01"
# Find S&F activity
Select-String -Path "logs\scadalink-*.log" -Pattern "store-and-forward|buffered|parked"
# Find failover events
Select-String -Path "logs\scadalink-*.log" -Pattern "Singleton|Member joined|Member removed"
```
## Common Issues
### Issue: Site Appears Offline in Health Dashboard
**Possible causes:**
1. Site nodes are actually down.
2. Network connectivity between site and central is broken.
3. Health report interval has not elapsed since site startup.
**Diagnosis:**
1. Check if the site service is running: `sc.exe query ScadaLink-Site`
2. Check site logs for errors.
3. Verify network: `Test-NetConnection -ComputerName central-01.example.com -Port 8081`
4. Wait 60 seconds (the offline detection threshold).
**Resolution:**
- If the service is stopped, start it.
- If network is blocked, open firewall port 8081.
- If the site just started, wait for the first health report (30-second interval).
### Issue: Deployment Stuck in "InProgress"
**Possible causes:**
1. Site is unreachable during deployment.
2. Central node failed over mid-deployment.
3. Instance compilation failed on site.
**Diagnosis:**
1. Check deployment status in the UI.
2. Check site logs for the deployment ID: `Select-String "dep-XXXXX"`
3. Check central logs for the deployment ID.
**Resolution:**
- If the site is unreachable: fix connectivity, then re-deploy (idempotent by revision hash).
- If compilation failed: check the script errors in site logs, fix the template, re-deploy.
- If stuck after failover: the new central node will re-query site state; wait or manually re-deploy.
### Issue: S&F Messages Accumulating
**Possible causes:**
1. External system is down.
2. SMTP server is unreachable.
3. Network issues between site and external target.
**Diagnosis:**
1. Check S&F buffer depth in health dashboard.
2. Check site logs for retry activity and error messages.
3. Verify external system connectivity from the site node.
**Resolution:**
- Fix the external system / SMTP / network issue. Retries resume automatically.
- If messages are permanently undeliverable: park and discard via the central UI.
- Check parked messages for patterns (same target, same error).
### Issue: OPC UA Connection Keeps Disconnecting
**Possible causes:**
1. OPC UA server is unstable.
2. Network intermittency.
3. Certificate trust issues.
**Diagnosis:**
1. Check DCL logs: look for "Entering Reconnecting state" frequency.
2. Check health dashboard: data connection status for the affected connection.
3. Verify OPC UA server health independently.
**Resolution:**
- DCL auto-reconnects at the configured interval (default 5 seconds).
- If the server certificate changed, update the trust store.
- If the server is consistently unstable, investigate the OPC UA server directly.
### Issue: Script Execution Errors
**Possible causes:**
1. Script timeout (default 30 seconds).
2. Runtime exception in script code.
3. Script references external system that is down.
**Diagnosis:**
1. Check health dashboard: script error count per interval.
2. Check site logs for the script name and error details.
3. Check if the script uses `ExternalSystem.Call()` — the target may be down.
**Resolution:**
- If timeout: optimize the script or increase the timeout in configuration.
- If runtime error: fix the script in the template editor, re-deploy.
- If external system is down: script errors will stop when the system recovers.
### Issue: Login Fails but LDAP Server is Up
**Possible causes:**
1. Incorrect LDAP search base DN.
2. User account is locked in AD.
3. LDAP group-to-role mapping does not include a required group.
4. TLS certificate issue on LDAP connection.
**Diagnosis:**
1. Check central logs for LDAP bind errors.
2. Verify LDAP connectivity: `Test-NetConnection -ComputerName ldap.example.com -Port 636`
3. Test LDAP bind manually using an LDAP browser tool.
**Resolution:**
- Fix the LDAP configuration.
- Unlock the user account in AD.
- Update group mappings in the configuration database.
### Issue: High Dead Letter Count
**Possible causes:**
1. Messages being sent to actors that no longer exist (e.g., after instance deletion).
2. Actor mailbox overflow.
3. Misconfigured actor paths after deployment changes.
**Diagnosis:**
1. Check health dashboard: dead letter count trend.
2. Check site logs for dead letter details (actor path, message type).
**Resolution:**
- Dead letters during failover are expected and transient.
- Persistent dead letters indicate a configuration or code issue.
- If dead letters reference deleted instances, they are harmless (S&F messages are retained by design).
## Health Dashboard Interpretation
### Metric: Data Connection Status
| Status | Meaning | Action |
|--------|---------|--------|
| Connected | OPC UA connection active | None |
| Disconnected | Connection lost, auto-reconnecting | Check OPC UA server |
| Connecting | Initial connection in progress | Wait |
### Metric: Tag Resolution
- `TotalSubscribed`: Number of tags the system is trying to monitor.
- `SuccessfullyResolved`: Tags with active subscriptions.
- Gap indicates unresolved tags (devices still booting or path errors).
### Metric: S&F Buffer Depth
- `ExternalSystem`: Messages to external REST APIs awaiting delivery.
- `Notification`: Email notifications awaiting SMTP delivery.
- Growing depth indicates the target system is unreachable.
### Metric: Error Counts (Per Interval)
- Counts reset every 30 seconds (health report interval).
- Raw counts, not rates — compare across intervals.
- Occasional script errors during failover are expected.