Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
2026-03-16 22:12:31 -04:00
parent 3b2320bd35
commit b659978764
68 changed files with 6253 additions and 44 deletions
--- a/docs/operations/failover-procedures.md
+++ b/docs/operations/failover-procedures.md
@@ -0,0 +1,134 @@
+# ScadaLink Failover Procedures
+
+## Automatic Failover (No Intervention Required)
+
+### Central Cluster Failover
+
+**What happens automatically:**
+
+1. Active central node becomes unreachable (process crash, network failure, hardware failure).
+2. Akka.NET failure detection triggers after ~10 seconds (2s heartbeat, 10s threshold).
+3. Split-brain resolver (keep-oldest) evaluates cluster state for 15 seconds (stable-after).
+4. Standby node is promoted to active. Total time: ~25 seconds.
+5. Cluster singletons migrate to the new active node.
+6. Load balancer detects the failed node via `/health/ready` and routes traffic to the surviving node.
+7. Active user sessions continue (JWT tokens are validated by the new node using the shared signing key).
+8. SignalR connections are dropped and Blazor clients automatically reconnect.
+
+**What is preserved:**
+- All configuration and deployment state (stored in SQL Server)
+- Active JWT sessions (shared signing key)
+- Deployment status records (SQL Server with optimistic concurrency)
+
+**What is temporarily disrupted:**
+- In-flight deployments: Central re-queries site state and re-issues if needed (idempotent)
+- Real-time debug view streams: Clients reconnect automatically
+- Health dashboard: Resumes on reconnect
+
+### Site Cluster Failover
+
+**What happens automatically:**
+
+1. Active site node becomes unreachable.
+2. Failure detection and split-brain resolution (~25 seconds total).
+3. Site Deployment Manager singleton migrates to standby.
+4. Instance Actors are recreated from persisted SQLite configurations.
+5. Staggered startup: 50ms delay between instance creations to prevent reconnection storms.
+6. DCL connection actors reconnect to OPC UA servers.
+7. Script Actors and Alarm Actors resume processing from incoming values (no stale state).
+8. S&F buffer is read from SQLite — pending retries resume.
+
+**What is preserved:**
+- Deployed instance configurations (SQLite)
+- Static attribute overrides (SQLite)
+- S&F message buffer (SQLite)
+- Site event logs (SQLite)
+
+**What is temporarily disrupted:**
+- Tag value subscriptions: DCL reconnects and re-subscribes transparently
+- Active script executions: Cancelled; trigger fires again on next value change
+- Alarm states: Re-evaluated from incoming tag values (correct state within one update cycle)
+
+## Manual Intervention Scenarios
+
+### Scenario 1: Both Central Nodes Down
+
+**Symptoms:** No central UI access, sites report "central unreachable" in logs.
+
+**Recovery:**
+1. Start either central node. It will form a single-node cluster.
+2. Verify SQL Server is accessible.
+3. Check `/health/ready` returns 200.
+4. Start the second node. It will join the cluster automatically.
+5. Verify both nodes appear in the Akka.NET cluster member list (check logs for "Member joined").
+
+**No data loss:** All state is in SQL Server.
+
+### Scenario 2: Both Site Nodes Down
+
+**Symptoms:** Site appears offline in central health dashboard.
+
+**Recovery:**
+1. Start either site node.
+2. Check logs for "Store-and-forward SQLite storage initialized".
+3. Verify instance actors are recreated: "Instance {Name}: created N script actors and M alarm actors".
+4. Start the second site node.
+5. Verify the site appears online in the central health dashboard within 60 seconds.
+
+**No data loss:** All state is in SQLite.
+
+### Scenario 3: Split-Brain (Network Partition Between Peers)
+
+**Symptoms:** Both nodes believe they are the active node. Logs show "Cluster partition detected".
+
+**How the system handles it:**
+- Keep-oldest resolver: The older node (first to join cluster) survives; the younger is downed.
+- `down-if-alone = on`: If a node is alone (no peers), it downs itself.
+- Stable-after (15s): The resolver waits 15 seconds for the partition to stabilize before acting.
+
+**Manual intervention (if auto-resolution fails):**
+1. Stop both nodes.
+2. Start the preferred node first (it becomes the "oldest").
+3. Start the second node.
+
+### Scenario 4: SQL Server Outage (Central)
+
+**Symptoms:** Central UI returns errors. `/health/ready` returns 503. Logs show database connection failures.
+
+**Impact:**
+- Active sessions with valid JWTs can still access cached UI state.
+- New logins fail (LDAP auth still works but role mapping requires DB).
+- Template changes and deployments fail.
+- Sites continue operating independently.
+
+**Recovery:**
+1. Restore SQL Server access.
+2. Central nodes will automatically reconnect (EF Core connection resiliency).
+3. Verify `/health/ready` returns 200.
+4. No manual intervention needed on ScadaLink nodes.
+
+### Scenario 5: Forced Singleton Migration
+
+**When to use:** The active node is degraded but not crashed (e.g., high CPU, disk full).
+
+**Procedure:**
+1. Initiate graceful shutdown on the degraded node:
+   - Stop the Windows Service: `sc.exe stop ScadaLink-Central`
+   - CoordinatedShutdown will migrate singletons to the standby.
+2. Wait for the standby to take over (check logs for "Singleton acquired").
+3. Fix the issue on the original node.
+4. Restart the service. It will rejoin as standby.
+
+## Failover Timeline
+
+```
+T+0s    Node failure detected (heartbeat timeout)
+T+2s    Akka.NET marks node as unreachable
+T+10s   Failure detection confirmed (threshold reached)
+T+10s   Split-brain resolver begins stable-after countdown
+T+25s   Resolver actions: surviving node promoted
+T+25s   Singleton migration begins
+T+26s   Instance Actors start recreating (staggered)
+T+30s   Health report sent from new active node
+T+60s   All instances operational (500 instances * 50ms stagger = 25s)
+```
--- a/docs/operations/maintenance-procedures.md
+++ b/docs/operations/maintenance-procedures.md
@@ -0,0 +1,215 @@
+# ScadaLink Maintenance Procedures
+
+## SQL Server Maintenance (Central)
+
+### Regular Maintenance Schedule
+
+| Task | Frequency | Window |
+|------|-----------|--------|
+| Index rebuild | Weekly | Off-peak hours |
+| Statistics update | Daily | Automated |
+| Backup (full) | Daily | Off-peak hours |
+| Backup (differential) | Every 4 hours | Anytime |
+| Backup (transaction log) | Every 15 minutes | Anytime |
+| Integrity check (DBCC CHECKDB) | Weekly | Off-peak hours |
+
+### Index Maintenance
+
+```sql
+-- Rebuild fragmented indexes on configuration database
+USE ScadaLink;
+EXEC sp_MSforeachtable 'ALTER INDEX ALL ON ? REBUILD WITH (ONLINE = ON)';
+```
+
+For large tables (AuditLogEntries, DeploymentRecords), consider filtered rebuilds:
+```sql
+ALTER INDEX IX_AuditLogEntries_Timestamp ON AuditLogEntries REBUILD
+    WITH (ONLINE = ON, FILLFACTOR = 90);
+```
+
+### Audit Log Retention
+
+The AuditLogEntries table grows continuously. Implement a retention policy:
+
+```sql
+-- Delete audit entries older than 1 year
+DELETE FROM AuditLogEntries
+WHERE Timestamp < DATEADD(YEAR, -1, GETUTCDATE());
+```
+
+Consider partitioning the AuditLogEntries table by month for efficient purging.
+
+### Database Growth Monitoring
+
+```sql
+-- Check database sizes
+EXEC sp_helpdb 'ScadaLink';
+EXEC sp_helpdb 'ScadaLink_MachineData';
+
+-- Check table sizes
+SELECT
+    t.NAME AS TableName,
+    p.rows AS RowCount,
+    SUM(a.total_pages) * 8 / 1024.0 AS TotalSpaceMB
+FROM sys.tables t
+INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
+INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
+INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
+GROUP BY t.Name, p.Rows
+ORDER BY TotalSpaceMB DESC;
+```
+
+## SQLite Management (Site)
+
+### Database Files
+
+| File | Purpose | Growth Pattern |
+|------|---------|---------------|
+| `site.db` | Deployed configs, static overrides | Stable (grows with deployments) |
+| `store-and-forward.db` | S&F message buffer | Variable (grows during outages) |
+
+### Monitoring SQLite Size
+
+```powershell
+# Check SQLite file sizes
+Get-ChildItem C:\ScadaLink\data\*.db | Select-Object Name, @{N='SizeMB';E={[math]::Round($_.Length/1MB,2)}}
+```
+
+### S&F Database Growth
+
+The S&F database has **no max buffer size** by design. During extended outages, it can grow significantly.
+
+**Monitoring:**
+- Check buffer depth in the health dashboard.
+- Alert if `store-and-forward.db` exceeds 1 GB.
+
+**Manual cleanup (if needed):**
+1. Identify and discard permanently undeliverable parked messages via the central UI.
+2. If the database is very large and the site is healthy, the messages will be delivered and removed automatically.
+
+### SQLite Vacuum
+
+SQLite does not reclaim disk space after deleting rows. Periodically vacuum:
+
+```powershell
+# Stop the ScadaLink service first
+sc.exe stop ScadaLink-Site
+
+# Vacuum the S&F database
+sqlite3 C:\ScadaLink\data\store-and-forward.db "VACUUM;"
+
+# Restart the service
+sc.exe start ScadaLink-Site
+```
+
+**Important:** Only vacuum when the service is stopped. SQLite does not support concurrent vacuum.
+
+### SQLite Backup
+
+```powershell
+# Hot backup using SQLite backup API (safe while service is running)
+sqlite3 C:\ScadaLink\data\site.db ".backup C:\Backups\site-$(Get-Date -Format yyyyMMdd).db"
+sqlite3 C:\ScadaLink\data\store-and-forward.db ".backup C:\Backups\sf-$(Get-Date -Format yyyyMMdd).db"
+```
+
+## Log Rotation
+
+### Serilog File Sink
+
+ScadaLink uses Serilog's rolling file sink with daily rotation:
+- New file created each day: `scadalink-20260316.log`
+- Files are not automatically deleted.
+
+### Log Retention Policy
+
+Implement a scheduled task to delete old log files:
+
+```powershell
+# Delete log files older than 30 days
+Get-ChildItem C:\ScadaLink\logs\scadalink-*.log |
+    Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
+    Remove-Item -Force
+```
+
+Schedule this as a Windows Task:
+```powershell
+$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -Command `"Get-ChildItem C:\ScadaLink\logs\scadalink-*.log | Where-Object { `$_.LastWriteTime -lt (Get-Date).AddDays(-30) } | Remove-Item -Force`""
+$trigger = New-ScheduledTaskTrigger -Daily -At "03:00"
+Register-ScheduledTask -TaskName "ScadaLink-LogCleanup" -Action $action -Trigger $trigger -Description "Clean up ScadaLink log files older than 30 days"
+```
+
+### Log Disk Space
+
+Monitor disk space on all nodes:
+```powershell
+Get-PSDrive C | Select-Object @{N='UsedGB';E={[math]::Round($_.Used/1GB,1)}}, @{N='FreeGB';E={[math]::Round($_.Free/1GB,1)}}
+```
+
+Alert if free space drops below 5 GB.
+
+## Site Event Log Maintenance
+
+### Automatic Purge
+
+The Site Event Logging component has built-in purge:
+- **Retention**: 30 days (configurable via `SiteEventLog:RetentionDays`)
+- **Storage cap**: 1 GB (configurable via `SiteEventLog:MaxStorageMB`)
+- **Purge interval**: Every 24 hours (configurable via `SiteEventLog:PurgeIntervalHours`)
+
+No manual intervention needed under normal conditions.
+
+### Manual Purge (Emergency)
+
+If event log storage is consuming excessive disk space:
+
+```powershell
+# Stop the service
+sc.exe stop ScadaLink-Site
+
+# Delete the event log database and let it be recreated
+Remove-Item C:\ScadaLink\data\event-log.db
+
+# Restart the service
+sc.exe start ScadaLink-Site
+```
+
+## Certificate Management
+
+### LDAP Certificates
+
+If using LDAPS (port 636), the LDAP server's TLS certificate must be trusted:
+1. Export the CA certificate from Active Directory.
+2. Import into the Windows certificate store on both central nodes.
+3. Restart the ScadaLink service.
+
+### OPC UA Certificates
+
+OPC UA connections may require certificate trust configuration:
+1. On first connection, the OPC UA client generates a self-signed certificate.
+2. The OPC UA server must trust this certificate.
+3. If the site node is replaced, a new certificate is generated; update the server trust list.
+
+## Scheduled Maintenance Window
+
+### Recommended Procedure
+
+1. **Notify operators** that the system will be in maintenance mode.
+2. **Gracefully stop the standby node** first (allows singleton to remain on active).
+3. Perform maintenance on the standby node (OS updates, disk cleanup, etc.).
+4. **Start the standby node** and verify it joins the cluster.
+5. **Gracefully stop the active node** (CoordinatedShutdown migrates singletons to the now-running standby).
+6. Perform maintenance on the former active node.
+7. **Start the former active node** — it rejoins as standby.
+
+This procedure maintains availability throughout the maintenance window.
+
+### Emergency Maintenance (Both Nodes)
+
+If both nodes must be stopped simultaneously:
+1. Stop both nodes.
+2. Perform maintenance.
+3. Start one node (it forms a single-node cluster).
+4. Verify health.
+5. Start the second node.
+
+Sites continue operating independently during central maintenance. Site-buffered data (S&F) will be delivered when central communication restores.
--- a/docs/operations/troubleshooting-guide.md
+++ b/docs/operations/troubleshooting-guide.md
@@ -0,0 +1,201 @@
+# ScadaLink Troubleshooting Guide
+
+## Log Analysis
+
+### Log Location
+
+- **File logs:** `C:\ScadaLink\logs\scadalink-YYYYMMDD.log`
+- **Console output:** Available when running interactively (not as a Windows Service)
+
+### Log Format
+
+```
+[14:32:05 INF] [Central/central-01] Template "PumpStation" saved by admin
+```
+
+Format: `[Time Level] [NodeRole/NodeHostname] Message`
+
+All log entries are enriched with:
+- `SiteId` — Site identifier (or "central" for central nodes)
+- `NodeHostname` — Machine hostname
+- `NodeRole` — "Central" or "Site"
+
+### Key Log Patterns
+
+| Pattern | Meaning |
+|---------|---------|
+| `Starting ScadaLink host as {Role}` | Node startup |
+| `Member joined` | Cluster peer connected |
+| `Member removed` | Cluster peer departed |
+| `Singleton acquired` | This node became the active singleton holder |
+| `Instance {Name}: created N script actors` | Instance successfully deployed |
+| `Script {Name} failed trust validation` | Script uses forbidden API |
+| `Immediate delivery to {Target} failed` | S&F transient failure, message buffered |
+| `Message {Id} parked` | S&F max retries reached |
+| `Site {SiteId} marked offline` | No health report for 60 seconds |
+| `Rejecting stale report` | Out-of-order health report (normal during failover) |
+
+### Filtering Logs
+
+Use the structured log properties for targeted analysis:
+
+```powershell
+# Find all errors for a specific site
+Select-String -Path "logs\scadalink-*.log" -Pattern "\[ERR\].*site-01"
+
+# Find S&F activity
+Select-String -Path "logs\scadalink-*.log" -Pattern "store-and-forward|buffered|parked"
+
+# Find failover events
+Select-String -Path "logs\scadalink-*.log" -Pattern "Singleton|Member joined|Member removed"
+```
+
+## Common Issues
+
+### Issue: Site Appears Offline in Health Dashboard
+
+**Possible causes:**
+1. Site nodes are actually down.
+2. Network connectivity between site and central is broken.
+3. Health report interval has not elapsed since site startup.
+
+**Diagnosis:**
+1. Check if the site service is running: `sc.exe query ScadaLink-Site`
+2. Check site logs for errors.
+3. Verify network: `Test-NetConnection -ComputerName central-01.example.com -Port 8081`
+4. Wait 60 seconds (the offline detection threshold).
+
+**Resolution:**
+- If the service is stopped, start it.
+- If network is blocked, open firewall port 8081.
+- If the site just started, wait for the first health report (30-second interval).
+
+### Issue: Deployment Stuck in "InProgress"
+
+**Possible causes:**
+1. Site is unreachable during deployment.
+2. Central node failed over mid-deployment.
+3. Instance compilation failed on site.
+
+**Diagnosis:**
+1. Check deployment status in the UI.
+2. Check site logs for the deployment ID: `Select-String "dep-XXXXX"`
+3. Check central logs for the deployment ID.
+
+**Resolution:**
+- If the site is unreachable: fix connectivity, then re-deploy (idempotent by revision hash).
+- If compilation failed: check the script errors in site logs, fix the template, re-deploy.
+- If stuck after failover: the new central node will re-query site state; wait or manually re-deploy.
+
+### Issue: S&F Messages Accumulating
+
+**Possible causes:**
+1. External system is down.
+2. SMTP server is unreachable.
+3. Network issues between site and external target.
+
+**Diagnosis:**
+1. Check S&F buffer depth in health dashboard.
+2. Check site logs for retry activity and error messages.
+3. Verify external system connectivity from the site node.
+
+**Resolution:**
+- Fix the external system / SMTP / network issue. Retries resume automatically.
+- If messages are permanently undeliverable: park and discard via the central UI.
+- Check parked messages for patterns (same target, same error).
+
+### Issue: OPC UA Connection Keeps Disconnecting
+
+**Possible causes:**
+1. OPC UA server is unstable.
+2. Network intermittency.
+3. Certificate trust issues.
+
+**Diagnosis:**
+1. Check DCL logs: look for "Entering Reconnecting state" frequency.
+2. Check health dashboard: data connection status for the affected connection.
+3. Verify OPC UA server health independently.
+
+**Resolution:**
+- DCL auto-reconnects at the configured interval (default 5 seconds).
+- If the server certificate changed, update the trust store.
+- If the server is consistently unstable, investigate the OPC UA server directly.
+
+### Issue: Script Execution Errors
+
+**Possible causes:**
+1. Script timeout (default 30 seconds).
+2. Runtime exception in script code.
+3. Script references external system that is down.
+
+**Diagnosis:**
+1. Check health dashboard: script error count per interval.
+2. Check site logs for the script name and error details.
+3. Check if the script uses `ExternalSystem.Call()` — the target may be down.
+
+**Resolution:**
+- If timeout: optimize the script or increase the timeout in configuration.
+- If runtime error: fix the script in the template editor, re-deploy.
+- If external system is down: script errors will stop when the system recovers.
+
+### Issue: Login Fails but LDAP Server is Up
+
+**Possible causes:**
+1. Incorrect LDAP search base DN.
+2. User account is locked in AD.
+3. LDAP group-to-role mapping does not include a required group.
+4. TLS certificate issue on LDAP connection.
+
+**Diagnosis:**
+1. Check central logs for LDAP bind errors.
+2. Verify LDAP connectivity: `Test-NetConnection -ComputerName ldap.example.com -Port 636`
+3. Test LDAP bind manually using an LDAP browser tool.
+
+**Resolution:**
+- Fix the LDAP configuration.
+- Unlock the user account in AD.
+- Update group mappings in the configuration database.
+
+### Issue: High Dead Letter Count
+
+**Possible causes:**
+1. Messages being sent to actors that no longer exist (e.g., after instance deletion).
+2. Actor mailbox overflow.
+3. Misconfigured actor paths after deployment changes.
+
+**Diagnosis:**
+1. Check health dashboard: dead letter count trend.
+2. Check site logs for dead letter details (actor path, message type).
+
+**Resolution:**
+- Dead letters during failover are expected and transient.
+- Persistent dead letters indicate a configuration or code issue.
+- If dead letters reference deleted instances, they are harmless (S&F messages are retained by design).
+
+## Health Dashboard Interpretation
+
+### Metric: Data Connection Status
+
+| Status | Meaning | Action |
+|--------|---------|--------|
+| Connected | OPC UA connection active | None |
+| Disconnected | Connection lost, auto-reconnecting | Check OPC UA server |
+| Connecting | Initial connection in progress | Wait |
+
+### Metric: Tag Resolution
+
+- `TotalSubscribed`: Number of tags the system is trying to monitor.
+- `SuccessfullyResolved`: Tags with active subscriptions.
+- Gap indicates unresolved tags (devices still booting or path errors).
+
+### Metric: S&F Buffer Depth
+
+- `ExternalSystem`: Messages to external REST APIs awaiting delivery.
+- `Notification`: Email notifications awaiting SMTP delivery.
+- Growing depth indicates the target system is unreachable.
+
+### Metric: Error Counts (Per Interval)
+
+- Counts reset every 30 seconds (health report interval).
+- Raw counts, not rates — compare across intervals.
+- Occasional script errors during failover are expected.