- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
6.7 KiB
ScadaLink Maintenance Procedures
SQL Server Maintenance (Central)
Regular Maintenance Schedule
| Task | Frequency | Window |
|---|---|---|
| Index rebuild | Weekly | Off-peak hours |
| Statistics update | Daily | Automated |
| Backup (full) | Daily | Off-peak hours |
| Backup (differential) | Every 4 hours | Anytime |
| Backup (transaction log) | Every 15 minutes | Anytime |
| Integrity check (DBCC CHECKDB) | Weekly | Off-peak hours |
Index Maintenance
-- Rebuild fragmented indexes on configuration database
USE ScadaLink;
EXEC sp_MSforeachtable 'ALTER INDEX ALL ON ? REBUILD WITH (ONLINE = ON)';
For large tables (AuditLogEntries, DeploymentRecords), consider filtered rebuilds:
ALTER INDEX IX_AuditLogEntries_Timestamp ON AuditLogEntries REBUILD
WITH (ONLINE = ON, FILLFACTOR = 90);
Audit Log Retention
The AuditLogEntries table grows continuously. Implement a retention policy:
-- Delete audit entries older than 1 year
DELETE FROM AuditLogEntries
WHERE Timestamp < DATEADD(YEAR, -1, GETUTCDATE());
Consider partitioning the AuditLogEntries table by month for efficient purging.
Database Growth Monitoring
-- Check database sizes
EXEC sp_helpdb 'ScadaLink';
EXEC sp_helpdb 'ScadaLink_MachineData';
-- Check table sizes
SELECT
t.NAME AS TableName,
p.rows AS RowCount,
SUM(a.total_pages) * 8 / 1024.0 AS TotalSpaceMB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
GROUP BY t.Name, p.Rows
ORDER BY TotalSpaceMB DESC;
SQLite Management (Site)
Database Files
| File | Purpose | Growth Pattern |
|---|---|---|
site.db |
Deployed configs, static overrides | Stable (grows with deployments) |
store-and-forward.db |
S&F message buffer | Variable (grows during outages) |
Monitoring SQLite Size
# Check SQLite file sizes
Get-ChildItem C:\ScadaLink\data\*.db | Select-Object Name, @{N='SizeMB';E={[math]::Round($_.Length/1MB,2)}}
S&F Database Growth
The S&F database has no max buffer size by design. During extended outages, it can grow significantly.
Monitoring:
- Check buffer depth in the health dashboard.
- Alert if
store-and-forward.dbexceeds 1 GB.
Manual cleanup (if needed):
- Identify and discard permanently undeliverable parked messages via the central UI.
- If the database is very large and the site is healthy, the messages will be delivered and removed automatically.
SQLite Vacuum
SQLite does not reclaim disk space after deleting rows. Periodically vacuum:
# Stop the ScadaLink service first
sc.exe stop ScadaLink-Site
# Vacuum the S&F database
sqlite3 C:\ScadaLink\data\store-and-forward.db "VACUUM;"
# Restart the service
sc.exe start ScadaLink-Site
Important: Only vacuum when the service is stopped. SQLite does not support concurrent vacuum.
SQLite Backup
# Hot backup using SQLite backup API (safe while service is running)
sqlite3 C:\ScadaLink\data\site.db ".backup C:\Backups\site-$(Get-Date -Format yyyyMMdd).db"
sqlite3 C:\ScadaLink\data\store-and-forward.db ".backup C:\Backups\sf-$(Get-Date -Format yyyyMMdd).db"
Log Rotation
Serilog File Sink
ScadaLink uses Serilog's rolling file sink with daily rotation:
- New file created each day:
scadalink-20260316.log - Files are not automatically deleted.
Log Retention Policy
Implement a scheduled task to delete old log files:
# Delete log files older than 30 days
Get-ChildItem C:\ScadaLink\logs\scadalink-*.log |
Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
Remove-Item -Force
Schedule this as a Windows Task:
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -Command `"Get-ChildItem C:\ScadaLink\logs\scadalink-*.log | Where-Object { `$_.LastWriteTime -lt (Get-Date).AddDays(-30) } | Remove-Item -Force`""
$trigger = New-ScheduledTaskTrigger -Daily -At "03:00"
Register-ScheduledTask -TaskName "ScadaLink-LogCleanup" -Action $action -Trigger $trigger -Description "Clean up ScadaLink log files older than 30 days"
Log Disk Space
Monitor disk space on all nodes:
Get-PSDrive C | Select-Object @{N='UsedGB';E={[math]::Round($_.Used/1GB,1)}}, @{N='FreeGB';E={[math]::Round($_.Free/1GB,1)}}
Alert if free space drops below 5 GB.
Site Event Log Maintenance
Automatic Purge
The Site Event Logging component has built-in purge:
- Retention: 30 days (configurable via
SiteEventLog:RetentionDays) - Storage cap: 1 GB (configurable via
SiteEventLog:MaxStorageMB) - Purge interval: Every 24 hours (configurable via
SiteEventLog:PurgeIntervalHours)
No manual intervention needed under normal conditions.
Manual Purge (Emergency)
If event log storage is consuming excessive disk space:
# Stop the service
sc.exe stop ScadaLink-Site
# Delete the event log database and let it be recreated
Remove-Item C:\ScadaLink\data\event-log.db
# Restart the service
sc.exe start ScadaLink-Site
Certificate Management
LDAP Certificates
If using LDAPS (port 636), the LDAP server's TLS certificate must be trusted:
- Export the CA certificate from Active Directory.
- Import into the Windows certificate store on both central nodes.
- Restart the ScadaLink service.
OPC UA Certificates
OPC UA connections may require certificate trust configuration:
- On first connection, the OPC UA client generates a self-signed certificate.
- The OPC UA server must trust this certificate.
- If the site node is replaced, a new certificate is generated; update the server trust list.
Scheduled Maintenance Window
Recommended Procedure
- Notify operators that the system will be in maintenance mode.
- Gracefully stop the standby node first (allows singleton to remain on active).
- Perform maintenance on the standby node (OS updates, disk cleanup, etc.).
- Start the standby node and verify it joins the cluster.
- Gracefully stop the active node (CoordinatedShutdown migrates singletons to the now-running standby).
- Perform maintenance on the former active node.
- Start the former active node — it rejoins as standby.
This procedure maintains availability throughout the maintenance window.
Emergency Maintenance (Both Nodes)
If both nodes must be stopped simultaneously:
- Stop both nodes.
- Perform maintenance.
- Start one node (it forms a single-node cluster).
- Verify health.
- Start the second node.
Sites continue operating independently during central maintenance. Site-buffered data (S&F) will be delivered when central communication restores.