- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
216 lines
6.7 KiB
Markdown
216 lines
6.7 KiB
Markdown
# ScadaLink Maintenance Procedures
|
|
|
|
## SQL Server Maintenance (Central)
|
|
|
|
### Regular Maintenance Schedule
|
|
|
|
| Task | Frequency | Window |
|
|
|------|-----------|--------|
|
|
| Index rebuild | Weekly | Off-peak hours |
|
|
| Statistics update | Daily | Automated |
|
|
| Backup (full) | Daily | Off-peak hours |
|
|
| Backup (differential) | Every 4 hours | Anytime |
|
|
| Backup (transaction log) | Every 15 minutes | Anytime |
|
|
| Integrity check (DBCC CHECKDB) | Weekly | Off-peak hours |
|
|
|
|
### Index Maintenance
|
|
|
|
```sql
|
|
-- Rebuild fragmented indexes on configuration database
|
|
USE ScadaLink;
|
|
EXEC sp_MSforeachtable 'ALTER INDEX ALL ON ? REBUILD WITH (ONLINE = ON)';
|
|
```
|
|
|
|
For large tables (AuditLogEntries, DeploymentRecords), consider filtered rebuilds:
|
|
```sql
|
|
ALTER INDEX IX_AuditLogEntries_Timestamp ON AuditLogEntries REBUILD
|
|
WITH (ONLINE = ON, FILLFACTOR = 90);
|
|
```
|
|
|
|
### Audit Log Retention
|
|
|
|
The AuditLogEntries table grows continuously. Implement a retention policy:
|
|
|
|
```sql
|
|
-- Delete audit entries older than 1 year
|
|
DELETE FROM AuditLogEntries
|
|
WHERE Timestamp < DATEADD(YEAR, -1, GETUTCDATE());
|
|
```
|
|
|
|
Consider partitioning the AuditLogEntries table by month for efficient purging.
|
|
|
|
### Database Growth Monitoring
|
|
|
|
```sql
|
|
-- Check database sizes
|
|
EXEC sp_helpdb 'ScadaLink';
|
|
EXEC sp_helpdb 'ScadaLink_MachineData';
|
|
|
|
-- Check table sizes
|
|
SELECT
|
|
t.NAME AS TableName,
|
|
p.rows AS RowCount,
|
|
SUM(a.total_pages) * 8 / 1024.0 AS TotalSpaceMB
|
|
FROM sys.tables t
|
|
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
|
|
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
|
|
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
|
|
GROUP BY t.Name, p.Rows
|
|
ORDER BY TotalSpaceMB DESC;
|
|
```
|
|
|
|
## SQLite Management (Site)
|
|
|
|
### Database Files
|
|
|
|
| File | Purpose | Growth Pattern |
|
|
|------|---------|---------------|
|
|
| `site.db` | Deployed configs, static overrides | Stable (grows with deployments) |
|
|
| `store-and-forward.db` | S&F message buffer | Variable (grows during outages) |
|
|
|
|
### Monitoring SQLite Size
|
|
|
|
```powershell
|
|
# Check SQLite file sizes
|
|
Get-ChildItem C:\ScadaLink\data\*.db | Select-Object Name, @{N='SizeMB';E={[math]::Round($_.Length/1MB,2)}}
|
|
```
|
|
|
|
### S&F Database Growth
|
|
|
|
The S&F database has **no max buffer size** by design. During extended outages, it can grow significantly.
|
|
|
|
**Monitoring:**
|
|
- Check buffer depth in the health dashboard.
|
|
- Alert if `store-and-forward.db` exceeds 1 GB.
|
|
|
|
**Manual cleanup (if needed):**
|
|
1. Identify and discard permanently undeliverable parked messages via the central UI.
|
|
2. If the database is very large and the site is healthy, the messages will be delivered and removed automatically.
|
|
|
|
### SQLite Vacuum
|
|
|
|
SQLite does not reclaim disk space after deleting rows. Periodically vacuum:
|
|
|
|
```powershell
|
|
# Stop the ScadaLink service first
|
|
sc.exe stop ScadaLink-Site
|
|
|
|
# Vacuum the S&F database
|
|
sqlite3 C:\ScadaLink\data\store-and-forward.db "VACUUM;"
|
|
|
|
# Restart the service
|
|
sc.exe start ScadaLink-Site
|
|
```
|
|
|
|
**Important:** Only vacuum when the service is stopped. SQLite does not support concurrent vacuum.
|
|
|
|
### SQLite Backup
|
|
|
|
```powershell
|
|
# Hot backup using SQLite backup API (safe while service is running)
|
|
sqlite3 C:\ScadaLink\data\site.db ".backup C:\Backups\site-$(Get-Date -Format yyyyMMdd).db"
|
|
sqlite3 C:\ScadaLink\data\store-and-forward.db ".backup C:\Backups\sf-$(Get-Date -Format yyyyMMdd).db"
|
|
```
|
|
|
|
## Log Rotation
|
|
|
|
### Serilog File Sink
|
|
|
|
ScadaLink uses Serilog's rolling file sink with daily rotation:
|
|
- New file created each day: `scadalink-20260316.log`
|
|
- Files are not automatically deleted.
|
|
|
|
### Log Retention Policy
|
|
|
|
Implement a scheduled task to delete old log files:
|
|
|
|
```powershell
|
|
# Delete log files older than 30 days
|
|
Get-ChildItem C:\ScadaLink\logs\scadalink-*.log |
|
|
Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
|
|
Remove-Item -Force
|
|
```
|
|
|
|
Schedule this as a Windows Task:
|
|
```powershell
|
|
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -Command `"Get-ChildItem C:\ScadaLink\logs\scadalink-*.log | Where-Object { `$_.LastWriteTime -lt (Get-Date).AddDays(-30) } | Remove-Item -Force`""
|
|
$trigger = New-ScheduledTaskTrigger -Daily -At "03:00"
|
|
Register-ScheduledTask -TaskName "ScadaLink-LogCleanup" -Action $action -Trigger $trigger -Description "Clean up ScadaLink log files older than 30 days"
|
|
```
|
|
|
|
### Log Disk Space
|
|
|
|
Monitor disk space on all nodes:
|
|
```powershell
|
|
Get-PSDrive C | Select-Object @{N='UsedGB';E={[math]::Round($_.Used/1GB,1)}}, @{N='FreeGB';E={[math]::Round($_.Free/1GB,1)}}
|
|
```
|
|
|
|
Alert if free space drops below 5 GB.
|
|
|
|
## Site Event Log Maintenance
|
|
|
|
### Automatic Purge
|
|
|
|
The Site Event Logging component has built-in purge:
|
|
- **Retention**: 30 days (configurable via `SiteEventLog:RetentionDays`)
|
|
- **Storage cap**: 1 GB (configurable via `SiteEventLog:MaxStorageMB`)
|
|
- **Purge interval**: Every 24 hours (configurable via `SiteEventLog:PurgeIntervalHours`)
|
|
|
|
No manual intervention needed under normal conditions.
|
|
|
|
### Manual Purge (Emergency)
|
|
|
|
If event log storage is consuming excessive disk space:
|
|
|
|
```powershell
|
|
# Stop the service
|
|
sc.exe stop ScadaLink-Site
|
|
|
|
# Delete the event log database and let it be recreated
|
|
Remove-Item C:\ScadaLink\data\event-log.db
|
|
|
|
# Restart the service
|
|
sc.exe start ScadaLink-Site
|
|
```
|
|
|
|
## Certificate Management
|
|
|
|
### LDAP Certificates
|
|
|
|
If using LDAPS (port 636), the LDAP server's TLS certificate must be trusted:
|
|
1. Export the CA certificate from Active Directory.
|
|
2. Import into the Windows certificate store on both central nodes.
|
|
3. Restart the ScadaLink service.
|
|
|
|
### OPC UA Certificates
|
|
|
|
OPC UA connections may require certificate trust configuration:
|
|
1. On first connection, the OPC UA client generates a self-signed certificate.
|
|
2. The OPC UA server must trust this certificate.
|
|
3. If the site node is replaced, a new certificate is generated; update the server trust list.
|
|
|
|
## Scheduled Maintenance Window
|
|
|
|
### Recommended Procedure
|
|
|
|
1. **Notify operators** that the system will be in maintenance mode.
|
|
2. **Gracefully stop the standby node** first (allows singleton to remain on active).
|
|
3. Perform maintenance on the standby node (OS updates, disk cleanup, etc.).
|
|
4. **Start the standby node** and verify it joins the cluster.
|
|
5. **Gracefully stop the active node** (CoordinatedShutdown migrates singletons to the now-running standby).
|
|
6. Perform maintenance on the former active node.
|
|
7. **Start the former active node** — it rejoins as standby.
|
|
|
|
This procedure maintains availability throughout the maintenance window.
|
|
|
|
### Emergency Maintenance (Both Nodes)
|
|
|
|
If both nodes must be stopped simultaneously:
|
|
1. Stop both nodes.
|
|
2. Perform maintenance.
|
|
3. Start one node (it forms a single-node cluster).
|
|
4. Verify health.
|
|
5. Start the second node.
|
|
|
|
Sites continue operating independently during central maintenance. Site-buffered data (S&F) will be delivered when central communication restores.
|