Files

Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.

2026-03-16 22:12:31 -04:00

6.7 KiB

Raw Blame History

ScadaLink Maintenance Procedures

SQL Server Maintenance (Central)

Regular Maintenance Schedule

Task	Frequency	Window
Index rebuild	Weekly	Off-peak hours
Statistics update	Daily	Automated
Backup (full)	Daily	Off-peak hours
Backup (differential)	Every 4 hours	Anytime
Backup (transaction log)	Every 15 minutes	Anytime
Integrity check (DBCC CHECKDB)	Weekly	Off-peak hours

Index Maintenance

-- Rebuild fragmented indexes on configuration database
USE ScadaLink;
EXEC sp_MSforeachtable 'ALTER INDEX ALL ON ? REBUILD WITH (ONLINE = ON)';

For large tables (AuditLogEntries, DeploymentRecords), consider filtered rebuilds:

ALTER INDEX IX_AuditLogEntries_Timestamp ON AuditLogEntries REBUILD
    WITH (ONLINE = ON, FILLFACTOR = 90);

Audit Log Retention

The AuditLogEntries table grows continuously. Implement a retention policy:

-- Delete audit entries older than 1 year
DELETE FROM AuditLogEntries
WHERE Timestamp < DATEADD(YEAR, -1, GETUTCDATE());

Consider partitioning the AuditLogEntries table by month for efficient purging.

Database Growth Monitoring

-- Check database sizes
EXEC sp_helpdb 'ScadaLink';
EXEC sp_helpdb 'ScadaLink_MachineData';

-- Check table sizes
SELECT
    t.NAME AS TableName,
    p.rows AS RowCount,
    SUM(a.total_pages) * 8 / 1024.0 AS TotalSpaceMB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
GROUP BY t.Name, p.Rows
ORDER BY TotalSpaceMB DESC;

SQLite Management (Site)

Database Files

File	Purpose	Growth Pattern
`site.db`	Deployed configs, static overrides	Stable (grows with deployments)
`store-and-forward.db`	S&F message buffer	Variable (grows during outages)

Monitoring SQLite Size

# Check SQLite file sizes
Get-ChildItem C:\ScadaLink\data\*.db | Select-Object Name, @{N='SizeMB';E={[math]::Round($_.Length/1MB,2)}}

S&F Database Growth

The S&F database has no max buffer size by design. During extended outages, it can grow significantly.

Monitoring:

Check buffer depth in the health dashboard.
Alert if store-and-forward.db exceeds 1 GB.

Manual cleanup (if needed):

Identify and discard permanently undeliverable parked messages via the central UI.
If the database is very large and the site is healthy, the messages will be delivered and removed automatically.

SQLite Vacuum

SQLite does not reclaim disk space after deleting rows. Periodically vacuum:

# Stop the ScadaLink service first
sc.exe stop ScadaLink-Site

# Vacuum the S&F database
sqlite3 C:\ScadaLink\data\store-and-forward.db "VACUUM;"

# Restart the service
sc.exe start ScadaLink-Site

Important: Only vacuum when the service is stopped. SQLite does not support concurrent vacuum.

SQLite Backup

# Hot backup using SQLite backup API (safe while service is running)
sqlite3 C:\ScadaLink\data\site.db ".backup C:\Backups\site-$(Get-Date -Format yyyyMMdd).db"
sqlite3 C:\ScadaLink\data\store-and-forward.db ".backup C:\Backups\sf-$(Get-Date -Format yyyyMMdd).db"

Log Rotation

Serilog File Sink

ScadaLink uses Serilog's rolling file sink with daily rotation:

New file created each day: scadalink-20260316.log
Files are not automatically deleted.

Log Retention Policy

Implement a scheduled task to delete old log files:

# Delete log files older than 30 days
Get-ChildItem C:\ScadaLink\logs\scadalink-*.log |
    Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
    Remove-Item -Force

Schedule this as a Windows Task:

$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -Command `"Get-ChildItem C:\ScadaLink\logs\scadalink-*.log | Where-Object { `$_.LastWriteTime -lt (Get-Date).AddDays(-30) } | Remove-Item -Force`""
$trigger = New-ScheduledTaskTrigger -Daily -At "03:00"
Register-ScheduledTask -TaskName "ScadaLink-LogCleanup" -Action $action -Trigger $trigger -Description "Clean up ScadaLink log files older than 30 days"

Log Disk Space

Monitor disk space on all nodes:

Get-PSDrive C | Select-Object @{N='UsedGB';E={[math]::Round($_.Used/1GB,1)}}, @{N='FreeGB';E={[math]::Round($_.Free/1GB,1)}}

Alert if free space drops below 5 GB.

Site Event Log Maintenance

Automatic Purge

The Site Event Logging component has built-in purge:

Retention: 30 days (configurable via SiteEventLog:RetentionDays)
Storage cap: 1 GB (configurable via SiteEventLog:MaxStorageMB)
Purge interval: Every 24 hours (configurable via SiteEventLog:PurgeIntervalHours)

No manual intervention needed under normal conditions.

Manual Purge (Emergency)

If event log storage is consuming excessive disk space:

# Stop the service
sc.exe stop ScadaLink-Site

# Delete the event log database and let it be recreated
Remove-Item C:\ScadaLink\data\event-log.db

# Restart the service
sc.exe start ScadaLink-Site

Certificate Management

LDAP Certificates

If using LDAPS (port 636), the LDAP server's TLS certificate must be trusted:

Export the CA certificate from Active Directory.
Import into the Windows certificate store on both central nodes.
Restart the ScadaLink service.

OPC UA Certificates

OPC UA connections may require certificate trust configuration:

On first connection, the OPC UA client generates a self-signed certificate.
The OPC UA server must trust this certificate.
If the site node is replaced, a new certificate is generated; update the server trust list.

Scheduled Maintenance Window

Recommended Procedure

Notify operators that the system will be in maintenance mode.
Gracefully stop the standby node first (allows singleton to remain on active).
Perform maintenance on the standby node (OS updates, disk cleanup, etc.).
Start the standby node and verify it joins the cluster.
Gracefully stop the active node (CoordinatedShutdown migrates singletons to the now-running standby).
Perform maintenance on the former active node.
Start the former active node — it rejoins as standby.

This procedure maintains availability throughout the maintenance window.

Emergency Maintenance (Both Nodes)

If both nodes must be stopped simultaneously:

Stop both nodes.
Perform maintenance.
Start one node (it forms a single-node cluster).
Verify health.
Start the second node.

Sites continue operating independently during central maintenance. Site-buffered data (S&F) will be delivered when central communication restores.

6.7 KiB Raw Blame History