ScadaBridge/docs/operations/maintenance-procedures.md

# ScadaBridge Maintenance Procedures

## SQL Server Maintenance (Central)

### Regular Maintenance Schedule

| Task | Frequency | Window |
|------|-----------|--------|
| Index rebuild | Weekly | Off-peak hours |
| Statistics update | Daily | Automated |
| Backup (full) | Daily | Off-peak hours |
| Backup (differential) | Every 4 hours | Anytime |
| Backup (transaction log) | Every 15 minutes | Anytime |
| Integrity check (DBCC CHECKDB) | Weekly | Off-peak hours |

### Index Maintenance

```sql
-- Rebuild fragmented indexes on configuration database
USE ScadaBridge;
EXEC sp_MSforeachtable 'ALTER INDEX ALL ON ? REBUILD WITH (ONLINE = ON)';
```

For large tables (AuditLogEntries, DeploymentRecords), consider filtered rebuilds:
```sql
ALTER INDEX IX_AuditLogEntries_Timestamp ON AuditLogEntries REBUILD
    WITH (ONLINE = ON, FILLFACTOR = 90);
```

### Audit Log Retention

The AuditLogEntries table grows continuously. Implement a retention policy:

```sql
-- Delete audit entries older than 1 year
DELETE FROM AuditLogEntries
WHERE Timestamp < DATEADD(YEAR, -1, GETUTCDATE());
```

Consider partitioning the AuditLogEntries table by month for efficient purging.

### Database Growth Monitoring

```sql
-- Check database sizes
EXEC sp_helpdb 'ScadaBridge';
EXEC sp_helpdb 'ScadaBridge_MachineData';

-- Check table sizes
SELECT
    t.NAME AS TableName,
    p.rows AS RowCount,
    SUM(a.total_pages) * 8 / 1024.0 AS TotalSpaceMB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
GROUP BY t.Name, p.Rows
ORDER BY TotalSpaceMB DESC;
```

## SQLite Management (Site)

### Database Files

| File | Purpose | Growth Pattern |
|------|---------|---------------|
| `site.db` | Deployed configs, static overrides | Stable (grows with deployments) |
| `store-and-forward.db` | S&F message buffer | Variable (grows during outages) |

### Monitoring SQLite Size

```powershell
# Check SQLite file sizes
Get-ChildItem C:\ScadaBridge\data\*.db | Select-Object Name, @{N='SizeMB';E={[math]::Round($_.Length/1MB,2)}}
```

### S&F Database Growth

The S&F database has **no max buffer size** by design. During extended outages, it can grow significantly.

**Monitoring:**
- Check buffer depth in the health dashboard.
- Alert if `store-and-forward.db` exceeds 1 GB.

**Manual cleanup (if needed):**
1. Identify and discard permanently undeliverable parked messages via the central UI.
2. If the database is very large and the site is healthy, the messages will be delivered and removed automatically.

### SQLite Vacuum

SQLite does not reclaim disk space after deleting rows. Periodically vacuum:

```powershell
# Stop the ScadaBridge service first
sc.exe stop ScadaBridge-Site

# Vacuum the S&F database
sqlite3 C:\ScadaBridge\data\store-and-forward.db "VACUUM;"

# Restart the service
sc.exe start ScadaBridge-Site
```

**Important:** Only vacuum when the service is stopped. SQLite does not support concurrent vacuum.

### SQLite Backup

```powershell
# Hot backup using SQLite backup API (safe while service is running)
sqlite3 C:\ScadaBridge\data\site.db ".backup C:\Backups\site-$(Get-Date -Format yyyyMMdd).db"
sqlite3 C:\ScadaBridge\data\store-and-forward.db ".backup C:\Backups\sf-$(Get-Date -Format yyyyMMdd).db"
```

## Log Rotation

### Serilog File Sink

ScadaBridge uses Serilog's rolling file sink with daily rotation:
- New file created each day: `scadabridge-20260316.log`
- Files are not automatically deleted.

### Log Retention Policy

Implement a scheduled task to delete old log files:

```powershell
# Delete log files older than 30 days
Get-ChildItem C:\ScadaBridge\logs\scadabridge-*.log |
    Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-30) } |
    Remove-Item -Force
```

Schedule this as a Windows Task:
```powershell
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -Command `"Get-ChildItem C:\ScadaBridge\logs\scadabridge-*.log | Where-Object { `$_.LastWriteTime -lt (Get-Date).AddDays(-30) } | Remove-Item -Force`""
$trigger = New-ScheduledTaskTrigger -Daily -At "03:00"
Register-ScheduledTask -TaskName "ScadaBridge-LogCleanup" -Action $action -Trigger $trigger -Description "Clean up ScadaBridge log files older than 30 days"
```

### Log Disk Space

Monitor disk space on all nodes:
```powershell
Get-PSDrive C | Select-Object @{N='UsedGB';E={[math]::Round($_.Used/1GB,1)}}, @{N='FreeGB';E={[math]::Round($_.Free/1GB,1)}}
```

Alert if free space drops below 5 GB.

## Site Event Log Maintenance

### Automatic Purge

The Site Event Logging component has built-in purge:
- **Retention**: 30 days (configurable via `SiteEventLog:RetentionDays`)
- **Storage cap**: 1 GB (configurable via `SiteEventLog:MaxStorageMB`)
- **Purge interval**: Every 24 hours (configurable via `SiteEventLog:PurgeIntervalHours`)

No manual intervention needed under normal conditions.

### Manual Purge (Emergency)

If event log storage is consuming excessive disk space:

```powershell
# Stop the service
sc.exe stop ScadaBridge-Site

# Delete the event log database and let it be recreated
Remove-Item C:\ScadaBridge\data\event-log.db

# Restart the service
sc.exe start ScadaBridge-Site
```

## Certificate Management

### LDAP Certificates

If using LDAPS (port 636), the LDAP server's TLS certificate must be trusted:
1. Export the CA certificate from Active Directory.
2. Import into the Windows certificate store on both central nodes.
3. Restart the ScadaBridge service.

### OPC UA Certificates

OPC UA connections may require certificate trust configuration:
1. On first connection, the OPC UA client generates a self-signed certificate.
2. The OPC UA server must trust this certificate.
3. If the site node is replaced, a new certificate is generated; update the server trust list.

## Scheduled Maintenance Window

### Recommended Procedure

1. **Notify operators** that the system will be in maintenance mode.
2. **Gracefully stop the standby node** first (allows singleton to remain on active).
3. Perform maintenance on the standby node (OS updates, disk cleanup, etc.).
4. **Start the standby node** and verify it joins the cluster.
5. **Gracefully stop the active node** (CoordinatedShutdown migrates singletons to the now-running standby).
6. Perform maintenance on the former active node.
7. **Start the former active node** — it rejoins as standby.

This procedure maintains availability throughout the maintenance window.

### Emergency Maintenance (Both Nodes)

If both nodes must be stopped simultaneously:
1. Stop both nodes.
2. Perform maintenance.
3. Start one node (it forms a single-node cluster).
4. Verify health.
5. Start the second node.

Sites continue operating independently during central maintenance. Site-buffered data (S&F) will be delivered when central communication restores.