- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
202 lines
6.9 KiB
Markdown
202 lines
6.9 KiB
Markdown
# ScadaLink Troubleshooting Guide
|
|
|
|
## Log Analysis
|
|
|
|
### Log Location
|
|
|
|
- **File logs:** `C:\ScadaLink\logs\scadalink-YYYYMMDD.log`
|
|
- **Console output:** Available when running interactively (not as a Windows Service)
|
|
|
|
### Log Format
|
|
|
|
```
|
|
[14:32:05 INF] [Central/central-01] Template "PumpStation" saved by admin
|
|
```
|
|
|
|
Format: `[Time Level] [NodeRole/NodeHostname] Message`
|
|
|
|
All log entries are enriched with:
|
|
- `SiteId` — Site identifier (or "central" for central nodes)
|
|
- `NodeHostname` — Machine hostname
|
|
- `NodeRole` — "Central" or "Site"
|
|
|
|
### Key Log Patterns
|
|
|
|
| Pattern | Meaning |
|
|
|---------|---------|
|
|
| `Starting ScadaLink host as {Role}` | Node startup |
|
|
| `Member joined` | Cluster peer connected |
|
|
| `Member removed` | Cluster peer departed |
|
|
| `Singleton acquired` | This node became the active singleton holder |
|
|
| `Instance {Name}: created N script actors` | Instance successfully deployed |
|
|
| `Script {Name} failed trust validation` | Script uses forbidden API |
|
|
| `Immediate delivery to {Target} failed` | S&F transient failure, message buffered |
|
|
| `Message {Id} parked` | S&F max retries reached |
|
|
| `Site {SiteId} marked offline` | No health report for 60 seconds |
|
|
| `Rejecting stale report` | Out-of-order health report (normal during failover) |
|
|
|
|
### Filtering Logs
|
|
|
|
Use the structured log properties for targeted analysis:
|
|
|
|
```powershell
|
|
# Find all errors for a specific site
|
|
Select-String -Path "logs\scadalink-*.log" -Pattern "\[ERR\].*site-01"
|
|
|
|
# Find S&F activity
|
|
Select-String -Path "logs\scadalink-*.log" -Pattern "store-and-forward|buffered|parked"
|
|
|
|
# Find failover events
|
|
Select-String -Path "logs\scadalink-*.log" -Pattern "Singleton|Member joined|Member removed"
|
|
```
|
|
|
|
## Common Issues
|
|
|
|
### Issue: Site Appears Offline in Health Dashboard
|
|
|
|
**Possible causes:**
|
|
1. Site nodes are actually down.
|
|
2. Network connectivity between site and central is broken.
|
|
3. Health report interval has not elapsed since site startup.
|
|
|
|
**Diagnosis:**
|
|
1. Check if the site service is running: `sc.exe query ScadaLink-Site`
|
|
2. Check site logs for errors.
|
|
3. Verify network: `Test-NetConnection -ComputerName central-01.example.com -Port 8081`
|
|
4. Wait 60 seconds (the offline detection threshold).
|
|
|
|
**Resolution:**
|
|
- If the service is stopped, start it.
|
|
- If network is blocked, open firewall port 8081.
|
|
- If the site just started, wait for the first health report (30-second interval).
|
|
|
|
### Issue: Deployment Stuck in "InProgress"
|
|
|
|
**Possible causes:**
|
|
1. Site is unreachable during deployment.
|
|
2. Central node failed over mid-deployment.
|
|
3. Instance compilation failed on site.
|
|
|
|
**Diagnosis:**
|
|
1. Check deployment status in the UI.
|
|
2. Check site logs for the deployment ID: `Select-String "dep-XXXXX"`
|
|
3. Check central logs for the deployment ID.
|
|
|
|
**Resolution:**
|
|
- If the site is unreachable: fix connectivity, then re-deploy (idempotent by revision hash).
|
|
- If compilation failed: check the script errors in site logs, fix the template, re-deploy.
|
|
- If stuck after failover: the new central node will re-query site state; wait or manually re-deploy.
|
|
|
|
### Issue: S&F Messages Accumulating
|
|
|
|
**Possible causes:**
|
|
1. External system is down.
|
|
2. SMTP server is unreachable.
|
|
3. Network issues between site and external target.
|
|
|
|
**Diagnosis:**
|
|
1. Check S&F buffer depth in health dashboard.
|
|
2. Check site logs for retry activity and error messages.
|
|
3. Verify external system connectivity from the site node.
|
|
|
|
**Resolution:**
|
|
- Fix the external system / SMTP / network issue. Retries resume automatically.
|
|
- If messages are permanently undeliverable: park and discard via the central UI.
|
|
- Check parked messages for patterns (same target, same error).
|
|
|
|
### Issue: OPC UA Connection Keeps Disconnecting
|
|
|
|
**Possible causes:**
|
|
1. OPC UA server is unstable.
|
|
2. Network intermittency.
|
|
3. Certificate trust issues.
|
|
|
|
**Diagnosis:**
|
|
1. Check DCL logs: look for "Entering Reconnecting state" frequency.
|
|
2. Check health dashboard: data connection status for the affected connection.
|
|
3. Verify OPC UA server health independently.
|
|
|
|
**Resolution:**
|
|
- DCL auto-reconnects at the configured interval (default 5 seconds).
|
|
- If the server certificate changed, update the trust store.
|
|
- If the server is consistently unstable, investigate the OPC UA server directly.
|
|
|
|
### Issue: Script Execution Errors
|
|
|
|
**Possible causes:**
|
|
1. Script timeout (default 30 seconds).
|
|
2. Runtime exception in script code.
|
|
3. Script references external system that is down.
|
|
|
|
**Diagnosis:**
|
|
1. Check health dashboard: script error count per interval.
|
|
2. Check site logs for the script name and error details.
|
|
3. Check if the script uses `ExternalSystem.Call()` — the target may be down.
|
|
|
|
**Resolution:**
|
|
- If timeout: optimize the script or increase the timeout in configuration.
|
|
- If runtime error: fix the script in the template editor, re-deploy.
|
|
- If external system is down: script errors will stop when the system recovers.
|
|
|
|
### Issue: Login Fails but LDAP Server is Up
|
|
|
|
**Possible causes:**
|
|
1. Incorrect LDAP search base DN.
|
|
2. User account is locked in AD.
|
|
3. LDAP group-to-role mapping does not include a required group.
|
|
4. TLS certificate issue on LDAP connection.
|
|
|
|
**Diagnosis:**
|
|
1. Check central logs for LDAP bind errors.
|
|
2. Verify LDAP connectivity: `Test-NetConnection -ComputerName ldap.example.com -Port 636`
|
|
3. Test LDAP bind manually using an LDAP browser tool.
|
|
|
|
**Resolution:**
|
|
- Fix the LDAP configuration.
|
|
- Unlock the user account in AD.
|
|
- Update group mappings in the configuration database.
|
|
|
|
### Issue: High Dead Letter Count
|
|
|
|
**Possible causes:**
|
|
1. Messages being sent to actors that no longer exist (e.g., after instance deletion).
|
|
2. Actor mailbox overflow.
|
|
3. Misconfigured actor paths after deployment changes.
|
|
|
|
**Diagnosis:**
|
|
1. Check health dashboard: dead letter count trend.
|
|
2. Check site logs for dead letter details (actor path, message type).
|
|
|
|
**Resolution:**
|
|
- Dead letters during failover are expected and transient.
|
|
- Persistent dead letters indicate a configuration or code issue.
|
|
- If dead letters reference deleted instances, they are harmless (S&F messages are retained by design).
|
|
|
|
## Health Dashboard Interpretation
|
|
|
|
### Metric: Data Connection Status
|
|
|
|
| Status | Meaning | Action |
|
|
|--------|---------|--------|
|
|
| Connected | OPC UA connection active | None |
|
|
| Disconnected | Connection lost, auto-reconnecting | Check OPC UA server |
|
|
| Connecting | Initial connection in progress | Wait |
|
|
|
|
### Metric: Tag Resolution
|
|
|
|
- `TotalSubscribed`: Number of tags the system is trying to monitor.
|
|
- `SuccessfullyResolved`: Tags with active subscriptions.
|
|
- Gap indicates unresolved tags (devices still booting or path errors).
|
|
|
|
### Metric: S&F Buffer Depth
|
|
|
|
- `ExternalSystem`: Messages to external REST APIs awaiting delivery.
|
|
- `Notification`: Email notifications awaiting SMTP delivery.
|
|
- Growing depth indicates the target system is unreachable.
|
|
|
|
### Metric: Error Counts (Per Interval)
|
|
|
|
- Counts reset every 30 seconds (health report interval).
|
|
- Raw counts, not rates — compare across intervals.
|
|
- Occasional script errors during failover are expected.
|