# ScadaLink Troubleshooting Guide ## Log Analysis ### Log Location - **File logs:** `C:\ScadaLink\logs\scadalink-YYYYMMDD.log` - **Console output:** Available when running interactively (not as a Windows Service) ### Log Format ``` [14:32:05 INF] [Central/central-01] Template "PumpStation" saved by admin ``` Format: `[Time Level] [NodeRole/NodeHostname] Message` All log entries are enriched with: - `SiteId` — Site identifier (or "central" for central nodes) - `NodeHostname` — Machine hostname - `NodeRole` — "Central" or "Site" ### Key Log Patterns | Pattern | Meaning | |---------|---------| | `Starting ScadaLink host as {Role}` | Node startup | | `Member joined` | Cluster peer connected | | `Member removed` | Cluster peer departed | | `Singleton acquired` | This node became the active singleton holder | | `Instance {Name}: created N script actors` | Instance successfully deployed | | `Script {Name} failed trust validation` | Script uses forbidden API | | `Immediate delivery to {Target} failed` | S&F transient failure, message buffered | | `Message {Id} parked` | S&F max retries reached | | `Site {SiteId} marked offline` | No health report for 60 seconds | | `Rejecting stale report` | Out-of-order health report (normal during failover) | ### Filtering Logs Use the structured log properties for targeted analysis: ```powershell # Find all errors for a specific site Select-String -Path "logs\scadalink-*.log" -Pattern "\[ERR\].*site-01" # Find S&F activity Select-String -Path "logs\scadalink-*.log" -Pattern "store-and-forward|buffered|parked" # Find failover events Select-String -Path "logs\scadalink-*.log" -Pattern "Singleton|Member joined|Member removed" ``` ## Common Issues ### Issue: Site Appears Offline in Health Dashboard **Possible causes:** 1. Site nodes are actually down. 2. Network connectivity between site and central is broken. 3. Health report interval has not elapsed since site startup. **Diagnosis:** 1. Check if the site service is running: `sc.exe query ScadaLink-Site` 2. Check site logs for errors. 3. Verify network: `Test-NetConnection -ComputerName central-01.example.com -Port 8081` 4. Wait 60 seconds (the offline detection threshold). **Resolution:** - If the service is stopped, start it. - If network is blocked, open firewall port 8081. - If the site just started, wait for the first health report (30-second interval). ### Issue: Deployment Stuck in "InProgress" **Possible causes:** 1. Site is unreachable during deployment. 2. Central node failed over mid-deployment. 3. Instance compilation failed on site. **Diagnosis:** 1. Check deployment status in the UI. 2. Check site logs for the deployment ID: `Select-String "dep-XXXXX"` 3. Check central logs for the deployment ID. **Resolution:** - If the site is unreachable: fix connectivity, then re-deploy (idempotent by revision hash). - If compilation failed: check the script errors in site logs, fix the template, re-deploy. - If stuck after failover: the new central node will re-query site state; wait or manually re-deploy. ### Issue: S&F Messages Accumulating **Possible causes:** 1. External system is down. 2. SMTP server is unreachable. 3. Network issues between site and external target. **Diagnosis:** 1. Check S&F buffer depth in health dashboard. 2. Check site logs for retry activity and error messages. 3. Verify external system connectivity from the site node. **Resolution:** - Fix the external system / SMTP / network issue. Retries resume automatically. - If messages are permanently undeliverable: park and discard via the central UI. - Check parked messages for patterns (same target, same error). ### Issue: OPC UA Connection Keeps Disconnecting **Possible causes:** 1. OPC UA server is unstable. 2. Network intermittency. 3. Certificate trust issues. **Diagnosis:** 1. Check DCL logs: look for "Entering Reconnecting state" frequency. 2. Check health dashboard: data connection status for the affected connection. 3. Verify OPC UA server health independently. **Resolution:** - DCL auto-reconnects at the configured interval (default 5 seconds). - If the server certificate changed, update the trust store. - If the server is consistently unstable, investigate the OPC UA server directly. ### Issue: Script Execution Errors **Possible causes:** 1. Script timeout (default 30 seconds). 2. Runtime exception in script code. 3. Script references external system that is down. **Diagnosis:** 1. Check health dashboard: script error count per interval. 2. Check site logs for the script name and error details. 3. Check if the script uses `ExternalSystem.Call()` — the target may be down. **Resolution:** - If timeout: optimize the script or increase the timeout in configuration. - If runtime error: fix the script in the template editor, re-deploy. - If external system is down: script errors will stop when the system recovers. ### Issue: Login Fails but LDAP Server is Up **Possible causes:** 1. Incorrect LDAP search base DN. 2. User account is locked in AD. 3. LDAP group-to-role mapping does not include a required group. 4. TLS certificate issue on LDAP connection. **Diagnosis:** 1. Check central logs for LDAP bind errors. 2. Verify LDAP connectivity: `Test-NetConnection -ComputerName ldap.example.com -Port 636` 3. Test LDAP bind manually using an LDAP browser tool. **Resolution:** - Fix the LDAP configuration. - Unlock the user account in AD. - Update group mappings in the configuration database. ### Issue: High Dead Letter Count **Possible causes:** 1. Messages being sent to actors that no longer exist (e.g., after instance deletion). 2. Actor mailbox overflow. 3. Misconfigured actor paths after deployment changes. **Diagnosis:** 1. Check health dashboard: dead letter count trend. 2. Check site logs for dead letter details (actor path, message type). **Resolution:** - Dead letters during failover are expected and transient. - Persistent dead letters indicate a configuration or code issue. - If dead letters reference deleted instances, they are harmless (S&F messages are retained by design). ## Health Dashboard Interpretation ### Metric: Data Connection Status | Status | Meaning | Action | |--------|---------|--------| | Connected | OPC UA connection active | None | | Disconnected | Connection lost, auto-reconnecting | Check OPC UA server | | Connecting | Initial connection in progress | Wait | ### Metric: Tag Resolution - `TotalSubscribed`: Number of tags the system is trying to monitor. - `SuccessfullyResolved`: Tags with active subscriptions. - Gap indicates unresolved tags (devices still booting or path errors). ### Metric: S&F Buffer Depth - `ExternalSystem`: Messages to external REST APIs awaiting delivery. - `Notification`: Email notifications awaiting SMTP delivery. - Growing depth indicates the target system is unreachable. ### Metric: Error Counts (Per Interval) - Counts reset every 30 seconds (health report interval). - Raw counts, not rates — compare across intervals. - Occasional script errors during failover are expected.