Files
ScadaBridge/docs/operations/troubleshooting-guide.md
Joseph Doherty 7b0b9c7365 refactor: rename ScadaLink → ZB.MOM.WW.ScadaBridge (code + projects + namespaces)
Solution + 23 src projects + 26 test projects renamed; folders, csproj,
namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated.
ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated.
SQL roles/logins, LDAP domains, CLI command name, and CLI config dir
(~/.scadalink → ~/.scadabridge) also renamed.

Build green; 5 Host.Tests fail awaiting SQL login rename in next commit.
Pre-existing StaleTagMonitor timing flakes unchanged.

Rename script committed at tools/rename-to-scadabridge.sh.
2026-05-28 09:37:45 -04:00

6.9 KiB

ScadaBridge Troubleshooting Guide

Log Analysis

Log Location

  • File logs: C:\ScadaBridge\logs\scadabridge-YYYYMMDD.log
  • Console output: Available when running interactively (not as a Windows Service)

Log Format

[14:32:05 INF] [Central/central-01] Template "PumpStation" saved by admin

Format: [Time Level] [NodeRole/NodeHostname] Message

All log entries are enriched with:

  • SiteId — Site identifier (or "central" for central nodes)
  • NodeHostname — Machine hostname
  • NodeRole — "Central" or "Site"

Key Log Patterns

Pattern Meaning
Starting ScadaBridge host as {Role} Node startup
Member joined Cluster peer connected
Member removed Cluster peer departed
Singleton acquired This node became the active singleton holder
Instance {Name}: created N script actors Instance successfully deployed
Script {Name} failed trust validation Script uses forbidden API
Immediate delivery to {Target} failed S&F transient failure, message buffered
Message {Id} parked S&F max retries reached
Site {SiteId} marked offline No health report for 60 seconds
Rejecting stale report Out-of-order health report (normal during failover)

Filtering Logs

Use the structured log properties for targeted analysis:

# Find all errors for a specific site
Select-String -Path "logs\scadabridge-*.log" -Pattern "\[ERR\].*site-01"

# Find S&F activity
Select-String -Path "logs\scadabridge-*.log" -Pattern "store-and-forward|buffered|parked"

# Find failover events
Select-String -Path "logs\scadabridge-*.log" -Pattern "Singleton|Member joined|Member removed"

Common Issues

Issue: Site Appears Offline in Health Dashboard

Possible causes:

  1. Site nodes are actually down.
  2. Network connectivity between site and central is broken.
  3. Health report interval has not elapsed since site startup.

Diagnosis:

  1. Check if the site service is running: sc.exe query ScadaBridge-Site
  2. Check site logs for errors.
  3. Verify network: Test-NetConnection -ComputerName central-01.example.com -Port 8081
  4. Wait 60 seconds (the offline detection threshold).

Resolution:

  • If the service is stopped, start it.
  • If network is blocked, open firewall port 8081.
  • If the site just started, wait for the first health report (30-second interval).

Issue: Deployment Stuck in "InProgress"

Possible causes:

  1. Site is unreachable during deployment.
  2. Central node failed over mid-deployment.
  3. Instance compilation failed on site.

Diagnosis:

  1. Check deployment status in the UI.
  2. Check site logs for the deployment ID: Select-String "dep-XXXXX"
  3. Check central logs for the deployment ID.

Resolution:

  • If the site is unreachable: fix connectivity, then re-deploy (idempotent by revision hash).
  • If compilation failed: check the script errors in site logs, fix the template, re-deploy.
  • If stuck after failover: the new central node will re-query site state; wait or manually re-deploy.

Issue: S&F Messages Accumulating

Possible causes:

  1. External system is down.
  2. SMTP server is unreachable.
  3. Network issues between site and external target.

Diagnosis:

  1. Check S&F buffer depth in health dashboard.
  2. Check site logs for retry activity and error messages.
  3. Verify external system connectivity from the site node.

Resolution:

  • Fix the external system / SMTP / network issue. Retries resume automatically.
  • If messages are permanently undeliverable: park and discard via the central UI.
  • Check parked messages for patterns (same target, same error).

Issue: OPC UA Connection Keeps Disconnecting

Possible causes:

  1. OPC UA server is unstable.
  2. Network intermittency.
  3. Certificate trust issues.

Diagnosis:

  1. Check DCL logs: look for "Entering Reconnecting state" frequency.
  2. Check health dashboard: data connection status for the affected connection.
  3. Verify OPC UA server health independently.

Resolution:

  • DCL auto-reconnects at the configured interval (default 5 seconds).
  • If the server certificate changed, update the trust store.
  • If the server is consistently unstable, investigate the OPC UA server directly.

Issue: Script Execution Errors

Possible causes:

  1. Script timeout (default 30 seconds).
  2. Runtime exception in script code.
  3. Script references external system that is down.

Diagnosis:

  1. Check health dashboard: script error count per interval.
  2. Check site logs for the script name and error details.
  3. Check if the script uses ExternalSystem.Call() — the target may be down.

Resolution:

  • If timeout: optimize the script or increase the timeout in configuration.
  • If runtime error: fix the script in the template editor, re-deploy.
  • If external system is down: script errors will stop when the system recovers.

Issue: Login Fails but LDAP Server is Up

Possible causes:

  1. Incorrect LDAP search base DN.
  2. User account is locked in AD.
  3. LDAP group-to-role mapping does not include a required group.
  4. TLS certificate issue on LDAP connection.

Diagnosis:

  1. Check central logs for LDAP bind errors.
  2. Verify LDAP connectivity: Test-NetConnection -ComputerName ldap.example.com -Port 636
  3. Test LDAP bind manually using an LDAP browser tool.

Resolution:

  • Fix the LDAP configuration.
  • Unlock the user account in AD.
  • Update group mappings in the configuration database.

Issue: High Dead Letter Count

Possible causes:

  1. Messages being sent to actors that no longer exist (e.g., after instance deletion).
  2. Actor mailbox overflow.
  3. Misconfigured actor paths after deployment changes.

Diagnosis:

  1. Check health dashboard: dead letter count trend.
  2. Check site logs for dead letter details (actor path, message type).

Resolution:

  • Dead letters during failover are expected and transient.
  • Persistent dead letters indicate a configuration or code issue.
  • If dead letters reference deleted instances, they are harmless (S&F messages are retained by design).

Health Dashboard Interpretation

Metric: Data Connection Status

Status Meaning Action
Connected OPC UA connection active None
Disconnected Connection lost, auto-reconnecting Check OPC UA server
Connecting Initial connection in progress Wait

Metric: Tag Resolution

  • TotalSubscribed: Number of tags the system is trying to monitor.
  • SuccessfullyResolved: Tags with active subscriptions.
  • Gap indicates unresolved tags (devices still booting or path errors).

Metric: S&F Buffer Depth

  • ExternalSystem: Messages to external REST APIs awaiting delivery.
  • Notification: Email notifications awaiting SMTP delivery.
  • Growing depth indicates the target system is unreachable.

Metric: Error Counts (Per Interval)

  • Counts reset every 30 seconds (health report interval).
  • Raw counts, not rates — compare across intervals.
  • Occasional script errors during failover are expected.