Files

Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.

2026-03-16 22:12:31 -04:00

6.9 KiB

Raw Blame History

ScadaLink Troubleshooting Guide

Log Analysis

Log Location

File logs: C:\ScadaLink\logs\scadalink-YYYYMMDD.log
Console output: Available when running interactively (not as a Windows Service)

Log Format

[14:32:05 INF] [Central/central-01] Template "PumpStation" saved by admin

Format: [Time Level] [NodeRole/NodeHostname] Message

All log entries are enriched with:

SiteId — Site identifier (or "central" for central nodes)
NodeHostname — Machine hostname
NodeRole — "Central" or "Site"

Key Log Patterns

Pattern	Meaning
`Starting ScadaLink host as {Role}`	Node startup
`Member joined`	Cluster peer connected
`Member removed`	Cluster peer departed
`Singleton acquired`	This node became the active singleton holder
`Instance {Name}: created N script actors`	Instance successfully deployed
`Script {Name} failed trust validation`	Script uses forbidden API
`Immediate delivery to {Target} failed`	S&F transient failure, message buffered
`Message {Id} parked`	S&F max retries reached
`Site {SiteId} marked offline`	No health report for 60 seconds
`Rejecting stale report`	Out-of-order health report (normal during failover)

Filtering Logs

Use the structured log properties for targeted analysis:

# Find all errors for a specific site
Select-String -Path "logs\scadalink-*.log" -Pattern "\[ERR\].*site-01"

# Find S&F activity
Select-String -Path "logs\scadalink-*.log" -Pattern "store-and-forward|buffered|parked"

# Find failover events
Select-String -Path "logs\scadalink-*.log" -Pattern "Singleton|Member joined|Member removed"

Common Issues

Issue: Site Appears Offline in Health Dashboard

Possible causes:

Site nodes are actually down.
Network connectivity between site and central is broken.
Health report interval has not elapsed since site startup.

Diagnosis:

Check if the site service is running: sc.exe query ScadaLink-Site
Check site logs for errors.
Verify network: Test-NetConnection -ComputerName central-01.example.com -Port 8081
Wait 60 seconds (the offline detection threshold).

Resolution:

If the service is stopped, start it.
If network is blocked, open firewall port 8081.
If the site just started, wait for the first health report (30-second interval).

Issue: Deployment Stuck in "InProgress"

Possible causes:

Site is unreachable during deployment.
Central node failed over mid-deployment.
Instance compilation failed on site.

Diagnosis:

Check deployment status in the UI.
Check site logs for the deployment ID: Select-String "dep-XXXXX"
Check central logs for the deployment ID.

Resolution:

If the site is unreachable: fix connectivity, then re-deploy (idempotent by revision hash).
If compilation failed: check the script errors in site logs, fix the template, re-deploy.
If stuck after failover: the new central node will re-query site state; wait or manually re-deploy.

Issue: S&F Messages Accumulating

Possible causes:

External system is down.
SMTP server is unreachable.
Network issues between site and external target.

Diagnosis:

Check S&F buffer depth in health dashboard.
Check site logs for retry activity and error messages.
Verify external system connectivity from the site node.

Resolution:

Fix the external system / SMTP / network issue. Retries resume automatically.
If messages are permanently undeliverable: park and discard via the central UI.
Check parked messages for patterns (same target, same error).

Issue: OPC UA Connection Keeps Disconnecting

Possible causes:

OPC UA server is unstable.
Network intermittency.
Certificate trust issues.

Diagnosis:

Check DCL logs: look for "Entering Reconnecting state" frequency.
Check health dashboard: data connection status for the affected connection.
Verify OPC UA server health independently.

Resolution:

DCL auto-reconnects at the configured interval (default 5 seconds).
If the server certificate changed, update the trust store.
If the server is consistently unstable, investigate the OPC UA server directly.

Issue: Script Execution Errors

Possible causes:

Script timeout (default 30 seconds).
Runtime exception in script code.
Script references external system that is down.

Diagnosis:

Check health dashboard: script error count per interval.
Check site logs for the script name and error details.
Check if the script uses ExternalSystem.Call() — the target may be down.

Resolution:

If timeout: optimize the script or increase the timeout in configuration.
If runtime error: fix the script in the template editor, re-deploy.
If external system is down: script errors will stop when the system recovers.

Possible causes:

Incorrect LDAP search base DN.
User account is locked in AD.
LDAP group-to-role mapping does not include a required group.
TLS certificate issue on LDAP connection.

Diagnosis:

Check central logs for LDAP bind errors.
Verify LDAP connectivity: Test-NetConnection -ComputerName ldap.example.com -Port 636
Test LDAP bind manually using an LDAP browser tool.

Resolution:

Fix the LDAP configuration.
Unlock the user account in AD.
Update group mappings in the configuration database.

Issue: High Dead Letter Count

Possible causes:

Messages being sent to actors that no longer exist (e.g., after instance deletion).
Actor mailbox overflow.
Misconfigured actor paths after deployment changes.

Diagnosis:

Check health dashboard: dead letter count trend.
Check site logs for dead letter details (actor path, message type).

Resolution:

Dead letters during failover are expected and transient.
Persistent dead letters indicate a configuration or code issue.
If dead letters reference deleted instances, they are harmless (S&F messages are retained by design).

Health Dashboard Interpretation

Metric: Data Connection Status

Status	Meaning	Action
Connected	OPC UA connection active	None
Disconnected	Connection lost, auto-reconnecting	Check OPC UA server
Connecting	Initial connection in progress	Wait

Metric: Tag Resolution

TotalSubscribed: Number of tags the system is trying to monitor.
SuccessfullyResolved: Tags with active subscriptions.
Gap indicates unresolved tags (devices still booting or path errors).

Metric: S&F Buffer Depth

ExternalSystem: Messages to external REST APIs awaiting delivery.
Notification: Email notifications awaiting SMTP delivery.
Growing depth indicates the target system is unreachable.

Metric: Error Counts (Per Interval)

Counts reset every 30 seconds (health report interval).
Raw counts, not rates — compare across intervals.
Occasional script errors during failover are expected.

6.9 KiB Raw Blame History

ScadaLink Troubleshooting Guide

Log Analysis

Log Location

Log Format

Key Log Patterns

Filtering Logs

Common Issues

Issue: Site Appears Offline in Health Dashboard

Issue: Deployment Stuck in "InProgress"

Issue: S&F Messages Accumulating

Issue: OPC UA Connection Keeps Disconnecting

Issue: Script Execution Errors

Issue: Login Fails but LDAP Server is Up

Issue: High Dead Letter Count

Health Dashboard Interpretation

Metric: Data Connection Status

Metric: Tag Resolution

Metric: S&F Buffer Depth

Metric: Error Counts (Per Interval)

6.9 KiB

Raw Blame History