Refine Data Connection Layer: error handling, reconnection, write failures, health reporting

Add connection lifecycle (fixed-interval auto-reconnect, immediate bad quality on
disconnect, transparent re-subscribe), synchronous write failure errors to scripts,
periodic tag path resolution retry, and enhanced health reporting with tag resolution
counts. Update cross-references in Health Monitoring and Site Runtime.
This commit is contained in:
Joseph Doherty
2026-03-16 07:51:37 -04:00
parent f0108e161b
commit 19c7e6880f
4 changed files with 89 additions and 2 deletions

View File

@@ -67,6 +67,41 @@ Each value update delivered to an Instance Actor includes:
- **Quality**: Data quality indicator (good, bad, uncertain).
- **Timestamp**: When the value was read from the device.
## Connection Lifecycle & Reconnection
The DCL manages connection lifecycle automatically:
1. **Connection drop detection**: When a connection to a data source is lost, the DCL immediately pushes a value update with quality `bad` for **every tag subscribed on that connection**. Instance Actors and their downstream consumers (alarms, scripts checking quality) see the staleness immediately.
2. **Auto-reconnect with fixed interval**: The DCL retries the connection at a configurable fixed interval (e.g., every 5 seconds). The retry interval is defined **per data connection**. This is consistent with the fixed-interval retry philosophy used throughout the system.
3. **Connection state transitions**: The DCL tracks each connection's state as `connected`, `disconnected`, or `reconnecting`. All transitions are logged to Site Event Logging.
4. **Transparent re-subscribe**: On successful reconnection, the DCL automatically re-establishes all previously active subscriptions for that connection. Instance Actors require no action — they simply see quality return to `good` as fresh values arrive from restored subscriptions.
## Write Failure Handling
Writes to physical devices are **synchronous** from the script's perspective:
- If the write fails (connection down, device rejection, timeout), the error is **returned to the calling script**. Script authors can catch and handle write errors (log, notify, retry, etc.).
- Write failures are also logged to Site Event Logging.
- There is **no store-and-forward for device writes** — these are real-time control operations. Buffering stale setpoints for later application would be dangerous in an industrial context.
## Tag Path Resolution
When the DCL subscribes to a tag path from the flattened configuration but the path does not exist on the physical device (e.g., typo in the template, device firmware changed, device still booting):
1. The failure is **logged to Site Event Logging**.
2. The attribute is marked with quality `bad`.
3. The DCL **periodically retries resolution** at a configurable interval, accommodating devices that come online in stages or load modules after startup.
4. On successful resolution, the subscription activates normally and quality reflects the live value from the device.
Note: Pre-deployment validation at central does **not** verify that tag paths resolve to real tags on physical devices — that is a runtime concern handled here.
## Health Reporting
The DCL reports the following metrics to the Health Monitoring component via the existing periodic heartbeat:
- **Connection status**: `connected`, `disconnected`, or `reconnecting` per data connection.
- **Tag resolution counts**: Per connection, the number of total subscribed tags vs. successfully resolved tags. This gives operators visibility into misconfigured templates without needing to open the debug view for individual instances.
## Dependencies
- **Site Runtime (Instance Actors)**: Receives subscription registrations and delivers value updates. Receives write requests.