Refine Data Connection Layer: error handling, reconnection, write failures, health reporting

Add connection lifecycle (fixed-interval auto-reconnect, immediate bad quality on disconnect, transparent re-subscribe), synchronous write failure errors to scripts, periodic tag path resolution retry, and enhanced health reporting with tag resolution counts. Update cross-references in Health Monitoring and Site Runtime.
2026-03-16 07:51:37 -04:00
parent f0108e161b
commit 19c7e6880f
4 changed files with 89 additions and 2 deletions
--- a/Component-DataConnectionLayer.md
+++ b/Component-DataConnectionLayer.md
@@ -67,6 +67,41 @@ Each value update delivered to an Instance Actor includes:
 - **Quality**: Data quality indicator (good, bad, uncertain).
 - **Timestamp**: When the value was read from the device.

+## Connection Lifecycle & Reconnection
+
+The DCL manages connection lifecycle automatically:
+
+1. **Connection drop detection**: When a connection to a data source is lost, the DCL immediately pushes a value update with quality `bad` for **every tag subscribed on that connection**. Instance Actors and their downstream consumers (alarms, scripts checking quality) see the staleness immediately.
+2. **Auto-reconnect with fixed interval**: The DCL retries the connection at a configurable fixed interval (e.g., every 5 seconds). The retry interval is defined **per data connection**. This is consistent with the fixed-interval retry philosophy used throughout the system.
+3. **Connection state transitions**: The DCL tracks each connection's state as `connected`, `disconnected`, or `reconnecting`. All transitions are logged to Site Event Logging.
+4. **Transparent re-subscribe**: On successful reconnection, the DCL automatically re-establishes all previously active subscriptions for that connection. Instance Actors require no action — they simply see quality return to `good` as fresh values arrive from restored subscriptions.
+
+## Write Failure Handling
+
+Writes to physical devices are **synchronous** from the script's perspective:
+
+- If the write fails (connection down, device rejection, timeout), the error is **returned to the calling script**. Script authors can catch and handle write errors (log, notify, retry, etc.).
+- Write failures are also logged to Site Event Logging.
+- There is **no store-and-forward for device writes** — these are real-time control operations. Buffering stale setpoints for later application would be dangerous in an industrial context.
+
+## Tag Path Resolution
+
+When the DCL subscribes to a tag path from the flattened configuration but the path does not exist on the physical device (e.g., typo in the template, device firmware changed, device still booting):
+
+1. The failure is **logged to Site Event Logging**.
+2. The attribute is marked with quality `bad`.
+3. The DCL **periodically retries resolution** at a configurable interval, accommodating devices that come online in stages or load modules after startup.
+4. On successful resolution, the subscription activates normally and quality reflects the live value from the device.
+
+Note: Pre-deployment validation at central does **not** verify that tag paths resolve to real tags on physical devices — that is a runtime concern handled here.
+
+## Health Reporting
+
+The DCL reports the following metrics to the Health Monitoring component via the existing periodic heartbeat:
+
+- **Connection status**: `connected`, `disconnected`, or `reconnecting` per data connection.
+- **Tag resolution counts**: Per connection, the number of total subscribed tags vs. successfully resolved tags. This gives operators visibility into misconfigured templates without needing to open the debug view for individual instances.
+
 ## Dependencies

 - **Site Runtime (Instance Actors)**: Receives subscription registrations and delivers value updates. Receives write requests.