Refine Data Connection Layer: error handling, reconnection, write failures, health reporting

Add connection lifecycle (fixed-interval auto-reconnect, immediate bad quality on disconnect, transparent re-subscribe), synchronous write failure errors to scripts, periodic tag path resolution retry, and enhanced health reporting with tag resolution counts. Update cross-references in Health Monitoring and Site Runtime.
2026-03-16 07:51:37 -04:00
parent f0108e161b
commit 19c7e6880f
4 changed files with 89 additions and 2 deletions
--- a/Component-DataConnectionLayer.md
+++ b/Component-DataConnectionLayer.md
@@ -67,6 +67,41 @@ Each value update delivered to an Instance Actor includes:
 - **Quality**: Data quality indicator (good, bad, uncertain).
 - **Timestamp**: When the value was read from the device.

+## Connection Lifecycle & Reconnection
+
+The DCL manages connection lifecycle automatically:
+
+1. **Connection drop detection**: When a connection to a data source is lost, the DCL immediately pushes a value update with quality `bad` for **every tag subscribed on that connection**. Instance Actors and their downstream consumers (alarms, scripts checking quality) see the staleness immediately.
+2. **Auto-reconnect with fixed interval**: The DCL retries the connection at a configurable fixed interval (e.g., every 5 seconds). The retry interval is defined **per data connection**. This is consistent with the fixed-interval retry philosophy used throughout the system.
+3. **Connection state transitions**: The DCL tracks each connection's state as `connected`, `disconnected`, or `reconnecting`. All transitions are logged to Site Event Logging.
+4. **Transparent re-subscribe**: On successful reconnection, the DCL automatically re-establishes all previously active subscriptions for that connection. Instance Actors require no action — they simply see quality return to `good` as fresh values arrive from restored subscriptions.
+
+## Write Failure Handling
+
+Writes to physical devices are **synchronous** from the script's perspective:
+
+- If the write fails (connection down, device rejection, timeout), the error is **returned to the calling script**. Script authors can catch and handle write errors (log, notify, retry, etc.).
+- Write failures are also logged to Site Event Logging.
+- There is **no store-and-forward for device writes** — these are real-time control operations. Buffering stale setpoints for later application would be dangerous in an industrial context.
+
+## Tag Path Resolution
+
+When the DCL subscribes to a tag path from the flattened configuration but the path does not exist on the physical device (e.g., typo in the template, device firmware changed, device still booting):
+
+1. The failure is **logged to Site Event Logging**.
+2. The attribute is marked with quality `bad`.
+3. The DCL **periodically retries resolution** at a configurable interval, accommodating devices that come online in stages or load modules after startup.
+4. On successful resolution, the subscription activates normally and quality reflects the live value from the device.
+
+Note: Pre-deployment validation at central does **not** verify that tag paths resolve to real tags on physical devices — that is a runtime concern handled here.
+
+## Health Reporting
+
+The DCL reports the following metrics to the Health Monitoring component via the existing periodic heartbeat:
+
+- **Connection status**: `connected`, `disconnected`, or `reconnecting` per data connection.
+- **Tag resolution counts**: Per connection, the number of total subscribed tags vs. successfully resolved tags. This gives operators visibility into misconfigured templates without needing to open the debug view for individual instances.
+
 ## Dependencies

 - **Site Runtime (Instance Actors)**: Receives subscription registrations and delivers value updates. Receives write requests.
--- a/Component-HealthMonitoring.md
+++ b/Component-HealthMonitoring.md
@@ -25,7 +25,8 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
 |--------|--------|-------------|
 | Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) |
 | Active/standby node status | Cluster Infrastructure | Which node is active, which is standby |
-| Data connection health | Data Connection Layer | Connected/disconnected per data connection |
+| Data connection health | Data Connection Layer | Connected/disconnected/reconnecting per data connection |
+| Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags |
 | Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
 | Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
 | Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
--- a/Component-SiteRuntime.md
+++ b/Component-SiteRuntime.md
@@ -103,7 +103,7 @@ Deployment Manager Singleton (Cluster Singleton)

 ### GetAttribute / SetAttribute
 - **GetAttribute**: Returns the current in-memory value for the requested attribute.
- **SetAttribute** (for attributes with data source reference): Sends a write request to the Data Connection Layer. The DCL writes to the physical device. The existing subscription picks up the confirmed value from the device and sends it back as a value update, which then updates the in-memory value. The in-memory value is **not** optimistically updated.
+- **SetAttribute** (for attributes with data source reference): Sends a write request to the Data Connection Layer. The DCL writes to the physical device. If the write fails (connection down, device rejection, timeout), the error is returned synchronously to the calling script for handling. On success, the existing subscription picks up the confirmed value from the device and sends it back as a value update, which then updates the in-memory value. The in-memory value is **not** optimistically updated.
 - **SetAttribute** (for static attributes): Updates the in-memory value directly. This change is ephemeral — it is lost on restart and resets to the deployed configuration value.

 ### Debug View Support
--- a/docs/plans/2026-03-16-data-connection-layer-refinement-design.md
+++ b/docs/plans/2026-03-16-data-connection-layer-refinement-design.md
@@ -0,0 +1,51 @@
+# Data Connection Layer Refinement — Design
+
+**Date**: 2026-03-16
+**Component**: Data Connection Layer (`Component-DataConnectionLayer.md`)
+**Status**: Approved
+
+## Problem
+
+The Data Connection Layer doc covered the happy path (interface, subscriptions, write-back, value format) but lacked specification for error handling, reconnection behavior, write failures, tag path resolution, and health reporting granularity.
+
+## Decisions
+
+### Connection Lifecycle & Reconnection
+- **Auto-reconnect with fixed interval** per data connection, consistent with the system's fixed-interval retry philosophy.
+- **Immediate bad quality** on disconnect — all tags on the affected connection are pushed with quality `bad` as soon as the connection drops.
+- **Transparent re-subscribe** on reconnection — the DCL re-establishes all prior subscriptions automatically. Instance Actors take no action; they see quality return to `good` as values resume.
+- Connection state transitions (`connected` / `disconnected` / `reconnecting`) logged to Site Event Logging.
+
+### Write Failure Handling
+- Writes are **synchronous** from the script's perspective. Failures (connection down, device rejection, timeout) return an error to the calling script.
+- **No store-and-forward for device writes** — buffering stale setpoints is dangerous for industrial control.
+- Write failures also logged to Site Event Logging.
+
+### Tag Path Resolution
+- Unresolvable tag paths are marked with quality `bad` and logged.
+- **Periodic retry** at a configurable interval to accommodate devices that start in stages.
+- On successful resolution, the subscription activates normally.
+
+### Health Reporting
+- Per-connection status: `connected` / `disconnected` / `reconnecting`.
+- Per-connection tag resolution counts: total subscribed tags vs. successfully resolved tags.
+- Both reported via existing Health Monitoring heartbeat.
+
+### Subscription Model
+- **No deduplication** — each Instance Actor gets its own subscription even if multiple actors subscribe to the same tag path. Protocol layers (e.g., OPC UA) handle this efficiently at the expected scale.
+
+## Affected Documents
+
+| Document | Change |
+|----------|--------|
+| `Component-DataConnectionLayer.md` | Added 4 new sections: Connection Lifecycle & Reconnection, Write Failure Handling, Tag Path Resolution, Health Reporting |
+| `Component-HealthMonitoring.md` | Added tag resolution counts to monitored metrics table |
+| `Component-SiteRuntime.md` | Updated SetAttribute description to note synchronous write failure errors |
+
+## Alternatives Considered
+
+- **Exponential backoff for reconnection**: Rejected — fixed interval is simpler and consistent with the rest of the system.
+- **Grace period before marking quality as bad**: Rejected — in SCADA, immediate staleness indication is safer.
+- **Instance Actor-driven re-subscription**: Rejected — adds complexity to Instance Actors for no benefit.
+- **Fire-and-forget writes**: Rejected — script authors need to know when device writes fail.
+- **Subscription deduplication in DCL**: Rejected — adds reference-counting complexity for minimal gain at expected scale.