Refine Data Connection Layer: error handling, reconnection, write failures, health reporting
Add connection lifecycle (fixed-interval auto-reconnect, immediate bad quality on disconnect, transparent re-subscribe), synchronous write failure errors to scripts, periodic tag path resolution retry, and enhanced health reporting with tag resolution counts. Update cross-references in Health Monitoring and Site Runtime.
This commit is contained in:
@@ -0,0 +1,51 @@
|
||||
# Data Connection Layer Refinement — Design
|
||||
|
||||
**Date**: 2026-03-16
|
||||
**Component**: Data Connection Layer (`Component-DataConnectionLayer.md`)
|
||||
**Status**: Approved
|
||||
|
||||
## Problem
|
||||
|
||||
The Data Connection Layer doc covered the happy path (interface, subscriptions, write-back, value format) but lacked specification for error handling, reconnection behavior, write failures, tag path resolution, and health reporting granularity.
|
||||
|
||||
## Decisions
|
||||
|
||||
### Connection Lifecycle & Reconnection
|
||||
- **Auto-reconnect with fixed interval** per data connection, consistent with the system's fixed-interval retry philosophy.
|
||||
- **Immediate bad quality** on disconnect — all tags on the affected connection are pushed with quality `bad` as soon as the connection drops.
|
||||
- **Transparent re-subscribe** on reconnection — the DCL re-establishes all prior subscriptions automatically. Instance Actors take no action; they see quality return to `good` as values resume.
|
||||
- Connection state transitions (`connected` / `disconnected` / `reconnecting`) logged to Site Event Logging.
|
||||
|
||||
### Write Failure Handling
|
||||
- Writes are **synchronous** from the script's perspective. Failures (connection down, device rejection, timeout) return an error to the calling script.
|
||||
- **No store-and-forward for device writes** — buffering stale setpoints is dangerous for industrial control.
|
||||
- Write failures also logged to Site Event Logging.
|
||||
|
||||
### Tag Path Resolution
|
||||
- Unresolvable tag paths are marked with quality `bad` and logged.
|
||||
- **Periodic retry** at a configurable interval to accommodate devices that start in stages.
|
||||
- On successful resolution, the subscription activates normally.
|
||||
|
||||
### Health Reporting
|
||||
- Per-connection status: `connected` / `disconnected` / `reconnecting`.
|
||||
- Per-connection tag resolution counts: total subscribed tags vs. successfully resolved tags.
|
||||
- Both reported via existing Health Monitoring heartbeat.
|
||||
|
||||
### Subscription Model
|
||||
- **No deduplication** — each Instance Actor gets its own subscription even if multiple actors subscribe to the same tag path. Protocol layers (e.g., OPC UA) handle this efficiently at the expected scale.
|
||||
|
||||
## Affected Documents
|
||||
|
||||
| Document | Change |
|
||||
|----------|--------|
|
||||
| `Component-DataConnectionLayer.md` | Added 4 new sections: Connection Lifecycle & Reconnection, Write Failure Handling, Tag Path Resolution, Health Reporting |
|
||||
| `Component-HealthMonitoring.md` | Added tag resolution counts to monitored metrics table |
|
||||
| `Component-SiteRuntime.md` | Updated SetAttribute description to note synchronous write failure errors |
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
- **Exponential backoff for reconnection**: Rejected — fixed interval is simpler and consistent with the rest of the system.
|
||||
- **Grace period before marking quality as bad**: Rejected — in SCADA, immediate staleness indication is safer.
|
||||
- **Instance Actor-driven re-subscription**: Rejected — adds complexity to Instance Actors for no benefit.
|
||||
- **Fire-and-forget writes**: Rejected — script authors need to know when device writes fail.
|
||||
- **Subscription deduplication in DCL**: Rejected — adds reference-counting complexity for minimal gain at expected scale.
|
||||
Reference in New Issue
Block a user