4b6ff49822
Expanding a Galaxy object in the tag picker hung on "loading…": the browse reply inlined every child's full attribute set (~152 KB), exceeding Akka's 128 KB remote frame, and remoting silently discarded the oversized reply. Browse path (DataConnectionLayer): - RealMxGatewayClient: navigation now uses BrowseChildren(include_attributes= false) — child objects only — and an object's own attributes load lazily via DiscoverHierarchy(root, max_depth=0) when it's expanded. Payload drops from ~152 KB/level to a few KB. Seam contract unchanged. - DataConnectionActor.CapBrowseChildren: protocol-agnostic byte-budget cap (~100 KB) on every BrowseNodeResult before it crosses the site→central frame, OR-ing the adapter's own Truncated flag. Byte budget, not a count — the only bound that holds regardless of NodeId/attribute-name length. - RealOpcUaClient: requestedMaxReferencesPerNode 1000 → 500 to narrow the window before the byte budget applies. - Graceful gRPC Unimplemented handling → NotSupportedException → BrowseFailureKind.NotBrowsable with an actionable message (older gateway builds lacking BrowseChildren). Picker UI (CentralUI): - NodeBrowserDialog: modal-lg → modal-xl; new scoped .razor.css caps the tree at 55vh with its own scrollbar so manual entry + Select/Cancel stay visible. - Protocol-agnostic failure messages (was hardcoded "OPC UA …"); renamed the leftover opcua-browser-tree class to node-browser-tree. Tests: new frame-budget cap test + NotSupported=>NotBrowsable mapping test; DCL suite 88/88. Doc: Component-DataConnectionLayer.md records the lazy attribute-light browse and the frame-size guard.
256 lines
19 KiB
Markdown
256 lines
19 KiB
Markdown
# Component: Data Connection Layer
|
||
|
||
## Purpose
|
||
|
||
The Data Connection Layer provides a uniform interface for reading from and writing to physical machines at site clusters. It abstracts protocol-specific details behind a common interface, manages subscriptions, and delivers live tag value updates to Instance Actors. It is a **clean data pipe** — it performs no evaluation of triggers, alarm conditions, or business logic.
|
||
|
||
## Location
|
||
|
||
Site clusters only. Central does not interact with machines directly.
|
||
|
||
## Responsibilities
|
||
|
||
- Manage data connections defined centrally and deployed to sites as part of artifact deployment (OPC UA servers). Data connection definitions are stored in local SQLite after deployment.
|
||
- Establish and maintain connections to data sources based on deployed instance configurations.
|
||
- Subscribe to tag paths as requested by Instance Actors (based on attribute data source references in the flattened configuration).
|
||
- Deliver tag value updates to the requesting Instance Actors.
|
||
- Support writing values to machines (when Instance Actors forward `SetAttribute` write requests for data-connected attributes).
|
||
- Report data connection health status to the Health Monitoring component.
|
||
|
||
## Common Interface
|
||
|
||
All protocol adapters implement the same interface:
|
||
|
||
```
|
||
IDataConnection : IAsyncDisposable
|
||
├── Connect(connectionDetails) → void
|
||
├── Disconnect() → void
|
||
├── Subscribe(tagPath, callback) → subscriptionId
|
||
├── Unsubscribe(subscriptionId) → void
|
||
├── Read(tagPath) → value
|
||
├── ReadBatch(tagPaths) → values
|
||
├── Write(tagPath, value) → void
|
||
├── WriteBatch(values) → void
|
||
├── WriteBatchAndWait(values, flagPath, flagValue, responsePath, responseValue, timeout) → bool
|
||
├── Status → ConnectionHealth
|
||
└── Disconnected → event Action?
|
||
```
|
||
|
||
The `Disconnected` event is raised by an adapter when it detects an unexpected connection loss (server offline, network failure, keep-alive timeout). The `DataConnectionActor` subscribes to this event to trigger the reconnection state machine. Additional protocols can be added by implementing this interface.
|
||
|
||
### Common Value Type
|
||
|
||
All protocols produce the same value tuple consumed by Instance Actors. Before the first value update arrives from the DCL, data-sourced attributes are held at **uncertain** quality by the Instance Actor (see Site Runtime — Initialization):
|
||
|
||
| Concept | ScadaBridge Design |
|
||
|---|---|
|
||
| Value container | `TagValue(Value, Quality, Timestamp)` |
|
||
| Quality | `QualityCode` enum: Good / Bad / Uncertain |
|
||
| Timestamp | `DateTimeOffset` (UTC) |
|
||
| Value type | `object?` |
|
||
|
||
## Supported Protocols
|
||
|
||
### OPC UA
|
||
- Uses the **OPC Foundation .NET Standard Library** (`OPCFoundation.NetStandard.Opc.Ua.Client`).
|
||
- Session-based connection with endpoint discovery, certificate handling, and configurable security modes.
|
||
- Subscriptions via OPC UA Monitored Items with data change notifications (1000ms sampling, queue size 10, discard-oldest).
|
||
- Read/Write via OPC UA Read/Write services with StatusCode-based quality mapping.
|
||
- Disconnect detection via `Session.KeepAlive` event (see Disconnect Detection Pattern below).
|
||
|
||
### MxGateway
|
||
- Connects to the **MxAccess Gateway** (AVEVA/Wonderware MXAccess-backed Galaxy) over gRPC using the `ZB.MOM.WW.MxGateway.Client` NuGet package (from the Gitea feed); `ZB.MOM.WW.MxGateway.Contracts` is pulled in transitively.
|
||
- Session-based: `OpenSession` + `Register` on connect; `AddItem` + `Advise` per subscription; value changes arrive on the gateway's server-streaming event feed (`StreamEvents`), resumable via `worker_sequence`.
|
||
- Read/Write via `ReadBulk` / `WriteBulk`; writes carry a configurable `WriteUserId`. Quality maps the OPC-style quality byte (≥192 Good, ≥64 Uncertain, else Bad), with a failing MXAccess status proxy treated as Bad.
|
||
- Galaxy hierarchy browse via the separate `GalaxyRepositoryClient` — objects are navigable nodes (keyed by Galaxy gobject id), attributes are selectable leaves (keyed by full tag reference). Browse is **lazy and attribute-light**: navigation uses `BrowseChildren` with `include_attributes=false` (child objects only), and an object's own attributes are fetched only when it is expanded, via `DiscoverHierarchy(root=<object>, max_depth=0)` scoped to that single object. This keeps each browse level's reply small; inlining every child's full attribute set could exceed the Akka remote frame and silently drop the reply.
|
||
- Disconnect detection: a fault on the event stream raises `IDataConnection.Disconnected`, driving the same reconnection state machine as OPC UA.
|
||
- Implemented as `MxGatewayDataConnection` over an `IMxGatewayClient` seam; the seam is decoupled from the generated gRPC types (only `RealMxGatewayClient` references them), so the adapter is fully unit-testable with a fake.
|
||
|
||
## Endpoint Redundancy
|
||
|
||
Data connections support an optional backup endpoint for automatic failover when the active endpoint becomes unreachable. Both endpoints use the same protocol.
|
||
|
||
**Entity fields:**
|
||
|
||
| Field | Type | Notes |
|
||
|-------|------|-------|
|
||
| `PrimaryConfiguration` | string? (max 4000) | Required. Renamed from `Configuration` |
|
||
| `BackupConfiguration` | string? (max 4000) | Optional. Null = no backup |
|
||
| `FailoverRetryCount` | int (default 3) | Retries on active endpoint before switching |
|
||
|
||
**Failover state machine:**
|
||
|
||
```
|
||
Connected → disconnect → push bad quality → retry active endpoint (5s)
|
||
→ N failures (≥ FailoverRetryCount) → switch to other endpoint
|
||
→ dispose adapter, create fresh adapter with other config
|
||
→ reconnect → ReSubscribeAll → Connected
|
||
```
|
||
|
||
- **Round-robin**: primary → backup → primary → backup. No preferred endpoint after first failover — the connection stays on whichever endpoint is working.
|
||
- **No auto-failback**: The connection remains on the active endpoint until it fails.
|
||
- **Single-endpoint connections** (no backup): Retry indefinitely on the same endpoint, preserving existing behavior.
|
||
- **Adapter lifecycle on failover**: The actor disposes the current `IDataConnection` adapter and creates a fresh one via `DataConnectionFactory.Create()` with the other endpoint's configuration. Clean slate — no stale state.
|
||
|
||
**Health reporting:**
|
||
|
||
- `DataConnectionHealthReport` includes `ActiveEndpoint`: `"Primary"`, `"Backup"`, or `"Primary (no backup)"`.
|
||
|
||
**Site event log entries:**
|
||
|
||
- `DataConnectionFailover` (Warning) — connection name, from-endpoint, to-endpoint, failure count.
|
||
- `DataConnectionRestored` (Info) — connection name, active endpoint.
|
||
|
||
See [`2026-03-22-primary-backup-data-connections-design.md`](../plans/2026-03-22-primary-backup-data-connections-design.md) for the full design.
|
||
|
||
## Connection Configuration Reference
|
||
|
||
All settings are parsed from the data connection's configuration JSON dictionaries (`PrimaryConfiguration` and optional `BackupConfiguration`, stored as `IDictionary<string, string>` connection details). Both endpoints use the same protocol-specific keys. Invalid numeric values fall back to defaults silently.
|
||
|
||
### OPC UA Settings
|
||
|
||
| Key | Type | Default | Description |
|
||
|-----|------|---------|-------------|
|
||
| `endpoint` / `EndpointUrl` | string | `opc.tcp://localhost:4840` | OPC UA server endpoint URL |
|
||
| `SessionTimeoutMs` | int | `60000` | OPC UA session timeout in milliseconds |
|
||
| `OperationTimeoutMs` | int | `15000` | Transport operation timeout in milliseconds |
|
||
| `PublishingIntervalMs` | int | `1000` | Subscription publishing interval in milliseconds |
|
||
| `KeepAliveCount` | int | `10` | Keep-alive frames before session timeout |
|
||
| `LifetimeCount` | int | `30` | Subscription lifetime in publish intervals |
|
||
| `MaxNotificationsPerPublish` | int | `100` | Max notifications batched per publish cycle |
|
||
| `SamplingIntervalMs` | int | `1000` | Per-item server sampling rate in milliseconds |
|
||
| `QueueSize` | int | `10` | Per-item notification buffer size |
|
||
| `SecurityMode` | string | `None` | Preferred endpoint security: `None`, `Sign`, or `SignAndEncrypt` |
|
||
| `AutoAcceptUntrustedCerts` | bool | `true` | Accept untrusted server certificates |
|
||
|
||
### MxGateway Settings
|
||
|
||
| Key | Type | Default | Description |
|
||
|-----|------|---------|-------------|
|
||
| `Endpoint` | string | `http://localhost:5000` | Gateway base URL |
|
||
| `ApiKey` | string | — | Sent to the gateway as `authorization: Bearer <key>` |
|
||
| `ClientName` | string | `scadabridge` (when blank) | MXAccess client registration name |
|
||
| `WriteUserId` | int | `0` | MXAccess user id applied to every write-back (0 = no user context) |
|
||
| `UseTls` | bool | `false` | Use TLS to a secured gateway |
|
||
| `CaFile` | string | — | Path to the CA certificate (TLS only) |
|
||
| `ServerName` | string | — | TLS server-name override |
|
||
| `ReadTimeoutMs` | int | `5000` | `ReadBulk` per-call timeout in milliseconds |
|
||
|
||
Secret handling for `ApiKey` follows the same at-rest treatment and log/telemetry redaction as the OPC UA `UserIdentity` username/password fields.
|
||
|
||
### Shared Settings (appsettings.json)
|
||
|
||
These are configured via `DataConnectionOptions` in `appsettings.json`, not per-connection:
|
||
|
||
| Setting | Default | Description |
|
||
|---------|---------|-------------|
|
||
| `ReconnectInterval` | 5s | Fixed interval between reconnection attempts |
|
||
| `TagResolutionRetryInterval` | 10s | Retry interval for unresolved tag paths |
|
||
| `WriteTimeout` | 30s | Timeout for write operations |
|
||
|
||
## Subscription Management
|
||
|
||
- When an Instance Actor is created (as part of the Site Runtime actor hierarchy), it registers its data source references with the Data Connection Layer.
|
||
- The DCL subscribes to the tag paths using the concrete connection details from the flattened configuration.
|
||
- Tag value updates are delivered directly to the requesting Instance Actor.
|
||
- When an Instance Actor is stopped (due to disable, delete, or redeployment), the DCL cleans up the associated subscriptions.
|
||
- When a new Instance Actor is created for a redeployment, subscriptions are established fresh based on the new configuration.
|
||
|
||
## Write-Back Support
|
||
|
||
- When a script calls `Instance.SetAttribute` for an attribute with a data source reference, the Instance Actor sends a write request to the DCL.
|
||
- The DCL writes the value to the physical device via the appropriate protocol.
|
||
- The existing subscription picks up the confirmed new value from the device and delivers it back to the Instance Actor as a standard value update.
|
||
- The Instance Actor's in-memory value is **not** updated until the device confirms the write.
|
||
|
||
## Browsing the address space
|
||
|
||
DCL is a clean data pipe on the hot path. Browse is an **opt-in capability** for protocols that support it, exposed via `IBrowsableDataConnection`. Only consumed by management/UI (the tag picker on the instance configure page); Instance Actors never call it. The browse path is **protocol-agnostic**: the same command/service/dialog serve every browsable protocol.
|
||
|
||
- `OpcUaDataConnection` and `MxGatewayDataConnection` both implement `IBrowsableDataConnection`; other/custom protocols do not (and return a `NotBrowsable` failure).
|
||
- `DataConnectionManagerActor` handles `BrowseNodeCommand` (fields: `ConnectionName`, `ParentNodeId`) and replies with `BrowseNodeResult` (children + `Truncated` + structured `BrowseFailure?`). The Central UI facade is `IBrowseService`/`BrowseService`, backing the `NodeBrowserDialog` tag picker.
|
||
- Node ids are opaque protocol-specific strings: OPC UA uses NodeIds; MxGateway uses Galaxy gobject ids for navigable objects and full tag references for selectable attribute leaves.
|
||
- Browse runs against the live session; no caching at DCL.
|
||
- **Frame-size guard**: the reply crosses the site→central Akka frame (default 128 KB) on a temp Ask actor; an oversized reply is silently discarded by remoting, hanging the picker. The child handler caps each `BrowseNodeResult` to a byte budget (~100 KB) before replying, OR-ing the adapter's own truncation signal into `Truncated`. This is protocol-agnostic (every adapter's reply funnels through it). Per-protocol upstream caps narrow the window first: OPC UA requests at most 500 references per node (continuation point → `Truncated`); MxGateway relies on the gateway's `BrowseChildren` page cap. A `Truncated` level prompts manual node-id entry in the picker rather than auto-paging.
|
||
|
||
## Value Update Message Format
|
||
|
||
Each value update delivered to an Instance Actor includes:
|
||
- **Tag path**: The relative path of the attribute's data source reference.
|
||
- **Value**: The new value from the device.
|
||
- **Quality**: Data quality indicator (good, bad, uncertain).
|
||
- **Timestamp**: When the value was read from the device.
|
||
|
||
## Connection Actor Model
|
||
|
||
Each data connection is managed by a dedicated connection actor that uses the Akka.NET **Become/Stash** pattern to model its lifecycle as a state machine:
|
||
|
||
- **Connecting**: The actor attempts to establish the connection. Subscription requests and write commands received during this phase are **stashed** (buffered in the actor's stash).
|
||
- **Connected**: The actor is actively servicing subscriptions. On entering this state, all stashed messages are unstashed and processed.
|
||
- **Reconnecting**: The connection was lost. The actor transitions back to a connecting-like state, stashing new requests while it retries.
|
||
|
||
This pattern ensures no messages are lost during connection transitions and is the standard Akka.NET approach for actors with I/O lifecycle dependencies.
|
||
|
||
**OPC UA-specific notes**: The `RealOpcUaClient` uses the OPC Foundation SDK's `Session.KeepAlive` event for proactive disconnect detection. The SDK sends keep-alive requests at the subscription's `KeepAliveCount × PublishingInterval` (default: 10s). When keep-alive fails, the `ConnectionLost` event fires, triggering the same reconnection flow. On reconnection, the DCL re-creates the OPC UA session and subscription, then re-adds all monitored items.
|
||
|
||
## Connection Lifecycle & Reconnection
|
||
|
||
The DCL manages connection lifecycle automatically:
|
||
|
||
1. **Connection drop detection**: When a connection to a data source is lost, the DCL immediately pushes a value update with quality `bad` for **every tag subscribed on that connection**. Instance Actors and their downstream consumers (alarms, scripts checking quality) see the staleness immediately.
|
||
2. **Auto-reconnect with fixed interval**: The DCL retries the connection at a configurable fixed interval (e.g., every 5 seconds). The retry interval is defined **per data connection**. This is consistent with the fixed-interval retry philosophy used throughout the system. Individual gRPC/OPC UA operations (reads, writes) fail immediately to the caller on error; there is no operation-level retry within the adapter.
|
||
3. **Connection state transitions**: The DCL tracks each connection's state as `connected`, `disconnected`, or `reconnecting`. All transitions are logged to Site Event Logging.
|
||
4. **Transparent re-subscribe**: On successful reconnection, the DCL automatically re-establishes all previously active subscriptions for that connection. Instance Actors require no action — they simply see quality return to `good` as fresh values arrive from restored subscriptions.
|
||
|
||
### Disconnect Detection Pattern
|
||
|
||
Each adapter implements the `IDataConnection.Disconnected` event to proactively signal connection loss to the `DataConnectionActor`. Detection uses two complementary paths:
|
||
|
||
**Proactive detection** (server goes offline between operations):
|
||
- **OPC UA**: The OPC Foundation SDK fires `Session.KeepAlive` events at regular intervals. `RealOpcUaClient` hooks this event; when `ServiceResult.IsBad(e.Status)` (server unreachable, keep-alive timeout), it fires `ConnectionLost`. The `OpcUaDataConnection` adapter translates this into `IDataConnection.Disconnected`.
|
||
|
||
**Reactive detection** (failure discovered during an operation):
|
||
- Both adapters wrap `ReadAsync` (and by extension `ReadBatchAsync`) with exception handling. If a read throws a non-cancellation exception, the adapter calls `RaiseDisconnected()` and re-throws. The `DataConnectionActor`'s existing error handling catches the exception while the disconnect event triggers the reconnection state machine.
|
||
|
||
**Event marshalling**: The `DataConnectionActor` subscribes to `_adapter.Disconnected` in `PreStart()`. Since `Disconnected` may fire from a background thread (gRPC stream task, OPC UA keep-alive timer), the handler sends an `AdapterDisconnected` message to `Self`, marshalling the notification onto the actor's message loop. This triggers `BecomeReconnecting()` → bad quality push → retry timer.
|
||
|
||
**Once-only guard**: `OpcUaDataConnection` uses a `volatile bool _disconnectFired` flag to ensure `RaiseDisconnected()` fires exactly once per connection session. The flag resets on successful reconnection (`ConnectAsync`).
|
||
|
||
## Write Failure Handling
|
||
|
||
Writes to physical devices are **synchronous** from the script's perspective:
|
||
|
||
- If the write fails (connection down, device rejection, timeout), the error is **returned to the calling script**. Script authors can catch and handle write errors (log, notify, retry, etc.).
|
||
- Write failures are also logged to Site Event Logging.
|
||
- There is **no store-and-forward for device writes** — these are real-time control operations. Buffering stale setpoints for later application would be dangerous in an industrial context.
|
||
|
||
## Tag Path Resolution
|
||
|
||
When the DCL subscribes to a tag path from the flattened configuration but the path does not exist on the physical device (e.g., typo in the template, device firmware changed, device still booting):
|
||
|
||
1. The failure is **logged to Site Event Logging**.
|
||
2. The attribute is marked with quality `bad`.
|
||
3. The DCL **periodically retries resolution** at a configurable interval, accommodating devices that come online in stages or load modules after startup.
|
||
4. On successful resolution, the subscription activates normally and quality reflects the live value from the device.
|
||
|
||
Note: Pre-deployment validation at central does **not** verify that tag paths resolve to real tags on physical devices — that is a runtime concern handled here.
|
||
|
||
## Health Reporting
|
||
|
||
The DCL reports the following metrics to the Health Monitoring component via the existing periodic heartbeat:
|
||
|
||
- **Connection status**: `connected`, `disconnected`, or `reconnecting` per data connection.
|
||
- **Tag resolution counts**: Per connection, the number of total subscribed tags vs. successfully resolved tags. This gives operators visibility into misconfigured templates without needing to open the debug view for individual instances.
|
||
|
||
## Dependencies
|
||
|
||
- **Site Runtime (Instance Actors)**: Receives subscription registrations and delivers value updates. Receives write requests.
|
||
- **Health Monitoring**: Reports connection status.
|
||
- **Site Event Logging**: Logs connection status changes.
|
||
|
||
## Interactions
|
||
|
||
- **Site Runtime (Instance Actors)**: Bidirectional — delivers value updates, receives subscription registrations and write-back commands.
|
||
- **Health Monitoring**: Reports connection health periodically.
|
||
- **Site Event Logging**: Logs connection/disconnection events.
|