Instrument the historian plugin with runtime query health counters and read-only cluster failover so operators can detect silent query degradation and keep serving history when a single cluster node goes down

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:08:32 -04:00
parent 4fe37fd1b7
commit 8f340553d9
20 changed files with 1526 additions and 32 deletions
@@ -103,7 +103,9 @@ Controls the Wonderware Historian SDK connection for OPC UA historical data acce
 | Property | Type | Default | Description |
 |----------|------|---------|-------------|
 | `Enabled` | `bool` | `false` | Enables OPC UA historical data access |
-| `ServerName` | `string` | `"localhost"` | Historian server hostname |
+| `ServerName` | `string` | `"localhost"` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments |
+| `ServerNames` | `List<string>` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover. See [Historical Data Access](HistoricalDataAccess.md#read-only-cluster-failover) |
+| `FailureCooldownSeconds` | `int` | `60` | How long a failed cluster node is skipped before being re-tried. Zero disables the cooldown |
 | `IntegratedSecurity` | `bool` | `true` | Use Windows authentication |
 | `UserName` | `string?` | `null` | Username when `IntegratedSecurity` is false |
 | `Password` | `string?` | `null` | Password when `IntegratedSecurity` is false |
@@ -250,6 +252,9 @@ Three boolean properties act as feature flags that control optional subsystems:
 - `AutoAcceptClientCertificates = true` emits a warning
 - Only-`None` profile configuration emits a warning
 - `OpcUa.AlarmFilter.ObjectFilters` is non-empty while `OpcUa.AlarmTrackingEnabled = false` emits a warning (filter has no effect)
+- `Historian.ServerName` (or `Historian.ServerNames`) must not be empty when `Historian.Enabled = true`
+- `Historian.FailureCooldownSeconds` must be zero or positive
+- `Historian.ServerName` is set alongside a non-empty `Historian.ServerNames` emits a warning (single ServerName is ignored)
 - `OpcUa.ApplicationUri` must be set when `Redundancy.Enabled = true`
 - `Redundancy.ServiceLevelBase` must be between 1 and 255
 - `Redundancy.ServerUris` should contain at least 2 entries when enabled
@@ -316,6 +321,8 @@ Integration tests use this constructor to inject substitute implementations of `
  "Historian": {
    "Enabled": false,
    "ServerName": "localhost",
+    "ServerNames": [],
+    "FailureCooldownSeconds": 60,
    "IntegratedSecurity": true,
    "UserName": null,
    "Password": null,
@@ -46,6 +46,8 @@ public class HistorianConfiguration
 {
    public bool Enabled { get; set; } = false;
    public string ServerName { get; set; } = "localhost";
+    public List<string> ServerNames { get; set; } = new();
+    public int FailureCooldownSeconds { get; set; } = 60;
    public bool IntegratedSecurity { get; set; } = true;
    public string? UserName { get; set; }
    public string? Password { get; set; }
@@ -61,7 +63,9 @@ When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plu

 | Property | Default | Description |
 |---|---|---|
-| `ServerName` | `localhost` | Historian server hostname |
+| `ServerName` | `localhost` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments |
+| `ServerNames` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover (see [Cluster Failover](#read-only-cluster-failover)) |
+| `FailureCooldownSeconds` | `60` | How long a failed cluster node is skipped before being re-tried. Zero means no cooldown (retry on every request) |
 | `IntegratedSecurity` | `true` | Use Windows authentication |
 | `UserName` | `null` | Username when `IntegratedSecurity` is false |
 | `Password` | `null` | Password when `IntegratedSecurity` is false |
@@ -73,12 +77,74 @@ When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plu

 `HistorianDataSource` (in the plugin assembly) maintains a persistent connection to the Historian server via `ArchestrA.HistorianAccess`:

-1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`.
-2. **Connection reuse** -- Subsequent queries reuse the same connection.
-3. **Auto-reconnect** -- On connection failure, the connection is disposed and re-established on the next query.
+1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`. When a cluster is configured, the data source iterates `HistorianClusterEndpointPicker.GetHealthyNodes()` in order and returns the first node that successfully connects.
+2. **Connection reuse** -- Subsequent queries reuse the same connection. The active node is tracked in `_activeProcessNode` / `_activeEventNode` and surfaced on the dashboard.
+3. **Auto-reconnect** -- On connection failure, the connection is disposed, the active node is marked failed in the picker, and the next query re-enters the picker loop to try the next eligible candidate.
 4. **Clean shutdown** -- `Dispose()` closes the connection when the service stops.

-The connection is opened with `ReadOnly = true` and `ConnectionType = Process`.
+The connection is opened with `ReadOnly = true` and `ConnectionType = Process`. The event (alarm history) path uses a separate connection with `ConnectionType = Event`, but both silos share the same cluster picker so a node that fails on one silo is immediately skipped on the other.
+
+## Read-Only Cluster Failover
+
+When `HistorianConfiguration.ServerNames` is non-empty, the plugin picks from an ordered list of cluster nodes instead of a single `ServerName`. Each connection attempt tries candidates in configuration order until one succeeds. Failed nodes are placed into a timed cooldown and re-admitted when the cooldown elapses.
+
+### HistorianClusterEndpointPicker
+
+The picker (in the plugin assembly, internal) is pure logic with no SDK dependency — all cluster behavior is unit-testable with a fake clock and scripted factory. Key characteristics:
+
+- **Ordered iteration**: nodes are tried in the exact order they appear in `ServerNames`. Operators can express a preference ("primary first, fallback second") by ordering the list.
+- **Per-node cooldown**: `MarkFailed(node, error)` starts a `FailureCooldownSeconds` window during which the node is skipped from `GetHealthyNodes()`. `MarkHealthy(node)` clears the window immediately (used on successful connect).
+- **Automatic re-admission**: when a node's cooldown elapses, the next call to `GetHealthyNodes()` includes it automatically — no background probe, no manual reset. The cumulative `FailureCount` and `LastError` are retained for operator diagnostics.
+- **Thread-safe**: a single lock guards the per-node state. Operations are microsecond-scale so contention is a non-issue.
+- **Shared across silos**: one picker instance is shared by the process-values connection and the event-history connection, so a node failure on one path immediately benches it for the other.
+- **Zero cooldown mode**: `FailureCooldownSeconds = 0` disables the cooldown entirely — the node is never benched. Useful for tests or for operators who want the SDK's own retry semantics to be the sole gate.
+
+### Connection attempt flow
+
+`HistorianDataSource.ConnectToAnyHealthyNode(HistorianConnectionType)` performs the actual iteration:
+
+1. Snapshot healthy nodes from the picker. If empty, throw `InvalidOperationException` with either "No historian nodes configured" or "All N historian nodes are in cooldown".
+2. For each candidate, clone `HistorianConfiguration` with the candidate as `ServerName` and pass it to the factory. On success: `MarkHealthy(node)` and return the `(Connection, Node)` tuple. On exception: `MarkFailed(node, ex.Message)`, log a warning, continue.
+3. If all candidates fail, wrap the last inner exception in an `InvalidOperationException` with the cumulative failure count so the existing read-method catch blocks surface a meaningful error through the health counters.
+
+The wrapping exception intentionally includes the last inner error message in the outer `Message` so the health snapshot's `LastError` field is still human-readable when the cluster exhausts every candidate.
+
+### Single-node backward compatibility
+
+When `ServerNames` is empty, the picker is seeded with a single entry from `ServerName` and the iteration loop still runs — it just has one candidate. Legacy deployments see no behavior change: the picker marks the single node healthy on success, runs the same cooldown logic on failure, and the dashboard renders a compact `Node: <hostname>` line instead of the cluster table.
+
+### Cluster health surface
+
+Runtime cluster state is exposed on `HistorianHealthSnapshot`:
+
+- `NodeCount` / `HealthyNodeCount` -- size of the configured cluster and how many are currently eligible.
+- `ActiveProcessNode` / `ActiveEventNode` -- which nodes are currently serving the two connection silos, or `null` when a silo has no open connection.
+- `Nodes: List<HistorianClusterNodeState>` -- per-node state with `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`.
+
+The dashboard renders this as a cluster table when `NodeCount > 1`. See [Status Dashboard](StatusDashboard.md#historian). `HealthCheckService` flips the overall service health to `Degraded` when `HealthyNodeCount < NodeCount` so operators can alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes.
+
+## Runtime Health Counters
+
+`HistorianDataSource` maintains runtime query counters updated on every read method exit — success or failure — so the dashboard can distinguish "plugin loaded but never queried" from "plugin loaded and queries are failing". The load-time `HistorianPluginLoader.LastOutcome` only reports whether the assembly resolved at startup; it cannot catch a connection that succeeds at boot and degrades later.
+
+### Counters
+
+- `TotalQueries` / `TotalSuccesses` / `TotalFailures` — cumulative since startup. Every call to `RecordSuccess` or `RecordFailure` in the read methods updates these under `_healthLock`. Empty result sets count as successes — the counter reflects "the SDK call returned" rather than "the SDK call returned data".
+- `ConsecutiveFailures` — latches while queries are failing; reset to zero by the first success. Drives `HealthCheckService` degradation at threshold 3.
+- `LastSuccessTime` / `LastFailureTime` — UTC timestamps of the most recent success or failure, or `null` when no query of that outcome has occurred yet.
+- `LastError` — exception message from the most recent failure, prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call is broken. Cleared on the next success.
+- `ProcessConnectionOpen` / `EventConnectionOpen` — whether the plugin currently holds an open SDK connection on each silo. Read from the data source's `_connection` / `_eventConnection` fields via a `Volatile.Read`.
+
+These fields are read once per dashboard refresh via `IHistorianDataSource.GetHealthSnapshot()` and serialized into `HistorianStatusInfo`. See [Status Dashboard](StatusDashboard.md#historian) for the HTML/JSON surface.
+
+### Two SDK connection silos
+
+The plugin maintains two independent `ArchestrA.HistorianAccess` connections, one per `HistorianConnectionType`:
+
+- **Process connection** (`ConnectionType = Process`) — serves historical *value* queries: `ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`. This is the SDK's query channel for tags stored in the Historian runtime.
+- **Event connection** (`ConnectionType = Event`) — serves historical *event/alarm* queries: `ReadEventsAsync`. The SDK requires a separately opened connection for its event store because the query API and wire schema are distinct from value queries.
+
+Both connections are lazy: they open on the first query that needs them. Either can be open, closed, or open against a different cluster node than the other. The dashboard renders both independently in the Historian panel (`Process Conn: open (host-a) | Event Conn: closed`) so operators can tell which silos are active and which node is serving each. When cluster support is configured, both silos share the same `HistorianClusterEndpointPicker`, so a failure on one silo marks the node unhealthy for the other as well.

 ## Raw Reads

@@ -104,16 +104,32 @@ New operation names are auto-registered on first use, so the `Operations` dictio

 ### Historian

-`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture.
+`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture and the [Runtime Health Counters](HistoricalDataAccess.md#runtime-health-counters) section for the data source instrumentation.

 | Field | Type | Description |
 |-------|------|-------------|
 | `Enabled` | `bool` | Whether `Historian.Enabled` is set in configuration |
-| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` |
+| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` — load-time outcome from `HistorianPluginLoader.LastOutcome` |
 | `PluginError` | `string?` | Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null` |
 | `PluginPath` | `string` | Absolute path the loader probed for the plugin assembly |
-| `ServerName` | `string` | Configured historian hostname |
+| `ServerName` | `string` | Legacy single-node hostname from `Historian.ServerName`; ignored when `ServerNames` is non-empty |
 | `Port` | `int` | Configured historian TCP port |
+| `QueryTotal` | `long` | Total historian read queries attempted since startup (raw + aggregate + at-time + events) |
+| `QuerySuccesses` | `long` | Queries that completed without an exception |
+| `QueryFailures` | `long` | Queries that raised an exception — each failure also triggers the plugin's reconnect path |
+| `ConsecutiveFailures` | `int` | Failures since the last success. Resets to zero on any successful query. Drives the `Degraded` health rule at threshold 3 |
+| `LastSuccessTime` | `DateTime?` | UTC timestamp of the most recent successful query, or `null` when no query has succeeded since startup |
+| `LastFailureTime` | `DateTime?` | UTC timestamp of the most recent failure |
+| `LastQueryError` | `string?` | Exception message from the most recent failure. Prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call failed |
+| `ProcessConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **process** silo (historical value queries — `ReadRaw`, `ReadAggregate`, `ReadAtTime`). See [Two SDK connection silos](HistoricalDataAccess.md#two-sdk-connection-silos) |
+| `EventConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **event** silo (alarm history queries — `ReadEvents`). Separate from the process connection because the SDK requires distinct query channels |
+| `ActiveProcessNode` | `string?` | Cluster node currently serving the process silo, or `null` when no process connection is open |
+| `ActiveEventNode` | `string?` | Cluster node currently serving the event silo, or `null` when no event connection is open |
+| `NodeCount` | `int` | Total configured historian cluster nodes. 1 for a legacy single-node deployment |
+| `HealthyNodeCount` | `int` | Nodes currently eligible for new connections (not in failure cooldown) |
+| `Nodes` | `List<HistorianClusterNodeState>` | Per-node cluster state in configuration order. Each entry carries `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime` |
+
+The operator dashboard renders a cluster table inside the Historian panel when `NodeCount > 1`. Legacy single-node deployments render a compact `Node: <hostname>` line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures.

 ### Alarms