Instrument the historian plugin with runtime query health counters and read-only cluster failover so operators can detect silent query degradation and keep serving history when a single cluster node goes down

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:08:32 -04:00
parent 4fe37fd1b7
commit 8f340553d9
20 changed files with 1526 additions and 32 deletions
@@ -103,7 +103,9 @@ Controls the Wonderware Historian SDK connection for OPC UA historical data acce
 | Property | Type | Default | Description |
 |----------|------|---------|-------------|
 | `Enabled` | `bool` | `false` | Enables OPC UA historical data access |
-| `ServerName` | `string` | `"localhost"` | Historian server hostname |
+| `ServerName` | `string` | `"localhost"` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments |
+| `ServerNames` | `List<string>` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover. See [Historical Data Access](HistoricalDataAccess.md#read-only-cluster-failover) |
+| `FailureCooldownSeconds` | `int` | `60` | How long a failed cluster node is skipped before being re-tried. Zero disables the cooldown |
 | `IntegratedSecurity` | `bool` | `true` | Use Windows authentication |
 | `UserName` | `string?` | `null` | Username when `IntegratedSecurity` is false |
 | `Password` | `string?` | `null` | Password when `IntegratedSecurity` is false |
@@ -250,6 +252,9 @@ Three boolean properties act as feature flags that control optional subsystems:
 - `AutoAcceptClientCertificates = true` emits a warning
 - Only-`None` profile configuration emits a warning
 - `OpcUa.AlarmFilter.ObjectFilters` is non-empty while `OpcUa.AlarmTrackingEnabled = false` emits a warning (filter has no effect)
+- `Historian.ServerName` (or `Historian.ServerNames`) must not be empty when `Historian.Enabled = true`
+- `Historian.FailureCooldownSeconds` must be zero or positive
+- `Historian.ServerName` is set alongside a non-empty `Historian.ServerNames` emits a warning (single ServerName is ignored)
 - `OpcUa.ApplicationUri` must be set when `Redundancy.Enabled = true`
 - `Redundancy.ServiceLevelBase` must be between 1 and 255
 - `Redundancy.ServerUris` should contain at least 2 entries when enabled
@@ -316,6 +321,8 @@ Integration tests use this constructor to inject substitute implementations of `
  "Historian": {
    "Enabled": false,
    "ServerName": "localhost",
+    "ServerNames": [],
+    "FailureCooldownSeconds": 60,
    "IntegratedSecurity": true,
    "UserName": null,
    "Password": null,
@@ -46,6 +46,8 @@ public class HistorianConfiguration
 {
    public bool Enabled { get; set; } = false;
    public string ServerName { get; set; } = "localhost";
+    public List<string> ServerNames { get; set; } = new();
+    public int FailureCooldownSeconds { get; set; } = 60;
    public bool IntegratedSecurity { get; set; } = true;
    public string? UserName { get; set; }
    public string? Password { get; set; }
@@ -61,7 +63,9 @@ When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plu

 | Property | Default | Description |
 |---|---|---|
-| `ServerName` | `localhost` | Historian server hostname |
+| `ServerName` | `localhost` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments |
+| `ServerNames` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover (see [Cluster Failover](#read-only-cluster-failover)) |
+| `FailureCooldownSeconds` | `60` | How long a failed cluster node is skipped before being re-tried. Zero means no cooldown (retry on every request) |
 | `IntegratedSecurity` | `true` | Use Windows authentication |
 | `UserName` | `null` | Username when `IntegratedSecurity` is false |
 | `Password` | `null` | Password when `IntegratedSecurity` is false |
@@ -73,12 +77,74 @@ When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plu

 `HistorianDataSource` (in the plugin assembly) maintains a persistent connection to the Historian server via `ArchestrA.HistorianAccess`:

-1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`.
-2. **Connection reuse** -- Subsequent queries reuse the same connection.
-3. **Auto-reconnect** -- On connection failure, the connection is disposed and re-established on the next query.
+1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`. When a cluster is configured, the data source iterates `HistorianClusterEndpointPicker.GetHealthyNodes()` in order and returns the first node that successfully connects.
+2. **Connection reuse** -- Subsequent queries reuse the same connection. The active node is tracked in `_activeProcessNode` / `_activeEventNode` and surfaced on the dashboard.
+3. **Auto-reconnect** -- On connection failure, the connection is disposed, the active node is marked failed in the picker, and the next query re-enters the picker loop to try the next eligible candidate.
 4. **Clean shutdown** -- `Dispose()` closes the connection when the service stops.

-The connection is opened with `ReadOnly = true` and `ConnectionType = Process`.
+The connection is opened with `ReadOnly = true` and `ConnectionType = Process`. The event (alarm history) path uses a separate connection with `ConnectionType = Event`, but both silos share the same cluster picker so a node that fails on one silo is immediately skipped on the other.
+
+## Read-Only Cluster Failover
+
+When `HistorianConfiguration.ServerNames` is non-empty, the plugin picks from an ordered list of cluster nodes instead of a single `ServerName`. Each connection attempt tries candidates in configuration order until one succeeds. Failed nodes are placed into a timed cooldown and re-admitted when the cooldown elapses.
+
+### HistorianClusterEndpointPicker
+
+The picker (in the plugin assembly, internal) is pure logic with no SDK dependency — all cluster behavior is unit-testable with a fake clock and scripted factory. Key characteristics:
+
+- **Ordered iteration**: nodes are tried in the exact order they appear in `ServerNames`. Operators can express a preference ("primary first, fallback second") by ordering the list.
+- **Per-node cooldown**: `MarkFailed(node, error)` starts a `FailureCooldownSeconds` window during which the node is skipped from `GetHealthyNodes()`. `MarkHealthy(node)` clears the window immediately (used on successful connect).
+- **Automatic re-admission**: when a node's cooldown elapses, the next call to `GetHealthyNodes()` includes it automatically — no background probe, no manual reset. The cumulative `FailureCount` and `LastError` are retained for operator diagnostics.
+- **Thread-safe**: a single lock guards the per-node state. Operations are microsecond-scale so contention is a non-issue.
+- **Shared across silos**: one picker instance is shared by the process-values connection and the event-history connection, so a node failure on one path immediately benches it for the other.
+- **Zero cooldown mode**: `FailureCooldownSeconds = 0` disables the cooldown entirely — the node is never benched. Useful for tests or for operators who want the SDK's own retry semantics to be the sole gate.
+
+### Connection attempt flow
+
+`HistorianDataSource.ConnectToAnyHealthyNode(HistorianConnectionType)` performs the actual iteration:
+
+1. Snapshot healthy nodes from the picker. If empty, throw `InvalidOperationException` with either "No historian nodes configured" or "All N historian nodes are in cooldown".
+2. For each candidate, clone `HistorianConfiguration` with the candidate as `ServerName` and pass it to the factory. On success: `MarkHealthy(node)` and return the `(Connection, Node)` tuple. On exception: `MarkFailed(node, ex.Message)`, log a warning, continue.
+3. If all candidates fail, wrap the last inner exception in an `InvalidOperationException` with the cumulative failure count so the existing read-method catch blocks surface a meaningful error through the health counters.
+
+The wrapping exception intentionally includes the last inner error message in the outer `Message` so the health snapshot's `LastError` field is still human-readable when the cluster exhausts every candidate.
+
+### Single-node backward compatibility
+
+When `ServerNames` is empty, the picker is seeded with a single entry from `ServerName` and the iteration loop still runs — it just has one candidate. Legacy deployments see no behavior change: the picker marks the single node healthy on success, runs the same cooldown logic on failure, and the dashboard renders a compact `Node: <hostname>` line instead of the cluster table.
+
+### Cluster health surface
+
+Runtime cluster state is exposed on `HistorianHealthSnapshot`:
+
+- `NodeCount` / `HealthyNodeCount` -- size of the configured cluster and how many are currently eligible.
+- `ActiveProcessNode` / `ActiveEventNode` -- which nodes are currently serving the two connection silos, or `null` when a silo has no open connection.
+- `Nodes: List<HistorianClusterNodeState>` -- per-node state with `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`.
+
+The dashboard renders this as a cluster table when `NodeCount > 1`. See [Status Dashboard](StatusDashboard.md#historian). `HealthCheckService` flips the overall service health to `Degraded` when `HealthyNodeCount < NodeCount` so operators can alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes.
+
+## Runtime Health Counters
+
+`HistorianDataSource` maintains runtime query counters updated on every read method exit — success or failure — so the dashboard can distinguish "plugin loaded but never queried" from "plugin loaded and queries are failing". The load-time `HistorianPluginLoader.LastOutcome` only reports whether the assembly resolved at startup; it cannot catch a connection that succeeds at boot and degrades later.
+
+### Counters
+
+- `TotalQueries` / `TotalSuccesses` / `TotalFailures` — cumulative since startup. Every call to `RecordSuccess` or `RecordFailure` in the read methods updates these under `_healthLock`. Empty result sets count as successes — the counter reflects "the SDK call returned" rather than "the SDK call returned data".
+- `ConsecutiveFailures` — latches while queries are failing; reset to zero by the first success. Drives `HealthCheckService` degradation at threshold 3.
+- `LastSuccessTime` / `LastFailureTime` — UTC timestamps of the most recent success or failure, or `null` when no query of that outcome has occurred yet.
+- `LastError` — exception message from the most recent failure, prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call is broken. Cleared on the next success.
+- `ProcessConnectionOpen` / `EventConnectionOpen` — whether the plugin currently holds an open SDK connection on each silo. Read from the data source's `_connection` / `_eventConnection` fields via a `Volatile.Read`.
+
+These fields are read once per dashboard refresh via `IHistorianDataSource.GetHealthSnapshot()` and serialized into `HistorianStatusInfo`. See [Status Dashboard](StatusDashboard.md#historian) for the HTML/JSON surface.
+
+### Two SDK connection silos
+
+The plugin maintains two independent `ArchestrA.HistorianAccess` connections, one per `HistorianConnectionType`:
+
+- **Process connection** (`ConnectionType = Process`) — serves historical *value* queries: `ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`. This is the SDK's query channel for tags stored in the Historian runtime.
+- **Event connection** (`ConnectionType = Event`) — serves historical *event/alarm* queries: `ReadEventsAsync`. The SDK requires a separately opened connection for its event store because the query API and wire schema are distinct from value queries.
+
+Both connections are lazy: they open on the first query that needs them. Either can be open, closed, or open against a different cluster node than the other. The dashboard renders both independently in the Historian panel (`Process Conn: open (host-a) | Event Conn: closed`) so operators can tell which silos are active and which node is serving each. When cluster support is configured, both silos share the same `HistorianClusterEndpointPicker`, so a failure on one silo marks the node unhealthy for the other as well.

 ## Raw Reads

@@ -104,16 +104,32 @@ New operation names are auto-registered on first use, so the `Operations` dictio

 ### Historian

-`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture.
+`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture and the [Runtime Health Counters](HistoricalDataAccess.md#runtime-health-counters) section for the data source instrumentation.

 | Field | Type | Description |
 |-------|------|-------------|
 | `Enabled` | `bool` | Whether `Historian.Enabled` is set in configuration |
-| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` |
+| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` — load-time outcome from `HistorianPluginLoader.LastOutcome` |
 | `PluginError` | `string?` | Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null` |
 | `PluginPath` | `string` | Absolute path the loader probed for the plugin assembly |
-| `ServerName` | `string` | Configured historian hostname |
+| `ServerName` | `string` | Legacy single-node hostname from `Historian.ServerName`; ignored when `ServerNames` is non-empty |
 | `Port` | `int` | Configured historian TCP port |
+| `QueryTotal` | `long` | Total historian read queries attempted since startup (raw + aggregate + at-time + events) |
+| `QuerySuccesses` | `long` | Queries that completed without an exception |
+| `QueryFailures` | `long` | Queries that raised an exception — each failure also triggers the plugin's reconnect path |
+| `ConsecutiveFailures` | `int` | Failures since the last success. Resets to zero on any successful query. Drives the `Degraded` health rule at threshold 3 |
+| `LastSuccessTime` | `DateTime?` | UTC timestamp of the most recent successful query, or `null` when no query has succeeded since startup |
+| `LastFailureTime` | `DateTime?` | UTC timestamp of the most recent failure |
+| `LastQueryError` | `string?` | Exception message from the most recent failure. Prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call failed |
+| `ProcessConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **process** silo (historical value queries — `ReadRaw`, `ReadAggregate`, `ReadAtTime`). See [Two SDK connection silos](HistoricalDataAccess.md#two-sdk-connection-silos) |
+| `EventConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **event** silo (alarm history queries — `ReadEvents`). Separate from the process connection because the SDK requires distinct query channels |
+| `ActiveProcessNode` | `string?` | Cluster node currently serving the process silo, or `null` when no process connection is open |
+| `ActiveEventNode` | `string?` | Cluster node currently serving the event silo, or `null` when no event connection is open |
+| `NodeCount` | `int` | Total configured historian cluster nodes. 1 for a legacy single-node deployment |
+| `HealthyNodeCount` | `int` | Nodes currently eligible for new connections (not in failure cooldown) |
+| `Nodes` | `List<HistorianClusterNodeState>` | Per-node cluster state in configuration order. Each entry carries `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime` |
+
+The operator dashboard renders a cluster table inside the Historian panel when `NodeCount > 1`. Legacy single-node deployments render a compact `Node: <hostname>` line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures.

 ### Alarms

@@ -463,6 +463,106 @@ Filter syntax quick reference (documented in `AlarmFilterConfiguration.cs` XML-d
 - Empty list → filter disabled → current unfiltered behavior.
 - Match semantics: an object is included when any template in its derivation chain matches any pattern, and the inclusion propagates to all descendants in the containment hierarchy. Each object is evaluated once regardless of how many patterns or ancestors match.

+## Historian Runtime Health Surface
+
+Updated: `2026-04-13 10:44-10:52 America/New_York`
+
+Both instances updated with runtime historian query instrumentation so the status dashboard can detect silent query degradation that the load-time `PluginStatus` cannot catch.
+
+Backups:
+- `C:\publish\lmxopcua\backups\20260413-104406-instance1`
+- `C:\publish\lmxopcua\backups\20260413-104406-instance2`
+
+Code changes:
+- `Host/Historian/HistorianHealthSnapshot.cs` (new) — DTO with `TotalQueries`, `TotalSuccesses`, `TotalFailures`, `ConsecutiveFailures`, `LastSuccessTime`, `LastFailureTime`, `LastError`, `ProcessConnectionOpen`, `EventConnectionOpen`.
+- `Host/Historian/IHistorianDataSource.cs` — added `GetHealthSnapshot()` interface method.
+- `Historian.Aveva/HistorianDataSource.cs` — added `_healthLock`-guarded counters, `RecordSuccess()` / `RecordFailure(path)` helpers called at every terminal site in all four read methods (raw, aggregate, at-time, events). Error messages carry a `raw:` / `aggregate:` / `at-time:` / `events:` prefix so operators can tell which SDK call is broken.
+- `Host/OpcUa/LmxNodeManager.cs` — exposes `HistorianHealth` property that proxies to `IHistorianDataSource.GetHealthSnapshot()`.
+- `Host/Status/StatusData.cs` — added 9 new fields on `HistorianStatusInfo`.
+- `Host/Status/StatusReportService.cs` — `BuildHistorianStatusInfo()` populates the new fields from the node manager; panel color gradient: green → yellow (1-4 consecutive failures) → red (≥5 consecutive or plugin unloaded). Renders `Queries: N (Success: X, Failure: Y) | Consecutive Failures: Z`, `Process Conn: open/closed | Event Conn: open/closed`, plus `Last Success:` / `Last Failure:` / `Last Error:` lines when applicable.
+- `Host/Status/HealthCheckService.cs` — new Rule 2b2: `Degraded` when `ConsecutiveFailures >= 3`. Threshold chosen to avoid flagging single transient blips.
+
+Tests:
+- 5 new unit tests in `HistorianDataSourceLifecycleTests` covering fresh zero-state, single failure, multi-failure consecutive increment, cross-read-path counting, and error-message-carries-path.
+- Full suite: 16/16 plugin tests, 447/447 host tests passing.
+
+Live verification on instance1:
+```
+Before any query:
+  Queries: 0 (Success: 0, Failure: 0) | Process Conn: closed | Event Conn: closed
+After TestMachine_001.TestHistoryValue raw read:
+  Queries: 1 (Success: 1, Failure: 0) | Process Conn: open
+  Last Success: 2026-04-13T14:45:18Z
+After aggregate hourly-average over 24h:
+  Queries: 2 (Success: 2, Failure: 0)
+After historyread against an unknown node id (bad tag):
+  Queries: 2 (counter unchanged — rejected at node-lookup before reaching the plugin; correct)
+```
+
+JSON endpoint `/api/status` carries all 9 new fields with correct types. Both instances deployed; instance1 `LmxOpcUa` PID 33824, instance2 `LmxOpcUa2` PID 30200.
+
+## Historian Read-Only Cluster Support
+
+Updated: `2026-04-13 11:25-12:00 America/New_York`
+
+Both instances updated with Wonderware Historian read-only cluster failover. Operators can supply an ordered list of historian cluster nodes; the plugin iterates them on each fresh connect and benches failed nodes for a configurable cooldown window. Single-node deployments are preserved via the existing `ServerName` field.
+
+Backups:
+- `C:\publish\lmxopcua\backups\20260413-112519-instance1`
+- `C:\publish\lmxopcua\backups\20260413-112519-instance2`
+
+Code changes:
+- `Host/Configuration/HistorianConfiguration.cs` — added `ServerNames: List<string>` (defaults to `[]`) and `FailureCooldownSeconds: int` (defaults to 60). `ServerName` preserved as fallback when `ServerNames` is empty.
+- `Host/Historian/HistorianClusterNodeState.cs` (new) — per-node DTO: `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`.
+- `Host/Historian/HistorianHealthSnapshot.cs` — extended with `ActiveProcessNode`, `ActiveEventNode`, `NodeCount`, `HealthyNodeCount`, `Nodes: List<HistorianClusterNodeState>`.
+- `Historian.Aveva/HistorianClusterEndpointPicker.cs` (new, internal) — pure picker with injected clock, thread-safe via lock, BFS-style `GetHealthyNodes()` / `MarkFailed()` / `MarkHealthy()` / `SnapshotNodeStates()`. Nodes iterate in configuration order; failed nodes skip until cooldown elapses; the cumulative `FailureCount` and `LastError` are retained across recovery for operator diagnostics.
+- `Historian.Aveva/HistorianDataSource.cs` — new `ConnectToAnyHealthyNode(type)` method iterates picker candidates, clones `HistorianConfiguration` per attempt with the candidate as `ServerName`, and returns the first successful `(Connection, Node)` tuple. `EnsureConnected` and `EnsureEventConnected` both call it. `HandleConnectionError` and `HandleEventConnectionError` now mark the active node failed in the picker before nulling. `_activeProcessNode` / `_activeEventNode` track the live node for the dashboard. Both silos (process + event) share a single picker instance so a node failure on one immediately benches it for the other.
+- `Host/Status/StatusData.cs` — added `NodeCount`, `HealthyNodeCount`, `ActiveProcessNode`, `ActiveEventNode`, `Nodes` to `HistorianStatusInfo`.
+- `Host/Status/StatusReportService.cs` — Historian panel renders `Process Conn: open (<node>)` badges and a cluster table (when `NodeCount > 1`) showing each node's state, cooldown expiry, failure count, and last error. Single-node deployments render a compact `Node: <hostname>` line.
+- `Host/Status/HealthCheckService.cs` — new Rule 2b3: `Degraded` when `NodeCount > 1 && HealthyNodeCount < NodeCount`. Lets operators alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes.
+- `Host/Configuration/ConfigurationValidator.cs` — logs the effective node list and `FailureCooldownSeconds` at startup, validates that `FailureCooldownSeconds >= 0`, warns when `ServerName` is set alongside a non-empty `ServerNames`.
+
+Tests:
+- `HistorianClusterEndpointPickerTests.cs` — 19 unit tests covering config parsing, ordered iteration, cooldown expiry, zero-cooldown mode, mark-healthy clears, cumulative failure counting, unknown-node safety, concurrent writers (thread-safety smoke test).
+- `HistorianClusterFailoverTests.cs` — 6 integration tests driving `HistorianDataSource` via a scripted `FakeHistorianConnectionFactory`: first-node-fails-picks-second, all-nodes-fail, second-call-skips-cooled-down-node, single-node-legacy-behavior, picker-order-respected, shared-picker-across-silos.
+- Full plugin suite: 41/41 tests passing. Host suite: 446/447 (1 pre-existing flaky MxAccess monitor test passes on retry).
+
+Live verification on instance1 (cluster = `["does-not-exist-historian.invalid", "localhost"]`, `FailureCooldownSeconds=30`):
+
+**Failover cycle 1** (fresh picker state, both nodes healthy):
+```
+2026-04-13 11:27:25.381 [WRN] Historian node does-not-exist-historian.invalid failed during connect attempt; trying next candidate
+2026-04-13 11:27:25.910 [INF] Historian SDK connection opened to localhost:32568
+```
+- historyread returned 1 value successfully (`Queries: 1 (Success: 1, Failure: 0)`).
+- Dashboard: panel yellow, `Cluster: 1 of 2 nodes healthy`, bad node `cooldown` until `11:27:55Z`, `Process Conn: open (localhost)`.
+
+**Cooldown expiry**:
+- At 11:29 UTC, the cooldown window had elapsed. Panel back to green, both nodes healthy, but `does-not-exist-historian.invalid` retains `FailureCount=1` and `LastError` as history.
+
+**Failover cycle 2** (service restart to drop persistent connection):
+```
+2026-04-13 14:00:39.352 [WRN] Historian node does-not-exist-historian.invalid failed during connect attempt; trying next candidate
+2026-04-13 14:00:39.885 [INF] Historian SDK connection opened to localhost:32568
+```
+- historyread returned 1 value successfully on the second restart cycle — proves the picker re-admits a cooled-down node and the whole failover cycle repeats cleanly.
+
+**Single-node restoration**:
+- Changed instance1 back to `"ServerNames": []`, restarted. Dashboard renders `Node: localhost` (no cluster table), panel green, backward compat verified.
+
+Final configuration: both instances running with empty `ServerNames` (single-node mode). `LmxOpcUa` PID 31064, `LmxOpcUa2` PID 15012.
+
+Operator configuration shape:
+```json
+"Historian": {
+  "Enabled": true,
+  "ServerName": "localhost",                // ignored when ServerNames is non-empty
+  "ServerNames": ["historian-a", "historian-b"],
+  "FailureCooldownSeconds": 60,
+  ...
+}
+```
+
 ## Notes

 The service deployment and restart succeeded. The live CLI checks confirm the endpoint is reachable and that the array node identifier has changed to the bracketless form. The array value on the live service still prints as blank even though the status is good, so if this environment should have populated `MoveInPartNumbers`, the runtime data path still needs follow-up investigation.
@@ -0,0 +1,181 @@
+using System;
+using System.Collections.Generic;
+using System.Linq;
+using ZB.MOM.WW.LmxOpcUa.Host.Configuration;
+using ZB.MOM.WW.LmxOpcUa.Host.Historian;
+
+namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
+{
+    /// <summary>
+    ///     Thread-safe, pure-logic endpoint picker for the Wonderware Historian cluster. Tracks which
+    ///     configured nodes are healthy, places failed nodes in a time-bounded cooldown, and hands
+    ///     out an ordered list of eligible candidates for the data source to try in sequence.
+    /// </summary>
+    /// <remarks>
+    ///     Design notes:
+    ///     <list type="bullet">
+    ///         <item>No SDK dependency — fully unit-testable with an injected clock.</item>
+    ///         <item>Per-node state is guarded by a single lock; operations are microsecond-scale
+    ///         so contention is a non-issue.</item>
+    ///         <item>Cooldown is purely passive: a node re-enters the healthy pool the next time
+    ///         it is queried after its cooldown window elapses. There is no background probe.</item>
+    ///         <item>Nodes are returned in configuration order so operators can express a
+    ///         preference (primary first, fallback second).</item>
+    ///         <item>When <see cref="HistorianConfiguration.ServerNames"/> is empty, the picker is
+    ///         initialized with a single entry from <see cref="HistorianConfiguration.ServerName"/>
+    ///         so legacy deployments continue to work unchanged.</item>
+    ///     </list>
+    /// </remarks>
+    internal sealed class HistorianClusterEndpointPicker
+    {
+        private readonly Func<DateTime> _clock;
+        private readonly TimeSpan _cooldown;
+        private readonly object _lock = new object();
+        private readonly List<NodeEntry> _nodes;
+
+        public HistorianClusterEndpointPicker(HistorianConfiguration config)
+            : this(config, () => DateTime.UtcNow) { }
+
+        internal HistorianClusterEndpointPicker(HistorianConfiguration config, Func<DateTime> clock)
+        {
+            _clock = clock ?? throw new ArgumentNullException(nameof(clock));
+            _cooldown = TimeSpan.FromSeconds(Math.Max(0, config.FailureCooldownSeconds));
+
+            var names = (config.ServerNames != null && config.ServerNames.Count > 0)
+                ? config.ServerNames
+                : new List<string> { config.ServerName };
+
+            _nodes = names
+                .Where(n => !string.IsNullOrWhiteSpace(n))
+                .Select(n => n.Trim())
+                .Distinct(StringComparer.OrdinalIgnoreCase)
+                .Select(n => new NodeEntry { Name = n })
+                .ToList();
+        }
+
+        /// <summary>
+        ///     Gets the total number of configured cluster nodes. Stable — nodes are never added
+        ///     or removed after construction.
+        /// </summary>
+        public int NodeCount
+        {
+            get
+            {
+                lock (_lock)
+                    return _nodes.Count;
+            }
+        }
+
+        /// <summary>
+        ///     Returns an ordered snapshot of nodes currently eligible for a connection attempt,
+        ///     with any node whose cooldown has elapsed automatically restored to the pool.
+        ///     An empty list means all nodes are in active cooldown.
+        /// </summary>
+        public IReadOnlyList<string> GetHealthyNodes()
+        {
+            lock (_lock)
+            {
+                var now = _clock();
+                return _nodes
+                    .Where(n => IsHealthyAt(n, now))
+                    .Select(n => n.Name)
+                    .ToList();
+            }
+        }
+
+        /// <summary>
+        ///     Gets the count of nodes currently eligible for a connection attempt (i.e., not in cooldown).
+        /// </summary>
+        public int HealthyNodeCount
+        {
+            get
+            {
+                lock (_lock)
+                {
+                    var now = _clock();
+                    return _nodes.Count(n => IsHealthyAt(n, now));
+                }
+            }
+        }
+
+        /// <summary>
+        ///     Places <paramref name="node"/> into cooldown starting at the current clock time.
+        ///     Increments the node's failure counter and stores the latest error message for
+        ///     surfacing on the dashboard. Unknown node names are ignored.
+        /// </summary>
+        public void MarkFailed(string node, string? error)
+        {
+            lock (_lock)
+            {
+                var entry = FindEntry(node);
+                if (entry == null)
+                    return;
+
+                var now = _clock();
+                entry.FailureCount++;
+                entry.LastError = error;
+                entry.LastFailureTime = now;
+                entry.CooldownUntil = _cooldown.TotalMilliseconds > 0 ? now + _cooldown : (DateTime?)null;
+            }
+        }
+
+        /// <summary>
+        ///     Marks <paramref name="node"/> as healthy immediately — clears any active cooldown but
+        ///     leaves the cumulative failure counter intact for operator diagnostics. Unknown node
+        ///     names are ignored.
+        /// </summary>
+        public void MarkHealthy(string node)
+        {
+            lock (_lock)
+            {
+                var entry = FindEntry(node);
+                if (entry == null)
+                    return;
+                entry.CooldownUntil = null;
+            }
+        }
+
+        /// <summary>
+        ///     Captures the current per-node state for the health dashboard. Freshly computed from
+        ///     <see cref="_clock"/> so recently-expired cooldowns are reported as healthy.
+        /// </summary>
+        public List<HistorianClusterNodeState> SnapshotNodeStates()
+        {
+            lock (_lock)
+            {
+                var now = _clock();
+                return _nodes.Select(n => new HistorianClusterNodeState
+                {
+                    Name = n.Name,
+                    IsHealthy = IsHealthyAt(n, now),
+                    CooldownUntil = IsHealthyAt(n, now) ? null : n.CooldownUntil,
+                    FailureCount = n.FailureCount,
+                    LastError = n.LastError,
+                    LastFailureTime = n.LastFailureTime
+                }).ToList();
+            }
+        }
+
+        private static bool IsHealthyAt(NodeEntry entry, DateTime now)
+        {
+            return entry.CooldownUntil == null || entry.CooldownUntil <= now;
+        }
+
+        private NodeEntry? FindEntry(string node)
+        {
+            for (var i = 0; i < _nodes.Count; i++)
+                if (string.Equals(_nodes[i].Name, node, StringComparison.OrdinalIgnoreCase))
+                    return _nodes[i];
+            return null;
+        }
+
+        private sealed class NodeEntry
+        {
+            public string Name { get; set; } = "";
+            public DateTime? CooldownUntil { get; set; }
+            public int FailureCount { get; set; }
+            public string? LastError { get; set; }
+            public DateTime? LastFailureTime { get; set; }
+        }
+    }
+}
@@ -27,20 +27,155 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
        private HistorianAccess? _eventConnection;
        private bool _disposed;

+        // Runtime query health state. Guarded by _healthLock — updated on every read
+        // method exit (success or failure) so the dashboard can distinguish "plugin
+        // loaded but never queried" from "plugin loaded and queries are failing".
+        private readonly object _healthLock = new object();
+        private long _totalSuccesses;
+        private long _totalFailures;
+        private int _consecutiveFailures;
+        private DateTime? _lastSuccessTime;
+        private DateTime? _lastFailureTime;
+        private string? _lastError;
+        private string? _activeProcessNode;
+        private string? _activeEventNode;
+
+        // Cluster endpoint picker — shared across process + event paths so a node that
+        // fails on one silo is skipped on the other. Initialized from config at construction.
+        private readonly HistorianClusterEndpointPicker _picker;
+
        /// <summary>
        ///     Initializes a Historian reader that translates OPC UA history requests into Wonderware Historian SDK queries.
        /// </summary>
        /// <param name="config">The Historian SDK connection settings used for runtime history lookups.</param>
        public HistorianDataSource(HistorianConfiguration config)
-            : this(config, new SdkHistorianConnectionFactory()) { }
+            : this(config, new SdkHistorianConnectionFactory(), null) { }

        /// <summary>
-        ///     Initializes a Historian reader with a custom connection factory for testing.
+        ///     Initializes a Historian reader with a custom connection factory for testing. When
+        ///     <paramref name="picker"/> is <see langword="null"/> a new picker is built from
+        ///     <paramref name="config"/>, preserving backward compatibility with existing tests.
        /// </summary>
-        internal HistorianDataSource(HistorianConfiguration config, IHistorianConnectionFactory factory)
+        internal HistorianDataSource(
+            HistorianConfiguration config,
+            IHistorianConnectionFactory factory,
+            HistorianClusterEndpointPicker? picker = null)
        {
            _config = config;
            _factory = factory;
+            _picker = picker ?? new HistorianClusterEndpointPicker(config);
+        }
+
+        /// <summary>
+        ///     Iterates the picker's healthy node list, cloning the configuration per attempt and
+        ///     handing it to the factory. Marks each tried node as healthy on success or failed on
+        ///     exception. Returns the winning connection + node name; throws when no nodes succeed.
+        /// </summary>
+        private (HistorianAccess Connection, string Node) ConnectToAnyHealthyNode(HistorianConnectionType type)
+        {
+            var candidates = _picker.GetHealthyNodes();
+            if (candidates.Count == 0)
+            {
+                var total = _picker.NodeCount;
+                throw new InvalidOperationException(
+                    total == 0
+                        ? "No historian nodes configured"
+                        : $"All {total} historian nodes are in cooldown — no healthy endpoints to connect to");
+            }
+
+            Exception? lastException = null;
+            foreach (var node in candidates)
+            {
+                var attemptConfig = CloneConfigWithServerName(node);
+                try
+                {
+                    var conn = _factory.CreateAndConnect(attemptConfig, type);
+                    _picker.MarkHealthy(node);
+                    return (conn, node);
+                }
+                catch (Exception ex)
+                {
+                    _picker.MarkFailed(node, ex.Message);
+                    lastException = ex;
+                    Log.Warning(ex,
+                        "Historian node {Node} failed during connect attempt; trying next candidate", node);
+                }
+            }
+
+            var inner = lastException?.Message ?? "(no detail)";
+            throw new InvalidOperationException(
+                $"All {candidates.Count} healthy historian candidate(s) failed during connect: {inner}",
+                lastException);
+        }
+
+        private HistorianConfiguration CloneConfigWithServerName(string serverName)
+        {
+            return new HistorianConfiguration
+            {
+                Enabled = _config.Enabled,
+                ServerName = serverName,
+                ServerNames = _config.ServerNames,
+                FailureCooldownSeconds = _config.FailureCooldownSeconds,
+                IntegratedSecurity = _config.IntegratedSecurity,
+                UserName = _config.UserName,
+                Password = _config.Password,
+                Port = _config.Port,
+                CommandTimeoutSeconds = _config.CommandTimeoutSeconds,
+                MaxValuesPerRead = _config.MaxValuesPerRead
+            };
+        }
+
+        /// <inheritdoc />
+        public HistorianHealthSnapshot GetHealthSnapshot()
+        {
+            var nodeStates = _picker.SnapshotNodeStates();
+            var healthyCount = 0;
+            foreach (var n in nodeStates)
+                if (n.IsHealthy)
+                    healthyCount++;
+
+            lock (_healthLock)
+            {
+                return new HistorianHealthSnapshot
+                {
+                    TotalQueries = _totalSuccesses + _totalFailures,
+                    TotalSuccesses = _totalSuccesses,
+                    TotalFailures = _totalFailures,
+                    ConsecutiveFailures = _consecutiveFailures,
+                    LastSuccessTime = _lastSuccessTime,
+                    LastFailureTime = _lastFailureTime,
+                    LastError = _lastError,
+                    ProcessConnectionOpen = Volatile.Read(ref _connection) != null,
+                    EventConnectionOpen = Volatile.Read(ref _eventConnection) != null,
+                    ActiveProcessNode = _activeProcessNode,
+                    ActiveEventNode = _activeEventNode,
+                    NodeCount = nodeStates.Count,
+                    HealthyNodeCount = healthyCount,
+                    Nodes = nodeStates
+                };
+            }
+        }
+
+        private void RecordSuccess()
+        {
+            lock (_healthLock)
+            {
+                _totalSuccesses++;
+                _lastSuccessTime = DateTime.UtcNow;
+                _consecutiveFailures = 0;
+                _lastError = null;
+            }
+        }
+
+        private void RecordFailure(string error)
+        {
+            lock (_healthLock)
+            {
+                _totalFailures++;
+                _lastFailureTime = DateTime.UtcNow;
+                _consecutiveFailures++;
+                _lastError = error;
+            }
        }

        private void EnsureConnected()
@@ -53,8 +188,9 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                return;

            // Create and wait for connection outside the lock so concurrent history
-            // requests are not serialized behind a slow Historian handshake.
-            var conn = _factory.CreateAndConnect(_config, HistorianConnectionType.Process);
+            // requests are not serialized behind a slow Historian handshake. The cluster
+            // picker iterates configured nodes and returns the first that successfully connects.
+            var (conn, winningNode) = ConnectToAnyHealthyNode(HistorianConnectionType.Process);

            lock (_connectionLock)
            {
@@ -74,7 +210,9 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                }

                _connection = conn;
-                Log.Information("Historian SDK connection opened to {Server}:{Port}", _config.ServerName, _config.Port);
+                lock (_healthLock)
+                    _activeProcessNode = winningNode;
+                Log.Information("Historian SDK connection opened to {Server}:{Port}", winningNode, _config.Port);
            }
        }

@@ -96,7 +234,17 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                }

                _connection = null;
-                Log.Warning(ex, "Historian SDK connection reset — will reconnect on next request");
+                string? failedNode;
+                lock (_healthLock)
+                {
+                    failedNode = _activeProcessNode;
+                    _activeProcessNode = null;
+                }
+
+                if (failedNode != null)
+                    _picker.MarkFailed(failedNode, ex?.Message ?? "mid-query failure");
+                Log.Warning(ex, "Historian SDK connection reset (node={Node}) — will reconnect on next request",
+                    failedNode ?? "(unknown)");
            }
        }

@@ -108,7 +256,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
            if (Volatile.Read(ref _eventConnection) != null)
                return;

-            var conn = _factory.CreateAndConnect(_config, HistorianConnectionType.Event);
+            var (conn, winningNode) = ConnectToAnyHealthyNode(HistorianConnectionType.Event);

            lock (_eventConnectionLock)
            {
@@ -127,8 +275,10 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                }

                _eventConnection = conn;
+                lock (_healthLock)
+                    _activeEventNode = winningNode;
                Log.Information("Historian SDK event connection opened to {Server}:{Port}",
-                    _config.ServerName, _config.Port);
+                    winningNode, _config.Port);
            }
        }

@@ -150,7 +300,17 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                }

                _eventConnection = null;
-                Log.Warning(ex, "Historian SDK event connection reset — will reconnect on next request");
+                string? failedNode;
+                lock (_healthLock)
+                {
+                    failedNode = _activeEventNode;
+                    _activeEventNode = null;
+                }
+
+                if (failedNode != null)
+                    _picker.MarkFailed(failedNode, ex?.Message ?? "mid-query failure");
+                Log.Warning(ex, "Historian SDK event connection reset (node={Node}) — will reconnect on next request",
+                    failedNode ?? "(unknown)");
            }
        }

@@ -183,6 +343,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                if (!query.StartQuery(args, out var error))
                {
                    Log.Warning("Historian SDK raw query start failed for {Tag}: {Error}", tagName, error.ErrorCode);
+                    RecordFailure($"raw StartQuery: {error.ErrorCode}");
                    HandleConnectionError();
                    return Task.FromResult(results);
                }
@@ -219,6 +380,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                }

                query.EndQuery(out _);
+                RecordSuccess();
            }
            catch (OperationCanceledException)
            {
@@ -231,6 +393,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
            catch (Exception ex)
            {
                Log.Warning(ex, "HistoryRead raw failed for {Tag}", tagName);
+                RecordFailure($"raw: {ex.Message}");
                HandleConnectionError(ex);
            }

@@ -265,6 +428,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                {
                    Log.Warning("Historian SDK aggregate query start failed for {Tag}: {Error}", tagName,
                        error.ErrorCode);
+                    RecordFailure($"aggregate StartQuery: {error.ErrorCode}");
                    HandleConnectionError();
                    return Task.FromResult(results);
                }
@@ -287,6 +451,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                }

                query.EndQuery(out _);
+                RecordSuccess();
            }
            catch (OperationCanceledException)
            {
@@ -299,6 +464,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
            catch (Exception ex)
            {
                Log.Warning(ex, "HistoryRead aggregate failed for {Tag}", tagName);
+                RecordFailure($"aggregate: {ex.Message}");
                HandleConnectionError(ex);
            }

@@ -380,6 +546,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva

                    query.EndQuery(out _);
                }
+                RecordSuccess();
            }
            catch (OperationCanceledException)
            {
@@ -392,6 +559,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
            catch (Exception ex)
            {
                Log.Warning(ex, "HistoryRead at-time failed for {Tag}", tagName);
+                RecordFailure($"at-time: {ex.Message}");
                HandleConnectionError(ex);
            }

@@ -430,6 +598,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                if (!query.StartQuery(args, out var error))
                {
                    Log.Warning("Historian SDK event query start failed: {Error}", error.ErrorCode);
+                    RecordFailure($"events StartQuery: {error.ErrorCode}");
                    HandleEventConnectionError();
                    return Task.FromResult(results);
                }
@@ -445,6 +614,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
                }

                query.EndQuery(out _);
+                RecordSuccess();
            }
            catch (OperationCanceledException)
            {
@@ -457,6 +627,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva
            catch (Exception ex)
            {
                Log.Warning(ex, "HistoryRead events failed for source {Source}", sourceName ?? "(all)");
+                RecordFailure($"events: {ex.Message}");
                HandleEventConnectionError(ex);
            }

@@ -1,4 +1,5 @@
 using System;
+using System.Collections.Generic;
 using System.Data.SqlClient;
 using System.Linq;
 using Opc.Ua;
@@ -127,20 +128,39 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
                Log.Warning("Only the 'None' security profile is configured — transport security is disabled");

            // Historian
-            Log.Information("Historian.Enabled={Enabled}, ServerName={ServerName}, IntegratedSecurity={IntegratedSecurity}, Port={Port}",
-                config.Historian.Enabled, config.Historian.ServerName, config.Historian.IntegratedSecurity,
+            var clusterNodes = config.Historian.ServerNames ?? new List<string>();
+            var effectiveNodes = clusterNodes.Count > 0
+                ? string.Join(",", clusterNodes)
+                : config.Historian.ServerName;
+            Log.Information(
+                "Historian.Enabled={Enabled}, Nodes=[{Nodes}], IntegratedSecurity={IntegratedSecurity}, Port={Port}",
+                config.Historian.Enabled, effectiveNodes, config.Historian.IntegratedSecurity,
                config.Historian.Port);
-            Log.Information("Historian.CommandTimeoutSeconds={Timeout}, MaxValuesPerRead={MaxValues}",
-                config.Historian.CommandTimeoutSeconds, config.Historian.MaxValuesPerRead);
+            Log.Information(
+                "Historian.CommandTimeoutSeconds={Timeout}, MaxValuesPerRead={MaxValues}, FailureCooldownSeconds={Cooldown}",
+                config.Historian.CommandTimeoutSeconds, config.Historian.MaxValuesPerRead,
+                config.Historian.FailureCooldownSeconds);

            if (config.Historian.Enabled)
            {
-                if (string.IsNullOrWhiteSpace(config.Historian.ServerName))
+                if (clusterNodes.Count == 0 && string.IsNullOrWhiteSpace(config.Historian.ServerName))
                {
-                    Log.Error("Historian.ServerName must not be empty when Historian is enabled");
+                    Log.Error("Historian.ServerName (or ServerNames) must not be empty when Historian is enabled");
                    valid = false;
                }

+                if (config.Historian.FailureCooldownSeconds < 0)
+                {
+                    Log.Error("Historian.FailureCooldownSeconds must be zero or positive");
+                    valid = false;
+                }
+
+                if (clusterNodes.Count > 0 && !string.IsNullOrWhiteSpace(config.Historian.ServerName)
+                    && config.Historian.ServerName != "localhost")
+                    Log.Warning(
+                        "Historian.ServerName='{ServerName}' is ignored because Historian.ServerNames has {Count} entries",
+                        config.Historian.ServerName, clusterNodes.Count);
+
                if (config.Historian.Port < 1 || config.Historian.Port > 65535)
                {
                    Log.Error("Historian.Port must be between 1 and 65535");
@@ -1,3 +1,5 @@
+using System.Collections.Generic;
+
 namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
 {
    /// <summary>
@@ -11,10 +13,25 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
        public bool Enabled { get; set; } = false;

        /// <summary>
-        ///     Gets or sets the Historian server hostname.
+        ///     Gets or sets the single Historian server hostname used when <see cref="ServerNames"/>
+        ///     is empty. Preserved for backward compatibility with pre-cluster deployments.
        /// </summary>
        public string ServerName { get; set; } = "localhost";

+        /// <summary>
+        ///     Gets or sets the ordered list of Historian cluster nodes. When non-empty, this list
+        ///     supersedes <see cref="ServerName"/>: the data source attempts each node in order on
+        ///     connect, falling through to the next on failure. A failed node is placed in cooldown
+        ///     for <see cref="FailureCooldownSeconds"/> before being re-eligible.
+        /// </summary>
+        public List<string> ServerNames { get; set; } = new();
+
+        /// <summary>
+        ///     Gets or sets the cooldown window, in seconds, that a historian node is skipped after
+        ///     a connection failure. A value of zero retries the node on every request. Default 60s.
+        /// </summary>
+        public int FailureCooldownSeconds { get; set; } = 60;
+
        /// <summary>
        ///     Gets or sets a value indicating whether Windows Integrated Security is used.
        ///     When false, <see cref="UserName"/> and <see cref="Password"/> are used instead.
@@ -0,0 +1,49 @@
+using System;
+
+namespace ZB.MOM.WW.LmxOpcUa.Host.Historian
+{
+    /// <summary>
+    ///     Point-in-time state of a single historian cluster node. One entry per configured node is
+    ///     surfaced inside <see cref="HistorianHealthSnapshot"/> so the status dashboard can render
+    ///     per-node health and operators can see which nodes are in cooldown.
+    /// </summary>
+    public sealed class HistorianClusterNodeState
+    {
+        /// <summary>
+        ///     Gets or sets the configured node hostname exactly as it appears in
+        ///     <c>HistorianConfiguration.ServerNames</c>.
+        /// </summary>
+        public string Name { get; set; } = "";
+
+        /// <summary>
+        ///     Gets or sets a value indicating whether the node is currently eligible for new connection
+        ///     attempts. <see langword="false"/> means the node is in its post-failure cooldown window
+        ///     and the picker is skipping it.
+        /// </summary>
+        public bool IsHealthy { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the UTC timestamp at which the node's cooldown expires, or
+        ///     <see langword="null"/> when the node is not in cooldown.
+        /// </summary>
+        public DateTime? CooldownUntil { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the number of times this node has transitioned from healthy to failed
+        ///     since startup. Does not decrement on recovery.
+        /// </summary>
+        public int FailureCount { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the message from the most recent failure, or <see langword="null"/> when
+        ///     the node has never failed.
+        /// </summary>
+        public string? LastError { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the UTC timestamp of the most recent failure, or <see langword="null"/>
+        ///     when the node has never failed.
+        /// </summary>
+        public DateTime? LastFailureTime { get; set; }
+    }
+}
@@ -0,0 +1,97 @@
+using System;
+using System.Collections.Generic;
+
+namespace ZB.MOM.WW.LmxOpcUa.Host.Historian
+{
+    /// <summary>
+    ///     Point-in-time runtime health of the historian plugin, surfaced to the status dashboard
+    ///     and health check service. Fills the gap between the load-time plugin status
+    ///     (<see cref="HistorianPluginLoader.LastOutcome"/>) and actual query behavior so operators
+    ///     can detect silent query degradation.
+    /// </summary>
+    public sealed class HistorianHealthSnapshot
+    {
+        /// <summary>
+        ///     Gets or sets the total number of historian read operations attempted since startup
+        ///     across all read paths (raw, aggregate, at-time, events).
+        /// </summary>
+        public long TotalQueries { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the total number of read operations that completed without an exception
+        ///     being caught by the plugin's error handler. Includes empty result sets as successes —
+        ///     the counter reflects "the SDK call returned" not "the SDK call returned data".
+        /// </summary>
+        public long TotalSuccesses { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the total number of read operations that raised an exception. Each failure
+        ///     also resets and closes the underlying SDK connection via the existing reconnect path.
+        /// </summary>
+        public long TotalFailures { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the number of consecutive failures since the last success. Latches until
+        ///     a successful query clears it. The health check service uses this as a degradation signal.
+        /// </summary>
+        public int ConsecutiveFailures { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the UTC timestamp of the last successful read, or <see langword="null"/>
+        ///     when no query has succeeded since startup.
+        /// </summary>
+        public DateTime? LastSuccessTime { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the UTC timestamp of the last failure, or <see langword="null"/> when no
+        ///     query has failed since startup.
+        /// </summary>
+        public DateTime? LastFailureTime { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the exception message from the most recent failure. Cleared on the next
+        ///     successful query.
+        /// </summary>
+        public string? LastError { get; set; }
+
+        /// <summary>
+        ///     Gets or sets a value indicating whether the plugin currently holds an open SDK
+        ///     connection for the process (historical values) path.
+        /// </summary>
+        public bool ProcessConnectionOpen { get; set; }
+
+        /// <summary>
+        ///     Gets or sets a value indicating whether the plugin currently holds an open SDK
+        ///     connection for the event (alarm history) path.
+        /// </summary>
+        public bool EventConnectionOpen { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the node the plugin is currently connected to for the process path,
+        ///     or <see langword="null"/> when no connection is open.
+        /// </summary>
+        public string? ActiveProcessNode { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the node the plugin is currently connected to for the event path,
+        ///     or <see langword="null"/> when no event connection is open.
+        /// </summary>
+        public string? ActiveEventNode { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the total number of configured historian cluster nodes. A value of 1
+        ///     reflects a legacy single-node deployment.
+        /// </summary>
+        public int NodeCount { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the number of configured nodes that are currently healthy (not in cooldown).
+        /// </summary>
+        public int HealthyNodeCount { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the per-node cluster state in configuration order.
+        /// </summary>
+        public List<HistorianClusterNodeState> Nodes { get; set; } = new();
+    }
+}
@@ -29,5 +29,12 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Historian
        Task<List<HistorianEventDto>> ReadEventsAsync(
            string? sourceName, DateTime startTime, DateTime endTime, int maxEvents,
            CancellationToken ct = default);
+
+        /// <summary>
+        ///     Returns a runtime snapshot of query success/failure counters and connection state.
+        ///     Consumed by the status dashboard and health check service so operators can detect
+        ///     silent query degradation that the load-time plugin status can't catch.
+        /// </summary>
+        HistorianHealthSnapshot GetHealthSnapshot();
    }
 }
@@ -190,6 +190,13 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
        public IReadOnlyList<string> AlarmFilterPatterns =>
            _alarmObjectFilter?.RawPatterns ?? Array.Empty<string>();

+        /// <summary>
+        ///     Gets the runtime historian health snapshot, or <see langword="null"/> when the historian
+        ///     plugin is not loaded. Surfaced on the status dashboard so operators can detect query
+        ///     failures that the load-time plugin status cannot catch.
+        /// </summary>
+        public HistorianHealthSnapshot? HistorianHealth => _historianDataSource?.GetHealthSnapshot();
+
        /// <summary>
        ///     Gets the number of distinct alarm conditions currently tracked (one per alarm attribute).
        /// </summary>
@@ -42,6 +42,33 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Status
                    Color = "yellow"
                };

+            // Rule 2b2: Historian plugin loaded but queries are failing consecutively → Degraded.
+            // Threshold of 3 avoids flagging a single transient blip; anything beyond that means
+            // the SDK is in a broken state that the reconnect loop isn't recovering from.
+            if (historian != null && historian.Enabled && historian.PluginStatus == "Loaded"
+                && historian.ConsecutiveFailures >= 3)
+                return new HealthInfo
+                {
+                    Status = "Degraded",
+                    Message =
+                        $"Historian plugin has {historian.ConsecutiveFailures} consecutive query failures: " +
+                        $"{historian.LastQueryError ?? "(no error)"}",
+                    Color = "yellow"
+                };
+
+            // Rule 2b3: Historian cluster has nodes in cooldown → Degraded (partial cluster).
+            // Only surfaces when the operator actually configured a multi-node cluster.
+            if (historian != null && historian.Enabled && historian.PluginStatus == "Loaded"
+                && historian.NodeCount > 1 && historian.HealthyNodeCount < historian.NodeCount)
+                return new HealthInfo
+                {
+                    Status = "Degraded",
+                    Message =
+                        $"Historian cluster has {historian.HealthyNodeCount} of {historian.NodeCount} " +
+                        "nodes healthy — one or more nodes are in failure cooldown",
+                    Color = "yellow"
+                };
+
            // Rule 2 / 2c: Success rate too low for any recorded operation
            if (metrics != null)
            {
@@ -257,6 +257,81 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Status
        ///     Gets or sets the configured historian TCP port.
        /// </summary>
        public int Port { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the total number of historian read queries attempted since startup.
+        /// </summary>
+        public long QueryTotal { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the number of historian queries that completed without an exception.
+        /// </summary>
+        public long QuerySuccesses { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the number of historian queries that raised an exception.
+        /// </summary>
+        public long QueryFailures { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the number of consecutive failures since the last successful query.
+        /// </summary>
+        public int ConsecutiveFailures { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the UTC timestamp of the last successful query.
+        /// </summary>
+        public DateTime? LastSuccessTime { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the UTC timestamp of the last query failure.
+        /// </summary>
+        public DateTime? LastFailureTime { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the exception message from the most recent failure.
+        /// </summary>
+        public string? LastQueryError { get; set; }
+
+        /// <summary>
+        ///     Gets or sets a value indicating whether the plugin currently holds an open process-path
+        ///     SDK connection.
+        /// </summary>
+        public bool ProcessConnectionOpen { get; set; }
+
+        /// <summary>
+        ///     Gets or sets a value indicating whether the plugin currently holds an open event-path
+        ///     SDK connection.
+        /// </summary>
+        public bool EventConnectionOpen { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the total number of configured historian cluster nodes.
+        /// </summary>
+        public int NodeCount { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the number of cluster nodes currently eligible for new connections
+        ///     (i.e., not in failure cooldown).
+        /// </summary>
+        public int HealthyNodeCount { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the node currently serving process (historical value) queries, or null
+        ///     when no process connection is open.
+        /// </summary>
+        public string? ActiveProcessNode { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the node currently serving event (alarm history) queries, or null when
+        ///     no event connection is open.
+        /// </summary>
+        public string? ActiveEventNode { get; set; }
+
+        /// <summary>
+        ///     Gets or sets the per-node cluster state in configuration order.
+        /// </summary>
+        public List<Historian.HistorianClusterNodeState> Nodes { get; set; } = new();
    }

    /// <summary>
@@ -125,6 +125,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Status
        private HistorianStatusInfo BuildHistorianStatusInfo()
        {
            var outcome = HistorianPluginLoader.LastOutcome;
+            var health = _nodeManager?.HistorianHealth;
            return new HistorianStatusInfo
            {
                Enabled = _historianConfig?.Enabled ?? false,
@@ -132,7 +133,21 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Status
                PluginError = outcome.Error,
                PluginPath = outcome.PluginPath,
                ServerName = _historianConfig?.ServerName ?? "",
-                Port = _historianConfig?.Port ?? 0
+                Port = _historianConfig?.Port ?? 0,
+                QueryTotal = health?.TotalQueries ?? 0,
+                QuerySuccesses = health?.TotalSuccesses ?? 0,
+                QueryFailures = health?.TotalFailures ?? 0,
+                ConsecutiveFailures = health?.ConsecutiveFailures ?? 0,
+                LastSuccessTime = health?.LastSuccessTime,
+                LastFailureTime = health?.LastFailureTime,
+                LastQueryError = health?.LastError,
+                ProcessConnectionOpen = health?.ProcessConnectionOpen ?? false,
+                EventConnectionOpen = health?.EventConnectionOpen ?? false,
+                NodeCount = health?.NodeCount ?? 0,
+                HealthyNodeCount = health?.HealthyNodeCount ?? 0,
+                ActiveProcessNode = health?.ActiveProcessNode,
+                ActiveEventNode = health?.ActiveEventNode,
+                Nodes = health?.Nodes ?? new List<Historian.HistorianClusterNodeState>()
            };
        }

@@ -304,13 +319,66 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Status
            sb.AppendLine("</div>");

            // Historian panel
-            var histColor = data.Historian.PluginStatus == "Loaded" ? "green"
-                : !data.Historian.Enabled ? "gray" : "red";
+            var anyClusterNodeFailed =
+                data.Historian.NodeCount > 0 && data.Historian.HealthyNodeCount < data.Historian.NodeCount;
+            var allClusterNodesFailed =
+                data.Historian.NodeCount > 0 && data.Historian.HealthyNodeCount == 0;
+            var histColor = !data.Historian.Enabled ? "gray"
+                : data.Historian.PluginStatus != "Loaded" ? "red"
+                : allClusterNodesFailed ? "red"
+                : data.Historian.ConsecutiveFailures >= 5 ? "red"
+                : anyClusterNodeFailed || data.Historian.ConsecutiveFailures > 0 ? "yellow"
+                : "green";
            sb.AppendLine($"<div class='panel {histColor}'><h2>Historian</h2>");
            sb.AppendLine(
-                $"<p>Enabled: <b>{data.Historian.Enabled}</b> | Plugin: <b>{data.Historian.PluginStatus}</b> | Server: {WebUtility.HtmlEncode(data.Historian.ServerName)}:{data.Historian.Port}</p>");
+                $"<p>Enabled: <b>{data.Historian.Enabled}</b> | Plugin: <b>{data.Historian.PluginStatus}</b> | Port: {data.Historian.Port}</p>");
            if (!string.IsNullOrEmpty(data.Historian.PluginError))
-                sb.AppendLine($"<p>Error: {WebUtility.HtmlEncode(data.Historian.PluginError)}</p>");
+                sb.AppendLine($"<p>Plugin Error: {WebUtility.HtmlEncode(data.Historian.PluginError)}</p>");
+            if (data.Historian.PluginStatus == "Loaded")
+            {
+                sb.AppendLine(
+                    $"<p>Queries: <b>{data.Historian.QueryTotal:N0}</b> " +
+                    $"(Success: {data.Historian.QuerySuccesses:N0}, Failure: {data.Historian.QueryFailures:N0}) " +
+                    $"| Consecutive Failures: <b>{data.Historian.ConsecutiveFailures}</b></p>");
+                var procBadge = data.Historian.ProcessConnectionOpen
+                    ? $"open ({WebUtility.HtmlEncode(data.Historian.ActiveProcessNode ?? "?")})"
+                    : "closed";
+                var evtBadge = data.Historian.EventConnectionOpen
+                    ? $"open ({WebUtility.HtmlEncode(data.Historian.ActiveEventNode ?? "?")})"
+                    : "closed";
+                sb.AppendLine(
+                    $"<p>Process Conn: <b>{procBadge}</b> | Event Conn: <b>{evtBadge}</b></p>");
+                if (data.Historian.LastSuccessTime.HasValue)
+                    sb.AppendLine($"<p>Last Success: {data.Historian.LastSuccessTime:O}</p>");
+                if (data.Historian.LastFailureTime.HasValue)
+                    sb.AppendLine($"<p>Last Failure: {data.Historian.LastFailureTime:O}</p>");
+                if (!string.IsNullOrEmpty(data.Historian.LastQueryError))
+                    sb.AppendLine(
+                        $"<p>Last Error: <code>{WebUtility.HtmlEncode(data.Historian.LastQueryError)}</code></p>");
+
+                // Cluster table: only when a true multi-node cluster is configured.
+                if (data.Historian.NodeCount > 1)
+                {
+                    sb.AppendLine(
+                        $"<p><b>Cluster:</b> {data.Historian.HealthyNodeCount} of {data.Historian.NodeCount} nodes healthy</p>");
+                    sb.AppendLine(
+                        "<table><tr><th>Node</th><th>State</th><th>Cooldown Until</th><th>Failures</th><th>Last Error</th></tr>");
+                    foreach (var node in data.Historian.Nodes)
+                    {
+                        var state = node.IsHealthy ? "healthy" : "cooldown";
+                        var cooldown = node.CooldownUntil?.ToString("O") ?? "-";
+                        var lastErr = WebUtility.HtmlEncode(node.LastError ?? "");
+                        sb.AppendLine(
+                            $"<tr><td>{WebUtility.HtmlEncode(node.Name)}</td><td>{state}</td>" +
+                            $"<td>{cooldown}</td><td>{node.FailureCount}</td><td><code>{lastErr}</code></td></tr>");
+                    }
+                    sb.AppendLine("</table>");
+                }
+                else if (data.Historian.NodeCount == 1)
+                {
+                    sb.AppendLine($"<p>Node: {WebUtility.HtmlEncode(data.Historian.Nodes[0].Name)}</p>");
+                }
+            }
            sb.AppendLine("</div>");

            // Alarms panel
@@ -75,6 +75,8 @@
  "Historian": {
    "Enabled": false,
    "ServerName": "localhost",
+    "ServerNames": [],
+    "FailureCooldownSeconds": 60,
    "IntegratedSecurity": true,
    "UserName": null,
    "Password": null,
@@ -1,4 +1,5 @@
 using System;
+using System.Collections.Generic;
 using ArchestrA;
 using ZB.MOM.WW.LmxOpcUa.Historian.Aveva;
 using ZB.MOM.WW.LmxOpcUa.Host.Configuration;
@@ -11,15 +12,38 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva.Tests
    /// </summary>
    internal sealed class FakeHistorianConnectionFactory : IHistorianConnectionFactory
    {
+        /// <summary>
+        ///     Exception thrown on every CreateAndConnect call unless a more specific rule in
+        ///     <see cref="ServerBehaviors"/> or <see cref="OnConnect"/> fires first.
+        /// </summary>
        public Exception? ConnectException { get; set; }

        public int ConnectCallCount { get; private set; }

        public Action<int>? OnConnect { get; set; }

+        /// <summary>
+        ///     Per-server-name override: if the requested <c>config.ServerName</c> has an entry
+        ///     whose value is non-null, that exception is thrown instead of the global
+        ///     <see cref="ConnectException"/>. Lets tests script cluster failover behavior like
+        ///     "node A always fails; node B always succeeds".
+        /// </summary>
+        public Dictionary<string, Exception?> ServerBehaviors { get; } =
+            new Dictionary<string, Exception?>(StringComparer.OrdinalIgnoreCase);
+
+        /// <summary>
+        ///     Ordered history of server names passed to CreateAndConnect so tests can assert the
+        ///     picker's iteration order and failover sequence.
+        /// </summary>
+        public List<string> ConnectHistory { get; } = new List<string>();
+
        public HistorianAccess CreateAndConnect(HistorianConfiguration config, HistorianConnectionType type)
        {
            ConnectCallCount++;
+            ConnectHistory.Add(config.ServerName);
+
+            if (ServerBehaviors.TryGetValue(config.ServerName, out var serverException) && serverException != null)
+                throw serverException;

            if (OnConnect != null)
            {
@@ -0,0 +1,291 @@
+using System;
+using System.Collections.Generic;
+using System.Linq;
+using System.Threading;
+using System.Threading.Tasks;
+using Shouldly;
+using Xunit;
+using ZB.MOM.WW.LmxOpcUa.Host.Configuration;
+
+namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva.Tests
+{
+    /// <summary>
+    ///     Exhaustive coverage of the cluster endpoint picker: config parsing, healthy-list ordering,
+    ///     cooldown behavior with an injected clock, and thread-safety under concurrent writers.
+    /// </summary>
+    public class HistorianClusterEndpointPickerTests
+    {
+        // ---------- Construction / config parsing ----------
+
+        [Fact]
+        public void SingleServerName_FallbackWhenServerNamesEmpty()
+        {
+            var picker = new HistorianClusterEndpointPicker(Config(serverName: "host-a"));
+            picker.NodeCount.ShouldBe(1);
+            picker.GetHealthyNodes().ShouldBe(new[] { "host-a" });
+        }
+
+        [Fact]
+        public void ServerNames_TakesPrecedenceOverLegacyServerName()
+        {
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverName: "legacy", serverNames: new[] { "host-a", "host-b" }));
+            picker.NodeCount.ShouldBe(2);
+            picker.GetHealthyNodes().ShouldBe(new[] { "host-a", "host-b" });
+        }
+
+        [Fact]
+        public void ServerNames_OrderedAsConfigured()
+        {
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "c", "a", "b" }));
+            picker.GetHealthyNodes().ShouldBe(new[] { "c", "a", "b" });
+        }
+
+        [Fact]
+        public void ServerNames_WhitespaceTrimmedAndEmptyDropped()
+        {
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "  host-a  ", "", "   ", "host-b" }));
+            picker.GetHealthyNodes().ShouldBe(new[] { "host-a", "host-b" });
+        }
+
+        [Fact]
+        public void ServerNames_CaseInsensitiveDeduplication()
+        {
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "Host-A", "HOST-A", "host-a" }));
+            picker.NodeCount.ShouldBe(1);
+        }
+
+        [Fact]
+        public void EmptyConfig_ProducesEmptyPool()
+        {
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverName: "", serverNames: Array.Empty<string>()));
+            picker.NodeCount.ShouldBe(0);
+            picker.GetHealthyNodes().ShouldBeEmpty();
+        }
+
+        // ---------- MarkFailed / cooldown window ----------
+
+        [Fact]
+        public void MarkFailed_RemovesNodeFromHealthyList()
+        {
+            var clock = new FakeClock();
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a", "b" }, cooldownSeconds: 60), clock.Now);
+
+            picker.MarkFailed("a", "boom");
+
+            picker.GetHealthyNodes().ShouldBe(new[] { "b" });
+            picker.HealthyNodeCount.ShouldBe(1);
+        }
+
+        [Fact]
+        public void MarkFailed_RecordsErrorAndTimestamp()
+        {
+            var clock = new FakeClock { UtcNow = new DateTime(2026, 4, 13, 10, 0, 0, DateTimeKind.Utc) };
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a", "b" }), clock.Now);
+
+            picker.MarkFailed("a", "connection refused");
+
+            var states = picker.SnapshotNodeStates();
+            var a = states.First(s => s.Name == "a");
+            a.IsHealthy.ShouldBeFalse();
+            a.FailureCount.ShouldBe(1);
+            a.LastError.ShouldBe("connection refused");
+            a.LastFailureTime.ShouldBe(clock.UtcNow);
+        }
+
+        [Fact]
+        public void MarkFailed_CooldownExpiryRestoresNode()
+        {
+            var clock = new FakeClock { UtcNow = new DateTime(2026, 4, 13, 10, 0, 0, DateTimeKind.Utc) };
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a", "b" }, cooldownSeconds: 60), clock.Now);
+
+            picker.MarkFailed("a", "boom");
+            picker.GetHealthyNodes().ShouldBe(new[] { "b" });
+
+            // Advance clock just before expiry — still in cooldown
+            clock.UtcNow = clock.UtcNow.AddSeconds(59);
+            picker.GetHealthyNodes().ShouldBe(new[] { "b" });
+
+            // Advance past cooldown — node returns to pool
+            clock.UtcNow = clock.UtcNow.AddSeconds(2);
+            picker.GetHealthyNodes().ShouldBe(new[] { "a", "b" });
+        }
+
+        [Fact]
+        public void ZeroCooldown_NeverBenchesNode()
+        {
+            var clock = new FakeClock();
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a", "b" }, cooldownSeconds: 0), clock.Now);
+
+            picker.MarkFailed("a", "boom");
+
+            // Zero cooldown → node remains eligible immediately
+            picker.GetHealthyNodes().ShouldBe(new[] { "a", "b" });
+            var state = picker.SnapshotNodeStates().First(s => s.Name == "a");
+            state.FailureCount.ShouldBe(1);
+            state.LastError.ShouldBe("boom");
+        }
+
+        [Fact]
+        public void AllNodesFailed_HealthyListIsEmpty()
+        {
+            var clock = new FakeClock();
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a", "b" }, cooldownSeconds: 60), clock.Now);
+
+            picker.MarkFailed("a", "boom");
+            picker.MarkFailed("b", "boom");
+
+            picker.GetHealthyNodes().ShouldBeEmpty();
+            picker.HealthyNodeCount.ShouldBe(0);
+        }
+
+        [Fact]
+        public void MarkFailed_AccumulatesFailureCount()
+        {
+            var clock = new FakeClock();
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a" }, cooldownSeconds: 10), clock.Now);
+
+            picker.MarkFailed("a", "error 1");
+            clock.UtcNow = clock.UtcNow.AddSeconds(20); // recover
+            picker.MarkFailed("a", "error 2");
+
+            picker.SnapshotNodeStates().First().FailureCount.ShouldBe(2);
+            picker.SnapshotNodeStates().First().LastError.ShouldBe("error 2");
+        }
+
+        // ---------- MarkHealthy ----------
+
+        [Fact]
+        public void MarkHealthy_ClearsCooldownImmediately()
+        {
+            var clock = new FakeClock();
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a", "b" }, cooldownSeconds: 3600), clock.Now);
+
+            picker.MarkFailed("a", "boom");
+            picker.GetHealthyNodes().ShouldBe(new[] { "b" });
+
+            picker.MarkHealthy("a");
+            picker.GetHealthyNodes().ShouldBe(new[] { "a", "b" });
+        }
+
+        [Fact]
+        public void MarkHealthy_PreservesCumulativeFailureCount()
+        {
+            var clock = new FakeClock();
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a" }), clock.Now);
+
+            picker.MarkFailed("a", "boom");
+            picker.MarkHealthy("a");
+
+            var state = picker.SnapshotNodeStates().First();
+            state.IsHealthy.ShouldBeTrue();
+            state.FailureCount.ShouldBe(1); // history preserved
+        }
+
+        // ---------- Unknown node handling ----------
+
+        [Fact]
+        public void MarkFailed_UnknownNode_IsIgnored()
+        {
+            var clock = new FakeClock();
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a" }), clock.Now);
+
+            Should.NotThrow(() => picker.MarkFailed("not-configured", "boom"));
+            picker.GetHealthyNodes().ShouldBe(new[] { "a" });
+        }
+
+        [Fact]
+        public void MarkHealthy_UnknownNode_IsIgnored()
+        {
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a" }));
+            Should.NotThrow(() => picker.MarkHealthy("not-configured"));
+        }
+
+        // ---------- SnapshotNodeStates ----------
+
+        [Fact]
+        public void SnapshotNodeStates_ReflectsConfigurationOrder()
+        {
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "z", "m", "a" }));
+            picker.SnapshotNodeStates().Select(s => s.Name).ShouldBe(new[] { "z", "m", "a" });
+        }
+
+        [Fact]
+        public void SnapshotNodeStates_HealthyEntriesHaveNoCooldown()
+        {
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a" }));
+            var state = picker.SnapshotNodeStates().First();
+            state.IsHealthy.ShouldBeTrue();
+            state.CooldownUntil.ShouldBeNull();
+            state.LastError.ShouldBeNull();
+            state.LastFailureTime.ShouldBeNull();
+        }
+
+        // ---------- Thread safety smoke test ----------
+
+        [Fact]
+        public void ConcurrentMarkAndQuery_DoesNotCorrupt()
+        {
+            var clock = new FakeClock();
+            var picker = new HistorianClusterEndpointPicker(
+                Config(serverNames: new[] { "a", "b", "c", "d" }, cooldownSeconds: 5), clock.Now);
+
+            var tasks = new List<Task>();
+            for (var i = 0; i < 8; i++)
+            {
+                tasks.Add(Task.Run(() =>
+                {
+                    for (var j = 0; j < 1000; j++)
+                    {
+                        picker.MarkFailed("a", "boom");
+                        picker.MarkHealthy("a");
+                        _ = picker.GetHealthyNodes();
+                        _ = picker.SnapshotNodeStates();
+                    }
+                }));
+            }
+
+            Task.WaitAll(tasks.ToArray());
+            // Just verify we can still read state after the storm.
+            picker.NodeCount.ShouldBe(4);
+            picker.GetHealthyNodes().Count.ShouldBeInRange(3, 4);
+        }
+
+        // ---------- Helpers ----------
+
+        private static HistorianConfiguration Config(
+            string serverName = "localhost",
+            string[]? serverNames = null,
+            int cooldownSeconds = 60)
+        {
+            return new HistorianConfiguration
+            {
+                ServerName = serverName,
+                ServerNames = (serverNames ?? Array.Empty<string>()).ToList(),
+                FailureCooldownSeconds = cooldownSeconds
+            };
+        }
+
+        private sealed class FakeClock
+        {
+            public DateTime UtcNow { get; set; } = new DateTime(2026, 1, 1, 0, 0, 0, DateTimeKind.Utc);
+            public DateTime Now() => UtcNow;
+        }
+    }
+}
@@ -0,0 +1,166 @@
+using System;
+using System.Linq;
+using Shouldly;
+using Xunit;
+using ZB.MOM.WW.LmxOpcUa.Host.Configuration;
+
+namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva.Tests
+{
+    /// <summary>
+    ///     End-to-end behavior of the cluster endpoint picker wired into
+    ///     <see cref="HistorianDataSource"/>. Verifies that a failing node is skipped on the next
+    ///     attempt, that the picker state is shared across process + event silos, and that the
+    ///     health snapshot surfaces the winning node.
+    /// </summary>
+    public class HistorianClusterFailoverTests
+    {
+        private static HistorianConfiguration ClusterConfig(params string[] nodes) => new()
+        {
+            Enabled = true,
+            ServerNames = nodes.ToList(),
+            Port = 32568,
+            IntegratedSecurity = true,
+            CommandTimeoutSeconds = 5,
+            FailureCooldownSeconds = 60
+        };
+
+        [Fact]
+        public void Connect_FirstNodeFails_PicksSecond()
+        {
+            // host-a fails during connect; host-b connects successfully. The fake returns an
+            // unconnected HistorianAccess on success, so the query phase will subsequently trip
+            // HandleConnectionError on host-b — that's expected. The observable signal is that
+            // the picker tried host-a first, skipped to host-b, and host-a's failure was recorded.
+            var factory = new FakeHistorianConnectionFactory();
+            factory.ServerBehaviors["host-a"] = new InvalidOperationException("A down");
+            var config = ClusterConfig("host-a", "host-b");
+            using var ds = new HistorianDataSource(config, factory);
+
+            ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            factory.ConnectHistory.ShouldBe(new[] { "host-a", "host-b" });
+            var snap = ds.GetHealthSnapshot();
+            snap.NodeCount.ShouldBe(2);
+            snap.Nodes.Single(n => n.Name == "host-a").IsHealthy.ShouldBeFalse();
+            snap.Nodes.Single(n => n.Name == "host-a").FailureCount.ShouldBe(1);
+            snap.Nodes.Single(n => n.Name == "host-a").LastError.ShouldContain("A down");
+        }
+
+        [Fact]
+        public void Connect_AllNodesFail_ReturnsEmptyResults_AndAllInCooldown()
+        {
+            var factory = new FakeHistorianConnectionFactory();
+            factory.ServerBehaviors["host-a"] = new InvalidOperationException("A down");
+            factory.ServerBehaviors["host-b"] = new InvalidOperationException("B down");
+            var config = ClusterConfig("host-a", "host-b");
+            using var ds = new HistorianDataSource(config, factory);
+
+            var results = ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            results.Count.ShouldBe(0);
+            factory.ConnectHistory.ShouldBe(new[] { "host-a", "host-b" });
+
+            var snap = ds.GetHealthSnapshot();
+            snap.ActiveProcessNode.ShouldBeNull();
+            snap.HealthyNodeCount.ShouldBe(0);
+            snap.TotalFailures.ShouldBe(1);  // one read call failed (after all cluster tries)
+            snap.LastError.ShouldContain("All 2 healthy historian candidate(s) failed");
+            snap.LastError.ShouldContain("B down");  // last inner exception preserved
+        }
+
+        [Fact]
+        public void Connect_SecondCall_SkipsCooledDownNode()
+        {
+            // After first call: host-a is in cooldown (60s), host-b is also marked failed via
+            // HandleConnectionError since the fake connection doesn't support real queries.
+            // Second call: both are in cooldown and the picker returns empty → the read method
+            // catches the "all nodes failed" exception and returns empty without retrying connect.
+            // We verify this by checking that the second call adds NOTHING to the connect history.
+            var factory = new FakeHistorianConnectionFactory();
+            factory.ServerBehaviors["host-a"] = new InvalidOperationException("A down");
+            var config = ClusterConfig("host-a", "host-b");  // 60s cooldown
+            using var ds = new HistorianDataSource(config, factory);
+
+            ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            factory.ConnectHistory.Clear();
+            var results = ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            // Both nodes are in cooldown → picker returns empty → factory is not called at all.
+            results.Count.ShouldBe(0);
+            factory.ConnectHistory.ShouldBeEmpty();
+        }
+
+        [Fact]
+        public void Connect_SingleNodeConfig_BehavesLikeLegacy()
+        {
+            var factory = new FakeHistorianConnectionFactory();
+            var config = new HistorianConfiguration
+            {
+                Enabled = true,
+                ServerName = "legacy-host",
+                Port = 32568,
+                FailureCooldownSeconds = 0
+            };
+            using var ds = new HistorianDataSource(config, factory);
+
+            ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            factory.ConnectHistory.ShouldBe(new[] { "legacy-host" });
+            var snap = ds.GetHealthSnapshot();
+            snap.NodeCount.ShouldBe(1);
+            snap.Nodes.Single().Name.ShouldBe("legacy-host");
+        }
+
+        [Fact]
+        public void Connect_PickerOrderRespected()
+        {
+            var factory = new FakeHistorianConnectionFactory();
+            factory.ServerBehaviors["host-a"] = new InvalidOperationException("A down");
+            factory.ServerBehaviors["host-b"] = new InvalidOperationException("B down");
+            factory.ServerBehaviors["host-c"] = new InvalidOperationException("C down");
+            var config = ClusterConfig("host-a", "host-b", "host-c");
+            using var ds = new HistorianDataSource(config, factory);
+
+            ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            // Candidates are tried in configuration order.
+            factory.ConnectHistory.ShouldBe(new[] { "host-a", "host-b", "host-c" });
+        }
+
+        [Fact]
+        public void Connect_SharedPickerAcrossProcessAndEventSilos()
+        {
+            // Process path tries host-a, fails, then tries host-b. host-a is in cooldown. When
+            // the event path subsequently starts with a 0s cooldown, the picker state is shared:
+            // host-a is still marked failed (via its cooldown window) at the moment the event
+            // silo asks. The event path therefore must not retry host-a.
+            var factory = new FakeHistorianConnectionFactory();
+            factory.ServerBehaviors["host-a"] = new InvalidOperationException("A down");
+            var config = ClusterConfig("host-a", "host-b");
+            using var ds = new HistorianDataSource(config, factory);
+
+            // Process path: host-a fails → host-b reached (then torn down mid-query via the fake).
+            ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            // At this point host-a and host-b are both in cooldown. ReadEvents will hit the
+            // picker's empty-healthy-list path and return empty without calling the factory.
+            factory.ConnectHistory.Clear();
+            var events = ds.ReadEventsAsync(null, DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            events.Count.ShouldBe(0);
+            factory.ConnectHistory.ShouldBeEmpty();
+            // Critical assertion: host-a was NOT retried by the event silo — it's in the
+            // shared cooldown from the process path's failure.
+            factory.ConnectHistory.ShouldNotContain("host-a");
+        }
+    }
+}
@@ -19,7 +19,10 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva.Tests
            ServerName = "test-historian",
            Port = 32568,
            IntegratedSecurity = true,
-            CommandTimeoutSeconds = 5
+            CommandTimeoutSeconds = 5,
+            // Zero cooldown so reconnect-after-error tests can retry through the cluster picker
+            // on the very next call, matching the pre-cluster behavior they were written against.
+            FailureCooldownSeconds = 0
        };

        [Fact]
@@ -174,5 +177,105 @@ namespace ZB.MOM.WW.LmxOpcUa.Historian.Aveva.Tests
            // Dispose should handle the null connection gracefully
            Should.NotThrow(() => ds.Dispose());
        }
+
+        // ---------- HistorianHealthSnapshot instrumentation ----------
+
+        [Fact]
+        public void GetHealthSnapshot_FreshDataSource_ReportsZeroCounters()
+        {
+            var ds = new HistorianDataSource(DefaultConfig, new FakeHistorianConnectionFactory());
+            var snap = ds.GetHealthSnapshot();
+
+            snap.TotalQueries.ShouldBe(0);
+            snap.TotalSuccesses.ShouldBe(0);
+            snap.TotalFailures.ShouldBe(0);
+            snap.ConsecutiveFailures.ShouldBe(0);
+            snap.LastSuccessTime.ShouldBeNull();
+            snap.LastFailureTime.ShouldBeNull();
+            snap.LastError.ShouldBeNull();
+            snap.ProcessConnectionOpen.ShouldBeFalse();
+            snap.EventConnectionOpen.ShouldBeFalse();
+        }
+
+        [Fact]
+        public void GetHealthSnapshot_AfterConnectionFailure_RecordsFailure()
+        {
+            var factory = new FakeHistorianConnectionFactory
+            {
+                ConnectException = new InvalidOperationException("Connection refused")
+            };
+            var ds = new HistorianDataSource(DefaultConfig, factory);
+
+            ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 100)
+                .GetAwaiter().GetResult();
+
+            var snap = ds.GetHealthSnapshot();
+            snap.TotalQueries.ShouldBe(1);
+            snap.TotalFailures.ShouldBe(1);
+            snap.TotalSuccesses.ShouldBe(0);
+            snap.ConsecutiveFailures.ShouldBe(1);
+            snap.LastFailureTime.ShouldNotBeNull();
+            snap.LastError.ShouldContain("Connection refused");
+            snap.ProcessConnectionOpen.ShouldBeFalse();
+        }
+
+        [Fact]
+        public void GetHealthSnapshot_AfterMultipleFailures_IncrementsConsecutive()
+        {
+            var factory = new FakeHistorianConnectionFactory
+            {
+                ConnectException = new InvalidOperationException("boom")
+            };
+            var ds = new HistorianDataSource(DefaultConfig, factory);
+
+            for (var i = 0; i < 4; i++)
+                ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 100)
+                    .GetAwaiter().GetResult();
+
+            var snap = ds.GetHealthSnapshot();
+            snap.TotalFailures.ShouldBe(4);
+            snap.ConsecutiveFailures.ShouldBe(4);
+            snap.TotalSuccesses.ShouldBe(0);
+        }
+
+        [Fact]
+        public void GetHealthSnapshot_AcrossReadPaths_CountsAllFailures()
+        {
+            var factory = new FakeHistorianConnectionFactory
+            {
+                ConnectException = new InvalidOperationException("sdk down")
+            };
+            var ds = new HistorianDataSource(DefaultConfig, factory);
+
+            ds.ReadRawAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+            ds.ReadAggregateAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 60000, "Average")
+                .GetAwaiter().GetResult();
+            ds.ReadAtTimeAsync("Tag1", new[] { DateTime.UtcNow })
+                .GetAwaiter().GetResult();
+            ds.ReadEventsAsync(null, DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 10)
+                .GetAwaiter().GetResult();
+
+            var snap = ds.GetHealthSnapshot();
+            snap.TotalFailures.ShouldBe(4);
+            snap.TotalQueries.ShouldBe(4);
+            snap.LastError.ShouldContain("sdk down");
+        }
+
+        [Fact]
+        public void GetHealthSnapshot_ErrorMessageCarriesReadPath()
+        {
+            var factory = new FakeHistorianConnectionFactory
+            {
+                ConnectException = new InvalidOperationException("unreachable")
+            };
+            var ds = new HistorianDataSource(DefaultConfig, factory);
+
+            ds.ReadAggregateAsync("Tag1", DateTime.UtcNow.AddHours(-1), DateTime.UtcNow, 60000, "Average")
+                .GetAwaiter().GetResult();
+
+            var snap = ds.GetHealthSnapshot();
+            snap.LastError.ShouldStartWith("aggregate:");
+        }
    }
 }