Instrument the historian plugin with runtime query health counters and read-only cluster failover so operators can detect silent query degradation and keep serving history when a single cluster node goes down

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-13 14:08:32 -04:00
parent 4fe37fd1b7
commit 8f340553d9
20 changed files with 1526 additions and 32 deletions

View File

@@ -463,6 +463,106 @@ Filter syntax quick reference (documented in `AlarmFilterConfiguration.cs` XML-d
- Empty list → filter disabled → current unfiltered behavior.
- Match semantics: an object is included when any template in its derivation chain matches any pattern, and the inclusion propagates to all descendants in the containment hierarchy. Each object is evaluated once regardless of how many patterns or ancestors match.
## Historian Runtime Health Surface
Updated: `2026-04-13 10:44-10:52 America/New_York`
Both instances updated with runtime historian query instrumentation so the status dashboard can detect silent query degradation that the load-time `PluginStatus` cannot catch.
Backups:
- `C:\publish\lmxopcua\backups\20260413-104406-instance1`
- `C:\publish\lmxopcua\backups\20260413-104406-instance2`
Code changes:
- `Host/Historian/HistorianHealthSnapshot.cs` (new) — DTO with `TotalQueries`, `TotalSuccesses`, `TotalFailures`, `ConsecutiveFailures`, `LastSuccessTime`, `LastFailureTime`, `LastError`, `ProcessConnectionOpen`, `EventConnectionOpen`.
- `Host/Historian/IHistorianDataSource.cs` — added `GetHealthSnapshot()` interface method.
- `Historian.Aveva/HistorianDataSource.cs` — added `_healthLock`-guarded counters, `RecordSuccess()` / `RecordFailure(path)` helpers called at every terminal site in all four read methods (raw, aggregate, at-time, events). Error messages carry a `raw:` / `aggregate:` / `at-time:` / `events:` prefix so operators can tell which SDK call is broken.
- `Host/OpcUa/LmxNodeManager.cs` — exposes `HistorianHealth` property that proxies to `IHistorianDataSource.GetHealthSnapshot()`.
- `Host/Status/StatusData.cs` — added 9 new fields on `HistorianStatusInfo`.
- `Host/Status/StatusReportService.cs``BuildHistorianStatusInfo()` populates the new fields from the node manager; panel color gradient: green → yellow (1-4 consecutive failures) → red (≥5 consecutive or plugin unloaded). Renders `Queries: N (Success: X, Failure: Y) | Consecutive Failures: Z`, `Process Conn: open/closed | Event Conn: open/closed`, plus `Last Success:` / `Last Failure:` / `Last Error:` lines when applicable.
- `Host/Status/HealthCheckService.cs` — new Rule 2b2: `Degraded` when `ConsecutiveFailures >= 3`. Threshold chosen to avoid flagging single transient blips.
Tests:
- 5 new unit tests in `HistorianDataSourceLifecycleTests` covering fresh zero-state, single failure, multi-failure consecutive increment, cross-read-path counting, and error-message-carries-path.
- Full suite: 16/16 plugin tests, 447/447 host tests passing.
Live verification on instance1:
```
Before any query:
Queries: 0 (Success: 0, Failure: 0) | Process Conn: closed | Event Conn: closed
After TestMachine_001.TestHistoryValue raw read:
Queries: 1 (Success: 1, Failure: 0) | Process Conn: open
Last Success: 2026-04-13T14:45:18Z
After aggregate hourly-average over 24h:
Queries: 2 (Success: 2, Failure: 0)
After historyread against an unknown node id (bad tag):
Queries: 2 (counter unchanged — rejected at node-lookup before reaching the plugin; correct)
```
JSON endpoint `/api/status` carries all 9 new fields with correct types. Both instances deployed; instance1 `LmxOpcUa` PID 33824, instance2 `LmxOpcUa2` PID 30200.
## Historian Read-Only Cluster Support
Updated: `2026-04-13 11:25-12:00 America/New_York`
Both instances updated with Wonderware Historian read-only cluster failover. Operators can supply an ordered list of historian cluster nodes; the plugin iterates them on each fresh connect and benches failed nodes for a configurable cooldown window. Single-node deployments are preserved via the existing `ServerName` field.
Backups:
- `C:\publish\lmxopcua\backups\20260413-112519-instance1`
- `C:\publish\lmxopcua\backups\20260413-112519-instance2`
Code changes:
- `Host/Configuration/HistorianConfiguration.cs` — added `ServerNames: List<string>` (defaults to `[]`) and `FailureCooldownSeconds: int` (defaults to 60). `ServerName` preserved as fallback when `ServerNames` is empty.
- `Host/Historian/HistorianClusterNodeState.cs` (new) — per-node DTO: `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`.
- `Host/Historian/HistorianHealthSnapshot.cs` — extended with `ActiveProcessNode`, `ActiveEventNode`, `NodeCount`, `HealthyNodeCount`, `Nodes: List<HistorianClusterNodeState>`.
- `Historian.Aveva/HistorianClusterEndpointPicker.cs` (new, internal) — pure picker with injected clock, thread-safe via lock, BFS-style `GetHealthyNodes()` / `MarkFailed()` / `MarkHealthy()` / `SnapshotNodeStates()`. Nodes iterate in configuration order; failed nodes skip until cooldown elapses; the cumulative `FailureCount` and `LastError` are retained across recovery for operator diagnostics.
- `Historian.Aveva/HistorianDataSource.cs` — new `ConnectToAnyHealthyNode(type)` method iterates picker candidates, clones `HistorianConfiguration` per attempt with the candidate as `ServerName`, and returns the first successful `(Connection, Node)` tuple. `EnsureConnected` and `EnsureEventConnected` both call it. `HandleConnectionError` and `HandleEventConnectionError` now mark the active node failed in the picker before nulling. `_activeProcessNode` / `_activeEventNode` track the live node for the dashboard. Both silos (process + event) share a single picker instance so a node failure on one immediately benches it for the other.
- `Host/Status/StatusData.cs` — added `NodeCount`, `HealthyNodeCount`, `ActiveProcessNode`, `ActiveEventNode`, `Nodes` to `HistorianStatusInfo`.
- `Host/Status/StatusReportService.cs` — Historian panel renders `Process Conn: open (<node>)` badges and a cluster table (when `NodeCount > 1`) showing each node's state, cooldown expiry, failure count, and last error. Single-node deployments render a compact `Node: <hostname>` line.
- `Host/Status/HealthCheckService.cs` — new Rule 2b3: `Degraded` when `NodeCount > 1 && HealthyNodeCount < NodeCount`. Lets operators alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes.
- `Host/Configuration/ConfigurationValidator.cs` — logs the effective node list and `FailureCooldownSeconds` at startup, validates that `FailureCooldownSeconds >= 0`, warns when `ServerName` is set alongside a non-empty `ServerNames`.
Tests:
- `HistorianClusterEndpointPickerTests.cs` — 19 unit tests covering config parsing, ordered iteration, cooldown expiry, zero-cooldown mode, mark-healthy clears, cumulative failure counting, unknown-node safety, concurrent writers (thread-safety smoke test).
- `HistorianClusterFailoverTests.cs` — 6 integration tests driving `HistorianDataSource` via a scripted `FakeHistorianConnectionFactory`: first-node-fails-picks-second, all-nodes-fail, second-call-skips-cooled-down-node, single-node-legacy-behavior, picker-order-respected, shared-picker-across-silos.
- Full plugin suite: 41/41 tests passing. Host suite: 446/447 (1 pre-existing flaky MxAccess monitor test passes on retry).
Live verification on instance1 (cluster = `["does-not-exist-historian.invalid", "localhost"]`, `FailureCooldownSeconds=30`):
**Failover cycle 1** (fresh picker state, both nodes healthy):
```
2026-04-13 11:27:25.381 [WRN] Historian node does-not-exist-historian.invalid failed during connect attempt; trying next candidate
2026-04-13 11:27:25.910 [INF] Historian SDK connection opened to localhost:32568
```
- historyread returned 1 value successfully (`Queries: 1 (Success: 1, Failure: 0)`).
- Dashboard: panel yellow, `Cluster: 1 of 2 nodes healthy`, bad node `cooldown` until `11:27:55Z`, `Process Conn: open (localhost)`.
**Cooldown expiry**:
- At 11:29 UTC, the cooldown window had elapsed. Panel back to green, both nodes healthy, but `does-not-exist-historian.invalid` retains `FailureCount=1` and `LastError` as history.
**Failover cycle 2** (service restart to drop persistent connection):
```
2026-04-13 14:00:39.352 [WRN] Historian node does-not-exist-historian.invalid failed during connect attempt; trying next candidate
2026-04-13 14:00:39.885 [INF] Historian SDK connection opened to localhost:32568
```
- historyread returned 1 value successfully on the second restart cycle — proves the picker re-admits a cooled-down node and the whole failover cycle repeats cleanly.
**Single-node restoration**:
- Changed instance1 back to `"ServerNames": []`, restarted. Dashboard renders `Node: localhost` (no cluster table), panel green, backward compat verified.
Final configuration: both instances running with empty `ServerNames` (single-node mode). `LmxOpcUa` PID 31064, `LmxOpcUa2` PID 15012.
Operator configuration shape:
```json
"Historian": {
"Enabled": true,
"ServerName": "localhost", // ignored when ServerNames is non-empty
"ServerNames": ["historian-a", "historian-b"],
"FailureCooldownSeconds": 60,
...
}
```
## Notes
The service deployment and restart succeeded. The live CLI checks confirm the endpoint is reachable and that the array node identifier has changed to the bracketless form. The array value on the live service still prints as blank even though the status is good, so if this environment should have populated `MoveInPartNumbers`, the runtime data path still needs follow-up investigation.