Files

Joseph Doherty 8f340553d9 Instrument the historian plugin with runtime query health counters and read-only cluster failover so operators can detect silent query degradation and keep serving history when a single cluster node goes down

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-13 14:08:32 -04:00

15 KiB

Raw Blame History

Status Dashboard

Overview

The service hosts an embedded HTTP status dashboard that surfaces real-time health, connection state, subscription counts, data change throughput, and Galaxy metadata. Operators access it through a browser to verify the bridge is functioning without needing an OPC UA client. The dashboard is enabled by default on port 8081 and can be disabled via configuration.

HTTP Server

StatusWebServer wraps a System.Net.HttpListener bound to http://+:{port}/. It starts a background task that accepts requests in a loop and dispatches them by path. Only GET requests are accepted; all other methods return 405 Method Not Allowed. Responses include Cache-Control: no-cache headers to prevent stale data in the browser.

Endpoints

Path	Content-Type	Description
`/`	`text/html`	Operator dashboard with auto-refresh
`/health`	`text/html`	Focused health page with service-level badge and component cards
`/api/status`	`application/json`	Full status snapshot as JSON (`StatusData`)
`/api/health`	`application/json`	Health endpoint (`HealthEndpointData`) -- returns `503` when status is `Unhealthy`, `200` otherwise

Any other path returns 404 Not Found.

Health Check Logic

HealthCheckService.CheckHealth evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, and 2d only fire when the corresponding integration is enabled and a non-null snapshot is passed:

Rule 1 -- Unhealthy: MXAccess connection state is not Connected. Returns a red banner with the current state.
Rule 2b -- Degraded: Historian.Enabled=true but the plugin load outcome is not Loaded. Returns a yellow banner citing the plugin status (NotFound, LoadFailed) and the error message if one is available.
Rule 2 / 2c -- Degraded: Any recorded operation has a low success rate. The sample threshold depends on the operation category:
- Regular operations (Read, Write, Subscribe, AlarmAcknowledge): >100 invocations and <50% success rate.
- Historian operations (HistoryReadRaw, HistoryReadProcessed, HistoryReadAtTime, HistoryReadEvents): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads.
Rule 2d -- Degraded (latched): AlarmTrackingEnabled=true and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts.
Rule 3 -- Healthy: All checks pass. Returns a green banner with "All systems operational."

The /api/health endpoint returns 200 for both Healthy and Degraded states, and 503 only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection.

Status Data Model

StatusReportService aggregates data from all bridge components into a StatusData DTO, which is then rendered as HTML or serialized to JSON. The DTO contains the following sections:

Connection

Field	Type	Description
`State`	`string`	Current MXAccess connection state (Connected, Disconnected, Connecting)
`ReconnectCount`	`int`	Number of reconnect attempts since startup
`ActiveSessions`	`int`	Number of active OPC UA client sessions

Health

Field	Type	Description
`Status`	`string`	Healthy, Degraded, or Unhealthy
`Message`	`string`	Operator-facing explanation
`Color`	`string`	CSS color token (green, yellow, red, gray)

Subscriptions

Field	Type	Description
`ActiveCount`	`int`	Number of active MXAccess tag subscriptions

Galaxy

Field	Type	Description
`GalaxyName`	`string`	Name of the Galaxy being bridged
`DbConnected`	`bool`	Whether the Galaxy repository database is reachable
`LastDeployTime`	`DateTime?`	Most recent deploy timestamp from the Galaxy
`ObjectCount`	`int`	Number of Galaxy objects in the address space
`AttributeCount`	`int`	Number of Galaxy attributes as OPC UA variables
`LastRebuildTime`	`DateTime?`	UTC timestamp of the last completed address-space rebuild

Data change

Field	Type	Description
`EventsPerSecond`	`double`	Rate of MXAccess data change events per second
`AvgBatchSize`	`double`	Average items processed per dispatch cycle
`PendingItems`	`int`	Items waiting in the dispatch queue
`TotalEvents`	`long`	Total MXAccess data change events since startup

Operations

A dictionary of MetricsStatistics keyed by operation name. Each entry contains:

TotalCount -- total invocations
SuccessRate -- fraction of successful operations
AverageMilliseconds, MinMilliseconds, MaxMilliseconds, Percentile95Milliseconds -- latency distribution

The instrumented operation names are:

Name	Source
`Read`	MXAccess live tag reads (`MxAccessClient.ReadWrite.cs`)
`Write`	MXAccess live tag writes
`Subscribe`	MXAccess subscription attach
`HistoryReadRaw`	`LmxNodeManager.HistoryReadRawModified` -> historian plugin
`HistoryReadProcessed`	`LmxNodeManager.HistoryReadProcessed` -> historian plugin (aggregates)
`HistoryReadAtTime`	`LmxNodeManager.HistoryReadAtTime` -> historian plugin (interpolated)
`HistoryReadEvents`	`LmxNodeManager.HistoryReadEvents` -> historian plugin (alarm/event history)
`AlarmAcknowledge`	`LmxNodeManager.OnAlarmAcknowledge` -> MXAccess AckMsg write

New operation names are auto-registered on first use, so the Operations dictionary only contains entries for features that have actually been exercised since startup.

Historian

HistorianStatusInfo -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See Historical Data Access for the plugin architecture and the Runtime Health Counters section for the data source instrumentation.

Field	Type	Description
`Enabled`	`bool`	Whether `Historian.Enabled` is set in configuration
`PluginStatus`	`string`	`Disabled`, `NotFound`, `LoadFailed`, or `Loaded` — load-time outcome from `HistorianPluginLoader.LastOutcome`
`PluginError`	`string?`	Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null`
`PluginPath`	`string`	Absolute path the loader probed for the plugin assembly
`ServerName`	`string`	Legacy single-node hostname from `Historian.ServerName`; ignored when `ServerNames` is non-empty
`Port`	`int`	Configured historian TCP port
`QueryTotal`	`long`	Total historian read queries attempted since startup (raw + aggregate + at-time + events)
`QuerySuccesses`	`long`	Queries that completed without an exception
`QueryFailures`	`long`	Queries that raised an exception — each failure also triggers the plugin's reconnect path
`ConsecutiveFailures`	`int`	Failures since the last success. Resets to zero on any successful query. Drives the `Degraded` health rule at threshold 3
`LastSuccessTime`	`DateTime?`	UTC timestamp of the most recent successful query, or `null` when no query has succeeded since startup
`LastFailureTime`	`DateTime?`	UTC timestamp of the most recent failure
`LastQueryError`	`string?`	Exception message from the most recent failure. Prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call failed
`ProcessConnectionOpen`	`bool`	Whether the plugin currently holds an open SDK connection for the process silo (historical value queries — `ReadRaw`, `ReadAggregate`, `ReadAtTime`). See Two SDK connection silos
`EventConnectionOpen`	`bool`	Whether the plugin currently holds an open SDK connection for the event silo (alarm history queries — `ReadEvents`). Separate from the process connection because the SDK requires distinct query channels
`ActiveProcessNode`	`string?`	Cluster node currently serving the process silo, or `null` when no process connection is open
`ActiveEventNode`	`string?`	Cluster node currently serving the event silo, or `null` when no event connection is open
`NodeCount`	`int`	Total configured historian cluster nodes. 1 for a legacy single-node deployment
`HealthyNodeCount`	`int`	Nodes currently eligible for new connections (not in failure cooldown)
`Nodes`	`List<HistorianClusterNodeState>`	Per-node cluster state in configuration order. Each entry carries `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`

The operator dashboard renders a cluster table inside the Historian panel when NodeCount > 1. Legacy single-node deployments render a compact Node: <hostname> line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures.

Alarms

AlarmStatusInfo -- surfaces alarm-condition tracking health and dispatch counters.

Field	Type	Description
`TrackingEnabled`	`bool`	Whether `OpcUa.AlarmTrackingEnabled` is set in configuration
`ConditionCount`	`int`	Number of distinct alarm conditions currently tracked
`ActiveAlarmCount`	`int`	Number of alarms currently in the `InAlarm=true` state
`TransitionCount`	`long`	Total `InAlarm` transitions observed in the dispatch loop since startup
`AckEventCount`	`long`	Total alarm acknowledgement transitions observed since startup
`AckWriteFailures`	`long`	Total MXAccess AckMsg writes that have failed while processing alarm acknowledges. Any non-zero value latches the service into Degraded (see Rule 2d).
`FilterEnabled`	`bool`	Whether `OpcUa.AlarmFilter.ObjectFilters` has any patterns configured
`FilterPatternCount`	`int`	Number of compiled filter patterns (after comma-splitting and trimming)
`FilterIncludedObjectCount`	`int`	Number of Galaxy objects included by the filter during the most recent address-space build. Zero when the filter is disabled.

When the filter is active, the operator dashboard's Alarms panel renders an extra line Filter: N pattern(s), M object(s) included so operators can verify scope at a glance. See Alarm Tracking for the matching rules and resolution algorithm.

Redundancy

RedundancyInfo -- only populated when Redundancy.Enabled=true in configuration. Shows mode, role, computed service level, application URI, and the set of peer server URIs. See Redundancy for the full guide.

Footer

Field	Type	Description
`Timestamp`	`DateTime`	UTC time when the snapshot was generated
`Version`	`string`	Service assembly version

`/api/health` Payload

The health endpoint returns a HealthEndpointData document distinct from the full dashboard snapshot. It is designed for load balancers and external monitoring probes that only need an up/down signal plus component-level detail:

Field	Type	Description
`Status`	`string`	`Healthy`, `Degraded`, or `Unhealthy` (drives the HTTP status code)
`ServiceLevel`	`byte`	OPC UA-style 0-255 service level. 255 when healthy non-redundant; 0 when MXAccess is down; redundancy-adjusted otherwise
`RedundancyEnabled`	`bool`	Whether redundancy is configured
`RedundancyRole`	`string?`	`Primary` or `Secondary` when redundancy is enabled; `null` otherwise
`RedundancyMode`	`string?`	`Warm` or `Hot` when redundancy is enabled; `null` otherwise
`Components.MxAccess`	`string`	`Connected` or `Disconnected`
`Components.Database`	`string`	`Connected` or `Disconnected`
`Components.OpcUaServer`	`string`	`Running` or `Stopped`
`Components.Historian`	`string`	`Disabled`, `NotFound`, `LoadFailed`, or `Loaded` -- matches `HistorianStatusInfo.PluginStatus`
`Components.Alarms`	`string`	`Disabled` or `Enabled` -- mirrors `OpcUa.AlarmTrackingEnabled`
`Uptime`	`string`	Formatted service uptime (e.g., `3d 5h 20m`)
`Timestamp`	`DateTime`	UTC time the snapshot was generated

Monitoring tools should:

Alert on Status=Unhealthy (HTTP 503) for hard outages.
Alert on Status=Degraded (HTTP 200) for latched or cumulative failures -- a degraded status means the server is still operating but a subsystem needs attention (historian plugin missing, alarm ack writes failing, history read error rate too high, etc.).

HTML Dashboards

`/` -- Operator dashboard

Monospace, dark background, color-coded panels. Panels: Connection, Health, Redundancy (when enabled), Subscriptions, Data Change Dispatch, Galaxy Info, Historian, Alarms, Operations (table), Footer. Each panel border color reflects component state (green, yellow, red, or gray).

The page includes a <meta http-equiv='refresh'> tag set to the configured RefreshIntervalSeconds (default 10 seconds), so the browser polls automatically without JavaScript.

`/health` -- Focused health view

Large status badge, computed ServiceLevel value, redundancy summary (when enabled), and a row of component cards: MXAccess, Galaxy Database, OPC UA Server, Historian, Alarm Tracking. Each card turns red when its component is in a failure state and grey when disabled. Best for wallboards and quick at-a-glance monitoring.

Configuration

The dashboard is configured through the Dashboard section in appsettings.json:

{
  "Dashboard": {
    "Enabled": true,
    "Port": 8081,
    "RefreshIntervalSeconds": 10
  }
}

Setting Enabled to false prevents the StatusWebServer from starting. The StatusReportService is still created so that other components can query health programmatically, but no HTTP listener is opened.

Component Wiring

StatusReportService is initialized after all other service components are created. OpcUaService.Start() calls SetComponents() to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b:

StatusReportInstance.SetComponents(
    effectiveMxClient,
    Metrics,
    GalaxyStatsInstance,
    ServerHost,
    NodeManagerInstance,
    _config.Redundancy,
    _config.OpcUa.ApplicationUri,
    _config.Historian);

This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is null, the report service falls back to default values (e.g., ConnectionState.Disconnected, zero counts, HistorianPluginStatus.Disabled).

The historian plugin status is sourced from HistorianPluginLoader.LastOutcome, which is updated on every load attempt. OpcUaService explicitly calls HistorianPluginLoader.MarkDisabled() when Historian.Enabled=false so the dashboard can distinguish "feature off" from "load failed" without ambiguity.

15 KiB Raw Blame History