Surface historian plugin and alarm-tracking health in the status dashboard so operators can detect misconfiguration and runtime degradation that previously showed as fully healthy
Wraps the 4 HistoryRead overrides and OnAlarmAcknowledge with PerformanceMetrics.BeginOperation, adds alarm counters to LmxNodeManager, publishes a structured HistorianPluginOutcome from HistorianPluginLoader, and extends HealthCheckService with plugin-load, history-read, and alarm-ack-failure degradation rules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -12,19 +12,24 @@ The service hosts an embedded HTTP status dashboard that surfaces real-time heal
|
||||
|
||||
| Path | Content-Type | Description |
|
||||
|------|-------------|-------------|
|
||||
| `/` | `text/html` | HTML dashboard with auto-refresh |
|
||||
| `/api/status` | `application/json` | Full status snapshot as JSON |
|
||||
| `/api/health` | `application/json` | Health check: returns `200` with `{"status":"healthy"}` or `503` with `{"status":"unhealthy"}` |
|
||||
| `/` | `text/html` | Operator dashboard with auto-refresh |
|
||||
| `/health` | `text/html` | Focused health page with service-level badge and component cards |
|
||||
| `/api/status` | `application/json` | Full status snapshot as JSON (`StatusData`) |
|
||||
| `/api/health` | `application/json` | Health endpoint (`HealthEndpointData`) -- returns `503` when status is `Unhealthy`, `200` otherwise |
|
||||
|
||||
Any other path returns `404 Not Found`.
|
||||
|
||||
## Health Check Logic
|
||||
|
||||
`HealthCheckService` evaluates bridge health using two rules applied in order:
|
||||
`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, and 2d only fire when the corresponding integration is enabled and a non-null snapshot is passed:
|
||||
|
||||
1. **Unhealthy** -- MXAccess connection state is not `Connected`. Returns a red banner with the current state.
|
||||
2. **Degraded** -- Any recorded operation has more than 100 total invocations and a success rate below 50%. Returns a yellow banner identifying the failing operation.
|
||||
3. **Healthy** -- All checks pass. Returns a green banner with "All systems operational."
|
||||
1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state.
|
||||
2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available.
|
||||
3. **Rule 2 / 2c -- Degraded**: Any recorded operation has a low success rate. The sample threshold depends on the operation category:
|
||||
- Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate.
|
||||
- Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads.
|
||||
4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts.
|
||||
5. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational."
|
||||
|
||||
The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection.
|
||||
|
||||
@@ -82,6 +87,51 @@ A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains
|
||||
- `SuccessRate` -- fraction of successful operations
|
||||
- `AverageMilliseconds`, `MinMilliseconds`, `MaxMilliseconds`, `Percentile95Milliseconds` -- latency distribution
|
||||
|
||||
The instrumented operation names are:
|
||||
|
||||
| Name | Source |
|
||||
|---|---|
|
||||
| `Read` | MXAccess live tag reads (`MxAccessClient.ReadWrite.cs`) |
|
||||
| `Write` | MXAccess live tag writes |
|
||||
| `Subscribe` | MXAccess subscription attach |
|
||||
| `HistoryReadRaw` | `LmxNodeManager.HistoryReadRawModified` -> historian plugin |
|
||||
| `HistoryReadProcessed` | `LmxNodeManager.HistoryReadProcessed` -> historian plugin (aggregates) |
|
||||
| `HistoryReadAtTime` | `LmxNodeManager.HistoryReadAtTime` -> historian plugin (interpolated) |
|
||||
| `HistoryReadEvents` | `LmxNodeManager.HistoryReadEvents` -> historian plugin (alarm/event history) |
|
||||
| `AlarmAcknowledge` | `LmxNodeManager.OnAlarmAcknowledge` -> MXAccess AckMsg write |
|
||||
|
||||
New operation names are auto-registered on first use, so the `Operations` dictionary only contains entries for features that have actually been exercised since startup.
|
||||
|
||||
### Historian
|
||||
|
||||
`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `Enabled` | `bool` | Whether `Historian.Enabled` is set in configuration |
|
||||
| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` |
|
||||
| `PluginError` | `string?` | Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null` |
|
||||
| `PluginPath` | `string` | Absolute path the loader probed for the plugin assembly |
|
||||
| `ServerName` | `string` | Configured historian hostname |
|
||||
| `Port` | `int` | Configured historian TCP port |
|
||||
|
||||
### Alarms
|
||||
|
||||
`AlarmStatusInfo` -- surfaces alarm-condition tracking health and dispatch counters.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `TrackingEnabled` | `bool` | Whether `OpcUa.AlarmTrackingEnabled` is set in configuration |
|
||||
| `ConditionCount` | `int` | Number of distinct alarm conditions currently tracked |
|
||||
| `ActiveAlarmCount` | `int` | Number of alarms currently in the `InAlarm=true` state |
|
||||
| `TransitionCount` | `long` | Total `InAlarm` transitions observed in the dispatch loop since startup |
|
||||
| `AckEventCount` | `long` | Total alarm acknowledgement transitions observed since startup |
|
||||
| `AckWriteFailures` | `long` | Total MXAccess AckMsg writes that have failed while processing alarm acknowledges. Any non-zero value latches the service into Degraded (see Rule 2d). |
|
||||
|
||||
### Redundancy
|
||||
|
||||
`RedundancyInfo` -- only populated when `Redundancy.Enabled=true` in configuration. Shows mode, role, computed service level, application URI, and the set of peer server URIs. See [Redundancy](Redundancy.md) for the full guide.
|
||||
|
||||
### Footer
|
||||
|
||||
| Field | Type | Description |
|
||||
@@ -89,12 +139,42 @@ A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains
|
||||
| `Timestamp` | `DateTime` | UTC time when the snapshot was generated |
|
||||
| `Version` | `string` | Service assembly version |
|
||||
|
||||
## HTML Dashboard
|
||||
## `/api/health` Payload
|
||||
|
||||
The HTML dashboard uses a monospace font on a dark background with color-coded panels. Each status section renders as a bordered panel whose border color reflects the component state (green, yellow, red, or gray). The operations table shows per-operation latency and success rate statistics.
|
||||
The health endpoint returns a `HealthEndpointData` document distinct from the full dashboard snapshot. It is designed for load balancers and external monitoring probes that only need an up/down signal plus component-level detail:
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `Status` | `string` | `Healthy`, `Degraded`, or `Unhealthy` (drives the HTTP status code) |
|
||||
| `ServiceLevel` | `byte` | OPC UA-style 0-255 service level. 255 when healthy non-redundant; 0 when MXAccess is down; redundancy-adjusted otherwise |
|
||||
| `RedundancyEnabled` | `bool` | Whether redundancy is configured |
|
||||
| `RedundancyRole` | `string?` | `Primary` or `Secondary` when redundancy is enabled; `null` otherwise |
|
||||
| `RedundancyMode` | `string?` | `Warm` or `Hot` when redundancy is enabled; `null` otherwise |
|
||||
| `Components.MxAccess` | `string` | `Connected` or `Disconnected` |
|
||||
| `Components.Database` | `string` | `Connected` or `Disconnected` |
|
||||
| `Components.OpcUaServer` | `string` | `Running` or `Stopped` |
|
||||
| `Components.Historian` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` -- matches `HistorianStatusInfo.PluginStatus` |
|
||||
| `Components.Alarms` | `string` | `Disabled` or `Enabled` -- mirrors `OpcUa.AlarmTrackingEnabled` |
|
||||
| `Uptime` | `string` | Formatted service uptime (e.g., `3d 5h 20m`) |
|
||||
| `Timestamp` | `DateTime` | UTC time the snapshot was generated |
|
||||
|
||||
Monitoring tools should:
|
||||
|
||||
- Alert on `Status=Unhealthy` (HTTP 503) for hard outages.
|
||||
- Alert on `Status=Degraded` (HTTP 200) for latched or cumulative failures -- a degraded status means the server is still operating but a subsystem needs attention (historian plugin missing, alarm ack writes failing, history read error rate too high, etc.).
|
||||
|
||||
## HTML Dashboards
|
||||
|
||||
### `/` -- Operator dashboard
|
||||
|
||||
Monospace, dark background, color-coded panels. Panels: Connection, Health, Redundancy (when enabled), Subscriptions, Data Change Dispatch, Galaxy Info, **Historian**, **Alarms**, Operations (table), Footer. Each panel border color reflects component state (green, yellow, red, or gray).
|
||||
|
||||
The page includes a `<meta http-equiv='refresh'>` tag set to the configured `RefreshIntervalSeconds` (default 10 seconds), so the browser polls automatically without JavaScript.
|
||||
|
||||
### `/health` -- Focused health view
|
||||
|
||||
Large status badge, computed `ServiceLevel` value, redundancy summary (when enabled), and a row of component cards: MXAccess, Galaxy Database, OPC UA Server, **Historian**, **Alarm Tracking**. Each card turns red when its component is in a failure state and grey when disabled. Best for wallboards and quick at-a-glance monitoring.
|
||||
|
||||
## Configuration
|
||||
|
||||
The dashboard is configured through the `Dashboard` section in `appsettings.json`:
|
||||
@@ -113,10 +193,20 @@ Setting `Enabled` to `false` prevents the `StatusWebServer` from starting. The `
|
||||
|
||||
## Component Wiring
|
||||
|
||||
`StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references:
|
||||
`StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b:
|
||||
|
||||
```csharp
|
||||
_statusReport.SetComponents(effectiveMxClient, _metrics, _galaxyStats, _serverHost, _nodeManager);
|
||||
StatusReportInstance.SetComponents(
|
||||
effectiveMxClient,
|
||||
Metrics,
|
||||
GalaxyStatsInstance,
|
||||
ServerHost,
|
||||
NodeManagerInstance,
|
||||
_config.Redundancy,
|
||||
_config.OpcUa.ApplicationUri,
|
||||
_config.Historian);
|
||||
```
|
||||
|
||||
This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is `null`, the report service falls back to default values (e.g., `ConnectionState.Disconnected`, zero counts).
|
||||
This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is `null`, the report service falls back to default values (e.g., `ConnectionState.Disconnected`, zero counts, `HistorianPluginStatus.Disabled`).
|
||||
|
||||
The historian plugin status is sourced from `HistorianPluginLoader.LastOutcome`, which is updated on every load attempt. `OpcUaService` explicitly calls `HistorianPluginLoader.MarkDisabled()` when `Historian.Enabled=false` so the dashboard can distinguish "feature off" from "load failed" without ambiguity.
|
||||
|
||||
Reference in New Issue
Block a user