# Status Dashboard ## Overview The service hosts an embedded HTTP status dashboard that surfaces real-time health, connection state, subscription counts, data change throughput, and Galaxy metadata. Operators access it through a browser to verify the bridge is functioning without needing an OPC UA client. The dashboard is enabled by default on port 8081 and can be disabled via configuration. ## HTTP Server `StatusWebServer` wraps a `System.Net.HttpListener` bound to `http://+:{port}/`. It starts a background task that accepts requests in a loop and dispatches them by path. Only `GET` requests are accepted; all other methods return `405 Method Not Allowed`. Responses include `Cache-Control: no-cache` headers to prevent stale data in the browser. ### Endpoints | Path | Content-Type | Description | |------|-------------|-------------| | `/` | `text/html` | Operator dashboard with auto-refresh | | `/health` | `text/html` | Focused health page with service-level badge and component cards | | `/api/status` | `application/json` | Full status snapshot as JSON (`StatusData`) | | `/api/health` | `application/json` | Health endpoint (`HealthEndpointData`) -- returns `503` when status is `Unhealthy`, `200` otherwise | Any other path returns `404 Not Found`. ## Health Check Logic `HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed: 1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state. 2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available. 3. **Rule 2 / 2c -- Degraded**: Any recorded operation has a low success rate. The sample threshold depends on the operation category: - Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate. - Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads. 4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts. 5. **Rule 2e -- Degraded**: `RuntimeStatus.StoppedCount > 0` -- at least one Galaxy runtime host (`$WinPlatform` / `$AppEngine`) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays `Unhealthy` via Rule 1 and this rule never double-messages; the probe manager also forces every entry to `Unknown` when the transport is disconnected, so the `StoppedCount` is always 0 in that case. 6. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational." The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection. ## Status Data Model `StatusReportService` aggregates data from all bridge components into a `StatusData` DTO, which is then rendered as HTML or serialized to JSON. The DTO contains the following sections: ### Connection | Field | Type | Description | |-------|------|-------------| | `State` | `string` | Current MXAccess connection state (Connected, Disconnected, Connecting) | | `ReconnectCount` | `int` | Number of reconnect attempts since startup | | `ActiveSessions` | `int` | Number of active OPC UA client sessions | ### Health | Field | Type | Description | |-------|------|-------------| | `Status` | `string` | Healthy, Degraded, or Unhealthy | | `Message` | `string` | Operator-facing explanation | | `Color` | `string` | CSS color token (green, yellow, red, gray) | ### Subscriptions | Field | Type | Description | |-------|------|-------------| | `ActiveCount` | `int` | Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see `ProbeCount`) | | `ProbeCount` | `int` | Subset of `ActiveCount` attributable to bridge-owned runtime status probes (`.ScanState` per deployed `$WinPlatform` / `$AppEngine`). Rendered as a separate `Probes: N (bridge-owned runtime status)` line on the dashboard so operators can distinguish probe overhead from client-driven subscription load | ### Galaxy | Field | Type | Description | |-------|------|-------------| | `GalaxyName` | `string` | Name of the Galaxy being bridged | | `DbConnected` | `bool` | Whether the Galaxy repository database is reachable | | `LastDeployTime` | `DateTime?` | Most recent deploy timestamp from the Galaxy | | `ObjectCount` | `int` | Number of Galaxy objects in the address space | | `AttributeCount` | `int` | Number of Galaxy attributes as OPC UA variables | | `LastRebuildTime` | `DateTime?` | UTC timestamp of the last completed address-space rebuild | ### Data change | Field | Type | Description | |-------|------|-------------| | `EventsPerSecond` | `double` | Rate of MXAccess data change events per second | | `AvgBatchSize` | `double` | Average items processed per dispatch cycle | | `PendingItems` | `int` | Items waiting in the dispatch queue | | `TotalEvents` | `long` | Total MXAccess data change events since startup | ### Galaxy Runtime Populated from the `GalaxyRuntimeProbeManager` that advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine`. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when `MxAccess.RuntimeStatusProbesEnabled = false`; the panel is suppressed entirely from the HTML when `Total == 0`. | Field | Type | Description | |-------|------|-------------| | `Total` | `int` | Number of runtime hosts tracked (Platforms + AppEngines) | | `RunningCount` | `int` | Hosts whose last probe callback reported `ScanState = true` with Good quality | | `StoppedCount` | `int` | Hosts whose last probe callback reported `ScanState != true` or a failed item status, or whose initial probe timed out in Unknown state | | `UnknownCount` | `int` | Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected | | `Hosts` | `List` | Per-host detail rows, sorted alphabetically by `ObjectName` | Each `GalaxyRuntimeStatus` entry: | Field | Type | Description | |-------|------|-------------| | `ObjectName` | `string` | Galaxy `tag_name` of the host (e.g., `DevPlatform`, `DevAppEngine`) | | `GobjectId` | `int` | Galaxy `gobject_id` of the host | | `Kind` | `string` | `$WinPlatform` or `$AppEngine` | | `State` | `enum` | `Unknown`, `Running`, or `Stopped` | | `LastStateCallbackTime` | `DateTime?` | UTC time of the most recent probe callback, whether good or bad | | `LastStateChangeTime` | `DateTime?` | UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column | | `LastScanState` | `bool?` | Last `ScanState` value received; `null` before the first callback | | `LastError` | `string?` | Detail message from the most recent failure callback (e.g., `"ScanState = false (OffScan)"`); cleared on successful recovery | | `GoodUpdateCount` | `long` | Cumulative count of `ScanState = true` callbacks | | `FailureCount` | `long` | Cumulative count of `ScanState != true` callbacks or failed item statuses | The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is `Running`, yellow when any host is `Unknown` with zero `Stopped`, red when any host is `Stopped`, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to `Unknown`). ### Operations A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains: - `TotalCount` -- total invocations - `SuccessRate` -- fraction of successful operations - `AverageMilliseconds`, `MinMilliseconds`, `MaxMilliseconds`, `Percentile95Milliseconds` -- latency distribution The instrumented operation names are: | Name | Source | |---|---| | `Read` | MXAccess live tag reads (`MxAccessClient.ReadWrite.cs`) | | `Write` | MXAccess live tag writes | | `Subscribe` | MXAccess subscription attach | | `HistoryReadRaw` | `LmxNodeManager.HistoryReadRawModified` -> historian plugin | | `HistoryReadProcessed` | `LmxNodeManager.HistoryReadProcessed` -> historian plugin (aggregates) | | `HistoryReadAtTime` | `LmxNodeManager.HistoryReadAtTime` -> historian plugin (interpolated) | | `HistoryReadEvents` | `LmxNodeManager.HistoryReadEvents` -> historian plugin (alarm/event history) | | `AlarmAcknowledge` | `LmxNodeManager.OnAlarmAcknowledge` -> MXAccess AckMsg write | New operation names are auto-registered on first use, so the `Operations` dictionary only contains entries for features that have actually been exercised since startup. ### Historian `HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture and the [Runtime Health Counters](HistoricalDataAccess.md#runtime-health-counters) section for the data source instrumentation. | Field | Type | Description | |-------|------|-------------| | `Enabled` | `bool` | Whether `Historian.Enabled` is set in configuration | | `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` — load-time outcome from `HistorianPluginLoader.LastOutcome` | | `PluginError` | `string?` | Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null` | | `PluginPath` | `string` | Absolute path the loader probed for the plugin assembly | | `ServerName` | `string` | Legacy single-node hostname from `Historian.ServerName`; ignored when `ServerNames` is non-empty | | `Port` | `int` | Configured historian TCP port | | `QueryTotal` | `long` | Total historian read queries attempted since startup (raw + aggregate + at-time + events) | | `QuerySuccesses` | `long` | Queries that completed without an exception | | `QueryFailures` | `long` | Queries that raised an exception — each failure also triggers the plugin's reconnect path | | `ConsecutiveFailures` | `int` | Failures since the last success. Resets to zero on any successful query. Drives the `Degraded` health rule at threshold 3 | | `LastSuccessTime` | `DateTime?` | UTC timestamp of the most recent successful query, or `null` when no query has succeeded since startup | | `LastFailureTime` | `DateTime?` | UTC timestamp of the most recent failure | | `LastQueryError` | `string?` | Exception message from the most recent failure. Prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call failed | | `ProcessConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **process** silo (historical value queries — `ReadRaw`, `ReadAggregate`, `ReadAtTime`). See [Two SDK connection silos](HistoricalDataAccess.md#two-sdk-connection-silos) | | `EventConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **event** silo (alarm history queries — `ReadEvents`). Separate from the process connection because the SDK requires distinct query channels | | `ActiveProcessNode` | `string?` | Cluster node currently serving the process silo, or `null` when no process connection is open | | `ActiveEventNode` | `string?` | Cluster node currently serving the event silo, or `null` when no event connection is open | | `NodeCount` | `int` | Total configured historian cluster nodes. 1 for a legacy single-node deployment | | `HealthyNodeCount` | `int` | Nodes currently eligible for new connections (not in failure cooldown) | | `Nodes` | `List` | Per-node cluster state in configuration order. Each entry carries `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime` | The operator dashboard renders a cluster table inside the Historian panel when `NodeCount > 1`. Legacy single-node deployments render a compact `Node: ` line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures. ### Alarms `AlarmStatusInfo` -- surfaces alarm-condition tracking health and dispatch counters. | Field | Type | Description | |-------|------|-------------| | `TrackingEnabled` | `bool` | Whether `OpcUa.AlarmTrackingEnabled` is set in configuration | | `ConditionCount` | `int` | Number of distinct alarm conditions currently tracked | | `ActiveAlarmCount` | `int` | Number of alarms currently in the `InAlarm=true` state | | `TransitionCount` | `long` | Total `InAlarm` transitions observed in the dispatch loop since startup | | `AckEventCount` | `long` | Total alarm acknowledgement transitions observed since startup | | `AckWriteFailures` | `long` | Total MXAccess AckMsg writes that have failed while processing alarm acknowledges. Any non-zero value latches the service into Degraded (see Rule 2d). | | `FilterEnabled` | `bool` | Whether `OpcUa.AlarmFilter.ObjectFilters` has any patterns configured | | `FilterPatternCount` | `int` | Number of compiled filter patterns (after comma-splitting and trimming) | | `FilterIncludedObjectCount` | `int` | Number of Galaxy objects included by the filter during the most recent address-space build. Zero when the filter is disabled. | When the filter is active, the operator dashboard's Alarms panel renders an extra line `Filter: N pattern(s), M object(s) included` so operators can verify scope at a glance. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the matching rules and resolution algorithm. ### Redundancy `RedundancyInfo` -- only populated when `Redundancy.Enabled=true` in configuration. Shows mode, role, computed service level, application URI, and the set of peer server URIs. See [Redundancy](Redundancy.md) for the full guide. ### Footer | Field | Type | Description | |-------|------|-------------| | `Timestamp` | `DateTime` | UTC time when the snapshot was generated | | `Version` | `string` | Service assembly version | ## `/api/health` Payload The health endpoint returns a `HealthEndpointData` document distinct from the full dashboard snapshot. It is designed for load balancers and external monitoring probes that only need an up/down signal plus component-level detail: | Field | Type | Description | |-------|------|-------------| | `Status` | `string` | `Healthy`, `Degraded`, or `Unhealthy` (drives the HTTP status code) | | `ServiceLevel` | `byte` | OPC UA-style 0-255 service level. 255 when healthy non-redundant; 0 when MXAccess is down; redundancy-adjusted otherwise | | `RedundancyEnabled` | `bool` | Whether redundancy is configured | | `RedundancyRole` | `string?` | `Primary` or `Secondary` when redundancy is enabled; `null` otherwise | | `RedundancyMode` | `string?` | `Warm` or `Hot` when redundancy is enabled; `null` otherwise | | `Components.MxAccess` | `string` | `Connected` or `Disconnected` | | `Components.Database` | `string` | `Connected` or `Disconnected` | | `Components.OpcUaServer` | `string` | `Running` or `Stopped` | | `Components.Historian` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` -- matches `HistorianStatusInfo.PluginStatus` | | `Components.Alarms` | `string` | `Disabled` or `Enabled` -- mirrors `OpcUa.AlarmTrackingEnabled` | | `Uptime` | `string` | Formatted service uptime (e.g., `3d 5h 20m`) | | `Timestamp` | `DateTime` | UTC time the snapshot was generated | Monitoring tools should: - Alert on `Status=Unhealthy` (HTTP 503) for hard outages. - Alert on `Status=Degraded` (HTTP 200) for latched or cumulative failures -- a degraded status means the server is still operating but a subsystem needs attention (historian plugin missing, alarm ack writes failing, history read error rate too high, etc.). ## HTML Dashboards ### `/` -- Operator dashboard Monospace, dark background, color-coded panels. Panels: Connection, Health, Redundancy (when enabled), Subscriptions, Data Change Dispatch, Galaxy Info, **Historian**, **Alarms**, Operations (table), Footer. Each panel border color reflects component state (green, yellow, red, or gray). The page includes a `` tag set to the configured `RefreshIntervalSeconds` (default 10 seconds), so the browser polls automatically without JavaScript. ### `/health` -- Focused health view Large status badge, computed `ServiceLevel` value, redundancy summary (when enabled), and a row of component cards: MXAccess, Galaxy Database, OPC UA Server, **Historian**, **Alarm Tracking**. Each card turns red when its component is in a failure state and grey when disabled. Best for wallboards and quick at-a-glance monitoring. ## Configuration The dashboard is configured through the `Dashboard` section in `appsettings.json`: ```json { "Dashboard": { "Enabled": true, "Port": 8081, "RefreshIntervalSeconds": 10 } } ``` Setting `Enabled` to `false` prevents the `StatusWebServer` from starting. The `StatusReportService` is still created so that other components can query health programmatically, but no HTTP listener is opened. ### Dashboard start failures are non-fatal If the dashboard is enabled but the configured port is already bound (e.g., a previous instance did not clean up, another service is squatting on the port, or the user lacks URL-reservation rights), `StatusWebServer.Start()` logs the listener exception at Error level and returns `false`. `OpcUaService` then logs a Warning, disposes the unstarted instance, sets `DashboardStartFailed = true`, and continues in degraded mode — the OPC UA endpoint still starts. Operators can detect the failure by searching the service log for: ``` [WRN] Status dashboard failed to bind on port {Port}; service continues without dashboard ``` Stability review 2026-04-13 Finding 2. ## Component Wiring `StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b: ```csharp StatusReportInstance.SetComponents( effectiveMxClient, Metrics, GalaxyStatsInstance, ServerHost, NodeManagerInstance, _config.Redundancy, _config.OpcUa.ApplicationUri, _config.Historian); ``` This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is `null`, the report service falls back to default values (e.g., `ConnectionState.Disconnected`, zero counts, `HistorianPluginStatus.Disabled`). The historian plugin status is sourced from `HistorianPluginLoader.LastOutcome`, which is updated on every load attempt. `OpcUaService` explicitly calls `HistorianPluginLoader.MarkDisabled()` when `Historian.Enabled=false` so the dashboard can distinguish "feature off" from "load failed" without ambiguity.