Files
lmxopcua/docs/StatusDashboard.md

19 KiB

Status Dashboard

Overview

The service hosts an embedded HTTP status dashboard that surfaces real-time health, connection state, subscription counts, data change throughput, and Galaxy metadata. Operators access it through a browser to verify the bridge is functioning without needing an OPC UA client. The dashboard is enabled by default on port 8081 and can be disabled via configuration.

HTTP Server

StatusWebServer wraps a System.Net.HttpListener bound to http://+:{port}/. It starts a background task that accepts requests in a loop and dispatches them by path. Only GET requests are accepted; all other methods return 405 Method Not Allowed. Responses include Cache-Control: no-cache headers to prevent stale data in the browser.

Endpoints

Path Content-Type Description
/ text/html Operator dashboard with auto-refresh
/health text/html Focused health page with service-level badge and component cards
/api/status application/json Full status snapshot as JSON (StatusData)
/api/health application/json Health endpoint (HealthEndpointData) -- returns 503 when status is Unhealthy, 200 otherwise

Any other path returns 404 Not Found.

Health Check Logic

HealthCheckService.CheckHealth evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed:

  1. Rule 1 -- Unhealthy: MXAccess connection state is not Connected. Returns a red banner with the current state.
  2. Rule 2b -- Degraded: Historian.Enabled=true but the plugin load outcome is not Loaded. Returns a yellow banner citing the plugin status (NotFound, LoadFailed) and the error message if one is available.
  3. Rule 2 / 2c -- Degraded: Any recorded operation has a low success rate. The sample threshold depends on the operation category:
    • Regular operations (Read, Write, Subscribe, AlarmAcknowledge): >100 invocations and <50% success rate.
    • Historian operations (HistoryReadRaw, HistoryReadProcessed, HistoryReadAtTime, HistoryReadEvents): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads.
  4. Rule 2d -- Degraded (latched): AlarmTrackingEnabled=true and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts.
  5. Rule 2e -- Degraded: RuntimeStatus.StoppedCount > 0 -- at least one Galaxy runtime host ($WinPlatform / $AppEngine) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays Unhealthy via Rule 1 and this rule never double-messages; the probe manager also forces every entry to Unknown when the transport is disconnected, so the StoppedCount is always 0 in that case.
  6. Rule 3 -- Healthy: All checks pass. Returns a green banner with "All systems operational."

The /api/health endpoint returns 200 for both Healthy and Degraded states, and 503 only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection.

Status Data Model

StatusReportService aggregates data from all bridge components into a StatusData DTO, which is then rendered as HTML or serialized to JSON. The DTO contains the following sections:

Connection

Field Type Description
State string Current MXAccess connection state (Connected, Disconnected, Connecting)
ReconnectCount int Number of reconnect attempts since startup
ActiveSessions int Number of active OPC UA client sessions

Health

Field Type Description
Status string Healthy, Degraded, or Unhealthy
Message string Operator-facing explanation
Color string CSS color token (green, yellow, red, gray)

Subscriptions

Field Type Description
ActiveCount int Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see ProbeCount)
ProbeCount int Subset of ActiveCount attributable to bridge-owned runtime status probes (<Host>.ScanState per deployed $WinPlatform / $AppEngine). Rendered as a separate Probes: N (bridge-owned runtime status) line on the dashboard so operators can distinguish probe overhead from client-driven subscription load

Galaxy

Field Type Description
GalaxyName string Name of the Galaxy being bridged
DbConnected bool Whether the Galaxy repository database is reachable
LastDeployTime DateTime? Most recent deploy timestamp from the Galaxy
ObjectCount int Number of Galaxy objects in the address space
AttributeCount int Number of Galaxy attributes as OPC UA variables
LastRebuildTime DateTime? UTC timestamp of the last completed address-space rebuild

Data change

Field Type Description
EventsPerSecond double Rate of MXAccess data change events per second
AvgBatchSize double Average items processed per dispatch cycle
PendingItems int Items waiting in the dispatch queue
TotalEvents long Total MXAccess data change events since startup

Galaxy Runtime

Populated from the GalaxyRuntimeProbeManager that advises <Host>.ScanState on every deployed $WinPlatform and $AppEngine. See MXAccess Bridge for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when MxAccess.RuntimeStatusProbesEnabled = false; the panel is suppressed entirely from the HTML when Total == 0.

Field Type Description
Total int Number of runtime hosts tracked (Platforms + AppEngines)
RunningCount int Hosts whose last probe callback reported ScanState = true with Good quality
StoppedCount int Hosts whose last probe callback reported ScanState != true or a failed item status, or whose initial probe timed out in Unknown state
UnknownCount int Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected
Hosts List<GalaxyRuntimeStatus> Per-host detail rows, sorted alphabetically by ObjectName

Each GalaxyRuntimeStatus entry:

Field Type Description
ObjectName string Galaxy tag_name of the host (e.g., DevPlatform, DevAppEngine)
GobjectId int Galaxy gobject_id of the host
Kind string $WinPlatform or $AppEngine
State enum Unknown, Running, or Stopped
LastStateCallbackTime DateTime? UTC time of the most recent probe callback, whether good or bad
LastStateChangeTime DateTime? UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column
LastScanState bool? Last ScanState value received; null before the first callback
LastError string? Detail message from the most recent failure callback (e.g., "ScanState = false (OffScan)"); cleared on successful recovery
GoodUpdateCount long Cumulative count of ScanState = true callbacks
FailureCount long Cumulative count of ScanState != true callbacks or failed item statuses

The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is Running, yellow when any host is Unknown with zero Stopped, red when any host is Stopped, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to Unknown).

Operations

A dictionary of MetricsStatistics keyed by operation name. Each entry contains:

  • TotalCount -- total invocations
  • SuccessRate -- fraction of successful operations
  • AverageMilliseconds, MinMilliseconds, MaxMilliseconds, Percentile95Milliseconds -- latency distribution

The instrumented operation names are:

Name Source
Read MXAccess live tag reads (MxAccessClient.ReadWrite.cs)
Write MXAccess live tag writes
Subscribe MXAccess subscription attach
HistoryReadRaw LmxNodeManager.HistoryReadRawModified -> historian plugin
HistoryReadProcessed LmxNodeManager.HistoryReadProcessed -> historian plugin (aggregates)
HistoryReadAtTime LmxNodeManager.HistoryReadAtTime -> historian plugin (interpolated)
HistoryReadEvents LmxNodeManager.HistoryReadEvents -> historian plugin (alarm/event history)
AlarmAcknowledge LmxNodeManager.OnAlarmAcknowledge -> MXAccess AckMsg write

New operation names are auto-registered on first use, so the Operations dictionary only contains entries for features that have actually been exercised since startup.

Historian

HistorianStatusInfo -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See Historical Data Access for the plugin architecture and the Runtime Health Counters section for the data source instrumentation.

Field Type Description
Enabled bool Whether Historian.Enabled is set in configuration
PluginStatus string Disabled, NotFound, LoadFailed, or Loaded — load-time outcome from HistorianPluginLoader.LastOutcome
PluginError string? Exception message from the last load attempt when PluginStatus=LoadFailed; otherwise null
PluginPath string Absolute path the loader probed for the plugin assembly
ServerName string Legacy single-node hostname from Historian.ServerName; ignored when ServerNames is non-empty
Port int Configured historian TCP port
QueryTotal long Total historian read queries attempted since startup (raw + aggregate + at-time + events)
QuerySuccesses long Queries that completed without an exception
QueryFailures long Queries that raised an exception — each failure also triggers the plugin's reconnect path
ConsecutiveFailures int Failures since the last success. Resets to zero on any successful query. Drives the Degraded health rule at threshold 3
LastSuccessTime DateTime? UTC timestamp of the most recent successful query, or null when no query has succeeded since startup
LastFailureTime DateTime? UTC timestamp of the most recent failure
LastQueryError string? Exception message from the most recent failure. Prefixed with the read-path name (raw:, aggregate:, at-time:, events:) so operators can tell which SDK call failed
ProcessConnectionOpen bool Whether the plugin currently holds an open SDK connection for the process silo (historical value queries — ReadRaw, ReadAggregate, ReadAtTime). See Two SDK connection silos
EventConnectionOpen bool Whether the plugin currently holds an open SDK connection for the event silo (alarm history queries — ReadEvents). Separate from the process connection because the SDK requires distinct query channels
ActiveProcessNode string? Cluster node currently serving the process silo, or null when no process connection is open
ActiveEventNode string? Cluster node currently serving the event silo, or null when no event connection is open
NodeCount int Total configured historian cluster nodes. 1 for a legacy single-node deployment
HealthyNodeCount int Nodes currently eligible for new connections (not in failure cooldown)
Nodes List<HistorianClusterNodeState> Per-node cluster state in configuration order. Each entry carries Name, IsHealthy, CooldownUntil, FailureCount, LastError, LastFailureTime

The operator dashboard renders a cluster table inside the Historian panel when NodeCount > 1. Legacy single-node deployments render a compact Node: <hostname> line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures.

Alarms

AlarmStatusInfo -- surfaces alarm-condition tracking health and dispatch counters.

Field Type Description
TrackingEnabled bool Whether OpcUa.AlarmTrackingEnabled is set in configuration
ConditionCount int Number of distinct alarm conditions currently tracked
ActiveAlarmCount int Number of alarms currently in the InAlarm=true state
TransitionCount long Total InAlarm transitions observed in the dispatch loop since startup
AckEventCount long Total alarm acknowledgement transitions observed since startup
AckWriteFailures long Total MXAccess AckMsg writes that have failed while processing alarm acknowledges. Any non-zero value latches the service into Degraded (see Rule 2d).
FilterEnabled bool Whether OpcUa.AlarmFilter.ObjectFilters has any patterns configured
FilterPatternCount int Number of compiled filter patterns (after comma-splitting and trimming)
FilterIncludedObjectCount int Number of Galaxy objects included by the filter during the most recent address-space build. Zero when the filter is disabled.

When the filter is active, the operator dashboard's Alarms panel renders an extra line Filter: N pattern(s), M object(s) included so operators can verify scope at a glance. See Alarm Tracking for the matching rules and resolution algorithm.

Redundancy

RedundancyInfo -- only populated when Redundancy.Enabled=true in configuration. Shows mode, role, computed service level, application URI, and the set of peer server URIs. See Redundancy for the full guide.

Field Type Description
Timestamp DateTime UTC time when the snapshot was generated
Version string Service assembly version

/api/health Payload

The health endpoint returns a HealthEndpointData document distinct from the full dashboard snapshot. It is designed for load balancers and external monitoring probes that only need an up/down signal plus component-level detail:

Field Type Description
Status string Healthy, Degraded, or Unhealthy (drives the HTTP status code)
ServiceLevel byte OPC UA-style 0-255 service level. 255 when healthy non-redundant; 0 when MXAccess is down; redundancy-adjusted otherwise
RedundancyEnabled bool Whether redundancy is configured
RedundancyRole string? Primary or Secondary when redundancy is enabled; null otherwise
RedundancyMode string? Warm or Hot when redundancy is enabled; null otherwise
Components.MxAccess string Connected or Disconnected
Components.Database string Connected or Disconnected
Components.OpcUaServer string Running or Stopped
Components.Historian string Disabled, NotFound, LoadFailed, or Loaded -- matches HistorianStatusInfo.PluginStatus
Components.Alarms string Disabled or Enabled -- mirrors OpcUa.AlarmTrackingEnabled
Uptime string Formatted service uptime (e.g., 3d 5h 20m)
Timestamp DateTime UTC time the snapshot was generated

Monitoring tools should:

  • Alert on Status=Unhealthy (HTTP 503) for hard outages.
  • Alert on Status=Degraded (HTTP 200) for latched or cumulative failures -- a degraded status means the server is still operating but a subsystem needs attention (historian plugin missing, alarm ack writes failing, history read error rate too high, etc.).

HTML Dashboards

/ -- Operator dashboard

Monospace, dark background, color-coded panels. Panels: Connection, Health, Redundancy (when enabled), Subscriptions, Data Change Dispatch, Galaxy Info, Historian, Alarms, Operations (table), Footer. Each panel border color reflects component state (green, yellow, red, or gray).

The page includes a <meta http-equiv='refresh'> tag set to the configured RefreshIntervalSeconds (default 10 seconds), so the browser polls automatically without JavaScript.

/health -- Focused health view

Large status badge, computed ServiceLevel value, redundancy summary (when enabled), and a row of component cards: MXAccess, Galaxy Database, OPC UA Server, Historian, Alarm Tracking. Each card turns red when its component is in a failure state and grey when disabled. Best for wallboards and quick at-a-glance monitoring.

Configuration

The dashboard is configured through the Dashboard section in appsettings.json:

{
  "Dashboard": {
    "Enabled": true,
    "Port": 8081,
    "RefreshIntervalSeconds": 10
  }
}

Setting Enabled to false prevents the StatusWebServer from starting. The StatusReportService is still created so that other components can query health programmatically, but no HTTP listener is opened.

Component Wiring

StatusReportService is initialized after all other service components are created. OpcUaService.Start() calls SetComponents() to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b:

StatusReportInstance.SetComponents(
    effectiveMxClient,
    Metrics,
    GalaxyStatsInstance,
    ServerHost,
    NodeManagerInstance,
    _config.Redundancy,
    _config.OpcUa.ApplicationUri,
    _config.Historian);

This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is null, the report service falls back to default values (e.g., ConnectionState.Disconnected, zero counts, HistorianPluginStatus.Disabled).

The historian plugin status is sourced from HistorianPluginLoader.LastOutcome, which is updated on every load attempt. OpcUaService explicitly calls HistorianPluginLoader.MarkDisabled() when Historian.Enabled=false so the dashboard can distinguish "feature off" from "load failed" without ambiguity.