LmxProxy is no longer needed. Moved the entire lmxproxy/ workspace, DCL adapter files, and related docs to deprecated/. Removed LmxProxy registration from DataConnectionFactory, project reference from DCL, protocol option from UI, and cleaned up all requirement docs.
122 lines
4.9 KiB
Markdown
122 lines
4.9 KiB
Markdown
# Component: HealthAndMetrics
|
|
|
|
## Purpose
|
|
|
|
Provides health checking, performance metrics collection, and an HTTP status dashboard for monitoring the LmxProxy service.
|
|
|
|
## Location
|
|
|
|
- `src/ZB.MOM.WW.LmxProxy.Host/Health/HealthCheckService.cs` — basic health check.
|
|
- `src/ZB.MOM.WW.LmxProxy.Host/Health/DetailedHealthCheckService.cs` — detailed health check with test tag read.
|
|
- `src/ZB.MOM.WW.LmxProxy.Host/Metrics/PerformanceMetrics.cs` — operation metrics collection.
|
|
- `src/ZB.MOM.WW.LmxProxy.Host/Status/StatusReportService.cs` — status report generation.
|
|
- `src/ZB.MOM.WW.LmxProxy.Host/Status/StatusWebServer.cs` — HTTP status endpoint.
|
|
|
|
## Responsibilities
|
|
|
|
- Evaluate service health based on connection state, operation success rates, and test tag reads.
|
|
- Track per-operation performance metrics (counts, latencies, percentiles).
|
|
- Serve an HTML status dashboard and JSON/health HTTP endpoints.
|
|
- Report metrics to logs on a periodic interval.
|
|
|
|
## 1. Health Checks
|
|
|
|
### 1.1 Basic Health Check (HealthCheckService)
|
|
|
|
`CheckHealthAsync()` evaluates:
|
|
|
|
| Check | Healthy | Degraded |
|
|
|-------|---------|----------|
|
|
| MxAccess connected | Yes | — |
|
|
| Success rate (if > 100 total ops) | ≥ 50% | < 50% |
|
|
| Client count | ≤ 100 | > 100 |
|
|
|
|
Returns health data dictionary: `scada_connected`, `scada_connection_state`, `total_clients`, `total_tags`, `total_operations`, `average_success_rate`.
|
|
|
|
### 1.2 Detailed Health Check (DetailedHealthCheckService)
|
|
|
|
`CheckHealthAsync()` performs an active probe:
|
|
|
|
1. Checks `IsConnected` — returns **Unhealthy** if not connected.
|
|
2. Reads a test tag (default `System.Heartbeat`).
|
|
3. If test tag quality is not Good — returns **Degraded**.
|
|
4. If test tag timestamp is older than **5 minutes** — returns **Degraded** (stale data detection).
|
|
5. Otherwise returns **Healthy**.
|
|
|
|
## 2. Performance Metrics
|
|
|
|
### 2.1 Tracking
|
|
|
|
`PerformanceMetrics` uses a `ConcurrentDictionary<string, OperationMetrics>` to track operations by name.
|
|
|
|
Operations tracked: `Read`, `ReadBatch`, `Write`, `WriteBatch` (recorded by ScadaGrpcService).
|
|
|
|
### 2.2 Recording
|
|
|
|
Two recording patterns:
|
|
- `RecordOperation(name, duration, success)` — explicit recording.
|
|
- `BeginOperation(name)` — returns an `ITimingScope` (disposable). On dispose, automatically records duration (via `Stopwatch`) and success flag (set via `SetSuccess(bool)`).
|
|
|
|
### 2.3 Per-Operation Statistics
|
|
|
|
`OperationMetrics` maintains:
|
|
- `_totalCount`, `_successCount` — running counters.
|
|
- `_totalMilliseconds`, `_minMilliseconds`, `_maxMilliseconds` — latency range.
|
|
- `_durations` — rolling buffer of up to **1000 latency samples** for percentile calculation.
|
|
|
|
`MetricsStatistics` snapshot:
|
|
- `TotalCount`, `SuccessCount`, `SuccessRate` (percentage).
|
|
- `AverageMilliseconds`, `MinMilliseconds`, `MaxMilliseconds`.
|
|
- `Percentile95Milliseconds` — calculated from sorted samples at the 95th percentile index.
|
|
|
|
### 2.4 Periodic Reporting
|
|
|
|
A timer fires every **60 seconds**, logging a summary of all operation metrics to Serilog.
|
|
|
|
## 3. Status Web Server
|
|
|
|
### 3.1 Server
|
|
|
|
`StatusWebServer` uses `HttpListener` on `http://+:{Port}/` (default port 8080).
|
|
|
|
- Starts an async request-handling loop, spawning a task per request.
|
|
- Graceful shutdown: cancels the listener, waits **5 seconds** for the listener task to exit.
|
|
- Returns HTTP 405 for non-GET methods, HTTP 500 on errors.
|
|
|
|
### 3.2 Endpoints
|
|
|
|
| Endpoint | Method | Response |
|
|
|----------|--------|----------|
|
|
| `/` | GET | HTML dashboard (auto-refresh every 30 seconds) |
|
|
| `/api/status` | GET | JSON status report (camelCase) |
|
|
| `/api/health` | GET | Plain text `OK` (200) or `UNHEALTHY` (503) |
|
|
|
|
### 3.3 HTML Dashboard
|
|
|
|
Generated by `StatusReportService`:
|
|
- Bootstrap-like CSS grid layout with status cards.
|
|
- Color-coded status: green = Healthy, yellow = Degraded, red = Unhealthy/Error.
|
|
- Operations table with columns: Count, SuccessRate, Avg/Min/Max/P95 milliseconds.
|
|
- Service metadata: ServiceName, Version (assembly version), connection state.
|
|
- Subscription stats: TotalClients, TotalTags, ActiveSubscriptions.
|
|
- Auto-refresh via `<meta http-equiv="refresh" content="30">`.
|
|
- Last updated timestamp.
|
|
|
|
### 3.4 JSON Status Report
|
|
|
|
Fully nested structure with camelCase property names:
|
|
- Service metadata, connection status, subscription stats, performance data, health check results.
|
|
|
|
## Dependencies
|
|
|
|
- **MxAccessClient** — `IsConnected`, `ConnectionState` for health checks; test tag read for detailed check.
|
|
- **SubscriptionManager** — subscription statistics.
|
|
- **PerformanceMetrics** — operation statistics for status report and health evaluation.
|
|
- **Configuration** — `WebServerConfiguration` for port and prefix.
|
|
|
|
## Interactions
|
|
|
|
- **GrpcServer** populates PerformanceMetrics via timing scopes on every RPC.
|
|
- **ServiceHost** creates all health/metrics/status components at startup and disposes them at shutdown.
|
|
- External monitoring systems can poll `/api/health` for availability checks.
|