# Component: HealthAndMetrics ## Purpose Provides health checking, performance metrics collection, and an HTTP status dashboard for monitoring the LmxProxy service. ## Location - `src/ZB.MOM.WW.LmxProxy.Host/Health/HealthCheckService.cs` — basic health check. - `src/ZB.MOM.WW.LmxProxy.Host/Health/DetailedHealthCheckService.cs` — detailed health check with test tag read. - `src/ZB.MOM.WW.LmxProxy.Host/Metrics/PerformanceMetrics.cs` — operation metrics collection. - `src/ZB.MOM.WW.LmxProxy.Host/Status/StatusReportService.cs` — status report generation. - `src/ZB.MOM.WW.LmxProxy.Host/Status/StatusWebServer.cs` — HTTP status endpoint. ## Responsibilities - Evaluate service health based on connection state, operation success rates, and test tag reads. - Track per-operation performance metrics (counts, latencies, percentiles). - Serve an HTML status dashboard and JSON/health HTTP endpoints. - Report metrics to logs on a periodic interval. ## 1. Health Checks ### 1.1 Basic Health Check (HealthCheckService) `CheckHealthAsync()` evaluates: | Check | Healthy | Degraded | |-------|---------|----------| | MxAccess connected | Yes | — | | Success rate (if > 100 total ops) | ≥ 50% | < 50% | | Client count | ≤ 100 | > 100 | Returns health data dictionary: `scada_connected`, `scada_connection_state`, `total_clients`, `total_tags`, `total_operations`, `average_success_rate`. ### 1.2 Detailed Health Check (DetailedHealthCheckService) `CheckHealthAsync()` performs an active probe: 1. Checks `IsConnected` — returns **Unhealthy** if not connected. 2. Reads a test tag (default `System.Heartbeat`). 3. If test tag quality is not Good — returns **Degraded**. 4. If test tag timestamp is older than **5 minutes** — returns **Degraded** (stale data detection). 5. Otherwise returns **Healthy**. ## 2. Performance Metrics ### 2.1 Tracking `PerformanceMetrics` uses a `ConcurrentDictionary` to track operations by name. Operations tracked: `Read`, `ReadBatch`, `Write`, `WriteBatch` (recorded by ScadaGrpcService). ### 2.2 Recording Two recording patterns: - `RecordOperation(name, duration, success)` — explicit recording. - `BeginOperation(name)` — returns an `ITimingScope` (disposable). On dispose, automatically records duration (via `Stopwatch`) and success flag (set via `SetSuccess(bool)`). ### 2.3 Per-Operation Statistics `OperationMetrics` maintains: - `_totalCount`, `_successCount` — running counters. - `_totalMilliseconds`, `_minMilliseconds`, `_maxMilliseconds` — latency range. - `_durations` — rolling buffer of up to **1000 latency samples** for percentile calculation. `MetricsStatistics` snapshot: - `TotalCount`, `SuccessCount`, `SuccessRate` (percentage). - `AverageMilliseconds`, `MinMilliseconds`, `MaxMilliseconds`. - `Percentile95Milliseconds` — calculated from sorted samples at the 95th percentile index. ### 2.4 Periodic Reporting A timer fires every **60 seconds**, logging a summary of all operation metrics to Serilog. ## 3. Status Web Server ### 3.1 Server `StatusWebServer` uses `HttpListener` on `http://+:{Port}/` (default port 8080). - Starts an async request-handling loop, spawning a task per request. - Graceful shutdown: cancels the listener, waits **5 seconds** for the listener task to exit. - Returns HTTP 405 for non-GET methods, HTTP 500 on errors. ### 3.2 Endpoints | Endpoint | Method | Response | |----------|--------|----------| | `/` | GET | HTML dashboard (auto-refresh every 30 seconds) | | `/api/status` | GET | JSON status report (camelCase) | | `/api/health` | GET | Plain text `OK` (200) or `UNHEALTHY` (503) | ### 3.3 HTML Dashboard Generated by `StatusReportService`: - Bootstrap-like CSS grid layout with status cards. - Color-coded status: green = Healthy, yellow = Degraded, red = Unhealthy/Error. - Operations table with columns: Count, SuccessRate, Avg/Min/Max/P95 milliseconds. - Service metadata: ServiceName, Version (assembly version), connection state. - Subscription stats: TotalClients, TotalTags, ActiveSubscriptions. - Auto-refresh via ``. - Last updated timestamp. ### 3.4 JSON Status Report Fully nested structure with camelCase property names: - Service metadata, connection status, subscription stats, performance data, health check results. ## Dependencies - **MxAccessClient** — `IsConnected`, `ConnectionState` for health checks; test tag read for detailed check. - **SubscriptionManager** — subscription statistics. - **PerformanceMetrics** — operation statistics for status report and health evaluation. - **Configuration** — `WebServerConfiguration` for port and prefix. ## Interactions - **GrpcServer** populates PerformanceMetrics via timing scopes on every RPC. - **ServiceHost** creates all health/metrics/status components at startup and disposes them at shutdown. - External monitoring systems can poll `/api/health` for availability checks.