Files
ScadaBridge/docs/requirements/Component-KpiHistory.md
T

169 lines
13 KiB
Markdown

# Component: KPI History
## Purpose
The KPI History component is the central, reusable **KPI-history backbone** — a tall / EAV time-series store, a periodic recorder singleton, a bucketed query API, and a reusable custom SVG trend-chart component. It turns the system's existing point-in-time KPIs into trends, and ships those trends for the **Notification Outbox (#21)**, **Site Call Audit (#22)**, **Audit Log (#23)**, and **Site Health (#11)** sources.
This supersedes the earlier "KPI history — point-in-time only, no separate time-series store is added" stance carried by the Notification Outbox and Site Call Audit KPI sections. M6 explicitly introduces a store. It lives in **central MS SQL** — the existing HA store — so it adds **no new infrastructure dependency**: a single `KpiSample` table, an EF mapping + migration, and a central cluster singleton that samples every minute.
The backbone is deliberately source-agnostic. Each owning component contributes an `IKpiSampleSource` registered into DI; the recorder enumerates them. KPI History therefore does **not** reference every component, and every source reuses the KPI reads its owner already computes — no per-source schema or storage work.
## Location
- `src/ZB.MOM.WW.ScadaBridge.KpiHistory` — the component project: the `KpiHistoryRecorderActor`, `KpiHistoryOptions` + validator, and the DI/options wiring (`ServiceCollectionExtensions`). It owns the recorder, the options, and consumes the `IKpiSampleSource` abstraction (defined in Commons).
- **`IKpiSampleSource` implementations live with their owners**, not here — `NotificationOutboxKpiSampleSource` (in NotificationOutbox), `SiteCallAuditKpiSampleSource` (in SiteCallAudit), `AuditLogKpiSampleSource` (in AuditLog), `SiteHealthKpiSampleSource` (in HealthMonitoring). Each registers itself via `TryAddEnumerable`.
- **Commons** — the `KpiSample` POCO entity (`Entities/Kpi`), the `IKpiSampleSource` and `IKpiHistoryRepository` interfaces (`Interfaces/Kpi`), and the `KpiSources` / `KpiScopes` constant catalogs + `KpiSeriesPoint` / `KpiSeriesBucketer` types (`Types/Kpi`).
- **Configuration Database** — the EF mapping (`KpiSampleEntityTypeConfiguration`), the migration that creates the `KpiSample` table + indexes, and the `KpiHistoryRepository` implementation.
- **Central UI** — the `KpiHistoryQueryService` query service and the reusable `KpiTrendChart.razor` component, plus the trend sections embedded on four surfaces.
The recorder is a **singleton on the active central node**, consistent with the other central singletons (Notification Outbox, Site Call Audit, purge actors).
## Responsibilities
- Own the `KpiSample` table — the central tall / EAV KPI-history store in MS SQL.
- Run the recorder loop: every `SampleInterval`, enumerate all registered `IKpiSampleSource`s and persist their samples stamped with one shared tick timestamp.
- Isolate sources from one another and from the store: a failure in any one source (or in the write) is logged and skipped for that tick and never disrupts the source component or the rest of the tick (best-effort observability).
- Purge aged rows on a daily cadence (`PurgeInterval`) older than `RetentionDays`.
- Provide a bucketed series-query API (`IKpiHistoryRepository.GetRawSeriesAsync` + `KpiSeriesBucketer`) and the Central UI query service + reusable trend chart that consume it.
KPI History is **observability, never a user-facing critical path** — neither recording nor querying may ever break a hosting page or disrupt a source component.
## Schema — `KpiSample` (tall / EAV)
A persistence-ignorant POCO in Commons; EF mapping + migration in Configuration Database; one table in central MS SQL. One row per `(Source, Metric, Scope, ScopeKey)` per recorder tick:
| Column | Type | Notes |
|---|---|---|
| `Id` | `bigint` PK identity | Surrogate key assigned by the store. |
| `Source` | `varchar(64)` | Owning source — a `KpiSources` constant: `NotificationOutbox` / `SiteCallAudit` / `AuditLog` / `SiteHealth`. |
| `Metric` | `varchar(64)` | Per-source metric name, e.g. `queueDepth`, `parkedCount`, `deadLetters` — drawn from each source's own metric catalog. |
| `Scope` | `varchar(16)` | A `KpiScopes` constant: `Global` / `Site` / `Node`. |
| `ScopeKey` | `varchar(64)` NULL | Site id (for `Site`) or node name (for `Node`); `NULL` for `Global`. |
| `Value` | `float` (`double`) | Counts carried exactly within range; ages stored as **seconds**. |
| `CapturedAtUtc` | `datetime2` | The recorder tick timestamp (UTC), shared across every sample in one tick. |
All timestamps are UTC, consistent with the system-wide convention.
Two named indexes back the access paths:
- **`IX_KpiSample_Series` (`Source`, `Metric`, `Scope`, `ScopeKey`, `CapturedAtUtc`)** — the per-series range query (one series scanned in time order).
- **`IX_KpiSample_Captured` (`CapturedAtUtc`)** — the retention purge.
## Recorder — `KpiHistoryRecorderActor`
The recorder is the Akka.NET cluster singleton **`kpi-history-recorder`** (singleton-manager actor `kpi-history-recorder-singleton`), running on the active central node. It is **not readiness-gated** — the recorder is pure observability and must never gate `/health/ready`, so it is started outside the readiness barrier (unlike the operational singletons). On graceful shutdown it drains via a `CoordinatedShutdown` task for clean singleton handover.
A timer fires every `SampleInterval` (default 60s; an immediate first tick primes the series, then it settles into the periodic cadence). On each tick the recorder:
1. Opens a **per-tick DI scope** (scoped `DbContext`/repository — the same scope-per-sweep pattern as the `NotificationOutboxActor`).
2. Enumerates the registered `IEnumerable<IKpiSampleSource>`. Each source returns an `IReadOnlyList<KpiSample>` stamped with the tick's single `CapturedAtUtc`.
3. Writes all collected samples via `IKpiHistoryRepository.RecordSamplesAsync`.
**Best-effort, per-source isolation.** Each source call and the write are individually guarded. A throwing source is logged and its samples skipped for that tick; it never aborts the tick, the other sources, or the source component itself. This is the same `IEnumerable<>`-of-adapters decoupling pattern used by `INotificationDeliveryAdapter`.
**Retention.** A daily purge timer (`PurgeInterval`, default 24h) deletes rows older than `RetentionDays` (default 90) via `IKpiHistoryRepository.PurgeOlderThanAsync`, reusing the existing purge-scheduler shape. Hourly/longer-range downsampling is deferred (YAGNI).
## Sample Sources
Each `IKpiSampleSource` lives in its owning component and is registered into DI with `TryAddEnumerable` (idempotent, additive). Each reuses the KPI reads its owner already performs — the Notification Outbox / Site Call Audit / Audit Log sources call their owners' existing `Compute…KpisAsync` aggregator reads; the Site Health source reads the in-memory `ICentralHealthAggregator` (no DB read). `Value` carries counts exactly and ages as seconds; all metric names below are the exact shipped strings.
### `NotificationOutboxKpiSampleSource` (in NotificationOutbox)
Scopes: **Global + per-Site + per-Node** (the per-node breakdown reuses the M5 `ComputePerNodeKpisAsync`).
- `queueDepth`
- `stuckCount`
- `parkedCount`
- `deliveredLastInterval`
- `oldestPendingAgeSeconds`
### `SiteCallAuditKpiSampleSource` (in SiteCallAudit)
Scopes: **Global + per-Site + per-Node**.
- `buffered`
- `parked`
- `failedLastInterval`
- `deliveredLastInterval`
- `stuck`
- `oldestPendingAgeSeconds`
### `AuditLogKpiSampleSource` (in AuditLog)
Scope: **Global**.
- `totalEventsLastHour`
- `errorEventsLastHour`
- `backlogTotal`
### `SiteHealthKpiSampleSource` (in HealthMonitoring)
Reads `ICentralHealthAggregator.GetAllSiteStates()` (in-memory, no DB). Scope: **per-Site** — the largest latent win, since Site Health was previously sequence-numbered every 30s but its history discarded.
- `connectionsUp`
- `connectionsDown`
- `scriptErrors`
- `alarmEvalErrors`
- `sfBufferDepth`
- `deadLetters`
- `parkedMessages`
- `deployedInstances`
- `enabledInstances`
- `disabledInstances`
- `auditBacklogPending`
- `eventLogWriteFailures`
## Query + UI
### Bucketed query
`IKpiHistoryRepository.GetRawSeriesAsync(source, metric, scope, scopeKey, fromUtc, toUtc, …)` returns the raw points for one series over `[fromUtc, toUtc]`. `KpiSeriesBucketer.Bucket(raw, fromUtc, toUtc, maxPoints)` then partitions the window into ≤ `maxPoints` time buckets and returns the **last value per bucket** as `KpiSeriesPoint(BucketStartUtc, Value)`. Last-value is correct for gauge metrics; v1 ships exactly one aggregation — avg / min / max are deferred.
### `KpiHistoryQueryService` (Central UI)
A scoped-repository direct read with a **dual-constructor test seam** (one ctor resolves a scoped `IKpiHistoryRepository` per call; the other accepts an injected repository for tests) — the same shape as `AuditLogQueryService`. `GetSeriesAsync` resolves the effective point cap (caller override or `KpiHistoryOptions.DefaultMaxSeriesPoints`), fetches the raw series, and reduces it via `KpiSeriesBucketer`. A query failure surfaces as an unavailable chart (em-dash / message), mirroring how the existing KPI tiles surface transient failures — it never breaks the hosting page.
### `KpiTrendChart.razor` (Central UI)
A reusable **custom inline-SVG** line/area chart — a polyline path with min/max + time-range axis labels, a responsive `viewBox`, and clean corporate styling. There is **no third-party charting library** (per the CLAUDE.md no-third-party-component-framework rule). The time window (e.g. 24h / 7d) is owned by the parent page.
### Surfaces
Trend sections render on four pages, each feeding `KpiTrendChart` from `KpiHistoryQueryService`:
- **Notification Outbox** page — outbox KPI trends.
- **Site Calls** page — cached-call KPI trends.
- **Audit Log** page — audit volume / error / backlog trends.
- **Health dashboard** — a per-site Site Health trend panel.
## Configuration — `KpiHistoryOptions`
Bound from the `ScadaBridge:KpiHistory` section on the central host (Options pattern), validated on startup by `KpiHistoryOptionsValidator`:
| Option | Default | Notes |
|---|---|---|
| `SampleInterval` | `60s` | Recorder tick cadence. Must be `> 0`. |
| `RetentionDays` | `90` | Rows older than this are purged. Bounded to `[1, 3650]` days. |
| `PurgeInterval` | `1d` | Daily purge cadence. Must be `> 0`. |
| `DefaultMaxSeriesPoints` | `200` | Default bucket cap for a series query when the caller does not override it. Bounded to `[2, 5000]`. |
Validation fails fast at startup on a non-positive `SampleInterval` / `PurgeInterval` (which would stall the recorder / purge), an out-of-range `RetentionDays` (too short loses history; too long defeats retention), or an out-of-range `DefaultMaxSeriesPoints`.
## Dependencies
- **Commons**: defines the `KpiSample` entity, the `IKpiSampleSource` and `IKpiHistoryRepository` interfaces, the `KpiSources` / `KpiScopes` catalogs, and the `KpiSeriesPoint` / `KpiSeriesBucketer` query types.
- **Configuration Database**: hosts the `KpiSample` table, its EF mapping, the migration, and the `KpiHistoryRepository` implementation.
- **Cluster Infrastructure**: hosts the `kpi-history-recorder` cluster singleton with active/standby failover.
- **Host**: binds `KpiHistoryOptions`, registers the component on the central role, and starts the recorder singleton **outside** the readiness barrier.
- **Notification Outbox / Site Call Audit / Audit Log / Health Monitoring**: each contributes an `IKpiSampleSource` and the KPI/aggregator reads it reuses. KPI History depends on the `IKpiSampleSource` abstraction, not on these components directly.
- **Central UI**: hosts `KpiHistoryQueryService` and the `KpiTrendChart` component.
## Interactions
- **Notification Outbox (#21)**: registers `NotificationOutboxKpiSampleSource` (Global / Site / Node), sampled each recorder tick; its trends render on the Notification Outbox page.
- **Site Call Audit (#22)**: registers `SiteCallAuditKpiSampleSource` (Global / Site / Node); its trends render on the Site Calls page.
- **Audit Log (#23)**: registers `AuditLogKpiSampleSource` (Global); its trends render on the Audit Log page.
- **Health Monitoring (#11)**: registers `SiteHealthKpiSampleSource` (per-Site), reading the in-memory central health aggregator; its trends render in the Health dashboard's per-site panel.
- **Central UI (#9)**: renders the reusable `KpiTrendChart` fed by `KpiHistoryQueryService` across the four trend surfaces; a query failure degrades to an unavailable-chart placeholder rather than breaking the page.
- **Cluster Infrastructure (#13)**: provides the active/standby singleton hosting for the recorder, which drains on `CoordinatedShutdown` for clean handover.