Files
ScadaBridge/docs/plans/2026-06-17-m6-kpi-history-design.md
T

183 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M6 — KPI History & Trends — Design
**Date:** 2026-06-17
**Milestone:** M6 (from `docs/plans/2026-06-15-stillpending-completion-design.md`)
**Status:** Approved — proceed to implementation plan.
## Summary
M6 was originally scoped as Notifications (T9 Teams adapter, T10 `NotificationType` enum
values + UI Type selector, T11 historical/trend KPI charts). During brainstorming the scope
was reshaped:
- **T9 (Teams + other non-Email delivery adapters)** — **DEFERRED to next major version**
(user decision 2026-06-17). The `INotificationDeliveryAdapter` seam already exists; no code
now. Transport choice (Incoming Webhook vs Microsoft Graph) and the Teams list-targeting
model remain to be designed.
- **T10 (`NotificationType` enum values + Central UI list "Type" selector)** — **DEFERRED with
T9.** A Type selector has no purpose until a second delivery type exists.
- **T11 (historical/trend KPI charts)** — promoted from a notifications-only feature into a
**reusable common KPI-history backbone** with trend charts shipped for **all** current KPI
sources.
This supersedes the original `docs/plans/notif.md` "KPI history — point-in-time only … no
separate time-series store is added" decision: the completion plan (T11) explicitly introduces
a store. We keep it in **central MS SQL** (the existing HA store) — no new infra dependency.
## Goal
A reusable central KPI-history backbone — tall/EAV store + periodic recorder + bucketed query
API + a custom SVG trend-chart component — with trend charts delivered for Notification Outbox,
Site Call Audit, Audit Log, and Site Health.
## Reuse map (why a common backbone)
Every source below is **point-in-time only today** (computed on demand, or only the latest kept
in memory). The backbone turns each into a trend with no per-source schema work:
| Source | Metrics (already computed) | Scope | Persisted today? |
|---|---|---|---|
| **Notification Outbox** | queue depth, stuck, parked, delivered/interval, oldest-pending age | global / per-site / per-node | on-demand only |
| **Site Call Audit** | buffered, parked, failed/interval, delivered/interval, oldest-pending, stuck | global / per-site / per-node | on-demand only |
| **Audit Log** | events/hour, error events/hour, backlog total | global | on-demand only |
| **Site Health** | connection up/down, tag-resolution good/bad, script errors, alarm-eval errors, S&F buffer depth, dead-letter count, parked count, deployed/enabled/disabled instances, audit backlog (pending/oldest/bytes), event-log write failures | per-site | only latest, in-memory |
Site Health is the largest latent win — sequence-numbered every 30s, but history is currently
discarded.
## Architecture
### New component: `#26 KpiHistory`
A small central component (mirrors NotificationOutbox / SiteCallAudit / AuditLog being their own
projects). Owns the recorder, options, and the `IKpiSampleSource` abstraction.
**Decoupling.** Each owning component registers its own `IKpiSampleSource` into DI; the recorder
enumerates `IEnumerable<IKpiSampleSource>` (same pattern as `INotificationDeliveryAdapter`). So
KpiHistory does **not** reference every component — each source impl lives with its owner and
calls that owner's existing `Compute…KpisAsync` methods (or, for Site Health, reads the
in-memory aggregator). CLAUDE.md component count goes 25 → 26.
### Schema — `KpiSample` (tall / EAV)
Commons entity, EF mapping + migration in ConfigurationDatabase, central MS SQL:
| Column | Type | Notes |
|---|---|---|
| `Id` | `bigint` PK identity | |
| `Source` | `varchar(64)` | `NotificationOutbox` / `SiteCallAudit` / `AuditLog` / `SiteHealth` |
| `Metric` | `varchar(64)` | per-source constant, e.g. `queueDepth`, `parkedCount`, `deadLetterCount` |
| `Scope` | `varchar(16)` | `Global` / `Site` / `Node` |
| `ScopeKey` | `varchar(64)` NULL | site id / node name; `NULL` for `Global` |
| `Value` | `float` | counts exact within range; ages stored as **seconds** |
| `CapturedAtUtc` | `datetime2` | recorder tick timestamp (UTC) |
Indexes:
- `IX_KpiSample_Series (Source, Metric, Scope, ScopeKey, CapturedAtUtc)` — per-series range query.
- `IX_KpiSample_Captured (CapturedAtUtc)` — retention purge.
### Recorder — `KpiHistoryRecorderActor` (central cluster singleton)
Runs on the active central node (consistent with the existing central singletons — Notification
Outbox, Site Call Audit, purge actors). Timer fires every `SampleIntervalSeconds` (default 60s).
Per tick:
1. Open a DI scope (scoped `DbContext`/repository — mirrors `NotificationOutboxActor`'s
scope-per-sweep pattern).
2. Enumerate registered `IKpiSampleSource`s; each returns `IReadOnlyList<KpiSample>` stamped with
the tick's `CapturedAtUtc`.
3. Write all samples via `IKpiHistoryRepository.RecordSamplesAsync`.
**Best-effort (R2).** Each source call and the write are individually wrapped; a failure logs and
skips that source's samples for the tick — it never throws into or disrupts the source component.
**Retention.** Daily purge (`PurgeIntervalHours`, default 24) deletes rows older than
`RetentionDays` (default 90), reusing the existing purge-scheduler shape. Hourly downsampling
beyond N days is deferred (YAGNI).
### Sample sources
- **`NotificationOutboxKpiSampleSource`** (in NotificationOutbox) →
`queueDepth, stuckCount, parkedCount, deliveredLastInterval, oldestPendingAgeSeconds`;
Global + per-Site + per-Node (reuses the M5 `ComputePerNodeKpisAsync`).
- **`SiteCallAuditKpiSampleSource`** (in SiteCallAudit) →
`buffered, parked, failedLastInterval, deliveredLastInterval, oldestPendingAgeSeconds, stuck`;
Global + per-Site + per-Node.
- **`AuditLogKpiSampleSource`** (in AuditLog) →
`totalEventsLastHour, errorEventsLastHour, backlogTotal`; Global.
- **`SiteHealthKpiSampleSource`** (in HealthMonitoring) → reads
`ICentralHealthAggregator.GetAllSiteStates()` (in-memory, no DB): per-Site
`connectionsUp/Down, tagsGood/Bad, scriptErrors, alarmEvalErrors, sfBufferDepth, deadLetters,
parkedMessages, deployedInstances/enabledInstances/disabledInstances, auditBacklogPending,
eventLogWriteFailures`.
### Query + UI
- **`IKpiHistoryRepository.GetSeriesAsync(source, metric, scope, scopeKey, fromUtc, toUtc,
maxPoints)`** → buckets `[from, toUtc]` into ≤ `maxPoints` buckets and returns
**last-value per bucket** (`KpiSeriesPoint(BucketStartUtc, Value)`). Last-value is correct for
gauge metrics; v1 uses one aggregation, avg/min/max deferred.
- **`KpiHistoryQueryService`** (CentralUI) — scoped-repo direct read with a dual-ctor test seam,
exactly like `AuditLogQueryService`.
- **`KpiTrendChart.razor`** — reusable custom **SVG** line/area component: polyline path, min/max
+ time-range axis labels, responsive `viewBox`, clean corporate styling (no third-party
charting lib, per CLAUDE.md). The time window (24h / 7d) is owned by the parent page.
- **Surfaces:** trend charts on the **Notification Outbox**, **Site Calls**, and **Audit Log**
pages, plus a per-site trend panel on the **Health dashboard**.
### Config — `KpiHistoryOptions`
`SampleIntervalSeconds` (60), `RetentionDays` (90), `PurgeIntervalHours` (24),
`DefaultMaxSeriesPoints` (200), with an options validator; bound on the central role in Host.
## Error handling
- Recorder: best-effort, per-source isolation (above). The KPI history is observability, never on
a user-facing critical path.
- Query: a query failure surfaces in the UI as an unavailable chart (em-dash / message), mirroring
how the existing KPI tiles surface transient failures — it never breaks the hosting page.
## Testing
- Recorder writes samples; **best-effort source-failure isolation** (a throwing source does not
abort the tick or other sources).
- Repository range/bucket-query correctness; retention purge deletes only aged rows.
- `KpiTrendChart` SVG render (unit/bUnit-style).
- One Playwright trend-chart UI test (per the M5M10 testing strategy).
- Targeted tests per task (filtered tests + per-project builds); full-solution build at
integration.
## Docs & deploy
- New `docs/requirements/Component-KpiHistory.md` (#26).
- README component table + CLAUDE.md (25 → 26 + KPI-history bullet).
- Interactions updated in Component-NotificationOutbox / -SiteCallAudit / -AuditLog /
-HealthMonitoring / -CentralUI.
- Update `docs/plans/2026-06-15-stillpending-completion-design.md` (T9/T10 deferred; T11 →
KPI-history backbone).
- EF migration auto-applies in dev; cluster rebuild via `bash docker/deploy.sh` at integration.
## Execution housekeeping
- Work in the dedicated worktree `m6-kpi-history`, branched off **local `main` (639e331,
includes unpushed M5)** — not `origin/main`.
- Implementers commit **pathspec form** (`git commit -m "…" -- <paths>`), retry on
`.git/index.lock`.
- Keep concurrent committers to ≤ 23 and run a **post-wave HEAD-presence check** per the
concurrent-commit ref-race lesson.
## Deferred (next major version)
- **T9** — Teams (and other non-Email) delivery adapter behind `INotificationDeliveryAdapter`.
- **T10** — `NotificationType` enum values + Central UI list "Type" selector.
## Task list
- **MK-1 (first priority):** Common KPI-history/rollup store — backbone (schema + recorder +
repository + retention + chart component).
- **MK-2:** Notification Outbox trend charts (T11 first consumer) — blocked by MK-1.
- **MK-3:** Trend consumers — Site Call Audit / Audit Log / Site Health — blocked by MK-1.
(With the "charts for all sources" scope, MK-2 and MK-3 together deliver the full UI this
milestone; the implementation plan sequences them.)