docs(m6): KPI History & Trends design — reusable tall/EAV KPI-history backbone + trend charts for all sources; T9/T10 (Teams) deferred to next major version
This commit is contained in:
@@ -0,0 +1,182 @@
|
|||||||
|
# M6 — KPI History & Trends — Design
|
||||||
|
|
||||||
|
**Date:** 2026-06-17
|
||||||
|
**Milestone:** M6 (from `docs/plans/2026-06-15-stillpending-completion-design.md`)
|
||||||
|
**Status:** Approved — proceed to implementation plan.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
M6 was originally scoped as Notifications (T9 Teams adapter, T10 `NotificationType` enum
|
||||||
|
values + UI Type selector, T11 historical/trend KPI charts). During brainstorming the scope
|
||||||
|
was reshaped:
|
||||||
|
|
||||||
|
- **T9 (Teams + other non-Email delivery adapters)** — **DEFERRED to next major version**
|
||||||
|
(user decision 2026-06-17). The `INotificationDeliveryAdapter` seam already exists; no code
|
||||||
|
now. Transport choice (Incoming Webhook vs Microsoft Graph) and the Teams list-targeting
|
||||||
|
model remain to be designed.
|
||||||
|
- **T10 (`NotificationType` enum values + Central UI list "Type" selector)** — **DEFERRED with
|
||||||
|
T9.** A Type selector has no purpose until a second delivery type exists.
|
||||||
|
- **T11 (historical/trend KPI charts)** — promoted from a notifications-only feature into a
|
||||||
|
**reusable common KPI-history backbone** with trend charts shipped for **all** current KPI
|
||||||
|
sources.
|
||||||
|
|
||||||
|
This supersedes the original `docs/plans/notif.md` "KPI history — point-in-time only … no
|
||||||
|
separate time-series store is added" decision: the completion plan (T11) explicitly introduces
|
||||||
|
a store. We keep it in **central MS SQL** (the existing HA store) — no new infra dependency.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
A reusable central KPI-history backbone — tall/EAV store + periodic recorder + bucketed query
|
||||||
|
API + a custom SVG trend-chart component — with trend charts delivered for Notification Outbox,
|
||||||
|
Site Call Audit, Audit Log, and Site Health.
|
||||||
|
|
||||||
|
## Reuse map (why a common backbone)
|
||||||
|
|
||||||
|
Every source below is **point-in-time only today** (computed on demand, or only the latest kept
|
||||||
|
in memory). The backbone turns each into a trend with no per-source schema work:
|
||||||
|
|
||||||
|
| Source | Metrics (already computed) | Scope | Persisted today? |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **Notification Outbox** | queue depth, stuck, parked, delivered/interval, oldest-pending age | global / per-site / per-node | on-demand only |
|
||||||
|
| **Site Call Audit** | buffered, parked, failed/interval, delivered/interval, oldest-pending, stuck | global / per-site / per-node | on-demand only |
|
||||||
|
| **Audit Log** | events/hour, error events/hour, backlog total | global | on-demand only |
|
||||||
|
| **Site Health** | connection up/down, tag-resolution good/bad, script errors, alarm-eval errors, S&F buffer depth, dead-letter count, parked count, deployed/enabled/disabled instances, audit backlog (pending/oldest/bytes), event-log write failures | per-site | only latest, in-memory |
|
||||||
|
|
||||||
|
Site Health is the largest latent win — sequence-numbered every 30s, but history is currently
|
||||||
|
discarded.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### New component: `#26 KpiHistory`
|
||||||
|
|
||||||
|
A small central component (mirrors NotificationOutbox / SiteCallAudit / AuditLog being their own
|
||||||
|
projects). Owns the recorder, options, and the `IKpiSampleSource` abstraction.
|
||||||
|
|
||||||
|
**Decoupling.** Each owning component registers its own `IKpiSampleSource` into DI; the recorder
|
||||||
|
enumerates `IEnumerable<IKpiSampleSource>` (same pattern as `INotificationDeliveryAdapter`). So
|
||||||
|
KpiHistory does **not** reference every component — each source impl lives with its owner and
|
||||||
|
calls that owner's existing `Compute…KpisAsync` methods (or, for Site Health, reads the
|
||||||
|
in-memory aggregator). CLAUDE.md component count goes 25 → 26.
|
||||||
|
|
||||||
|
### Schema — `KpiSample` (tall / EAV)
|
||||||
|
|
||||||
|
Commons entity, EF mapping + migration in ConfigurationDatabase, central MS SQL:
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| `Id` | `bigint` PK identity | |
|
||||||
|
| `Source` | `varchar(64)` | `NotificationOutbox` / `SiteCallAudit` / `AuditLog` / `SiteHealth` |
|
||||||
|
| `Metric` | `varchar(64)` | per-source constant, e.g. `queueDepth`, `parkedCount`, `deadLetterCount` |
|
||||||
|
| `Scope` | `varchar(16)` | `Global` / `Site` / `Node` |
|
||||||
|
| `ScopeKey` | `varchar(64)` NULL | site id / node name; `NULL` for `Global` |
|
||||||
|
| `Value` | `float` | counts exact within range; ages stored as **seconds** |
|
||||||
|
| `CapturedAtUtc` | `datetime2` | recorder tick timestamp (UTC) |
|
||||||
|
|
||||||
|
Indexes:
|
||||||
|
- `IX_KpiSample_Series (Source, Metric, Scope, ScopeKey, CapturedAtUtc)` — per-series range query.
|
||||||
|
- `IX_KpiSample_Captured (CapturedAtUtc)` — retention purge.
|
||||||
|
|
||||||
|
### Recorder — `KpiHistoryRecorderActor` (central cluster singleton)
|
||||||
|
|
||||||
|
Runs on the active central node (consistent with the existing central singletons — Notification
|
||||||
|
Outbox, Site Call Audit, purge actors). Timer fires every `SampleIntervalSeconds` (default 60s).
|
||||||
|
Per tick:
|
||||||
|
|
||||||
|
1. Open a DI scope (scoped `DbContext`/repository — mirrors `NotificationOutboxActor`'s
|
||||||
|
scope-per-sweep pattern).
|
||||||
|
2. Enumerate registered `IKpiSampleSource`s; each returns `IReadOnlyList<KpiSample>` stamped with
|
||||||
|
the tick's `CapturedAtUtc`.
|
||||||
|
3. Write all samples via `IKpiHistoryRepository.RecordSamplesAsync`.
|
||||||
|
|
||||||
|
**Best-effort (R2).** Each source call and the write are individually wrapped; a failure logs and
|
||||||
|
skips that source's samples for the tick — it never throws into or disrupts the source component.
|
||||||
|
|
||||||
|
**Retention.** Daily purge (`PurgeIntervalHours`, default 24) deletes rows older than
|
||||||
|
`RetentionDays` (default 90), reusing the existing purge-scheduler shape. Hourly downsampling
|
||||||
|
beyond N days is deferred (YAGNI).
|
||||||
|
|
||||||
|
### Sample sources
|
||||||
|
|
||||||
|
- **`NotificationOutboxKpiSampleSource`** (in NotificationOutbox) →
|
||||||
|
`queueDepth, stuckCount, parkedCount, deliveredLastInterval, oldestPendingAgeSeconds`;
|
||||||
|
Global + per-Site + per-Node (reuses the M5 `ComputePerNodeKpisAsync`).
|
||||||
|
- **`SiteCallAuditKpiSampleSource`** (in SiteCallAudit) →
|
||||||
|
`buffered, parked, failedLastInterval, deliveredLastInterval, oldestPendingAgeSeconds, stuck`;
|
||||||
|
Global + per-Site + per-Node.
|
||||||
|
- **`AuditLogKpiSampleSource`** (in AuditLog) →
|
||||||
|
`totalEventsLastHour, errorEventsLastHour, backlogTotal`; Global.
|
||||||
|
- **`SiteHealthKpiSampleSource`** (in HealthMonitoring) → reads
|
||||||
|
`ICentralHealthAggregator.GetAllSiteStates()` (in-memory, no DB): per-Site
|
||||||
|
`connectionsUp/Down, tagsGood/Bad, scriptErrors, alarmEvalErrors, sfBufferDepth, deadLetters,
|
||||||
|
parkedMessages, deployedInstances/enabledInstances/disabledInstances, auditBacklogPending,
|
||||||
|
eventLogWriteFailures`.
|
||||||
|
|
||||||
|
### Query + UI
|
||||||
|
|
||||||
|
- **`IKpiHistoryRepository.GetSeriesAsync(source, metric, scope, scopeKey, fromUtc, toUtc,
|
||||||
|
maxPoints)`** → buckets `[from, toUtc]` into ≤ `maxPoints` buckets and returns
|
||||||
|
**last-value per bucket** (`KpiSeriesPoint(BucketStartUtc, Value)`). Last-value is correct for
|
||||||
|
gauge metrics; v1 uses one aggregation, avg/min/max deferred.
|
||||||
|
- **`KpiHistoryQueryService`** (CentralUI) — scoped-repo direct read with a dual-ctor test seam,
|
||||||
|
exactly like `AuditLogQueryService`.
|
||||||
|
- **`KpiTrendChart.razor`** — reusable custom **SVG** line/area component: polyline path, min/max
|
||||||
|
+ time-range axis labels, responsive `viewBox`, clean corporate styling (no third-party
|
||||||
|
charting lib, per CLAUDE.md). The time window (24h / 7d) is owned by the parent page.
|
||||||
|
- **Surfaces:** trend charts on the **Notification Outbox**, **Site Calls**, and **Audit Log**
|
||||||
|
pages, plus a per-site trend panel on the **Health dashboard**.
|
||||||
|
|
||||||
|
### Config — `KpiHistoryOptions`
|
||||||
|
|
||||||
|
`SampleIntervalSeconds` (60), `RetentionDays` (90), `PurgeIntervalHours` (24),
|
||||||
|
`DefaultMaxSeriesPoints` (200), with an options validator; bound on the central role in Host.
|
||||||
|
|
||||||
|
## Error handling
|
||||||
|
|
||||||
|
- Recorder: best-effort, per-source isolation (above). The KPI history is observability, never on
|
||||||
|
a user-facing critical path.
|
||||||
|
- Query: a query failure surfaces in the UI as an unavailable chart (em-dash / message), mirroring
|
||||||
|
how the existing KPI tiles surface transient failures — it never breaks the hosting page.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
- Recorder writes samples; **best-effort source-failure isolation** (a throwing source does not
|
||||||
|
abort the tick or other sources).
|
||||||
|
- Repository range/bucket-query correctness; retention purge deletes only aged rows.
|
||||||
|
- `KpiTrendChart` SVG render (unit/bUnit-style).
|
||||||
|
- One Playwright trend-chart UI test (per the M5–M10 testing strategy).
|
||||||
|
- Targeted tests per task (filtered tests + per-project builds); full-solution build at
|
||||||
|
integration.
|
||||||
|
|
||||||
|
## Docs & deploy
|
||||||
|
|
||||||
|
- New `docs/requirements/Component-KpiHistory.md` (#26).
|
||||||
|
- README component table + CLAUDE.md (25 → 26 + KPI-history bullet).
|
||||||
|
- Interactions updated in Component-NotificationOutbox / -SiteCallAudit / -AuditLog /
|
||||||
|
-HealthMonitoring / -CentralUI.
|
||||||
|
- Update `docs/plans/2026-06-15-stillpending-completion-design.md` (T9/T10 deferred; T11 →
|
||||||
|
KPI-history backbone).
|
||||||
|
- EF migration auto-applies in dev; cluster rebuild via `bash docker/deploy.sh` at integration.
|
||||||
|
|
||||||
|
## Execution housekeeping
|
||||||
|
|
||||||
|
- Work in the dedicated worktree `m6-kpi-history`, branched off **local `main` (639e331,
|
||||||
|
includes unpushed M5)** — not `origin/main`.
|
||||||
|
- Implementers commit **pathspec form** (`git commit -m "…" -- <paths>`), retry on
|
||||||
|
`.git/index.lock`.
|
||||||
|
- Keep concurrent committers to ≤ 2–3 and run a **post-wave HEAD-presence check** per the
|
||||||
|
concurrent-commit ref-race lesson.
|
||||||
|
|
||||||
|
## Deferred (next major version)
|
||||||
|
|
||||||
|
- **T9** — Teams (and other non-Email) delivery adapter behind `INotificationDeliveryAdapter`.
|
||||||
|
- **T10** — `NotificationType` enum values + Central UI list "Type" selector.
|
||||||
|
|
||||||
|
## Task list
|
||||||
|
|
||||||
|
- **MK-1 (first priority):** Common KPI-history/rollup store — backbone (schema + recorder +
|
||||||
|
repository + retention + chart component).
|
||||||
|
- **MK-2:** Notification Outbox trend charts (T11 first consumer) — blocked by MK-1.
|
||||||
|
- **MK-3:** Trend consumers — Site Call Audit / Audit Log / Site Health — blocked by MK-1.
|
||||||
|
|
||||||
|
(With the "charts for all sources" scope, MK-2 and MK-3 together deliver the full UI this
|
||||||
|
milestone; the implementation plan sequences them.)
|
||||||
Reference in New Issue
Block a user