From 6084e56c9f090c2e37f0c31e99606afd515a2a19 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 17 Jun 2026 05:18:58 -0400 Subject: [PATCH] =?UTF-8?q?docs(m6):=20KPI=20History=20&=20Trends=20design?= =?UTF-8?q?=20=E2=80=94=20reusable=20tall/EAV=20KPI-history=20backbone=20+?= =?UTF-8?q?=20trend=20charts=20for=20all=20sources;=20T9/T10=20(Teams)=20d?= =?UTF-8?q?eferred=20to=20next=20major=20version?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../plans/2026-06-17-m6-kpi-history-design.md | 182 ++++++++++++++++++ 1 file changed, 182 insertions(+) create mode 100644 docs/plans/2026-06-17-m6-kpi-history-design.md diff --git a/docs/plans/2026-06-17-m6-kpi-history-design.md b/docs/plans/2026-06-17-m6-kpi-history-design.md new file mode 100644 index 00000000..d5f276dc --- /dev/null +++ b/docs/plans/2026-06-17-m6-kpi-history-design.md @@ -0,0 +1,182 @@ +# M6 — KPI History & Trends — Design + +**Date:** 2026-06-17 +**Milestone:** M6 (from `docs/plans/2026-06-15-stillpending-completion-design.md`) +**Status:** Approved — proceed to implementation plan. + +## Summary + +M6 was originally scoped as Notifications (T9 Teams adapter, T10 `NotificationType` enum +values + UI Type selector, T11 historical/trend KPI charts). During brainstorming the scope +was reshaped: + +- **T9 (Teams + other non-Email delivery adapters)** — **DEFERRED to next major version** + (user decision 2026-06-17). The `INotificationDeliveryAdapter` seam already exists; no code + now. Transport choice (Incoming Webhook vs Microsoft Graph) and the Teams list-targeting + model remain to be designed. +- **T10 (`NotificationType` enum values + Central UI list "Type" selector)** — **DEFERRED with + T9.** A Type selector has no purpose until a second delivery type exists. +- **T11 (historical/trend KPI charts)** — promoted from a notifications-only feature into a + **reusable common KPI-history backbone** with trend charts shipped for **all** current KPI + sources. + +This supersedes the original `docs/plans/notif.md` "KPI history — point-in-time only … no +separate time-series store is added" decision: the completion plan (T11) explicitly introduces +a store. We keep it in **central MS SQL** (the existing HA store) — no new infra dependency. + +## Goal + +A reusable central KPI-history backbone — tall/EAV store + periodic recorder + bucketed query +API + a custom SVG trend-chart component — with trend charts delivered for Notification Outbox, +Site Call Audit, Audit Log, and Site Health. + +## Reuse map (why a common backbone) + +Every source below is **point-in-time only today** (computed on demand, or only the latest kept +in memory). The backbone turns each into a trend with no per-source schema work: + +| Source | Metrics (already computed) | Scope | Persisted today? | +|---|---|---|---| +| **Notification Outbox** | queue depth, stuck, parked, delivered/interval, oldest-pending age | global / per-site / per-node | on-demand only | +| **Site Call Audit** | buffered, parked, failed/interval, delivered/interval, oldest-pending, stuck | global / per-site / per-node | on-demand only | +| **Audit Log** | events/hour, error events/hour, backlog total | global | on-demand only | +| **Site Health** | connection up/down, tag-resolution good/bad, script errors, alarm-eval errors, S&F buffer depth, dead-letter count, parked count, deployed/enabled/disabled instances, audit backlog (pending/oldest/bytes), event-log write failures | per-site | only latest, in-memory | + +Site Health is the largest latent win — sequence-numbered every 30s, but history is currently +discarded. + +## Architecture + +### New component: `#26 KpiHistory` + +A small central component (mirrors NotificationOutbox / SiteCallAudit / AuditLog being their own +projects). Owns the recorder, options, and the `IKpiSampleSource` abstraction. + +**Decoupling.** Each owning component registers its own `IKpiSampleSource` into DI; the recorder +enumerates `IEnumerable` (same pattern as `INotificationDeliveryAdapter`). So +KpiHistory does **not** reference every component — each source impl lives with its owner and +calls that owner's existing `Compute…KpisAsync` methods (or, for Site Health, reads the +in-memory aggregator). CLAUDE.md component count goes 25 → 26. + +### Schema — `KpiSample` (tall / EAV) + +Commons entity, EF mapping + migration in ConfigurationDatabase, central MS SQL: + +| Column | Type | Notes | +|---|---|---| +| `Id` | `bigint` PK identity | | +| `Source` | `varchar(64)` | `NotificationOutbox` / `SiteCallAudit` / `AuditLog` / `SiteHealth` | +| `Metric` | `varchar(64)` | per-source constant, e.g. `queueDepth`, `parkedCount`, `deadLetterCount` | +| `Scope` | `varchar(16)` | `Global` / `Site` / `Node` | +| `ScopeKey` | `varchar(64)` NULL | site id / node name; `NULL` for `Global` | +| `Value` | `float` | counts exact within range; ages stored as **seconds** | +| `CapturedAtUtc` | `datetime2` | recorder tick timestamp (UTC) | + +Indexes: +- `IX_KpiSample_Series (Source, Metric, Scope, ScopeKey, CapturedAtUtc)` — per-series range query. +- `IX_KpiSample_Captured (CapturedAtUtc)` — retention purge. + +### Recorder — `KpiHistoryRecorderActor` (central cluster singleton) + +Runs on the active central node (consistent with the existing central singletons — Notification +Outbox, Site Call Audit, purge actors). Timer fires every `SampleIntervalSeconds` (default 60s). +Per tick: + +1. Open a DI scope (scoped `DbContext`/repository — mirrors `NotificationOutboxActor`'s + scope-per-sweep pattern). +2. Enumerate registered `IKpiSampleSource`s; each returns `IReadOnlyList` stamped with + the tick's `CapturedAtUtc`. +3. Write all samples via `IKpiHistoryRepository.RecordSamplesAsync`. + +**Best-effort (R2).** Each source call and the write are individually wrapped; a failure logs and +skips that source's samples for the tick — it never throws into or disrupts the source component. + +**Retention.** Daily purge (`PurgeIntervalHours`, default 24) deletes rows older than +`RetentionDays` (default 90), reusing the existing purge-scheduler shape. Hourly downsampling +beyond N days is deferred (YAGNI). + +### Sample sources + +- **`NotificationOutboxKpiSampleSource`** (in NotificationOutbox) → + `queueDepth, stuckCount, parkedCount, deliveredLastInterval, oldestPendingAgeSeconds`; + Global + per-Site + per-Node (reuses the M5 `ComputePerNodeKpisAsync`). +- **`SiteCallAuditKpiSampleSource`** (in SiteCallAudit) → + `buffered, parked, failedLastInterval, deliveredLastInterval, oldestPendingAgeSeconds, stuck`; + Global + per-Site + per-Node. +- **`AuditLogKpiSampleSource`** (in AuditLog) → + `totalEventsLastHour, errorEventsLastHour, backlogTotal`; Global. +- **`SiteHealthKpiSampleSource`** (in HealthMonitoring) → reads + `ICentralHealthAggregator.GetAllSiteStates()` (in-memory, no DB): per-Site + `connectionsUp/Down, tagsGood/Bad, scriptErrors, alarmEvalErrors, sfBufferDepth, deadLetters, + parkedMessages, deployedInstances/enabledInstances/disabledInstances, auditBacklogPending, + eventLogWriteFailures`. + +### Query + UI + +- **`IKpiHistoryRepository.GetSeriesAsync(source, metric, scope, scopeKey, fromUtc, toUtc, + maxPoints)`** → buckets `[from, toUtc]` into ≤ `maxPoints` buckets and returns + **last-value per bucket** (`KpiSeriesPoint(BucketStartUtc, Value)`). Last-value is correct for + gauge metrics; v1 uses one aggregation, avg/min/max deferred. +- **`KpiHistoryQueryService`** (CentralUI) — scoped-repo direct read with a dual-ctor test seam, + exactly like `AuditLogQueryService`. +- **`KpiTrendChart.razor`** — reusable custom **SVG** line/area component: polyline path, min/max + + time-range axis labels, responsive `viewBox`, clean corporate styling (no third-party + charting lib, per CLAUDE.md). The time window (24h / 7d) is owned by the parent page. +- **Surfaces:** trend charts on the **Notification Outbox**, **Site Calls**, and **Audit Log** + pages, plus a per-site trend panel on the **Health dashboard**. + +### Config — `KpiHistoryOptions` + +`SampleIntervalSeconds` (60), `RetentionDays` (90), `PurgeIntervalHours` (24), +`DefaultMaxSeriesPoints` (200), with an options validator; bound on the central role in Host. + +## Error handling + +- Recorder: best-effort, per-source isolation (above). The KPI history is observability, never on + a user-facing critical path. +- Query: a query failure surfaces in the UI as an unavailable chart (em-dash / message), mirroring + how the existing KPI tiles surface transient failures — it never breaks the hosting page. + +## Testing + +- Recorder writes samples; **best-effort source-failure isolation** (a throwing source does not + abort the tick or other sources). +- Repository range/bucket-query correctness; retention purge deletes only aged rows. +- `KpiTrendChart` SVG render (unit/bUnit-style). +- One Playwright trend-chart UI test (per the M5–M10 testing strategy). +- Targeted tests per task (filtered tests + per-project builds); full-solution build at + integration. + +## Docs & deploy + +- New `docs/requirements/Component-KpiHistory.md` (#26). +- README component table + CLAUDE.md (25 → 26 + KPI-history bullet). +- Interactions updated in Component-NotificationOutbox / -SiteCallAudit / -AuditLog / + -HealthMonitoring / -CentralUI. +- Update `docs/plans/2026-06-15-stillpending-completion-design.md` (T9/T10 deferred; T11 → + KPI-history backbone). +- EF migration auto-applies in dev; cluster rebuild via `bash docker/deploy.sh` at integration. + +## Execution housekeeping + +- Work in the dedicated worktree `m6-kpi-history`, branched off **local `main` (639e331, + includes unpushed M5)** — not `origin/main`. +- Implementers commit **pathspec form** (`git commit -m "…" -- `), retry on + `.git/index.lock`. +- Keep concurrent committers to ≤ 2–3 and run a **post-wave HEAD-presence check** per the + concurrent-commit ref-race lesson. + +## Deferred (next major version) + +- **T9** — Teams (and other non-Email) delivery adapter behind `INotificationDeliveryAdapter`. +- **T10** — `NotificationType` enum values + Central UI list "Type" selector. + +## Task list + +- **MK-1 (first priority):** Common KPI-history/rollup store — backbone (schema + recorder + + repository + retention + chart component). +- **MK-2:** Notification Outbox trend charts (T11 first consumer) — blocked by MK-1. +- **MK-3:** Trend consumers — Site Call Audit / Audit Log / Site Health — blocked by MK-1. + +(With the "charts for all sources" scope, MK-2 and MK-3 together deliver the full UI this +milestone; the implementation plan sequences them.)