9.5 KiB
M6 — KPI History & Trends — Design
Date: 2026-06-17
Milestone: M6 (from docs/plans/2026-06-15-stillpending-completion-design.md)
Status: Approved — proceed to implementation plan.
Summary
M6 was originally scoped as Notifications (T9 Teams adapter, T10 NotificationType enum
values + UI Type selector, T11 historical/trend KPI charts). During brainstorming the scope
was reshaped:
- T9 (Teams + other non-Email delivery adapters) — DEFERRED to next major version
(user decision 2026-06-17). The
INotificationDeliveryAdapterseam already exists; no code now. Transport choice (Incoming Webhook vs Microsoft Graph) and the Teams list-targeting model remain to be designed. - T10 (
NotificationTypeenum values + Central UI list "Type" selector) — DEFERRED with T9. A Type selector has no purpose until a second delivery type exists. - T11 (historical/trend KPI charts) — promoted from a notifications-only feature into a reusable common KPI-history backbone with trend charts shipped for all current KPI sources.
This supersedes the original docs/plans/notif.md "KPI history — point-in-time only … no
separate time-series store is added" decision: the completion plan (T11) explicitly introduces
a store. We keep it in central MS SQL (the existing HA store) — no new infra dependency.
Goal
A reusable central KPI-history backbone — tall/EAV store + periodic recorder + bucketed query API + a custom SVG trend-chart component — with trend charts delivered for Notification Outbox, Site Call Audit, Audit Log, and Site Health.
Reuse map (why a common backbone)
Every source below is point-in-time only today (computed on demand, or only the latest kept in memory). The backbone turns each into a trend with no per-source schema work:
| Source | Metrics (already computed) | Scope | Persisted today? |
|---|---|---|---|
| Notification Outbox | queue depth, stuck, parked, delivered/interval, oldest-pending age | global / per-site / per-node | on-demand only |
| Site Call Audit | buffered, parked, failed/interval, delivered/interval, oldest-pending, stuck | global / per-site / per-node | on-demand only |
| Audit Log | events/hour, error events/hour, backlog total | global | on-demand only |
| Site Health | connection up/down, tag-resolution good/bad, script errors, alarm-eval errors, S&F buffer depth, dead-letter count, parked count, deployed/enabled/disabled instances, audit backlog (pending/oldest/bytes), event-log write failures | per-site | only latest, in-memory |
Site Health is the largest latent win — sequence-numbered every 30s, but history is currently discarded.
Architecture
New component: #26 KpiHistory
A small central component (mirrors NotificationOutbox / SiteCallAudit / AuditLog being their own
projects). Owns the recorder, options, and the IKpiSampleSource abstraction.
Decoupling. Each owning component registers its own IKpiSampleSource into DI; the recorder
enumerates IEnumerable<IKpiSampleSource> (same pattern as INotificationDeliveryAdapter). So
KpiHistory does not reference every component — each source impl lives with its owner and
calls that owner's existing Compute…KpisAsync methods (or, for Site Health, reads the
in-memory aggregator). CLAUDE.md component count goes 25 → 26.
Schema — KpiSample (tall / EAV)
Commons entity, EF mapping + migration in ConfigurationDatabase, central MS SQL:
| Column | Type | Notes |
|---|---|---|
Id |
bigint PK identity |
|
Source |
varchar(64) |
NotificationOutbox / SiteCallAudit / AuditLog / SiteHealth |
Metric |
varchar(64) |
per-source constant, e.g. queueDepth, parkedCount, deadLetterCount |
Scope |
varchar(16) |
Global / Site / Node |
ScopeKey |
varchar(64) NULL |
site id / node name; NULL for Global |
Value |
float |
counts exact within range; ages stored as seconds |
CapturedAtUtc |
datetime2 |
recorder tick timestamp (UTC) |
Indexes:
IX_KpiSample_Series (Source, Metric, Scope, ScopeKey, CapturedAtUtc)— per-series range query.IX_KpiSample_Captured (CapturedAtUtc)— retention purge.
Recorder — KpiHistoryRecorderActor (central cluster singleton)
Runs on the active central node (consistent with the existing central singletons — Notification
Outbox, Site Call Audit, purge actors). Timer fires every SampleIntervalSeconds (default 60s).
Per tick:
- Open a DI scope (scoped
DbContext/repository — mirrorsNotificationOutboxActor's scope-per-sweep pattern). - Enumerate registered
IKpiSampleSources; each returnsIReadOnlyList<KpiSample>stamped with the tick'sCapturedAtUtc. - Write all samples via
IKpiHistoryRepository.RecordSamplesAsync.
Best-effort (R2). Each source call and the write are individually wrapped; a failure logs and skips that source's samples for the tick — it never throws into or disrupts the source component.
Retention. Daily purge (PurgeIntervalHours, default 24) deletes rows older than
RetentionDays (default 90), reusing the existing purge-scheduler shape. Hourly downsampling
beyond N days is deferred (YAGNI).
Sample sources
NotificationOutboxKpiSampleSource(in NotificationOutbox) →queueDepth, stuckCount, parkedCount, deliveredLastInterval, oldestPendingAgeSeconds; Global + per-Site + per-Node (reuses the M5ComputePerNodeKpisAsync).SiteCallAuditKpiSampleSource(in SiteCallAudit) →buffered, parked, failedLastInterval, deliveredLastInterval, oldestPendingAgeSeconds, stuck; Global + per-Site + per-Node.AuditLogKpiSampleSource(in AuditLog) →totalEventsLastHour, errorEventsLastHour, backlogTotal; Global.SiteHealthKpiSampleSource(in HealthMonitoring) → readsICentralHealthAggregator.GetAllSiteStates()(in-memory, no DB): per-SiteconnectionsUp/Down, tagsGood/Bad, scriptErrors, alarmEvalErrors, sfBufferDepth, deadLetters, parkedMessages, deployedInstances/enabledInstances/disabledInstances, auditBacklogPending, eventLogWriteFailures.
Query + UI
IKpiHistoryRepository.GetSeriesAsync(source, metric, scope, scopeKey, fromUtc, toUtc, maxPoints)→ buckets[from, toUtc]into ≤maxPointsbuckets and returns last-value per bucket (KpiSeriesPoint(BucketStartUtc, Value)). Last-value is correct for gauge metrics; v1 uses one aggregation, avg/min/max deferred.KpiHistoryQueryService(CentralUI) — scoped-repo direct read with a dual-ctor test seam, exactly likeAuditLogQueryService.KpiTrendChart.razor— reusable custom SVG line/area component: polyline path, min/max- time-range axis labels, responsive
viewBox, clean corporate styling (no third-party charting lib, per CLAUDE.md). The time window (24h / 7d) is owned by the parent page.
- time-range axis labels, responsive
- Surfaces: trend charts on the Notification Outbox, Site Calls, and Audit Log pages, plus a per-site trend panel on the Health dashboard.
Config — KpiHistoryOptions
SampleIntervalSeconds (60), RetentionDays (90), PurgeIntervalHours (24),
DefaultMaxSeriesPoints (200), with an options validator; bound on the central role in Host.
Error handling
- Recorder: best-effort, per-source isolation (above). The KPI history is observability, never on a user-facing critical path.
- Query: a query failure surfaces in the UI as an unavailable chart (em-dash / message), mirroring how the existing KPI tiles surface transient failures — it never breaks the hosting page.
Testing
- Recorder writes samples; best-effort source-failure isolation (a throwing source does not abort the tick or other sources).
- Repository range/bucket-query correctness; retention purge deletes only aged rows.
KpiTrendChartSVG render (unit/bUnit-style).- One Playwright trend-chart UI test (per the M5–M10 testing strategy).
- Targeted tests per task (filtered tests + per-project builds); full-solution build at integration.
Docs & deploy
- New
docs/requirements/Component-KpiHistory.md(#26). - README component table + CLAUDE.md (25 → 26 + KPI-history bullet).
- Interactions updated in Component-NotificationOutbox / -SiteCallAudit / -AuditLog / -HealthMonitoring / -CentralUI.
- Update
docs/plans/2026-06-15-stillpending-completion-design.md(T9/T10 deferred; T11 → KPI-history backbone). - EF migration auto-applies in dev; cluster rebuild via
bash docker/deploy.shat integration.
Execution housekeeping
- Work in the dedicated worktree
m6-kpi-history, branched off localmain(639e331, includes unpushed M5) — notorigin/main. - Implementers commit pathspec form (
git commit -m "…" -- <paths>), retry on.git/index.lock. - Keep concurrent committers to ≤ 2–3 and run a post-wave HEAD-presence check per the concurrent-commit ref-race lesson.
Deferred (next major version)
- T9 — Teams (and other non-Email) delivery adapter behind
INotificationDeliveryAdapter. - T10 —
NotificationTypeenum values + Central UI list "Type" selector.
Task list
- MK-1 (first priority): Common KPI-history/rollup store — backbone (schema + recorder + repository + retention + chart component).
- MK-2: Notification Outbox trend charts (T11 first consumer) — blocked by MK-1.
- MK-3: Trend consumers — Site Call Audit / Audit Log / Site Health — blocked by MK-1.
(With the "charts for all sources" scope, MK-2 and MK-3 together deliver the full UI this milestone; the implementation plan sequences them.)