docs(notification-outbox): add central-computed outbox KPIs to Health Monitoring

This commit is contained in:
Joseph Doherty
2026-05-18 23:17:07 -04:00
parent 0b56c809e1
commit 6ccf3766dc

View File

@@ -29,8 +29,11 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
| Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags | | Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags |
| Script error rates | Site Runtime (Script Actors) | Frequency of script failures | | Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures | | Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) | | Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category external, notification (notifications awaiting forward to central), DB write |
| Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues | | Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues |
| Notification Outbox queue depth | Notification Outbox (central) | Count of `Pending` + `Retrying` notifications — central-computed, not site-reported |
| Notification Outbox stuck count | Notification Outbox (central) | Count of `Pending` / `Retrying` notifications older than the configurable stuck-age threshold — central-computed, not site-reported |
| Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported |
## Reporting Protocol ## Reporting Protocol
@@ -50,10 +53,22 @@ Script error rates and alarm evaluation error rates are calculated as **raw coun
- **Alarm evaluation errors** include all failures during alarm condition evaluation. - **Alarm evaluation errors** include all failures during alarm condition evaluation.
- For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics. - For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics.
## Notification Outbox KPIs
The Notification Outbox is a **central** component, so its KPIs are **central-computed** rather than collected from sites and carried in the site health report:
- The dashboard surfaces three **headline** outbox KPIs: **queue depth** (`Pending` + `Retrying`), **stuck count** (`Pending` / `Retrying` rows older than the configurable stuck-age threshold), and **parked count** (`Parked`).
- The Notification Outbox component computes these on demand from the central `Notifications` table; the health dashboard polls it for the headline tiles.
- The fuller KPI set — which also includes **delivered (last interval)** and **oldest pending age** — lives on the Central UI **Notification Outbox** page, not the health dashboard.
- Outbox KPIs are **point-in-time**, computed on demand from the `Notifications` table. There is no time-series store — consistent with Health Monitoring's "current status only" philosophy. The outbox's own ~1-year row retention answers historical questions directly.
These are distinct from the site-reported **Store-and-forward buffer depth** notification metric, which now covers the **site→central leg** — notifications still buffered in a site's Store-and-Forward Engine awaiting forward to central — and remains part of the site health report.
## Central Storage ## Central Storage
- Health metrics are held **in memory** at the central cluster for display in the UI. - Health metrics are held **in memory** at the central cluster for display in the UI.
- No historical health data is persisted — the dashboard shows current/latest status only. - No historical health data is persisted — the dashboard shows current/latest status only.
- Notification Outbox KPIs are not stored by Health Monitoring; they are computed point-in-time from the central `Notifications` table each time the dashboard refreshes — consistent with the current-status-only philosophy.
- Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future. - Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.
## No Alerting ## No Alerting
@@ -66,8 +81,9 @@ Script error rates and alarm evaluation error rates are calculated as **raw coun
- **Communication Layer**: Transports health reports from sites to central. - **Communication Layer**: Transports health reports from sites to central.
- **Data Connection Layer (site)**: Provides connection health metrics. - **Data Connection Layer (site)**: Provides connection health metrics.
- **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics. - **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics.
- **Store-and-Forward Engine (site)**: Provides buffer depth metrics. - **Store-and-Forward Engine (site)**: Provides buffer depth metrics, including the notification backlog awaiting forward to central.
- **Cluster Infrastructure (site)**: Provides node role status. - **Cluster Infrastructure (site)**: Provides node role status.
- **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles.
## Interactions ## Interactions