From d4e86c1b1d8aa7e8dee5972b0bf4d3d496243058 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 18 May 2026 22:54:17 -0400 Subject: [PATCH] docs(plans): design for central notification outbox Captures the basic design for a reliable notification outbox: sites store-and-forward notifications to the central cluster, which logs them to a type-agnostic Notifications table (single audit source) and delivers them via per-type adapters with retry, parking, and KPIs. --- docs/plans/notif.md | 222 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 222 insertions(+) create mode 100644 docs/plans/notif.md diff --git a/docs/plans/notif.md b/docs/plans/notif.md new file mode 100644 index 0000000..88a5d18 --- /dev/null +++ b/docs/plans/notif.md @@ -0,0 +1,222 @@ +# Design: Notification Outbox + +**Date:** 2026-05-18 +**Status:** Basic design — approved, open for refinement. + +## Problem + +Notification delivery today happens at the site clusters: scripts call `Notify.To().Send()`, +the Notification Service composes an email, and the site sends it via SMTP. The Store-and-Forward +Engine buffers transient failures. Two gaps motivated this design: + +1. **No audit trail.** A successful send is recorded nowhere. A permanently-failed send is + returned to the script and then lost. Only a transiently-failed-and-buffered notification + is visible — indirectly, as Store-and-Forward activity. +2. **No monitoring.** There is no view of delivery health: no KPIs, and no way to find + notifications that are stuck or have been parked. + +## Solution overview + +Invert where delivery happens. Sites no longer send notifications directly. Instead: + +- A site script's notification is **store-and-forwarded to the central cluster**. +- Central **logs every notification to a `Notifications` table** in the central config DB + (MS SQL) — the single source of audit truth. +- A central **Notification Outbox** dispatches and delivers from that table, with retry, + parking, per-notification status, and KPIs. + +The `Notifications` table is type-agnostic so it can record any notification type the system +supports — email today, Microsoft Teams and others later. + +### End-to-end flow + +``` +Site script: Notify.To("list").Send(subject, body) + │ generate NotificationId (GUID) locally; return it to the script immediately + ▼ +Site Store-and-Forward Engine (notification category, target = central) + │ durably forwards to central via the Communication Layer (ClusterClient); + │ buffers/retries if central is unreachable + ▼ +Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending) + │ ack the site → site S&F clears the message + ▼ +Central Notification Outbox actor (singleton, active central node) + │ polls due rows; resolves the list; delivers via the matching adapter + ├── success → Delivered + ├── transient failure → Retrying (schedule NextAttemptAt) + └── permanent failure + / retries exhausted → Parked +``` + +`Notify.Status(notificationId)` round-trips site→central and reads the table. Before central +has ingested the row, status reads as `Pending` (in transit). + +## Component design + +### New component #21: Notification Outbox + +A **central** component — the first outbox to live centrally (the Store-and-Forward Engine +remains site-only). + +- **Location:** Central cluster. +- **Actor:** `NotificationOutboxActor` — a **singleton on the active central node**. +- **Owns:** the durable central queue (the `Notifications` table), the dispatcher loop, + retry scheduling, parking, per-notification status tracking, and KPI computation. +- SMTP/HTTP delivery is blocking I/O — delivery work runs on a **dedicated blocking-I/O + dispatcher** (same pattern as Script Execution Actors). + +### Notification Service (revised) + +Shrinks to two clear jobs, both **central-only**: + +- Manage **notification-list and SMTP definitions** in the config DB. +- Provide **delivery adapters** — stateless "deliver one notification" implementations per + type (see below). + +Notifications and SMTP config are **no longer deployed to sites**. Sites never talk to SMTP. + +### Store-and-Forward Engine (revised) + +Keeps its notification category, but the delivery *target* changes from SMTP to **central**. +"Delivering" a buffered notification now means handing it to the Communication Layer for the +central cluster and clearing it on central's ack. The site→central forward uses a fixed +retry interval (host-level config, since it concerns reaching central, not any list). + +## Typed notification lists + +Each notification list gains a **`Type`** field plus type-specific targets: + +- `Email` — a set of recipient addresses (implemented now). +- `Teams`, others — future types. + +`Notify.To("list")` works transparently for any type — the script does not care. Lists are +defined and stored centrally only. + +**Recipient resolution happens at central, at delivery time** — the site forwards only +`(listName, subject, body)`. This keeps definitions in one place and removes the deploy-to-sites +artifact entirely. + +## The `Notifications` table (central MS SQL) + +Type-agnostic. One row per notification. + +| Field | Notes | +|---|---| +| `NotificationId` | GUID, primary key. Generated at the **site**; used as the idempotency key. | +| `Type` | `Email` / `Teams` / … discriminator. | +| `ListName` | Target notification list. | +| `Subject`, `Body` | Plain-text content. | +| `TypeData` | JSON — extensibility hook for future per-type fields. | +| `Status` | `Pending` → `Retrying` → `Delivered` / `Parked` / `Discarded`. | +| `RetryCount` | Delivery attempts so far. | +| `LastError` | Detail of the most recent failure. | +| `ResolvedTargets` | Who the notification actually went to — snapshotted by central at delivery time, for audit. | +| `SourceSiteId`, `SourceInstanceId`, `SourceScript` | Provenance. | +| `SiteEnqueuedAt` | When the script called `Send()` (carried from the site). | +| `CreatedAt` | When central ingested the row. | +| `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt` | Delivery timestamps. | + +All timestamps are UTC. + +### Status lifecycle + +- `Pending` — ingested, awaiting first dispatch. +- `Retrying` — a transient failure occurred; `NextAttemptAt` schedules the next attempt. +- `Delivered` — terminal, success. +- `Parked` — terminal-not-delivered: a permanent failure, or retries exhausted. `LastError` + distinguishes which. +- `Discarded` — terminal, reached **only by operator action** on a parked notification. The + row is kept (not deleted) so the table remains a complete audit record. + +### Retry policy + +Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry +interval — consistent with the existing fixed-interval (no backoff) convention. + +### Retention + +Terminal rows (`Delivered`, `Parked`, `Discarded`) are removed by a **daily purge job** after +a configurable window (default ~1 year). This preserves a strong audit trail while bounding +table growth. Non-terminal rows are never purged. + +## Delivery adapters + +An `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns +one of `success | transient failure | permanent failure`, mirroring the External System +Gateway error-classification pattern. + +- **Email adapter — implemented now.** The existing SMTP composition/send logic, relocated + to the central cluster. +- **Teams and other adapters — future.** The `Type` discriminator and the adapter interface + are the seam; no Teams code is written in this basic plan. Teams auth and targeting + (Incoming Webhooks vs Graph API) is a separate design conversation. + +## Active/standby behavior + +The `NotificationOutboxActor` is a singleton on the active central node. All outbox state +lives in MS SQL, which is already the central HA store — so no Akka-level replication is +needed (unlike the site S&F engine). On central failover the new active node resumes +dispatch directly from the table. + +The site→central handoff is **at-least-once**: central acks only after the row is persisted, +and a lost ack causes the site to resend. The GUID `NotificationId` idempotency key makes a +resend harmless (insert-if-not-exists). A rare failover mid-delivery could re-send one +already-`Delivered` notification — an accepted trade-off, consistent with the duplicate-delivery +trade-off the Store-and-Forward Engine already accepts. + +## Monitoring + +### KPIs + +Central-computed from the `Notifications` table — global, with a per-source-site breakdown: + +- **Queue depth** — count of `Pending` + `Retrying`. +- **Stuck count** — `Pending`/`Retrying` rows older than a configurable age threshold + (default 10 minutes). +- **Parked count** — count of `Parked`. +- **Delivered (last interval)** — count of `Delivered` since the previous sample. +- **Oldest pending age** — age of the oldest non-terminal notification. + +### Stuck detection + +A notification is **stuck** if it is `Pending` or `Retrying` and older than the configurable +age threshold. Detection is **display-only** — a count KPI and a row badge. No automated +escalation or alerting, consistent with the current system-wide no-alerting policy. + +### Surfacing + +- **Health Monitoring dashboard** — headline KPI tiles: queue depth, stuck count, parked + count. These are central-computed (not part of the site health report). The site S&F + notification backlog remains a separate site health metric, covering the site→central leg. +- **New Central UI "Notification Outbox" page** — KPI tiles plus a queryable notification + list: filter by status, type, source site, list, and time range; a stuck-only toggle; + keyword search on subject. Parked notifications offer **Retry** (→ `Pending`, reset + `RetryCount`/`NextAttemptAt`) and **Discard** (→ `Discarded`) actions. Stuck rows are badged. + +## Cross-document impact + +| Document | Change | +|---|---| +| `Component-NotificationOutbox.md` | **New** — component #21. | +| `Component-NotificationService.md` | Delivery moves central; lists gain a `Type`; no deploy-to-sites; async script API; delivery adapters. | +| `Component-StoreAndForward.md` | Notification category retargeted from SMTP to central. | +| `Component-HealthMonitoring.md` | Outbox KPIs added as central-computed headline metrics. | +| `Component-CentralUI.md` | New Notification Outbox page. | +| Central–Site Communication | New `NotificationSubmit` + ack message pair. | +| Configuration Database / Commons | `Notifications` table, entity POCO, repository interface + implementation, EF migration, message contracts. | +| `README.md` | Component table 20 → 21. | +| `CLAUDE.md` | Component list 20 → 21; new key design decisions. | + +## Open questions for refinement + +- **Site→central forward retry config** — where the fixed forward-retry interval lives + (host appsettings vs a deployed setting). +- **`Notify.Status` payload** — whether status queries also return retry count / last error + to scripts, or just the status enum. +- **Stuck threshold default** — 10 minutes is a placeholder. +- **Pre-ingest status** — confirm `Pending` is the right reading for a notification still + in the site S&F buffer (vs a distinct "Forwarding" state). +- **Site-side diagnostics** — whether to keep a lightweight Site Event Logging entry for + "notification enqueued / forwarded," now that central holds the authoritative record. +- **KPI history** — KPIs are currently point-in-time; whether any trend/history is wanted.