From 03887203906fbd4ae5a8b8bfd3cf2416f417b273 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 18 May 2026 23:04:17 -0400 Subject: [PATCH] docs(notification-outbox): add Component-NotificationOutbox design doc --- .../Component-NotificationOutbox.md | 166 ++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 docs/requirements/Component-NotificationOutbox.md diff --git a/docs/requirements/Component-NotificationOutbox.md b/docs/requirements/Component-NotificationOutbox.md new file mode 100644 index 0000000..98913c5 --- /dev/null +++ b/docs/requirements/Component-NotificationOutbox.md @@ -0,0 +1,166 @@ +# Component: Notification Outbox + +## Purpose + +The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, logs every one to the `Notifications` table in the central configuration database, and delivers them through per-type delivery adapters. The `Notifications` table is the single source of audit truth: every notification — successfully delivered, parked, or discarded — has exactly one durable row. The outbox provides delivery retry, parking of failures, per-notification status tracking, and KPIs for delivery health. + +This inverts where notification delivery happens. Sites no longer send notifications directly via SMTP; a site script's notification is store-and-forwarded to central, and the central outbox owns dispatch and delivery. + +## Location + +Central cluster. The `NotificationOutboxActor` is a **singleton on the active central node**. It is the first outbox component to live centrally — the Store-and-Forward Engine remains site-only. + +## Responsibilities + +- Own the durable central queue — the `Notifications` table in the central MS SQL database. +- Ingest store-and-forwarded notifications from sites, insert-if-not-exists on `NotificationId`, and ack the site only after the row is persisted. +- Run the dispatcher loop: poll due rows, resolve the target notification list, and deliver via the matching adapter. +- Schedule retries for transient failures and park notifications on permanent failure or exhausted retries. +- Track per-notification status across the delivery lifecycle. +- Compute delivery KPIs from the `Notifications` table for the Health Monitoring dashboard and the Central UI. +- Purge terminal rows daily after a configurable retention window. + +SMTP and HTTP delivery is blocking I/O. Delivery work runs on a **dedicated blocking-I/O dispatcher**, the same pattern used by Script Execution Actors, so delivery never blocks the actor's dispatcher loop. + +## End-to-End Flow + +``` +Site script: Notify.To("list").Send(subject, body) + │ generate NotificationId (GUID) locally; return it to the script immediately + ▼ +Site Store-and-Forward Engine (notification category, target = central) + │ durably forwards to central via the Communication Layer (ClusterClient); + │ buffers/retries if central is unreachable + ▼ +Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending) + │ ack the site → site S&F clears the message + ▼ +Central Notification Outbox actor (singleton, active central node) + │ polls due rows; resolves the list; delivers via the matching adapter + ├── success → Delivered + ├── transient failure → Retrying (schedule NextAttemptAt) + └── permanent failure + / retries exhausted → Parked +``` + +The site forwards only `(listName, subject, body)` plus provenance — recipient resolution happens at central, at delivery time. This keeps notification-list definitions in one place and removes the deploy-to-sites artifact entirely. + +`Notify.Status(notificationId)` returns a small status record — status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site S&F buffer the site answers the query **locally** (status `Forwarding`); once forwarded, the query round-trips to central and reads the `Notifications` table. + +## The `Notifications` Table + +The table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later. One row per notification. + +| Field | Notes | +|---|---| +| `NotificationId` | GUID, primary key. Generated at the **site**; used as the idempotency key. | +| `Type` | `Email` / `Teams` / … discriminator. | +| `ListName` | Target notification list. | +| `Subject`, `Body` | Plain-text content. | +| `TypeData` | JSON — extensibility hook for future per-type fields. | +| `Status` | `Pending` → `Retrying` → `Delivered` / `Parked` / `Discarded`. | +| `RetryCount` | Delivery attempts so far. | +| `LastError` | Detail of the most recent failure. | +| `ResolvedTargets` | Who the notification actually went to — snapshotted by central at delivery time, for audit. | +| `SourceSiteId`, `SourceInstanceId`, `SourceScript` | Provenance. | +| `SiteEnqueuedAt` | When the script called `Send()` (carried from the site). | +| `CreatedAt` | When central ingested the row. | +| `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt` | Delivery timestamps. | + +All timestamps are UTC. + +### Status Lifecycle + +- `Forwarding` — in the site S&F buffer, not yet received by central. **Site-local only** — never stored in the central `Notifications` table; reported by `Notify.Status` while the site still holds the notification. +- `Pending` — ingested by central, awaiting first dispatch. +- `Retrying` — a transient failure occurred; `NextAttemptAt` schedules the next attempt. +- `Delivered` — terminal, success. +- `Parked` — terminal-not-delivered: a permanent failure, or retries exhausted. `LastError` distinguishes which. +- `Discarded` — terminal, reached **only by operator action** on a parked notification. The row is kept (not deleted) so the table remains a complete audit record. + +### Retry Policy + +Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval. The interval is fixed (no exponential backoff), consistent with the existing fixed-interval store-and-forward convention. + +### Retention + +Terminal rows (`Delivered`, `Parked`, `Discarded`) are removed by a **daily purge job** after a configurable window (default ~1 year). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged. + +## Ingest & Idempotency + +The site→central handoff is **at-least-once**. Central ingests an inbound notification submission with an **insert-if-not-exists** on `NotificationId`, then acks the site; the site S&F engine clears the message only on that ack. Because central acks only after the row is persisted (ack-after-persist), a lost ack causes the site to resend, and the GUID `NotificationId` idempotency key makes the resend harmless — the duplicate insert is a no-op. + +A rare central failover mid-delivery could re-send one already-`Delivered` notification. This is an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts. + +## Dispatcher + +The dispatcher loop runs on a fixed interval. On each tick the `NotificationOutboxActor`: + +1. Polls the `Notifications` table for **due rows** — `Pending` rows, and `Retrying` rows whose `NextAttemptAt` has passed. +2. Resolves the target notification list to its recipients/targets at central, at delivery time. +3. Hands the notification to the delivery adapter registered for its `Type`, running on the dedicated blocking-I/O dispatcher. +4. Applies the result: + - **success** → `Delivered`, set `DeliveredAt`, snapshot `ResolvedTargets`. + - **transient failure** → `Retrying`, increment `RetryCount`, set `NextAttemptAt`, record `LastError`; once retries are exhausted → `Parked`. + - **permanent failure** → `Parked`, record `LastError`. + +## Delivery Adapters + +A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern. + +- **Email adapter — implemented now.** The existing SMTP composition/send logic, relocated to the central cluster. +- **Teams and other adapters — future.** The `Type` discriminator and the adapter interface are the seam; no Teams code exists in this design. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation. + +Delivery adapters are provided by the Notification Service, which manages notification-list and SMTP definitions and supplies the stateless per-type "deliver one notification" implementations. + +## Active/Standby Behavior + +The `NotificationOutboxActor` is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store, so **no Akka-level replication is needed** (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the `Notifications` table — `Pending` rows and due `Retrying` rows are picked up on the next dispatcher tick. + +## Monitoring + +### KPIs + +KPIs are central-computed from the `Notifications` table — global, with a per-source-site breakdown: + +- **Queue depth** — count of `Pending` + `Retrying`. +- **Stuck count** — `Pending` / `Retrying` rows older than the configurable stuck-age threshold. +- **Parked count** — count of `Parked`. +- **Delivered (last interval)** — count of `Delivered` since the previous sample. +- **Oldest pending age** — age of the oldest non-terminal notification. + +KPIs are point-in-time, computed on demand from the table. The ~1-year row retention answers historical questions directly, so no separate time-series store is added. + +### Stuck Detection + +A notification is **stuck** if it is `Pending` or `Retrying` and older than a configurable age threshold (default 10 minutes). Detection is **display-only** — a count KPI and a row badge. There is no automated escalation or alerting, consistent with the system-wide no-alerting policy. + +### Surfacing + +- **Health Monitoring dashboard** — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed and are not part of the site health report. The site S&F notification backlog remains a separate site health metric covering the site→central leg. +- **Central UI "Notification Outbox" page** — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer **Retry** (→ `Pending`, reset `RetryCount` / `NextAttemptAt`) and **Discard** (→ `Discarded`) actions. Stuck rows are badged. + +## Configuration + +The component is configured via `NotificationOutboxOptions`, bound from an `appsettings.json` section on the central host (Options pattern): + +- **Dispatch interval** — how often the dispatcher loop polls for due rows. +- **Stuck-age threshold** — age beyond which a non-terminal notification is counted as stuck (default 10 minutes). +- **Terminal-row retention window** — age after which terminal rows are removed by the daily purge job (default ~1 year). + +Delivery max-retry-count and retry interval are not part of `NotificationOutboxOptions` — they are reused from the central SMTP configuration. + +## Dependencies + +- **Notification Service**: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes. +- **Configuration Database**: Hosts the `Notifications` table; provides the entity POCO, repository, and EF migration for outbox persistence. +- **Central–Site Communication**: Carries inbound notification submissions and acks between sites and central. +- **Health Monitoring**: Consumes the outbox KPIs as central-computed headline metrics. +- **Central UI**: Hosts the Notification Outbox page. + +## Interactions + +- **Site Store-and-Forward Engine**: Forwards notifications to central via the Communication Layer; the outbox ingests them and acks once persisted. +- **Notification Service**: Supplies delivery adapters and resolves notification lists at delivery time. +- **Central UI**: Queries the `Notifications` table for the Notification Outbox page and issues operator Retry/Discard actions on parked notifications. +- **Health Monitoring**: Polls the outbox for KPI tiles on the health dashboard.