# Design: Notification Outbox **Date:** 2026-05-18 **Status:** Basic design — approved, open for refinement. ## Problem Notification delivery today happens at the site clusters: scripts call `Notify.To().Send()`, the Notification Service composes an email, and the site sends it via SMTP. The Store-and-Forward Engine buffers transient failures. Two gaps motivated this design: 1. **No audit trail.** A successful send is recorded nowhere. A permanently-failed send is returned to the script and then lost. Only a transiently-failed-and-buffered notification is visible — indirectly, as Store-and-Forward activity. 2. **No monitoring.** There is no view of delivery health: no KPIs, and no way to find notifications that are stuck or have been parked. ## Solution overview Invert where delivery happens. Sites no longer send notifications directly. Instead: - A site script's notification is **store-and-forwarded to the central cluster**. - Central **logs every notification to a `Notifications` table** in the central config DB (MS SQL) — the single source of audit truth. - A central **Notification Outbox** dispatches and delivers from that table, with retry, parking, per-notification status, and KPIs. The `Notifications` table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later. ### End-to-end flow ``` Site script: Notify.To("list").Send(subject, body) │ generate NotificationId (GUID) locally; return it to the script immediately ▼ Site Store-and-Forward Engine (notification category, target = central) │ durably forwards to central via the Communication Layer (ClusterClient); │ buffers/retries if central is unreachable ▼ Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending) │ ack the site → site S&F clears the message ▼ Central Notification Outbox actor (singleton, active central node) │ polls due rows; resolves the list; delivers via the matching adapter ├── success → Delivered ├── transient failure → Retrying (schedule NextAttemptAt) └── permanent failure / retries exhausted → Parked ``` `Notify.Status(notificationId)` returns a small **status record** — status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site S&F buffer the site answers the query **locally** (status `Forwarding`); once forwarded, the query round-trips to central and reads the `Notifications` table. ## Component design ### New component #21: Notification Outbox A **central** component — the first outbox to live centrally (the Store-and-Forward Engine remains site-only). - **Location:** Central cluster. - **Actor:** `NotificationOutboxActor` — a **singleton on the active central node**. - **Owns:** the durable central queue (the `Notifications` table), the dispatcher loop, retry scheduling, parking, per-notification status tracking, and KPI computation. - SMTP/HTTP delivery is blocking I/O — delivery work runs on a **dedicated blocking-I/O dispatcher** (same pattern as Script Execution Actors). ### Notification Service (revised) Shrinks to two clear jobs, both **central-only**: - Manage **notification-list and SMTP definitions** in the config DB. - Provide **delivery adapters** — stateless "deliver one notification" implementations per type (see below). Notifications and SMTP config are **no longer deployed to sites**. Sites never talk to SMTP. ### Store-and-Forward Engine (revised) Keeps its notification category, but the delivery *target* changes from SMTP to **central**. "Delivering" a buffered notification now means handing it to the Communication Layer for the central cluster and clearing it on central's ack. The site→central forward uses a fixed retry interval configured in the host `appsettings.json` — it concerns reaching the central cluster rather than any notification list. ## Typed notification lists Each notification list gains a **`Type`** field plus type-specific targets: - `Email` — a set of recipient addresses (implemented now). - `Teams`, others — future types. `Notify.To("list")` works transparently for any type — the script does not care. Lists are defined and stored centrally only. **Recipient resolution happens at central, at delivery time** — the site forwards only `(listName, subject, body)`. This keeps definitions in one place and removes the deploy-to-sites artifact entirely. ## The `Notifications` table (central MS SQL) Type-agnostic. One row per notification. | Field | Notes | |---|---| | `NotificationId` | GUID, primary key. Generated at the **site**; used as the idempotency key. | | `Type` | `Email` / `Teams` / … discriminator. | | `ListName` | Target notification list. | | `Subject`, `Body` | Plain-text content. | | `TypeData` | JSON — extensibility hook for future per-type fields. | | `Status` | `Pending` → `Retrying` → `Delivered` / `Parked` / `Discarded`. | | `RetryCount` | Delivery attempts so far. | | `LastError` | Detail of the most recent failure. | | `ResolvedTargets` | Who the notification actually went to — snapshotted by central at delivery time, for audit. | | `SourceSiteId`, `SourceInstanceId`, `SourceScript` | Provenance. | | `SiteEnqueuedAt` | When the script called `Send()` (carried from the site). | | `CreatedAt` | When central ingested the row. | | `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt` | Delivery timestamps. | All timestamps are UTC. ### Status lifecycle - `Forwarding` — in the site S&F buffer, not yet received by central. **Site-local only** — never stored in the central `Notifications` table; reported by `Notify.Status` while the site still holds the notification. - `Pending` — ingested by central, awaiting first dispatch. - `Retrying` — a transient failure occurred; `NextAttemptAt` schedules the next attempt. - `Delivered` — terminal, success. - `Parked` — terminal-not-delivered: a permanent failure, or retries exhausted. `LastError` distinguishes which. - `Discarded` — terminal, reached **only by operator action** on a parked notification. The row is kept (not deleted) so the table remains a complete audit record. ### Retry policy Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval — consistent with the existing fixed-interval (no backoff) convention. ### Retention Terminal rows (`Delivered`, `Parked`, `Discarded`) are removed by a **daily purge job** after a configurable window (default ~1 year). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged. ## Delivery adapters An `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern. - **Email adapter — implemented now.** The existing SMTP composition/send logic, relocated to the central cluster. - **Teams and other adapters — future.** The `Type` discriminator and the adapter interface are the seam; no Teams code is written in this basic plan. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation. ## Active/standby behavior The `NotificationOutboxActor` is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store — so no Akka-level replication is needed (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the table. The site→central handoff is **at-least-once**: central acks only after the row is persisted, and a lost ack causes the site to resend. The GUID `NotificationId` idempotency key makes a resend harmless (insert-if-not-exists). A rare failover mid-delivery could re-send one already-`Delivered` notification — an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts. ## Monitoring ### KPIs Central-computed from the `Notifications` table — global, with a per-source-site breakdown: - **Queue depth** — count of `Pending` + `Retrying`. - **Stuck count** — `Pending`/`Retrying` rows older than a configurable age threshold (default 10 minutes). - **Parked count** — count of `Parked`. - **Delivered (last interval)** — count of `Delivered` since the previous sample. - **Oldest pending age** — age of the oldest non-terminal notification. ### Stuck detection A notification is **stuck** if it is `Pending` or `Retrying` and older than the configurable age threshold. Detection is **display-only** — a count KPI and a row badge. No automated escalation or alerting, consistent with the current system-wide no-alerting policy. ### Surfacing - **Health Monitoring dashboard** — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed (not part of the site health report). The site S&F notification backlog remains a separate site health metric, covering the site→central leg. - **New Central UI "Notification Outbox" page** — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer **Retry** (→ `Pending`, reset `RetryCount`/`NextAttemptAt`) and **Discard** (→ `Discarded`) actions. Stuck rows are badged. ## Cross-document impact | Document | Change | |---|---| | `Component-NotificationOutbox.md` | **New** — component #21. | | `Component-NotificationService.md` | Delivery moves central; lists gain a `Type`; no deploy-to-sites; async script API; delivery adapters. | | `Component-StoreAndForward.md` | Notification category retargeted from SMTP to central. | | `Component-HealthMonitoring.md` | Outbox KPIs added as central-computed headline metrics. | | `Component-SiteEventLogging.md` | New Notification event category — logs site→central forward failures and long-buffered notifications. | | `Component-CentralUI.md` | New Notification Outbox page. | | Central–Site Communication | New `NotificationSubmit` + ack message pair. | | Configuration Database / Commons | `Notifications` table, entity POCO, repository interface + implementation, EF migration, message contracts. | | `README.md` | Component table 20 → 21. | | `CLAUDE.md` | Component list 20 → 21; new key design decisions. | ## Refinement decisions (2026-05-18) - **Site→central forward retry config** — the fixed forward-retry interval lives in the host `appsettings.json` (infrastructure config, not a deployed artifact). - **`Notify.Status` payload** — returns a status record: status, retry count, last error, and key timestamps (enqueued, delivered). - **Stuck threshold default** — 10 minutes, configurable. - **Pre-ingest status** — a distinct site-local `Forwarding` state; the site answers `Notify.Status` from its own S&F buffer without a round-trip to central. - **Site-side diagnostics** — Site Event Logging records site→central **forward failures** and long-buffered notifications only, not routine enqueue/forward success events. - **KPI history** — point-in-time only, computed on demand from the `Notifications` table; the ~1-year row retention answers historical questions directly, so no separate time-series store is added. ## Open questions None outstanding — the basic design is fully specified. The next step is an implementation plan against the cross-document impact table.