docs(plans): design for central notification outbox
Captures the basic design for a reliable notification outbox: sites store-and-forward notifications to the central cluster, which logs them to a type-agnostic Notifications table (single audit source) and delivers them via per-type adapters with retry, parking, and KPIs.
This commit is contained in:
222
docs/plans/notif.md
Normal file
222
docs/plans/notif.md
Normal file
@@ -0,0 +1,222 @@
|
|||||||
|
# Design: Notification Outbox
|
||||||
|
|
||||||
|
**Date:** 2026-05-18
|
||||||
|
**Status:** Basic design — approved, open for refinement.
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Notification delivery today happens at the site clusters: scripts call `Notify.To().Send()`,
|
||||||
|
the Notification Service composes an email, and the site sends it via SMTP. The Store-and-Forward
|
||||||
|
Engine buffers transient failures. Two gaps motivated this design:
|
||||||
|
|
||||||
|
1. **No audit trail.** A successful send is recorded nowhere. A permanently-failed send is
|
||||||
|
returned to the script and then lost. Only a transiently-failed-and-buffered notification
|
||||||
|
is visible — indirectly, as Store-and-Forward activity.
|
||||||
|
2. **No monitoring.** There is no view of delivery health: no KPIs, and no way to find
|
||||||
|
notifications that are stuck or have been parked.
|
||||||
|
|
||||||
|
## Solution overview
|
||||||
|
|
||||||
|
Invert where delivery happens. Sites no longer send notifications directly. Instead:
|
||||||
|
|
||||||
|
- A site script's notification is **store-and-forwarded to the central cluster**.
|
||||||
|
- Central **logs every notification to a `Notifications` table** in the central config DB
|
||||||
|
(MS SQL) — the single source of audit truth.
|
||||||
|
- A central **Notification Outbox** dispatches and delivers from that table, with retry,
|
||||||
|
parking, per-notification status, and KPIs.
|
||||||
|
|
||||||
|
The `Notifications` table is type-agnostic so it can record any notification type the system
|
||||||
|
supports — email today, Microsoft Teams and others later.
|
||||||
|
|
||||||
|
### End-to-end flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Site script: Notify.To("list").Send(subject, body)
|
||||||
|
│ generate NotificationId (GUID) locally; return it to the script immediately
|
||||||
|
▼
|
||||||
|
Site Store-and-Forward Engine (notification category, target = central)
|
||||||
|
│ durably forwards to central via the Communication Layer (ClusterClient);
|
||||||
|
│ buffers/retries if central is unreachable
|
||||||
|
▼
|
||||||
|
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
|
||||||
|
│ ack the site → site S&F clears the message
|
||||||
|
▼
|
||||||
|
Central Notification Outbox actor (singleton, active central node)
|
||||||
|
│ polls due rows; resolves the list; delivers via the matching adapter
|
||||||
|
├── success → Delivered
|
||||||
|
├── transient failure → Retrying (schedule NextAttemptAt)
|
||||||
|
└── permanent failure
|
||||||
|
/ retries exhausted → Parked
|
||||||
|
```
|
||||||
|
|
||||||
|
`Notify.Status(notificationId)` round-trips site→central and reads the table. Before central
|
||||||
|
has ingested the row, status reads as `Pending` (in transit).
|
||||||
|
|
||||||
|
## Component design
|
||||||
|
|
||||||
|
### New component #21: Notification Outbox
|
||||||
|
|
||||||
|
A **central** component — the first outbox to live centrally (the Store-and-Forward Engine
|
||||||
|
remains site-only).
|
||||||
|
|
||||||
|
- **Location:** Central cluster.
|
||||||
|
- **Actor:** `NotificationOutboxActor` — a **singleton on the active central node**.
|
||||||
|
- **Owns:** the durable central queue (the `Notifications` table), the dispatcher loop,
|
||||||
|
retry scheduling, parking, per-notification status tracking, and KPI computation.
|
||||||
|
- SMTP/HTTP delivery is blocking I/O — delivery work runs on a **dedicated blocking-I/O
|
||||||
|
dispatcher** (same pattern as Script Execution Actors).
|
||||||
|
|
||||||
|
### Notification Service (revised)
|
||||||
|
|
||||||
|
Shrinks to two clear jobs, both **central-only**:
|
||||||
|
|
||||||
|
- Manage **notification-list and SMTP definitions** in the config DB.
|
||||||
|
- Provide **delivery adapters** — stateless "deliver one notification" implementations per
|
||||||
|
type (see below).
|
||||||
|
|
||||||
|
Notifications and SMTP config are **no longer deployed to sites**. Sites never talk to SMTP.
|
||||||
|
|
||||||
|
### Store-and-Forward Engine (revised)
|
||||||
|
|
||||||
|
Keeps its notification category, but the delivery *target* changes from SMTP to **central**.
|
||||||
|
"Delivering" a buffered notification now means handing it to the Communication Layer for the
|
||||||
|
central cluster and clearing it on central's ack. The site→central forward uses a fixed
|
||||||
|
retry interval (host-level config, since it concerns reaching central, not any list).
|
||||||
|
|
||||||
|
## Typed notification lists
|
||||||
|
|
||||||
|
Each notification list gains a **`Type`** field plus type-specific targets:
|
||||||
|
|
||||||
|
- `Email` — a set of recipient addresses (implemented now).
|
||||||
|
- `Teams`, others — future types.
|
||||||
|
|
||||||
|
`Notify.To("list")` works transparently for any type — the script does not care. Lists are
|
||||||
|
defined and stored centrally only.
|
||||||
|
|
||||||
|
**Recipient resolution happens at central, at delivery time** — the site forwards only
|
||||||
|
`(listName, subject, body)`. This keeps definitions in one place and removes the deploy-to-sites
|
||||||
|
artifact entirely.
|
||||||
|
|
||||||
|
## The `Notifications` table (central MS SQL)
|
||||||
|
|
||||||
|
Type-agnostic. One row per notification.
|
||||||
|
|
||||||
|
| Field | Notes |
|
||||||
|
|---|---|
|
||||||
|
| `NotificationId` | GUID, primary key. Generated at the **site**; used as the idempotency key. |
|
||||||
|
| `Type` | `Email` / `Teams` / … discriminator. |
|
||||||
|
| `ListName` | Target notification list. |
|
||||||
|
| `Subject`, `Body` | Plain-text content. |
|
||||||
|
| `TypeData` | JSON — extensibility hook for future per-type fields. |
|
||||||
|
| `Status` | `Pending` → `Retrying` → `Delivered` / `Parked` / `Discarded`. |
|
||||||
|
| `RetryCount` | Delivery attempts so far. |
|
||||||
|
| `LastError` | Detail of the most recent failure. |
|
||||||
|
| `ResolvedTargets` | Who the notification actually went to — snapshotted by central at delivery time, for audit. |
|
||||||
|
| `SourceSiteId`, `SourceInstanceId`, `SourceScript` | Provenance. |
|
||||||
|
| `SiteEnqueuedAt` | When the script called `Send()` (carried from the site). |
|
||||||
|
| `CreatedAt` | When central ingested the row. |
|
||||||
|
| `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt` | Delivery timestamps. |
|
||||||
|
|
||||||
|
All timestamps are UTC.
|
||||||
|
|
||||||
|
### Status lifecycle
|
||||||
|
|
||||||
|
- `Pending` — ingested, awaiting first dispatch.
|
||||||
|
- `Retrying` — a transient failure occurred; `NextAttemptAt` schedules the next attempt.
|
||||||
|
- `Delivered` — terminal, success.
|
||||||
|
- `Parked` — terminal-not-delivered: a permanent failure, or retries exhausted. `LastError`
|
||||||
|
distinguishes which.
|
||||||
|
- `Discarded` — terminal, reached **only by operator action** on a parked notification. The
|
||||||
|
row is kept (not deleted) so the table remains a complete audit record.
|
||||||
|
|
||||||
|
### Retry policy
|
||||||
|
|
||||||
|
Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry
|
||||||
|
interval — consistent with the existing fixed-interval (no backoff) convention.
|
||||||
|
|
||||||
|
### Retention
|
||||||
|
|
||||||
|
Terminal rows (`Delivered`, `Parked`, `Discarded`) are removed by a **daily purge job** after
|
||||||
|
a configurable window (default ~1 year). This preserves a strong audit trail while bounding
|
||||||
|
table growth. Non-terminal rows are never purged.
|
||||||
|
|
||||||
|
## Delivery adapters
|
||||||
|
|
||||||
|
An `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns
|
||||||
|
one of `success | transient failure | permanent failure`, mirroring the External System
|
||||||
|
Gateway error-classification pattern.
|
||||||
|
|
||||||
|
- **Email adapter — implemented now.** The existing SMTP composition/send logic, relocated
|
||||||
|
to the central cluster.
|
||||||
|
- **Teams and other adapters — future.** The `Type` discriminator and the adapter interface
|
||||||
|
are the seam; no Teams code is written in this basic plan. Teams auth and targeting
|
||||||
|
(Incoming Webhooks vs Graph API) is a separate design conversation.
|
||||||
|
|
||||||
|
## Active/standby behavior
|
||||||
|
|
||||||
|
The `NotificationOutboxActor` is a singleton on the active central node. All outbox state
|
||||||
|
lives in MS SQL, which is already the central HA store — so no Akka-level replication is
|
||||||
|
needed (unlike the site S&F engine). On central failover the new active node resumes
|
||||||
|
dispatch directly from the table.
|
||||||
|
|
||||||
|
The site→central handoff is **at-least-once**: central acks only after the row is persisted,
|
||||||
|
and a lost ack causes the site to resend. The GUID `NotificationId` idempotency key makes a
|
||||||
|
resend harmless (insert-if-not-exists). A rare failover mid-delivery could re-send one
|
||||||
|
already-`Delivered` notification — an accepted trade-off, consistent with the duplicate-delivery
|
||||||
|
trade-off the Store-and-Forward Engine already accepts.
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### KPIs
|
||||||
|
|
||||||
|
Central-computed from the `Notifications` table — global, with a per-source-site breakdown:
|
||||||
|
|
||||||
|
- **Queue depth** — count of `Pending` + `Retrying`.
|
||||||
|
- **Stuck count** — `Pending`/`Retrying` rows older than a configurable age threshold
|
||||||
|
(default 10 minutes).
|
||||||
|
- **Parked count** — count of `Parked`.
|
||||||
|
- **Delivered (last interval)** — count of `Delivered` since the previous sample.
|
||||||
|
- **Oldest pending age** — age of the oldest non-terminal notification.
|
||||||
|
|
||||||
|
### Stuck detection
|
||||||
|
|
||||||
|
A notification is **stuck** if it is `Pending` or `Retrying` and older than the configurable
|
||||||
|
age threshold. Detection is **display-only** — a count KPI and a row badge. No automated
|
||||||
|
escalation or alerting, consistent with the current system-wide no-alerting policy.
|
||||||
|
|
||||||
|
### Surfacing
|
||||||
|
|
||||||
|
- **Health Monitoring dashboard** — headline KPI tiles: queue depth, stuck count, parked
|
||||||
|
count. These are central-computed (not part of the site health report). The site S&F
|
||||||
|
notification backlog remains a separate site health metric, covering the site→central leg.
|
||||||
|
- **New Central UI "Notification Outbox" page** — KPI tiles plus a queryable notification
|
||||||
|
list: filter by status, type, source site, list, and time range; a stuck-only toggle;
|
||||||
|
keyword search on subject. Parked notifications offer **Retry** (→ `Pending`, reset
|
||||||
|
`RetryCount`/`NextAttemptAt`) and **Discard** (→ `Discarded`) actions. Stuck rows are badged.
|
||||||
|
|
||||||
|
## Cross-document impact
|
||||||
|
|
||||||
|
| Document | Change |
|
||||||
|
|---|---|
|
||||||
|
| `Component-NotificationOutbox.md` | **New** — component #21. |
|
||||||
|
| `Component-NotificationService.md` | Delivery moves central; lists gain a `Type`; no deploy-to-sites; async script API; delivery adapters. |
|
||||||
|
| `Component-StoreAndForward.md` | Notification category retargeted from SMTP to central. |
|
||||||
|
| `Component-HealthMonitoring.md` | Outbox KPIs added as central-computed headline metrics. |
|
||||||
|
| `Component-CentralUI.md` | New Notification Outbox page. |
|
||||||
|
| Central–Site Communication | New `NotificationSubmit` + ack message pair. |
|
||||||
|
| Configuration Database / Commons | `Notifications` table, entity POCO, repository interface + implementation, EF migration, message contracts. |
|
||||||
|
| `README.md` | Component table 20 → 21. |
|
||||||
|
| `CLAUDE.md` | Component list 20 → 21; new key design decisions. |
|
||||||
|
|
||||||
|
## Open questions for refinement
|
||||||
|
|
||||||
|
- **Site→central forward retry config** — where the fixed forward-retry interval lives
|
||||||
|
(host appsettings vs a deployed setting).
|
||||||
|
- **`Notify.Status` payload** — whether status queries also return retry count / last error
|
||||||
|
to scripts, or just the status enum.
|
||||||
|
- **Stuck threshold default** — 10 minutes is a placeholder.
|
||||||
|
- **Pre-ingest status** — confirm `Pending` is the right reading for a notification still
|
||||||
|
in the site S&F buffer (vs a distinct "Forwarding" state).
|
||||||
|
- **Site-side diagnostics** — whether to keep a lightweight Site Event Logging entry for
|
||||||
|
"notification enqueued / forwarded," now that central holds the authoritative record.
|
||||||
|
- **KPI history** — KPIs are currently point-in-time; whether any trend/history is wanted.
|
||||||
Reference in New Issue
Block a user