Captures the basic design for a reliable notification outbox: sites store-and-forward notifications to the central cluster, which logs them to a type-agnostic Notifications table (single audit source) and delivers them via per-type adapters with retry, parking, and KPIs.
10 KiB
Design: Notification Outbox
Date: 2026-05-18 Status: Basic design — approved, open for refinement.
Problem
Notification delivery today happens at the site clusters: scripts call Notify.To().Send(),
the Notification Service composes an email, and the site sends it via SMTP. The Store-and-Forward
Engine buffers transient failures. Two gaps motivated this design:
- No audit trail. A successful send is recorded nowhere. A permanently-failed send is returned to the script and then lost. Only a transiently-failed-and-buffered notification is visible — indirectly, as Store-and-Forward activity.
- No monitoring. There is no view of delivery health: no KPIs, and no way to find notifications that are stuck or have been parked.
Solution overview
Invert where delivery happens. Sites no longer send notifications directly. Instead:
- A site script's notification is store-and-forwarded to the central cluster.
- Central logs every notification to a
Notificationstable in the central config DB (MS SQL) — the single source of audit truth. - A central Notification Outbox dispatches and delivers from that table, with retry, parking, per-notification status, and KPIs.
The Notifications table is type-agnostic so it can record any notification type the system
supports — email today, Microsoft Teams and others later.
End-to-end flow
Site script: Notify.To("list").Send(subject, body)
│ generate NotificationId (GUID) locally; return it to the script immediately
▼
Site Store-and-Forward Engine (notification category, target = central)
│ durably forwards to central via the Communication Layer (ClusterClient);
│ buffers/retries if central is unreachable
▼
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
│ ack the site → site S&F clears the message
▼
Central Notification Outbox actor (singleton, active central node)
│ polls due rows; resolves the list; delivers via the matching adapter
├── success → Delivered
├── transient failure → Retrying (schedule NextAttemptAt)
└── permanent failure
/ retries exhausted → Parked
Notify.Status(notificationId) round-trips site→central and reads the table. Before central
has ingested the row, status reads as Pending (in transit).
Component design
New component #21: Notification Outbox
A central component — the first outbox to live centrally (the Store-and-Forward Engine remains site-only).
- Location: Central cluster.
- Actor:
NotificationOutboxActor— a singleton on the active central node. - Owns: the durable central queue (the
Notificationstable), the dispatcher loop, retry scheduling, parking, per-notification status tracking, and KPI computation. - SMTP/HTTP delivery is blocking I/O — delivery work runs on a dedicated blocking-I/O dispatcher (same pattern as Script Execution Actors).
Notification Service (revised)
Shrinks to two clear jobs, both central-only:
- Manage notification-list and SMTP definitions in the config DB.
- Provide delivery adapters — stateless "deliver one notification" implementations per type (see below).
Notifications and SMTP config are no longer deployed to sites. Sites never talk to SMTP.
Store-and-Forward Engine (revised)
Keeps its notification category, but the delivery target changes from SMTP to central. "Delivering" a buffered notification now means handing it to the Communication Layer for the central cluster and clearing it on central's ack. The site→central forward uses a fixed retry interval (host-level config, since it concerns reaching central, not any list).
Typed notification lists
Each notification list gains a Type field plus type-specific targets:
Email— a set of recipient addresses (implemented now).Teams, others — future types.
Notify.To("list") works transparently for any type — the script does not care. Lists are
defined and stored centrally only.
Recipient resolution happens at central, at delivery time — the site forwards only
(listName, subject, body). This keeps definitions in one place and removes the deploy-to-sites
artifact entirely.
The Notifications table (central MS SQL)
Type-agnostic. One row per notification.
| Field | Notes |
|---|---|
NotificationId |
GUID, primary key. Generated at the site; used as the idempotency key. |
Type |
Email / Teams / … discriminator. |
ListName |
Target notification list. |
Subject, Body |
Plain-text content. |
TypeData |
JSON — extensibility hook for future per-type fields. |
Status |
Pending → Retrying → Delivered / Parked / Discarded. |
RetryCount |
Delivery attempts so far. |
LastError |
Detail of the most recent failure. |
ResolvedTargets |
Who the notification actually went to — snapshotted by central at delivery time, for audit. |
SourceSiteId, SourceInstanceId, SourceScript |
Provenance. |
SiteEnqueuedAt |
When the script called Send() (carried from the site). |
CreatedAt |
When central ingested the row. |
LastAttemptAt, NextAttemptAt, DeliveredAt |
Delivery timestamps. |
All timestamps are UTC.
Status lifecycle
Pending— ingested, awaiting first dispatch.Retrying— a transient failure occurred;NextAttemptAtschedules the next attempt.Delivered— terminal, success.Parked— terminal-not-delivered: a permanent failure, or retries exhausted.LastErrordistinguishes which.Discarded— terminal, reached only by operator action on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.
Retry policy
Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval — consistent with the existing fixed-interval (no backoff) convention.
Retention
Terminal rows (Delivered, Parked, Discarded) are removed by a daily purge job after
a configurable window (default ~1 year). This preserves a strong audit trail while bounding
table growth. Non-terminal rows are never purged.
Delivery adapters
An INotificationDeliveryAdapter is registered per Type. Each Deliver(...) call returns
one of success | transient failure | permanent failure, mirroring the External System
Gateway error-classification pattern.
- Email adapter — implemented now. The existing SMTP composition/send logic, relocated to the central cluster.
- Teams and other adapters — future. The
Typediscriminator and the adapter interface are the seam; no Teams code is written in this basic plan. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.
Active/standby behavior
The NotificationOutboxActor is a singleton on the active central node. All outbox state
lives in MS SQL, which is already the central HA store — so no Akka-level replication is
needed (unlike the site S&F engine). On central failover the new active node resumes
dispatch directly from the table.
The site→central handoff is at-least-once: central acks only after the row is persisted,
and a lost ack causes the site to resend. The GUID NotificationId idempotency key makes a
resend harmless (insert-if-not-exists). A rare failover mid-delivery could re-send one
already-Delivered notification — an accepted trade-off, consistent with the duplicate-delivery
trade-off the Store-and-Forward Engine already accepts.
Monitoring
KPIs
Central-computed from the Notifications table — global, with a per-source-site breakdown:
- Queue depth — count of
Pending+Retrying. - Stuck count —
Pending/Retryingrows older than a configurable age threshold (default 10 minutes). - Parked count — count of
Parked. - Delivered (last interval) — count of
Deliveredsince the previous sample. - Oldest pending age — age of the oldest non-terminal notification.
Stuck detection
A notification is stuck if it is Pending or Retrying and older than the configurable
age threshold. Detection is display-only — a count KPI and a row badge. No automated
escalation or alerting, consistent with the current system-wide no-alerting policy.
Surfacing
- Health Monitoring dashboard — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed (not part of the site health report). The site S&F notification backlog remains a separate site health metric, covering the site→central leg.
- New Central UI "Notification Outbox" page — KPI tiles plus a queryable notification
list: filter by status, type, source site, list, and time range; a stuck-only toggle;
keyword search on subject. Parked notifications offer Retry (→
Pending, resetRetryCount/NextAttemptAt) and Discard (→Discarded) actions. Stuck rows are badged.
Cross-document impact
| Document | Change |
|---|---|
Component-NotificationOutbox.md |
New — component #21. |
Component-NotificationService.md |
Delivery moves central; lists gain a Type; no deploy-to-sites; async script API; delivery adapters. |
Component-StoreAndForward.md |
Notification category retargeted from SMTP to central. |
Component-HealthMonitoring.md |
Outbox KPIs added as central-computed headline metrics. |
Component-CentralUI.md |
New Notification Outbox page. |
| Central–Site Communication | New NotificationSubmit + ack message pair. |
| Configuration Database / Commons | Notifications table, entity POCO, repository interface + implementation, EF migration, message contracts. |
README.md |
Component table 20 → 21. |
CLAUDE.md |
Component list 20 → 21; new key design decisions. |
Open questions for refinement
- Site→central forward retry config — where the fixed forward-retry interval lives (host appsettings vs a deployed setting).
Notify.Statuspayload — whether status queries also return retry count / last error to scripts, or just the status enum.- Stuck threshold default — 10 minutes is a placeholder.
- Pre-ingest status — confirm
Pendingis the right reading for a notification still in the site S&F buffer (vs a distinct "Forwarding" state). - Site-side diagnostics — whether to keep a lightweight Site Event Logging entry for "notification enqueued / forwarded," now that central holds the authoritative record.
- KPI history — KPIs are currently point-in-time; whether any trend/history is wanted.