Files

Joseph Doherty d4e86c1b1d docs(plans): design for central notification outbox

Captures the basic design for a reliable notification outbox: sites
store-and-forward notifications to the central cluster, which logs
them to a type-agnostic Notifications table (single audit source) and
delivers them via per-type adapters with retry, parking, and KPIs.

2026-05-18 22:54:17 -04:00

10 KiB

Raw Blame History

Design: Notification Outbox

Date: 2026-05-18 Status: Basic design — approved, open for refinement.

Problem

Notification delivery today happens at the site clusters: scripts call Notify.To().Send(), the Notification Service composes an email, and the site sends it via SMTP. The Store-and-Forward Engine buffers transient failures. Two gaps motivated this design:

No audit trail. A successful send is recorded nowhere. A permanently-failed send is returned to the script and then lost. Only a transiently-failed-and-buffered notification is visible — indirectly, as Store-and-Forward activity.
No monitoring. There is no view of delivery health: no KPIs, and no way to find notifications that are stuck or have been parked.

Solution overview

Invert where delivery happens. Sites no longer send notifications directly. Instead:

A site script's notification is store-and-forwarded to the central cluster.
Central logs every notification to a Notifications table in the central config DB (MS SQL) — the single source of audit truth.
A central Notification Outbox dispatches and delivers from that table, with retry, parking, per-notification status, and KPIs.

The Notifications table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later.

End-to-end flow

Site script: Notify.To("list").Send(subject, body)
    │  generate NotificationId (GUID) locally; return it to the script immediately
    ▼
Site Store-and-Forward Engine  (notification category, target = central)
    │  durably forwards to central via the Communication Layer (ClusterClient);
    │  buffers/retries if central is unreachable
    ▼
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
    │  ack the site → site S&F clears the message
    ▼
Central Notification Outbox actor (singleton, active central node)
    │  polls due rows; resolves the list; delivers via the matching adapter
    ├── success            → Delivered
    ├── transient failure  → Retrying (schedule NextAttemptAt)
    └── permanent failure
        / retries exhausted → Parked

Notify.Status(notificationId) round-trips site→central and reads the table. Before central has ingested the row, status reads as Pending (in transit).

Component design

New component #21: Notification Outbox

A central component — the first outbox to live centrally (the Store-and-Forward Engine remains site-only).

Location: Central cluster.
Actor: NotificationOutboxActor — a singleton on the active central node.
Owns: the durable central queue (the Notifications table), the dispatcher loop, retry scheduling, parking, per-notification status tracking, and KPI computation.
SMTP/HTTP delivery is blocking I/O — delivery work runs on a dedicated blocking-I/O dispatcher (same pattern as Script Execution Actors).

Notification Service (revised)

Shrinks to two clear jobs, both central-only:

Manage notification-list and SMTP definitions in the config DB.
Provide delivery adapters — stateless "deliver one notification" implementations per type (see below).

Notifications and SMTP config are no longer deployed to sites. Sites never talk to SMTP.

Store-and-Forward Engine (revised)

Keeps its notification category, but the delivery target changes from SMTP to central. "Delivering" a buffered notification now means handing it to the Communication Layer for the central cluster and clearing it on central's ack. The site→central forward uses a fixed retry interval (host-level config, since it concerns reaching central, not any list).

Typed notification lists

Each notification list gains a Type field plus type-specific targets:

Email — a set of recipient addresses (implemented now).
Teams, others — future types.

Notify.To("list") works transparently for any type — the script does not care. Lists are defined and stored centrally only.

Recipient resolution happens at central, at delivery time — the site forwards only (listName, subject, body). This keeps definitions in one place and removes the deploy-to-sites artifact entirely.

The `Notifications` table (central MS SQL)

Type-agnostic. One row per notification.

Field	Notes
`NotificationId`	GUID, primary key. Generated at the site; used as the idempotency key.
`Type`	`Email` / `Teams` / … discriminator.
`ListName`	Target notification list.
`Subject`, `Body`	Plain-text content.
`TypeData`	JSON — extensibility hook for future per-type fields.
`Status`	`Pending` → `Retrying` → `Delivered` / `Parked` / `Discarded`.
`RetryCount`	Delivery attempts so far.
`LastError`	Detail of the most recent failure.
`ResolvedTargets`	Who the notification actually went to — snapshotted by central at delivery time, for audit.
`SourceSiteId`, `SourceInstanceId`, `SourceScript`	Provenance.
`SiteEnqueuedAt`	When the script called `Send()` (carried from the site).
`CreatedAt`	When central ingested the row.
`LastAttemptAt`, `NextAttemptAt`, `DeliveredAt`	Delivery timestamps.

All timestamps are UTC.

Status lifecycle

Pending — ingested, awaiting first dispatch.
Retrying — a transient failure occurred; NextAttemptAt schedules the next attempt.
Delivered — terminal, success.
Parked — terminal-not-delivered: a permanent failure, or retries exhausted. LastError distinguishes which.
Discarded — terminal, reached only by operator action on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.

Retry policy

Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval — consistent with the existing fixed-interval (no backoff) convention.

Retention

Terminal rows (Delivered, Parked, Discarded) are removed by a daily purge job after a configurable window (default ~1 year). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged.

Delivery adapters

An INotificationDeliveryAdapter is registered per Type. Each Deliver(...) call returns one of success | transient failure | permanent failure, mirroring the External System Gateway error-classification pattern.

Email adapter — implemented now. The existing SMTP composition/send logic, relocated to the central cluster.
Teams and other adapters — future. The Type discriminator and the adapter interface are the seam; no Teams code is written in this basic plan. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.

Active/standby behavior

The NotificationOutboxActor is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store — so no Akka-level replication is needed (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the table.

The site→central handoff is at-least-once: central acks only after the row is persisted, and a lost ack causes the site to resend. The GUID NotificationId idempotency key makes a resend harmless (insert-if-not-exists). A rare failover mid-delivery could re-send one already-Delivered notification — an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts.

Monitoring

KPIs

Central-computed from the Notifications table — global, with a per-source-site breakdown:

Queue depth — count of Pending + Retrying.
Stuck count — Pending/Retrying rows older than a configurable age threshold (default 10 minutes).
Parked count — count of Parked.
Delivered (last interval) — count of Delivered since the previous sample.
Oldest pending age — age of the oldest non-terminal notification.

Stuck detection

A notification is stuck if it is Pending or Retrying and older than the configurable age threshold. Detection is display-only — a count KPI and a row badge. No automated escalation or alerting, consistent with the current system-wide no-alerting policy.

Surfacing

Health Monitoring dashboard — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed (not part of the site health report). The site S&F notification backlog remains a separate site health metric, covering the site→central leg.
New Central UI "Notification Outbox" page — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer Retry (→ Pending, reset RetryCount/NextAttemptAt) and Discard (→ Discarded) actions. Stuck rows are badged.

Cross-document impact

Document	Change
`Component-NotificationOutbox.md`	New — component #21.
`Component-NotificationService.md`	Delivery moves central; lists gain a `Type`; no deploy-to-sites; async script API; delivery adapters.
`Component-StoreAndForward.md`	Notification category retargeted from SMTP to central.
`Component-HealthMonitoring.md`	Outbox KPIs added as central-computed headline metrics.
`Component-CentralUI.md`	New Notification Outbox page.
Central–Site Communication	New `NotificationSubmit` + ack message pair.
Configuration Database / Commons	`Notifications` table, entity POCO, repository interface + implementation, EF migration, message contracts.
`README.md`	Component table 20 → 21.
`CLAUDE.md`	Component list 20 → 21; new key design decisions.

Open questions for refinement

Site→central forward retry config — where the fixed forward-retry interval lives (host appsettings vs a deployed setting).
Notify.Status payload — whether status queries also return retry count / last error to scripts, or just the status enum.
Stuck threshold default — 10 minutes is a placeholder.
Pre-ingest status — confirm Pending is the right reading for a notification still in the site S&F buffer (vs a distinct "Forwarding" state).
Site-side diagnostics — whether to keep a lightweight Site Event Logging entry for "notification enqueued / forwarded," now that central holds the authoritative record.
KPI history — KPIs are currently point-in-time; whether any trend/history is wanted.

10 KiB Raw Blame History Unescape Escape