12 KiB
Component: Notification Outbox
Purpose
The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, logs every one to the Notifications table in the central configuration database, and delivers them through per-type delivery adapters. The Notifications table is the single source of audit truth: every notification — successfully delivered, parked, or discarded — has exactly one durable row. The outbox provides delivery retry, parking of failures, per-notification status tracking, and KPIs for delivery health.
This inverts where notification delivery happens. Sites no longer send notifications directly via SMTP; a site script's notification is store-and-forwarded to central, and the central outbox owns dispatch and delivery.
Location
Central cluster. The NotificationOutboxActor is a singleton on the active central node. It is the first outbox component to live centrally — the Store-and-Forward Engine remains site-only.
Responsibilities
- Own the durable central queue — the
Notificationstable in the central MS SQL database. - Ingest store-and-forwarded notifications from sites, insert-if-not-exists on
NotificationId, and ack the site only after the row is persisted. - Run the dispatcher loop: poll due rows, resolve the target notification list, and deliver via the matching adapter.
- Schedule retries for transient failures and park notifications on permanent failure or exhausted retries.
- Track per-notification status across the delivery lifecycle.
- Compute delivery KPIs from the
Notificationstable for the Health Monitoring dashboard and the Central UI. - Purge terminal rows daily after a configurable retention window.
SMTP and HTTP delivery is blocking I/O. Delivery work runs on a dedicated blocking-I/O dispatcher, the same pattern used by Script Execution Actors, so delivery never blocks the actor's dispatcher loop.
End-to-End Flow
Site script: Notify.To("list").Send(subject, body)
│ generate NotificationId (GUID) locally; return it to the script immediately
▼
Site Store-and-Forward Engine (notification category, target = central)
│ durably forwards to central via Central–Site Communication (ClusterClient);
│ buffers/retries if central is unreachable
▼
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
│ ack the site → site S&F clears the message
▼
Central Notification Outbox actor (singleton, active central node)
│ polls due rows; resolves the list; delivers via the matching adapter
├── success → Delivered
├── transient failure → Retrying (schedule NextAttemptAt)
└── permanent failure
/ retries exhausted → Parked
The site forwards only (listName, subject, body) plus provenance — recipient resolution happens at central, at delivery time. This keeps notification-list definitions in one place and removes the deploy-to-sites artifact entirely.
Notify.Status(notificationId) returns a small status record — status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site S&F buffer the site answers the query locally (status Forwarding); once forwarded, the query round-trips to central and reads the Notifications table.
The Notifications Table
The table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later. One row per notification.
| Field | Notes |
|---|---|
NotificationId |
GUID, primary key. Generated at the site; used as the idempotency key. |
Type |
Email / Teams / … discriminator. |
ListName |
Target notification list. |
Subject, Body |
Plain-text content. |
TypeData |
JSON — extensibility hook for future per-type fields. |
Status |
Lifecycle state — one of Pending, Retrying, Delivered, Parked, Discarded. See Status Lifecycle below. |
RetryCount |
Delivery attempts so far. |
LastError |
Detail of the most recent failure. |
ResolvedTargets |
Who the notification actually went to — snapshotted by central at delivery time, for audit. |
SourceSiteId, SourceInstanceId, SourceScript |
Provenance. |
SiteEnqueuedAt |
When the script called Send() (carried from the site). |
CreatedAt |
When central ingested the row. |
LastAttemptAt, NextAttemptAt, DeliveredAt |
Delivery timestamps. |
All timestamps are UTC.
Status Lifecycle
Forwarding— in the site S&F buffer, not yet received by central. Site-local only — never stored in the centralNotificationstable; reported byNotify.Statuswhile the site still holds the notification.Pending— ingested by central, awaiting first dispatch.Retrying— a transient failure occurred;NextAttemptAtschedules the next attempt.Delivered— terminal, success.Parked— terminal-not-delivered: a permanent failure, or retries exhausted.LastErrordistinguishes which.Discarded— terminal, reached only by operator action on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.
The Notification Outbox and the central Site Call Audit component share the TrackedOperationId tracking model and this status lifecycle, but differ in delivery locality: the Notification Outbox delivers notifications itself (central SMTP), whereas Site Call Audit only audits cached calls delivered site-locally by the site Store-and-Forward Engine — it is not a dispatcher.
Retry Policy
Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval. The interval is fixed (no exponential backoff), consistent with the existing fixed-interval store-and-forward convention.
Retention
Terminal rows (Delivered, Parked, Discarded) are removed by a daily purge job after a configurable window (default 365 days). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged.
Ingest & Idempotency
The site→central handoff is at-least-once. Central ingests an inbound notification submission with an insert-if-not-exists on NotificationId, then acks the site; the site S&F engine clears the message only on that ack. Because central acks only after the row is persisted (ack-after-persist), a lost ack causes the site to resend, and the GUID NotificationId idempotency key makes the resend harmless — the duplicate insert is a no-op.
A rare central failover mid-delivery could re-send one already-Delivered notification. This is an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts.
Dispatcher
The dispatcher loop runs on a fixed interval. On each tick the NotificationOutboxActor:
- Polls the
Notificationstable for due rows —Pendingrows, andRetryingrows whoseNextAttemptAthas passed. - Resolves the target notification list to its recipients/targets at central, at delivery time.
- Hands the notification to the delivery adapter registered for its
Type, running on the dedicated blocking-I/O dispatcher. - Applies the result:
- success →
Delivered, setDeliveredAt, snapshotResolvedTargets. - transient failure →
Retrying, incrementRetryCount, setNextAttemptAt, recordLastError; once retries are exhausted →Parked. - permanent failure →
Parked, recordLastError.
- success →
Delivery Adapters
A delivery adapter implementing INotificationDeliveryAdapter is registered per Type. Each Deliver(...) call returns one of success | transient failure | permanent failure, mirroring the External System Gateway error-classification pattern.
- Email adapter — implemented now. The existing SMTP composition/send logic, relocated to the central cluster.
- Teams and other adapters — future. The
Typediscriminator and the adapter interface are the seam; no Teams code exists in this design. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.
Delivery adapters are provided by the Notification Service, which manages notification-list and SMTP definitions and supplies the stateless per-type "deliver one notification" implementations.
Active/Standby Behavior
The NotificationOutboxActor is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store, so no Akka-level replication is needed (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the Notifications table — Pending rows and due Retrying rows are picked up on the next dispatcher tick.
Monitoring
KPIs
KPIs are central-computed from the Notifications table — global, with a per-source-site breakdown:
- Queue depth — count of
Pending+Retrying. - Stuck count —
Pending/Retryingrows older than the configurable stuck-age threshold. - Parked count — count of
Parked. - Delivered (last interval) — count of
Deliveredsince the previous sample. - Oldest pending age — age of the oldest non-terminal notification.
KPIs are point-in-time, computed on demand from the table. The configurable row retention (default 365 days) answers historical questions directly, so no separate time-series store is added.
Stuck Detection
A notification is stuck if it is Pending or Retrying and older than a configurable age threshold (default 10 minutes). Detection is display-only — a count KPI and a row badge. There is no automated escalation or alerting, consistent with the system-wide no-alerting policy.
Surfacing
- Health Monitoring dashboard — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed and are not part of the site health report. The site S&F notification backlog remains a separate site health metric covering the site→central leg.
- Central UI "Notification Outbox" page — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer Retry (→
Pending, resetRetryCount/NextAttemptAt) and Discard (→Discarded) actions. Stuck rows are badged.
Configuration
The component is configured via NotificationOutboxOptions, bound from an appsettings.json section on the central host (Options pattern):
- Dispatch interval — how often the dispatcher loop polls for due rows.
- Stuck-age threshold — age beyond which a non-terminal notification is counted as stuck (default 10 minutes).
- Terminal-row retention window — age after which terminal rows are removed by the daily purge job (default 365 days).
Delivery max-retry-count and retry interval are not part of NotificationOutboxOptions — they are reused from the central SMTP configuration.
Dependencies
- Notification Service: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes.
- Configuration Database: Hosts the
Notificationstable; provides the entity POCO, repository, and EF migration for outbox persistence. - Central–Site Communication: Carries inbound notification submissions and acks between sites and central.
- Health Monitoring: Consumes the outbox KPIs as central-computed headline metrics.
- Central UI: Hosts the Notification Outbox page.
Interactions
- Site Store-and-Forward Engine: Forwards notifications to central via Central–Site Communication; the outbox ingests them and acks once persisted.
- Notification Service: Supplies delivery adapters and resolves notification lists at delivery time.
- Central UI: Queries the
Notificationstable for the Notification Outbox page and issues operator Retry/Discard actions on parked notifications. - Health Monitoring: Polls the outbox for KPI tiles on the health dashboard.