Gitea renders mermaid inline, so the flow/state/hierarchy/DAG diagrams move to text-in-markdown: auto-layout (removes the manual overlap-prone draw.io step), diffable source, no committed binaries, and a dark-text theme so labels stay legible. Keep draw.io PNGs only for the two complex bespoke diagrams (logical architecture, env2 topology) where pixel control still wins. All 24 mermaid blocks validated by rendering.
15 KiB
Component: Notification Outbox
Purpose
The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, logs every one to the Notifications table in the central configuration database, and delivers them through per-type delivery adapters. The Notifications table is the single source of audit truth: every notification — successfully delivered, parked, or discarded — has exactly one durable row. The outbox provides delivery retry, parking of failures, per-notification status tracking, and KPIs for delivery health.
This inverts where notification delivery happens. Sites no longer send notifications directly via SMTP; a site script's notification is store-and-forwarded to central, and the central outbox owns dispatch and delivery.
Location
Central cluster. The NotificationOutboxActor is a singleton on the active central node. It is the first outbox component to live centrally — the Store-and-Forward Engine remains site-only.
Responsibilities
- Own the durable central queue — the
Notificationstable in the central MS SQL database. - Ingest store-and-forwarded notifications from sites, insert-if-not-exists on
NotificationId, and ack the site only after the row is persisted. - Run the dispatcher loop: poll due rows, resolve the target notification list, and deliver via the matching adapter.
- Schedule retries for transient failures and park notifications on permanent failure or exhausted retries.
- Track per-notification status across the delivery lifecycle.
- Compute delivery KPIs from the
Notificationstable for the Health Monitoring dashboard and the Central UI. - Purge terminal rows daily after a configurable retention window.
SMTP and HTTP delivery is blocking I/O. Delivery work runs on a dedicated blocking-I/O dispatcher, the same pattern used by Script Execution Actors, so delivery never blocks the actor's dispatcher loop.
End-to-End Flow
%%{init: {'theme':'base', 'themeVariables': {'textColor':'#111111','lineColor':'#555555','edgeLabelBackground':'#ffffff','fontSize':'15px'}}}%%
flowchart TD
SCRIPT(["Site script: Notify.To('list').Send(subject, body)<br/>generate NotificationId (GUID) locally;<br/>return it to the script immediately"])
SNF["Site Store-and-Forward Engine<br/>(notification category, target = central)<br/>durably forwards to central via Central-Site Communication<br/>(ClusterClient); buffers/retries if central is unreachable"]
INGEST[("Central ingest: insert-if-not-exists on NotificationId<br/>to Notifications table (Pending)<br/>ack the site, site S and F clears the message")]
OUTBOX["Central Notification Outbox actor<br/>(singleton, active central node)<br/>polls due rows; resolves the list;<br/>delivers via the matching adapter"]
D1{Delivery outcome}
DELIVERED(["Delivered"])
RETRYING["Retrying<br/>(schedule NextAttemptAt)"]
PARKED(["Parked"])
SCRIPT --> SNF
SNF --> INGEST
INGEST --> OUTBOX
OUTBOX --> D1
D1 -->|success| DELIVERED
D1 -->|transient failure| RETRYING
D1 -->|"permanent failure /<br/>retries exhausted"| PARKED
RETRYING -.->|retry due| OUTBOX
classDef start fill:#d5e8d4,stroke:#82b366,color:#111111;
classDef proc fill:#dae8fc,stroke:#6c8ebf,color:#111111;
classDef dec fill:#fff2cc,stroke:#d6b656,color:#111111;
classDef warn fill:#ffe6cc,stroke:#d79b00,color:#111111;
classDef bad fill:#f8cecc,stroke:#b85450,color:#111111;
classDef alt fill:#e1d5e7,stroke:#9673a6,color:#111111;
class SCRIPT,DELIVERED start
class SNF warn
class INGEST proc
class OUTBOX alt
class D1,RETRYING dec
class PARKED bad
The site forwards only (listName, subject, body) plus provenance — recipient resolution happens at central, at delivery time. This keeps notification-list definitions in one place and removes the deploy-to-sites artifact entirely.
Notify.Status(notificationId) returns a small status record — status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site S&F buffer the site answers the query locally (status Forwarding); once forwarded, the query round-trips to central and reads the Notifications table.
The Notifications Table
The table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later. One row per notification.
| Field | Notes |
|---|---|
NotificationId |
GUID, primary key. Generated at the site; used as the idempotency key. |
Type |
Email / Teams / … discriminator. |
ListName |
Target notification list. |
Subject, Body |
Plain-text content. |
TypeData |
JSON — extensibility hook for future per-type fields. |
Status |
Lifecycle state — one of Pending, Retrying, Delivered, Parked, Discarded. See Status Lifecycle below. |
RetryCount |
Delivery attempts so far. |
LastError |
Detail of the most recent failure. |
ResolvedTargets |
Who the notification actually went to — snapshotted by central at delivery time, for audit. |
SourceSiteId, SourceInstanceId, SourceScript |
Provenance. |
SourceNode |
The cluster node on which the notification was enqueued — node-a / node-b for site-originated rows (qualified by SourceSiteId). Nullable. Carried verbatim from the site through the S&F handoff. |
SiteEnqueuedAt |
When the script called Send() (carried from the site). |
CreatedAt |
When central ingested the row. |
LastAttemptAt, NextAttemptAt, DeliveredAt |
Delivery timestamps. |
All timestamps are UTC.
Status Lifecycle
Forwarding— in the site S&F buffer, not yet received by central. Site-local only — never stored in the centralNotificationstable; reported byNotify.Statuswhile the site still holds the notification.Pending— ingested by central, awaiting first dispatch.Retrying— a transient failure occurred;NextAttemptAtschedules the next attempt.Delivered— terminal, success.Parked— terminal-not-delivered: a permanent failure, or retries exhausted.LastErrordistinguishes which.Discarded— terminal, reached only by operator action on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.
The Notification Outbox and the central Site Call Audit component share the TrackedOperationId tracking model and this status lifecycle, but differ in delivery locality: the Notification Outbox delivers notifications itself (central SMTP), whereas Site Call Audit only audits cached calls delivered site-locally by the site Store-and-Forward Engine — it is not a dispatcher.
Retry Policy
Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval. The interval is fixed (no exponential backoff), consistent with the existing fixed-interval store-and-forward convention.
Retention
Terminal rows (Delivered, Parked, Discarded) are removed by a daily purge job after a configurable window (default 365 days). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged.
Ingest & Idempotency
The site→central handoff is at-least-once. Central ingests an inbound notification submission with an insert-if-not-exists on NotificationId, then acks the site; the site S&F engine clears the message only on that ack. Because central acks only after the row is persisted (ack-after-persist), a lost ack causes the site to resend, and the GUID NotificationId idempotency key makes the resend harmless — the duplicate insert is a no-op.
A rare central failover mid-delivery could re-send one already-Delivered notification. This is an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts.
Dispatcher
The dispatcher loop runs on a fixed interval. On each tick the NotificationOutboxActor:
- Polls the
Notificationstable for due rows —Pendingrows, andRetryingrows whoseNextAttemptAthas passed. - Resolves the target notification list to its recipients/targets at central, at delivery time.
- Hands the notification to the delivery adapter registered for its
Type, running on the dedicated blocking-I/O dispatcher. - Applies the result:
- success →
Delivered, setDeliveredAt, snapshotResolvedTargets. - transient failure →
Retrying, incrementRetryCount, setNextAttemptAt, recordLastError; once retries are exhausted →Parked. - permanent failure →
Parked, recordLastError.
- success →
Each delivery attempt also writes a Notification.Attempt row to the central AuditLog via ICentralAuditWriter; a transition to a terminal status (Delivered / Parked / Discarded) writes a Notification.Terminal row. Audit writes are direct (no telemetry — the dispatcher runs at central), insert-if-not-exists on EventId. The site-emitted Notification.Enqueued row arrives separately via the standard audit telemetry channel from the site's SQLite write-buffer, so the full per-notification audit trail is Enqueued (site-originated) → Attempt × N (central direct-write) → Terminal (central direct-write). See Component-AuditLog.md, Central direct-write (central-originated events).
The operational Notifications table remains the source of truth for the dispatcher and for Retry/Discard actions; the AuditLog rows are immutable shadows. Operator Retry/Discard still mutates only the Notifications row, and each transition emits the corresponding Notification.Attempt / Notification.Terminal audit row.
Audit-write failure never affects delivery. If the ICentralAuditWriter direct-write fails (transient DB error, schema lock, etc.) the dispatcher logs the failure and increments the CentralAuditWriteFailures health metric (see Health Monitoring #11), but the delivery attempt's outcome on the Notifications row stands. The audit row is recovered by re-emission on the next dispatcher tick or by the on-startup reconciliation sweep; central never aborts a notification because audit failed.
Delivery Adapters
A delivery adapter implementing INotificationDeliveryAdapter is registered per Type. Each Deliver(...) call returns one of success | transient failure | permanent failure, mirroring the External System Gateway error-classification pattern.
- Email adapter — implemented now. The existing SMTP composition/send logic, relocated to the central cluster.
- Teams and other adapters — future. The
Typediscriminator and the adapter interface are the seam; no Teams code exists in this design. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.
Delivery adapters are provided by the Notification Service, which manages notification-list and SMTP definitions and supplies the stateless per-type "deliver one notification" implementations.
Active/Standby Behavior
The NotificationOutboxActor is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store, so no Akka-level replication is needed (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the Notifications table — Pending rows and due Retrying rows are picked up on the next dispatcher tick.
Monitoring
KPIs
KPIs are central-computed from the Notifications table — global, with a per-source-site breakdown:
- Queue depth — count of
Pending+Retrying. - Stuck count —
Pending/Retryingrows older than the configurable stuck-age threshold. - Parked count — count of
Parked. - Delivered (last interval) — count of
Deliveredsince the previous sample. - Oldest pending age — age of the oldest non-terminal notification.
KPIs are point-in-time, computed on demand from the table. The configurable row retention (default 365 days) answers historical questions directly, so no separate time-series store is added.
Stuck Detection
A notification is stuck if it is Pending or Retrying and older than a configurable age threshold (default 10 minutes). Detection is display-only — a count KPI and a row badge. There is no automated escalation or alerting, consistent with the system-wide no-alerting policy.
Surfacing
- Health Monitoring dashboard — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed and are not part of the site health report. The site S&F notification backlog remains a separate site health metric covering the site→central leg.
- Central UI "Notification Outbox" page — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer Retry (→
Pending, resetRetryCount/NextAttemptAt) and Discard (→Discarded) actions. Stuck rows are badged.
Configuration
The component is configured via NotificationOutboxOptions, bound from an appsettings.json section on the central host (Options pattern):
- Dispatch interval — how often the dispatcher loop polls for due rows.
- Stuck-age threshold — age beyond which a non-terminal notification is counted as stuck (default 10 minutes).
- Terminal-row retention window — age after which terminal rows are removed by the daily purge job (default 365 days).
Delivery max-retry-count and retry interval are not part of NotificationOutboxOptions — they are reused from the central SMTP configuration.
Dependencies
- Notification Service: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes.
- Configuration Database: Hosts the
Notificationstable; provides the entity POCO, repository, and EF migration for outbox persistence. - Central–Site Communication: Carries inbound notification submissions and acks between sites and central.
- Audit Log (#23): The dispatcher direct-writes
Notification.AttemptandNotification.Terminalrows to the centralAuditLogviaICentralAuditWriter(insert-if-not-exists onEventId); the site-emittedNotification.Enqueuedrow arrives via the standard audit telemetry channel. See Component-AuditLog.md, Central direct-write (central-originated events). - Health Monitoring: Consumes the outbox KPIs as central-computed headline metrics.
- Central UI: Hosts the Notification Outbox page.
Interactions
- Site Store-and-Forward Engine: Forwards notifications to central via Central–Site Communication; the outbox ingests them and acks once persisted.
- Notification Service: Supplies delivery adapters and resolves notification lists at delivery time.
- Central UI: Queries the
Notificationstable for the Notification Outbox page and issues operator Retry/Discard actions on parked notifications. - Health Monitoring: Polls the outbox for KPI tiles on the health dashboard.