Files
scadalink-design/docs/requirements/Component-NotificationOutbox.md

12 KiB
Raw Blame History

Component: Notification Outbox

Purpose

The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, logs every one to the Notifications table in the central configuration database, and delivers them through per-type delivery adapters. The Notifications table is the single source of audit truth: every notification — successfully delivered, parked, or discarded — has exactly one durable row. The outbox provides delivery retry, parking of failures, per-notification status tracking, and KPIs for delivery health.

This inverts where notification delivery happens. Sites no longer send notifications directly via SMTP; a site script's notification is store-and-forwarded to central, and the central outbox owns dispatch and delivery.

Location

Central cluster. The NotificationOutboxActor is a singleton on the active central node. It is the first outbox component to live centrally — the Store-and-Forward Engine remains site-only.

Responsibilities

  • Own the durable central queue — the Notifications table in the central MS SQL database.
  • Ingest store-and-forwarded notifications from sites, insert-if-not-exists on NotificationId, and ack the site only after the row is persisted.
  • Run the dispatcher loop: poll due rows, resolve the target notification list, and deliver via the matching adapter.
  • Schedule retries for transient failures and park notifications on permanent failure or exhausted retries.
  • Track per-notification status across the delivery lifecycle.
  • Compute delivery KPIs from the Notifications table for the Health Monitoring dashboard and the Central UI.
  • Purge terminal rows daily after a configurable retention window.

SMTP and HTTP delivery is blocking I/O. Delivery work runs on a dedicated blocking-I/O dispatcher, the same pattern used by Script Execution Actors, so delivery never blocks the actor's dispatcher loop.

End-to-End Flow

Site script: Notify.To("list").Send(subject, body)
    │  generate NotificationId (GUID) locally; return it to the script immediately
    ▼
Site Store-and-Forward Engine  (notification category, target = central)
    │  durably forwards to central via the Communication Layer (ClusterClient);
    │  buffers/retries if central is unreachable
    ▼
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
    │  ack the site → site S&F clears the message
    ▼
Central Notification Outbox actor (singleton, active central node)
    │  polls due rows; resolves the list; delivers via the matching adapter
    ├── success            → Delivered
    ├── transient failure  → Retrying (schedule NextAttemptAt)
    └── permanent failure
        / retries exhausted → Parked

The site forwards only (listName, subject, body) plus provenance — recipient resolution happens at central, at delivery time. This keeps notification-list definitions in one place and removes the deploy-to-sites artifact entirely.

Notify.Status(notificationId) returns a small status record — status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site S&F buffer the site answers the query locally (status Forwarding); once forwarded, the query round-trips to central and reads the Notifications table.

The Notifications Table

The table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later. One row per notification.

Field Notes
NotificationId GUID, primary key. Generated at the site; used as the idempotency key.
Type Email / Teams / … discriminator.
ListName Target notification list.
Subject, Body Plain-text content.
TypeData JSON — extensibility hook for future per-type fields.
Status PendingRetryingDelivered / Parked / Discarded.
RetryCount Delivery attempts so far.
LastError Detail of the most recent failure.
ResolvedTargets Who the notification actually went to — snapshotted by central at delivery time, for audit.
SourceSiteId, SourceInstanceId, SourceScript Provenance.
SiteEnqueuedAt When the script called Send() (carried from the site).
CreatedAt When central ingested the row.
LastAttemptAt, NextAttemptAt, DeliveredAt Delivery timestamps.

All timestamps are UTC.

Status Lifecycle

  • Forwarding — in the site S&F buffer, not yet received by central. Site-local only — never stored in the central Notifications table; reported by Notify.Status while the site still holds the notification.
  • Pending — ingested by central, awaiting first dispatch.
  • Retrying — a transient failure occurred; NextAttemptAt schedules the next attempt.
  • Delivered — terminal, success.
  • Parked — terminal-not-delivered: a permanent failure, or retries exhausted. LastError distinguishes which.
  • Discarded — terminal, reached only by operator action on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.

Retry Policy

Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval. The interval is fixed (no exponential backoff), consistent with the existing fixed-interval store-and-forward convention.

Retention

Terminal rows (Delivered, Parked, Discarded) are removed by a daily purge job after a configurable window (default ~1 year). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged.

Ingest & Idempotency

The site→central handoff is at-least-once. Central ingests an inbound notification submission with an insert-if-not-exists on NotificationId, then acks the site; the site S&F engine clears the message only on that ack. Because central acks only after the row is persisted (ack-after-persist), a lost ack causes the site to resend, and the GUID NotificationId idempotency key makes the resend harmless — the duplicate insert is a no-op.

A rare central failover mid-delivery could re-send one already-Delivered notification. This is an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts.

Dispatcher

The dispatcher loop runs on a fixed interval. On each tick the NotificationOutboxActor:

  1. Polls the Notifications table for due rowsPending rows, and Retrying rows whose NextAttemptAt has passed.
  2. Resolves the target notification list to its recipients/targets at central, at delivery time.
  3. Hands the notification to the delivery adapter registered for its Type, running on the dedicated blocking-I/O dispatcher.
  4. Applies the result:
    • successDelivered, set DeliveredAt, snapshot ResolvedTargets.
    • transient failureRetrying, increment RetryCount, set NextAttemptAt, record LastError; once retries are exhausted → Parked.
    • permanent failureParked, record LastError.

Delivery Adapters

A delivery adapter implementing INotificationDeliveryAdapter is registered per Type. Each Deliver(...) call returns one of success | transient failure | permanent failure, mirroring the External System Gateway error-classification pattern.

  • Email adapter — implemented now. The existing SMTP composition/send logic, relocated to the central cluster.
  • Teams and other adapters — future. The Type discriminator and the adapter interface are the seam; no Teams code exists in this design. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.

Delivery adapters are provided by the Notification Service, which manages notification-list and SMTP definitions and supplies the stateless per-type "deliver one notification" implementations.

Active/Standby Behavior

The NotificationOutboxActor is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store, so no Akka-level replication is needed (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the Notifications table — Pending rows and due Retrying rows are picked up on the next dispatcher tick.

Monitoring

KPIs

KPIs are central-computed from the Notifications table — global, with a per-source-site breakdown:

  • Queue depth — count of Pending + Retrying.
  • Stuck countPending / Retrying rows older than the configurable stuck-age threshold.
  • Parked count — count of Parked.
  • Delivered (last interval) — count of Delivered since the previous sample.
  • Oldest pending age — age of the oldest non-terminal notification.

KPIs are point-in-time, computed on demand from the table. The ~1-year row retention answers historical questions directly, so no separate time-series store is added.

Stuck Detection

A notification is stuck if it is Pending or Retrying and older than a configurable age threshold (default 10 minutes). Detection is display-only — a count KPI and a row badge. There is no automated escalation or alerting, consistent with the system-wide no-alerting policy.

Surfacing

  • Health Monitoring dashboard — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed and are not part of the site health report. The site S&F notification backlog remains a separate site health metric covering the site→central leg.
  • Central UI "Notification Outbox" page — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer Retry (→ Pending, reset RetryCount / NextAttemptAt) and Discard (→ Discarded) actions. Stuck rows are badged.

Configuration

The component is configured via NotificationOutboxOptions, bound from an appsettings.json section on the central host (Options pattern):

  • Dispatch interval — how often the dispatcher loop polls for due rows.
  • Stuck-age threshold — age beyond which a non-terminal notification is counted as stuck (default 10 minutes).
  • Terminal-row retention window — age after which terminal rows are removed by the daily purge job (default ~1 year).

Delivery max-retry-count and retry interval are not part of NotificationOutboxOptions — they are reused from the central SMTP configuration.

Dependencies

  • Notification Service: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes.
  • Configuration Database: Hosts the Notifications table; provides the entity POCO, repository, and EF migration for outbox persistence.
  • CentralSite Communication: Carries inbound notification submissions and acks between sites and central.
  • Health Monitoring: Consumes the outbox KPIs as central-computed headline metrics.
  • Central UI: Hosts the Notification Outbox page.

Interactions

  • Site Store-and-Forward Engine: Forwards notifications to central via the Communication Layer; the outbox ingests them and acks once persisted.
  • Notification Service: Supplies delivery adapters and resolves notification lists at delivery time.
  • Central UI: Queries the Notifications table for the Notification Outbox page and issues operator Retry/Discard actions on parked notifications.
  • Health Monitoring: Polls the outbox for KPI tiles on the health dashboard.