Files
scadalink-design/docs/plans/notif.md
Joseph Doherty d4e86c1b1d docs(plans): design for central notification outbox
Captures the basic design for a reliable notification outbox: sites
store-and-forward notifications to the central cluster, which logs
them to a type-agnostic Notifications table (single audit source) and
delivers them via per-type adapters with retry, parking, and KPIs.
2026-05-18 22:54:17 -04:00

10 KiB
Raw Blame History

Design: Notification Outbox

Date: 2026-05-18 Status: Basic design — approved, open for refinement.

Problem

Notification delivery today happens at the site clusters: scripts call Notify.To().Send(), the Notification Service composes an email, and the site sends it via SMTP. The Store-and-Forward Engine buffers transient failures. Two gaps motivated this design:

  1. No audit trail. A successful send is recorded nowhere. A permanently-failed send is returned to the script and then lost. Only a transiently-failed-and-buffered notification is visible — indirectly, as Store-and-Forward activity.
  2. No monitoring. There is no view of delivery health: no KPIs, and no way to find notifications that are stuck or have been parked.

Solution overview

Invert where delivery happens. Sites no longer send notifications directly. Instead:

  • A site script's notification is store-and-forwarded to the central cluster.
  • Central logs every notification to a Notifications table in the central config DB (MS SQL) — the single source of audit truth.
  • A central Notification Outbox dispatches and delivers from that table, with retry, parking, per-notification status, and KPIs.

The Notifications table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later.

End-to-end flow

Site script: Notify.To("list").Send(subject, body)
    │  generate NotificationId (GUID) locally; return it to the script immediately
    ▼
Site Store-and-Forward Engine  (notification category, target = central)
    │  durably forwards to central via the Communication Layer (ClusterClient);
    │  buffers/retries if central is unreachable
    ▼
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
    │  ack the site → site S&F clears the message
    ▼
Central Notification Outbox actor (singleton, active central node)
    │  polls due rows; resolves the list; delivers via the matching adapter
    ├── success            → Delivered
    ├── transient failure  → Retrying (schedule NextAttemptAt)
    └── permanent failure
        / retries exhausted → Parked

Notify.Status(notificationId) round-trips site→central and reads the table. Before central has ingested the row, status reads as Pending (in transit).

Component design

New component #21: Notification Outbox

A central component — the first outbox to live centrally (the Store-and-Forward Engine remains site-only).

  • Location: Central cluster.
  • Actor: NotificationOutboxActor — a singleton on the active central node.
  • Owns: the durable central queue (the Notifications table), the dispatcher loop, retry scheduling, parking, per-notification status tracking, and KPI computation.
  • SMTP/HTTP delivery is blocking I/O — delivery work runs on a dedicated blocking-I/O dispatcher (same pattern as Script Execution Actors).

Notification Service (revised)

Shrinks to two clear jobs, both central-only:

  • Manage notification-list and SMTP definitions in the config DB.
  • Provide delivery adapters — stateless "deliver one notification" implementations per type (see below).

Notifications and SMTP config are no longer deployed to sites. Sites never talk to SMTP.

Store-and-Forward Engine (revised)

Keeps its notification category, but the delivery target changes from SMTP to central. "Delivering" a buffered notification now means handing it to the Communication Layer for the central cluster and clearing it on central's ack. The site→central forward uses a fixed retry interval (host-level config, since it concerns reaching central, not any list).

Typed notification lists

Each notification list gains a Type field plus type-specific targets:

  • Email — a set of recipient addresses (implemented now).
  • Teams, others — future types.

Notify.To("list") works transparently for any type — the script does not care. Lists are defined and stored centrally only.

Recipient resolution happens at central, at delivery time — the site forwards only (listName, subject, body). This keeps definitions in one place and removes the deploy-to-sites artifact entirely.

The Notifications table (central MS SQL)

Type-agnostic. One row per notification.

Field Notes
NotificationId GUID, primary key. Generated at the site; used as the idempotency key.
Type Email / Teams / … discriminator.
ListName Target notification list.
Subject, Body Plain-text content.
TypeData JSON — extensibility hook for future per-type fields.
Status PendingRetryingDelivered / Parked / Discarded.
RetryCount Delivery attempts so far.
LastError Detail of the most recent failure.
ResolvedTargets Who the notification actually went to — snapshotted by central at delivery time, for audit.
SourceSiteId, SourceInstanceId, SourceScript Provenance.
SiteEnqueuedAt When the script called Send() (carried from the site).
CreatedAt When central ingested the row.
LastAttemptAt, NextAttemptAt, DeliveredAt Delivery timestamps.

All timestamps are UTC.

Status lifecycle

  • Pending — ingested, awaiting first dispatch.
  • Retrying — a transient failure occurred; NextAttemptAt schedules the next attempt.
  • Delivered — terminal, success.
  • Parked — terminal-not-delivered: a permanent failure, or retries exhausted. LastError distinguishes which.
  • Discarded — terminal, reached only by operator action on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.

Retry policy

Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval — consistent with the existing fixed-interval (no backoff) convention.

Retention

Terminal rows (Delivered, Parked, Discarded) are removed by a daily purge job after a configurable window (default ~1 year). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged.

Delivery adapters

An INotificationDeliveryAdapter is registered per Type. Each Deliver(...) call returns one of success | transient failure | permanent failure, mirroring the External System Gateway error-classification pattern.

  • Email adapter — implemented now. The existing SMTP composition/send logic, relocated to the central cluster.
  • Teams and other adapters — future. The Type discriminator and the adapter interface are the seam; no Teams code is written in this basic plan. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.

Active/standby behavior

The NotificationOutboxActor is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store — so no Akka-level replication is needed (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the table.

The site→central handoff is at-least-once: central acks only after the row is persisted, and a lost ack causes the site to resend. The GUID NotificationId idempotency key makes a resend harmless (insert-if-not-exists). A rare failover mid-delivery could re-send one already-Delivered notification — an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts.

Monitoring

KPIs

Central-computed from the Notifications table — global, with a per-source-site breakdown:

  • Queue depth — count of Pending + Retrying.
  • Stuck countPending/Retrying rows older than a configurable age threshold (default 10 minutes).
  • Parked count — count of Parked.
  • Delivered (last interval) — count of Delivered since the previous sample.
  • Oldest pending age — age of the oldest non-terminal notification.

Stuck detection

A notification is stuck if it is Pending or Retrying and older than the configurable age threshold. Detection is display-only — a count KPI and a row badge. No automated escalation or alerting, consistent with the current system-wide no-alerting policy.

Surfacing

  • Health Monitoring dashboard — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed (not part of the site health report). The site S&F notification backlog remains a separate site health metric, covering the site→central leg.
  • New Central UI "Notification Outbox" page — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer Retry (→ Pending, reset RetryCount/NextAttemptAt) and Discard (→ Discarded) actions. Stuck rows are badged.

Cross-document impact

Document Change
Component-NotificationOutbox.md New — component #21.
Component-NotificationService.md Delivery moves central; lists gain a Type; no deploy-to-sites; async script API; delivery adapters.
Component-StoreAndForward.md Notification category retargeted from SMTP to central.
Component-HealthMonitoring.md Outbox KPIs added as central-computed headline metrics.
Component-CentralUI.md New Notification Outbox page.
CentralSite Communication New NotificationSubmit + ack message pair.
Configuration Database / Commons Notifications table, entity POCO, repository interface + implementation, EF migration, message contracts.
README.md Component table 20 → 21.
CLAUDE.md Component list 20 → 21; new key design decisions.

Open questions for refinement

  • Site→central forward retry config — where the fixed forward-retry interval lives (host appsettings vs a deployed setting).
  • Notify.Status payload — whether status queries also return retry count / last error to scripts, or just the status enum.
  • Stuck threshold default — 10 minutes is a placeholder.
  • Pre-ingest status — confirm Pending is the right reading for a notification still in the site S&F buffer (vs a distinct "Forwarding" state).
  • Site-side diagnostics — whether to keep a lightweight Site Event Logging entry for "notification enqueued / forwarded," now that central holds the authoritative record.
  • KPI history — KPIs are currently point-in-time; whether any trend/history is wanted.