Files
scadalink-design/docs/plans/notif.md
Joseph Doherty bbfa0c515e docs(plans): fold refinement decisions into notification outbox design
Resolves the six open questions: host-level forward-retry config,
Notify.Status returns a status record, 10-min stuck threshold, a
site-local Forwarding state, site-side logging of forward failures
only, and point-in-time KPIs computed from the Notifications table.
2026-05-18 22:57:45 -04:00

237 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: Notification Outbox
**Date:** 2026-05-18
**Status:** Basic design — approved, open for refinement.
## Problem
Notification delivery today happens at the site clusters: scripts call `Notify.To().Send()`,
the Notification Service composes an email, and the site sends it via SMTP. The Store-and-Forward
Engine buffers transient failures. Two gaps motivated this design:
1. **No audit trail.** A successful send is recorded nowhere. A permanently-failed send is
returned to the script and then lost. Only a transiently-failed-and-buffered notification
is visible — indirectly, as Store-and-Forward activity.
2. **No monitoring.** There is no view of delivery health: no KPIs, and no way to find
notifications that are stuck or have been parked.
## Solution overview
Invert where delivery happens. Sites no longer send notifications directly. Instead:
- A site script's notification is **store-and-forwarded to the central cluster**.
- Central **logs every notification to a `Notifications` table** in the central config DB
(MS SQL) — the single source of audit truth.
- A central **Notification Outbox** dispatches and delivers from that table, with retry,
parking, per-notification status, and KPIs.
The `Notifications` table is type-agnostic so it can record any notification type the system
supports — email today, Microsoft Teams and others later.
### End-to-end flow
```
Site script: Notify.To("list").Send(subject, body)
│ generate NotificationId (GUID) locally; return it to the script immediately
Site Store-and-Forward Engine (notification category, target = central)
│ durably forwards to central via the Communication Layer (ClusterClient);
│ buffers/retries if central is unreachable
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
│ ack the site → site S&F clears the message
Central Notification Outbox actor (singleton, active central node)
│ polls due rows; resolves the list; delivers via the matching adapter
├── success → Delivered
├── transient failure → Retrying (schedule NextAttemptAt)
└── permanent failure
/ retries exhausted → Parked
```
`Notify.Status(notificationId)` returns a small **status record** — status, retry count,
last error, and key timestamps (enqueued, delivered). While the notification is still in the
site S&F buffer the site answers the query **locally** (status `Forwarding`); once forwarded,
the query round-trips to central and reads the `Notifications` table.
## Component design
### New component #21: Notification Outbox
A **central** component — the first outbox to live centrally (the Store-and-Forward Engine
remains site-only).
- **Location:** Central cluster.
- **Actor:** `NotificationOutboxActor` — a **singleton on the active central node**.
- **Owns:** the durable central queue (the `Notifications` table), the dispatcher loop,
retry scheduling, parking, per-notification status tracking, and KPI computation.
- SMTP/HTTP delivery is blocking I/O — delivery work runs on a **dedicated blocking-I/O
dispatcher** (same pattern as Script Execution Actors).
### Notification Service (revised)
Shrinks to two clear jobs, both **central-only**:
- Manage **notification-list and SMTP definitions** in the config DB.
- Provide **delivery adapters** — stateless "deliver one notification" implementations per
type (see below).
Notifications and SMTP config are **no longer deployed to sites**. Sites never talk to SMTP.
### Store-and-Forward Engine (revised)
Keeps its notification category, but the delivery *target* changes from SMTP to **central**.
"Delivering" a buffered notification now means handing it to the Communication Layer for the
central cluster and clearing it on central's ack. The site→central forward uses a fixed
retry interval configured in the host `appsettings.json` — it concerns reaching the central
cluster rather than any notification list.
## Typed notification lists
Each notification list gains a **`Type`** field plus type-specific targets:
- `Email` — a set of recipient addresses (implemented now).
- `Teams`, others — future types.
`Notify.To("list")` works transparently for any type — the script does not care. Lists are
defined and stored centrally only.
**Recipient resolution happens at central, at delivery time** — the site forwards only
`(listName, subject, body)`. This keeps definitions in one place and removes the deploy-to-sites
artifact entirely.
## The `Notifications` table (central MS SQL)
Type-agnostic. One row per notification.
| Field | Notes |
|---|---|
| `NotificationId` | GUID, primary key. Generated at the **site**; used as the idempotency key. |
| `Type` | `Email` / `Teams` / … discriminator. |
| `ListName` | Target notification list. |
| `Subject`, `Body` | Plain-text content. |
| `TypeData` | JSON — extensibility hook for future per-type fields. |
| `Status` | `Pending``Retrying``Delivered` / `Parked` / `Discarded`. |
| `RetryCount` | Delivery attempts so far. |
| `LastError` | Detail of the most recent failure. |
| `ResolvedTargets` | Who the notification actually went to — snapshotted by central at delivery time, for audit. |
| `SourceSiteId`, `SourceInstanceId`, `SourceScript` | Provenance. |
| `SiteEnqueuedAt` | When the script called `Send()` (carried from the site). |
| `CreatedAt` | When central ingested the row. |
| `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt` | Delivery timestamps. |
All timestamps are UTC.
### Status lifecycle
- `Forwarding` — in the site S&F buffer, not yet received by central. **Site-local only**
never stored in the central `Notifications` table; reported by `Notify.Status` while the
site still holds the notification.
- `Pending` — ingested by central, awaiting first dispatch.
- `Retrying` — a transient failure occurred; `NextAttemptAt` schedules the next attempt.
- `Delivered` — terminal, success.
- `Parked` — terminal-not-delivered: a permanent failure, or retries exhausted. `LastError`
distinguishes which.
- `Discarded` — terminal, reached **only by operator action** on a parked notification. The
row is kept (not deleted) so the table remains a complete audit record.
### Retry policy
Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry
interval — consistent with the existing fixed-interval (no backoff) convention.
### Retention
Terminal rows (`Delivered`, `Parked`, `Discarded`) are removed by a **daily purge job** after
a configurable window (default ~1 year). This preserves a strong audit trail while bounding
table growth. Non-terminal rows are never purged.
## Delivery adapters
An `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns
one of `success | transient failure | permanent failure`, mirroring the External System
Gateway error-classification pattern.
- **Email adapter — implemented now.** The existing SMTP composition/send logic, relocated
to the central cluster.
- **Teams and other adapters — future.** The `Type` discriminator and the adapter interface
are the seam; no Teams code is written in this basic plan. Teams auth and targeting
(Incoming Webhooks vs Graph API) is a separate design conversation.
## Active/standby behavior
The `NotificationOutboxActor` is a singleton on the active central node. All outbox state
lives in MS SQL, which is already the central HA store — so no Akka-level replication is
needed (unlike the site S&F engine). On central failover the new active node resumes
dispatch directly from the table.
The site→central handoff is **at-least-once**: central acks only after the row is persisted,
and a lost ack causes the site to resend. The GUID `NotificationId` idempotency key makes a
resend harmless (insert-if-not-exists). A rare failover mid-delivery could re-send one
already-`Delivered` notification — an accepted trade-off, consistent with the duplicate-delivery
trade-off the Store-and-Forward Engine already accepts.
## Monitoring
### KPIs
Central-computed from the `Notifications` table — global, with a per-source-site breakdown:
- **Queue depth** — count of `Pending` + `Retrying`.
- **Stuck count** — `Pending`/`Retrying` rows older than a configurable age threshold
(default 10 minutes).
- **Parked count** — count of `Parked`.
- **Delivered (last interval)** — count of `Delivered` since the previous sample.
- **Oldest pending age** — age of the oldest non-terminal notification.
### Stuck detection
A notification is **stuck** if it is `Pending` or `Retrying` and older than the configurable
age threshold. Detection is **display-only** — a count KPI and a row badge. No automated
escalation or alerting, consistent with the current system-wide no-alerting policy.
### Surfacing
- **Health Monitoring dashboard** — headline KPI tiles: queue depth, stuck count, parked
count. These are central-computed (not part of the site health report). The site S&F
notification backlog remains a separate site health metric, covering the site→central leg.
- **New Central UI "Notification Outbox" page** — KPI tiles plus a queryable notification
list: filter by status, type, source site, list, and time range; a stuck-only toggle;
keyword search on subject. Parked notifications offer **Retry** (→ `Pending`, reset
`RetryCount`/`NextAttemptAt`) and **Discard** (→ `Discarded`) actions. Stuck rows are badged.
## Cross-document impact
| Document | Change |
|---|---|
| `Component-NotificationOutbox.md` | **New** — component #21. |
| `Component-NotificationService.md` | Delivery moves central; lists gain a `Type`; no deploy-to-sites; async script API; delivery adapters. |
| `Component-StoreAndForward.md` | Notification category retargeted from SMTP to central. |
| `Component-HealthMonitoring.md` | Outbox KPIs added as central-computed headline metrics. |
| `Component-SiteEventLogging.md` | New Notification event category — logs site→central forward failures and long-buffered notifications. |
| `Component-CentralUI.md` | New Notification Outbox page. |
| CentralSite Communication | New `NotificationSubmit` + ack message pair. |
| Configuration Database / Commons | `Notifications` table, entity POCO, repository interface + implementation, EF migration, message contracts. |
| `README.md` | Component table 20 → 21. |
| `CLAUDE.md` | Component list 20 → 21; new key design decisions. |
## Refinement decisions (2026-05-18)
- **Site→central forward retry config** — the fixed forward-retry interval lives in the host
`appsettings.json` (infrastructure config, not a deployed artifact).
- **`Notify.Status` payload** — returns a status record: status, retry count, last error,
and key timestamps (enqueued, delivered).
- **Stuck threshold default** — 10 minutes, configurable.
- **Pre-ingest status** — a distinct site-local `Forwarding` state; the site answers
`Notify.Status` from its own S&F buffer without a round-trip to central.
- **Site-side diagnostics** — Site Event Logging records site→central **forward failures**
and long-buffered notifications only, not routine enqueue/forward success events.
- **KPI history** — point-in-time only, computed on demand from the `Notifications` table;
the ~1-year row retention answers historical questions directly, so no separate
time-series store is added.
## Open questions
None outstanding — the basic design is fully specified. The next step is an implementation
plan against the cross-document impact table.