Files
scadalink-design/docs/requirements/Component-NotificationOutbox.md

167 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Component: Notification Outbox
## Purpose
The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, logs every one to the `Notifications` table in the central configuration database, and delivers them through per-type delivery adapters. The `Notifications` table is the single source of audit truth: every notification — successfully delivered, parked, or discarded — has exactly one durable row. The outbox provides delivery retry, parking of failures, per-notification status tracking, and KPIs for delivery health.
This inverts where notification delivery happens. Sites no longer send notifications directly via SMTP; a site script's notification is store-and-forwarded to central, and the central outbox owns dispatch and delivery.
## Location
Central cluster. The `NotificationOutboxActor` is a **singleton on the active central node**. It is the first outbox component to live centrally — the Store-and-Forward Engine remains site-only.
## Responsibilities
- Own the durable central queue — the `Notifications` table in the central MS SQL database.
- Ingest store-and-forwarded notifications from sites, insert-if-not-exists on `NotificationId`, and ack the site only after the row is persisted.
- Run the dispatcher loop: poll due rows, resolve the target notification list, and deliver via the matching adapter.
- Schedule retries for transient failures and park notifications on permanent failure or exhausted retries.
- Track per-notification status across the delivery lifecycle.
- Compute delivery KPIs from the `Notifications` table for the Health Monitoring dashboard and the Central UI.
- Purge terminal rows daily after a configurable retention window.
SMTP and HTTP delivery is blocking I/O. Delivery work runs on a **dedicated blocking-I/O dispatcher**, the same pattern used by Script Execution Actors, so delivery never blocks the actor's dispatcher loop.
## End-to-End Flow
```
Site script: Notify.To("list").Send(subject, body)
│ generate NotificationId (GUID) locally; return it to the script immediately
Site Store-and-Forward Engine (notification category, target = central)
│ durably forwards to central via the Communication Layer (ClusterClient);
│ buffers/retries if central is unreachable
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
│ ack the site → site S&F clears the message
Central Notification Outbox actor (singleton, active central node)
│ polls due rows; resolves the list; delivers via the matching adapter
├── success → Delivered
├── transient failure → Retrying (schedule NextAttemptAt)
└── permanent failure
/ retries exhausted → Parked
```
The site forwards only `(listName, subject, body)` plus provenance — recipient resolution happens at central, at delivery time. This keeps notification-list definitions in one place and removes the deploy-to-sites artifact entirely.
`Notify.Status(notificationId)` returns a small status record — status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site S&F buffer the site answers the query **locally** (status `Forwarding`); once forwarded, the query round-trips to central and reads the `Notifications` table.
## The `Notifications` Table
The table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later. One row per notification.
| Field | Notes |
|---|---|
| `NotificationId` | GUID, primary key. Generated at the **site**; used as the idempotency key. |
| `Type` | `Email` / `Teams` / … discriminator. |
| `ListName` | Target notification list. |
| `Subject`, `Body` | Plain-text content. |
| `TypeData` | JSON — extensibility hook for future per-type fields. |
| `Status` | `Pending``Retrying``Delivered` / `Parked` / `Discarded`. |
| `RetryCount` | Delivery attempts so far. |
| `LastError` | Detail of the most recent failure. |
| `ResolvedTargets` | Who the notification actually went to — snapshotted by central at delivery time, for audit. |
| `SourceSiteId`, `SourceInstanceId`, `SourceScript` | Provenance. |
| `SiteEnqueuedAt` | When the script called `Send()` (carried from the site). |
| `CreatedAt` | When central ingested the row. |
| `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt` | Delivery timestamps. |
All timestamps are UTC.
### Status Lifecycle
- `Forwarding` — in the site S&F buffer, not yet received by central. **Site-local only** — never stored in the central `Notifications` table; reported by `Notify.Status` while the site still holds the notification.
- `Pending` — ingested by central, awaiting first dispatch.
- `Retrying` — a transient failure occurred; `NextAttemptAt` schedules the next attempt.
- `Delivered` — terminal, success.
- `Parked` — terminal-not-delivered: a permanent failure, or retries exhausted. `LastError` distinguishes which.
- `Discarded` — terminal, reached **only by operator action** on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.
### Retry Policy
Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval. The interval is fixed (no exponential backoff), consistent with the existing fixed-interval store-and-forward convention.
### Retention
Terminal rows (`Delivered`, `Parked`, `Discarded`) are removed by a **daily purge job** after a configurable window (default ~1 year). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged.
## Ingest & Idempotency
The site→central handoff is **at-least-once**. Central ingests an inbound notification submission with an **insert-if-not-exists** on `NotificationId`, then acks the site; the site S&F engine clears the message only on that ack. Because central acks only after the row is persisted (ack-after-persist), a lost ack causes the site to resend, and the GUID `NotificationId` idempotency key makes the resend harmless — the duplicate insert is a no-op.
A rare central failover mid-delivery could re-send one already-`Delivered` notification. This is an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts.
## Dispatcher
The dispatcher loop runs on a fixed interval. On each tick the `NotificationOutboxActor`:
1. Polls the `Notifications` table for **due rows**`Pending` rows, and `Retrying` rows whose `NextAttemptAt` has passed.
2. Resolves the target notification list to its recipients/targets at central, at delivery time.
3. Hands the notification to the delivery adapter registered for its `Type`, running on the dedicated blocking-I/O dispatcher.
4. Applies the result:
- **success** → `Delivered`, set `DeliveredAt`, snapshot `ResolvedTargets`.
- **transient failure** → `Retrying`, increment `RetryCount`, set `NextAttemptAt`, record `LastError`; once retries are exhausted → `Parked`.
- **permanent failure** → `Parked`, record `LastError`.
## Delivery Adapters
A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern.
- **Email adapter — implemented now.** The existing SMTP composition/send logic, relocated to the central cluster.
- **Teams and other adapters — future.** The `Type` discriminator and the adapter interface are the seam; no Teams code exists in this design. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.
Delivery adapters are provided by the Notification Service, which manages notification-list and SMTP definitions and supplies the stateless per-type "deliver one notification" implementations.
## Active/Standby Behavior
The `NotificationOutboxActor` is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store, so **no Akka-level replication is needed** (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the `Notifications` table — `Pending` rows and due `Retrying` rows are picked up on the next dispatcher tick.
## Monitoring
### KPIs
KPIs are central-computed from the `Notifications` table — global, with a per-source-site breakdown:
- **Queue depth** — count of `Pending` + `Retrying`.
- **Stuck count** — `Pending` / `Retrying` rows older than the configurable stuck-age threshold.
- **Parked count** — count of `Parked`.
- **Delivered (last interval)** — count of `Delivered` since the previous sample.
- **Oldest pending age** — age of the oldest non-terminal notification.
KPIs are point-in-time, computed on demand from the table. The ~1-year row retention answers historical questions directly, so no separate time-series store is added.
### Stuck Detection
A notification is **stuck** if it is `Pending` or `Retrying` and older than a configurable age threshold (default 10 minutes). Detection is **display-only** — a count KPI and a row badge. There is no automated escalation or alerting, consistent with the system-wide no-alerting policy.
### Surfacing
- **Health Monitoring dashboard** — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed and are not part of the site health report. The site S&F notification backlog remains a separate site health metric covering the site→central leg.
- **Central UI "Notification Outbox" page** — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer **Retry** (→ `Pending`, reset `RetryCount` / `NextAttemptAt`) and **Discard** (→ `Discarded`) actions. Stuck rows are badged.
## Configuration
The component is configured via `NotificationOutboxOptions`, bound from an `appsettings.json` section on the central host (Options pattern):
- **Dispatch interval** — how often the dispatcher loop polls for due rows.
- **Stuck-age threshold** — age beyond which a non-terminal notification is counted as stuck (default 10 minutes).
- **Terminal-row retention window** — age after which terminal rows are removed by the daily purge job (default ~1 year).
Delivery max-retry-count and retry interval are not part of `NotificationOutboxOptions` — they are reused from the central SMTP configuration.
## Dependencies
- **Notification Service**: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes.
- **Configuration Database**: Hosts the `Notifications` table; provides the entity POCO, repository, and EF migration for outbox persistence.
- **CentralSite Communication**: Carries inbound notification submissions and acks between sites and central.
- **Health Monitoring**: Consumes the outbox KPIs as central-computed headline metrics.
- **Central UI**: Hosts the Notification Outbox page.
## Interactions
- **Site Store-and-Forward Engine**: Forwards notifications to central via the Communication Layer; the outbox ingests them and acks once persisted.
- **Notification Service**: Supplies delivery adapters and resolves notification lists at delivery time.
- **Central UI**: Queries the `Notifications` table for the Notification Outbox page and issues operator Retry/Discard actions on parked notifications.
- **Health Monitoring**: Polls the outbox for KPI tiles on the health dashboard.