Final cross-bundle reviewer identified 7 inconsistencies that the per-bundle reviewers couldn't see; all fixed in one logical commit. Critical: - HighLevelReqs AL-3: drop 'then upsert-on-newer-status' — AuditLog is strictly append-only (correct for SiteCalls/Notifications, wrong for the immutable AuditLog shadow). - Component-AuditLog Error rate KPI: align with HealthMonitoring's exclusion list (Success/Delivered/Enqueued) rather than just non-Success; otherwise every Delivered notification or Enqueued cached call would be counted as an error. Important: - Component-AuditLog line 154: ISiteAuditWriter -> IAuditWriter (canonical name per Commons and the rest of this doc). - Component-AuditLog Central direct-write paragraph: convert remaining slash notation (ApiInbound/Completed, Notification/Attempt, Notification/Terminal) to dot notation used everywhere else. - Component-ClusterInfrastructure: scope SiteCallAuditActor to reconciliation + KPIs + Retry/Discard relay; cached-telemetry ingest is AuditLogIngestActor's role per Combined Telemetry contract. - Component-CentralUI Audit Log page: state the OperationalAudit read permission and the read-vs-export split (matching CLI doc). - Component-NotificationOutbox: add never-fail-the-action invariant for dispatcher audit writes. Minor: - Component-InboundAPI: 'Non-blocking semantics' was ambiguous (could be read as async); reword to 'Fail-soft' — the write is still synchronous before flush, but failures are caught and don't change the response. - Component-CLI: realign audit-query/audit-export flags to actually match the Central UI Audit Log filter set (channel, kind, status, site, instance, target, actor, correlation-id, errors-only); drop --user and --entity-id which are IAuditService concepts, not Audit Log columns. - Component-AuditLog KPI tile names: 'Volume/Error rate/Backlog' -> 'Audit volume/Audit error rate/Audit backlog' (matches Central UI and Health Monitoring); drop the two orphan KPIs (Top inbound callers, Top outbound 5xx) that were never surfaced anywhere. - Component-AuditLog Interactions: re-attribute DbOutbound emissions to ESG (where Database.* lives) with a note that Site Runtime is the API surface for scripts. - HighLevelReqs AL-12: drop 'and reconciliation operations' (CLI has no reconcile command; reconciliation is an internal self-healing pull). Add note that verify-chain becomes operational once AL-11's hash chain ships.
176 lines
14 KiB
Markdown
176 lines
14 KiB
Markdown
# Component: Notification Outbox
|
||
|
||
## Purpose
|
||
|
||
The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, logs every one to the `Notifications` table in the central configuration database, and delivers them through per-type delivery adapters. The `Notifications` table is the single source of audit truth: every notification — successfully delivered, parked, or discarded — has exactly one durable row. The outbox provides delivery retry, parking of failures, per-notification status tracking, and KPIs for delivery health.
|
||
|
||
This inverts where notification delivery happens. Sites no longer send notifications directly via SMTP; a site script's notification is store-and-forwarded to central, and the central outbox owns dispatch and delivery.
|
||
|
||
## Location
|
||
|
||
Central cluster. The `NotificationOutboxActor` is a **singleton on the active central node**. It is the first outbox component to live centrally — the Store-and-Forward Engine remains site-only.
|
||
|
||
## Responsibilities
|
||
|
||
- Own the durable central queue — the `Notifications` table in the central MS SQL database.
|
||
- Ingest store-and-forwarded notifications from sites, insert-if-not-exists on `NotificationId`, and ack the site only after the row is persisted.
|
||
- Run the dispatcher loop: poll due rows, resolve the target notification list, and deliver via the matching adapter.
|
||
- Schedule retries for transient failures and park notifications on permanent failure or exhausted retries.
|
||
- Track per-notification status across the delivery lifecycle.
|
||
- Compute delivery KPIs from the `Notifications` table for the Health Monitoring dashboard and the Central UI.
|
||
- Purge terminal rows daily after a configurable retention window.
|
||
|
||
SMTP and HTTP delivery is blocking I/O. Delivery work runs on a **dedicated blocking-I/O dispatcher**, the same pattern used by Script Execution Actors, so delivery never blocks the actor's dispatcher loop.
|
||
|
||
## End-to-End Flow
|
||
|
||
```
|
||
Site script: Notify.To("list").Send(subject, body)
|
||
│ generate NotificationId (GUID) locally; return it to the script immediately
|
||
▼
|
||
Site Store-and-Forward Engine (notification category, target = central)
|
||
│ durably forwards to central via Central–Site Communication (ClusterClient);
|
||
│ buffers/retries if central is unreachable
|
||
▼
|
||
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
|
||
│ ack the site → site S&F clears the message
|
||
▼
|
||
Central Notification Outbox actor (singleton, active central node)
|
||
│ polls due rows; resolves the list; delivers via the matching adapter
|
||
├── success → Delivered
|
||
├── transient failure → Retrying (schedule NextAttemptAt)
|
||
└── permanent failure
|
||
/ retries exhausted → Parked
|
||
```
|
||
|
||
The site forwards only `(listName, subject, body)` plus provenance — recipient resolution happens at central, at delivery time. This keeps notification-list definitions in one place and removes the deploy-to-sites artifact entirely.
|
||
|
||
`Notify.Status(notificationId)` returns a small status record — status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site S&F buffer the site answers the query **locally** (status `Forwarding`); once forwarded, the query round-trips to central and reads the `Notifications` table.
|
||
|
||
## The `Notifications` Table
|
||
|
||
The table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later. One row per notification.
|
||
|
||
| Field | Notes |
|
||
|---|---|
|
||
| `NotificationId` | GUID, primary key. Generated at the **site**; used as the idempotency key. |
|
||
| `Type` | `Email` / `Teams` / … discriminator. |
|
||
| `ListName` | Target notification list. |
|
||
| `Subject`, `Body` | Plain-text content. |
|
||
| `TypeData` | JSON — extensibility hook for future per-type fields. |
|
||
| `Status` | Lifecycle state — one of `Pending`, `Retrying`, `Delivered`, `Parked`, `Discarded`. See Status Lifecycle below. |
|
||
| `RetryCount` | Delivery attempts so far. |
|
||
| `LastError` | Detail of the most recent failure. |
|
||
| `ResolvedTargets` | Who the notification actually went to — snapshotted by central at delivery time, for audit. |
|
||
| `SourceSiteId`, `SourceInstanceId`, `SourceScript` | Provenance. |
|
||
| `SiteEnqueuedAt` | When the script called `Send()` (carried from the site). |
|
||
| `CreatedAt` | When central ingested the row. |
|
||
| `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt` | Delivery timestamps. |
|
||
|
||
All timestamps are UTC.
|
||
|
||
### Status Lifecycle
|
||
|
||
- `Forwarding` — in the site S&F buffer, not yet received by central. **Site-local only** — never stored in the central `Notifications` table; reported by `Notify.Status` while the site still holds the notification.
|
||
- `Pending` — ingested by central, awaiting first dispatch.
|
||
- `Retrying` — a transient failure occurred; `NextAttemptAt` schedules the next attempt.
|
||
- `Delivered` — terminal, success.
|
||
- `Parked` — terminal-not-delivered: a permanent failure, or retries exhausted. `LastError` distinguishes which.
|
||
- `Discarded` — terminal, reached **only by operator action** on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.
|
||
|
||
The Notification Outbox and the central [`Site Call Audit`](Component-SiteCallAudit.md) component share the `TrackedOperationId` tracking model and this status lifecycle, but differ in delivery locality: the Notification Outbox **delivers** notifications itself (central SMTP), whereas Site Call Audit only **audits** cached calls delivered site-locally by the site Store-and-Forward Engine — it is not a dispatcher.
|
||
|
||
### Retry Policy
|
||
|
||
Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval. The interval is fixed (no exponential backoff), consistent with the existing fixed-interval store-and-forward convention.
|
||
|
||
### Retention
|
||
|
||
Terminal rows (`Delivered`, `Parked`, `Discarded`) are removed by a **daily purge job** after a configurable window (default 365 days). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged.
|
||
|
||
## Ingest & Idempotency
|
||
|
||
The site→central handoff is **at-least-once**. Central ingests an inbound notification submission with an insert-if-not-exists on `NotificationId`, then acks the site; the site S&F engine clears the message only on that ack. Because central acks only after the row is persisted (ack-after-persist), a lost ack causes the site to resend, and the GUID `NotificationId` idempotency key makes the resend harmless — the duplicate insert is a no-op.
|
||
|
||
A rare central failover mid-delivery could re-send one already-`Delivered` notification. This is an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts.
|
||
|
||
## Dispatcher
|
||
|
||
The dispatcher loop runs on a fixed interval. On each tick the `NotificationOutboxActor`:
|
||
|
||
1. Polls the `Notifications` table for **due rows** — `Pending` rows, and `Retrying` rows whose `NextAttemptAt` has passed.
|
||
2. Resolves the target notification list to its recipients/targets at central, at delivery time.
|
||
3. Hands the notification to the delivery adapter registered for its `Type`, running on the dedicated blocking-I/O dispatcher.
|
||
4. Applies the result:
|
||
- **success** → `Delivered`, set `DeliveredAt`, snapshot `ResolvedTargets`.
|
||
- **transient failure** → `Retrying`, increment `RetryCount`, set `NextAttemptAt`, record `LastError`; once retries are exhausted → `Parked`.
|
||
- **permanent failure** → `Parked`, record `LastError`.
|
||
|
||
Each delivery attempt also writes a `Notification.Attempt` row to the central `AuditLog` via `ICentralAuditWriter`; a transition to a terminal status (`Delivered` / `Parked` / `Discarded`) writes a `Notification.Terminal` row. Audit writes are **direct** (no telemetry — the dispatcher runs at central), insert-if-not-exists on `EventId`. The site-emitted `Notification.Enqueued` row arrives separately via the standard audit telemetry channel from the site's SQLite write-buffer, so the full per-notification audit trail is `Enqueued` (site-originated) → `Attempt` × N (central direct-write) → `Terminal` (central direct-write). See [Component-AuditLog.md](Component-AuditLog.md), Central direct-write (central-originated events).
|
||
|
||
The operational `Notifications` table remains the **source of truth** for the dispatcher and for Retry/Discard actions; the `AuditLog` rows are immutable shadows. Operator Retry/Discard still mutates only the `Notifications` row, and each transition emits the corresponding `Notification.Attempt` / `Notification.Terminal` audit row.
|
||
|
||
**Audit-write failure never affects delivery.** If the `ICentralAuditWriter` direct-write fails (transient DB error, schema lock, etc.) the dispatcher logs the failure and increments the `CentralAuditWriteFailures` health metric (see Health Monitoring #11), but the delivery attempt's outcome on the `Notifications` row stands. The audit row is recovered by re-emission on the next dispatcher tick or by the on-startup reconciliation sweep; central never aborts a notification because audit failed.
|
||
|
||
## Delivery Adapters
|
||
|
||
A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern.
|
||
|
||
- **Email adapter — implemented now.** The existing SMTP composition/send logic, relocated to the central cluster.
|
||
- **Teams and other adapters — future.** The `Type` discriminator and the adapter interface are the seam; no Teams code exists in this design. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.
|
||
|
||
Delivery adapters are provided by the Notification Service, which manages notification-list and SMTP definitions and supplies the stateless per-type "deliver one notification" implementations.
|
||
|
||
## Active/Standby Behavior
|
||
|
||
The `NotificationOutboxActor` is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store, so **no Akka-level replication is needed** (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the `Notifications` table — `Pending` rows and due `Retrying` rows are picked up on the next dispatcher tick.
|
||
|
||
## Monitoring
|
||
|
||
### KPIs
|
||
|
||
KPIs are central-computed from the `Notifications` table — global, with a per-source-site breakdown:
|
||
|
||
- **Queue depth** — count of `Pending` + `Retrying`.
|
||
- **Stuck count** — `Pending` / `Retrying` rows older than the configurable stuck-age threshold.
|
||
- **Parked count** — count of `Parked`.
|
||
- **Delivered (last interval)** — count of `Delivered` since the previous sample.
|
||
- **Oldest pending age** — age of the oldest non-terminal notification.
|
||
|
||
KPIs are point-in-time, computed on demand from the table. The configurable row retention (default 365 days) answers historical questions directly, so no separate time-series store is added.
|
||
|
||
### Stuck Detection
|
||
|
||
A notification is **stuck** if it is `Pending` or `Retrying` and older than a configurable age threshold (default 10 minutes). Detection is **display-only** — a count KPI and a row badge. There is no automated escalation or alerting, consistent with the system-wide no-alerting policy.
|
||
|
||
### Surfacing
|
||
|
||
- **Health Monitoring dashboard** — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed and are not part of the site health report. The site S&F notification backlog remains a separate site health metric covering the site→central leg.
|
||
- **Central UI "Notification Outbox" page** — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer **Retry** (→ `Pending`, reset `RetryCount` / `NextAttemptAt`) and **Discard** (→ `Discarded`) actions. Stuck rows are badged.
|
||
|
||
## Configuration
|
||
|
||
The component is configured via `NotificationOutboxOptions`, bound from an `appsettings.json` section on the central host (Options pattern):
|
||
|
||
- **Dispatch interval** — how often the dispatcher loop polls for due rows.
|
||
- **Stuck-age threshold** — age beyond which a non-terminal notification is counted as stuck (default 10 minutes).
|
||
- **Terminal-row retention window** — age after which terminal rows are removed by the daily purge job (default 365 days).
|
||
|
||
Delivery max-retry-count and retry interval are not part of `NotificationOutboxOptions` — they are reused from the central SMTP configuration.
|
||
|
||
## Dependencies
|
||
|
||
- **Notification Service**: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes.
|
||
- **Configuration Database**: Hosts the `Notifications` table; provides the entity POCO, repository, and EF migration for outbox persistence.
|
||
- **Central–Site Communication**: Carries inbound notification submissions and acks between sites and central.
|
||
- **Audit Log (#23)**: The dispatcher direct-writes `Notification.Attempt` and `Notification.Terminal` rows to the central `AuditLog` via `ICentralAuditWriter` (insert-if-not-exists on `EventId`); the site-emitted `Notification.Enqueued` row arrives via the standard audit telemetry channel. See [Component-AuditLog.md](Component-AuditLog.md), Central direct-write (central-originated events).
|
||
- **Health Monitoring**: Consumes the outbox KPIs as central-computed headline metrics.
|
||
- **Central UI**: Hosts the Notification Outbox page.
|
||
|
||
## Interactions
|
||
|
||
- **Site Store-and-Forward Engine**: Forwards notifications to central via Central–Site Communication; the outbox ingests them and acks once persisted.
|
||
- **Notification Service**: Supplies delivery adapters and resolves notification lists at delivery time.
|
||
- **Central UI**: Queries the `Notifications` table for the Notification Outbox page and issues operator Retry/Discard actions on parked notifications.
|
||
- **Health Monitoring**: Polls the outbox for KPI tiles on the health dashboard.
|