docs(components): reference docs batch 3/4 — NotificationService, NotificationOutbox, SiteCallAudit, HealthMonitoring, SiteEventLogging, InboundAPI
This commit is contained in:
@@ -0,0 +1,260 @@
|
||||
# Notification Outbox
|
||||
|
||||
The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, persists each one to the `Notifications` table in the central MS SQL database, and delivers them through per-type delivery adapters. It is the first outbox component to run centrally — the Store-and-Forward Engine remains site-only.
|
||||
|
||||
## Overview
|
||||
|
||||
Notification Outbox (#21) runs exclusively on the central cluster. Sites no longer send notifications directly via SMTP: a script's `Notify.Send` call generates a `NotificationId` (GUID) locally, the notification is stored in the site S&F buffer, forwarded to central via Central–Site Communication, and the central outbox owns all dispatch and delivery from that point on.
|
||||
|
||||
The component code lives in `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/`, with a flat layout:
|
||||
|
||||
- Root — `NotificationOutboxActor`, `NotificationOutboxOptions`, `ServiceCollectionExtensions`.
|
||||
- `Delivery/` — `INotificationDeliveryAdapter`, `DeliveryOutcome`/`DeliveryResult`, `EmailNotificationDeliveryAdapter`.
|
||||
- `Messages/` — `InternalMessages` (actor-internal timer and pipe messages, never sent over the network).
|
||||
|
||||
Shared types used throughout — `Notification`, `NotificationStatus`, `NotificationType`, `INotificationOutboxRepository`, and all ClusterClient message contracts — live in `src/ZB.MOM.WW.ScadaBridge.Commons/`.
|
||||
|
||||
The DI entry point is `ServiceCollectionExtensions.AddNotificationOutbox`, called by the Host on central nodes. It binds `NotificationOutboxOptions` and registers the typed delivery adapters. The Host separately registers `NotificationOutboxActor` as a cluster singleton and wires the ClusterClient receptionist so inbound `NotificationSubmit` messages reach the actor regardless of which central node is active.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### `Notifications` table as source of truth
|
||||
|
||||
The central MS SQL `Notifications` table is the single source of audit truth for every notification in the system. One row per `NotificationId` (GUID primary key), regardless of delivery channel. The table is type-agnostic: the `Type` discriminator (`Email`; future channels add new enum members) selects the delivery adapter while the rest of the schema is shared. The `Notifications` table answers both operational queries (current status, retry count, next attempt) and KPI queries (queue depth, stuck count, throughput); no separate time-series store is needed.
|
||||
|
||||
### Status lifecycle
|
||||
|
||||
| Status | Where it lives | Meaning |
|
||||
|---|---|---|
|
||||
| `Forwarding` | Site-local only | Notification is in the site S&F buffer; never stored in `Notifications`. |
|
||||
| `Pending` | Central | Ingested; awaiting first dispatch sweep. |
|
||||
| `Retrying` | Central | Transient failure; `NextAttemptAt` schedules the next attempt. |
|
||||
| `Delivered` | Central, terminal | Successfully sent; `DeliveredAt` and `ResolvedTargets` are set. |
|
||||
| `Parked` | Central, terminal | Permanent failure or retries exhausted; `LastError` records why. |
|
||||
| `Discarded` | Central, terminal | Operator-cancelled a `Parked` notification; row is kept for the audit record. |
|
||||
|
||||
`Forwarding` is answered site-locally by `Notify.Status(id)`; once the ack arrives, queries round-trip to the `Notifications` table on central.
|
||||
|
||||
### At-least-once site→central handoff
|
||||
|
||||
Central ingests a `NotificationSubmit` with `insert-if-not-exists` on `NotificationId`, then acks the site with `NotificationSubmitAck`. The site S&F engine clears the message only after receiving that ack. A lost ack causes the site to resend; the GUID idempotency key makes the resend a no-op. Because the ack is sent only after the row is persisted, no notification is lost to a race between the ack and a central failover — the row already exists.
|
||||
|
||||
### No Akka-level replication
|
||||
|
||||
All outbox state lives in MS SQL, which is already the central HA store. There is no Akka replication of the actor's in-memory state. On a central failover the new active node's singleton starts a fresh dispatch sweep; `Pending` and due `Retrying` rows are reclaimed from the table on the next tick.
|
||||
|
||||
## Architecture
|
||||
|
||||
### `NotificationOutboxActor`
|
||||
|
||||
`NotificationOutboxActor` is a `ReceiveActor` that implements `IWithTimers`. It runs as a cluster singleton on the active central node. The actor is responsible for both the ingest path (accepting `NotificationSubmit` messages) and the dispatch path (running the periodic delivery loop). All async work is executed via `PipeTo(Self)` so every result lands on the actor's mailbox thread, preserving single-threaded actor semantics:
|
||||
|
||||
```csharp
|
||||
public class NotificationOutboxActor : ReceiveActor, IWithTimers
|
||||
{
|
||||
public NotificationOutboxActor(
|
||||
IServiceProvider serviceProvider,
|
||||
NotificationOutboxOptions options,
|
||||
ICentralAuditWriter auditWriter,
|
||||
ILogger<NotificationOutboxActor> logger)
|
||||
{
|
||||
Receive<NotificationSubmit>(HandleSubmit);
|
||||
Receive<InternalMessages.IngestPersisted>(HandleIngestPersisted);
|
||||
Receive<InternalMessages.DispatchTick>(_ => HandleDispatchTick());
|
||||
Receive<InternalMessages.DispatchComplete>(_ => _dispatching = false);
|
||||
Receive<InternalMessages.PurgeTick>(_ => HandlePurgeTick());
|
||||
Receive<InternalMessages.PurgeComplete>(_ => { });
|
||||
Receive<NotificationOutboxQueryRequest>(HandleQuery);
|
||||
Receive<NotificationStatusQuery>(HandleStatusQuery);
|
||||
Receive<NotificationDetailRequest>(HandleDetailRequest);
|
||||
Receive<RetryNotificationRequest>(HandleRetry);
|
||||
Receive<DiscardNotificationRequest>(HandleDiscard);
|
||||
Receive<NotificationKpiRequest>(HandleKpiRequest);
|
||||
Receive<PerSiteNotificationKpiRequest>(HandlePerSiteKpiRequest);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`PreStart` starts two periodic Akka timers: `DispatchTick` at `DispatchInterval` and `PurgeTick` at `PurgeInterval`. A lifecycle-scoped `CancellationTokenSource` (`_shutdownCts`) is created in `PreStart` and cancelled in `PostStop` so any in-flight SMTP send observes coordinated shutdown instead of blocking for a full connect/auth/send timeout.
|
||||
|
||||
### Ingest path
|
||||
|
||||
`HandleSubmit` maps a `NotificationSubmit` to a `Notification` entity and calls `PersistAsync`, which opens a fresh DI scope, resolves `INotificationOutboxRepository`, and calls `InsertIfNotExistsAsync`. The boolean result is intentionally ignored — an existing row is treated identically to a fresh insert. The async result is piped back to `Self` as `InternalMessages.IngestPersisted`, which carries the original `Sender` reference so the ack is sent from the actor thread:
|
||||
|
||||
```csharp
|
||||
private void HandleSubmit(NotificationSubmit msg)
|
||||
{
|
||||
var sender = Sender;
|
||||
var notification = BuildNotification(msg);
|
||||
|
||||
PersistAsync(notification).PipeTo(
|
||||
Self,
|
||||
success: () => new InternalMessages.IngestPersisted(
|
||||
msg.NotificationId, sender, Succeeded: true, Error: null),
|
||||
failure: ex => new InternalMessages.IngestPersisted(
|
||||
msg.NotificationId, sender, Succeeded: false, Error: ex.GetBaseException().Message));
|
||||
}
|
||||
```
|
||||
|
||||
`NotificationSubmitAck(Accepted: true)` is returned for both a fresh insert and an existing row. Only a thrown repository error yields `Accepted: false`, causing the site to retain and retry its S&F message.
|
||||
|
||||
### Dispatch loop
|
||||
|
||||
On each `DispatchTick` the actor checks a boolean in-flight guard (`_dispatching`). If a sweep is already running the tick is silently dropped — sweeps never overlap. Otherwise the guard is raised and `RunDispatchPass` is launched:
|
||||
|
||||
1. A scoped `INotificationOutboxRepository` fetches `Pending` rows and `Retrying` rows whose `NextAttemptAt ≤ now`, ordered by `CreatedAt` ascending, capped at `DispatchBatchSize`.
|
||||
2. The retry policy (`maxRetries`, `retryDelay`) is resolved by reading `SmtpConfiguration` from `INotificationRepository`. Non-positive values are clamped to `FallbackMaxRetries = 10` and `FallbackRetryDelay = 1 min` with a warning log so a misconfigured SMTP row does not silently park notifications without retrying.
|
||||
3. Each notification in the batch is delivered sequentially via `DeliverOneAsync`. Per-notification exceptions are caught and logged so one bad row never aborts the rest of the batch.
|
||||
|
||||
`DispatchComplete` (singleton instance) is piped back to `Self` on both the success and failure projections, ensuring the in-flight guard is always cleared even if the sweep faults unexpectedly.
|
||||
|
||||
### Delivery adapters
|
||||
|
||||
`INotificationDeliveryAdapter` is the per-channel delivery seam:
|
||||
|
||||
```csharp
|
||||
public interface INotificationDeliveryAdapter
|
||||
{
|
||||
NotificationType Type { get; }
|
||||
Task<DeliveryOutcome> DeliverAsync(
|
||||
Notification notification, CancellationToken cancellationToken = default);
|
||||
}
|
||||
```
|
||||
|
||||
`DeliveryOutcome` carries a `DeliveryResult` enum (`Success`, `TransientFailure`, `PermanentFailure`), resolved recipients on success, and an error string on failure. The adapter map (`NotificationType → INotificationDeliveryAdapter`) is built lazily on the first dispatch sweep and cached in `_adaptersCache` for the actor's lifetime, paired with an actor-lifetime `IServiceScope` (`_adaptersScope`) disposed in `PostStop`. This avoids rebuilding the dictionary on every sweep while respecting each adapter's scoped DI dependencies.
|
||||
|
||||
`EmailNotificationDeliveryAdapter` is the only registered adapter. It resolves the target list and recipients from `INotificationRepository` at delivery time (not at ingest), connects to SMTP via `ISmtpClientWrapper`, acquires an OAuth2 token if configured, and sends a BCC plain-text email. Error classification mirrors the External System Gateway pattern:
|
||||
|
||||
| Exception | `DeliveryResult` |
|
||||
|---|---|
|
||||
| `SmtpPermanentException` (SMTP 5xx) | `PermanentFailure` |
|
||||
| Connection/protocol/timeout errors | `TransientFailure` |
|
||||
| `OperationCanceledException` (shutdown) | propagated, not classified |
|
||||
| Missing list, empty recipient list, no SMTP config, invalid TLS mode | `PermanentFailure` |
|
||||
| Unclassified (e.g. OAuth2 token fetch failure) | `PermanentFailure` |
|
||||
|
||||
### Status transitions in `DeliverOneAsync`
|
||||
|
||||
```csharp
|
||||
switch (outcome.Result)
|
||||
{
|
||||
case DeliveryResult.Success:
|
||||
notification.Status = NotificationStatus.Delivered;
|
||||
notification.DeliveredAt = now;
|
||||
notification.ResolvedTargets = outcome.ResolvedTargets;
|
||||
break;
|
||||
|
||||
case DeliveryResult.TransientFailure:
|
||||
notification.RetryCount++;
|
||||
notification.LastError = outcome.Error;
|
||||
if (notification.RetryCount >= maxRetries)
|
||||
notification.Status = NotificationStatus.Parked;
|
||||
else
|
||||
{
|
||||
notification.Status = NotificationStatus.Retrying;
|
||||
notification.NextAttemptAt = now + retryDelay;
|
||||
}
|
||||
break;
|
||||
|
||||
case DeliveryResult.PermanentFailure:
|
||||
notification.Status = NotificationStatus.Parked;
|
||||
notification.LastError = outcome.Error;
|
||||
break;
|
||||
}
|
||||
await outboxRepository.UpdateAsync(notification, cancellationToken);
|
||||
```
|
||||
|
||||
Every attempt also writes audit rows via `ICentralAuditWriter` (see Audit Integration below). Audit-write failure is caught, logged, and never propagates back into the dispatcher — the delivery outcome on the `Notifications` row stands regardless.
|
||||
|
||||
### Audit integration
|
||||
|
||||
Each delivery attempt emits two `AuditChannel.Notification` / `AuditKind.NotifyDeliver` rows via `ICentralAuditWriter`:
|
||||
|
||||
- An `AuditStatus.Attempted` row (always, per attempt), carrying attempt duration in milliseconds.
|
||||
- A terminal row (`Delivered`, `Parked`, or `Discarded`) when the post-outcome status is terminal.
|
||||
|
||||
`CorrelationId` on both rows is parsed from the `NotificationId` GUID. `ExecutionId` and `ParentExecutionId` are echoed from `Notification.OriginExecutionId` / `Notification.OriginParentExecutionId`, linking the central `NotifyDeliver` rows to the site-emitted `NotifySend` row for the same script run. The `Actor` field is `"system"` — there is no authenticated user at dispatch time.
|
||||
|
||||
Manual discard via `HandleDiscard` also emits a terminal `Discarded` row (with a null error, because the discard is operator-driven, not a delivery failure).
|
||||
|
||||
### Purge
|
||||
|
||||
`HandlePurgeTick` fires daily at `PurgeInterval`. `RunPurgePass` opens a scoped `INotificationOutboxRepository` and calls `DeleteTerminalOlderThanAsync(cutoff)`, where `cutoff = now − TerminalRetention` (default 365 days). The deleted count is logged at `Information`. Purge faults are caught internally so the returned task never faults.
|
||||
|
||||
## Usage
|
||||
|
||||
The outbox is consumed through two DI seams.
|
||||
|
||||
**Ingest** — the Host registers `NotificationOutboxActor` as an Akka cluster singleton and with the `ClusterClientReceptionist`. Site clusters send `NotificationSubmit` messages through Central–Site Communication; the actor ingests them without further configuration by the caller.
|
||||
|
||||
**Operator actions / UI queries** — the Central UI's Notification Outbox page and the ManagementActor resolve the singleton `IActorRef` and send query or command messages:
|
||||
|
||||
| Message | Actor response | Allowed when |
|
||||
|---|---|---|
|
||||
| `NotificationOutboxQueryRequest` | `NotificationOutboxQueryResponse` | Any time |
|
||||
| `NotificationStatusQuery` | `NotificationStatusResponse` | Any time |
|
||||
| `NotificationDetailRequest` | `NotificationDetailResponse` | Any time |
|
||||
| `RetryNotificationRequest` | `RetryNotificationResponse` | Row is `Parked` |
|
||||
| `DiscardNotificationRequest` | `DiscardNotificationResponse` | Row is `Parked` |
|
||||
| `NotificationKpiRequest` | `NotificationKpiResponse` | Any time |
|
||||
| `PerSiteNotificationKpiRequest` | `PerSiteNotificationKpiResponse` | Any time |
|
||||
|
||||
Retry resets the notification to `Pending` with `RetryCount = 0`, `NextAttemptAt = null`, and `LastError = null` so the dispatch loop reclaims it on the next sweep. Discard transitions to terminal `Discarded` and emits the corresponding audit row.
|
||||
|
||||
## Configuration
|
||||
|
||||
Options are bound from `ScadaBridge:NotificationOutbox` via `NotificationOutboxOptions`:
|
||||
|
||||
| Key | Default | Description |
|
||||
|---|---|---|
|
||||
| `DispatchInterval` | `00:00:10` (10 s) | How often the dispatcher polls for due rows. |
|
||||
| `DispatchBatchSize` | `100` | Maximum notifications claimed per sweep. |
|
||||
| `StuckAgeThreshold` | `00:10:00` (10 min) | Age beyond which a non-terminal row is counted as stuck in KPIs and the UI badge. Display-only; does not affect dispatcher behaviour. |
|
||||
| `TerminalRetention` | `365` days | How long terminal rows are kept before the daily purge removes them. |
|
||||
| `PurgeInterval` | `1` day | Cadence of the background purge sweep. |
|
||||
| `DeliveredKpiWindow` | `00:01:00` (1 min) | Trailing window for the "delivered last interval" throughput KPI. |
|
||||
|
||||
Delivery retry policy (`MaxRetries`, `RetryDelay`) is read at runtime from `SmtpConfiguration` via `INotificationRepository`, not from `NotificationOutboxOptions`. Non-positive values are clamped to `FallbackMaxRetries = 10` and `FallbackRetryDelay = 1 min` with a `Warning` log.
|
||||
|
||||
## Dependencies & Interactions
|
||||
|
||||
- [Commons (#16)](./Commons.md) — owns `Notification`, `NotificationStatus`, `NotificationType`, `INotificationOutboxRepository`, `INotificationRepository`, and all message contracts (`NotificationSubmit`, `NotificationSubmitAck`, `NotificationStatusQuery`, `NotificationKpiRequest`, and their responses). Also owns `ScadaBridgeAuditEventFactory` and the `AuditChannel`/`AuditKind`/`AuditStatus` enums used to build dispatch audit rows.
|
||||
- [Configuration Database (#17)](./ConfigurationDatabase.md) — registers the scoped `INotificationOutboxRepository` (the central `dbo.Notifications` table) and `INotificationRepository` (notification-list, recipient, and SMTP configuration tables). Central hosts must call `AddConfigurationDatabase` before `AddNotificationOutbox`.
|
||||
- [Notification Service (#8)](./StoreAndForward.md) — supplies `ISmtpClientWrapper`, `OAuth2TokenService`, `NotificationOptions`, `SmtpTlsModeParser`, `SmtpErrorClassifier`, and the `SmtpPermanentException` type. `AddNotificationOutbox` relies on `AddNotificationService` being called by the Host to register these shared SMTP primitives; registering them twice would duplicate them.
|
||||
- [Central–Site Communication (#5)](./Communication.md) — carries `NotificationSubmit` / `NotificationSubmitAck` between sites and central via ClusterClient, and `NotificationStatusQuery` / `NotificationStatusResponse` for the `Notify.Status` round-trip.
|
||||
- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — the site-side component that durably buffers notifications in SQLite and retries forwarding until central acks. The outbox is the receiving end of the S&F handoff.
|
||||
- [Audit Log (#23)](./AuditLog.md) — the outbox is a central direct-write caller of `ICentralAuditWriter`. It emits `NotifyDeliver` rows (Attempted + terminal) per delivery attempt and per operator Discard. The upstream `NotifySend` row is emitted by the site and arrives at central via standard audit telemetry.
|
||||
- [Health Monitoring (#11)](./Host.md) — polls `NotificationKpiRequest` / `PerSiteNotificationKpiRequest` for the headline KPI tiles on the health dashboard (queue depth, stuck count, parked count). These are central-computed from the `Notifications` table and are separate from the site S&F backlog metric.
|
||||
- [Central UI (#9)](./Host.md) — hosts the Notification Outbox page: KPI tiles, a queryable/filterable notification list, per-row Retry/Discard actions on parked notifications, and a stuck-row badge.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Notifications stuck in `Pending`
|
||||
|
||||
A notification stays `Pending` when the dispatch loop is not running or is failing silently. Check for `"Dispatch sweep failed"` at `Error` level in the central node logs. The most common cause is a missing or misconfigured `SmtpConfiguration` row, which the adapter surfaces as a `PermanentFailure` and parks the notification immediately. A `Warning` log line naming `SmtpConfiguration.MaxRetries` or `SmtpConfiguration.RetryDelay` being non-positive indicates the retry policy was clamped — correct the SMTP configuration row.
|
||||
|
||||
### Notifications parked with "no delivery adapter for type"
|
||||
|
||||
The actor parks a notification immediately and logs this message when `NotificationType` has a value for which no `INotificationDeliveryAdapter` is registered. Currently only `Email` has an adapter; future channel types must be registered in `AddNotificationOutbox` before notifications of that type are submitted.
|
||||
|
||||
### Dispatch loop wedged (guard stuck `true`)
|
||||
|
||||
The boolean `_dispatching` guard is cleared by `DispatchComplete`, which is piped even on a faulted sweep. If the actor itself stops and restarts, `PreStart` reinitialises the timers and the guard resets. A wedged guard without a restart indicates the `PipeTo` callback is never completing — examine logs around `"Dispatch sweep faulted unexpectedly"`.
|
||||
|
||||
### SMTP credentials appearing in logs
|
||||
|
||||
`EmailNotificationDeliveryAdapter` runs `CredentialRedactor.Scrub` on all exception messages before logging. If credential strings appear in logs the SMTP exception message is not being routed through the `catch` filters in `DeliverAsync` — ensure the exception type is reachable by one of the three `catch` blocks and not escaping before scrubbing.
|
||||
|
||||
### Failover mid-delivery
|
||||
|
||||
A central failover while a delivery attempt is in flight leaves the row in its pre-attempt status (`Pending` or `Retrying`). The new active node picks it up on the next dispatch tick. One notification may be re-sent to SMTP as a result — this is an accepted trade-off, consistent with the at-least-once guarantee the S&F Engine already provides.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Notification Outbox design specification](../requirements/Component-NotificationOutbox.md)
|
||||
- [Audit Log](./AuditLog.md)
|
||||
- [Commons](./Commons.md)
|
||||
- [Configuration Database](./ConfigurationDatabase.md)
|
||||
- [Central–Site Communication](./Communication.md)
|
||||
- [Store-and-Forward Engine](./StoreAndForward.md)
|
||||
- [Health Monitoring](./Host.md)
|
||||
Reference in New Issue
Block a user