NotificationService (Notify.Send returns string not NotificationId; MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox (one Attempted row always, terminal row only on terminal status), SiteCallAudit (direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring (CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
20 KiB
Notification Outbox
The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, persists each one to the Notifications table in the central MS SQL database, and delivers them through per-type delivery adapters. It is the first outbox component to run centrally — the Store-and-Forward Engine remains site-only.
Overview
Notification Outbox (#21) runs exclusively on the central cluster. Sites no longer send notifications directly via SMTP: a script's Notify.Send call generates a NotificationId (GUID) locally, the notification is stored in the site S&F buffer, forwarded to central via Central–Site Communication, and the central outbox owns all dispatch and delivery from that point on.
The component code lives in src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/, with a flat layout:
- Root —
NotificationOutboxActor,NotificationOutboxOptions,ServiceCollectionExtensions. Delivery/—INotificationDeliveryAdapter,DeliveryOutcome/DeliveryResult,EmailNotificationDeliveryAdapter.Messages/—InternalMessages(actor-internal timer and pipe messages, never sent over the network).
Shared types used throughout — Notification, NotificationStatus, NotificationType, INotificationOutboxRepository, and all ClusterClient message contracts — live in src/ZB.MOM.WW.ScadaBridge.Commons/.
The DI entry point is ServiceCollectionExtensions.AddNotificationOutbox, called by the Host on central nodes. It binds NotificationOutboxOptions and registers the typed delivery adapters. The Host separately registers NotificationOutboxActor as a cluster singleton and wires the ClusterClient receptionist so inbound NotificationSubmit messages reach the actor regardless of which central node is active.
Key Concepts
Notifications table as source of truth
The central MS SQL Notifications table is the single source of audit truth for every notification in the system. One row per NotificationId (GUID primary key), regardless of delivery channel. The table is type-agnostic: the Type discriminator (Email; future channels add new enum members) selects the delivery adapter while the rest of the schema is shared. The Notifications table answers both operational queries (current status, retry count, next attempt) and KPI queries (queue depth, stuck count, throughput); no separate time-series store is needed.
Status lifecycle
| Status | Where it lives | Meaning |
|---|---|---|
Forwarding |
Site-local only | Notification is in the site S&F buffer; never stored in Notifications. |
Pending |
Central | Ingested; awaiting first dispatch sweep. |
Retrying |
Central | Transient failure; NextAttemptAt schedules the next attempt. |
Delivered |
Central, terminal | Successfully sent; DeliveredAt and ResolvedTargets are set. |
Parked |
Central, terminal | Permanent failure or retries exhausted; LastError records why. |
Discarded |
Central, terminal | Operator-cancelled a Parked notification; row is kept for the audit record. |
Forwarding is answered site-locally by Notify.Status(id); once the ack arrives, queries round-trip to the Notifications table on central.
At-least-once site→central handoff
Central ingests a NotificationSubmit with insert-if-not-exists on NotificationId, then acks the site with NotificationSubmitAck. The site S&F engine clears the message only after receiving that ack. A lost ack causes the site to resend; the GUID idempotency key makes the resend a no-op. Because the ack is sent only after the row is persisted, no notification is lost to a race between the ack and a central failover — the row already exists.
No Akka-level replication
All outbox state lives in MS SQL, which is already the central HA store. There is no Akka replication of the actor's in-memory state. On a central failover the new active node's singleton starts a fresh dispatch sweep; Pending and due Retrying rows are reclaimed from the table on the next tick.
Architecture
NotificationOutboxActor
NotificationOutboxActor is a ReceiveActor that implements IWithTimers. It runs as a cluster singleton on the active central node. The actor is responsible for both the ingest path (accepting NotificationSubmit messages) and the dispatch path (running the periodic delivery loop). All async work is executed via PipeTo(Self) so every result lands on the actor's mailbox thread, preserving single-threaded actor semantics:
public class NotificationOutboxActor : ReceiveActor, IWithTimers
{
public NotificationOutboxActor(
IServiceProvider serviceProvider,
NotificationOutboxOptions options,
ICentralAuditWriter auditWriter,
ILogger<NotificationOutboxActor> logger)
{
Receive<NotificationSubmit>(HandleSubmit);
Receive<InternalMessages.IngestPersisted>(HandleIngestPersisted);
Receive<InternalMessages.DispatchTick>(_ => HandleDispatchTick());
Receive<InternalMessages.DispatchComplete>(_ => _dispatching = false);
Receive<InternalMessages.PurgeTick>(_ => HandlePurgeTick());
Receive<InternalMessages.PurgeComplete>(_ => { });
Receive<NotificationOutboxQueryRequest>(HandleQuery);
Receive<NotificationStatusQuery>(HandleStatusQuery);
Receive<NotificationDetailRequest>(HandleDetailRequest);
Receive<RetryNotificationRequest>(HandleRetry);
Receive<DiscardNotificationRequest>(HandleDiscard);
Receive<NotificationKpiRequest>(HandleKpiRequest);
Receive<PerSiteNotificationKpiRequest>(HandlePerSiteKpiRequest);
}
}
PreStart starts two periodic Akka timers: DispatchTick at DispatchInterval and PurgeTick at PurgeInterval. A lifecycle-scoped CancellationTokenSource (_shutdownCts) is created in PreStart and cancelled in PostStop so any in-flight SMTP send observes coordinated shutdown instead of blocking for a full connect/auth/send timeout.
Ingest path
HandleSubmit maps a NotificationSubmit to a Notification entity and calls PersistAsync, which opens a fresh DI scope, resolves INotificationOutboxRepository, and calls InsertIfNotExistsAsync. The boolean result is intentionally ignored — an existing row is treated identically to a fresh insert. The async result is piped back to Self as InternalMessages.IngestPersisted, which carries the original Sender reference so the ack is sent from the actor thread:
private void HandleSubmit(NotificationSubmit msg)
{
var sender = Sender;
var notification = BuildNotification(msg);
PersistAsync(notification).PipeTo(
Self,
success: () => new InternalMessages.IngestPersisted(
msg.NotificationId, sender, Succeeded: true, Error: null),
failure: ex => new InternalMessages.IngestPersisted(
msg.NotificationId, sender, Succeeded: false, Error: ex.GetBaseException().Message));
}
NotificationSubmitAck(Accepted: true) is returned for both a fresh insert and an existing row. Only a thrown repository error yields Accepted: false, causing the site to retain and retry its S&F message.
Dispatch loop
On each DispatchTick the actor checks a boolean in-flight guard (_dispatching). If a sweep is already running the tick is silently dropped — sweeps never overlap. Otherwise the guard is raised and RunDispatchPass is launched:
- A scoped
INotificationOutboxRepositoryfetchesPendingrows andRetryingrows whoseNextAttemptAt ≤ now, ordered byCreatedAtascending, capped atDispatchBatchSize. - The retry policy (
maxRetries,retryDelay) is resolved by readingSmtpConfigurationfromINotificationRepository. Non-positive values are clamped toFallbackMaxRetries = 10andFallbackRetryDelay = 1 minwith a warning log so a misconfigured SMTP row does not silently park notifications without retrying. - Each notification in the batch is delivered sequentially via
DeliverOneAsync. Per-notification exceptions are caught and logged so one bad row never aborts the rest of the batch.
DispatchComplete (singleton instance) is piped back to Self on both the success and failure projections, ensuring the in-flight guard is always cleared even if the sweep faults unexpectedly.
Delivery adapters
INotificationDeliveryAdapter is the per-channel delivery seam:
public interface INotificationDeliveryAdapter
{
NotificationType Type { get; }
Task<DeliveryOutcome> DeliverAsync(
Notification notification, CancellationToken cancellationToken = default);
}
DeliveryOutcome carries a DeliveryResult enum (Success, TransientFailure, PermanentFailure), resolved recipients on success, and an error string on failure. The adapter map (NotificationType → INotificationDeliveryAdapter) is built lazily on the first dispatch sweep and cached in _adaptersCache for the actor's lifetime, paired with an actor-lifetime IServiceScope (_adaptersScope) disposed in PostStop. This avoids rebuilding the dictionary on every sweep while respecting each adapter's scoped DI dependencies.
EmailNotificationDeliveryAdapter is the only registered adapter. It resolves the target list and recipients from INotificationRepository at delivery time (not at ingest), connects to SMTP via ISmtpClientWrapper, acquires an OAuth2 token if configured, and sends a BCC plain-text email. Error classification mirrors the External System Gateway pattern:
| Exception | DeliveryResult |
|---|---|
SmtpPermanentException (SMTP 5xx) |
PermanentFailure |
| Connection/protocol/timeout errors | TransientFailure |
OperationCanceledException (shutdown) |
propagated, not classified |
| Missing list, empty recipient list, no SMTP config, invalid TLS mode | PermanentFailure |
| Unclassified (e.g. OAuth2 token fetch failure) | PermanentFailure |
Status transitions in DeliverOneAsync
switch (outcome.Result)
{
case DeliveryResult.Success:
notification.Status = NotificationStatus.Delivered;
notification.DeliveredAt = now;
notification.ResolvedTargets = outcome.ResolvedTargets;
break;
case DeliveryResult.TransientFailure:
notification.RetryCount++;
notification.LastError = outcome.Error;
if (notification.RetryCount >= maxRetries)
notification.Status = NotificationStatus.Parked;
else
{
notification.Status = NotificationStatus.Retrying;
notification.NextAttemptAt = now + retryDelay;
}
break;
case DeliveryResult.PermanentFailure:
notification.Status = NotificationStatus.Parked;
notification.LastError = outcome.Error;
break;
}
await outboxRepository.UpdateAsync(notification, cancellationToken);
Every attempt also writes audit rows via ICentralAuditWriter (see Audit Integration below). Audit-write failure is caught, logged, and never propagates back into the dispatcher — the delivery outcome on the Notifications row stands regardless.
Audit integration
Each delivery attempt emits at least one AuditChannel.Notification / AuditKind.NotifyDeliver row via ICentralAuditWriter:
- An
AuditStatus.Attemptedrow (always, per attempt), carrying attempt duration in milliseconds. - A second, terminal row (
Delivered,Parked, orDiscarded) only when the post-outcome status is terminal — a transient failure that transitions the notification toRetryingemits only theAttemptedrow.
CorrelationId on the emitted row(s) is parsed from the NotificationId GUID. ExecutionId and ParentExecutionId are echoed from Notification.OriginExecutionId / Notification.OriginParentExecutionId, linking the central NotifyDeliver rows to the site-emitted NotifySend row for the same script run. The Actor field is "system" — there is no authenticated user at dispatch time.
Manual discard via HandleDiscard also emits a terminal Discarded row (with a null error, because the discard is operator-driven, not a delivery failure).
Purge
HandlePurgeTick fires daily at PurgeInterval. RunPurgePass opens a scoped INotificationOutboxRepository and calls DeleteTerminalOlderThanAsync(cutoff), where cutoff = now − TerminalRetention (default 365 days). The deleted count is logged at Information. Purge faults are caught internally so the returned task never faults.
Usage
The outbox is consumed through two DI seams.
Ingest — the Host registers NotificationOutboxActor as an Akka cluster singleton and with the ClusterClientReceptionist. Site clusters send NotificationSubmit messages through Central–Site Communication; the actor ingests them without further configuration by the caller.
Operator actions / UI queries — the Central UI's Notification Outbox page and the ManagementActor resolve the singleton IActorRef and send query or command messages:
| Message | Actor response | Allowed when |
|---|---|---|
NotificationOutboxQueryRequest |
NotificationOutboxQueryResponse |
Any time |
NotificationStatusQuery |
NotificationStatusResponse |
Any time |
NotificationDetailRequest |
NotificationDetailResponse |
Any time |
RetryNotificationRequest |
RetryNotificationResponse |
Row is Parked |
DiscardNotificationRequest |
DiscardNotificationResponse |
Row is Parked |
NotificationKpiRequest |
NotificationKpiResponse |
Any time |
PerSiteNotificationKpiRequest |
PerSiteNotificationKpiResponse |
Any time |
Retry resets the notification to Pending with RetryCount = 0, NextAttemptAt = null, and LastError = null so the dispatch loop reclaims it on the next sweep. Discard transitions to terminal Discarded and emits the corresponding audit row.
Configuration
Options are bound from ScadaBridge:NotificationOutbox via NotificationOutboxOptions:
| Key | Default | Description |
|---|---|---|
DispatchInterval |
00:00:10 (10 s) |
How often the dispatcher polls for due rows. |
DispatchBatchSize |
100 |
Maximum notifications claimed per sweep. |
StuckAgeThreshold |
00:10:00 (10 min) |
Age beyond which a non-terminal row is counted as stuck in KPIs and the UI badge. Display-only; does not affect dispatcher behaviour. |
TerminalRetention |
365 days |
How long terminal rows are kept before the daily purge removes them. |
PurgeInterval |
1 day |
Cadence of the background purge sweep. |
DeliveredKpiWindow |
00:01:00 (1 min) |
Trailing window for the "delivered last interval" throughput KPI. |
Delivery retry policy (MaxRetries, RetryDelay) is read at runtime from SmtpConfiguration via INotificationRepository, not from NotificationOutboxOptions. Non-positive values are clamped to FallbackMaxRetries = 10 and FallbackRetryDelay = 1 min with a Warning log.
Dependencies & Interactions
- Commons (#16) — owns
Notification,NotificationStatus,NotificationType,INotificationOutboxRepository,INotificationRepository, and all message contracts (NotificationSubmit,NotificationSubmitAck,NotificationStatusQuery,NotificationKpiRequest, and their responses). Also ownsScadaBridgeAuditEventFactoryand theAuditChannel/AuditKind/AuditStatusenums used to build dispatch audit rows. - Configuration Database (#17) — registers the scoped
INotificationOutboxRepository(the centraldbo.Notificationstable) andINotificationRepository(notification-list, recipient, and SMTP configuration tables). Central hosts must callAddConfigurationDatabasebeforeAddNotificationOutbox. - Notification Service (#8) — supplies
ISmtpClientWrapper,OAuth2TokenService,NotificationOptions,SmtpTlsModeParser,SmtpErrorClassifier, and theSmtpPermanentExceptiontype.AddNotificationOutboxrelies onAddNotificationServicebeing called by the Host to register these shared SMTP primitives; registering them twice would duplicate them. - Central–Site Communication (#5) — carries
NotificationSubmit/NotificationSubmitAckbetween sites and central via ClusterClient, andNotificationStatusQuery/NotificationStatusResponsefor theNotify.Statusround-trip. - Store-and-Forward Engine (#6) — the site-side component that durably buffers notifications in SQLite and retries forwarding until central acks. The outbox is the receiving end of the S&F handoff.
- Audit Log (#23) — the outbox is a central direct-write caller of
ICentralAuditWriter. It emits anAttemptedNotifyDeliverrow per delivery attempt, plus a terminal row only when the attempt drives the notification to a terminal status (Delivered/Parked/Discarded); it also emits a terminal row per operator Discard. The upstreamNotifySendrow is emitted by the site and arrives at central via standard audit telemetry. - Health Monitoring (#11) — polls
NotificationKpiRequest/PerSiteNotificationKpiRequestfor the headline KPI tiles on the health dashboard (queue depth, stuck count, parked count). These are central-computed from theNotificationstable and are separate from the site S&F backlog metric. - Central UI (#9) — hosts the Notification Outbox page: KPI tiles, a queryable/filterable notification list, per-row Retry/Discard actions on parked notifications, and a stuck-row badge.
Troubleshooting
Notifications stuck in Pending
A notification stays Pending when the dispatch loop is not running or is failing silently. Check for "Dispatch sweep failed" at Error level in the central node logs. The most common cause is a missing or misconfigured SmtpConfiguration row, which the adapter surfaces as a PermanentFailure and parks the notification immediately. A Warning log line naming SmtpConfiguration.MaxRetries or SmtpConfiguration.RetryDelay being non-positive indicates the retry policy was clamped — correct the SMTP configuration row.
Notifications parked with "no delivery adapter for type"
The actor parks a notification immediately and logs this message when NotificationType has a value for which no INotificationDeliveryAdapter is registered. Currently only Email has an adapter; future channel types must be registered in AddNotificationOutbox before notifications of that type are submitted.
Dispatch loop wedged (guard stuck true)
The boolean _dispatching guard is cleared by DispatchComplete, which is piped even on a faulted sweep. If the actor itself stops and restarts, PreStart reinitialises the timers and the guard resets. A wedged guard without a restart indicates the PipeTo callback is never completing — examine logs around "Dispatch sweep faulted unexpectedly".
SMTP credentials appearing in logs
EmailNotificationDeliveryAdapter runs CredentialRedactor.Scrub on all exception messages before logging. If credential strings appear in logs the SMTP exception message is not being routed through the catch filters in DeliverAsync — ensure the exception type is reachable by one of the three catch blocks and not escaping before scrubbing.
Failover mid-delivery
A central failover while a delivery attempt is in flight leaves the row in its pre-attempt status (Pending or Retrying). The new active node picks it up on the next dispatch tick. One notification may be re-sent to SMTP as a result — this is an accepted trade-off, consistent with the at-least-once guarantee the S&F Engine already provides.