Files
Joseph Doherty 9175b0c013 docs(components): accuracy fixes from deep review (batch 3)
NotificationService (Notify.Send returns string not NotificationId;
MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox
(one Attempted row always, terminal row only on terminal status), SiteCallAudit
(direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring
(CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on
IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both
nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
2026-06-03 16:37:15 -04:00

20 KiB
Raw Permalink Blame History

Notification Outbox

The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, persists each one to the Notifications table in the central MS SQL database, and delivers them through per-type delivery adapters. It is the first outbox component to run centrally — the Store-and-Forward Engine remains site-only.

Overview

Notification Outbox (#21) runs exclusively on the central cluster. Sites no longer send notifications directly via SMTP: a script's Notify.Send call generates a NotificationId (GUID) locally, the notification is stored in the site S&F buffer, forwarded to central via CentralSite Communication, and the central outbox owns all dispatch and delivery from that point on.

The component code lives in src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/, with a flat layout:

  • Root — NotificationOutboxActor, NotificationOutboxOptions, ServiceCollectionExtensions.
  • Delivery/INotificationDeliveryAdapter, DeliveryOutcome/DeliveryResult, EmailNotificationDeliveryAdapter.
  • Messages/InternalMessages (actor-internal timer and pipe messages, never sent over the network).

Shared types used throughout — Notification, NotificationStatus, NotificationType, INotificationOutboxRepository, and all ClusterClient message contracts — live in src/ZB.MOM.WW.ScadaBridge.Commons/.

The DI entry point is ServiceCollectionExtensions.AddNotificationOutbox, called by the Host on central nodes. It binds NotificationOutboxOptions and registers the typed delivery adapters. The Host separately registers NotificationOutboxActor as a cluster singleton and wires the ClusterClient receptionist so inbound NotificationSubmit messages reach the actor regardless of which central node is active.

Key Concepts

Notifications table as source of truth

The central MS SQL Notifications table is the single source of audit truth for every notification in the system. One row per NotificationId (GUID primary key), regardless of delivery channel. The table is type-agnostic: the Type discriminator (Email; future channels add new enum members) selects the delivery adapter while the rest of the schema is shared. The Notifications table answers both operational queries (current status, retry count, next attempt) and KPI queries (queue depth, stuck count, throughput); no separate time-series store is needed.

Status lifecycle

Status Where it lives Meaning
Forwarding Site-local only Notification is in the site S&F buffer; never stored in Notifications.
Pending Central Ingested; awaiting first dispatch sweep.
Retrying Central Transient failure; NextAttemptAt schedules the next attempt.
Delivered Central, terminal Successfully sent; DeliveredAt and ResolvedTargets are set.
Parked Central, terminal Permanent failure or retries exhausted; LastError records why.
Discarded Central, terminal Operator-cancelled a Parked notification; row is kept for the audit record.

Forwarding is answered site-locally by Notify.Status(id); once the ack arrives, queries round-trip to the Notifications table on central.

At-least-once site→central handoff

Central ingests a NotificationSubmit with insert-if-not-exists on NotificationId, then acks the site with NotificationSubmitAck. The site S&F engine clears the message only after receiving that ack. A lost ack causes the site to resend; the GUID idempotency key makes the resend a no-op. Because the ack is sent only after the row is persisted, no notification is lost to a race between the ack and a central failover — the row already exists.

No Akka-level replication

All outbox state lives in MS SQL, which is already the central HA store. There is no Akka replication of the actor's in-memory state. On a central failover the new active node's singleton starts a fresh dispatch sweep; Pending and due Retrying rows are reclaimed from the table on the next tick.

Architecture

NotificationOutboxActor

NotificationOutboxActor is a ReceiveActor that implements IWithTimers. It runs as a cluster singleton on the active central node. The actor is responsible for both the ingest path (accepting NotificationSubmit messages) and the dispatch path (running the periodic delivery loop). All async work is executed via PipeTo(Self) so every result lands on the actor's mailbox thread, preserving single-threaded actor semantics:

public class NotificationOutboxActor : ReceiveActor, IWithTimers
{
    public NotificationOutboxActor(
        IServiceProvider serviceProvider,
        NotificationOutboxOptions options,
        ICentralAuditWriter auditWriter,
        ILogger<NotificationOutboxActor> logger)
    {
        Receive<NotificationSubmit>(HandleSubmit);
        Receive<InternalMessages.IngestPersisted>(HandleIngestPersisted);
        Receive<InternalMessages.DispatchTick>(_ => HandleDispatchTick());
        Receive<InternalMessages.DispatchComplete>(_ => _dispatching = false);
        Receive<InternalMessages.PurgeTick>(_ => HandlePurgeTick());
        Receive<InternalMessages.PurgeComplete>(_ => { });
        Receive<NotificationOutboxQueryRequest>(HandleQuery);
        Receive<NotificationStatusQuery>(HandleStatusQuery);
        Receive<NotificationDetailRequest>(HandleDetailRequest);
        Receive<RetryNotificationRequest>(HandleRetry);
        Receive<DiscardNotificationRequest>(HandleDiscard);
        Receive<NotificationKpiRequest>(HandleKpiRequest);
        Receive<PerSiteNotificationKpiRequest>(HandlePerSiteKpiRequest);
    }
}

PreStart starts two periodic Akka timers: DispatchTick at DispatchInterval and PurgeTick at PurgeInterval. A lifecycle-scoped CancellationTokenSource (_shutdownCts) is created in PreStart and cancelled in PostStop so any in-flight SMTP send observes coordinated shutdown instead of blocking for a full connect/auth/send timeout.

Ingest path

HandleSubmit maps a NotificationSubmit to a Notification entity and calls PersistAsync, which opens a fresh DI scope, resolves INotificationOutboxRepository, and calls InsertIfNotExistsAsync. The boolean result is intentionally ignored — an existing row is treated identically to a fresh insert. The async result is piped back to Self as InternalMessages.IngestPersisted, which carries the original Sender reference so the ack is sent from the actor thread:

private void HandleSubmit(NotificationSubmit msg)
{
    var sender = Sender;
    var notification = BuildNotification(msg);

    PersistAsync(notification).PipeTo(
        Self,
        success: () => new InternalMessages.IngestPersisted(
            msg.NotificationId, sender, Succeeded: true, Error: null),
        failure: ex => new InternalMessages.IngestPersisted(
            msg.NotificationId, sender, Succeeded: false, Error: ex.GetBaseException().Message));
}

NotificationSubmitAck(Accepted: true) is returned for both a fresh insert and an existing row. Only a thrown repository error yields Accepted: false, causing the site to retain and retry its S&F message.

Dispatch loop

On each DispatchTick the actor checks a boolean in-flight guard (_dispatching). If a sweep is already running the tick is silently dropped — sweeps never overlap. Otherwise the guard is raised and RunDispatchPass is launched:

  1. A scoped INotificationOutboxRepository fetches Pending rows and Retrying rows whose NextAttemptAt ≤ now, ordered by CreatedAt ascending, capped at DispatchBatchSize.
  2. The retry policy (maxRetries, retryDelay) is resolved by reading SmtpConfiguration from INotificationRepository. Non-positive values are clamped to FallbackMaxRetries = 10 and FallbackRetryDelay = 1 min with a warning log so a misconfigured SMTP row does not silently park notifications without retrying.
  3. Each notification in the batch is delivered sequentially via DeliverOneAsync. Per-notification exceptions are caught and logged so one bad row never aborts the rest of the batch.

DispatchComplete (singleton instance) is piped back to Self on both the success and failure projections, ensuring the in-flight guard is always cleared even if the sweep faults unexpectedly.

Delivery adapters

INotificationDeliveryAdapter is the per-channel delivery seam:

public interface INotificationDeliveryAdapter
{
    NotificationType Type { get; }
    Task<DeliveryOutcome> DeliverAsync(
        Notification notification, CancellationToken cancellationToken = default);
}

DeliveryOutcome carries a DeliveryResult enum (Success, TransientFailure, PermanentFailure), resolved recipients on success, and an error string on failure. The adapter map (NotificationType → INotificationDeliveryAdapter) is built lazily on the first dispatch sweep and cached in _adaptersCache for the actor's lifetime, paired with an actor-lifetime IServiceScope (_adaptersScope) disposed in PostStop. This avoids rebuilding the dictionary on every sweep while respecting each adapter's scoped DI dependencies.

EmailNotificationDeliveryAdapter is the only registered adapter. It resolves the target list and recipients from INotificationRepository at delivery time (not at ingest), connects to SMTP via ISmtpClientWrapper, acquires an OAuth2 token if configured, and sends a BCC plain-text email. Error classification mirrors the External System Gateway pattern:

Exception DeliveryResult
SmtpPermanentException (SMTP 5xx) PermanentFailure
Connection/protocol/timeout errors TransientFailure
OperationCanceledException (shutdown) propagated, not classified
Missing list, empty recipient list, no SMTP config, invalid TLS mode PermanentFailure
Unclassified (e.g. OAuth2 token fetch failure) PermanentFailure

Status transitions in DeliverOneAsync

switch (outcome.Result)
{
    case DeliveryResult.Success:
        notification.Status = NotificationStatus.Delivered;
        notification.DeliveredAt = now;
        notification.ResolvedTargets = outcome.ResolvedTargets;
        break;

    case DeliveryResult.TransientFailure:
        notification.RetryCount++;
        notification.LastError = outcome.Error;
        if (notification.RetryCount >= maxRetries)
            notification.Status = NotificationStatus.Parked;
        else
        {
            notification.Status = NotificationStatus.Retrying;
            notification.NextAttemptAt = now + retryDelay;
        }
        break;

    case DeliveryResult.PermanentFailure:
        notification.Status = NotificationStatus.Parked;
        notification.LastError = outcome.Error;
        break;
}
await outboxRepository.UpdateAsync(notification, cancellationToken);

Every attempt also writes audit rows via ICentralAuditWriter (see Audit Integration below). Audit-write failure is caught, logged, and never propagates back into the dispatcher — the delivery outcome on the Notifications row stands regardless.

Audit integration

Each delivery attempt emits at least one AuditChannel.Notification / AuditKind.NotifyDeliver row via ICentralAuditWriter:

  • An AuditStatus.Attempted row (always, per attempt), carrying attempt duration in milliseconds.
  • A second, terminal row (Delivered, Parked, or Discarded) only when the post-outcome status is terminal — a transient failure that transitions the notification to Retrying emits only the Attempted row.

CorrelationId on the emitted row(s) is parsed from the NotificationId GUID. ExecutionId and ParentExecutionId are echoed from Notification.OriginExecutionId / Notification.OriginParentExecutionId, linking the central NotifyDeliver rows to the site-emitted NotifySend row for the same script run. The Actor field is "system" — there is no authenticated user at dispatch time.

Manual discard via HandleDiscard also emits a terminal Discarded row (with a null error, because the discard is operator-driven, not a delivery failure).

Purge

HandlePurgeTick fires daily at PurgeInterval. RunPurgePass opens a scoped INotificationOutboxRepository and calls DeleteTerminalOlderThanAsync(cutoff), where cutoff = now TerminalRetention (default 365 days). The deleted count is logged at Information. Purge faults are caught internally so the returned task never faults.

Usage

The outbox is consumed through two DI seams.

Ingest — the Host registers NotificationOutboxActor as an Akka cluster singleton and with the ClusterClientReceptionist. Site clusters send NotificationSubmit messages through CentralSite Communication; the actor ingests them without further configuration by the caller.

Operator actions / UI queries — the Central UI's Notification Outbox page and the ManagementActor resolve the singleton IActorRef and send query or command messages:

Message Actor response Allowed when
NotificationOutboxQueryRequest NotificationOutboxQueryResponse Any time
NotificationStatusQuery NotificationStatusResponse Any time
NotificationDetailRequest NotificationDetailResponse Any time
RetryNotificationRequest RetryNotificationResponse Row is Parked
DiscardNotificationRequest DiscardNotificationResponse Row is Parked
NotificationKpiRequest NotificationKpiResponse Any time
PerSiteNotificationKpiRequest PerSiteNotificationKpiResponse Any time

Retry resets the notification to Pending with RetryCount = 0, NextAttemptAt = null, and LastError = null so the dispatch loop reclaims it on the next sweep. Discard transitions to terminal Discarded and emits the corresponding audit row.

Configuration

Options are bound from ScadaBridge:NotificationOutbox via NotificationOutboxOptions:

Key Default Description
DispatchInterval 00:00:10 (10 s) How often the dispatcher polls for due rows.
DispatchBatchSize 100 Maximum notifications claimed per sweep.
StuckAgeThreshold 00:10:00 (10 min) Age beyond which a non-terminal row is counted as stuck in KPIs and the UI badge. Display-only; does not affect dispatcher behaviour.
TerminalRetention 365 days How long terminal rows are kept before the daily purge removes them.
PurgeInterval 1 day Cadence of the background purge sweep.
DeliveredKpiWindow 00:01:00 (1 min) Trailing window for the "delivered last interval" throughput KPI.

Delivery retry policy (MaxRetries, RetryDelay) is read at runtime from SmtpConfiguration via INotificationRepository, not from NotificationOutboxOptions. Non-positive values are clamped to FallbackMaxRetries = 10 and FallbackRetryDelay = 1 min with a Warning log.

Dependencies & Interactions

  • Commons (#16) — owns Notification, NotificationStatus, NotificationType, INotificationOutboxRepository, INotificationRepository, and all message contracts (NotificationSubmit, NotificationSubmitAck, NotificationStatusQuery, NotificationKpiRequest, and their responses). Also owns ScadaBridgeAuditEventFactory and the AuditChannel/AuditKind/AuditStatus enums used to build dispatch audit rows.
  • Configuration Database (#17) — registers the scoped INotificationOutboxRepository (the central dbo.Notifications table) and INotificationRepository (notification-list, recipient, and SMTP configuration tables). Central hosts must call AddConfigurationDatabase before AddNotificationOutbox.
  • Notification Service (#8) — supplies ISmtpClientWrapper, OAuth2TokenService, NotificationOptions, SmtpTlsModeParser, SmtpErrorClassifier, and the SmtpPermanentException type. AddNotificationOutbox relies on AddNotificationService being called by the Host to register these shared SMTP primitives; registering them twice would duplicate them.
  • CentralSite Communication (#5) — carries NotificationSubmit / NotificationSubmitAck between sites and central via ClusterClient, and NotificationStatusQuery / NotificationStatusResponse for the Notify.Status round-trip.
  • Store-and-Forward Engine (#6) — the site-side component that durably buffers notifications in SQLite and retries forwarding until central acks. The outbox is the receiving end of the S&F handoff.
  • Audit Log (#23) — the outbox is a central direct-write caller of ICentralAuditWriter. It emits an Attempted NotifyDeliver row per delivery attempt, plus a terminal row only when the attempt drives the notification to a terminal status (Delivered/Parked/Discarded); it also emits a terminal row per operator Discard. The upstream NotifySend row is emitted by the site and arrives at central via standard audit telemetry.
  • Health Monitoring (#11) — polls NotificationKpiRequest / PerSiteNotificationKpiRequest for the headline KPI tiles on the health dashboard (queue depth, stuck count, parked count). These are central-computed from the Notifications table and are separate from the site S&F backlog metric.
  • Central UI (#9) — hosts the Notification Outbox page: KPI tiles, a queryable/filterable notification list, per-row Retry/Discard actions on parked notifications, and a stuck-row badge.

Troubleshooting

Notifications stuck in Pending

A notification stays Pending when the dispatch loop is not running or is failing silently. Check for "Dispatch sweep failed" at Error level in the central node logs. The most common cause is a missing or misconfigured SmtpConfiguration row, which the adapter surfaces as a PermanentFailure and parks the notification immediately. A Warning log line naming SmtpConfiguration.MaxRetries or SmtpConfiguration.RetryDelay being non-positive indicates the retry policy was clamped — correct the SMTP configuration row.

Notifications parked with "no delivery adapter for type"

The actor parks a notification immediately and logs this message when NotificationType has a value for which no INotificationDeliveryAdapter is registered. Currently only Email has an adapter; future channel types must be registered in AddNotificationOutbox before notifications of that type are submitted.

Dispatch loop wedged (guard stuck true)

The boolean _dispatching guard is cleared by DispatchComplete, which is piped even on a faulted sweep. If the actor itself stops and restarts, PreStart reinitialises the timers and the guard resets. A wedged guard without a restart indicates the PipeTo callback is never completing — examine logs around "Dispatch sweep faulted unexpectedly".

SMTP credentials appearing in logs

EmailNotificationDeliveryAdapter runs CredentialRedactor.Scrub on all exception messages before logging. If credential strings appear in logs the SMTP exception message is not being routed through the catch filters in DeliverAsync — ensure the exception type is reachable by one of the three catch blocks and not escaping before scrubbing.

Failover mid-delivery

A central failover while a delivery attempt is in flight leaves the row in its pre-attempt status (Pending or Retrying). The new active node picks it up on the next dispatch tick. One notification may be re-sent to SMTP as a result — this is an accepted trade-off, consistent with the at-least-once guarantee the S&F Engine already provides.