Files

Joseph Doherty 9e5e32d0f2 docs(audit): add SourceNode column to AuditLog/Notifications/SiteCalls design + plan

- Adds SourceNode varchar(64) NULL to AuditLog, Notifications, and SiteCalls
  tables with role-name semantics: node-a/node-b for site rows (qualified by
  SourceSiteId), central-a/central-b for central direct-write rows.
- New IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc) index.
- Reframes CLAUDE.md from documentation-only to implementation project.
- Adds docs/plans/2026-05-23-audit-source-node.md + tasks.json companion.

2026-05-23 15:34:44 -04:00

14 KiB

Raw Blame History

Component: Notification Outbox

Purpose

The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, logs every one to the Notifications table in the central configuration database, and delivers them through per-type delivery adapters. The Notifications table is the single source of audit truth: every notification — successfully delivered, parked, or discarded — has exactly one durable row. The outbox provides delivery retry, parking of failures, per-notification status tracking, and KPIs for delivery health.

This inverts where notification delivery happens. Sites no longer send notifications directly via SMTP; a site script's notification is store-and-forwarded to central, and the central outbox owns dispatch and delivery.

Location

Central cluster. The NotificationOutboxActor is a singleton on the active central node. It is the first outbox component to live centrally — the Store-and-Forward Engine remains site-only.

Responsibilities

Own the durable central queue — the Notifications table in the central MS SQL database.
Ingest store-and-forwarded notifications from sites, insert-if-not-exists on NotificationId, and ack the site only after the row is persisted.
Run the dispatcher loop: poll due rows, resolve the target notification list, and deliver via the matching adapter.
Schedule retries for transient failures and park notifications on permanent failure or exhausted retries.
Track per-notification status across the delivery lifecycle.
Compute delivery KPIs from the Notifications table for the Health Monitoring dashboard and the Central UI.
Purge terminal rows daily after a configurable retention window.

SMTP and HTTP delivery is blocking I/O. Delivery work runs on a dedicated blocking-I/O dispatcher, the same pattern used by Script Execution Actors, so delivery never blocks the actor's dispatcher loop.

End-to-End Flow

Site script: Notify.To("list").Send(subject, body)
    │  generate NotificationId (GUID) locally; return it to the script immediately
    ▼
Site Store-and-Forward Engine  (notification category, target = central)
    │  durably forwards to central via Central–Site Communication (ClusterClient);
    │  buffers/retries if central is unreachable
    ▼
Central ingest: insert-if-not-exists on NotificationId → Notifications table (Pending)
    │  ack the site → site S&F clears the message
    ▼
Central Notification Outbox actor (singleton, active central node)
    │  polls due rows; resolves the list; delivers via the matching adapter
    ├── success            → Delivered
    ├── transient failure  → Retrying (schedule NextAttemptAt)
    └── permanent failure
        / retries exhausted → Parked

The site forwards only (listName, subject, body) plus provenance — recipient resolution happens at central, at delivery time. This keeps notification-list definitions in one place and removes the deploy-to-sites artifact entirely.

Notify.Status(notificationId) returns a small status record — status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site S&F buffer the site answers the query locally (status Forwarding); once forwarded, the query round-trips to central and reads the Notifications table.

The `Notifications` Table

The table is type-agnostic so it can record any notification type the system supports — email today, Microsoft Teams and others later. One row per notification.

Field	Notes
`NotificationId`	GUID, primary key. Generated at the site; used as the idempotency key.
`Type`	`Email` / `Teams` / … discriminator.
`ListName`	Target notification list.
`Subject`, `Body`	Plain-text content.
`TypeData`	JSON — extensibility hook for future per-type fields.
`Status`	Lifecycle state — one of `Pending`, `Retrying`, `Delivered`, `Parked`, `Discarded`. See Status Lifecycle below.
`RetryCount`	Delivery attempts so far.
`LastError`	Detail of the most recent failure.
`ResolvedTargets`	Who the notification actually went to — snapshotted by central at delivery time, for audit.
`SourceSiteId`, `SourceInstanceId`, `SourceScript`	Provenance.
`SourceNode`	The cluster node on which the notification was enqueued — `node-a` / `node-b` for site-originated rows (qualified by `SourceSiteId`). Nullable. Carried verbatim from the site through the S&F handoff.
`SiteEnqueuedAt`	When the script called `Send()` (carried from the site).
`CreatedAt`	When central ingested the row.
`LastAttemptAt`, `NextAttemptAt`, `DeliveredAt`	Delivery timestamps.

All timestamps are UTC.

Status Lifecycle

Forwarding — in the site S&F buffer, not yet received by central. Site-local only — never stored in the central Notifications table; reported by Notify.Status while the site still holds the notification.
Pending — ingested by central, awaiting first dispatch.
Retrying — a transient failure occurred; NextAttemptAt schedules the next attempt.
Delivered — terminal, success.
Parked — terminal-not-delivered: a permanent failure, or retries exhausted. LastError distinguishes which.
Discarded — terminal, reached only by operator action on a parked notification. The row is kept (not deleted) so the table remains a complete audit record.

The Notification Outbox and the central Site Call Audit component share the TrackedOperationId tracking model and this status lifecycle, but differ in delivery locality: the Notification Outbox delivers notifications itself (central SMTP), whereas Site Call Audit only audits cached calls delivered site-locally by the site Store-and-Forward Engine — it is not a dispatcher.

Retry Policy

Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval. The interval is fixed (no exponential backoff), consistent with the existing fixed-interval store-and-forward convention.

Retention

Terminal rows (Delivered, Parked, Discarded) are removed by a daily purge job after a configurable window (default 365 days). This preserves a strong audit trail while bounding table growth. Non-terminal rows are never purged.

Ingest & Idempotency

The site→central handoff is at-least-once. Central ingests an inbound notification submission with an insert-if-not-exists on NotificationId, then acks the site; the site S&F engine clears the message only on that ack. Because central acks only after the row is persisted (ack-after-persist), a lost ack causes the site to resend, and the GUID NotificationId idempotency key makes the resend harmless — the duplicate insert is a no-op.

A rare central failover mid-delivery could re-send one already-Delivered notification. This is an accepted trade-off, consistent with the duplicate-delivery trade-off the Store-and-Forward Engine already accepts.

Dispatcher

The dispatcher loop runs on a fixed interval. On each tick the NotificationOutboxActor:

Polls the Notifications table for due rows — Pending rows, and Retrying rows whose NextAttemptAt has passed.
Resolves the target notification list to its recipients/targets at central, at delivery time.
Hands the notification to the delivery adapter registered for its Type, running on the dedicated blocking-I/O dispatcher.
Applies the result:
- success → Delivered, set DeliveredAt, snapshot ResolvedTargets.
- transient failure → Retrying, increment RetryCount, set NextAttemptAt, record LastError; once retries are exhausted → Parked.
- permanent failure → Parked, record LastError.

Each delivery attempt also writes a Notification.Attempt row to the central AuditLog via ICentralAuditWriter; a transition to a terminal status (Delivered / Parked / Discarded) writes a Notification.Terminal row. Audit writes are direct (no telemetry — the dispatcher runs at central), insert-if-not-exists on EventId. The site-emitted Notification.Enqueued row arrives separately via the standard audit telemetry channel from the site's SQLite write-buffer, so the full per-notification audit trail is Enqueued (site-originated) → Attempt × N (central direct-write) → Terminal (central direct-write). See Component-AuditLog.md, Central direct-write (central-originated events).

The operational Notifications table remains the source of truth for the dispatcher and for Retry/Discard actions; the AuditLog rows are immutable shadows. Operator Retry/Discard still mutates only the Notifications row, and each transition emits the corresponding Notification.Attempt / Notification.Terminal audit row.

Audit-write failure never affects delivery. If the ICentralAuditWriter direct-write fails (transient DB error, schema lock, etc.) the dispatcher logs the failure and increments the CentralAuditWriteFailures health metric (see Health Monitoring #11), but the delivery attempt's outcome on the Notifications row stands. The audit row is recovered by re-emission on the next dispatcher tick or by the on-startup reconciliation sweep; central never aborts a notification because audit failed.

Delivery Adapters

A delivery adapter implementing INotificationDeliveryAdapter is registered per Type. Each Deliver(...) call returns one of success | transient failure | permanent failure, mirroring the External System Gateway error-classification pattern.

Email adapter — implemented now. The existing SMTP composition/send logic, relocated to the central cluster.
Teams and other adapters — future. The Type discriminator and the adapter interface are the seam; no Teams code exists in this design. Teams auth and targeting (Incoming Webhooks vs Graph API) is a separate design conversation.

Delivery adapters are provided by the Notification Service, which manages notification-list and SMTP definitions and supplies the stateless per-type "deliver one notification" implementations.

Active/Standby Behavior

The NotificationOutboxActor is a singleton on the active central node. All outbox state lives in MS SQL, which is already the central HA store, so no Akka-level replication is needed (unlike the site S&F engine). On central failover the new active node resumes dispatch directly from the Notifications table — Pending rows and due Retrying rows are picked up on the next dispatcher tick.

Monitoring

KPIs

KPIs are central-computed from the Notifications table — global, with a per-source-site breakdown:

Queue depth — count of Pending + Retrying.
Stuck count — Pending / Retrying rows older than the configurable stuck-age threshold.
Parked count — count of Parked.
Delivered (last interval) — count of Delivered since the previous sample.
Oldest pending age — age of the oldest non-terminal notification.

KPIs are point-in-time, computed on demand from the table. The configurable row retention (default 365 days) answers historical questions directly, so no separate time-series store is added.

Stuck Detection

A notification is stuck if it is Pending or Retrying and older than a configurable age threshold (default 10 minutes). Detection is display-only — a count KPI and a row badge. There is no automated escalation or alerting, consistent with the system-wide no-alerting policy.

Surfacing

Health Monitoring dashboard — headline KPI tiles: queue depth, stuck count, parked count. These are central-computed and are not part of the site health report. The site S&F notification backlog remains a separate site health metric covering the site→central leg.
Central UI "Notification Outbox" page — KPI tiles plus a queryable notification list: filter by status, type, source site, list, and time range; a stuck-only toggle; keyword search on subject. Parked notifications offer Retry (→ Pending, reset RetryCount / NextAttemptAt) and Discard (→ Discarded) actions. Stuck rows are badged.

Configuration

The component is configured via NotificationOutboxOptions, bound from an appsettings.json section on the central host (Options pattern):

Dispatch interval — how often the dispatcher loop polls for due rows.
Stuck-age threshold — age beyond which a non-terminal notification is counted as stuck (default 10 minutes).
Terminal-row retention window — age after which terminal rows are removed by the daily purge job (default 365 days).

Delivery max-retry-count and retry interval are not part of NotificationOutboxOptions — they are reused from the central SMTP configuration.

Dependencies

Notification Service: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes.
Configuration Database: Hosts the Notifications table; provides the entity POCO, repository, and EF migration for outbox persistence.
Central–Site Communication: Carries inbound notification submissions and acks between sites and central.
Audit Log (#23): The dispatcher direct-writes Notification.Attempt and Notification.Terminal rows to the central AuditLog via ICentralAuditWriter (insert-if-not-exists on EventId); the site-emitted Notification.Enqueued row arrives via the standard audit telemetry channel. See Component-AuditLog.md, Central direct-write (central-originated events).
Health Monitoring: Consumes the outbox KPIs as central-computed headline metrics.
Central UI: Hosts the Notification Outbox page.

Interactions

Site Store-and-Forward Engine: Forwards notifications to central via Central–Site Communication; the outbox ingests them and acks once persisted.
Notification Service: Supplies delivery adapters and resolves notification lists at delivery time.
Central UI: Queries the Notifications table for the Notification Outbox page and issues operator Retry/Discard actions on parked notifications.
Health Monitoring: Polls the outbox for KPI tiles on the health dashboard.

14 KiB Raw Blame History Unescape Escape