From c1c8e356874cd64b6c90ecd9f959505df82a228b Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 3 Jun 2026 15:52:33 -0400 Subject: [PATCH] =?UTF-8?q?docs(components):=20reference=20docs=20batch=20?= =?UTF-8?q?3/4=20=E2=80=94=20NotificationService,=20NotificationOutbox,=20?= =?UTF-8?q?SiteCallAudit,=20HealthMonitoring,=20SiteEventLogging,=20Inboun?= =?UTF-8?q?dAPI?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/components/HealthMonitoring.md | 184 ++++++++++++++++ docs/components/InboundAPI.md | 279 +++++++++++++++++++++++++ docs/components/NotificationOutbox.md | 260 +++++++++++++++++++++++ docs/components/NotificationService.md | 228 ++++++++++++++++++++ docs/components/SiteCallAudit.md | 254 ++++++++++++++++++++++ docs/components/SiteEventLogging.md | 234 +++++++++++++++++++++ 6 files changed, 1439 insertions(+) create mode 100644 docs/components/HealthMonitoring.md create mode 100644 docs/components/InboundAPI.md create mode 100644 docs/components/NotificationOutbox.md create mode 100644 docs/components/NotificationService.md create mode 100644 docs/components/SiteCallAudit.md create mode 100644 docs/components/SiteEventLogging.md diff --git a/docs/components/HealthMonitoring.md b/docs/components/HealthMonitoring.md new file mode 100644 index 00000000..c7f78f33 --- /dev/null +++ b/docs/components/HealthMonitoring.md @@ -0,0 +1,184 @@ +# Health Monitoring + +The Health Monitoring component collects operational metrics from site cluster subsystems, forwards them to central on a 30-second cadence, and exposes an in-memory aggregated view that the Central UI health dashboard polls. + +## Overview + +Health Monitoring (#11) runs on both site and central nodes with different roles on each side. It exists as a display-only, in-memory signal layer — no historical persistence, no alerting. The philosophy is current-status-only: the dashboard answers "what is the system doing right now?" rather than "what happened over the last hour?". + +The component code lives in `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`: + +- `SiteHealthCollector` / `ISiteHealthCollector` — the site-side thread-safe accumulator that other site subsystems push metrics into. Script Actors, Alarm Actors, DCL connection actors, and the Audit Log bridge all call into this singleton. +- `HealthReportSender` — a `BackgroundService` that ticks every `ReportInterval` (30 s), atomically drains the collector into a `SiteHealthReport`, stamps a monotonic sequence number, and fires it to central over `IHealthReportTransport` (Akka Tell, fire-and-forget). Only the active node sends — the standby runs the loop but skips the send. +- `CentralHealthAggregator` / `ICentralHealthAggregator` — the central-side `BackgroundService` and in-memory store. Receives `SiteHealthReport` messages, applies them atomically via compare-and-swap on `SiteHealthState` records, and runs a periodic offline-detection sweep. +- `CentralHealthReportLoop` — central-only counterpart to `HealthReportSender`: generates a synthetic `SiteHealthReport` for the central cluster itself (site ID `$central`) so central appears as a first-class card on `/monitoring/health`. Only the Primary (leader) node generates reports. +- `SiteHealthState` — immutable record holding the latest report, last-heartbeat timestamp, sequence number, and online/offline flag for one site. Handed directly to UI callers; never torn because the aggregator swaps it atomically. + +DI entry points are split by role: `AddSiteHealthMonitoring` for site nodes (registers `ISiteHealthCollector` + starts `HealthReportSender`); `AddCentralHealthAggregation` for central nodes (registers `CentralHealthAggregator` + starts `CentralHealthReportLoop`); `AddHealthMonitoring` for nodes that need the collector but not the sender (shared use). All three are idempotent with respect to `HealthMonitoringOptionsValidator` registration. + +## Key Concepts + +### Monotonic sequence numbers + +Sequence numbers are seeded at construction with the current Unix epoch in milliseconds rather than starting at zero. This ensures that, after a site failover, the newly-active node's first report always carries a higher sequence number than any report the previous active node sent — the central aggregator's sequence guard would otherwise silently discard the new active's reports as stale. The same seeding applies to `CentralHealthReportLoop` for the `$central` synthetic site. + +### Raw error counts per interval + +`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`, `SiteAuditWriteFailures`, and `AuditRedactionFailure` are raw counts accumulated since the previous report, not rates. `SiteHealthCollector.CollectReport` atomically reads and resets each counter via `Interlocked.Exchange`. If the transport `Send` throws, `HealthReportSender` restores the drained counts back into the collector via `Interlocked.Add` so they roll forward into the next report rather than being silently lost. Concurrent increments that arrive during a failed send accumulate against zero and are preserved by the restore `Add`. + +### Online/offline detection + +Online status is driven by `LastHeartbeatAt`, not by `LastReportReceivedAt`. Heartbeats arrive from `SiteCommunicationActor` every ~5 s (`CommunicationOptions.TransportHeartbeatInterval`), so the 60 s `OfflineTimeout` tolerates roughly twelve missed heartbeats before declaring a site offline. A single-node failover — where the standby is alive but the active cannot produce a full report — therefore does not trigger a false offline transition. + +The synthetic `$central` site has no heartbeat source; its only signal is the 30 s `CentralHealthReportLoop` self-report. It therefore gets a longer `CentralOfflineTimeout` (default 3 × `ReportInterval` = 90 s), equivalent to one missed self-report. The validator rejects any configuration where `CentralOfflineTimeout < OfflineTimeout`. + +The offline-check `PeriodicTimer` runs at half the shorter of the two timeouts so whichever site class has the tighter window is swept at least twice within it. + +### Dead-letter monitoring + +`SiteHealthCollector.IncrementDeadLetter` is called by the site's Akka `EventStream` dead-letter subscriber. Each call atomically increments `_deadLetterCount`; the count appears in the next health report as `DeadLetterCount`. Dead letters indicate messages sent to actors that no longer exist — typically stale references or timing races after instance redeploy. The health dashboard surfaces the count per report interval for quick triage; Site Event Log Viewer provides the per-message detail. + +## Architecture + +### Site-side collection + +`SiteHealthCollector` holds all per-interval counters (`_scriptErrorCount`, `_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`, `_auditRedactionFailures`) as `int` fields touched only through `Interlocked` operations, and snapshot state (`ConcurrentDictionary` for connection health, tag resolution, and S&F buffer depths) that is overwritten rather than incremented. This split means `CollectReport` can atomically reset the counters in one pass while taking a point-in-time copy of the dictionaries, with no locks: + +```csharp +// SiteHealthCollector.CollectReport (abbreviated) +public SiteHealthReport CollectReport(string siteId) +{ + var scriptErrors = Interlocked.Exchange(ref _scriptErrorCount, 0); + var alarmErrors = Interlocked.Exchange(ref _alarmErrorCount, 0); + var deadLetters = Interlocked.Exchange(ref _deadLetterCount, 0); + var auditFailures = Interlocked.Exchange(ref _siteAuditWriteFailures, 0); + var redactFailures = Interlocked.Exchange(ref _auditRedactionFailures, 0); + + return new SiteHealthReport( + SiteId: siteId, + SequenceNumber: 0, // caller stamps monotonic seq + ReportTimestamp: _timeProvider.GetUtcNow(), + ScriptErrorCount: scriptErrors, + DeadLetterCount: deadLetters, + SiteAuditWriteFailures: auditFailures, + AuditRedactionFailure: redactFailures, + /* ... connection snapshots, instance counts, S&F depths */ + SiteAuditBacklog: _siteAuditBacklog); +} +``` + +The `SequenceNumber` field in the returned record is always `0`; `HealthReportSender` overwrites it with the atomically-incremented monotonic counter immediately before calling `_transport.Send`. + +### Site-side report send + +`HealthReportSender` is an active-node-only sender: at the top of each tick it checks `_collector.IsActiveNode` and skips the remainder when `false`. The active/standby flag is set by the Deployment Manager singleton ownership check, not by this component. + +The send itself is synchronous and fire-and-forget (`IHealthReportTransport.Send` wraps an Akka `Tell`). A transport exception is caught, logged, and the interval counts are restored before re-throwing — the outer `catch (Exception)` swallows the rethrow so the background service never terminates from a single failed send. + +### Central aggregation + +`CentralHealthAggregator` stores one `SiteHealthState` record per site in a `ConcurrentDictionary`. Every write (from `ProcessReport` or `MarkHeartbeat`) uses a compare-and-swap loop: + +```csharp +// CentralHealthAggregator.ProcessReport (core CAS path) +var updated = existing with +{ + LatestReport = report, + LastReportReceivedAt = now, + LastHeartbeatAt = now, + LastSequenceNumber = report.SequenceNumber, + IsOnline = true +}; + +if (_siteStates.TryUpdate(report.SiteId, updated, existing)) + return; + +// CAS lost — retry with fresh value +``` + +Sequence numbers guard against stale reports from a pre-failover node overwriting the new active's fresher state. A heartbeat for an unknown site (e.g., just after a central restart) registers the site as online with a null `LatestReport` so the site is not shown as "unknown" during the failover window. + +The offline sweep runs on a `PeriodicTimer` at `ComputeCheckInterval(_options)` — half the shorter of `OfflineTimeout` and `CentralOfflineTimeout`. It checks `LastHeartbeatAt` (not report time) and applies a single non-retried CAS: if the CAS loses, the site was just heard from and leaving it online is correct. + +### Audit Log metrics bridge + +Audit Log registers `AddAuditLogHealthMetricsBridge` on site nodes after `AddSiteHealthMonitoring`. This replaces the default no-op failure counters with two bridges that forward directly into `ISiteHealthCollector`: + +- `HealthMetricsAuditWriteFailureCounter` — called by `FallbackAuditWriter` on every primary SQLite failure; increments `SiteAuditWriteFailures`. +- `HealthMetricsAuditRedactionFailureCounter` — called by the payload redactor on every over-redaction event; increments `AuditRedactionFailure`. + +A third collaborator, `SiteAuditBacklogReporter`, is a hosted service that polls `ISiteAuditQueue.GetBacklogStatsAsync` every 30 s and pushes a `SiteAuditBacklogSnapshot` into `ISiteHealthCollector.UpdateSiteAuditBacklog`. The snapshot (`PendingCount`, `OldestPendingUtc`, `OnDiskBytes`) rides as `SiteAuditBacklog` on the next health report. The poll runs in a separate service rather than inline in `CollectReport` to keep the report path free of synchronous SQLite I/O. + +On central, `AuditCentralHealthSnapshot` (in the Audit Log component) is the symmetric counterpart: it accumulates `CentralAuditWriteFailures`, `AuditRedactionFailure`, and a per-site `SiteAuditTelemetryStalled` map fed by `SiteAuditTelemetryStalledTracker`. These are read by the central health dashboard alongside the aggregated site states. See [Audit Log](./AuditLog.md) for the full counter and stall-detection design. + +## Usage + +Site subsystems call `ISiteHealthCollector` directly — it is a DI singleton. Examples of callers by subsystem: + +| Caller | Method | Metric in report | +|--------|--------|-----------------| +| Script Actor (on unhandled exception) | `IncrementScriptError()` | `ScriptErrorCount` | +| Alarm Actor (on eval failure) | `IncrementAlarmError()` | `AlarmEvaluationErrorCount` | +| Akka EventStream dead-letter subscriber | `IncrementDeadLetter()` | `DeadLetterCount` | +| DCL connection actor | `UpdateConnectionHealth(name, health)` | `DataConnectionStatuses` | +| DCL connection actor | `UpdateTagResolution(name, total, resolved)` | `TagResolutionCounts` | +| DCL connection actor | `UpdateTagQuality(name, good, bad, uncertain)` | `DataConnectionTagQuality` | +| Audit Log bridge | `IncrementSiteAuditWriteFailures()` | `SiteAuditWriteFailures` | +| Audit Log bridge | `IncrementAuditRedactionFailure()` | `AuditRedactionFailure` | +| `SiteAuditBacklogReporter` | `UpdateSiteAuditBacklog(snapshot)` | `SiteAuditBacklog` | +| `HealthReportSender` | `SetParkedMessageCount(count)` | `ParkedMessageCount` | + +Central consumers resolve `ICentralHealthAggregator` and call `GetAllSiteStates()` or `GetSiteState(siteId)` to read a snapshot-safe dictionary of `SiteHealthState` records. The health dashboard polls this on a ~10 s timer. Because `SiteHealthState` is an immutable record swapped atomically, a consumer can hold the reference without risk of a torn read. + +## Configuration + +Options class: `HealthMonitoringOptions`, bound from the `ScadaBridge:HealthMonitoring` section. Validated at startup by `HealthMonitoringOptionsValidator` (registered with `ValidateOnStart`) so a bad configuration fails with a clear key-naming message rather than an opaque `ArgumentOutOfRangeException` inside a `PeriodicTimer` constructor. + +| Key | Default | Constraint | Description | +|-----|---------|-----------|-------------| +| `ScadaBridge:HealthMonitoring:ReportInterval` | `00:00:30` (30 s) | Must be `> 0` | Interval at which site nodes emit health reports to central. Also the `CentralHealthReportLoop` self-report cadence. | +| `ScadaBridge:HealthMonitoring:OfflineTimeout` | `00:01:00` (60 s) | Must be `> 0` | Silence window after which a real site is marked offline. Driven by `LastHeartbeatAt`, not last report time. | +| `ScadaBridge:HealthMonitoring:CentralOfflineTimeout` | `00:03:00` (3 min) | Must be `>= OfflineTimeout` | Grace window for the `$central` synthetic site, which has no heartbeat source. Defaults to 3× `ReportInterval`. | + +The offline-check cadence is derived at runtime as `min(OfflineTimeout, CentralOfflineTimeout) / 2` — not directly configurable. + +## Dependencies & Interactions + +- [Commons (#16)](./Commons.md) — defines `SiteHealthReport`, `SiteHealthReportReplica`, `NodeStatus`, `SiteAuditBacklogSnapshot`, and the `ISiteHealthCollector` / `ICentralHealthAggregator` interfaces consumed throughout. `SiteHealthReport` is an additive record; new fields use default values so existing producers remain valid. +- [Central–Site Communication (#5)](./Communication.md) — transports `SiteHealthReport` messages from site to central via Akka remoting (fire-and-forget Tell through `IHealthReportTransport`). Also delivers heartbeats from `SiteCommunicationActor` that `CentralHealthAggregator.MarkHeartbeat` uses to keep sites online between reports. `SiteHealthReportReplica` is broadcast on DistributedPubSub so both central nodes maintain identical aggregator state. +- [Site Runtime (#3)](./SiteRuntime.md) — Script Actors call `IncrementScriptError`; Alarm Actors call `IncrementAlarmError`; the Deployment Manager singleton ownership check drives `SetActiveNode`. +- [Data Connection Layer (#4)](./DataConnectionLayer.md) — connection actors call `UpdateConnectionHealth`, `UpdateTagResolution`, `UpdateConnectionEndpoint`, `UpdateTagQuality`, and `RemoveConnection` on `ISiteHealthCollector`. +- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — `HealthReportSender` queries `StoreAndForwardStorage` for `GetParkedMessageCountAsync` and `GetBufferDepthByCategoryAsync`; the results populate `ParkedMessageCount` and `StoreAndForwardBufferDepths` (keyed by `StoreAndForwardCategory` name). +- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — `IClusterNodeProvider` supplies cluster node list and `SelfIsPrimary` flag to both `HealthReportSender` and `CentralHealthReportLoop`. Heartbeat cadence (default 5 s) is owned by Cluster Infrastructure / `SiteCommunicationActor`. +- [Audit Log (#23)](./AuditLog.md) — `AddAuditLogHealthMetricsBridge` wires `HealthMetricsAuditWriteFailureCounter` and `HealthMetricsAuditRedactionFailureCounter` into the site collector, and registers `SiteAuditBacklogReporter` to poll the site-local SQLite drain backlog. On central, `AuditCentralHealthSnapshot` exposes `CentralAuditWriteFailures`, `AuditRedactionFailure`, and per-site `SiteAuditTelemetryStalled` alongside the aggregated site states on the health dashboard. +- [Central UI (#9)](./Host.md) — the health dashboard resolves `ICentralHealthAggregator` and polls `GetAllSiteStates()` on a ~10 s timer. Notification Outbox and Site Call Audit KPIs are computed on demand from their own central tables by those components; Health Monitoring does not own or cache them. +- [Host (#15)](./Host.md) — implements `ISiteIdentityProvider` (supplies `SiteId` for report payloads) and `IClusterNodeProvider`, and calls the appropriate `Add*` entry points from the role-specific composition root. + +## Troubleshooting + +### A site flaps online/offline during single-node failover + +The 60 s `OfflineTimeout` is driven by heartbeats, not reports. The standby node keeps sending heartbeats even when the active is down. If the site still shows as offline during a failover window shorter than 60 s, check that `SiteCommunicationActor` is running on the standby (it is not a singleton — both nodes run it) and that heartbeats are reaching central. Temporarily increasing `OfflineTimeout` reduces false-offline transitions at the cost of slower genuine-offline detection. + +### Reports from the new active are silently discarded after failover + +This happens when the new active's process-start sequence numbers fall below the prior active's last sequence number. `HealthReportSender` seeds `_sequenceNumber` with `TimeProvider.GetUtcNow().ToUnixTimeMilliseconds()` at construction, so this should not occur unless the new node's clock is significantly behind the old node's. Check time synchronization between site nodes. + +### `$central` shows as offline + +`CentralHealthReportLoop` only generates reports when `IClusterNodeProvider.SelfIsPrimary` is true. If both central nodes are healthy but the `$central` entry shows offline, the primary node's loop may have stalled or the Akka cluster may be in a split-brain state. Check `CentralHealthReportLoop` logs for `"Failed to generate central health report"` errors. + +### `SiteAuditWriteFailures` non-zero + +Non-zero `SiteAuditWriteFailures` in consecutive reports indicates the site-local SQLite audit writer is throwing persistently and rows are being routed to the `RingBufferFallback`. Check disk space and SQLite file health at the site node. See [Audit Log](./AuditLog.md) — the fallback ring is drop-oldest; sustained failure loses rows. + +## Related Documentation + +- [Health Monitoring design specification](../requirements/Component-HealthMonitoring.md) +- [Audit Log](./AuditLog.md) +- [Central–Site Communication](./Communication.md) +- [Site Runtime](./SiteRuntime.md) +- [Data Connection Layer](./DataConnectionLayer.md) +- [Store-and-Forward Engine](./StoreAndForward.md) +- [Cluster Infrastructure](./ClusterInfrastructure.md) +- [Commons](./Commons.md) diff --git a/docs/components/InboundAPI.md b/docs/components/InboundAPI.md new file mode 100644 index 00000000..19eea6c7 --- /dev/null +++ b/docs/components/InboundAPI.md @@ -0,0 +1,279 @@ +# Inbound API + +The Inbound API exposes a `POST /api/{methodName}` endpoint on the active central node so external systems can invoke C# scripts that live entirely on central, with the ability to reach any site instance through a routing surface. It is the inward counterpart of the External System Gateway — where that component handles scripts calling out, this handles callers coming in. + +## Overview + +Inbound API (#14) is a central-only, active-node-only component. Its code lives in `src/ZB.MOM.WW.ScadaBridge.InboundAPI/`, with shared entity and message types in `src/ZB.MOM.WW.ScadaBridge.Commons/`. + +The component has three runtime responsibilities: + +- **Auth and dispatch** — `EndpointExtensions.MapInboundAPI` registers the endpoint; `InboundApiEndpointFilter` enforces the active-node gate and body-size cap before the handler runs; the handler authenticates via `IApiKeyVerifier` and resolves the matching `ApiMethod` from `IInboundApiRepository`. +- **Script execution** — `InboundScriptExecutor` compiles `ApiMethod.Script` via Roslyn, caches the compiled delegate, and runs it inside `InboundScriptContext` against a method-level timeout. +- **Audit emission** — `AuditWriteMiddleware` wraps the entire request pipeline; it mints the per-request `ExecutionId`, buffers request and response bodies up to the configured cap, and writes one `ApiInbound` row to `ICentralAuditWriter` in its `finally` block regardless of outcome. + +The DI entry point is `ServiceCollectionExtensions.AddInboundAPI`, which registers `InboundScriptExecutor` (singleton), `RouteHelper` (scoped), `CommunicationServiceInstanceRouter` (scoped), and `InboundApiEndpointFilter` (singleton). API key verification is registered separately by the Host composition root via `AddZbApiKeyAuth` — `AddInboundAPI` does not register it. + +## Key Concepts + +### API key authentication + +Authentication uses a Bearer token in the `Authorization` header (`sbk__`). The shared `IApiKeyVerifier` performs a peppered-HMAC constant-time secret comparison against the key store. Every verifier failure — missing token, unknown key, revoked key, secret mismatch — maps to a single `401` with the body `{"error":"Invalid or missing API key"}` so the failure reason is never surfaced to the caller. + +The spec describes `X-API-Key` header auth. The code has retired that header in favour of a `Bearer` token scheme (`Authorization: Bearer sbk__`). The constant `UnauthorizedMessage` and `NotApprovedMessage` in `EndpointExtensions` are deliberately identical across different reject branches to prevent method enumeration. + +### Per-method scope authorization + +Once a key verifies, the handler checks whether `identity.Scopes.Contains(methodName)` (ordinal, case-sensitive) before making any database call. A key must carry the exact method name as a scope — `"Echo"` does not grant `"echo"`. If the scope check fails, or the subsequent `IInboundApiRepository.GetMethodByNameAsync` returns null, both branches emit `403` with the same body `{"error":"API key not approved for this method"}`. The scope check runs first to avoid a DB round-trip on the reject path and to eliminate a latency timing oracle. + +### `ApiMethod` entity + +`ApiMethod` (in `ZB.MOM.WW.ScadaBridge.Commons.Entities.InboundApi`) is the persistence-ignorant shape: + +```csharp +public class ApiMethod +{ + public int Id { get; set; } + public string Name { get; set; } // route segment + public string Script { get; set; } // Roslyn C# script body + public string? ParameterDefinitions { get; set; } // JSON: List + public string? ReturnDefinition { get; set; } // JSON: List + public int TimeoutSeconds { get; set; } +} +``` + +`ParameterDefinitions` and `ReturnDefinition` are stored as JSON strings to keep the schema simple; both are deserialized on every request by `ParameterValidator` and `ReturnValueValidator`. + +### Extended type system + +Parameter and return field definitions share the same six-type vocabulary: + +| Type | JSON shape | C# value after coercion | +|-----------|----------------------|-------------------------------------| +| `Boolean` | `true` / `false` | `bool` | +| `Integer` | number (whole) | `long` | +| `Float` | number | `double` | +| `String` | string | `string` | +| `Object` | JSON object | `Dictionary` | +| `List` | JSON array | `List` | + +`Object` and `List` are validated for JSON shape only — field-level or element-level type constraints are the script's responsibility. Template attributes use only the four primitive types; the extended types apply here and in the External System Gateway. + +## Architecture + +### Request pipeline + +``` +POST /api/{methodName} + │ + ├─ AuditWriteMiddleware ← mints ExecutionId; buffers bodies; emits audit row in finally + │ └─ InboundApiEndpointFilter ← 503 on standby node; 413 on oversized body + │ └─ HandleInboundApiRequest + │ ├─ IApiKeyVerifier.VerifyAsync ← 401 on any auth failure + │ ├─ scope check + GetMethodByNameAsync ← 403 on not-approved + │ ├─ ParameterValidator.Validate ← 400 on bad parameters + │ └─ InboundScriptExecutor.ExecuteAsync + │ ├─ ForbiddenApiChecker ← static trust model enforcement + │ ├─ Roslyn compile + cache ← handler cached by method name + │ └─ ReturnValueValidator ← 500 on return shape mismatch + └─ ICentralAuditWriter.WriteAsync ← fire-and-forget from middleware finally +``` + +The filter is applied at registration time via `.AddEndpointFilter()` in `EndpointExtensions.MapInboundAPI`; it runs before the handler so a standby node or an oversized body never reaches auth or script execution. + +### Script compilation and handler cache + +`InboundScriptExecutor` is a singleton holding two `ConcurrentDictionary` instances: + +- `_scriptHandlers` — maps method name to a compiled `Func>`. +- `_knownBadMethods` — records methods whose scripts have failed to compile, capped at 1 000 entries, so a bad script is compiled at most once per startup and a flood of unique bogus names cannot grow the cache without bound. + +The compilation path in `CompileAndRegister`: + +```csharp +public bool CompileAndRegister(ApiMethod method) +{ + var handler = Compile(method); + if (handler == null) + { + TryRecordBadMethod(method.Name); + return false; + } + _knownBadMethods.TryRemove(method.Name, out _); + return Register(method.Name, handler); +} +``` + +`Compile` runs `ForbiddenApiChecker.FindViolations` first — a Roslyn syntax-tree walk that rejects forbidden namespace references (`System.IO`, `System.Diagnostics`, `System.Threading` except `Tasks`, `System.Reflection`, `System.Net`, `System.Runtime.InteropServices`, `Microsoft.Win32`) and reflection-gateway member names (`GetType`, `Assembly`, `GetMethod`, `CreateInstance`, `InvokeMember`, and others). Scripts containing `dynamic` or `Activator` are also rejected. This is defence-in-depth, not a true sandbox. + +If a method is invoked before it has been compiled — for example a method created after startup — `ExecuteAsync` performs a lazy compile on first call, then stores the handler via `GetOrAdd` so concurrent first callers share one delegate. + +Scripts are compiled with a restricted reference set (`mscorlib`, `System.Linq`, `System.Collections.Generic`, `RouteHelper`'s assembly, `ScriptParameters`'s assembly, and the C# runtime binder) and with imports for `System`, `System.Collections.Generic`, `System.Linq`, and `System.Threading.Tasks`. The `globalsType` is `InboundScriptContext`. + +### Script context and the `Route` surface + +`InboundScriptContext` is the Roslyn globals object injected into every running script: + +```csharp +public class InboundScriptContext +{ + public ScriptParameters Parameters { get; } + public RouteHelper Route { get; } + public CancellationToken CancellationToken { get; } +} +``` + +`Parameters` wraps the validated, type-coerced values. `Parameters["key"]` gives raw `object?` access; `Parameters.Get("key")` adds typed conversion with clear error messages (`ScriptParameterException`). `Route` is a scoped `RouteHelper` already bound to the method-level deadline token and to the inbound request's `ParentExecutionId`. + +`RouteHelper.To(instanceCode)` returns a `RouteTarget` that exposes five operations: + +| Method | Description | +|--------|-------------| +| `Call(scriptName, parameters?)` | Invoke a script on the instance; returns the script's return value. | +| `GetAttribute(name)` | Read one attribute value. | +| `GetAttributes(names)` | Batch-read; returns `IReadOnlyDictionary`. | +| `SetAttribute(name, value)` | Write one attribute value. | +| `SetAttributes(dict)` | Batch-write. | + +All five operations are synchronous from the script's perspective (the central node blocks until the site responds or the method timeout fires). There is no store-and-forward — a site-unreachable or timed-out routed call throws `InvalidOperationException` back to the script, which surfaces as a `500` to the caller. + +`RouteTarget.Call` builds a `RouteToCallRequest` carrying `ParentExecutionId` so the spawned site script execution records the inbound request as its parent in the audit tree: + +```csharp +var request = new RouteToCallRequest( + correlationId, _instanceCode, scriptName, ScriptArgs.Normalize(parameters), + DateTimeOffset.UtcNow, _parentExecutionId); + +var response = await _instanceRouter.RouteToCallAsync(siteId, request, token); +``` + +`IInstanceRouter` is the seam over `CommunicationService`; in production, `CommunicationServiceInstanceRouter` delegates every call directly to `CommunicationService.RouteToCallAsync / RouteToGetAttributesAsync / RouteToSetAttributesAsync`. + +### Active-node gating + +`IActiveNodeGate.IsActiveNode` is the seam the Host implements using Akka cluster state. When `false`, `InboundApiEndpointFilter` returns `503` before any auth or script logic runs. When no implementation is registered — non-clustered hosts, tests — the endpoint is served, preserving prior behaviour. + +### Audit integration + +`AuditWriteMiddleware` sits in the pipeline above the endpoint filter and handler. At the start of every request it: + +1. Mints a fresh `Guid` as the request's `ExecutionId` and stashes it on `HttpContext.Items[InboundExecutionIdItemKey]`. +2. Calls `HttpRequest.EnableBuffering()` (for POST/PUT/PATCH requests only) and reads up to `AuditLogOptions.InboundMaxBytes` bytes of the request body into a bounded audit copy, then rewinds the stream to position 0 so the downstream handler sees the full payload. +3. Wraps `HttpResponse.Body` in `CapturedResponseStream`, which mirrors every write to the real sink while capturing up to `InboundMaxBytes` bytes for the audit copy. + +In the `finally` block, the middleware calls `ICentralAuditWriter.WriteAsync` (fire-and-forget with fault observation) to emit one `AuditChannel.ApiInbound` row. The row's `AuditKind` is `InboundAuthFailure` for `401`/`403` and `InboundRequest` for all other outcomes. Status is `Delivered` for 2xx and `Failed` for 4xx/5xx or a handler exception. `Actor` is the resolved API key display name (stashed by the endpoint handler on `HttpContext.Items[AuditActorItemKey]` after successful auth); it is forced null for auth failures so the middleware never echoes an unauthenticated principal. The audit row's `ExecutionId` is the same `Guid` minted in step 1. + +The endpoint handler reads that same `ExecutionId` from `HttpContext.Items` and threads it into `InboundScriptExecutor.ExecuteAsync` as `parentExecutionId`, which in turn passes it to `RouteHelper.WithParentExecutionId`. Any `Route.To().Call()` inside the script carries it as `RouteToCallRequest.ParentExecutionId`, so the spawned site script execution is linked back to this inbound request in the audit tree. + +Audit emission is best-effort. A write failure is caught, logged at `Warning`, and dropped. It never alters the HTTP response. + +## Usage + +### HTTP contract + +```http +POST /api/{methodName} +Authorization: Bearer sbk__ +Content-Type: application/json + +{ + "siteId": "SiteA", + "startDate": "2026-03-01", + "endDate": "2026-03-16" +} +``` + +Success response (`200`): + +```json +{ + "siteName": "Site Alpha", + "totalUnits": 14250, + "lines": [ + { "lineName": "Line-1", "units": 8200, "efficiency": 92.5 } + ] +} +``` + +Error responses: + +| Status | Condition | +|--------|-----------| +| `401` | Missing, malformed, unknown, revoked, or secret-mismatched token. | +| `403` | Valid key, but not in scope for this method; or method not found. | +| `400` | Missing required parameters, wrong types, or unexpected fields. | +| `413` | Request body exceeds `MaxRequestBodyBytes`. | +| `500` | Script execution error, compilation failure, or return-shape mismatch. | +| `503` | Request reached a standby node. | + +The `403` body is identical whether the method does not exist or the key lacks scope, so a caller holding a valid key cannot enumerate method names by observing status differences. + +### Writing a method script + +A method script runs as a Roslyn C# script with `InboundScriptContext` as globals. The script has access to `Parameters`, `Route`, and `CancellationToken`. + +```csharp +// Example: read a parameter, call a site script, return a result +var siteId = Parameters.Get("siteId"); +var result = await Route.To(siteId).Call("GetProductionSummary", new { date = Parameters.Get("date") }); +return result; +``` + +The `Route.To().Call()` inherits the method-level timeout automatically. A script that needs a tighter per-call bound may pass an explicit `CancellationToken`. Scripts may not access `System.IO`, `System.Diagnostics.Process`, `System.Threading` (except `Tasks`), `System.Reflection`, `System.Net`, or reflection-gateway members — violations are rejected statically at compile time. + +### Startup compilation and hot-reload + +At startup the Host loads all `ApiMethod` rows from the configuration database and calls `CompileAndRegister` on each. After a method is updated via the Management API or CLI, the Management Service calls `CompileAndRegister` again — the updated script takes effect on the next request, with no node restart. Methods created after startup compile lazily on first invocation. Scripts modified directly in the database do not take effect until the next node restart; always use the Management API, CLI, or Central UI. + +## Configuration + +Options class: `InboundApiOptions`, bound from the `ScadaBridge:InboundApi` section. + +| Key | Default | Description | +|-----|---------|-------------| +| `DefaultMethodTimeout` | `00:00:30` | Execution timeout applied when `ApiMethod.TimeoutSeconds` is zero or not set. | +| `MaxRequestBodyBytes` | `1048576` (1 MiB) | Body size cap enforced by `InboundApiEndpointFilter` before the body is parsed. Requests whose `Content-Length` exceeds this return `413`; chunked requests are cut off by Kestrel as they stream in. | +| `ApiKeyPepper` | _(required)_ | Server-side HMAC pepper for bearer credentials. Consumed by the shared `IApiKeyVerifier`; must be a strong, random value (`≥ 16` characters), different per environment, supplied via a secret store. | + +The inbound body-capture cap for audit is configured separately under `AuditLog:InboundMaxBytes` (default 1 MiB; range `[8192, 16777216]`). It governs only the audit copy — the downstream handler always sees the full body. + +## Dependencies & Interactions + +- [Commons (#16)](./Commons.md) — owns `ApiMethod`, `ParameterDefinition`, `ScriptParameters`, `ScriptParameterException`, the `RouteToCall*` / `RouteToGetAttributes*` / `RouteToSetAttributes*` message records, `IInboundApiRepository`, and `IInstanceLocator`. Also owns `ICentralAuditWriter` (via `ZB.MOM.WW.Audit`), `AuditChannel`, `AuditKind`, `AuditStatus`, and `ScadaBridgeAuditEventFactory`. +- [Configuration Database (#17)](./ConfigurationDatabase.md) — provides the `IInboundApiRepository` implementation (`GetMethodByNameAsync`, `GetAllApiMethodsAsync`, CRUD). Method definitions persist in the central MS SQL configuration database. +- [Central–Site Communication (#5)](./Communication.md) — `CommunicationServiceInstanceRouter` delegates every `Route.To()` operation to `CommunicationService`. The routed call travels from the central `CommunicationActor` to the target site via `ClusterClient`, reaches the target `InstanceActor`, and a `ScriptExecutionActor` executes the named script. The return value flows back synchronously. +- [Audit Log (#23)](./AuditLog.md) — `AuditWriteMiddleware` resolves `ICentralAuditWriter` to emit the `ApiInbound` row via the central direct-write path. The inbound request is the parent execution for any site script it spawns: the middleware's `ExecutionId` becomes `RouteToCallRequest.ParentExecutionId` on every routed `Call`. Cross-link: `AuditWriteMiddleware.InboundExecutionIdItemKey` / `AuditWriteMiddleware.AuditActorItemKey` are the `HttpContext.Items` keys that tie the endpoint handler and middleware together. +- [Security (#10)](./Security.md) — API key verification (`IApiKeyVerifier`, `AddZbApiKeyAuth`) is registered by the Host. The inbound API uses a dedicated key scheme independent of LDAP/AD session auth. +- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — `IActiveNodeGate` (interface in this project; implementation in the Host) gates the endpoint to the active central node. A standby returns `503` without running any script logic. +- [Health Monitoring (#11)](./HealthMonitoring.md) — `ScadaBridgeTelemetry.RecordInboundApiRequest(methodName)` is called on every request (after auth failures are classified); `CentralAuditWriteFailures` surfaces on the central health snapshot when an audit write fails. +- Design spec: [Component-InboundAPI.md](../requirements/Component-InboundAPI.md). + +## Troubleshooting + +### A method always returns 500 after a script update + +The in-memory handler cache still holds the previous compiled delegate. If the update went through the Management API or CLI, `CompileAndRegister` should have been called automatically and the new script should be active on the next request. If the script was edited directly in the database, the cached delegate is stale until the next node restart. Check the `ScadaBridge.InboundAPI` log category for a `"script compilation failed"` or `"trust model violation"` warning to distinguish a compile error from a routing failure. + +### A method is stuck in the known-bad-methods cache + +If a previously broken script is fixed but `ExecuteAsync` still returns `"Script compilation failed for this method"`, the method name is in `_knownBadMethods`. `CompileAndRegister` clears the bad-method entry on a successful compile; calling it (via the Management API or CLI `api-method update`) after the fix is applied resets the cache and makes the corrected script active immediately. + +### Routed calls time out but the site is reachable + +The method-level timeout covers the entire execution including `Route.To().Call()`. A slow site script, a large return value, or network latency can consume the budget. `TimeoutSeconds` on the `ApiMethod` entity controls the cap per method; `DefaultMethodTimeout` in `InboundApiOptions` applies when `TimeoutSeconds` is zero. Increase `TimeoutSeconds` for long-running methods; a `503` from the `/health/active` endpoint on the site side indicates a site failover mid-call. + +### Audit rows missing for inbound requests + +`AuditWriteMiddleware` emits on a fire-and-forget `Task`; a write failure is caught and logged at `Warning` under `AuditWriteMiddleware`. `CentralAuditWriteFailures` increments on the central health snapshot. The request itself still returns its normal HTTP response — a missing audit row never means the call failed. + +## Related Documentation + +- [Inbound API design specification](../requirements/Component-InboundAPI.md) +- [Audit Log](./AuditLog.md) +- [Commons](./Commons.md) +- [Configuration Database](./ConfigurationDatabase.md) +- [Central–Site Communication](./Communication.md) +- [Security](./Security.md) +- [Cluster Infrastructure](./ClusterInfrastructure.md) +- [External System Gateway](./ExternalSystemGateway.md) +- [Site Runtime](./SiteRuntime.md) diff --git a/docs/components/NotificationOutbox.md b/docs/components/NotificationOutbox.md new file mode 100644 index 00000000..21150628 --- /dev/null +++ b/docs/components/NotificationOutbox.md @@ -0,0 +1,260 @@ +# Notification Outbox + +The Notification Outbox is the central component that receives store-and-forwarded notifications from site clusters, persists each one to the `Notifications` table in the central MS SQL database, and delivers them through per-type delivery adapters. It is the first outbox component to run centrally — the Store-and-Forward Engine remains site-only. + +## Overview + +Notification Outbox (#21) runs exclusively on the central cluster. Sites no longer send notifications directly via SMTP: a script's `Notify.Send` call generates a `NotificationId` (GUID) locally, the notification is stored in the site S&F buffer, forwarded to central via Central–Site Communication, and the central outbox owns all dispatch and delivery from that point on. + +The component code lives in `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/`, with a flat layout: + +- Root — `NotificationOutboxActor`, `NotificationOutboxOptions`, `ServiceCollectionExtensions`. +- `Delivery/` — `INotificationDeliveryAdapter`, `DeliveryOutcome`/`DeliveryResult`, `EmailNotificationDeliveryAdapter`. +- `Messages/` — `InternalMessages` (actor-internal timer and pipe messages, never sent over the network). + +Shared types used throughout — `Notification`, `NotificationStatus`, `NotificationType`, `INotificationOutboxRepository`, and all ClusterClient message contracts — live in `src/ZB.MOM.WW.ScadaBridge.Commons/`. + +The DI entry point is `ServiceCollectionExtensions.AddNotificationOutbox`, called by the Host on central nodes. It binds `NotificationOutboxOptions` and registers the typed delivery adapters. The Host separately registers `NotificationOutboxActor` as a cluster singleton and wires the ClusterClient receptionist so inbound `NotificationSubmit` messages reach the actor regardless of which central node is active. + +## Key Concepts + +### `Notifications` table as source of truth + +The central MS SQL `Notifications` table is the single source of audit truth for every notification in the system. One row per `NotificationId` (GUID primary key), regardless of delivery channel. The table is type-agnostic: the `Type` discriminator (`Email`; future channels add new enum members) selects the delivery adapter while the rest of the schema is shared. The `Notifications` table answers both operational queries (current status, retry count, next attempt) and KPI queries (queue depth, stuck count, throughput); no separate time-series store is needed. + +### Status lifecycle + +| Status | Where it lives | Meaning | +|---|---|---| +| `Forwarding` | Site-local only | Notification is in the site S&F buffer; never stored in `Notifications`. | +| `Pending` | Central | Ingested; awaiting first dispatch sweep. | +| `Retrying` | Central | Transient failure; `NextAttemptAt` schedules the next attempt. | +| `Delivered` | Central, terminal | Successfully sent; `DeliveredAt` and `ResolvedTargets` are set. | +| `Parked` | Central, terminal | Permanent failure or retries exhausted; `LastError` records why. | +| `Discarded` | Central, terminal | Operator-cancelled a `Parked` notification; row is kept for the audit record. | + +`Forwarding` is answered site-locally by `Notify.Status(id)`; once the ack arrives, queries round-trip to the `Notifications` table on central. + +### At-least-once site→central handoff + +Central ingests a `NotificationSubmit` with `insert-if-not-exists` on `NotificationId`, then acks the site with `NotificationSubmitAck`. The site S&F engine clears the message only after receiving that ack. A lost ack causes the site to resend; the GUID idempotency key makes the resend a no-op. Because the ack is sent only after the row is persisted, no notification is lost to a race between the ack and a central failover — the row already exists. + +### No Akka-level replication + +All outbox state lives in MS SQL, which is already the central HA store. There is no Akka replication of the actor's in-memory state. On a central failover the new active node's singleton starts a fresh dispatch sweep; `Pending` and due `Retrying` rows are reclaimed from the table on the next tick. + +## Architecture + +### `NotificationOutboxActor` + +`NotificationOutboxActor` is a `ReceiveActor` that implements `IWithTimers`. It runs as a cluster singleton on the active central node. The actor is responsible for both the ingest path (accepting `NotificationSubmit` messages) and the dispatch path (running the periodic delivery loop). All async work is executed via `PipeTo(Self)` so every result lands on the actor's mailbox thread, preserving single-threaded actor semantics: + +```csharp +public class NotificationOutboxActor : ReceiveActor, IWithTimers +{ + public NotificationOutboxActor( + IServiceProvider serviceProvider, + NotificationOutboxOptions options, + ICentralAuditWriter auditWriter, + ILogger logger) + { + Receive(HandleSubmit); + Receive(HandleIngestPersisted); + Receive(_ => HandleDispatchTick()); + Receive(_ => _dispatching = false); + Receive(_ => HandlePurgeTick()); + Receive(_ => { }); + Receive(HandleQuery); + Receive(HandleStatusQuery); + Receive(HandleDetailRequest); + Receive(HandleRetry); + Receive(HandleDiscard); + Receive(HandleKpiRequest); + Receive(HandlePerSiteKpiRequest); + } +} +``` + +`PreStart` starts two periodic Akka timers: `DispatchTick` at `DispatchInterval` and `PurgeTick` at `PurgeInterval`. A lifecycle-scoped `CancellationTokenSource` (`_shutdownCts`) is created in `PreStart` and cancelled in `PostStop` so any in-flight SMTP send observes coordinated shutdown instead of blocking for a full connect/auth/send timeout. + +### Ingest path + +`HandleSubmit` maps a `NotificationSubmit` to a `Notification` entity and calls `PersistAsync`, which opens a fresh DI scope, resolves `INotificationOutboxRepository`, and calls `InsertIfNotExistsAsync`. The boolean result is intentionally ignored — an existing row is treated identically to a fresh insert. The async result is piped back to `Self` as `InternalMessages.IngestPersisted`, which carries the original `Sender` reference so the ack is sent from the actor thread: + +```csharp +private void HandleSubmit(NotificationSubmit msg) +{ + var sender = Sender; + var notification = BuildNotification(msg); + + PersistAsync(notification).PipeTo( + Self, + success: () => new InternalMessages.IngestPersisted( + msg.NotificationId, sender, Succeeded: true, Error: null), + failure: ex => new InternalMessages.IngestPersisted( + msg.NotificationId, sender, Succeeded: false, Error: ex.GetBaseException().Message)); +} +``` + +`NotificationSubmitAck(Accepted: true)` is returned for both a fresh insert and an existing row. Only a thrown repository error yields `Accepted: false`, causing the site to retain and retry its S&F message. + +### Dispatch loop + +On each `DispatchTick` the actor checks a boolean in-flight guard (`_dispatching`). If a sweep is already running the tick is silently dropped — sweeps never overlap. Otherwise the guard is raised and `RunDispatchPass` is launched: + +1. A scoped `INotificationOutboxRepository` fetches `Pending` rows and `Retrying` rows whose `NextAttemptAt ≤ now`, ordered by `CreatedAt` ascending, capped at `DispatchBatchSize`. +2. The retry policy (`maxRetries`, `retryDelay`) is resolved by reading `SmtpConfiguration` from `INotificationRepository`. Non-positive values are clamped to `FallbackMaxRetries = 10` and `FallbackRetryDelay = 1 min` with a warning log so a misconfigured SMTP row does not silently park notifications without retrying. +3. Each notification in the batch is delivered sequentially via `DeliverOneAsync`. Per-notification exceptions are caught and logged so one bad row never aborts the rest of the batch. + +`DispatchComplete` (singleton instance) is piped back to `Self` on both the success and failure projections, ensuring the in-flight guard is always cleared even if the sweep faults unexpectedly. + +### Delivery adapters + +`INotificationDeliveryAdapter` is the per-channel delivery seam: + +```csharp +public interface INotificationDeliveryAdapter +{ + NotificationType Type { get; } + Task DeliverAsync( + Notification notification, CancellationToken cancellationToken = default); +} +``` + +`DeliveryOutcome` carries a `DeliveryResult` enum (`Success`, `TransientFailure`, `PermanentFailure`), resolved recipients on success, and an error string on failure. The adapter map (`NotificationType → INotificationDeliveryAdapter`) is built lazily on the first dispatch sweep and cached in `_adaptersCache` for the actor's lifetime, paired with an actor-lifetime `IServiceScope` (`_adaptersScope`) disposed in `PostStop`. This avoids rebuilding the dictionary on every sweep while respecting each adapter's scoped DI dependencies. + +`EmailNotificationDeliveryAdapter` is the only registered adapter. It resolves the target list and recipients from `INotificationRepository` at delivery time (not at ingest), connects to SMTP via `ISmtpClientWrapper`, acquires an OAuth2 token if configured, and sends a BCC plain-text email. Error classification mirrors the External System Gateway pattern: + +| Exception | `DeliveryResult` | +|---|---| +| `SmtpPermanentException` (SMTP 5xx) | `PermanentFailure` | +| Connection/protocol/timeout errors | `TransientFailure` | +| `OperationCanceledException` (shutdown) | propagated, not classified | +| Missing list, empty recipient list, no SMTP config, invalid TLS mode | `PermanentFailure` | +| Unclassified (e.g. OAuth2 token fetch failure) | `PermanentFailure` | + +### Status transitions in `DeliverOneAsync` + +```csharp +switch (outcome.Result) +{ + case DeliveryResult.Success: + notification.Status = NotificationStatus.Delivered; + notification.DeliveredAt = now; + notification.ResolvedTargets = outcome.ResolvedTargets; + break; + + case DeliveryResult.TransientFailure: + notification.RetryCount++; + notification.LastError = outcome.Error; + if (notification.RetryCount >= maxRetries) + notification.Status = NotificationStatus.Parked; + else + { + notification.Status = NotificationStatus.Retrying; + notification.NextAttemptAt = now + retryDelay; + } + break; + + case DeliveryResult.PermanentFailure: + notification.Status = NotificationStatus.Parked; + notification.LastError = outcome.Error; + break; +} +await outboxRepository.UpdateAsync(notification, cancellationToken); +``` + +Every attempt also writes audit rows via `ICentralAuditWriter` (see Audit Integration below). Audit-write failure is caught, logged, and never propagates back into the dispatcher — the delivery outcome on the `Notifications` row stands regardless. + +### Audit integration + +Each delivery attempt emits two `AuditChannel.Notification` / `AuditKind.NotifyDeliver` rows via `ICentralAuditWriter`: + +- An `AuditStatus.Attempted` row (always, per attempt), carrying attempt duration in milliseconds. +- A terminal row (`Delivered`, `Parked`, or `Discarded`) when the post-outcome status is terminal. + +`CorrelationId` on both rows is parsed from the `NotificationId` GUID. `ExecutionId` and `ParentExecutionId` are echoed from `Notification.OriginExecutionId` / `Notification.OriginParentExecutionId`, linking the central `NotifyDeliver` rows to the site-emitted `NotifySend` row for the same script run. The `Actor` field is `"system"` — there is no authenticated user at dispatch time. + +Manual discard via `HandleDiscard` also emits a terminal `Discarded` row (with a null error, because the discard is operator-driven, not a delivery failure). + +### Purge + +`HandlePurgeTick` fires daily at `PurgeInterval`. `RunPurgePass` opens a scoped `INotificationOutboxRepository` and calls `DeleteTerminalOlderThanAsync(cutoff)`, where `cutoff = now − TerminalRetention` (default 365 days). The deleted count is logged at `Information`. Purge faults are caught internally so the returned task never faults. + +## Usage + +The outbox is consumed through two DI seams. + +**Ingest** — the Host registers `NotificationOutboxActor` as an Akka cluster singleton and with the `ClusterClientReceptionist`. Site clusters send `NotificationSubmit` messages through Central–Site Communication; the actor ingests them without further configuration by the caller. + +**Operator actions / UI queries** — the Central UI's Notification Outbox page and the ManagementActor resolve the singleton `IActorRef` and send query or command messages: + +| Message | Actor response | Allowed when | +|---|---|---| +| `NotificationOutboxQueryRequest` | `NotificationOutboxQueryResponse` | Any time | +| `NotificationStatusQuery` | `NotificationStatusResponse` | Any time | +| `NotificationDetailRequest` | `NotificationDetailResponse` | Any time | +| `RetryNotificationRequest` | `RetryNotificationResponse` | Row is `Parked` | +| `DiscardNotificationRequest` | `DiscardNotificationResponse` | Row is `Parked` | +| `NotificationKpiRequest` | `NotificationKpiResponse` | Any time | +| `PerSiteNotificationKpiRequest` | `PerSiteNotificationKpiResponse` | Any time | + +Retry resets the notification to `Pending` with `RetryCount = 0`, `NextAttemptAt = null`, and `LastError = null` so the dispatch loop reclaims it on the next sweep. Discard transitions to terminal `Discarded` and emits the corresponding audit row. + +## Configuration + +Options are bound from `ScadaBridge:NotificationOutbox` via `NotificationOutboxOptions`: + +| Key | Default | Description | +|---|---|---| +| `DispatchInterval` | `00:00:10` (10 s) | How often the dispatcher polls for due rows. | +| `DispatchBatchSize` | `100` | Maximum notifications claimed per sweep. | +| `StuckAgeThreshold` | `00:10:00` (10 min) | Age beyond which a non-terminal row is counted as stuck in KPIs and the UI badge. Display-only; does not affect dispatcher behaviour. | +| `TerminalRetention` | `365` days | How long terminal rows are kept before the daily purge removes them. | +| `PurgeInterval` | `1` day | Cadence of the background purge sweep. | +| `DeliveredKpiWindow` | `00:01:00` (1 min) | Trailing window for the "delivered last interval" throughput KPI. | + +Delivery retry policy (`MaxRetries`, `RetryDelay`) is read at runtime from `SmtpConfiguration` via `INotificationRepository`, not from `NotificationOutboxOptions`. Non-positive values are clamped to `FallbackMaxRetries = 10` and `FallbackRetryDelay = 1 min` with a `Warning` log. + +## Dependencies & Interactions + +- [Commons (#16)](./Commons.md) — owns `Notification`, `NotificationStatus`, `NotificationType`, `INotificationOutboxRepository`, `INotificationRepository`, and all message contracts (`NotificationSubmit`, `NotificationSubmitAck`, `NotificationStatusQuery`, `NotificationKpiRequest`, and their responses). Also owns `ScadaBridgeAuditEventFactory` and the `AuditChannel`/`AuditKind`/`AuditStatus` enums used to build dispatch audit rows. +- [Configuration Database (#17)](./ConfigurationDatabase.md) — registers the scoped `INotificationOutboxRepository` (the central `dbo.Notifications` table) and `INotificationRepository` (notification-list, recipient, and SMTP configuration tables). Central hosts must call `AddConfigurationDatabase` before `AddNotificationOutbox`. +- [Notification Service (#8)](./StoreAndForward.md) — supplies `ISmtpClientWrapper`, `OAuth2TokenService`, `NotificationOptions`, `SmtpTlsModeParser`, `SmtpErrorClassifier`, and the `SmtpPermanentException` type. `AddNotificationOutbox` relies on `AddNotificationService` being called by the Host to register these shared SMTP primitives; registering them twice would duplicate them. +- [Central–Site Communication (#5)](./Communication.md) — carries `NotificationSubmit` / `NotificationSubmitAck` between sites and central via ClusterClient, and `NotificationStatusQuery` / `NotificationStatusResponse` for the `Notify.Status` round-trip. +- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — the site-side component that durably buffers notifications in SQLite and retries forwarding until central acks. The outbox is the receiving end of the S&F handoff. +- [Audit Log (#23)](./AuditLog.md) — the outbox is a central direct-write caller of `ICentralAuditWriter`. It emits `NotifyDeliver` rows (Attempted + terminal) per delivery attempt and per operator Discard. The upstream `NotifySend` row is emitted by the site and arrives at central via standard audit telemetry. +- [Health Monitoring (#11)](./Host.md) — polls `NotificationKpiRequest` / `PerSiteNotificationKpiRequest` for the headline KPI tiles on the health dashboard (queue depth, stuck count, parked count). These are central-computed from the `Notifications` table and are separate from the site S&F backlog metric. +- [Central UI (#9)](./Host.md) — hosts the Notification Outbox page: KPI tiles, a queryable/filterable notification list, per-row Retry/Discard actions on parked notifications, and a stuck-row badge. + +## Troubleshooting + +### Notifications stuck in `Pending` + +A notification stays `Pending` when the dispatch loop is not running or is failing silently. Check for `"Dispatch sweep failed"` at `Error` level in the central node logs. The most common cause is a missing or misconfigured `SmtpConfiguration` row, which the adapter surfaces as a `PermanentFailure` and parks the notification immediately. A `Warning` log line naming `SmtpConfiguration.MaxRetries` or `SmtpConfiguration.RetryDelay` being non-positive indicates the retry policy was clamped — correct the SMTP configuration row. + +### Notifications parked with "no delivery adapter for type" + +The actor parks a notification immediately and logs this message when `NotificationType` has a value for which no `INotificationDeliveryAdapter` is registered. Currently only `Email` has an adapter; future channel types must be registered in `AddNotificationOutbox` before notifications of that type are submitted. + +### Dispatch loop wedged (guard stuck `true`) + +The boolean `_dispatching` guard is cleared by `DispatchComplete`, which is piped even on a faulted sweep. If the actor itself stops and restarts, `PreStart` reinitialises the timers and the guard resets. A wedged guard without a restart indicates the `PipeTo` callback is never completing — examine logs around `"Dispatch sweep faulted unexpectedly"`. + +### SMTP credentials appearing in logs + +`EmailNotificationDeliveryAdapter` runs `CredentialRedactor.Scrub` on all exception messages before logging. If credential strings appear in logs the SMTP exception message is not being routed through the `catch` filters in `DeliverAsync` — ensure the exception type is reachable by one of the three `catch` blocks and not escaping before scrubbing. + +### Failover mid-delivery + +A central failover while a delivery attempt is in flight leaves the row in its pre-attempt status (`Pending` or `Retrying`). The new active node picks it up on the next dispatch tick. One notification may be re-sent to SMTP as a result — this is an accepted trade-off, consistent with the at-least-once guarantee the S&F Engine already provides. + +## Related Documentation + +- [Notification Outbox design specification](../requirements/Component-NotificationOutbox.md) +- [Audit Log](./AuditLog.md) +- [Commons](./Commons.md) +- [Configuration Database](./ConfigurationDatabase.md) +- [Central–Site Communication](./Communication.md) +- [Store-and-Forward Engine](./StoreAndForward.md) +- [Health Monitoring](./Host.md) diff --git a/docs/components/NotificationService.md b/docs/components/NotificationService.md new file mode 100644 index 00000000..744684e7 --- /dev/null +++ b/docs/components/NotificationService.md @@ -0,0 +1,228 @@ +# Notification Service + +The Notification Service is the central-only component that owns notification-list and SMTP definitions, and supplies the per-channel `INotificationDeliveryAdapter` implementations that the Notification Outbox invokes at delivery time. Sites never deliver notifications; they store-and-forward notification payloads to central, where this component's adapters perform all actual SMTP sends. + +## Overview + +Notification Service (#8) runs on the central cluster only. Its responsibilities split cleanly into two layers: + +- **Definitions** — `NotificationList` and `SmtpConfiguration` entities stored in the central Configuration Database. Notification lists carry a `NotificationType` discriminator (`Email` now; additional types such as `Teams` are planned). Lists and SMTP config are never deployed to sites. +- **Delivery adapters** — stateless, per-type implementations of `INotificationDeliveryAdapter`. The Notification Outbox selects the adapter matching a notification's `Type`, calls `DeliverAsync`, and receives a three-way `DeliveryOutcome` (`Success` / `TransientFailure` / `PermanentFailure`). The adapter owns the full recipient-resolution, connection, authentication, send, and disconnect sequence. + +The component code lives in `src/ZB.MOM.WW.ScadaBridge.NotificationService/`. The `EmailNotificationDeliveryAdapter` that consumes these primitives lives in `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Delivery/`. + +## Key Concepts + +### Central-only delivery + +Before the current design, site nodes delivered notifications directly over SMTP. That arrangement required SMTP credentials and notification lists to be deployed to every site. The redesign inverts the path: a site script calls `Notify.To("list").Send(subject, body)`, receives a `NotificationId` GUID immediately, and the notification is store-and-forwarded to central. The Notification Outbox on central ingests it and calls the delivery adapter. Sites never open an SMTP connection. + +This means: +- Credential exposure is limited to the central cluster. +- List membership is resolved at delivery time, so a list change takes effect for all future deliveries without redeploying to sites. +- The SMTP `MaxConcurrentConnections` limit is enforced at a single point. + +### `NotificationType` discriminator + +`NotificationList.Type` is a `NotificationType` enum value (`Email` currently). The script API `Notify.To("listName")` is type-agnostic — the calling script does not reference a type. The Notification Outbox reads the type from the central database when it picks up the notification, then selects the matching adapter by `INotificationDeliveryAdapter.Type`. Adding a new delivery channel means adding a new adapter; existing scripts continue to work. + +### Per-delivery SMTP client lifetime + +`MailKitSmtpClientWrapper` wraps a single `MailKit.Net.Smtp.SmtpClient`. MailKit's client is not thread-safe and holds one TCP/TLS connection. The DI registration is therefore a **factory**, not a singleton wrapper: + +```csharp +services.AddSingleton>(_ => () => new MailKitSmtpClientWrapper()); +``` + +`EmailNotificationDeliveryAdapter.SendAsync` invokes the factory at the top of each delivery attempt, runs connect → authenticate → send → disconnect on the fresh wrapper, and disposes it in a `finally` block. Each delivery pays a full TCP+TLS handshake; this is the deliberate cost of avoiding shared connection state between concurrent outbox dispatches. The factory shape allows a future pooled implementation to be slotted in without changing callers. + +## Architecture + +### Primitives registered by `AddNotificationService` + +`ServiceCollectionExtensions.AddNotificationService` is the single DI entry point, called on the central composition root only: + +```csharp +public static IServiceCollection AddNotificationService(this IServiceCollection services) +{ + services.AddOptions() + .BindConfiguration("ScadaBridge:Notification"); + + services.AddHttpClient(); + services.AddSingleton(); + services.AddSingleton>(_ => () => new MailKitSmtpClientWrapper()); + + return services; +} +``` + +Three things are registered: the `NotificationOptions` fallback values, the `OAuth2TokenService` token cache, and the `ISmtpClientWrapper` factory. The `EmailNotificationDeliveryAdapter` itself is registered by `ZB.MOM.WW.ScadaBridge.NotificationOutbox`, which depends on this project. + +### `INotificationDeliveryAdapter` + +```csharp +public interface INotificationDeliveryAdapter +{ + NotificationType Type { get; } + Task DeliverAsync( + Notification notification, + CancellationToken cancellationToken = default); +} +``` + +The `DeliveryOutcome` record carries a `DeliveryResult` (`Success` / `TransientFailure` / `PermanentFailure`), `ResolvedTargets` (a snapshotted string of the concrete recipients, written to the `Notifications` audit row on success), and an `Error` string on failure. + +### Email delivery sequence + +`EmailNotificationDeliveryAdapter.DeliverAsync` runs this sequence, classifying every failure before returning: + +1. **Resolve list** — calls `INotificationRepository.GetListByNameAsync`. An unknown list returns `Permanent` immediately (the list was deleted; retrying cannot fix it). +2. **Resolve recipients** — calls `GetRecipientsByListIdAsync`. An empty list returns `Permanent`. +3. **Resolve SMTP config** — calls `GetAllSmtpConfigurationsAsync`, takes the first row. No config returns `Permanent`. +4. **Parse TLS mode** — `SmtpTlsModeParser.Parse(smtpConfig.TlsMode)`. An unrecognised string returns `Permanent` (config fault, not a transient network condition). +5. **Validate addresses** — `EmailAddressValidator.ValidateAddresses(fromAddress, recipients)`. A malformed address returns `Permanent`. +6. **Send** — calls the private `SendAsync`, which connect/auth/send/disconnects via a fresh `ISmtpClientWrapper`. + +`SendAsync` maps `SmtpCommandException` 5xx responses to `SmtpPermanentException`, then lets it propagate. `DeliverAsync` catches `SmtpPermanentException` → `Permanent`; SMTP 4xx / socket / protocol / timeout exceptions → `Transient` (via `SmtpErrorClassifier`); unclassified exceptions (e.g., OAuth2 token fetch failure) → `Permanent` (retrying a broken credential wastes token-endpoint calls). + +### SMTP error classification + +`SmtpErrorClassifier.Classify` uses MailKit's typed exceptions and the numeric `SmtpStatusCode` rather than message substring matching: + +```csharp +public static SmtpErrorClass Classify(Exception ex, CancellationToken cancellationToken) +{ + if (ex is OperationCanceledException && cancellationToken.IsCancellationRequested) + return SmtpErrorClass.Unknown; + + if (ex is SmtpCommandException command) + { + var code = (int)command.StatusCode; + if (code >= 400 && code < 500) return SmtpErrorClass.Transient; + if (code >= 500 && code < 600) return SmtpErrorClass.Permanent; + return SmtpErrorClass.Unknown; + } + + if (ex is SmtpProtocolException + or ServiceNotConnectedException + or SocketException + or TimeoutException) + return SmtpErrorClass.Transient; + + return SmtpErrorClass.Unknown; +} +``` + +A `Permanent` classification inside `SendAsync` is wrapped in `SmtpPermanentException` so the outer `DeliverAsync` catch filter can identify it cleanly. + +### OAuth2 token lifecycle + +`OAuth2TokenService.GetTokenAsync` fetches tokens for Microsoft 365 Client Credentials SMTP. Credentials are supplied as `tenantId:clientId:clientSecret`. Tokens are cached in a `ConcurrentDictionary` keyed by a SHA-256 hash of the credential string (NS-006), so distinct SMTP configurations never share a token. A per-credential `SemaphoreSlim` prevents thundering-herd refreshes. Tokens are refreshed 60 seconds before the reported `expires_in` expiry. Only the tenant is logged — the client secret and token value are never written to logs. + +### Credential redaction + +`CredentialRedactor.Scrub(text, credentials)` masks the full packed credential string and its trailing colon-component (password or `clientSecret`) in any text before it reaches a log line. Components shorter than 12 characters are not masked — a short username such as `root` would otherwise mask unrelated diagnostic text. All SMTP error paths in `EmailNotificationDeliveryAdapter` pass exception messages through `Scrub` before logging. + +## Usage + +### Script API + +Site scripts do not interact with this component directly. The script surface is: + +```csharp +// Returns a NotificationId immediately — does not block for delivery. +NotificationId id = Notify.To("Shift-Supervisors").Send("Tank overflow", "Tank T-03 is at 98%"); + +// Site-local while still in the S&F buffer; round-trips to central once forwarded. +NotificationDeliveryStatus status = Notify.Status(id); +``` + +`Notify.To("list")` is type-agnostic. The `NotificationId` is a GUID generated at the site. `Notify.Status` returns a `NotificationDeliveryStatus` record with `Status` (`Forwarding` site-local, or `Pending` / `Retrying` / `Delivered` / `Parked` / `Discarded` from central), `RetryCount`, `LastError`, and `DeliveredAt`. + +### Registering the adapter + +On the central host, both projects are registered. The Notification Outbox registers `EmailNotificationDeliveryAdapter` as a keyed or enumerated `INotificationDeliveryAdapter` and calls `AddNotificationService` to get its dependencies: + +```csharp +// Central composition root (simplified) +services.AddNotificationService(); +services.AddNotificationOutbox(); // registers EmailNotificationDeliveryAdapter +``` + +## Configuration + +`NotificationOptions` is bound from `ScadaBridge:Notification`. These values are **fallbacks** — when a `SmtpConfiguration` row has a non-positive value for a field, the adapter uses the option value instead. A positive value on the row always takes precedence. + +| Section | Key | Default | Description | +|---------|-----|---------|-------------| +| `ScadaBridge:Notification` | `ConnectionTimeoutSeconds` | `30` | SMTP connection/operation timeout in seconds. Applied when `SmtpConfiguration.ConnectionTimeoutSeconds` is zero or negative. | +| `ScadaBridge:Notification` | `MaxConcurrentConnections` | `5` | Maximum concurrent SMTP connections. Used as the documented default; enforcement is in `EmailNotificationDeliveryAdapter`. | + +SMTP retry settings (`MaxRetries`, `RetryDelay`) live on the `SmtpConfiguration` entity and are read by the Notification Outbox dispatcher — they are not part of `NotificationOptions`. + +### `SmtpConfiguration` entity fields + +| Field | Type | Notes | +|-------|------|-------| +| `Host` | `string` | SMTP server hostname or IP. | +| `Port` | `int` | e.g., 587 for StartTLS, 465 for SSL. | +| `AuthType` | `string` | `basic` or `oauth2`. | +| `Credentials` | `string?` | Basic: `username:password`. OAuth2: `tenantId:clientId:clientSecret`. | +| `TlsMode` | `string?` | `None`, `StartTLS`, or `SSL`. Null/empty defaults to `StartTls`. | +| `FromAddress` | `string` | Sender address in the From header. Also the XOAUTH2 `user=` identity for M365. | +| `ConnectionTimeoutSeconds` | `int` | 0 → falls back to `NotificationOptions`. | +| `MaxConcurrentConnections` | `int` | 0 → falls back to `NotificationOptions`. | +| `MaxRetries` | `int` | Read by Notification Outbox. | +| `RetryDelay` | `TimeSpan` | Read by Notification Outbox. | + +### `NotificationList` entity fields + +| Field | Type | Notes | +|-------|------|-------| +| `Name` | `string` | Unique list name. Passed as `Notify.To("name")`. | +| `Type` | `NotificationType` | Enum discriminator. Currently `Email` only. | +| `Recipients` | `ICollection` | Resolved at delivery time by the adapter. | + +Each `NotificationRecipient` carries `Name` (display) and `EmailAddress`. + +## Dependencies & Interactions + +- [Commons (#16)](./Commons.md) — owns `NotificationList`, `NotificationRecipient`, `SmtpConfiguration`, `Notification`, `NotificationType`, `NotificationStatus`, `INotificationRepository`, and the `NotificationSubmit` / `NotificationSubmitAck` / `NotificationStatusQuery` / `NotificationStatusResponse` / `NotificationDeliveryStatus` message contracts. +- [Configuration Database (#17)](./ConfigurationDatabase.md) — persists `NotificationList`, `NotificationRecipient`, and `SmtpConfiguration`. Implements `INotificationRepository`. The `EmailNotificationDeliveryAdapter` resolves lists and recipients via this repository at delivery time. +- [Notification Outbox (#21)](./NotificationOutbox.md) — the central dispatch counterpart. The Notification Outbox registers `EmailNotificationDeliveryAdapter`, drives retry and parking, and owns the `Notifications` audit table. Notification Service supplies the SMTP primitives (`ISmtpClientWrapper` factory, `OAuth2TokenService`, `SmtpErrorClassifier`, `CredentialRedactor`, `EmailAddressValidator`); Notification Outbox owns when and how often `DeliverAsync` is called. +- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — site-side buffer. Site scripts hand notifications to the S&F engine, which forwards them to central. The Notification Service has no direct interaction with the site S&F engine; by the time `DeliverAsync` is called, the notification has already been ingested by the Notification Outbox. +- [Security & Auth (#10)](./Security.md) — Design role is required to manage notification lists and SMTP configuration. +- Design spec: [Component-NotificationService.md](../requirements/Component-NotificationService.md). + +## Troubleshooting + +### A notification is Parked with a permanent failure + +A `PermanentFailure` outcome means `EmailNotificationDeliveryAdapter` determined that retrying cannot fix the failure. Common root causes: + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "Notification list '…' not found" | List was renamed or deleted after the notification was submitted. | Recreate the list or discard the notification in the Central UI Outbox page. | +| "Notification list '…' has no recipients" | List exists but has no recipient rows. | Add recipients to the list. | +| "No SMTP configuration available" | No `SmtpConfiguration` row exists. | Add an SMTP configuration in Central UI. | +| "Unknown SMTP TLS mode '…'" | `TlsMode` field contains a value other than `None`, `StartTLS`, or `SSL`. | Correct the `TlsMode` value. | +| "Invalid sender (from) email address" or "Invalid recipient email address(es)" | Malformed address in the `SmtpConfiguration.FromAddress` or in a `NotificationRecipient.EmailAddress`. | Correct the address; the adapter validates via `MailboxAddress.TryParse`. | +| SMTP 5xx reply | Server rejected the message permanently (e.g., mailbox not found, policy block). | Check the `LastError` field on the `Notifications` row. The error text has credentials redacted. | +| OAuth2 credential parse error | `Credentials` field is not in `tenantId:clientId:clientSecret` format. | Correct the credentials on the SMTP configuration. | + +### Transient failures retrying indefinitely + +The retry count and delay come from `SmtpConfiguration.MaxRetries` and `RetryDelay`, enforced by the Notification Outbox. Once `MaxRetries` is exhausted, the Notification Outbox moves the row to `Parked`. If a notification stays in `Retrying` longer than expected, check whether `MaxRetries` is set to a non-zero value on the `SmtpConfiguration` row and that the Notification Outbox actor is running on the active central node. + +### OAuth2 token not refreshing + +`OAuth2TokenService` caches tokens per credential hash. A singleton restart resets the cache; the next `GetTokenAsync` call fetches a fresh token. If token fetches fail repeatedly (network partition to `login.microsoftonline.com`, wrong tenant/client/secret), the failure surfaces as an unclassified exception in `DeliverAsync` and the notification is parked as permanent. The log line includes the tenant ID but not the secret. + +## Related Documentation + +- [Notification Service design specification](../requirements/Component-NotificationService.md) +- [Notification Outbox](./NotificationOutbox.md) +- [Commons](./Commons.md) +- [Configuration Database](./ConfigurationDatabase.md) +- [Store-and-Forward Engine](./StoreAndForward.md) +- [Security & Auth](./Security.md) diff --git a/docs/components/SiteCallAudit.md b/docs/components/SiteCallAudit.md new file mode 100644 index 00000000..bd818f32 --- /dev/null +++ b/docs/components/SiteCallAudit.md @@ -0,0 +1,254 @@ +# Site Call Audit + +Site Call Audit (#22) is a central-only observability component that maintains an eventually-consistent mirror of every cached call — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — issued by site scripts. It ingests lifecycle telemetry from sites into the central `SiteCalls` MS SQL table, computes point-in-time KPIs, and relays operator Retry/Discard actions back to the owning site. It does not deliver anything: cached-call execution stays entirely site-local. + +## Overview + +The component lives in `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/` and runs only on the central cluster. Its single class, `SiteCallAuditActor`, is an Akka.NET `ReceiveActor` deployed as a `ClusterSingletonManager`-managed singleton on the active central node. + +Telemetry reaches central through the shared `CachedCallTelemetry` packet (see [Audit Log](./AuditLog.md)), which carries both an `AuditEvent` for the `AuditLog` table and a `SiteCallOperational` snapshot for the `SiteCalls` table. The `AuditLogIngestActor` (Audit Log #23) writes both in a single MS SQL transaction when it receives an `IngestCachedTelemetryCommand`; it then tells `SiteCallAuditActor` an `UpsertSiteCallCommand` so the `SiteCalls` row is always consistent with the paired audit row. The `SiteCallAuditActor` is therefore an ingest target, not a transport; it never constructs telemetry packets and never decides what gets delivered. + +Sites remain the source of truth. `Tracking.Status()` is answered site-locally from the site SQLite tracking store; the central `SiteCalls` row is what the Central UI Site Calls page reads — it may lag by one telemetry cycle. + +## Key Concepts + +### Mirror, not dispatcher + +The Notification Outbox (#21) ingests notifications and dispatches them centrally. Site Call Audit is different: the Store-and-Forward Engine on each site performs all retry and delivery attempts against the site's locally reachable external systems and databases. Central cannot reach those systems. The `SiteCalls` table is read-only from central's perspective — operators can view it and trigger Retry/Discard actions, but the actions are forwarded to the site; central never mutates the mirror row directly. + +### Monotonic upsert idempotency + +The `SiteCalls` table holds one row per `TrackedOperationId`. `ISiteCallAuditRepository.UpsertAsync` implements insert-if-not-exists followed by a conditional update that only applies when the incoming status has a strictly higher rank than the stored status: + +``` +Submitted=0, Forwarded=1, Attempted=2, Skipped=2, +Delivered=3, Failed=3, Parked=3, Discarded=3 +``` + +Out-of-order telemetry, duplicate gRPC packets, and future reconciliation pulls therefore all feed the same writer safely — status never regresses. + +### Stuck calls + +A non-terminal row (`TerminalAtUtc IS NULL`) created before `now − StuckAgeThreshold` (default 10 minutes) is classified as stuck. Stuck is display-only: surfaced as a `StuckCount` KPI and a row badge in the UI. There is no escalation or alerting. + +## Architecture + +### `SiteCallAuditActor` + +`SiteCallAuditActor` is a `ReceiveActor` with two constructors: + +- **Production**: receives `IServiceProvider` and opens a fresh DI scope per message to resolve the scoped EF Core `ISiteCallAuditRepository`. This mirrors `AuditLogIngestActor`'s pattern — a long-lived singleton cannot hold a scope across messages. +- **Test**: receives a concrete `ISiteCallAuditRepository` and reuses it across all messages, allowing integration tests to run against a real MS SQL fixture without DI scaffolding. + +The actor catches all repository exceptions in its write path and replies `Accepted=false` without rethrowing, keeping the singleton alive across storage faults. The `SupervisorStrategy` override (one-for-one, `maxNrOfRetries: 0`) governs any future children — the actor currently has none. + +```csharp +private async Task OnUpsertAsync(UpsertSiteCallCommand cmd) +{ + var replyTo = Sender; + var id = cmd.SiteCall.TrackedOperationId; + + IServiceScope? scope = null; + ISiteCallAuditRepository repository; + if (_injectedRepository is not null) + { + repository = _injectedRepository; + } + else + { + scope = _serviceProvider!.CreateScope(); + repository = scope.ServiceProvider.GetRequiredService(); + } + + try + { + var siteCall = cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }; + await repository.UpsertAsync(siteCall).ConfigureAwait(false); + replyTo.Tell(new UpsertSiteCallReply(id, Accepted: true)); + } + catch (Exception ex) + { + _logger.LogError(ex, "SiteCallAudit upsert failed for {TrackedOperationId}", id); + replyTo.Tell(new UpsertSiteCallReply(id, Accepted: false)); + } + finally + { + scope?.Dispose(); + } +} +``` + +`IngestedAtUtc` is always stamped at central-side persist time, not carried from the site. This ensures the column reflects when central last processed the row, not when the site emitted it. + +### Message handlers + +| Message | Direction | Handler | +|---|---|---| +| `UpsertSiteCallCommand` | Central ingest → actor | `OnUpsertAsync` — scope-per-message upsert | +| `SiteCallQueryRequest` | Central UI → actor | `HandleQuery` — keyset-paged query, max 200 rows | +| `SiteCallDetailRequest` | Central UI → actor | `HandleDetail` — single-row full detail | +| `SiteCallKpiRequest` | Central UI → actor | `HandleKpi` — global KPI snapshot | +| `PerSiteSiteCallKpiRequest` | Central UI → actor | `HandlePerSiteKpi` — per-site KPI list | +| `RetrySiteCallRequest` | Central UI → actor | `HandleRetrySiteCall` — relay to site | +| `DiscardSiteCallRequest` | Central UI → actor | `HandleDiscardSiteCall` — relay to site | +| `RegisterCentralCommunication` | Host → actor | Wires the `CentralCommunicationActor` transport | + +All read handlers capture `Sender` before the first `await` and use `PipeTo` to return the response — the standard Akka pattern for async ask-reply handlers. + +### `SiteCalls` table + +One row per `TrackedOperationId` in the central MS SQL configuration database. Key columns: + +| Column | Type | Notes | +|---|---|---| +| `TrackedOperationId` | `uniqueidentifier` | PK; GUID stamped site-side at call time | +| `Channel` | `varchar` | `"ApiOutbound"` or `"DbOutbound"` | +| `Target` | `varchar` | Human-readable target, e.g. `"ERP.GetOrder"` | +| `SourceSite` | `varchar` | Site that issued the call | +| `SourceNode` | `varchar NULL` | `node-a` / `node-b`; nullable for retired nodes | +| `Status` | `varchar` | String form of `AuditStatus`; monotonic | +| `RetryCount` | `int` | Dispatch attempts so far | +| `LastError` | `varchar NULL` | Most recent error text | +| `HttpStatus` | `int NULL` | Last HTTP response code (API calls only) | +| `CreatedAtUtc` | `datetime2` | First submit timestamp from site | +| `UpdatedAtUtc` | `datetime2` | Latest site-side status mutation | +| `TerminalAtUtc` | `datetime2 NULL` | Set when status reaches a terminal rank | +| `IngestedAtUtc` | `datetime2` | Central-side stamp, updated on every upsert | + +Unlike the `AuditLog` table, `SiteCalls` is a standard non-partitioned table on `[PRIMARY]` holding mutable operational state. No DB-role restriction applies; it is updated in place by the upsert. + +### Status lifecycle + +``` +Submitted → Forwarded → Attempted ──→ Delivered (terminal, success) + └──→ Parked (non-terminal, awaiting operator action) + └──→ Failed (terminal, permanent failure) + └──→ Discarded (terminal, operator-initiated on Parked row) +``` + +`Failed` rows are not operator-actionable — a permanent failure (e.g. HTTP 4xx) would fail again, and the error was already returned synchronously to the calling script. Only `Parked` rows support Retry and Discard. + +### Retry / Discard relay + +The `CentralCommunicationActor` is wired into `SiteCallAuditActor` after both actors exist, via `RegisterCentralCommunication`. Until registration completes, any relay request receives an immediate `SiteCallRelayOutcome.SiteUnreachable` outcome — there is genuinely no route to any site. + +When `_centralCommunication` is set, the relay handler wraps the command in a `SiteEnvelope` keyed to `SourceSite` and Asks the `CentralCommunicationActor`, which routes it over the per-site `ClusterClient`: + +```csharp +private void HandleRetrySiteCall(RetrySiteCallRequest request) +{ + var sender = Sender; + + if (_centralCommunication is null) + { + sender.Tell(UnreachableRetry(request.CorrelationId)); + return; + } + + var relay = new RetryParkedOperation( + request.CorrelationId, new TrackedOperationId(request.TrackedOperationId)); + var envelope = new SiteEnvelope(request.SourceSite, relay); + + _centralCommunication.Ask(envelope, _options.RelayTimeout) + .PipeTo( + sender, + success: ack => MapRetryResponse(request.CorrelationId, ack), + failure: ex => MapRetryFailure(request.CorrelationId, request.SourceSite, ex)); +} +``` + +The site applies `RetryParkedOperation` / `DiscardParkedOperation` to its own Store-and-Forward buffer and returns a `ParkedOperationActionAck`. The actor maps the ack to a `SiteCallRelayOutcome`: + +| Ack | Outcome | +|---|---| +| `Applied=true` | `Applied` | +| `Applied=false`, no error | `NotParked` — site had nothing to do | +| `Applied=false`, error present | `OperationFailed` — site faulted | +| Ask timeout / no route | `SiteUnreachable` | + +`SiteUnreachable` is distinguished from `OperationFailed` because central is a mirror — a relay that never reached the site is a transient transport condition, not an operation failure. The UI surfaces "site unreachable" so operators know to retry once the site is back online. + +The corrected cached-call state flows back to central via the normal telemetry path after the site applies the action. Central never writes the `SiteCalls` row to reflect a relay outcome directly. + +### KPI computation + +KPIs are computed point-in-time against the `SiteCalls` table by `ISiteCallAuditRepository.ComputeKpisAsync` and `ComputePerSiteKpisAsync`. All aggregation is server-side; no rows are materialised. The actor derives the cutoff timestamps from `SiteCallAuditOptions` before calling the repository: + +```csharp +private void HandleKpi(SiteCallKpiRequest request) +{ + var sender = Sender; + var now = DateTime.UtcNow; + var stuckCutoff = now - _options.StuckAgeThreshold; + var intervalSince = now - _options.KpiInterval; + + KpiAsync(request.CorrelationId, stuckCutoff, intervalSince).PipeTo( + sender, + success: response => response, + failure: ex => new SiteCallKpiResponse( + request.CorrelationId, Success: false, + ErrorMessage: ex.GetBaseException().Message, + BufferedCount: 0, ParkedCount: 0, FailedLastInterval: 0, + DeliveredLastInterval: 0, OldestPendingAge: null, StuckCount: 0)); +} +``` + +The `SiteCallKpiSnapshot` shape mirrors `NotificationKpiSnapshot` so the Central UI dashboard can reuse the same tile layout for both components. + +## Usage + +The actor accepts only Akka messages — there is no public API beyond the message protocol defined in Commons. The Central UI's Site Calls page sends `SiteCallQueryRequest` / `SiteCallKpiRequest` / `PerSiteSiteCallKpiRequest` / `SiteCallDetailRequest` through `CommunicationService`, which Asks the singleton and awaits `SiteCallQueryResponse` / `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` / `SiteCallDetailResponse`. + +The ingest path is driven by `AuditLogIngestActor.OnCachedTelemetryAsync`, which tells an `UpsertSiteCallCommand` after committing the dual-write transaction. The `SiteCallAuditActor` does not need to coordinate with `AuditLogIngestActor` — the transaction guarantees the `AuditLog` row always precedes the upsert command. + +Registration is via `ServiceCollectionExtensions.AddSiteCallAudit`, which binds `SiteCallAuditOptions` from the `ScadaBridge:SiteCallAudit` configuration section. The actor `Props` and the `ClusterSingletonManager` registration are wired in the Host's central-role composition. + +## Configuration + +`SiteCallAuditOptions` is bound from the `ScadaBridge:SiteCallAudit` section. + +| Key | Default | Description | +|---|---|---| +| `StuckAgeThreshold` | `00:10:00` | Age past which a non-terminal row is counted as stuck. Display-only; no escalation. | +| `KpiInterval` | `00:01:00` | Trailing window for `DeliveredLastInterval` and `FailedLastInterval` KPIs. | +| `RelayTimeout` | `00:00:10` | Ask timeout for the central→site Retry/Discard relay. Must be less than `CommunicationOptions.QueryTimeout` (default 30 s) so the inner relay times out first and returns the distinct `SiteUnreachable` outcome. | + +## Dependencies & Interactions + +- [Commons (#16)](./Commons.md) — owns `SiteCall`, `SiteCallOperational`, `TrackedOperationId`, `SiteCallAuditOptions`-adjacent types (`SiteCallKpiSnapshot`, `SiteCallSiteKpiSnapshot`, `SiteCallQueryFilter`, `SiteCallPaging`), all message contracts (`UpsertSiteCallCommand`, `UpsertSiteCallReply`, `SiteCallQueryRequest/Response`, `SiteCallDetailRequest/Response`, `SiteCallKpiRequest/Response`, `PerSiteSiteCallKpiRequest/Response`, `RetrySiteCallRequest/Response`, `DiscardSiteCallRequest/Response`, `SiteCallRelayOutcome`), and the `ISiteCallAuditRepository` interface. +- [Configuration Database (#17)](./ConfigurationDatabase.md) — implements `ISiteCallAuditRepository` against the central `dbo.SiteCalls` table. Central hosts must call `AddConfigurationDatabase` for the actor to resolve its scoped repository. +- [Audit Log (#23)](./AuditLog.md) — shares the `CachedCallTelemetry` packet. `AuditLogIngestActor.OnCachedTelemetryAsync` writes the `AuditLog` row and the `SiteCalls` upsert in a single MS SQL transaction, then tells `UpsertSiteCallCommand` to this actor. The two components coordinate via message-passing, not a shared service. +- [Central–Site Communication (#5)](./Communication.md) — the `CentralCommunicationActor` is the transport the relay handlers use. It is registered via `RegisterCentralCommunication` by the Host after both actors are running. `CommunicationService` also provides the async wrappers (`RetrySiteCallAsync`, `DiscardSiteCallAsync`) that the Central UI calls; those methods Ask the `SiteCallAuditActor` with the outer `CommunicationOptions.QueryTimeout`. +- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — site-side executor of `RetryParkedOperation` and `DiscardParkedOperation`. The site's S&F buffer is the source of truth for parked cached calls; it emits updated telemetry after applying an operator action. +- [Health Monitoring (#11)](./HealthMonitoring.md) — consumes `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` to surface buffered count, parked count, stuck count, and throughput KPI tiles on the health dashboard alongside the Notification Outbox tiles. +- [Central UI (#9)](./Host.md) — the Site Calls page queries this actor for the paginated list, detail modal, and KPIs; it issues Retry/Discard actions that flow through `CommunicationService` to the relay handlers here. +- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — hosts the `SiteCallAuditActor` singleton with active/standby failover via `ClusterSingletonManager`. + +## Troubleshooting + +### Relay returns `SiteUnreachable` + +The `CentralCommunicationActor` could not route the command to the site — the site is offline, the `ClusterClient` route has not yet resolved, or the relay timed out waiting for a `ParkedOperationActionAck`. The `_options.RelayTimeout` (default 10 s) is the inner Ask timeout. The action was NOT applied. Retry the operator action once the site is back online; the `SiteCalls` mirror row will self-correct via telemetry after the site applies it. + +### Relay returns `NotParked` + +The site was reached but reported no parked row for the given `TrackedOperationId`. The call was likely already delivered, discarded, or transitioned out of `Parked` status between the operator clicking Retry/Discard and the relay arriving. No action is required; the telemetry will update the mirror row shortly. + +### Upsert replied `Accepted=false` + +The actor caught a repository exception and replied false to the caller without rethrowing. The central singleton remains alive. Check the structured log for a `SiteCallAudit upsert failed for {TrackedOperationId}` error with the exception detail. If the MS SQL configuration database is temporarily unavailable, the telemetry sender will retry on its next cycle (the at-least-once gRPC transport) or the future reconciliation pull will backfill the row. + +### `SiteCalls` rows not appearing + +Ingest flows through `AuditLogIngestActor.OnCachedTelemetryAsync`, which writes the `AuditLog` row and `SiteCalls` upsert in one transaction before telling `UpsertSiteCallCommand`. If that transaction fails, neither row is written. Check `AuditLog` ingest health first — a missing `AuditLog` row for the same `TrackedOperationId` confirms the telemetry never reached central, not that the `SiteCalls` upsert failed in isolation. + +## Related Documentation + +- [Site Call Audit design specification](../requirements/Component-SiteCallAudit.md) +- [Audit Log](./AuditLog.md) +- [Notification Outbox](./NotificationOutbox.md) +- [Configuration Database](./ConfigurationDatabase.md) +- [Central–Site Communication](./Communication.md) +- [Store-and-Forward Engine](./StoreAndForward.md) +- [Commons](./Commons.md) +- [Health Monitoring](./HealthMonitoring.md) diff --git a/docs/components/SiteEventLogging.md b/docs/components/SiteEventLogging.md new file mode 100644 index 00000000..d99753f9 --- /dev/null +++ b/docs/components/SiteEventLogging.md @@ -0,0 +1,234 @@ +# Site Event Logging + +The Site Event Logging component records operational events at each site cluster into a local SQLite database. Events are written by site actors on a fire-and-forget basis and are available for remote query from central, providing a diagnostic window into site runtime activity without coupling subsystems to a central store. + +## Overview + +Site Event Logging (#12) is a site-only write path that runs alongside the operational subsystems it observes. Unlike the Audit Log (#23), which spans the script trust boundary and flows to a central append-only table, the site event log is a local diagnostic store: it captures events that are useful for troubleshooting runtime behaviour (script failures, connection flapping, deployment outcomes) but are not part of a ledger that must survive failover or node replacement. + +The component code lives in `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/`: + +- `SiteEventLogger` — the singleton write path: one owned `SqliteConnection` behind a shared write lock, fed by a bounded `Channel` so actor threads never block on disk I/O. +- `EventLogQueryService` — executes queries against the local SQLite, filtering and paginating results for central requests. +- `EventLogHandlerActor` — Akka actor bridge that receives `EventLogQueryRequest` messages from the `SiteCommunicationActor` and returns `EventLogQueryResponse`. +- `EventLogPurgeService` — `BackgroundService` that enforces time-based retention and the storage cap on a configurable interval. +- `SiteEventLogOptions` — options class bound from `ScadaBridge:SiteEventLog`. + +The DI entry point is `ServiceCollectionExtensions.AddSiteEventLogging`, registered on site nodes by `SiteServiceRegistration`. `EventLogHandlerActor` is wired separately as a cluster singleton inside `AkkaHostedService` because it must be created inside the `ActorSystem`. + +## Key Concepts + +### Active-node-only writes + +Only the active site node generates and stores events. The standby's local SQLite receives no writes, so purging there is unnecessary. `EventLogPurgeService` consults an optional `SiteEventLogActiveNodeCheck` delegate on each tick; the Host registers the real check on site nodes, and the purge early-exits on the standby. When no delegate is registered (tests, non-clustered hosts), the purge runs on every tick, preserving pre-cluster behaviour. + +On failover, the newly active node starts logging to its own SQLite database. Historical events from the previous active node are not queryable until that node comes back online. This is acceptable because event logs are diagnostic, not transactional — a missing log tail after failover is not a data-integrity concern. + +### Event types and severity + +`ISiteEventLogger.LogEventAsync` accepts a free-form `eventType` string and one of three case-sensitive `severity` values: `"Info"`, `"Warning"`, or `"Error"`. Unknown severities are rejected at write time — the allowed set is enforced by a `HashSet` with `StringComparer.Ordinal`, matching the SQLite `BINARY` collation used by query filters so a stored value is never invisible to a later query. + +The `event_type` values used across site subsystems are: `script`, `alarm`, `deployment`, `connection`, `store_and_forward`, `instance_lifecycle`. + +### Non-blocking write path + +`LogEventAsync` validates its arguments and enqueues a `PendingEvent` onto a bounded `Channel`. The background writer loop drains it sequentially against the shared connection. The returned `Task` completes once the event is durably persisted and faults if the write fails, so a caller that awaits it can detect a dropped event. The caller is never blocked on disk I/O. + +### Keyset pagination + +Queries use keyset pagination: the caller supplies a nullable `ContinuationToken` (the `id` of the last row in the previous page), and the query appends `id > $afterId` so each page starts exactly after the previous one with no row-skipping or re-scanning. The response includes a new `ContinuationToken` and a `HasMore` flag. + +## Architecture + +### Schema and indexes + +`SiteEventLogger.InitializeSchema` sets `PRAGMA auto_vacuum = INCREMENTAL` before creating the table — this is required before any table exists for the mode to take effect, and it allows `PRAGMA incremental_vacuum` to reclaim free pages during cap-purge batches: + +```csharp +cmd.CommandText = """ + CREATE TABLE IF NOT EXISTS site_events ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + timestamp TEXT NOT NULL, + event_type TEXT NOT NULL, + severity TEXT NOT NULL, + instance_id TEXT, + source TEXT NOT NULL, + message TEXT NOT NULL, + details TEXT + ); + CREATE INDEX IF NOT EXISTS idx_events_timestamp ON site_events(timestamp); + CREATE INDEX IF NOT EXISTS idx_events_type ON site_events(event_type); + CREATE INDEX IF NOT EXISTS idx_events_instance ON site_events(instance_id); + CREATE INDEX IF NOT EXISTS idx_events_severity ON site_events(severity); + """; +``` + +Keyword search (`KeywordFilter`) runs as `LIKE '%…%' ESCAPE '\'` on `message` and `source`. A leading-wildcard `LIKE` cannot use a B-tree index, so keyword-only queries full-scan the table. All other filters (`event_type`, `severity`, `instance_id`, `timestamp`) are covered by the indexes above. + +### Connection lock + +`SiteEventLogger` owns one `SqliteConnection` that is not thread-safe. Every database access — writes from the background loop, reads from `EventLogQueryService`, deletes from `EventLogPurgeService` — must go through `WithConnection`, which serialises callers on a shared lock: + +```csharp +internal bool WithConnection(Action action) +{ + ArgumentNullException.ThrowIfNull(action); + lock (_writeLock) + { + if (_disposed) return false; + action(_connection); + return true; + } +} +``` + +`EventLogQueryService` and `EventLogPurgeService` both depend on the concrete `SiteEventLogger` rather than `ISiteEventLogger` to avoid a downcast that would throw `InvalidCastException` for any other implementation. + +### Write queue and drop behaviour + +The write queue is bounded at `WriteQueueCapacity` (default 10,000). Overflow uses `BoundedChannelFullMode.DropOldest`: when the queue is full, the oldest pending event is evicted, its completion `Task` is faulted with `InvalidOperationException`, and `FailedWriteCount` is incremented so the drop is observable. On any SQLite write error, `FailedWriteCount` is also incremented and the affected `Task` is faulted: + +```csharp +_writeQueue = Channel.CreateBounded( + new BoundedChannelOptions(capacity) + { + SingleReader = true, + SingleWriter = false, + FullMode = BoundedChannelFullMode.DropOldest, + }, + itemDropped: dropped => + { + Interlocked.Increment(ref _failedWriteCount); + dropped.Completion.TrySetException( + new InvalidOperationException( + $"Event was dropped because the write queue exceeded its bounded capacity ({capacity}).")); + }); +``` + +### Purge: retention and storage cap + +`EventLogPurgeService` runs two passes on each tick: + +1. **Retention purge** — deletes all rows where `timestamp < cutoff` (cutoff = `UtcNow` minus `RetentionDays`). A single `DELETE` statement; no batching needed. + +2. **Storage cap purge** — if the logical database size exceeds `MaxStorageMb`, deletes the oldest 1,000 rows per batch and calls `PRAGMA incremental_vacuum` after each batch to reclaim free pages. The loop stops when the size is within the cap, when no rows are deleted, or when the size fails to decrease across a batch (guards against a scenario where vacuuming cannot shrink the file): + +```csharp +cmd.CommandText = $""" + DELETE FROM site_events WHERE id IN ( + SELECT id FROM site_events ORDER BY id ASC LIMIT {CapPurgeBatchSize} + ) + """; +var rows = cmd.ExecuteNonQuery(); + +using var vacuumCmd = connection.CreateCommand(); +vacuumCmd.CommandText = "PRAGMA incremental_vacuum"; +vacuumCmd.ExecuteNonQuery(); +``` + +Logical size is measured as `(page_count - freelist_count) × page_size` so the cap loop observes reclaimed pages even before they are returned to the OS. + +A purge runs once on `BackgroundService` startup and then on each `PurgeInterval` tick. + +### Central query path + +Central queries arrive via the `SiteCommunicationActor`, which dispatches `EventLogQueryRequest` messages to the `EventLogHandlerActor` cluster singleton. The actor delegates immediately to `IEventLogQueryService.ExecuteQuery` and returns the `EventLogQueryResponse` to the sender synchronously, keeping the actor message loop unblocked while the read runs under the shared lock: + +```csharp +public class EventLogHandlerActor : ReceiveActor +{ + public EventLogHandlerActor(IEventLogQueryService queryService) + { + _queryService = queryService; + + Receive(msg => + { + var response = _queryService.ExecuteQuery(msg); + Sender.Tell(response); + }); + } +} +``` + +`EventLogQueryService` clamps the caller-supplied `PageSize` to `MaxQueryPageSize` (default 500) before building the query, so a central client that requests `int.MaxValue` cannot force the query to materialise the entire log into one list while holding the write lock. + +## Usage + +Callers resolve `ISiteEventLogger` from DI. Because the write is non-blocking and best-effort, site actors discard the returned `Task` with `_ =` rather than awaiting it on the hot path: + +```csharp +// ScriptExecutionActor — reporting a script failure +_ = siteEventLogger?.LogEventAsync( + "script", "Error", instanceName, $"ScriptActor:{scriptName}", errorMsg, ex.ToString()); + +// DataConnectionActor — reporting a connection loss +_ = _siteEventLogger.LogEventAsync( + "connection", "Warning", null, _connectionName, + $"Connection lost — entering reconnect cycle", null); + +// DataConnectionActor — reporting a reconnection +_ = _siteEventLogger.LogEventAsync( + "connection", "Info", null, _connectionName, + $"Connection restored on {_activeEndpoint} endpoint", null); +``` + +The `source` argument uses the convention `"ActorType:Name"` (e.g. `"ScriptActor:MonitorSpeed"`, `"DataConnectionActor:PLC1"`). The `details` field carries any supplemental context — stack traces, compiler output, thresholds — as free-form text; JSON is conventional but not validated. + +Callers that need to confirm persistence — rare in production, common in tests — can await the returned `Task` and handle a faulted result. + +## Configuration + +Options are bound from the `ScadaBridge:SiteEventLog` section by `SiteServiceRegistration`. + +| Key | Default | Description | +|-----|---------|-------------| +| `RetentionDays` | `30` | Days before events are deleted by the retention purge. | +| `MaxStorageMb` | `1024` | Maximum logical database size in MB. Oldest rows are deleted in 1,000-row batches when exceeded. | +| `DatabasePath` | `site_events.db` | File path for the SQLite database. | +| `QueryPageSize` | `500` | Default page size when the caller does not supply one. | +| `MaxQueryPageSize` | `500` | Hard upper bound on caller-supplied page sizes. Silent clamp. | +| `PurgeInterval` | `24h` (`TimeSpan`) | Interval between purge ticks. An initial purge also runs on service startup. | +| `WriteQueueCapacity` | `10000` | Bounded write-queue capacity. Overflow evicts oldest with `DropOldest`. | + +The docker cluster appsettings (`ScadaBridge:SiteEventLog`) sets `RetentionDays: 30` and `MaxStorageMb: 1024`, matching the code defaults. `PurgeScheduleCron` appears in those files as a vestigial key; the actual purge cadence is driven by `PurgeInterval` in the options class, not a cron expression. + +## Dependencies & Interactions + +- [Commons (#16)](./Commons.md) — defines the `EventLogQueryRequest` / `EventLogQueryResponse` / `EventLogEntry` message contracts in `ZB.MOM.WW.ScadaBridge.Commons.Messages.RemoteQuery`, shared across the site query path and the central dispatch path (`QueryEventLogsCommand`). +- [Central–Site Communication (#5)](./Communication.md) — the `SiteCommunicationActor` dispatches inbound `EventLogQueryRequest` messages to `EventLogHandlerActor` and carries the `EventLogQueryResponse` back to central. The query timeout is 30 s. +- [Site Runtime (#3)](./SiteRuntime.md) — `ScriptActor` and `ScriptExecutionActor` log `script`-type events: trigger expression failures, script execution errors, and timeouts. `ISiteEventLogger` is resolved from DI inside execution actors. +- [Data Connection Layer (#4)](./DataConnectionLayer.md) — `DataConnectionActor` logs `connection`-type events: connection loss, reconnection, and endpoint failover. `DataConnectionManagerActor` may also log connection-category events. +- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — logs `store_and_forward`-type events on the site→central notification forward path (forward failures, long-buffered notifications). Routine enqueue and forward-success events are not logged; central's `Notifications` table is the authoritative record. +- [Host (#15)](./Host.md) — `SiteServiceRegistration` calls `AddSiteEventLogging` and binds `SiteEventLogOptions`. `AkkaHostedService` wires `EventLogHandlerActor` as a cluster singleton scoped to `"site-{SiteId}"` and registers the `SiteEventLogActiveNodeCheck` delegate so the purge runs only on the active node. +- [Audit Log (#23)](./AuditLog.md) — a distinct component. The Audit Log captures every trust-boundary action (outbound API calls, DB writes, notifications, inbound API) and flows to a central append-only table with monthly partitioning and 365-day retention. The site event log captures internal runtime diagnostics (failures, state transitions) locally with 30-day retention. The two stores are complementary, not overlapping. +- [Site Call Audit (#22)](./SiteCallAudit.md) — a distinct component. Site Call Audit mirrors cached-call operational status in the central `SiteCalls` table via gRPC telemetry. Site Event Logging has no role in that flow. + +## Troubleshooting + +### Write failures are observable but not propagated + +A SQLite write failure increments `FailedWriteCount` on `ISiteEventLogger`, logs an error via `ILogger`, and faults the returned `Task`. The calling actor discards the `Task` on the hot path (`_ = logger?.LogEventAsync(…)`), so the failure does not surface to the actor's message loop. `FailedWriteCount` is available for Health Monitoring integration but is not yet wired to the health surface; a non-zero count indicates disk pressure, a full queue, or a corrupt database file. + +### Queue overflow drops oldest events + +When the site write queue fills (sustained disk slowness or very high event rates), the oldest pending event is silently evicted and `FailedWriteCount` is incremented. Recent events are preserved at the cost of older ones. Reducing event throughput or increasing `WriteQueueCapacity` addresses sustained overflow. + +### Cap-purge loop terminates early + +If the database size does not decrease across a cap-purge batch, the loop stops to avoid emptying the entire table. This situation should not occur with `auto_vacuum = INCREMENTAL` enabled, but the guard prevents runaway deletion if vacuuming regresses. A `Warning` log line reporting the stable size is the signal to investigate filesystem-level free-page reclamation. + +### Central query returns stale data after failover + +After a site failover, the new active node's event log starts empty. Central queries will see no events until the new node generates them. This is by design — event logs are not replicated. Historical events from the previous active node return when that node comes back online and responds to queries. + +## Related Documentation + +- [Site Event Logging design specification](../requirements/Component-SiteEventLogging.md) +- [Audit Log](./AuditLog.md) +- [Site Call Audit](./SiteCallAudit.md) +- [Central–Site Communication](./Communication.md) +- [Site Runtime](./SiteRuntime.md) +- [Data Connection Layer](./DataConnectionLayer.md) +- [Store-and-Forward Engine](./StoreAndForward.md) +- [Host](./Host.md) +- [Commons](./Commons.md)