NotificationService (Notify.Send returns string not NotificationId; MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox (one Attempted row always, terminal row only on terminal status), SiteCallAudit (direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring (CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
16 KiB
Site Event Logging
The Site Event Logging component records operational events at each site cluster into a local SQLite database. Events are written by site actors on a fire-and-forget basis and are available for remote query from central, providing a diagnostic window into site runtime activity without coupling subsystems to a central store.
Overview
Site Event Logging (#12) is a site-only write path that runs alongside the operational subsystems it observes. Unlike the Audit Log (#23), which spans the script trust boundary and flows to a central append-only table, the site event log is a local diagnostic store: it captures events that are useful for troubleshooting runtime behaviour (script failures, connection flapping, deployment outcomes) but are not part of a ledger that must survive failover or node replacement.
The component code lives in src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/:
SiteEventLogger— the singleton write path: one ownedSqliteConnectionbehind a shared write lock, fed by a boundedChannel<T>so actor threads never block on disk I/O.EventLogQueryService— executes queries against the local SQLite, filtering and paginating results for central requests.EventLogHandlerActor— Akka actor bridge that receivesEventLogQueryRequestmessages from theSiteCommunicationActorand returnsEventLogQueryResponse.EventLogPurgeService—BackgroundServicethat enforces time-based retention and the storage cap on a configurable interval.SiteEventLogOptions— options class bound fromScadaBridge:SiteEventLog.
The DI entry point is ServiceCollectionExtensions.AddSiteEventLogging, registered on site nodes by SiteServiceRegistration. EventLogHandlerActor is wired separately as a cluster singleton inside AkkaHostedService because it must be created inside the ActorSystem.
Key Concepts
Active-node-only writes
Only the active site node generates and stores events. The standby's local SQLite receives no writes, so purging there is unnecessary. EventLogPurgeService consults an optional SiteEventLogActiveNodeCheck delegate on each tick and early-exits when the delegate returns false. The delegate is an optional seam: AddSiteEventLogging resolves it via sp.GetService<SiteEventLogActiveNodeCheck>(), so the service compiles and runs without it. The Host does not currently register the delegate, so GetService returns null and the constructor defaults to () => true. As a result the purge currently runs on every tick on both nodes. When no delegate is registered, the purge runs on every tick, preserving pre-cluster behaviour.
On failover, the newly active node starts logging to its own SQLite database. Historical events from the previous active node are not queryable until that node comes back online. This is acceptable because event logs are diagnostic, not transactional — a missing log tail after failover is not a data-integrity concern.
Event types and severity
ISiteEventLogger.LogEventAsync accepts a free-form eventType string and one of three case-sensitive severity values: "Info", "Warning", or "Error". Unknown severities are rejected at write time — the allowed set is enforced by a HashSet<string> with StringComparer.Ordinal, matching the SQLite BINARY collation used by query filters so a stored value is never invisible to a later query.
The event_type values used across site subsystems are: script, alarm, deployment, connection, store_and_forward, instance_lifecycle.
Non-blocking write path
LogEventAsync validates its arguments and enqueues a PendingEvent onto a bounded Channel<T>. The background writer loop drains it sequentially against the shared connection. The returned Task completes once the event is durably persisted and faults if the write fails, so a caller that awaits it can detect a dropped event. The caller is never blocked on disk I/O.
Keyset pagination
Queries use keyset pagination: the caller supplies a nullable ContinuationToken (the id of the last row in the previous page), and the query appends id > $afterId so each page starts exactly after the previous one with no row-skipping or re-scanning. The response includes a new ContinuationToken and a HasMore flag.
Architecture
Schema and indexes
SiteEventLogger.InitializeSchema sets PRAGMA auto_vacuum = INCREMENTAL before creating the table — this is required before any table exists for the mode to take effect, and it allows PRAGMA incremental_vacuum to reclaim free pages during cap-purge batches:
cmd.CommandText = """
CREATE TABLE IF NOT EXISTS site_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
event_type TEXT NOT NULL,
severity TEXT NOT NULL,
instance_id TEXT,
source TEXT NOT NULL,
message TEXT NOT NULL,
details TEXT
);
CREATE INDEX IF NOT EXISTS idx_events_timestamp ON site_events(timestamp);
CREATE INDEX IF NOT EXISTS idx_events_type ON site_events(event_type);
CREATE INDEX IF NOT EXISTS idx_events_instance ON site_events(instance_id);
CREATE INDEX IF NOT EXISTS idx_events_severity ON site_events(severity);
""";
Keyword search (KeywordFilter) runs as LIKE '%…%' ESCAPE '\' on message and source. A leading-wildcard LIKE cannot use a B-tree index, so keyword-only queries full-scan the table. All other filters (event_type, severity, instance_id, timestamp) are covered by the indexes above.
Connection lock
SiteEventLogger owns one SqliteConnection that is not thread-safe. Every database access — writes from the background loop, reads from EventLogQueryService, deletes from EventLogPurgeService — must go through WithConnection, which serialises callers on a shared lock:
internal bool WithConnection(Action<SqliteConnection> action)
{
ArgumentNullException.ThrowIfNull(action);
lock (_writeLock)
{
if (_disposed) return false;
action(_connection);
return true;
}
}
EventLogQueryService and EventLogPurgeService both depend on the concrete SiteEventLogger rather than ISiteEventLogger to avoid a downcast that would throw InvalidCastException for any other implementation.
Write queue and drop behaviour
The write queue is bounded at WriteQueueCapacity (default 10,000). Overflow uses BoundedChannelFullMode.DropOldest: when the queue is full, the oldest pending event is evicted, its completion Task is faulted with InvalidOperationException, and FailedWriteCount is incremented so the drop is observable. On any SQLite write error, FailedWriteCount is also incremented and the affected Task is faulted:
_writeQueue = Channel.CreateBounded<PendingEvent>(
new BoundedChannelOptions(capacity)
{
SingleReader = true,
SingleWriter = false,
FullMode = BoundedChannelFullMode.DropOldest,
},
itemDropped: dropped =>
{
Interlocked.Increment(ref _failedWriteCount);
dropped.Completion.TrySetException(
new InvalidOperationException(
$"Event was dropped because the write queue exceeded its bounded capacity ({capacity})."));
});
Purge: retention and storage cap
EventLogPurgeService runs two passes on each tick:
-
Retention purge — deletes all rows where
timestamp < cutoff(cutoff =UtcNowminusRetentionDays). A singleDELETEstatement; no batching needed. -
Storage cap purge — if the logical database size exceeds
MaxStorageMb, deletes the oldest 1,000 rows per batch and callsPRAGMA incremental_vacuumafter each batch to reclaim free pages. The loop stops when the size is within the cap, when no rows are deleted, or when the size fails to decrease across a batch (guards against a scenario where vacuuming cannot shrink the file):
cmd.CommandText = $"""
DELETE FROM site_events WHERE id IN (
SELECT id FROM site_events ORDER BY id ASC LIMIT {CapPurgeBatchSize}
)
""";
var rows = cmd.ExecuteNonQuery();
using var vacuumCmd = connection.CreateCommand();
vacuumCmd.CommandText = "PRAGMA incremental_vacuum";
vacuumCmd.ExecuteNonQuery();
Logical size is measured as (page_count - freelist_count) × page_size so the cap loop observes reclaimed pages even before they are returned to the OS.
A purge runs once on BackgroundService startup and then on each PurgeInterval tick.
Central query path
Central queries arrive via the SiteCommunicationActor, which dispatches EventLogQueryRequest messages to the EventLogHandlerActor cluster singleton. The actor delegates immediately to IEventLogQueryService.ExecuteQuery and returns the EventLogQueryResponse to the sender synchronously, keeping the actor message loop unblocked while the read runs under the shared lock:
public class EventLogHandlerActor : ReceiveActor
{
public EventLogHandlerActor(IEventLogQueryService queryService)
{
_queryService = queryService;
Receive<EventLogQueryRequest>(msg =>
{
var response = _queryService.ExecuteQuery(msg);
Sender.Tell(response);
});
}
}
EventLogQueryService clamps the caller-supplied PageSize to MaxQueryPageSize (default 500) before building the query, so a central client that requests int.MaxValue cannot force the query to materialise the entire log into one list while holding the write lock.
Usage
Callers resolve ISiteEventLogger from DI. Because the write is non-blocking and best-effort, site actors discard the returned Task with _ = rather than awaiting it on the hot path:
// ScriptExecutionActor — reporting a script failure
_ = siteEventLogger?.LogEventAsync(
"script", "Error", instanceName, $"ScriptActor:{scriptName}", errorMsg, ex.ToString());
// DataConnectionActor — reporting a connection loss
_ = _siteEventLogger.LogEventAsync(
"connection", "Warning", null, _connectionName,
$"Connection lost — entering reconnect cycle", null);
// DataConnectionActor — reporting a reconnection
_ = _siteEventLogger.LogEventAsync(
"connection", "Info", null, _connectionName,
$"Connection restored on {_activeEndpoint} endpoint", null);
The source argument uses the convention "ActorType:Name" (e.g. "ScriptActor:MonitorSpeed", "DataConnectionActor:PLC1"). The details field carries any supplemental context — stack traces, compiler output, thresholds — as free-form text; JSON is conventional but not validated.
Callers that need to confirm persistence — rare in production, common in tests — can await the returned Task and handle a faulted result.
Configuration
Options are bound from the ScadaBridge:SiteEventLog section by SiteServiceRegistration.
| Key | Default | Description |
|---|---|---|
RetentionDays |
30 |
Days before events are deleted by the retention purge. |
MaxStorageMb |
1024 |
Maximum logical database size in MB. Oldest rows are deleted in 1,000-row batches when exceeded. |
DatabasePath |
site_events.db |
File path for the SQLite database. |
QueryPageSize |
500 |
Default page size when the caller does not supply one. |
MaxQueryPageSize |
500 |
Hard upper bound on caller-supplied page sizes. Silent clamp. |
PurgeInterval |
24h (TimeSpan) |
Interval between purge ticks. An initial purge also runs on service startup. |
WriteQueueCapacity |
10000 |
Bounded write-queue capacity. Overflow evicts oldest with DropOldest. |
The docker cluster appsettings (ScadaBridge:SiteEventLog) sets RetentionDays: 30 and MaxStorageMb: 1024, matching the code defaults. PurgeScheduleCron appears in those files as a vestigial key; the actual purge cadence is driven by PurgeInterval in the options class, not a cron expression.
Dependencies & Interactions
- Commons (#16) — defines the
EventLogQueryRequest/EventLogQueryResponse/EventLogEntrymessage contracts inZB.MOM.WW.ScadaBridge.Commons.Messages.RemoteQuery, shared across the site query path and the central dispatch path (QueryEventLogsCommand). - Central–Site Communication (#5) — the
SiteCommunicationActordispatches inboundEventLogQueryRequestmessages toEventLogHandlerActorand carries theEventLogQueryResponseback to central. The query timeout is 30 s. - Site Runtime (#3) —
ScriptActorandScriptExecutionActorlogscript-type events: trigger expression failures, script execution errors, and timeouts.ISiteEventLoggeris resolved from DI inside execution actors. - Data Connection Layer (#4) —
DataConnectionActorlogsconnection-type events: connection loss, reconnection, and endpoint failover.DataConnectionManagerActormay also log connection-category events. - Store-and-Forward Engine (#6) — logs
store_and_forward-type events on the site→central notification forward path (forward failures, long-buffered notifications). Routine enqueue and forward-success events are not logged; central'sNotificationstable is the authoritative record. - Host (#15) —
SiteServiceRegistrationcallsAddSiteEventLoggingand bindsSiteEventLogOptions.AkkaHostedServicewiresEventLogHandlerActoras a cluster singleton scoped to"site-{SiteId}". TheSiteEventLogActiveNodeCheckdelegate is an optional seam defined inSiteEventLoggingfor the Host to register when it wants to gate the purge to the active node only; the Host does not currently register it, so the purge defaults to always-active and runs on every node. - Audit Log (#23) — a distinct component. The Audit Log captures every trust-boundary action (outbound API calls, DB writes, notifications, inbound API) and flows to a central append-only table with monthly partitioning and 365-day retention. The site event log captures internal runtime diagnostics (failures, state transitions) locally with 30-day retention. The two stores are complementary, not overlapping.
- Site Call Audit (#22) — a distinct component. Site Call Audit mirrors cached-call operational status in the central
SiteCallstable via gRPC telemetry. Site Event Logging has no role in that flow.
Troubleshooting
Write failures are observable but not propagated
A SQLite write failure increments FailedWriteCount on ISiteEventLogger, logs an error via ILogger, and faults the returned Task. The calling actor discards the Task on the hot path (_ = logger?.LogEventAsync(…)), so the failure does not surface to the actor's message loop. FailedWriteCount is available for Health Monitoring integration but is not yet wired to the health surface; a non-zero count indicates disk pressure, a full queue, or a corrupt database file.
Queue overflow drops oldest events
When the site write queue fills (sustained disk slowness or very high event rates), the oldest pending event is silently evicted and FailedWriteCount is incremented. Recent events are preserved at the cost of older ones. Reducing event throughput or increasing WriteQueueCapacity addresses sustained overflow.
Cap-purge loop terminates early
If the database size does not decrease across a cap-purge batch, the loop stops to avoid emptying the entire table. This situation should not occur with auto_vacuum = INCREMENTAL enabled, but the guard prevents runaway deletion if vacuuming regresses. A Warning log line reporting the stable size is the signal to investigate filesystem-level free-page reclamation.
Central query returns stale data after failover
After a site failover, the new active node's event log starts empty. Central queries will see no events until the new node generates them. This is by design — event logs are not replicated. Historical events from the previous active node return when that node comes back online and responds to queries.