Files
Joseph Doherty 9175b0c013 docs(components): accuracy fixes from deep review (batch 3)
NotificationService (Notify.Send returns string not NotificationId;
MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox
(one Attempted row always, terminal row only on terminal status), SiteCallAudit
(direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring
(CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on
IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both
nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
2026-06-03 16:37:15 -04:00

16 KiB
Raw Permalink Blame History

Site Event Logging

The Site Event Logging component records operational events at each site cluster into a local SQLite database. Events are written by site actors on a fire-and-forget basis and are available for remote query from central, providing a diagnostic window into site runtime activity without coupling subsystems to a central store.

Overview

Site Event Logging (#12) is a site-only write path that runs alongside the operational subsystems it observes. Unlike the Audit Log (#23), which spans the script trust boundary and flows to a central append-only table, the site event log is a local diagnostic store: it captures events that are useful for troubleshooting runtime behaviour (script failures, connection flapping, deployment outcomes) but are not part of a ledger that must survive failover or node replacement.

The component code lives in src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/:

  • SiteEventLogger — the singleton write path: one owned SqliteConnection behind a shared write lock, fed by a bounded Channel<T> so actor threads never block on disk I/O.
  • EventLogQueryService — executes queries against the local SQLite, filtering and paginating results for central requests.
  • EventLogHandlerActor — Akka actor bridge that receives EventLogQueryRequest messages from the SiteCommunicationActor and returns EventLogQueryResponse.
  • EventLogPurgeServiceBackgroundService that enforces time-based retention and the storage cap on a configurable interval.
  • SiteEventLogOptions — options class bound from ScadaBridge:SiteEventLog.

The DI entry point is ServiceCollectionExtensions.AddSiteEventLogging, registered on site nodes by SiteServiceRegistration. EventLogHandlerActor is wired separately as a cluster singleton inside AkkaHostedService because it must be created inside the ActorSystem.

Key Concepts

Active-node-only writes

Only the active site node generates and stores events. The standby's local SQLite receives no writes, so purging there is unnecessary. EventLogPurgeService consults an optional SiteEventLogActiveNodeCheck delegate on each tick and early-exits when the delegate returns false. The delegate is an optional seam: AddSiteEventLogging resolves it via sp.GetService<SiteEventLogActiveNodeCheck>(), so the service compiles and runs without it. The Host does not currently register the delegate, so GetService returns null and the constructor defaults to () => true. As a result the purge currently runs on every tick on both nodes. When no delegate is registered, the purge runs on every tick, preserving pre-cluster behaviour.

On failover, the newly active node starts logging to its own SQLite database. Historical events from the previous active node are not queryable until that node comes back online. This is acceptable because event logs are diagnostic, not transactional — a missing log tail after failover is not a data-integrity concern.

Event types and severity

ISiteEventLogger.LogEventAsync accepts a free-form eventType string and one of three case-sensitive severity values: "Info", "Warning", or "Error". Unknown severities are rejected at write time — the allowed set is enforced by a HashSet<string> with StringComparer.Ordinal, matching the SQLite BINARY collation used by query filters so a stored value is never invisible to a later query.

The event_type values used across site subsystems are: script, alarm, deployment, connection, store_and_forward, instance_lifecycle.

Non-blocking write path

LogEventAsync validates its arguments and enqueues a PendingEvent onto a bounded Channel<T>. The background writer loop drains it sequentially against the shared connection. The returned Task completes once the event is durably persisted and faults if the write fails, so a caller that awaits it can detect a dropped event. The caller is never blocked on disk I/O.

Keyset pagination

Queries use keyset pagination: the caller supplies a nullable ContinuationToken (the id of the last row in the previous page), and the query appends id > $afterId so each page starts exactly after the previous one with no row-skipping or re-scanning. The response includes a new ContinuationToken and a HasMore flag.

Architecture

Schema and indexes

SiteEventLogger.InitializeSchema sets PRAGMA auto_vacuum = INCREMENTAL before creating the table — this is required before any table exists for the mode to take effect, and it allows PRAGMA incremental_vacuum to reclaim free pages during cap-purge batches:

cmd.CommandText = """
    CREATE TABLE IF NOT EXISTS site_events (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        timestamp TEXT NOT NULL,
        event_type TEXT NOT NULL,
        severity TEXT NOT NULL,
        instance_id TEXT,
        source TEXT NOT NULL,
        message TEXT NOT NULL,
        details TEXT
    );
    CREATE INDEX IF NOT EXISTS idx_events_timestamp ON site_events(timestamp);
    CREATE INDEX IF NOT EXISTS idx_events_type ON site_events(event_type);
    CREATE INDEX IF NOT EXISTS idx_events_instance ON site_events(instance_id);
    CREATE INDEX IF NOT EXISTS idx_events_severity ON site_events(severity);
    """;

Keyword search (KeywordFilter) runs as LIKE '%…%' ESCAPE '\' on message and source. A leading-wildcard LIKE cannot use a B-tree index, so keyword-only queries full-scan the table. All other filters (event_type, severity, instance_id, timestamp) are covered by the indexes above.

Connection lock

SiteEventLogger owns one SqliteConnection that is not thread-safe. Every database access — writes from the background loop, reads from EventLogQueryService, deletes from EventLogPurgeService — must go through WithConnection, which serialises callers on a shared lock:

internal bool WithConnection(Action<SqliteConnection> action)
{
    ArgumentNullException.ThrowIfNull(action);
    lock (_writeLock)
    {
        if (_disposed) return false;
        action(_connection);
        return true;
    }
}

EventLogQueryService and EventLogPurgeService both depend on the concrete SiteEventLogger rather than ISiteEventLogger to avoid a downcast that would throw InvalidCastException for any other implementation.

Write queue and drop behaviour

The write queue is bounded at WriteQueueCapacity (default 10,000). Overflow uses BoundedChannelFullMode.DropOldest: when the queue is full, the oldest pending event is evicted, its completion Task is faulted with InvalidOperationException, and FailedWriteCount is incremented so the drop is observable. On any SQLite write error, FailedWriteCount is also incremented and the affected Task is faulted:

_writeQueue = Channel.CreateBounded<PendingEvent>(
    new BoundedChannelOptions(capacity)
    {
        SingleReader = true,
        SingleWriter = false,
        FullMode = BoundedChannelFullMode.DropOldest,
    },
    itemDropped: dropped =>
    {
        Interlocked.Increment(ref _failedWriteCount);
        dropped.Completion.TrySetException(
            new InvalidOperationException(
                $"Event was dropped because the write queue exceeded its bounded capacity ({capacity})."));
    });

Purge: retention and storage cap

EventLogPurgeService runs two passes on each tick:

  1. Retention purge — deletes all rows where timestamp < cutoff (cutoff = UtcNow minus RetentionDays). A single DELETE statement; no batching needed.

  2. Storage cap purge — if the logical database size exceeds MaxStorageMb, deletes the oldest 1,000 rows per batch and calls PRAGMA incremental_vacuum after each batch to reclaim free pages. The loop stops when the size is within the cap, when no rows are deleted, or when the size fails to decrease across a batch (guards against a scenario where vacuuming cannot shrink the file):

cmd.CommandText = $"""
    DELETE FROM site_events WHERE id IN (
        SELECT id FROM site_events ORDER BY id ASC LIMIT {CapPurgeBatchSize}
    )
    """;
var rows = cmd.ExecuteNonQuery();

using var vacuumCmd = connection.CreateCommand();
vacuumCmd.CommandText = "PRAGMA incremental_vacuum";
vacuumCmd.ExecuteNonQuery();

Logical size is measured as (page_count - freelist_count) × page_size so the cap loop observes reclaimed pages even before they are returned to the OS.

A purge runs once on BackgroundService startup and then on each PurgeInterval tick.

Central query path

Central queries arrive via the SiteCommunicationActor, which dispatches EventLogQueryRequest messages to the EventLogHandlerActor cluster singleton. The actor delegates immediately to IEventLogQueryService.ExecuteQuery and returns the EventLogQueryResponse to the sender synchronously, keeping the actor message loop unblocked while the read runs under the shared lock:

public class EventLogHandlerActor : ReceiveActor
{
    public EventLogHandlerActor(IEventLogQueryService queryService)
    {
        _queryService = queryService;

        Receive<EventLogQueryRequest>(msg =>
        {
            var response = _queryService.ExecuteQuery(msg);
            Sender.Tell(response);
        });
    }
}

EventLogQueryService clamps the caller-supplied PageSize to MaxQueryPageSize (default 500) before building the query, so a central client that requests int.MaxValue cannot force the query to materialise the entire log into one list while holding the write lock.

Usage

Callers resolve ISiteEventLogger from DI. Because the write is non-blocking and best-effort, site actors discard the returned Task with _ = rather than awaiting it on the hot path:

// ScriptExecutionActor — reporting a script failure
_ = siteEventLogger?.LogEventAsync(
    "script", "Error", instanceName, $"ScriptActor:{scriptName}", errorMsg, ex.ToString());

// DataConnectionActor — reporting a connection loss
_ = _siteEventLogger.LogEventAsync(
    "connection", "Warning", null, _connectionName,
    $"Connection lost — entering reconnect cycle", null);

// DataConnectionActor — reporting a reconnection
_ = _siteEventLogger.LogEventAsync(
    "connection", "Info", null, _connectionName,
    $"Connection restored on {_activeEndpoint} endpoint", null);

The source argument uses the convention "ActorType:Name" (e.g. "ScriptActor:MonitorSpeed", "DataConnectionActor:PLC1"). The details field carries any supplemental context — stack traces, compiler output, thresholds — as free-form text; JSON is conventional but not validated.

Callers that need to confirm persistence — rare in production, common in tests — can await the returned Task and handle a faulted result.

Configuration

Options are bound from the ScadaBridge:SiteEventLog section by SiteServiceRegistration.

Key Default Description
RetentionDays 30 Days before events are deleted by the retention purge.
MaxStorageMb 1024 Maximum logical database size in MB. Oldest rows are deleted in 1,000-row batches when exceeded.
DatabasePath site_events.db File path for the SQLite database.
QueryPageSize 500 Default page size when the caller does not supply one.
MaxQueryPageSize 500 Hard upper bound on caller-supplied page sizes. Silent clamp.
PurgeInterval 24h (TimeSpan) Interval between purge ticks. An initial purge also runs on service startup.
WriteQueueCapacity 10000 Bounded write-queue capacity. Overflow evicts oldest with DropOldest.

The docker cluster appsettings (ScadaBridge:SiteEventLog) sets RetentionDays: 30 and MaxStorageMb: 1024, matching the code defaults. PurgeScheduleCron appears in those files as a vestigial key; the actual purge cadence is driven by PurgeInterval in the options class, not a cron expression.

Dependencies & Interactions

  • Commons (#16) — defines the EventLogQueryRequest / EventLogQueryResponse / EventLogEntry message contracts in ZB.MOM.WW.ScadaBridge.Commons.Messages.RemoteQuery, shared across the site query path and the central dispatch path (QueryEventLogsCommand).
  • CentralSite Communication (#5) — the SiteCommunicationActor dispatches inbound EventLogQueryRequest messages to EventLogHandlerActor and carries the EventLogQueryResponse back to central. The query timeout is 30 s.
  • Site Runtime (#3)ScriptActor and ScriptExecutionActor log script-type events: trigger expression failures, script execution errors, and timeouts. ISiteEventLogger is resolved from DI inside execution actors.
  • Data Connection Layer (#4)DataConnectionActor logs connection-type events: connection loss, reconnection, and endpoint failover. DataConnectionManagerActor may also log connection-category events.
  • Store-and-Forward Engine (#6) — logs store_and_forward-type events on the site→central notification forward path (forward failures, long-buffered notifications). Routine enqueue and forward-success events are not logged; central's Notifications table is the authoritative record.
  • Host (#15)SiteServiceRegistration calls AddSiteEventLogging and binds SiteEventLogOptions. AkkaHostedService wires EventLogHandlerActor as a cluster singleton scoped to "site-{SiteId}". The SiteEventLogActiveNodeCheck delegate is an optional seam defined in SiteEventLogging for the Host to register when it wants to gate the purge to the active node only; the Host does not currently register it, so the purge defaults to always-active and runs on every node.
  • Audit Log (#23) — a distinct component. The Audit Log captures every trust-boundary action (outbound API calls, DB writes, notifications, inbound API) and flows to a central append-only table with monthly partitioning and 365-day retention. The site event log captures internal runtime diagnostics (failures, state transitions) locally with 30-day retention. The two stores are complementary, not overlapping.
  • Site Call Audit (#22) — a distinct component. Site Call Audit mirrors cached-call operational status in the central SiteCalls table via gRPC telemetry. Site Event Logging has no role in that flow.

Troubleshooting

Write failures are observable but not propagated

A SQLite write failure increments FailedWriteCount on ISiteEventLogger, logs an error via ILogger, and faults the returned Task. The calling actor discards the Task on the hot path (_ = logger?.LogEventAsync(…)), so the failure does not surface to the actor's message loop. FailedWriteCount is available for Health Monitoring integration but is not yet wired to the health surface; a non-zero count indicates disk pressure, a full queue, or a corrupt database file.

Queue overflow drops oldest events

When the site write queue fills (sustained disk slowness or very high event rates), the oldest pending event is silently evicted and FailedWriteCount is incremented. Recent events are preserved at the cost of older ones. Reducing event throughput or increasing WriteQueueCapacity addresses sustained overflow.

Cap-purge loop terminates early

If the database size does not decrease across a cap-purge batch, the loop stops to avoid emptying the entire table. This situation should not occur with auto_vacuum = INCREMENTAL enabled, but the guard prevents runaway deletion if vacuuming regresses. A Warning log line reporting the stable size is the signal to investigate filesystem-level free-page reclamation.

Central query returns stale data after failover

After a site failover, the new active node's event log starts empty. Central queries will see no events until the new node generates them. This is by design — event logs are not replicated. Historical events from the previous active node return when that node comes back online and responds to queries.