Files
ScadaBridge/docs/components/SiteEventLogging.md
T

235 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Site Event Logging
The Site Event Logging component records operational events at each site cluster into a local SQLite database. Events are written by site actors on a fire-and-forget basis and are available for remote query from central, providing a diagnostic window into site runtime activity without coupling subsystems to a central store.
## Overview
Site Event Logging (#12) is a site-only write path that runs alongside the operational subsystems it observes. Unlike the Audit Log (#23), which spans the script trust boundary and flows to a central append-only table, the site event log is a local diagnostic store: it captures events that are useful for troubleshooting runtime behaviour (script failures, connection flapping, deployment outcomes) but are not part of a ledger that must survive failover or node replacement.
The component code lives in `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/`:
- `SiteEventLogger` — the singleton write path: one owned `SqliteConnection` behind a shared write lock, fed by a bounded `Channel<T>` so actor threads never block on disk I/O.
- `EventLogQueryService` — executes queries against the local SQLite, filtering and paginating results for central requests.
- `EventLogHandlerActor` — Akka actor bridge that receives `EventLogQueryRequest` messages from the `SiteCommunicationActor` and returns `EventLogQueryResponse`.
- `EventLogPurgeService``BackgroundService` that enforces time-based retention and the storage cap on a configurable interval.
- `SiteEventLogOptions` — options class bound from `ScadaBridge:SiteEventLog`.
The DI entry point is `ServiceCollectionExtensions.AddSiteEventLogging`, registered on site nodes by `SiteServiceRegistration`. `EventLogHandlerActor` is wired separately as a cluster singleton inside `AkkaHostedService` because it must be created inside the `ActorSystem`.
## Key Concepts
### Active-node-only writes
Only the active site node generates and stores events. The standby's local SQLite receives no writes, so purging there is unnecessary. `EventLogPurgeService` consults an optional `SiteEventLogActiveNodeCheck` delegate on each tick; the Host registers the real check on site nodes, and the purge early-exits on the standby. When no delegate is registered (tests, non-clustered hosts), the purge runs on every tick, preserving pre-cluster behaviour.
On failover, the newly active node starts logging to its own SQLite database. Historical events from the previous active node are not queryable until that node comes back online. This is acceptable because event logs are diagnostic, not transactional — a missing log tail after failover is not a data-integrity concern.
### Event types and severity
`ISiteEventLogger.LogEventAsync` accepts a free-form `eventType` string and one of three case-sensitive `severity` values: `"Info"`, `"Warning"`, or `"Error"`. Unknown severities are rejected at write time — the allowed set is enforced by a `HashSet<string>` with `StringComparer.Ordinal`, matching the SQLite `BINARY` collation used by query filters so a stored value is never invisible to a later query.
The `event_type` values used across site subsystems are: `script`, `alarm`, `deployment`, `connection`, `store_and_forward`, `instance_lifecycle`.
### Non-blocking write path
`LogEventAsync` validates its arguments and enqueues a `PendingEvent` onto a bounded `Channel<T>`. The background writer loop drains it sequentially against the shared connection. The returned `Task` completes once the event is durably persisted and faults if the write fails, so a caller that awaits it can detect a dropped event. The caller is never blocked on disk I/O.
### Keyset pagination
Queries use keyset pagination: the caller supplies a nullable `ContinuationToken` (the `id` of the last row in the previous page), and the query appends `id > $afterId` so each page starts exactly after the previous one with no row-skipping or re-scanning. The response includes a new `ContinuationToken` and a `HasMore` flag.
## Architecture
### Schema and indexes
`SiteEventLogger.InitializeSchema` sets `PRAGMA auto_vacuum = INCREMENTAL` before creating the table — this is required before any table exists for the mode to take effect, and it allows `PRAGMA incremental_vacuum` to reclaim free pages during cap-purge batches:
```csharp
cmd.CommandText = """
CREATE TABLE IF NOT EXISTS site_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
event_type TEXT NOT NULL,
severity TEXT NOT NULL,
instance_id TEXT,
source TEXT NOT NULL,
message TEXT NOT NULL,
details TEXT
);
CREATE INDEX IF NOT EXISTS idx_events_timestamp ON site_events(timestamp);
CREATE INDEX IF NOT EXISTS idx_events_type ON site_events(event_type);
CREATE INDEX IF NOT EXISTS idx_events_instance ON site_events(instance_id);
CREATE INDEX IF NOT EXISTS idx_events_severity ON site_events(severity);
""";
```
Keyword search (`KeywordFilter`) runs as `LIKE '%…%' ESCAPE '\'` on `message` and `source`. A leading-wildcard `LIKE` cannot use a B-tree index, so keyword-only queries full-scan the table. All other filters (`event_type`, `severity`, `instance_id`, `timestamp`) are covered by the indexes above.
### Connection lock
`SiteEventLogger` owns one `SqliteConnection` that is not thread-safe. Every database access — writes from the background loop, reads from `EventLogQueryService`, deletes from `EventLogPurgeService` — must go through `WithConnection`, which serialises callers on a shared lock:
```csharp
internal bool WithConnection(Action<SqliteConnection> action)
{
ArgumentNullException.ThrowIfNull(action);
lock (_writeLock)
{
if (_disposed) return false;
action(_connection);
return true;
}
}
```
`EventLogQueryService` and `EventLogPurgeService` both depend on the concrete `SiteEventLogger` rather than `ISiteEventLogger` to avoid a downcast that would throw `InvalidCastException` for any other implementation.
### Write queue and drop behaviour
The write queue is bounded at `WriteQueueCapacity` (default 10,000). Overflow uses `BoundedChannelFullMode.DropOldest`: when the queue is full, the oldest pending event is evicted, its completion `Task` is faulted with `InvalidOperationException`, and `FailedWriteCount` is incremented so the drop is observable. On any SQLite write error, `FailedWriteCount` is also incremented and the affected `Task` is faulted:
```csharp
_writeQueue = Channel.CreateBounded<PendingEvent>(
new BoundedChannelOptions(capacity)
{
SingleReader = true,
SingleWriter = false,
FullMode = BoundedChannelFullMode.DropOldest,
},
itemDropped: dropped =>
{
Interlocked.Increment(ref _failedWriteCount);
dropped.Completion.TrySetException(
new InvalidOperationException(
$"Event was dropped because the write queue exceeded its bounded capacity ({capacity})."));
});
```
### Purge: retention and storage cap
`EventLogPurgeService` runs two passes on each tick:
1. **Retention purge** — deletes all rows where `timestamp < cutoff` (cutoff = `UtcNow` minus `RetentionDays`). A single `DELETE` statement; no batching needed.
2. **Storage cap purge** — if the logical database size exceeds `MaxStorageMb`, deletes the oldest 1,000 rows per batch and calls `PRAGMA incremental_vacuum` after each batch to reclaim free pages. The loop stops when the size is within the cap, when no rows are deleted, or when the size fails to decrease across a batch (guards against a scenario where vacuuming cannot shrink the file):
```csharp
cmd.CommandText = $"""
DELETE FROM site_events WHERE id IN (
SELECT id FROM site_events ORDER BY id ASC LIMIT {CapPurgeBatchSize}
)
""";
var rows = cmd.ExecuteNonQuery();
using var vacuumCmd = connection.CreateCommand();
vacuumCmd.CommandText = "PRAGMA incremental_vacuum";
vacuumCmd.ExecuteNonQuery();
```
Logical size is measured as `(page_count - freelist_count) × page_size` so the cap loop observes reclaimed pages even before they are returned to the OS.
A purge runs once on `BackgroundService` startup and then on each `PurgeInterval` tick.
### Central query path
Central queries arrive via the `SiteCommunicationActor`, which dispatches `EventLogQueryRequest` messages to the `EventLogHandlerActor` cluster singleton. The actor delegates immediately to `IEventLogQueryService.ExecuteQuery` and returns the `EventLogQueryResponse` to the sender synchronously, keeping the actor message loop unblocked while the read runs under the shared lock:
```csharp
public class EventLogHandlerActor : ReceiveActor
{
public EventLogHandlerActor(IEventLogQueryService queryService)
{
_queryService = queryService;
Receive<EventLogQueryRequest>(msg =>
{
var response = _queryService.ExecuteQuery(msg);
Sender.Tell(response);
});
}
}
```
`EventLogQueryService` clamps the caller-supplied `PageSize` to `MaxQueryPageSize` (default 500) before building the query, so a central client that requests `int.MaxValue` cannot force the query to materialise the entire log into one list while holding the write lock.
## Usage
Callers resolve `ISiteEventLogger` from DI. Because the write is non-blocking and best-effort, site actors discard the returned `Task` with `_ =` rather than awaiting it on the hot path:
```csharp
// ScriptExecutionActor — reporting a script failure
_ = siteEventLogger?.LogEventAsync(
"script", "Error", instanceName, $"ScriptActor:{scriptName}", errorMsg, ex.ToString());
// DataConnectionActor — reporting a connection loss
_ = _siteEventLogger.LogEventAsync(
"connection", "Warning", null, _connectionName,
$"Connection lost — entering reconnect cycle", null);
// DataConnectionActor — reporting a reconnection
_ = _siteEventLogger.LogEventAsync(
"connection", "Info", null, _connectionName,
$"Connection restored on {_activeEndpoint} endpoint", null);
```
The `source` argument uses the convention `"ActorType:Name"` (e.g. `"ScriptActor:MonitorSpeed"`, `"DataConnectionActor:PLC1"`). The `details` field carries any supplemental context — stack traces, compiler output, thresholds — as free-form text; JSON is conventional but not validated.
Callers that need to confirm persistence — rare in production, common in tests — can await the returned `Task` and handle a faulted result.
## Configuration
Options are bound from the `ScadaBridge:SiteEventLog` section by `SiteServiceRegistration`.
| Key | Default | Description |
|-----|---------|-------------|
| `RetentionDays` | `30` | Days before events are deleted by the retention purge. |
| `MaxStorageMb` | `1024` | Maximum logical database size in MB. Oldest rows are deleted in 1,000-row batches when exceeded. |
| `DatabasePath` | `site_events.db` | File path for the SQLite database. |
| `QueryPageSize` | `500` | Default page size when the caller does not supply one. |
| `MaxQueryPageSize` | `500` | Hard upper bound on caller-supplied page sizes. Silent clamp. |
| `PurgeInterval` | `24h` (`TimeSpan`) | Interval between purge ticks. An initial purge also runs on service startup. |
| `WriteQueueCapacity` | `10000` | Bounded write-queue capacity. Overflow evicts oldest with `DropOldest`. |
The docker cluster appsettings (`ScadaBridge:SiteEventLog`) sets `RetentionDays: 30` and `MaxStorageMb: 1024`, matching the code defaults. `PurgeScheduleCron` appears in those files as a vestigial key; the actual purge cadence is driven by `PurgeInterval` in the options class, not a cron expression.
## Dependencies & Interactions
- [Commons (#16)](./Commons.md) — defines the `EventLogQueryRequest` / `EventLogQueryResponse` / `EventLogEntry` message contracts in `ZB.MOM.WW.ScadaBridge.Commons.Messages.RemoteQuery`, shared across the site query path and the central dispatch path (`QueryEventLogsCommand`).
- [CentralSite Communication (#5)](./Communication.md) — the `SiteCommunicationActor` dispatches inbound `EventLogQueryRequest` messages to `EventLogHandlerActor` and carries the `EventLogQueryResponse` back to central. The query timeout is 30 s.
- [Site Runtime (#3)](./SiteRuntime.md) — `ScriptActor` and `ScriptExecutionActor` log `script`-type events: trigger expression failures, script execution errors, and timeouts. `ISiteEventLogger` is resolved from DI inside execution actors.
- [Data Connection Layer (#4)](./DataConnectionLayer.md) — `DataConnectionActor` logs `connection`-type events: connection loss, reconnection, and endpoint failover. `DataConnectionManagerActor` may also log connection-category events.
- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — logs `store_and_forward`-type events on the site→central notification forward path (forward failures, long-buffered notifications). Routine enqueue and forward-success events are not logged; central's `Notifications` table is the authoritative record.
- [Host (#15)](./Host.md) — `SiteServiceRegistration` calls `AddSiteEventLogging` and binds `SiteEventLogOptions`. `AkkaHostedService` wires `EventLogHandlerActor` as a cluster singleton scoped to `"site-{SiteId}"` and registers the `SiteEventLogActiveNodeCheck` delegate so the purge runs only on the active node.
- [Audit Log (#23)](./AuditLog.md) — a distinct component. The Audit Log captures every trust-boundary action (outbound API calls, DB writes, notifications, inbound API) and flows to a central append-only table with monthly partitioning and 365-day retention. The site event log captures internal runtime diagnostics (failures, state transitions) locally with 30-day retention. The two stores are complementary, not overlapping.
- [Site Call Audit (#22)](./SiteCallAudit.md) — a distinct component. Site Call Audit mirrors cached-call operational status in the central `SiteCalls` table via gRPC telemetry. Site Event Logging has no role in that flow.
## Troubleshooting
### Write failures are observable but not propagated
A SQLite write failure increments `FailedWriteCount` on `ISiteEventLogger`, logs an error via `ILogger`, and faults the returned `Task`. The calling actor discards the `Task` on the hot path (`_ = logger?.LogEventAsync(…)`), so the failure does not surface to the actor's message loop. `FailedWriteCount` is available for Health Monitoring integration but is not yet wired to the health surface; a non-zero count indicates disk pressure, a full queue, or a corrupt database file.
### Queue overflow drops oldest events
When the site write queue fills (sustained disk slowness or very high event rates), the oldest pending event is silently evicted and `FailedWriteCount` is incremented. Recent events are preserved at the cost of older ones. Reducing event throughput or increasing `WriteQueueCapacity` addresses sustained overflow.
### Cap-purge loop terminates early
If the database size does not decrease across a cap-purge batch, the loop stops to avoid emptying the entire table. This situation should not occur with `auto_vacuum = INCREMENTAL` enabled, but the guard prevents runaway deletion if vacuuming regresses. A `Warning` log line reporting the stable size is the signal to investigate filesystem-level free-page reclamation.
### Central query returns stale data after failover
After a site failover, the new active node's event log starts empty. Central queries will see no events until the new node generates them. This is by design — event logs are not replicated. Historical events from the previous active node return when that node comes back online and responds to queries.
## Related Documentation
- [Site Event Logging design specification](../requirements/Component-SiteEventLogging.md)
- [Audit Log](./AuditLog.md)
- [Site Call Audit](./SiteCallAudit.md)
- [CentralSite Communication](./Communication.md)
- [Site Runtime](./SiteRuntime.md)
- [Data Connection Layer](./DataConnectionLayer.md)
- [Store-and-Forward Engine](./StoreAndForward.md)
- [Host](./Host.md)
- [Commons](./Commons.md)