Files
scadalink-design/Component-SiteEventLogging.md
Joseph Doherty 34694adba2 Apply Codex review findings across all 17 components
Template Engine: add composed member addressing (path-qualified canonical names),
override granularity per entity type, semantic validation (call targets, arg types),
graph acyclicity enforcement, revision hashes for flattened configs.

Deployment Manager: add deployment ID + idempotency, per-instance operation lock
covering all mutating commands, state transition matrix, site-side apply atomicity
(all-or-nothing), artifact version compatibility policy.

Site Runtime: add script trust model (forbidden APIs, execution timeout, constrained
compilation), concurrency/serialization rules (Instance Actor serializes mutations),
site-wide stream backpressure (per-subscriber buffering, fire-and-forget publish).

Communication: add application-level correlation IDs for protocol safety beyond
Akka.NET transport guarantees.

External System Gateway: add 408/429 as transient errors, CachedCall idempotency
note, dedicated dispatcher for blocking I/O isolation.

Health Monitoring: add monotonic sequence numbers to prevent stale report overwrites.

Security: require LDAPS/StartTLS for LDAP connections.

Central UI: add failover behavior (SignalR reconnect, JWT survives, shared Data
Protection keys, load balancer readiness).

Cluster Infrastructure: add down-if-alone=on for safe singleton ownership.

Site Event Logging: clarify active-node-only logging (no replication), add 1GB
storage cap with oldest-first purge.

Host: add readiness gating (health check endpoint, no traffic until operational).

Commons: add message contract versioning policy (additive-only evolution).

Configuration Database: add optimistic concurrency on deployment status records.
2026-03-16 09:06:12 -04:00

73 lines
4.2 KiB
Markdown

# Component: Site Event Logging
## Purpose
The Site Event Logging component records operational events at each site cluster, providing a local audit trail of runtime activity. Events are queryable from the central UI for remote troubleshooting.
## Location
Site clusters (event recording and storage). Central cluster (remote query access via UI).
## Responsibilities
- Record operational events from all site subsystems.
- Persist events to local SQLite.
- Enforce 30-day retention policy with automatic purging.
- Respond to remote queries from central for event log data.
## Events Logged
| Category | Events |
|----------|--------|
| Script Executions | Script started, completed, failed (with error details), recursion limit exceeded |
| Alarm Events | Alarm activated, alarm cleared (which alarm, which instance), alarm evaluation error |
| Deployment Events | Configuration received from central, scripts compiled, applied successfully, apply failed |
| Data Connection Status | Connected, disconnected, reconnected (per connection) |
| Store-and-Forward | Message queued, delivered, retried, parked |
| Instance Lifecycle | Instance enabled, disabled, deleted |
## Event Entry Schema
Each event entry contains:
- **Timestamp**: When the event occurred.
- **Event Type**: Category of the event (script, alarm, deployment, connection, store-and-forward, instance-lifecycle).
- **Severity**: Info, Warning, or Error.
- **Instance ID** *(optional)*: The instance associated with the event (if applicable).
- **Source**: The subsystem that generated the event (e.g., "ScriptActor:MonitorSpeed", "AlarmActor:OverTemp", "DataConnection:PLC1").
- **Message**: Human-readable description of the event.
- **Details** *(optional)*: Additional structured data (e.g., exception stack trace, alarm name, message ID, compilation errors).
## Storage
- Events are stored in **local SQLite** on each site node.
- Each node maintains its own event log. Only the **active node** generates and stores events. Event logs are **not replicated** to the standby node. On failover, the new active node starts logging to its own SQLite database; historical events from the previous active node are no longer queryable via central until that node comes back online. This is acceptable because event logs are diagnostic, not transactional.
- **Retention**: 30 days. A **daily background job** runs on the active node and deletes all events older than 30 days. Hard delete — no archival.
- **Storage cap**: A configurable maximum database size (default: 1 GB) is enforced. If the storage cap is reached before the 30-day retention window, the oldest events are purged first. This prevents disk exhaustion from alarm storms, script failure loops, or connection flapping.
## Central Access
- The central UI can query site event logs remotely via the Communication Layer.
- Queries support filtering by:
- Event type / category
- Time range
- Instance ID
- Severity
- **Keyword search**: Free-text search on message and source fields (SQLite LIKE query). Useful for finding events by script name, alarm name, or error message across all instances.
- Results are **paginated** with a configurable page size (default: 500 events). Each response includes a continuation token for fetching additional pages. This prevents broad queries from overwhelming the communication channel.
- The site processes the query locally against SQLite and returns matching results to central.
## Dependencies
- **SQLite**: Local storage on each site node.
- **Communication Layer**: Handles remote query requests from central.
- **Site Runtime**: Generates script execution events, alarm events, deployment application events, and instance lifecycle events.
- **Data Connection Layer**: Generates connection status events.
- **Store-and-Forward Engine**: Generates buffer activity events.
## Interactions
- **All site subsystems**: Event logging is a cross-cutting concern — any subsystem that produces notable events calls the Event Logging service.
- **Communication Layer**: Receives remote queries from central and returns results.
- **Central UI**: Site Event Log Viewer displays queried events.
- **Health Monitoring**: Script error rates and alarm evaluation error rates can be derived from event log data.