Files

Joseph Doherty 34694adba2 Apply Codex review findings across all 17 components

Template Engine: add composed member addressing (path-qualified canonical names),
override granularity per entity type, semantic validation (call targets, arg types),
graph acyclicity enforcement, revision hashes for flattened configs.

Deployment Manager: add deployment ID + idempotency, per-instance operation lock
covering all mutating commands, state transition matrix, site-side apply atomicity
(all-or-nothing), artifact version compatibility policy.

Site Runtime: add script trust model (forbidden APIs, execution timeout, constrained
compilation), concurrency/serialization rules (Instance Actor serializes mutations),
site-wide stream backpressure (per-subscriber buffering, fire-and-forget publish).

Communication: add application-level correlation IDs for protocol safety beyond
Akka.NET transport guarantees.

External System Gateway: add 408/429 as transient errors, CachedCall idempotency
note, dedicated dispatcher for blocking I/O isolation.

Health Monitoring: add monotonic sequence numbers to prevent stale report overwrites.

Security: require LDAPS/StartTLS for LDAP connections.

Central UI: add failover behavior (SignalR reconnect, JWT survives, shared Data
Protection keys, load balancer readiness).

Cluster Infrastructure: add down-if-alone=on for safe singleton ownership.

Site Event Logging: clarify active-node-only logging (no replication), add 1GB
storage cap with oldest-first purge.

Host: add readiness gating (health check endpoint, no traffic until operational).

Commons: add message contract versioning policy (additive-only evolution).

Configuration Database: add optimistic concurrency on deployment status records.

2026-03-16 09:06:12 -04:00

4.2 KiB

Raw Blame History

Component: Site Event Logging

Purpose

The Site Event Logging component records operational events at each site cluster, providing a local audit trail of runtime activity. Events are queryable from the central UI for remote troubleshooting.

Location

Site clusters (event recording and storage). Central cluster (remote query access via UI).

Responsibilities

Record operational events from all site subsystems.
Persist events to local SQLite.
Enforce 30-day retention policy with automatic purging.
Respond to remote queries from central for event log data.

Events Logged

Category	Events
Script Executions	Script started, completed, failed (with error details), recursion limit exceeded
Alarm Events	Alarm activated, alarm cleared (which alarm, which instance), alarm evaluation error
Deployment Events	Configuration received from central, scripts compiled, applied successfully, apply failed
Data Connection Status	Connected, disconnected, reconnected (per connection)
Store-and-Forward	Message queued, delivered, retried, parked
Instance Lifecycle	Instance enabled, disabled, deleted

Event Entry Schema

Each event entry contains:

Timestamp: When the event occurred.
Event Type: Category of the event (script, alarm, deployment, connection, store-and-forward, instance-lifecycle).
Severity: Info, Warning, or Error.
Instance ID (optional): The instance associated with the event (if applicable).
Source: The subsystem that generated the event (e.g., "ScriptActor:MonitorSpeed", "AlarmActor:OverTemp", "DataConnection:PLC1").
Message: Human-readable description of the event.
Details (optional): Additional structured data (e.g., exception stack trace, alarm name, message ID, compilation errors).

Storage

Events are stored in local SQLite on each site node.
Each node maintains its own event log. Only the active node generates and stores events. Event logs are not replicated to the standby node. On failover, the new active node starts logging to its own SQLite database; historical events from the previous active node are no longer queryable via central until that node comes back online. This is acceptable because event logs are diagnostic, not transactional.
Retention: 30 days. A daily background job runs on the active node and deletes all events older than 30 days. Hard delete — no archival.
Storage cap: A configurable maximum database size (default: 1 GB) is enforced. If the storage cap is reached before the 30-day retention window, the oldest events are purged first. This prevents disk exhaustion from alarm storms, script failure loops, or connection flapping.

Central Access

The central UI can query site event logs remotely via the Communication Layer.
Queries support filtering by:
- Event type / category
- Time range
- Instance ID
- Severity
- Keyword search: Free-text search on message and source fields (SQLite LIKE query). Useful for finding events by script name, alarm name, or error message across all instances.
Results are paginated with a configurable page size (default: 500 events). Each response includes a continuation token for fetching additional pages. This prevents broad queries from overwhelming the communication channel.
The site processes the query locally against SQLite and returns matching results to central.

Dependencies

SQLite: Local storage on each site node.
Communication Layer: Handles remote query requests from central.
Site Runtime: Generates script execution events, alarm events, deployment application events, and instance lifecycle events.
Data Connection Layer: Generates connection status events.
Store-and-Forward Engine: Generates buffer activity events.

Interactions

All site subsystems: Event logging is a cross-cutting concern — any subsystem that produces notable events calls the Event Logging service.
Communication Layer: Receives remote queries from central and returns results.
Central UI: Site Event Log Viewer displays queried events.
Health Monitoring: Script error rates and alarm evaluation error rates can be derived from event log data.

4.2 KiB Raw Blame History