diff --git a/CLAUDE.md b/CLAUDE.md index 6bd8abd..513340d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -36,7 +36,7 @@ This project contains design documentation for a distributed SCADA system built - Use `git diff` to review changes before committing. - Commit related changes together with a descriptive message summarizing the design decision. -## Current Component List (21 components) +## Current Component List (22 components) 1. Template Engine — Template modeling, inheritance, composition, validation, flattening, diffs. 2. Deployment Manager — Central-side deployment pipeline, system-wide artifact deployment, instance lifecycle. @@ -59,6 +59,7 @@ This project contains design documentation for a distributed SCADA system built 19. CLI — Command-line tool using HTTP Management API, System.CommandLine, JSON/table output. 20. Traefik Proxy — Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover. 21. Notification Outbox — Central component ingesting store-and-forwarded notifications, `Notifications` audit table, dispatcher loop, retry/parking, delivery KPIs. +22. Site Call Audit — Central component auditing site cached calls (`CachedCall`/`CachedWrite`); `SiteCalls` audit table, telemetry ingest, reconciliation, KPIs, central→site Retry/Discard relay; sites remain the source of truth. ## Key Design Decisions (for context across sessions) @@ -120,6 +121,11 @@ This project contains design documentation for a distributed SCADA system built - Site→central handoff is at-least-once: ack-after-persist plus insert-if-not-exists on `NotificationId`. - No Akka replication — MS SQL is the HA store; daily purge of terminal rows after a configurable window (default 365 days). - Notification Outbox retry reuses central SMTP max-retry-count and fixed interval. +- Cached calls (`ExternalSystem.CachedCall`, `Database.CachedWrite`) return a `TrackedOperationId` tracking handle, unified with `Notify.Send`'s existing tracking model (`Notify.Status` retained as a thin alias). +- A site-local operation tracking table (SQLite, alongside the S&F buffer) is the source of truth for cached-call status; `Tracking.Status(id)` reads it site-locally and authoritatively; terminal rows purged after a configurable window (default 7 days). +- Unified tracking status lifecycle `Pending → Retrying → Delivered / Parked / Failed / Discarded`; `Failed` = permanent failure (also returned synchronously to the calling script). No `Forwarding` state for cached calls. +- Site Call Audit (#22): central `SiteCallAuditActor` singleton with a `SiteCalls` audit table (central MS SQL) fed by best-effort site telemetry plus periodic reconciliation pulls — an eventually-consistent mirror, NOT a dispatcher; cached-call delivery stays site-local. Ingest is insert-if-not-exists then upsert-on-newer-status. +- Central UI Site Calls page + central→site `RetryParkedOperation`/`DiscardParkedOperation` relay for parked cached calls; central never mutates the `SiteCalls` row directly. ### Security & Auth - Authentication: direct LDAP bind (username/password), no Kerberos/NTLM. LDAPS/StartTLS required. @@ -144,6 +150,7 @@ This project contains design documentation for a distributed SCADA system built - Notification Outbox KPIs are central-computed point-in-time from the `Notifications` table (global + per-source-site): queue depth, stuck count, parked count, delivered-last-interval, oldest-pending age. - Stuck = `Pending`/`Retrying` older than a configurable age threshold (default 10 min) — display-only (KPI count + row badge), no escalation/alerting. - Headline KPI tiles surface on the Health dashboard; a new Central UI Notification Outbox page offers a queryable list with Retry/Discard actions on parked notifications. +- Site Call Audit KPIs are central-computed point-in-time from the `SiteCalls` table (global + per-site), mirroring the Notification Outbox KPI shape; tiles surface on the Health dashboard alongside a queryable Central UI Site Calls page with Retry/Discard on parked rows. ### Code Organization - Entity classes are persistence-ignorant POCOs in Commons; EF mappings in Configuration Database. diff --git a/README.md b/README.md index 433f7cf..4accec8 100644 --- a/README.md +++ b/README.md @@ -55,6 +55,7 @@ This document serves as the master index for the SCADA system design. The system | 19 | CLI | [docs/requirements/Component-CLI.md](docs/requirements/Component-CLI.md) | Standalone command-line tool, System.CommandLine, HTTP transport via Management API, JSON/table output, mirrors all Management Service operations. | | 20 | Traefik Proxy | [docs/requirements/Component-TraefikProxy.md](docs/requirements/Component-TraefikProxy.md) | Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover. | | 21 | Notification Outbox | [docs/requirements/Component-NotificationOutbox.md](docs/requirements/Component-NotificationOutbox.md) | Central component ingesting store-and-forwarded notifications into the `Notifications` audit table, with `NotificationOutboxActor` singleton dispatcher, per-type delivery adapters, retry/parking, status tracking, daily purge, and delivery KPIs. | +| 22 | Site Call Audit | [docs/requirements/Component-SiteCallAudit.md](docs/requirements/Component-SiteCallAudit.md) | Central component auditing site cached calls (`ExternalSystem.CachedCall`/`Database.CachedWrite`) into the `SiteCalls` audit table, with `SiteCallAuditActor` singleton, telemetry ingest, periodic reconciliation, point-in-time KPIs, daily purge, and central→site Retry/Discard relay for parked calls. | ### Reference Documentation diff --git a/docs/requirements/Component-CentralUI.md b/docs/requirements/Component-CentralUI.md index 180ffaa..9a5102f 100644 --- a/docs/requirements/Component-CentralUI.md +++ b/docs/requirements/Component-CentralUI.md @@ -127,10 +127,17 @@ Central cluster only. Sites have no user interface. - **Stuck rows are visually badged** — a notification is stuck if it is `Pending` or `Retrying` and older than the configurable stuck-age threshold. Stuck detection is display-only; there is no automated escalation or alerting. - All queries are served from the central `Notifications` table — no remote per-site queries are needed, unlike the Parked Message Management page. +### Site Calls (Deployment Role) +- Monitor cached calls store-and-forwarded from sites — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` operations. Scoped to the `ExternalCall` and `DatabaseWrite` kinds only; notifications keep their separate Notification Outbox page and are not merged here. +- A **queryable cached-call list** filterable by site, kind, status, and time range. Each row shows the call's timestamp, site, kind, target summary, status badge, retry count, and last error. +- **Retry** and **Discard** actions are available on `Parked` rows only — `Failed` rows are not actionable, since a permanent failure would simply fail again and its error was already returned synchronously to the calling script. The actions issue central→site commands to the owning site; if the site is offline the UI surfaces a "site unreachable" message. +- Data is served from the central Site Call Audit component's `SiteCalls` table. The page is **read-mostly** — an eventually-consistent mirror of site state; the site remains the source of truth. + ### Health Monitoring Dashboard (All Roles) - Overview of all sites with online/offline status. - Per-site detail: active/standby node status, data connection health, script error rates, alarm evaluation error rates, store-and-forward buffer depths. - Headline **Notification Outbox KPI tiles** — queue depth, stuck count, and parked count. These are central-computed by the Notification Outbox from the central `Notifications` table (not part of any site health report). The full outbox view is on the dedicated Notification Outbox page. +- Headline **Site Call Audit KPI tiles** — buffered count, parked count, and failed-last-interval. These are central-computed by the Site Call Audit component from the central `SiteCalls` table (not part of any site health report). The full cached-call view is on the dedicated Site Calls page. ### Site Event Log Viewer (Deployment Role) - Query site event logs remotely. @@ -155,3 +162,4 @@ Central cluster only. Sites have no user interface. - **Configuration Database**: All central data, including audit log data for the audit log viewer. Accessed via `ICentralUiRepository`. - **Health Monitoring**: Provides site health data for the dashboard. - **Notification Outbox**: Provides notification delivery KPIs and serves the `Notifications` table queries and Retry/Discard actions for the Notification Outbox page. +- **Site Call Audit**: Serves the `SiteCalls` table queries and relays Retry/Discard actions to sites for the Site Calls page. diff --git a/docs/requirements/Component-Commons.md b/docs/requirements/Component-Commons.md index 4c08d64..3811dd5 100644 --- a/docs/requirements/Component-Commons.md +++ b/docs/requirements/Component-Commons.md @@ -35,6 +35,9 @@ Commons must define shared primitive and utility types used across multiple comp - **`AlarmLevel` enum**: None, Low, LowLow, High, HighHigh. Severity level for an active alarm; always `None` for binary trigger types, set by `HiLo` triggers. - **`AlarmTriggerType` enum**: ValueMatch, RangeViolation, RateOfChange, HiLo. - **`ConnectionHealth` enum**: Connected, Disconnected, Connecting, Error. +- **`TrackedOperationId`**: A GUID identifying a tracked store-and-forward operation (`ExternalSystem.CachedCall`, `Database.CachedWrite`, `Notify.Send`). Generated caller-side at the site at call time, returned to the script as a tracking handle, and reused as the idempotency key for telemetry sent to central. The notification domain's existing `NotificationId` is the notification-specific name for this same concept. +- **`TrackedOperationKind` enum**: ExternalCall, DatabaseWrite. Discriminates the two cached-call kinds carried by a tracked operation (notifications are tracked separately via the `NotificationType` enum). +- **`TrackedOperationStatus` enum**: Pending, Retrying, Delivered, Parked, Failed, Discarded. The unified lifecycle state shared by all tracked store-and-forward operations. This is the operation's externally-observable lifecycle status in the site-local tracking table (the status record); it is related to but distinct from the S&F buffer's own `StoreAndForwardMessageStatus`, which tracks a buffered message's retry state within the buffer (the retry mechanism). `Failed` (permanent failure) has no notification analogue — notifications use only the other five states (the `NotificationStatus` enum omits `Failed`). Types defined here must be immutable and thread-safe. @@ -84,6 +87,7 @@ Commons must define repository interfaces that consuming components use for data - `IExternalSystemRepository` — External system definitions, method definitions, database connection definitions. - `INotificationRepository` — Notification lists (including the `Type` field), recipients, SMTP configuration. - `INotificationOutboxRepository` — The `Notifications` table: insert-if-not-exists ingest on `NotificationId`, due-row polling (`Pending` rows and `Retrying` rows past `NextAttemptAt`), status transitions, KPI aggregate queries, and the bulk delete of terminal rows used by the daily purge job. +- `ISiteCallAuditRepository` — The `SiteCalls` table: insert-if-not-exists ingest on `TrackedOperationId`, upsert-on-newer-status from telemetry and reconciliation pulls, KPI aggregate queries, and the bulk delete of terminal rows used by the daily purge job. - `ISiteRepository` — Sites, data connections, and their site assignments. - `ICentralUiRepository` — Read-oriented queries spanning multiple domain areas for display purposes. @@ -119,6 +123,8 @@ Commons must define the shared DTOs and message contracts used for inter-compone - **Script Execution DTOs**: Script call requests (with recursion depth), return values, error results. - **System-Wide Artifact DTOs**: Shared script packages, external system definitions, database connection definitions, notification list definitions. - **Notification DTOs**: `NotificationSubmit` (site→central submission: `NotificationId`, `ListName`, `Subject`, `Body`, provenance, `SiteEnqueuedAt`) and `NotificationSubmitAck` (central acknowledgement returned only after the `Notifications` row is persisted — ack-after-persist — which the site Store-and-Forward Engine waits on before clearing the buffered message). `NotificationStatusQuery` / `NotificationStatusResponse` back the `Notify.Status` script API, round-tripping a status record (status, retry count, last error, key timestamps) once a notification has been forwarded. Recipient resolution is *not* part of any contract — the site forwards only `(listName, subject, body)` and central resolves the list at delivery time. Subject to the additive-only evolution rules in REQ-COM-5a, since a submission can cross the site→central version-skew boundary. +- **Cached Call Tracking DTOs**: `CachedCallTelemetry` (site→central lifecycle telemetry for a tracked cached call: `TrackedOperationId`, source site, `Kind` — the `TrackedOperationKind` enum (`ExternalCall` / `DatabaseWrite`) — target summary, status, retry count, last error, key timestamps, and source instance / script provenance) and `CachedCallReconcileRequest` / `CachedCallReconcileResponse` (cursor-based per-site pull of tracking rows changed since a cursor, used so missed telemetry self-heals). All three live in the `Integration/` message folder and are subject to the additive-only evolution rules in REQ-COM-5a, since they cross the site→central version-skew boundary. +- **Parked Operation Command DTOs**: `RetryParkedOperation` and `DiscardParkedOperation` (central→site command/control messages keyed by `TrackedOperationId`, instructing the owning site to retry or discard a parked store-and-forward operation). These generalize the existing parked-message retry/discard commands to also cover parked cached calls; they live in the `RemoteQuery/` message folder alongside the other parked-message management messages. All message types must be `record` types or immutable classes suitable for use as Akka.NET messages (though Commons itself must not depend on Akka.NET). @@ -145,11 +151,13 @@ ScadaLink.Commons/ │ ├── StaleTagMonitor.cs # heartbeat staleness watchdog │ ├── ValueFormatter.cs # culture-invariant value-to-string helper │ ├── DynamicJsonElement.cs # dynamic JSON wrapper for scripts +│ ├── TrackedOperationId.cs # tracked store-and-forward operation ID (GUID) │ ├── Enums/ # InstanceState, DeploymentStatus, AlarmState, │ │ # AlarmLevel, AlarmTriggerType, ConnectionHealth, │ │ # DataType, StoreAndForwardCategory, │ │ # StoreAndForwardMessageStatus, -│ │ # NotificationType, NotificationStatus +│ │ # NotificationType, NotificationStatus, +│ │ # TrackedOperationKind, TrackedOperationStatus │ ├── DataConnections/ # OPC UA endpoint config value objects + enums │ ├── Flattening/ # FlattenedConfiguration, ConfigurationDiff, │ │ # DeploymentPackage, ValidationResult @@ -164,6 +172,7 @@ ScadaLink.Commons/ │ │ ├── IExternalSystemRepository.cs │ │ ├── INotificationRepository.cs │ │ ├── INotificationOutboxRepository.cs +│ │ ├── ISiteCallAuditRepository.cs │ │ ├── ISiteRepository.cs │ │ └── ICentralUiRepository.cs │ └── Services/ # REQ-COM-4a: Cross-cutting service interfaces @@ -199,11 +208,13 @@ ScadaLink.Commons/ │ ├── Artifacts/ │ ├── DataConnection/ # data-connection subscribe/write/health messages │ ├── Instance/ # attribute get/set request/command messages -│ ├── Integration/ # external-integration call request/response +│ ├── Integration/ # external-integration call request/response, +│ │ # cached-call tracking telemetry + reconcile │ ├── Notification/ # NotificationSubmit + ack, -│ │ # NotificationStatusQuery/Response +│ │ # NotificationStatusQuery/Response │ ├── InboundApi/ # Route.To() request messages -│ ├── RemoteQuery/ # event-log and parked-message query messages +│ ├── RemoteQuery/ # event-log and parked-message query messages, +│ │ # parked-operation retry/discard commands │ └── Management/ # HTTP/ClusterClient management commands + registry ├── Serialization/ # OpcUaEndpointConfigSerializer (typed↔legacy JSON) └── Validators/ # OpcUaEndpointConfigValidator diff --git a/docs/requirements/Component-Communication.md b/docs/requirements/Component-Communication.md index f6cdb84..6c870a9 100644 --- a/docs/requirements/Component-Communication.md +++ b/docs/requirements/Component-Communication.md @@ -122,7 +122,7 @@ Keepalive settings are configurable via `CommunicationOptions`: - Site event logs. - Instance debug snapshots (attribute values and alarm states). - Central can also send management commands: - - Retry or discard parked messages. + - Retry or discard parked messages and parked cached calls — central sends `RetryParkedOperation` / `DiscardParkedOperation` (keyed by `TrackedOperationId`) to the owning site, which applies the change to its S&F buffer and tracking table. ### 9. Notification Submission (Site → Central) - **Pattern**: Fire-and-forget with acknowledgment. @@ -131,6 +131,14 @@ Keepalive settings are configurable via `CommunicationOptions`: - The `NotificationId` GUID — generated at the site — is the **idempotency key**. The handoff is at-least-once: a re-sent submission after a lost ack is harmless because central's insert-if-not-exists treats the duplicate as a no-op. - **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent. +### 10. Cached Call Telemetry (Site → Central) +- **Pattern**: Fire-and-forget telemetry with a periodic reconciliation pull. +- The site **Store-and-Forward Engine** emits a `CachedCallTelemetry` message to central on **every** cached-call lifecycle transition (`Pending → Retrying → Delivered / Parked / Failed / Discarded`). The first telemetry event for an operation carries its initial status — `Pending` when a transient failure has buffered the call, or directly `Delivered`/`Failed` for a cached call that never buffers. The message carries the `TrackedOperationId`, source site, `Kind` (the `TrackedOperationKind` enum), target summary, status, retry count, last error, key timestamps, and source provenance. +- Emission is **best-effort and at-least-once**, **idempotent on `TrackedOperationId`** — central's Site Call Audit component ingests with insert-if-not-exists then upsert-on-newer-status, so a re-sent or out-of-order event is harmless. +- **Reconciliation pull**: because telemetry is best-effort, the central **Site Call Audit** component periodically — and on site reconnect — issues a `CachedCallReconcileRequest` to each site; the site replies with a `CachedCallReconcileResponse` carrying all tracking rows changed since a cursor. Any telemetry missed during a disconnect self-heals through this pull. +- Central audit is an **eventually-consistent mirror** — the site's operation tracking table remains the source of truth for cached-call status (`Tracking.Status(id)` is always answered site-locally). +- **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent. + ## Topology ``` @@ -182,6 +190,7 @@ Each request/response pattern has a default timeout that can be overridden in co | 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack | | 8. Remote Queries | 30 seconds | Querying parked messages or event logs | | 9. Notification Submission | 30 seconds | Fire-and-forget with ack; central acks after persisting the row | +| 10. Cached Call Telemetry | 30 seconds | Reconciliation pull is request/response; telemetry emission itself is fire-and-forget | Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure. @@ -237,6 +246,7 @@ The ManagementActor is registered at the well-known path `/user/management` on c - **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data. - **Central UI**: Debug view requests and remote queries flow through communication. - **Health Monitoring**: Receives periodic health reports from sites. -- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication. +- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication. Also emits `CachedCallTelemetry` and answers `CachedCallReconcileRequest` pulls, and receives relayed `RetryParkedOperation` / `DiscardParkedOperation` commands. +- **Site Call Audit (central)**: Receives cached-call telemetry and reconciliation responses; issues reconciliation pulls and relays parked-operation Retry/Discard commands to sites through communication. - **Site Event Logging**: Event log queries are routed through communication. - **Management Service**: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting. diff --git a/docs/requirements/Component-ConfigurationDatabase.md b/docs/requirements/Component-ConfigurationDatabase.md index a00807b..55d8f74 100644 --- a/docs/requirements/Component-ConfigurationDatabase.md +++ b/docs/requirements/Component-ConfigurationDatabase.md @@ -57,6 +57,9 @@ The configuration database stores all central system data, organized by domain a - **SMTP Configuration**: Email server settings. - **Notifications**: The durable central notification queue owned by the Notification Outbox — one row per notification, the single source of audit truth. The schema is **type-agnostic** so it records any notification type the system supports (email today, Microsoft Teams and others later): a `Type` discriminator selects the type, and a `TypeData` JSON column (`nvarchar(max)`) carries any future per-type fields without a schema change. Columns: `NotificationId` (GUID, primary key — generated at the site, used as the idempotency key), `Type`, `ListName`, `Subject`, `Body`, `TypeData`, `Status`, `RetryCount`, `LastError`, `ResolvedTargets`, `SourceSiteId`, `SourceInstanceId`, `SourceScript`, `SiteEnqueuedAt`, `CreatedAt`, `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt`. `Status` is a `NotificationStatus` enum stored with values `Pending`, `Retrying`, `Delivered`, `Parked`, `Discarded` (the site-local `Forwarding` state is never persisted centrally). Indexed on `Status` and `NextAttemptAt` for efficient dispatcher polling of due rows, and on `SourceSiteId` and `CreatedAt` for KPI computation and the Central UI query page. Terminal rows are removed by a daily purge job — see Scheduled Maintenance below. See Component-NotificationOutbox.md for the full lifecycle. +### Site Calls +- **SiteCalls**: The central audit table for cached site calls — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — owned by the Site Call Audit component and a sibling of the `Notifications` table. One row per cached operation. Columns: `TrackedOperationId` (GUID, primary key — generated site-side at call time, used as the idempotency key), `SourceSite`, `Kind` (a `TrackedOperationKind` enum stored with values `ExternalCall` / `DatabaseWrite`), `TargetSummary` (external system + method for an `ExternalCall`, database connection name for a `DatabaseWrite`), `Status` (a `TrackedOperationStatus` enum stored with values `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`), `RetryCount`, `LastError`, `Provenance` (source instance / script), `CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`. The table is populated **only** by Site Call Audit telemetry and reconciliation pulls — sites are the source of truth and the row is an eventually-consistent mirror, never written by a central dispatcher. Ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`, then **upsert-on-newer-status**; the lifecycle is monotonic, so at-least-once and out-of-order telemetry are harmless. Indexed on `Status` and `SourceSite` for KPI computation and the Central UI query page. Terminal rows are removed by a daily purge job — see Scheduled Maintenance below. See Component-SiteCallAudit.md for the full lifecycle. + ### Inbound API - **API Keys**: Key definitions (name/label, key value, enabled flag). - **API Methods**: Method definitions (name, approved key references, parameter definitions, return value definitions, implementation script, timeout). @@ -97,6 +100,7 @@ Repository interfaces are defined in **Commons** alongside the POCO entity class | `IExternalSystemRepository` | External System Gateway | External system definitions, method definitions, database connection definitions | | `INotificationRepository` | Notification Service | Notification lists (including the `Type` field), recipients, SMTP configuration | | `INotificationOutboxRepository` | Notification Outbox | The `Notifications` table — insert-if-not-exists ingest, due-row polling, status transitions, KPI aggregate queries, and bulk delete of terminal rows used by the daily purge job | +| `ISiteCallAuditRepository` | Site Call Audit | The `SiteCalls` table — insert-if-not-exists ingest, upsert-on-newer-status, KPI aggregate queries, and bulk delete of terminal rows used by the daily purge job | | `IHealthMonitoringRepository` | Health Monitoring | (Minimal — health data is in-memory; repository needed only if connectivity history is persisted in the future) | | `ICentralUiRepository` | Central UI | Read-oriented queries spanning multiple domain areas for display purposes | @@ -274,6 +278,10 @@ The Configuration Database supports seeding initial data required for the system The `Notifications` table grows one row per notification and is never trimmed by normal operation — `Discarded` rows are deliberately retained for audit. To bound table growth while preserving a strong audit trail, a **daily purge job** deletes terminal rows (`Delivered`, `Parked`, `Discarded`) older than a configurable retention window (default 365 days). Non-terminal rows (`Pending`, `Retrying`) are never purged. The purge is a bulk `DELETE` against `INotificationOutboxRepository`; it is owned and scheduled by the Notification Outbox component (see Component-NotificationOutbox.md), which supplies the retention window from `NotificationOutboxOptions`. The Configuration Database component provides only the repository operation and the table. +### SiteCalls Table Purge + +The `SiteCalls` table grows one row per cached site call and is never trimmed by normal operation. To bound table growth while preserving a strong audit trail, a **daily purge job** deletes terminal rows (`Delivered`, `Failed`, `Discarded`) older than a configurable retention window (default 365 days). Non-terminal rows (`Pending`, `Retrying`, `Parked`) are never purged. The purge is a bulk `DELETE`; it is owned and scheduled by the Site Call Audit component (see Component-SiteCallAudit.md), which supplies the retention window. The Configuration Database component provides only the repository operation and the table. + --- ## Connection Management @@ -301,6 +309,7 @@ The `Notifications` table grows one row per notification and is never trimmed by - **External System Gateway**: Uses `IExternalSystemRepository` for external system and database connection definitions. - **Notification Service**: Uses `INotificationRepository` for notification lists and SMTP configuration. - **Notification Outbox**: Uses `INotificationOutboxRepository` for all access to the `Notifications` table — ingest, dispatch polling, status updates, KPI queries, and the daily purge of terminal rows. +- **Site Call Audit**: Uses `ISiteCallAuditRepository` for all access to the `SiteCalls` table — telemetry/reconciliation ingest, KPI queries, and the daily purge of terminal rows. - **Central UI**: Uses `ICentralUiRepository` for read-oriented queries across domain areas, including audit log queries for the audit log viewer. - **All central components that modify state**: Call `IAuditService.LogAsync()` after successful operations to record audit entries within the same transaction. - **Host**: Provides database connection configuration. Registers DbContext, repository implementations, and `IAuditService` implementation in the DI container. Triggers auto-migration in development or validates schema version in production. diff --git a/docs/requirements/Component-ExternalSystemGateway.md b/docs/requirements/Component-ExternalSystemGateway.md index f64af27..e38560c 100644 --- a/docs/requirements/Component-ExternalSystemGateway.md +++ b/docs/requirements/Component-ExternalSystemGateway.md @@ -59,10 +59,11 @@ Each database connection definition includes: - Failures are immediate — no buffering. ### Cached Write (Store-and-Forward) -- Script calls `Database.CachedWrite("name", "sql", parameters)`. -- The write is submitted to the Store-and-Forward Engine. +- Script calls `Database.CachedWrite("name", "sql", parameters)`. This is **deferred delivery**: the call returns a `TrackedOperationId` tracking handle immediately rather than the write result. - Payload includes: connection name, SQL statement, serialized parameter values. -- If the database is unavailable, the write is buffered and retried per the connection's retry settings. +- The write is attempted immediately. On immediate success it is recorded as a terminal `Delivered` tracking record. On **transient failure** (database unavailable) it is buffered (`Pending`/`Retrying`) and retried per the connection's retry settings by the Store-and-Forward Engine. +- On **permanent failure** (e.g. a SQL syntax or constraint error — a request that will never succeed), the error is returned **synchronously** to the calling script and the write is **not** buffered. The call is also recorded as a terminal `Failed` tracking record capturing the error. +- Cached-write status is observable to scripts via `Tracking.Status(id)` (answered site-locally and authoritatively) and centrally via the Site Call Audit component. ## Invocation Protocol @@ -84,10 +85,11 @@ Scripts choose between two call modes per invocation, mirroring the dual-mode da - Use for request/response interactions where the script needs the result (e.g., fetching a recipe, querying inventory). ### Cached (Store-and-Forward) -- Script calls `ExternalSystem.CachedCall("systemName", "methodName", params)`. -- The call is attempted immediately. If it succeeds, the response is discarded (fire-and-forget). -- On **transient failure** (connection refused, timeout, HTTP 5xx), the call is routed to the Store-and-Forward Engine for retry per the system's retry settings. The script does **not** block — the call is buffered and the script continues. -- On **permanent failure** (HTTP 4xx), the error is returned **synchronously** to the calling script. No retry — the request itself is wrong. +- Script calls `ExternalSystem.CachedCall("systemName", "methodName", params)`. This is **deferred delivery**: the call returns a `TrackedOperationId` tracking handle immediately rather than the response body. +- The call is attempted immediately. If it succeeds, the response is discarded and the call is recorded as a terminal `Delivered` tracking record. +- On **transient failure** (connection refused, timeout, HTTP 5xx), the call is routed to the Store-and-Forward Engine for retry per the system's retry settings. The script does **not** block — the call is buffered (`Pending`/`Retrying`) and the script continues. +- On **permanent failure** (HTTP 4xx), the error is returned **synchronously** to the calling script. No retry — the request itself is wrong. The call is also recorded as a terminal `Failed` tracking record capturing the error. +- Cached-call status is observable to scripts via `Tracking.Status(id)` (answered site-locally and authoritatively) and centrally via the Site Call Audit component. - Use for outbound data pushes where deferred delivery is acceptable (e.g., posting production data, sending quality reports). ## Call Timeout & Error Handling @@ -95,7 +97,7 @@ Scripts choose between two call modes per invocation, mirroring the dual-mode da - Each external system definition specifies a **timeout** that applies to all method calls on that system. - Error classification by HTTP response: - **Transient failures** (connection refused, timeout, HTTP 408, 429, 5xx): Behavior depends on call mode — `CachedCall` buffers for retry; `Call` returns error to script. - - **Permanent failures** (HTTP 4xx except 408/429): Always returned to the calling script regardless of call mode. Logged to Site Event Logging. + - **Permanent failures** (HTTP 4xx except 408/429): Always returned to the calling script regardless of call mode. Logged to Site Event Logging. For `CachedCall`, the failure is additionally recorded as a terminal `Failed` tracking record — so even a never-buffered cached call has an authoritative status record. - This classification ensures the S&F buffer is not polluted with requests that will never succeed. - **Idempotency note**: `CachedCall` retries may result in duplicate delivery if the external system received the original request but the response was lost. Callers should use `CachedCall` only for operations that are idempotent or where duplicate delivery is acceptable. @@ -114,7 +116,8 @@ Scripts choose between two call modes per invocation, mirroring the dual-mode da - **Configuration Database (MS SQL)**: Stores external system and database connection definitions (central only). - **Local SQLite**: At sites, external system and database connection definitions are read from local SQLite (populated by artifact deployment). Sites do not access the central config DB. -- **Store-and-Forward Engine**: Handles buffering for failed external system calls and cached database writes. +- **Store-and-Forward Engine**: Handles buffering for failed external system calls and cached database writes, and owns the site-local operation tracking table read by `Tracking.Status(id)`. +- **Site Call Audit**: Central audit mirror for cached calls — receives cached-call lifecycle telemetry so `CachedCall`/`CachedWrite` status is observable centrally. - **Communication Layer**: Routes inbound external system requests from central to sites. - **Security & Auth**: Design role manages definitions. - **Configuration Database (via IAuditService)**: Definition changes are audit logged. @@ -122,5 +125,6 @@ Scripts choose between two call modes per invocation, mirroring the dual-mode da ## Interactions - **Site Runtime (Script/Alarm Execution Actors)**: Scripts invoke external system methods and database operations through this component. -- **Store-and-Forward Engine**: Failed calls and cached writes are routed here for reliable delivery. +- **Store-and-Forward Engine**: Failed calls and cached writes are routed here for reliable delivery; it also assigns each cached call a `TrackedOperationId` tracking row. +- **Site Call Audit**: The central observability sibling for cached calls — cached-call status reported here is queried via the Central UI Site Calls page. - **Deployment Manager**: Receives updated definitions as part of system-wide artifact deployment (triggered explicitly by Deployment role). diff --git a/docs/requirements/Component-HealthMonitoring.md b/docs/requirements/Component-HealthMonitoring.md index 86c8d38..548603d 100644 --- a/docs/requirements/Component-HealthMonitoring.md +++ b/docs/requirements/Component-HealthMonitoring.md @@ -64,11 +64,23 @@ The Notification Outbox is a **central** component, so its KPIs are **central-co These are distinct from the site-reported **Store-and-forward buffer depth** notification metric, which now covers the **site→central leg** — notifications still buffered in a site's Store-and-Forward Engine awaiting forward to central — and remains part of the site health report. +## Site Call Audit KPIs + +The Site Call Audit is a **central** component, so its KPIs — like the Notification Outbox's — are **central-computed** rather than collected from sites and carried in the site health report: + +- The dashboard surfaces Site Call Audit **headline** KPI tiles alongside the existing Notification Outbox tiles. +- The Site Call Audit component computes these on demand from the central `SiteCalls` table, **global and per-source-site**; the health dashboard polls it for the headline tiles. +- The KPI set is **buffered count** (`Pending` + `Retrying`), **parked count** (`Parked`), **failed (last interval)**, **delivered (last interval)**, **oldest pending age**, and **stuck count** (`Pending` / `Retrying` rows older than the configurable stuck-age threshold). +- **Stuck** is `Pending` / `Retrying` rows older than a configurable threshold (default **10 minutes**) — **display-only** (KPI count plus a row badge), with no escalation or alerting, consistent with the Notification Outbox stuck metric. +- Site Call Audit KPIs are **point-in-time**, computed on demand from the `SiteCalls` table. There is no time-series store — consistent with Health Monitoring's "current status only" philosophy. + +Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** — cached calls are delivered by each site's Store-and-Forward Engine, and the `SiteCalls` table is an eventually-consistent central mirror of site-owned status. + ## Central Storage - Health metrics are held **in memory** at the central cluster for display in the UI. - No historical health data is persisted — the dashboard shows current/latest status only. -- Notification Outbox KPIs are not stored by Health Monitoring; they are computed point-in-time from the central `Notifications` table each time the dashboard refreshes — consistent with the current-status-only philosophy. +- Notification Outbox and Site Call Audit KPIs are not stored by Health Monitoring; they are computed point-in-time from the central `Notifications` and `SiteCalls` tables respectively each time the dashboard refreshes — consistent with the current-status-only philosophy. - Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future. ## No Alerting @@ -84,6 +96,7 @@ These are distinct from the site-reported **Store-and-forward buffer depth** not - **Store-and-Forward Engine (site)**: Provides buffer depth metrics, including the notification backlog awaiting forward to central. - **Cluster Infrastructure (site)**: Provides node role status. - **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles. +- **Site Call Audit (central)**: Provides central-computed cached-call KPIs — buffered count, parked count, failed/delivered (last interval), oldest pending age, stuck count — for the headline dashboard tiles. ## Interactions diff --git a/docs/requirements/Component-Host.md b/docs/requirements/Component-Host.md index db96928..1196050 100644 --- a/docs/requirements/Component-Host.md +++ b/docs/requirements/Component-Host.md @@ -32,7 +32,7 @@ The same compiled binary must be deployable to both central and site nodes. The At startup the Host must inspect the configured node role and register only the component services appropriate for that role: - **Shared** (both Central and Site): ClusterInfrastructure, Communication, HealthMonitoring, ExternalSystemGateway. -- **Central only**: TemplateEngine, DeploymentManager, Security, AuditLogging, CentralUI, InboundAPI, ManagementService, NotificationService, NotificationOutbox. +- **Central only**: TemplateEngine, DeploymentManager, Security, AuditLogging, CentralUI, InboundAPI, ManagementService, NotificationService, NotificationOutbox, SiteCallAudit. - **Site only**: SiteRuntime, DataConnectionLayer, StoreAndForward, SiteEventLogging. Components not applicable to the current role must not be registered in the DI container or the Akka.NET actor system. @@ -62,6 +62,7 @@ The Host must bind configuration sections from `appsettings.json` to strongly-ty | `ScadaLink:InboundApi` | `InboundApiOptions` | Inbound API | DefaultMethodTimeout | | `ScadaLink:Notification` | `NotificationOptions` | Notification Service | (SMTP config is stored in the central config DB, not in appsettings) | | `ScadaLink:NotificationOutbox` | `NotificationOutboxOptions` | Notification Outbox | Dispatcher poll interval, stuck-age threshold, retention window (delivery retry settings reuse the central SMTP configuration) | +| `ScadaLink:SiteCallAudit` | `SiteCallAuditOptions` | Site Call Audit | Reconciliation pull interval, stuck-age threshold, retention window | | `ScadaLink:ManagementService` | `ManagementServiceOptions` | Management Service | (Reserved for future configuration) | | `ScadaLink:Logging` | `LoggingOptions` | Host | Serilog sink configuration, log level overrides | @@ -179,6 +180,7 @@ The Host's `Program.cs` calls these extension methods; the component libraries o | ExternalSystemGateway | Yes | Yes | Yes | Yes | No | | NotificationService | Yes | No | Yes | Yes | No | | NotificationOutbox | Yes | No | Yes | Yes | No | +| SiteCallAudit | Yes | No | Yes | Yes | No | | TemplateEngine | Yes | No | Yes | Yes | No | | DeploymentManager | Yes | No | Yes | Yes | No | | Security | Yes | No | Yes | Yes | No | @@ -195,7 +197,7 @@ The Host's `Program.cs` calls these extension methods; the component libraries o ## Dependencies -- **All 17 component libraries**: The Host references every component project to call their extension methods (excludes CLI, which is a separate executable). +- **All 18 component libraries**: The Host references every component project to call their extension methods (excludes CLI, which is a separate executable). - **Akka.Hosting**: For `AddAkka()` and the hosting configuration builder. - **Akka.Remote.Hosting, Akka.Cluster.Hosting**: For Akka subsystem configuration. (No Akka.Persistence plugin — see the Persistence note under REQ-HOST-6.) - **Serilog.AspNetCore**: For structured logging integration. diff --git a/docs/requirements/Component-NotificationOutbox.md b/docs/requirements/Component-NotificationOutbox.md index bd26149..612ae47 100644 --- a/docs/requirements/Component-NotificationOutbox.md +++ b/docs/requirements/Component-NotificationOutbox.md @@ -78,6 +78,8 @@ All timestamps are UTC. - `Parked` — terminal-not-delivered: a permanent failure, or retries exhausted. `LastError` distinguishes which. - `Discarded` — terminal, reached **only by operator action** on a parked notification. The row is kept (not deleted) so the table remains a complete audit record. +The Notification Outbox and the central [`Site Call Audit`](Component-SiteCallAudit.md) component share the `TrackedOperationId` tracking model and this status lifecycle, but differ in delivery locality: the Notification Outbox **delivers** notifications itself (central SMTP), whereas Site Call Audit only **audits** cached calls delivered site-locally by the site Store-and-Forward Engine — it is not a dispatcher. + ### Retry Policy Delivery retry reuses the central SMTP configuration's max-retry-count and fixed retry interval. The interval is fixed (no exponential backoff), consistent with the existing fixed-interval store-and-forward convention. diff --git a/docs/requirements/Component-NotificationService.md b/docs/requirements/Component-NotificationService.md index c2de73f..f419826 100644 --- a/docs/requirements/Component-NotificationService.md +++ b/docs/requirements/Component-NotificationService.md @@ -61,6 +61,7 @@ NotificationStatus status = Notify.Status(id); - `Notify.To("listName").Send(...)` is **asynchronous**: it generates a `NotificationId` (GUID) locally, hands the notification to the site Store-and-Forward Engine for forwarding to central, and returns the `NotificationId` to the script **immediately**. The script does not block waiting for delivery. - The message body is **plain text** only. No HTML content. - `Notify.Status(notificationId)` returns a small **status record** — the current status, retry count, last error, and key timestamps (enqueued, delivered). While the notification is still in the site Store-and-Forward buffer, the site answers the query **locally** with status `Forwarding`; once forwarded to central, the query round-trips to central and reads the `Notifications` table. +- The returned `NotificationId` is a `TrackedOperationId` — the shared Commons tracking-handle type used by all store-and-forward producers; `NotificationId` is simply the notification-domain name for it. Likewise, `Notify.Status` is a thin alias of the unified `Tracking.Status` accessor, retained for backward compatibility. This is a naming/type clarification only — notification delivery behavior is unchanged. ## Notification Delivery Behavior diff --git a/docs/requirements/Component-SiteCallAudit.md b/docs/requirements/Component-SiteCallAudit.md new file mode 100644 index 0000000..491c800 --- /dev/null +++ b/docs/requirements/Component-SiteCallAudit.md @@ -0,0 +1,130 @@ +# Component: Site Call Audit + +## Purpose + +Provides central, queryable audit and operational visibility for cached calls +made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`. +Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry +to this component, which maintains a central audit record, computes KPIs, and +relays Retry/Discard actions back to the owning site. + +This is the second centrally-hosted observability component for site +store-and-forward activity (the Notification Outbox is the first). Unlike the +Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers +anything. Cached calls are delivered by the site's Store-and-Forward Engine +against site-local external systems and databases, which central cannot reach. + +## Location + +Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active +central node. Registered as component #22 in the Host role configuration. + +## Responsibilities + +- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls` + table. +- Run periodic per-site reconciliation pulls so missed telemetry self-heals. +- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table. +- Relay operator Retry/Discard actions for parked cached calls to the owning + site over the command/control channel. +- Purge terminal audit rows after a configurable retention window. + +## The `SiteCalls` Table + +Lives in the central MS SQL configuration database — a sibling of the +`Notifications` table. One row per `TrackedOperationId`: + +- **TrackedOperationId** — GUID, primary key. Generated site-side at call time. +- **SourceSite** — site that issued the call. +- **Kind** — `TrackedOperationKind` enum: `ExternalCall` or `DatabaseWrite`. +- **TargetSummary** — external system + method name for an `ExternalCall`; for a + `DatabaseWrite`, just the database connection name — intentionally not the SQL + statement or table, a deliberate scoping choice. +- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`. +- **RetryCount** — attempts so far. +- **LastError** — most recent error detail, if any. +- **Provenance** — source instance / script. +- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps. + +## Status Lifecycle + +`Pending → Retrying → Delivered / Parked / Failed / Discarded` + +- **Pending** — non-terminal: buffered after a transient failure, awaiting its + first retry. +- **Retrying** — non-terminal: undergoing retry attempts. +- **Delivered** — terminal, success. A cached call that succeeds on its first + immediate attempt is recorded directly as `Delivered`. +- **Parked** — non-terminal: transient retries exhausted; awaiting manual action. +- **Failed** — terminal: permanent failure (e.g. HTTP 4xx). The error was also + returned synchronously to the calling script; the record captures it. `Failed` + rows are **not operator-actionable** — see Retry / Discard Relay. +- **Discarded** — terminal, reached **only by operator action** on a `Parked` + row. The row is kept (not deleted) so the table remains a complete audit + record. + +The site is the source of truth. The `SiteCalls` row is an eventually-consistent +mirror — never queried by scripts (`Tracking.Status()` is answered site-locally). + +## Ingest & Idempotency + +Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`, +then **upsert-on-newer-status**. The lifecycle is monotonic, so status only +advances and never regresses; at-least-once and out-of-order telemetry are +therefore harmless. + +## Reconciliation + +Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site +reconnect — pulls "all tracking rows changed since cursor X" from each site. +Gaps left by lost telemetry self-heal. Central converges to the site; the site +never depends on central. + +## Retry / Discard Relay + +Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard +from the Central UI is relayed to that site as a `RetryParkedOperation` / +`DiscardParkedOperation` command over the command/control channel. The site +applies the change and emits telemetry reflecting the new state; central never +mutates the `SiteCalls` row directly. If the site is offline the command fails +fast and the UI surfaces a "site unreachable" message. + +Only `Parked` rows are operator-actionable. `Failed` rows offer no Retry or +Discard: a permanent failure (e.g. HTTP 4xx) would simply fail again, and the +error was already returned synchronously to the calling script — there is +nothing for an operator to recover. + +## KPIs + +Point-in-time, computed from the `SiteCalls` table, global and per-source-site, +mirroring the Notification Outbox KPI shape: + +- Buffered count (`Pending` + `Retrying`) +- Parked count +- Failed-last-interval +- Delivered-last-interval +- Oldest-pending age +- Stuck count — `Pending`/`Retrying` older than a configurable threshold + (default 10 minutes); display-only, no escalation. + +## Retention + +Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a +configurable window (default 365 days), matching the `Notifications` purge. + +## Dependencies + +- **Configuration Database**: hosts the `SiteCalls` table and its repository. +- **Central–Site Communication**: receives cached-call telemetry and reconciliation + responses; sends Retry/Discard commands. +- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and + the executor of relayed Retry/Discard commands. +- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts. + +## Interactions + +- **Central UI**: the Site Calls page queries this component and issues + Retry/Discard actions. +- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard. +- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with + active/standby failover. diff --git a/docs/requirements/Component-SiteRuntime.md b/docs/requirements/Component-SiteRuntime.md index aa16147..38a99a4 100644 --- a/docs/requirements/Component-SiteRuntime.md +++ b/docs/requirements/Component-SiteRuntime.md @@ -254,15 +254,19 @@ Available to all Script Execution Actors and Alarm Execution Actors: ### External Systems - `ExternalSystem.Call("systemName", "methodName", params)` — Synchronous HTTP call. Blocks until response or timeout. All failures return to script. Use when the script needs the result. -- `ExternalSystem.CachedCall("systemName", "methodName", params)` — Fire-and-forget with store-and-forward on transient failure. Use for outbound data pushes where deferred delivery is acceptable. +- `ExternalSystem.CachedCall("systemName", "methodName", params)` — Deferred delivery. Returns a `TrackedOperationId` tracking handle immediately rather than the response; the call is attempted immediately and, on transient failure, store-and-forwarded for retry. Use for outbound data pushes where deferred delivery is acceptable. +- The returned `TrackedOperationId` can be passed to `Tracking.Status(id)` (see **Operation Tracking** below) to observe delivery progress. ### Notifications -- `Notify.To("listName").Send("subject", "message")` — Send a notification via a named notification list. Generates a `NotificationId` (GUID) locally and returns it immediately; the notification is store-and-forwarded to the central cluster, which owns delivery. The script never contacts SMTP. -- `Notify.Status("notificationId")` — Returns a status record (status, retry count, last error, key timestamps). While the notification is still in the site store-and-forward buffer the site answers locally (status `Forwarding`); once forwarded the query round-trips to central. +- `Notify.To("listName").Send("subject", "message")` — Send a notification via a named notification list. Generates a `TrackedOperationId` (GUID) locally and returns it immediately; the notification is store-and-forwarded to the central cluster, which owns delivery. The script never contacts SMTP. (`NotificationId` is the notification-domain name for this same `TrackedOperationId` type.) +- `Notify.Status("trackedOperationId")` — A thin alias of `Tracking.Status(id)` retained for the notification domain. Returns a status record (status, retry count, last error, key timestamps). While the notification is still in the site store-and-forward buffer the site answers locally (status `Forwarding`); once forwarded the query round-trips to central. ### Database Access - `Database.Connection("connectionName")` — Obtain a raw MS SQL client connection (ADO.NET) for synchronous read/write. -- `Database.CachedWrite("connectionName", "sql", parameters)` — Submit a write operation for store-and-forward delivery. +- `Database.CachedWrite("connectionName", "sql", parameters)` — Submit a write operation for store-and-forward delivery. Returns a `TrackedOperationId` tracking handle immediately; pass it to `Tracking.Status(id)` to observe delivery progress. + +### Operation Tracking +- `Tracking.Status("trackedOperationId")` — Returns a status record (status, retry count, last error, key timestamps) for any tracked operation: a cached external system call, a cached database write, or a notification. For cached calls and writes the answer is always site-local and authoritative — the site owns the operation tracking table. (`Notify.Status(...)` is a thin alias scoped to the notification domain.) ### Parameter Access - `Parameters["key"]` — Raw dictionary access (returns `object?`, requires manual casting). @@ -283,7 +287,7 @@ Available to all Script Execution Actors and Alarm Execution Actors: Scripts execute **in-process** with constrained access. The following restrictions are enforced at compilation and runtime: -- **Allowed**: Access to the Script Runtime API (GetAttribute, SetAttribute, CallScript, CallShared, ExternalSystem, Notify, Database), standard C# language features, basic .NET types (collections, string manipulation, math, date/time). +- **Allowed**: Access to the Script Runtime API (GetAttribute, SetAttribute, CallScript, CallShared, ExternalSystem, Notify, Database, Tracking), standard C# language features, basic .NET types (collections, string manipulation, math, date/time). - **Forbidden**: File system access (`System.IO`), process spawning (`System.Diagnostics.Process`), threading (`System.Threading` — except async/await), reflection (`System.Reflection`), raw network access (`System.Net.Sockets`, `System.Net.Http` — must use `ExternalSystem.Call`), assembly loading, unsafe code. - **Execution timeout**: Configurable per-script maximum execution time. Exceeding the timeout cancels the script and logs an error. - **Memory**: Scripts share the host process memory. No per-script memory limit, but the execution timeout prevents runaway allocations. @@ -354,7 +358,7 @@ Per Akka.NET best practices, internal actor communication uses **Tell** (fire-an ## Dependencies - **Data Connection Layer**: Provides tag value updates to Instance Actors. Receives write requests from Instance Actors. -- **Store-and-Forward Engine**: Handles reliable delivery for external system calls, cached database writes, and notifications submitted by scripts. For the notification category specifically, it forwards to the central cluster for delivery (not directly to SMTP). +- **Store-and-Forward Engine**: Handles reliable delivery for external system calls, cached database writes, and notifications submitted by scripts. For the notification category specifically, it forwards to the central cluster for delivery (not directly to SMTP). Owns the site-local operation tracking table that backs `Tracking.Status(id)`. - **External System Gateway**: Provides external system method invocations for scripts. - **Communication Layer**: Receives deployments and lifecycle commands from central. Handles debug view requests. Reports deployment results. - **Site Event Logging**: Records script executions, alarm events, deployment events, instance lifecycle events. diff --git a/docs/requirements/Component-StoreAndForward.md b/docs/requirements/Component-StoreAndForward.md index 24297d5..dcc048a 100644 --- a/docs/requirements/Component-StoreAndForward.md +++ b/docs/requirements/Component-StoreAndForward.md @@ -18,9 +18,11 @@ Site clusters only. The central cluster does not buffer messages. - Retry delivery per message according to the configured retry policy. - Park messages that exhaust their retry limit (dead-letter). - Persist buffered messages to local SQLite for durability. +- Maintain a site-local **operation tracking table** holding one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`) — the authoritative status record consulted by `Tracking.Status(id)`. +- Emit cached-call lifecycle telemetry to the central Site Call Audit component on every status transition. - Replicate buffered messages to the standby node via application-level replication over Akka.NET remoting. - On failover, the standby node takes over delivery from its replicated copy. -- Respond to remote queries from central for parked message management (list, retry, discard). +- Respond to remote queries from central for parked message management (list, retry, discard), including central-driven Retry/Discard of parked cached calls. ## Message Lifecycle @@ -44,6 +46,10 @@ Attempt immediate delivery For notifications, "delivery" means forwarding the message to the central cluster via Central–Site Communication; "success" is central's ack, on which the message is cleared. Notifications do not park — they are retried at the fixed forward interval until central acks. Parking applies only to the external-system-call and cached-database-write categories. +For the cached-call categories (`ExternalCall` and `DatabaseWrite`), the operation tracking table is the status record and the S&F buffer is purely the retry mechanism. A cached call that succeeds on its first immediate attempt is written directly as a terminal `Delivered` tracking row and never enters the S&F buffer. When immediate delivery fails transiently, the message is buffered and its tracking row moves to `Pending`/`Retrying`; the buffered message carries its `TrackedOperationId` so the tracking row and the retry record stay linked. When immediate delivery fails **permanently** (e.g. HTTP 4xx), the message is not buffered — the error is returned synchronously to the calling script as before — but the tracking row is written directly as a terminal `Failed` row capturing the error. On every tracking-table status transition the site emits `CachedCallTelemetry` to central. + +Every cached-call outcome maps to a tracking-table state: immediate success → `Delivered`; transient failure → `Pending`/`Retrying`, eventually `Delivered` or `Parked`; permanent failure → terminal `Failed`; operator discard of a parked row → terminal `Discarded`. + ## Retry Policy For the external-system-call and cached-database-write categories, retry settings are defined on the **source entity** (not per-message): @@ -54,7 +60,7 @@ The **notification** category retries differently: it has no source-entity setti The retry interval is **fixed** (not exponential backoff). Fixed interval is sufficient for the expected use cases. -**Note**: Only **transient failures** are eligible for store-and-forward buffering. For external system calls, transient failures are connection errors, timeouts, and HTTP 5xx responses. Permanent failures (HTTP 4xx) are returned directly to the calling script and are **not** queued for retry. This prevents the buffer from accumulating requests that will never succeed. +**Note**: Only **transient failures** are eligible for store-and-forward buffering. For external system calls, transient failures are connection errors, timeouts, and HTTP 5xx responses. Permanent failures (HTTP 4xx) are returned directly to the calling script and are **not** queued for retry. This prevents the buffer from accumulating requests that will never succeed. For the cached-call categories, a permanent failure additionally sets the operation's tracking-table row to terminal `Failed`, capturing the error — so even a never-buffered cached call has an authoritative status record. `Failed` rows are not operator-actionable: a permanent failure would only fail again, and the error was already returned to the script. ## Buffer Size @@ -68,6 +74,22 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del - On failover, the new active node has a near-complete copy of the buffer. In rare cases, the most recent operations may not have been replicated (e.g., a message added or removed just before failover). This can result in a few **duplicate deliveries** (message delivered but remove not replicated) or a few **missed retries** (message added but not replicated). Both are acceptable trade-offs for the latency benefit. - On failover, the new active node resumes delivery from its local copy. +### Operation Tracking Table + +Alongside the S&F buffer DB, each site node holds a **site-local operation tracking table** in SQLite. It carries one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome. + +- This table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row. +- Each row records the operation kind (`TrackedOperationKind`), a target summary (external system + method, or database connection name), the unified `TrackedOperationStatus`, retry count, last error, source provenance (instance / script), and the created/updated/terminal UTC timestamps. +- `Tracking.Status(id)` reads this table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror. +- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly here, with nothing placed in the S&F buffer. +- Terminal rows are purged after a configurable retention window (default 7 days) — the site holds live operational state; central holds long-term audit. + +Notifications are unaffected: they have no tracking table. Their `NotificationId` and status are owned by the central `Notifications` table, and their lifecycle continues to forward to central exactly as before. + +### Telemetry to Central + +On every tracking-table status transition, the site emits a `CachedCallTelemetry` message to the central Site Call Audit component over the site→central channel. Emission is best-effort, at-least-once, and idempotent on `TrackedOperationId`. Because telemetry is best-effort, the site also responds to `CachedCallReconcileRequest` reconciliation pulls — cursor-based per-site reads of tracking rows changed since a cursor — so any missed telemetry self-heals. The site never depends on central; central converges to the site. + ## Parked Message Management - Parked messages remain stored at the site in SQLite. @@ -75,26 +97,29 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del - Operators can: - **Retry** a parked message (moves it back to the retry queue). - **Discard** a parked message (removes it permanently). -- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages continue to exist and can be managed via the central UI. +- For parked cached calls, Retry/Discard can be driven centrally: the Site Call Audit component relays `RetryParkedOperation` / `DiscardParkedOperation` commands (keyed by `TrackedOperationId`) down to the owning site. The site applies the command to its S&F buffer and tracking table, then emits `CachedCallTelemetry` reflecting the new state (`Retrying` or `Discarded`) — central never mutates its mirror row directly. +- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages, and their tracking rows, continue to exist and can be managed via the central UI. ## Message Format Each buffered message stores: - **Message ID**: Unique identifier. - **Category**: External system call, notification, or cached database write. +- **Tracked Operation ID**: For the cached-call categories, the `TrackedOperationId` linking the buffered message to its row in the operation tracking table. Not used by the notification category, which is tracked centrally via its `NotificationId`. - **Target**: External system name, the central cluster (for notifications), or database connection name. - **Payload**: Serialized message content (API method + parameters; notification list name + subject + body plus the locally generated `NotificationId` and source provenance; SQL + parameters). - **Retry Count**: Number of attempts so far. - **Created At**: Timestamp when the message was first queued. - **Last Attempt At**: Timestamp of the most recent delivery attempt. -- **Status**: Pending, retrying, or parked. +- **Status**: Pending, retrying, or parked. This is the **buffer message's** retry state, distinct from the operation's `TrackedOperationStatus` lifecycle in the operation tracking table. A buffer message exists only while a cached call is mid-retry, so it never carries the terminal `Delivered`, `Failed`, or `Discarded` states — those live solely on the tracking row. ## Dependencies - **SQLite**: Local persistence on each node. - **Communication Layer**: Application-level replication to standby node; remote query handling from central; carries buffered notifications to the central cluster (ClusterClient) and receives central's acks. - **External System Gateway**: Delivers external system API calls. -- **Central–Site Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over Central–Site Communication and cleared on central's ack. +- **Central–Site Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over Central–Site Communication and cleared on central's ack. Also carries `CachedCallTelemetry` and reconciliation responses to central, and receives `RetryParkedOperation` / `DiscardParkedOperation` commands. +- **Site Call Audit**: The central audit mirror for cached calls — receives this engine's cached-call telemetry and reconciliation responses, and relays operator Retry/Discard of parked cached calls back as commands. - **Database Connections**: Delivers cached database writes. - **Site Event Logging**: Logs store-and-forward activity (queued, delivered, retried, parked). @@ -103,4 +128,5 @@ Each buffered message stores: - **Site Runtime (Script Actors)**: Scripts submit messages to the buffer (external calls, notifications, cached DB writes). - **Communication Layer**: Handles parked message queries/commands from central; carries buffered notifications to the central cluster. - **Notification Outbox**: The central destination for the notification category — central ingests each forwarded notification into the `Notifications` table and acks the site, on which the engine clears the buffered message. +- **Site Call Audit**: The central observability sibling for the cached-call categories — this engine emits `CachedCallTelemetry` on every tracking-table transition, answers `CachedCallReconcileRequest` pulls, and executes the `RetryParkedOperation` / `DiscardParkedOperation` commands it relays. - **Health Monitoring**: Reports buffer depth metrics, including the notification backlog covering the site→central forward leg. diff --git a/docs/requirements/HighLevelReqs.md b/docs/requirements/HighLevelReqs.md index 4299a4c..a97e107 100644 --- a/docs/requirements/HighLevelReqs.md +++ b/docs/requirements/HighLevelReqs.md @@ -231,7 +231,7 @@ Scripts executing on a site for a given instance can: - **Write** attribute values on that instance. For attributes with a data source reference, the write goes to the Data Connection Layer which writes to the physical device; the in-memory value updates when the device confirms the new value via the existing subscription. For static attributes, the write updates the in-memory value and **persists the override to local SQLite** — the value survives restart and failover. Persisted overrides are reset when the instance is redeployed. - **Call other scripts** on that instance via `Instance.CallScript("scriptName", params)`. Calls use the Akka ask pattern and return the called script's return value. Script-to-script calls support concurrent execution. - **Call shared scripts** via `Scripts.CallShared("scriptName", params)`. Shared scripts execute **inline** in the calling Script Actor's context — they are compiled code libraries, not separate actors. -- **Call external system API methods** in two modes: `ExternalSystem.Call()` for synchronous request/response, or `ExternalSystem.CachedCall()` for fire-and-forget with store-and-forward on transient failure (see Section 5). +- **Call external system API methods** in two modes: `ExternalSystem.Call()` for synchronous request/response, or `ExternalSystem.CachedCall()` for deferred delivery — it returns a `TrackedOperationId` tracking handle immediately and store-and-forwards the call on transient failure (see Section 5). - **Send notifications** (see Section 6). - **Access databases** by requesting an MS SQL client connection by name (see Section 5.5).