Files
ScadaBridge/docs/requirements/Component-SiteCallAudit.md
T
Joseph Doherty fd618cf1dc fix(review): full code-review remediation — 5 High + Medium/Low across 16 modules
Remediation from the full per-module code review at 4307c381 (findings recorded
separately in code-reviews/).

Highs fixed:
- DeploymentManager-025/SiteRuntime-031: stop broadcasting notification lists + SMTP
  configs (incl. credentials) to sites; site purges already-persisted rows on apply
  (enforces the central-only delivery design; clears plaintext SMTP creds at rest).
- DataConnectionLayer-023: guard the native-alarm subscribe path against the
  mid-flight-unsubscribe adapter-feed leak (mirrors the DCL-021 tag-path fix).
- SiteEventLogging-024: normalize From/To query bounds to UTC (the -016 fix the
  audit trail claimed but never committed).
- KpiHistory-001: add an in-flight guard to the recorder sample tick.
- ScriptAnalysis-001: harden the trust analyzer's TPA-absent fallback (resolve
  forbidden anchors in the minimal reference set; warn on degraded mode) — anchors
  added to validation references only, never the compile gate.
(InboundAPI-026 left to the feat/ipsen-movein effort per owner decision.)

Medium/Low: DM-026 deterministic deploy-status tiebreaker; SR-027/028/029/030
native-alarm leak/phantom-active/delete-during-redeploy fixes; AL-013/014/016;
TE-024 (folder-mutation audit rows now persisted)/025; SF-025 gauge-provider
clear-on-stop; ESG-025/026; SEC-023/024/025; SCA-007/008/009; plus doc/test
accuracy COM-023/024, HOST-025/026, HM-024/025, NS-027/028.

Full-solution build 0 warnings; ~3560 tests across 18 touched suites green.
2026-06-20 17:55:12 -04:00

7.7 KiB
Raw Blame History

Component: Site Call Audit

Purpose

Provides central, queryable audit and operational visibility for cached calls made by site scripts — ExternalSystem.CachedCall() and Database.CachedWrite(). Each such call carries a TrackedOperationId; sites report lifecycle telemetry to this component, which maintains a central audit record, computes KPIs, and relays Retry/Discard actions back to the owning site.

This is the second centrally-hosted observability component for site store-and-forward activity (the Notification Outbox is the first). Unlike the Notification Outbox, Site Call Audit is not a dispatcher — it never delivers anything. Cached calls are delivered by the site's Store-and-Forward Engine against site-local external systems and databases, which central cannot reach.

Location

Central cluster only. A singleton actor (SiteCallAuditActor) on the active central node. Registered as component #22 in the Host role configuration.

Responsibilities

  • Ingest cached-call lifecycle telemetry from sites into the central SiteCalls table.
  • Run periodic per-site reconciliation pulls so missed telemetry self-heals.
  • Compute point-in-time KPIs (global and per-site) from the SiteCalls table.
  • Relay operator Retry/Discard actions for parked cached calls to the owning site over the command/control channel.
  • Purge terminal audit rows after a configurable retention window.

The SiteCalls Table

Lives in the central MS SQL configuration database — a sibling of the Notifications table. One row per TrackedOperationId:

  • TrackedOperationId — GUID, primary key. Generated site-side at call time.
  • SourceSite — site that issued the call.
  • SourceNode — the cluster node on which the call was issued (node-a / node-b, qualified by SourceSite). Nullable. Stamped site-side at submit time and carried verbatim through the combined CachedCallTelemetry packet, reconciliation pulls, and the central upsert.
  • KindTrackedOperationKind enum: ExternalCall or DatabaseWrite.
  • TargetSummary — external system + method name for an ExternalCall; for a DatabaseWrite, just the database connection name — intentionally not the SQL statement or table, a deliberate scoping choice.
  • StatusPending, Retrying, Delivered, Parked, Failed, Discarded.
  • RetryCount — attempts so far.
  • LastError — most recent error detail, if any.
  • Provenance — source instance / script.
  • CreatedAtUtc, UpdatedAtUtc, TerminalAtUtc — key timestamps.

Status Lifecycle

Pending → Retrying → Delivered / Parked / Failed / Discarded

  • Pending — non-terminal: buffered after a transient failure, awaiting its first retry.
  • Retrying — non-terminal: undergoing retry attempts.
  • Delivered — terminal, success. A cached call that succeeds on its first immediate attempt is recorded directly as Delivered.
  • Parked — non-terminal: transient retries exhausted; awaiting manual action.
  • Failed — terminal: permanent failure (e.g. HTTP 4xx). The error was also returned synchronously to the calling script; the record captures it. Failed rows are not operator-actionable — see Retry / Discard Relay.
  • Discarded — terminal, reached only by operator action on a Parked row. The row is kept (not deleted) so the table remains a complete audit record.

The site is the source of truth. The SiteCalls row is an eventually-consistent mirror — never queried by scripts (Tracking.Status() is answered site-locally).

Ingest & Idempotency

Telemetry ingestion is insert-if-not-exists keyed on TrackedOperationId, then upsert-on-newer-status. The lifecycle is monotonic, so status only advances and never regresses; at-least-once and out-of-order telemetry are therefore harmless.

From v1.x onward, the CachedCallTelemetry message additively carries the AuditEvent content alongside the existing operational fields. Central's AuditLogIngestActor (Audit Log #23) performs both the immutable AuditLog insert and the SiteCalls upsert in a single transaction. Idempotency keys remain EventId (for AuditLog) and TrackedOperationId (for SiteCalls). See Component-AuditLog.md, Cached Operations — Combined Telemetry, for the dual-write contract.

Reconciliation

Because telemetry is best-effort, SiteCallAuditActor periodically — and on site reconnect — pulls "all tracking rows changed since cursor X" from each site. Gaps left by lost telemetry self-heal. Central converges to the site; the site never depends on central.

Retry / Discard Relay

Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard from the Central UI is relayed to that site as a RetryParkedOperation / DiscardParkedOperation command over the command/control channel. The site applies the change and emits telemetry reflecting the new state; central never mutates the SiteCalls row directly. If the site is offline the command fails fast and the UI surfaces a "site unreachable" message.

Only Parked rows are operator-actionable. Failed rows offer no Retry or Discard: a permanent failure (e.g. HTTP 4xx) would simply fail again, and the error was already returned synchronously to the calling script — there is nothing for an operator to recover.

KPIs

Point-in-time, computed from the SiteCalls table, global and per-source-site, mirroring the Notification Outbox KPI shape:

  • Buffered count (Pending + Retrying)
  • Parked count
  • Failed-last-interval
  • Delivered-last-interval
  • Oldest-pending age
  • Stuck count — Pending/Retrying older than a configurable threshold (default 10 minutes); display-only, no escalation.

Retention

Daily purge of terminal rows (Delivered, Failed, Discarded) after a configurable window (default 365 days), matching the Notifications purge.

Dependencies

  • Configuration Database: hosts the SiteCalls table and its repository.
  • CentralSite Communication: receives cached-call telemetry and reconciliation responses; sends Retry/Discard commands.
  • Store-and-Forward Engine: the site-side origin of cached-call telemetry and the executor of relayed Retry/Discard commands.
  • Audit Log (#23): shares the CachedCallTelemetry packet — each lifecycle transition (CachedEnqueued, CachedAttempt, CachedTerminal) carries an AuditEvent alongside the operational fields, and central's AuditLogIngestActor performs the AuditLog insert and the SiteCalls upsert in a single transaction (see Component-AuditLog.md, Cached Operations — Combined Telemetry).
  • Commons: TrackedOperationId, status enum, telemetry message contracts.

Interactions

  • Central UI: the Site Calls page queries this component and issues Retry/Discard actions.
  • Health Monitoring: surfaces Site Call Audit KPI tiles on the dashboard.
  • Cluster Infrastructure: hosts the SiteCallAuditActor singleton with active/standby failover.
  • KPI History (#26): emits IKpiSampleSource (SiteCallAuditKpiSampleSource, Global + per-Site + per-Node) consumed by the KpiHistory recorder (#26), reusing the existing KPI reads. All six metrics — buffered / parked / failedLastInterval / deliveredLastInterval / stuck / oldestPendingAgeSeconds — are sampled into the KpiSample history store, but only the three charted via the public KpiMetrics.SiteCallAudit catalog (buffered / parked / failedLastInterval) render as trends on the Site Calls page via KpiTrendChart; deliveredLastInterval / stuck / oldestPendingAgeSeconds are sampled-but-not-yet-charted (available for future trend panels / ad-hoc query). See Component-KpiHistory.md.