Files
ScadaBridge/docs/requirements/Component-SiteCallAudit.md
T
Joseph Doherty fd618cf1dc fix(review): full code-review remediation — 5 High + Medium/Low across 16 modules
Remediation from the full per-module code review at 4307c381 (findings recorded
separately in code-reviews/).

Highs fixed:
- DeploymentManager-025/SiteRuntime-031: stop broadcasting notification lists + SMTP
  configs (incl. credentials) to sites; site purges already-persisted rows on apply
  (enforces the central-only delivery design; clears plaintext SMTP creds at rest).
- DataConnectionLayer-023: guard the native-alarm subscribe path against the
  mid-flight-unsubscribe adapter-feed leak (mirrors the DCL-021 tag-path fix).
- SiteEventLogging-024: normalize From/To query bounds to UTC (the -016 fix the
  audit trail claimed but never committed).
- KpiHistory-001: add an in-flight guard to the recorder sample tick.
- ScriptAnalysis-001: harden the trust analyzer's TPA-absent fallback (resolve
  forbidden anchors in the minimal reference set; warn on degraded mode) — anchors
  added to validation references only, never the compile gate.
(InboundAPI-026 left to the feat/ipsen-movein effort per owner decision.)

Medium/Low: DM-026 deterministic deploy-status tiebreaker; SR-027/028/029/030
native-alarm leak/phantom-active/delete-during-redeploy fixes; AL-013/014/016;
TE-024 (folder-mutation audit rows now persisted)/025; SF-025 gauge-provider
clear-on-stop; ESG-025/026; SEC-023/024/025; SCA-007/008/009; plus doc/test
accuracy COM-023/024, HOST-025/026, HM-024/025, NS-027/028.

Full-solution build 0 warnings; ~3560 tests across 18 touched suites green.
2026-06-20 17:55:12 -04:00

159 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Component: Site Call Audit
## Purpose
Provides central, queryable audit and operational visibility for cached calls
made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`.
Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry
to this component, which maintains a central audit record, computes KPIs, and
relays Retry/Discard actions back to the owning site.
This is the second centrally-hosted observability component for site
store-and-forward activity (the Notification Outbox is the first). Unlike the
Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers
anything. Cached calls are delivered by the site's Store-and-Forward Engine
against site-local external systems and databases, which central cannot reach.
## Location
Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active
central node. Registered as component #22 in the Host role configuration.
## Responsibilities
- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls`
table.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table.
- Relay operator Retry/Discard actions for parked cached calls to the owning
site over the command/control channel.
- Purge terminal audit rows after a configurable retention window.
## The `SiteCalls` Table
Lives in the central MS SQL configuration database — a sibling of the
`Notifications` table. One row per `TrackedOperationId`:
- **TrackedOperationId** — GUID, primary key. Generated site-side at call time.
- **SourceSite** — site that issued the call.
- **SourceNode** — the cluster node on which the call was issued (`node-a` /
`node-b`, qualified by `SourceSite`). Nullable. Stamped site-side at submit
time and carried verbatim through the combined `CachedCallTelemetry` packet,
reconciliation pulls, and the central upsert.
- **Kind** — `TrackedOperationKind` enum: `ExternalCall` or `DatabaseWrite`.
- **TargetSummary** — external system + method name for an `ExternalCall`; for a
`DatabaseWrite`, just the database connection name — intentionally not the SQL
statement or table, a deliberate scoping choice.
- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
- **RetryCount** — attempts so far.
- **LastError** — most recent error detail, if any.
- **Provenance** — source instance / script.
- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps.
## Status Lifecycle
`Pending → Retrying → Delivered / Parked / Failed / Discarded`
- **Pending** — non-terminal: buffered after a transient failure, awaiting its
first retry.
- **Retrying** — non-terminal: undergoing retry attempts.
- **Delivered** — terminal, success. A cached call that succeeds on its first
immediate attempt is recorded directly as `Delivered`.
- **Parked** — non-terminal: transient retries exhausted; awaiting manual action.
- **Failed** — terminal: permanent failure (e.g. HTTP 4xx). The error was also
returned synchronously to the calling script; the record captures it. `Failed`
rows are **not operator-actionable** — see Retry / Discard Relay.
- **Discarded** — terminal, reached **only by operator action** on a `Parked`
row. The row is kept (not deleted) so the table remains a complete audit
record.
The site is the source of truth. The `SiteCalls` row is an eventually-consistent
mirror — never queried by scripts (`Tracking.Status()` is answered site-locally).
## Ingest & Idempotency
Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`,
then **upsert-on-newer-status**. The lifecycle is monotonic, so status only
advances and never regresses; at-least-once and out-of-order telemetry are
therefore harmless.
From v1.x onward, the `CachedCallTelemetry` message additively carries the
`AuditEvent` content alongside the existing operational fields. Central's
`AuditLogIngestActor` (Audit Log #23) performs both the immutable `AuditLog`
insert and the `SiteCalls` upsert in a single transaction. Idempotency keys
remain `EventId` (for `AuditLog`) and `TrackedOperationId` (for `SiteCalls`).
See [Component-AuditLog.md](Component-AuditLog.md), Cached Operations —
Combined Telemetry, for the dual-write contract.
## Reconciliation
Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site
reconnect — pulls "all tracking rows changed since cursor X" from each site.
Gaps left by lost telemetry self-heal. Central converges to the site; the site
never depends on central.
## Retry / Discard Relay
Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard
from the Central UI is relayed to that site as a `RetryParkedOperation` /
`DiscardParkedOperation` command over the command/control channel. The site
applies the change and emits telemetry reflecting the new state; central never
mutates the `SiteCalls` row directly. If the site is offline the command fails
fast and the UI surfaces a "site unreachable" message.
Only `Parked` rows are operator-actionable. `Failed` rows offer no Retry or
Discard: a permanent failure (e.g. HTTP 4xx) would simply fail again, and the
error was already returned synchronously to the calling script — there is
nothing for an operator to recover.
## KPIs
Point-in-time, computed from the `SiteCalls` table, global and per-source-site,
mirroring the Notification Outbox KPI shape:
- Buffered count (`Pending` + `Retrying`)
- Parked count
- Failed-last-interval
- Delivered-last-interval
- Oldest-pending age
- Stuck count — `Pending`/`Retrying` older than a configurable threshold
(default 10 minutes); display-only, no escalation.
## Retention
Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a
configurable window (default 365 days), matching the `Notifications` purge.
## Dependencies
- **Configuration Database**: hosts the `SiteCalls` table and its repository.
- **CentralSite Communication**: receives cached-call telemetry and reconciliation
responses; sends Retry/Discard commands.
- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and
the executor of relayed Retry/Discard commands.
- **Audit Log (#23)**: shares the `CachedCallTelemetry` packet — each lifecycle
transition (`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) carries an
`AuditEvent` alongside the operational fields, and central's
`AuditLogIngestActor` performs the `AuditLog` insert and the `SiteCalls`
upsert in a single transaction (see [Component-AuditLog.md](Component-AuditLog.md),
Cached Operations — Combined Telemetry).
- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts.
## Interactions
- **Central UI**: the Site Calls page queries this component and issues
Retry/Discard actions.
- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard.
- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with
active/standby failover.
- **KPI History (#26)**: emits `IKpiSampleSource`
(`SiteCallAuditKpiSampleSource`, Global + per-Site + per-Node) consumed by the
KpiHistory recorder (#26), reusing the existing KPI reads. All six metrics —
`buffered` / `parked` / `failedLastInterval` / `deliveredLastInterval` /
`stuck` / `oldestPendingAgeSeconds` — are sampled into the `KpiSample` history
store, but only the three charted via the public `KpiMetrics.SiteCallAudit`
catalog (`buffered` / `parked` / `failedLastInterval`) render as trends on the
Site Calls page via `KpiTrendChart`; `deliveredLastInterval` / `stuck` /
`oldestPendingAgeSeconds` are sampled-but-not-yet-charted (available for future
trend panels / ad-hoc query). See [Component-KpiHistory.md](Component-KpiHistory.md).