# Site Call Audit Site Call Audit (#22) is a central-only observability component that maintains an eventually-consistent mirror of every cached call — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — issued by site scripts. It ingests lifecycle telemetry from sites into the central `SiteCalls` MS SQL table, computes point-in-time KPIs, and relays operator Retry/Discard actions back to the owning site. It does not deliver anything: cached-call execution stays entirely site-local. ## Overview The component lives in `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/` and runs only on the central cluster. Its single class, `SiteCallAuditActor`, is an Akka.NET `ReceiveActor` deployed as a `ClusterSingletonManager`-managed singleton on the active central node. Telemetry reaches central through the shared `CachedCallTelemetry` packet (see [Audit Log](./AuditLog.md)), which carries both an `AuditEvent` for the `AuditLog` table and a `SiteCallOperational` snapshot for the `SiteCalls` table. The `AuditLogIngestActor` (Audit Log #23) writes both in a single MS SQL transaction when it receives an `IngestCachedTelemetryCommand`; it then tells `SiteCallAuditActor` an `UpsertSiteCallCommand` so the `SiteCalls` row is always consistent with the paired audit row. The `SiteCallAuditActor` is therefore an ingest target, not a transport; it never constructs telemetry packets and never decides what gets delivered. Sites remain the source of truth. `Tracking.Status()` is answered site-locally from the site SQLite tracking store; the central `SiteCalls` row is what the Central UI Site Calls page reads — it may lag by one telemetry cycle. ## Key Concepts ### Mirror, not dispatcher The Notification Outbox (#21) ingests notifications and dispatches them centrally. Site Call Audit is different: the Store-and-Forward Engine on each site performs all retry and delivery attempts against the site's locally reachable external systems and databases. Central cannot reach those systems. The `SiteCalls` table is read-only from central's perspective — operators can view it and trigger Retry/Discard actions, but the actions are forwarded to the site; central never mutates the mirror row directly. ### Monotonic upsert idempotency The `SiteCalls` table holds one row per `TrackedOperationId`. `ISiteCallAuditRepository.UpsertAsync` implements insert-if-not-exists followed by a conditional update that only applies when the incoming status has a strictly higher rank than the stored status: ```text Submitted=0, Forwarded=1, Attempted=2, Skipped=2, Delivered=3, Failed=3, Parked=3, Discarded=3 ``` Out-of-order telemetry, duplicate gRPC packets, and future reconciliation pulls therefore all feed the same writer safely — status never regresses. ### Stuck calls A non-terminal row (`TerminalAtUtc IS NULL`) created before `now − StuckAgeThreshold` (default 10 minutes) is classified as stuck. Stuck is display-only: surfaced as a `StuckCount` KPI and a row badge in the UI. There is no escalation or alerting. ## Architecture ### `SiteCallAuditActor` `SiteCallAuditActor` is a `ReceiveActor` with two constructors: - **Production**: receives `IServiceProvider` and opens a fresh DI scope per message to resolve the scoped EF Core `ISiteCallAuditRepository`. This mirrors `AuditLogIngestActor`'s pattern — a long-lived singleton cannot hold a scope across messages. - **Test**: receives a concrete `ISiteCallAuditRepository` and reuses it across all messages, allowing integration tests to run against a real MS SQL fixture without DI scaffolding. The actor catches all repository exceptions in its write path and replies `Accepted=false` without rethrowing, keeping the singleton alive across storage faults. The `SupervisorStrategy` override (one-for-one, `maxNrOfRetries: 0`) governs any future children — the actor currently has none. ```csharp private async Task OnUpsertAsync(UpsertSiteCallCommand cmd) { var replyTo = Sender; var id = cmd.SiteCall.TrackedOperationId; IServiceScope? scope = null; ISiteCallAuditRepository repository; if (_injectedRepository is not null) { repository = _injectedRepository; } else { scope = _serviceProvider!.CreateScope(); repository = scope.ServiceProvider.GetRequiredService(); } try { var siteCall = cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }; await repository.UpsertAsync(siteCall).ConfigureAwait(false); replyTo.Tell(new UpsertSiteCallReply(id, Accepted: true)); } catch (Exception ex) { _logger.LogError(ex, "SiteCallAudit upsert failed for {TrackedOperationId}", id); replyTo.Tell(new UpsertSiteCallReply(id, Accepted: false)); } finally { scope?.Dispose(); } } ``` `IngestedAtUtc` is always stamped at central-side persist time, not carried from the site. This ensures the column reflects when central last processed the row, not when the site emitted it. ### Message handlers | Message | Direction | Handler | |---|---|---| | `UpsertSiteCallCommand` | Central ingest → actor | `OnUpsertAsync` — scope-per-message upsert | | `SiteCallQueryRequest` | Central UI → actor | `HandleQuery` — keyset-paged query, max 200 rows | | `SiteCallDetailRequest` | Central UI → actor | `HandleDetail` — single-row full detail | | `SiteCallKpiRequest` | Central UI → actor | `HandleKpi` — global KPI snapshot | | `PerSiteSiteCallKpiRequest` | Central UI → actor | `HandlePerSiteKpi` — per-site KPI list | | `RetrySiteCallRequest` | Central UI → actor | `HandleRetrySiteCall` — relay to site | | `DiscardSiteCallRequest` | Central UI → actor | `HandleDiscardSiteCall` — relay to site | | `RegisterCentralCommunication` | Host → actor | Wires the `CentralCommunicationActor` transport | All read handlers capture `Sender` before the first `await` and use `PipeTo` to return the response — the standard Akka pattern for async ask-reply handlers. ### `SiteCalls` table One row per `TrackedOperationId` in the central MS SQL configuration database. Key columns: | Column | Type | Notes | |---|---|---| | `TrackedOperationId` | `uniqueidentifier` | PK; GUID stamped site-side at call time | | `Channel` | `varchar` | `"ApiOutbound"` or `"DbOutbound"` | | `Target` | `varchar` | Human-readable target, e.g. `"ERP.GetOrder"` | | `SourceSite` | `varchar` | Site that issued the call | | `SourceNode` | `varchar NULL` | `node-a` / `node-b`; nullable for retired nodes | | `Status` | `varchar` | String form of `AuditStatus`; monotonic | | `RetryCount` | `int` | Dispatch attempts so far | | `LastError` | `varchar NULL` | Most recent error text | | `HttpStatus` | `int NULL` | Last HTTP response code (API calls only) | | `CreatedAtUtc` | `datetime2` | First submit timestamp from site | | `UpdatedAtUtc` | `datetime2` | Latest site-side status mutation | | `TerminalAtUtc` | `datetime2 NULL` | Set when status reaches a terminal rank | | `IngestedAtUtc` | `datetime2` | Central-side stamp, updated on every upsert | Unlike the `AuditLog` table, `SiteCalls` is a standard non-partitioned table on `[PRIMARY]` holding mutable operational state. No DB-role restriction applies; it is updated in place by the upsert. ### Status lifecycle ```text Submitted → Forwarded → Attempted ──→ Delivered (terminal, success) └──→ Parked (non-terminal, awaiting operator action) └──→ Failed (terminal, permanent failure) └──→ Discarded (terminal, operator-initiated on Parked row) ``` `Failed` rows are not operator-actionable — a permanent failure (e.g. HTTP 4xx) would fail again, and the error was already returned synchronously to the calling script. Only `Parked` rows support Retry and Discard. ### Retry / Discard relay The `CentralCommunicationActor` is wired into `SiteCallAuditActor` after both actors exist, via `RegisterCentralCommunication`. Until registration completes, any relay request receives an immediate `SiteCallRelayOutcome.SiteUnreachable` outcome — there is genuinely no route to any site. When `_centralCommunication` is set, the relay handler wraps the command in a `SiteEnvelope` keyed to `SourceSite` and Asks the `CentralCommunicationActor`, which routes it over the per-site `ClusterClient`: ```csharp private void HandleRetrySiteCall(RetrySiteCallRequest request) { var sender = Sender; if (_centralCommunication is null) { sender.Tell(UnreachableRetry(request.CorrelationId)); return; } var relay = new RetryParkedOperation( request.CorrelationId, new TrackedOperationId(request.TrackedOperationId)); var envelope = new SiteEnvelope(request.SourceSite, relay); _centralCommunication.Ask(envelope, _options.RelayTimeout) .PipeTo( sender, success: ack => MapRetryResponse(request.CorrelationId, ack), failure: ex => MapRetryFailure(request.CorrelationId, request.SourceSite, ex)); } ``` The site applies `RetryParkedOperation` / `DiscardParkedOperation` to its own Store-and-Forward buffer and returns a `ParkedOperationActionAck`. The actor maps the ack to a `SiteCallRelayOutcome`: | Ack | Outcome | |---|---| | `Applied=true` | `Applied` | | `Applied=false`, no error | `NotParked` — site had nothing to do | | `Applied=false`, error present | `OperationFailed` — site faulted | | Ask timeout / no route | `SiteUnreachable` | `SiteUnreachable` is distinguished from `OperationFailed` because central is a mirror — a relay that never reached the site is a transient transport condition, not an operation failure. The UI surfaces "site unreachable" so operators know to retry once the site is back online. The corrected cached-call state flows back to central via the normal telemetry path after the site applies the action. Central never writes the `SiteCalls` row to reflect a relay outcome directly. ### KPI computation KPIs are computed point-in-time against the `SiteCalls` table by `ISiteCallAuditRepository.ComputeKpisAsync` and `ComputePerSiteKpisAsync`. All aggregation is server-side; no rows are materialised. The actor derives the cutoff timestamps from `SiteCallAuditOptions` before calling the repository: ```csharp private void HandleKpi(SiteCallKpiRequest request) { var sender = Sender; var now = DateTime.UtcNow; var stuckCutoff = now - _options.StuckAgeThreshold; var intervalSince = now - _options.KpiInterval; KpiAsync(request.CorrelationId, stuckCutoff, intervalSince).PipeTo( sender, success: response => response, failure: ex => new SiteCallKpiResponse( request.CorrelationId, Success: false, ErrorMessage: ex.GetBaseException().Message, BufferedCount: 0, ParkedCount: 0, FailedLastInterval: 0, DeliveredLastInterval: 0, OldestPendingAge: null, StuckCount: 0)); } ``` The `SiteCallKpiSnapshot` shape mirrors `NotificationKpiSnapshot` so the Central UI dashboard can reuse the same tile layout for both components. ## Usage The actor accepts only Akka messages — there is no public API beyond the message protocol defined in Commons. The Central UI's Site Calls page sends `SiteCallQueryRequest` / `SiteCallKpiRequest` / `PerSiteSiteCallKpiRequest` / `SiteCallDetailRequest` through `CommunicationService`, which Asks the singleton and awaits `SiteCallQueryResponse` / `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` / `SiteCallDetailResponse`. The ingest path is driven by `AuditLogIngestActor.OnCachedTelemetryAsync`, which tells an `UpsertSiteCallCommand` after committing the dual-write transaction. The `SiteCallAuditActor` does not need to coordinate with `AuditLogIngestActor` — the transaction guarantees the `AuditLog` row always precedes the upsert command. Registration is via `ServiceCollectionExtensions.AddSiteCallAudit`, which binds `SiteCallAuditOptions` from the `ScadaBridge:SiteCallAudit` configuration section. The actor `Props` and the `ClusterSingletonManager` registration are wired in the Host's central-role composition. ## Configuration `SiteCallAuditOptions` is bound from the `ScadaBridge:SiteCallAudit` section. | Key | Default | Description | |---|---|---| | `StuckAgeThreshold` | `00:10:00` | Age past which a non-terminal row is counted as stuck. Display-only; no escalation. | | `KpiInterval` | `00:01:00` | Trailing window for `DeliveredLastInterval` and `FailedLastInterval` KPIs. | | `RelayTimeout` | `00:00:10` | Ask timeout for the central→site Retry/Discard relay. Must be less than `CommunicationOptions.QueryTimeout` (default 30 s) so the inner relay times out first and returns the distinct `SiteUnreachable` outcome. | ## Dependencies & Interactions - [Commons (#16)](./Commons.md) — owns `SiteCall`, `SiteCallOperational`, `TrackedOperationId`, `SiteCallAuditOptions`-adjacent types (`SiteCallKpiSnapshot`, `SiteCallSiteKpiSnapshot`, `SiteCallQueryFilter`, `SiteCallPaging`), all message contracts (`UpsertSiteCallCommand`, `UpsertSiteCallReply`, `SiteCallQueryRequest/Response`, `SiteCallDetailRequest/Response`, `SiteCallKpiRequest/Response`, `PerSiteSiteCallKpiRequest/Response`, `RetrySiteCallRequest/Response`, `DiscardSiteCallRequest/Response`, `SiteCallRelayOutcome`), and the `ISiteCallAuditRepository` interface. - [Configuration Database (#17)](./ConfigurationDatabase.md) — implements `ISiteCallAuditRepository` against the central `dbo.SiteCalls` table. Central hosts must call `AddConfigurationDatabase` for the actor to resolve its scoped repository. - [Audit Log (#23)](./AuditLog.md) — shares the `CachedCallTelemetry` packet. `AuditLogIngestActor.OnCachedTelemetryAsync` writes the `AuditLog` row and the `SiteCalls` upsert in a single MS SQL transaction, then tells `UpsertSiteCallCommand` to this actor. The two components coordinate via message-passing, not a shared service. - [Central–Site Communication (#5)](./Communication.md) — the `CentralCommunicationActor` is the transport the relay handlers use. It is registered via `RegisterCentralCommunication` by the Host after both actors are running. `CommunicationService` also provides the async wrappers (`RetrySiteCallAsync`, `DiscardSiteCallAsync`) that the Central UI calls; those methods Ask the `SiteCallAuditActor` with the outer `CommunicationOptions.QueryTimeout`. - [Store-and-Forward Engine (#6)](./StoreAndForward.md) — site-side executor of `RetryParkedOperation` and `DiscardParkedOperation`. The site's S&F buffer is the source of truth for parked cached calls; it emits updated telemetry after applying an operator action. - [Health Monitoring (#11)](./HealthMonitoring.md) — consumes `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` to surface buffered count, parked count, stuck count, and throughput KPI tiles on the health dashboard alongside the Notification Outbox tiles. - [Central UI (#9)](./CentralUI.md) — the Site Calls page queries this actor for the paginated list, detail modal, and KPIs; it issues Retry/Discard actions that flow through `CommunicationService` to the relay handlers here. - [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — hosts the `SiteCallAuditActor` singleton with active/standby failover via `ClusterSingletonManager`. ## Troubleshooting ### Relay returns `SiteUnreachable` The `CentralCommunicationActor` could not route the command to the site — the site is offline, the `ClusterClient` route has not yet resolved, or the relay timed out waiting for a `ParkedOperationActionAck`. The `_options.RelayTimeout` (default 10 s) is the inner Ask timeout. The action was NOT applied. Retry the operator action once the site is back online; the `SiteCalls` mirror row will self-correct via telemetry after the site applies it. ### Relay returns `NotParked` The site was reached but reported no parked row for the given `TrackedOperationId`. The call was likely already delivered, discarded, or transitioned out of `Parked` status between the operator clicking Retry/Discard and the relay arriving. No action is required; the telemetry will update the mirror row shortly. ### Upsert replied `Accepted=false` The actor caught a repository exception and replied false to the caller without rethrowing. The central singleton remains alive. Check the structured log for a `SiteCallAudit upsert failed for {TrackedOperationId}` error with the exception detail. If the MS SQL configuration database is temporarily unavailable, the telemetry sender will retry on its next cycle (the at-least-once gRPC transport) or the future reconciliation pull will backfill the row. ### `SiteCalls` rows not appearing Ingest flows through `AuditLogIngestActor.OnCachedTelemetryAsync`, which writes the `AuditLog` row and `SiteCalls` upsert in one transaction before telling `UpsertSiteCallCommand`. If that transaction fails, neither row is written. Check `AuditLog` ingest health first — a missing `AuditLog` row for the same `TrackedOperationId` confirms the telemetry never reached central, not that the `SiteCalls` upsert failed in isolation. ## Related Documentation - [Site Call Audit design specification](../requirements/Component-SiteCallAudit.md) - [Audit Log](./AuditLog.md) - [Notification Outbox](./NotificationOutbox.md) - [Configuration Database](./ConfigurationDatabase.md) - [Central–Site Communication](./Communication.md) - [Store-and-Forward Engine](./StoreAndForward.md) - [Commons](./Commons.md) - [Health Monitoring](./HealthMonitoring.md)