ScadaBridge/docs/components/SiteCallAudit.md

# Site Call Audit

Site Call Audit (#22) is a central-only observability component that maintains an eventually-consistent mirror of every cached call — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — issued by site scripts. It ingests lifecycle telemetry from sites into the central `SiteCalls` MS SQL table, computes point-in-time KPIs, and relays operator Retry/Discard actions back to the owning site. It does not deliver anything: cached-call execution stays entirely site-local.

## Overview

The component lives in `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/` and runs only on the central cluster. Its single class, `SiteCallAuditActor`, is an Akka.NET `ReceiveActor` deployed as a `ClusterSingletonManager`-managed singleton on the active central node.

Telemetry reaches central through the shared `CachedCallTelemetry` packet (see [Audit Log](./AuditLog.md)), which carries both an `AuditEvent` for the `AuditLog` table and a `SiteCallOperational` snapshot for the `SiteCalls` table. The `AuditLogIngestActor` (Audit Log #23) writes both in a single MS SQL transaction when it receives an `IngestCachedTelemetryCommand`; it then tells `SiteCallAuditActor` an `UpsertSiteCallCommand` so the `SiteCalls` row is always consistent with the paired audit row. The `SiteCallAuditActor` is therefore an ingest target, not a transport; it never constructs telemetry packets and never decides what gets delivered.

Sites remain the source of truth. `Tracking.Status()` is answered site-locally from the site SQLite tracking store; the central `SiteCalls` row is what the Central UI Site Calls page reads — it may lag by one telemetry cycle.

## Key Concepts

### Mirror, not dispatcher

The Notification Outbox (#21) ingests notifications and dispatches them centrally. Site Call Audit is different: the Store-and-Forward Engine on each site performs all retry and delivery attempts against the site's locally reachable external systems and databases. Central cannot reach those systems. The `SiteCalls` table is read-only from central's perspective — operators can view it and trigger Retry/Discard actions, but the actions are forwarded to the site; central never mutates the mirror row directly.

### Monotonic upsert idempotency

The `SiteCalls` table holds one row per `TrackedOperationId`. `ISiteCallAuditRepository.UpsertAsync` implements insert-if-not-exists followed by a conditional update that only applies when the incoming status has a strictly higher rank than the stored status:

```text
Submitted=0, Forwarded=1, Attempted=2, Skipped=2,
Delivered=3, Failed=3, Parked=3, Discarded=3
```

Out-of-order telemetry, duplicate gRPC packets, and future reconciliation pulls therefore all feed the same writer safely — status never regresses.

### Stuck calls

A non-terminal row (`TerminalAtUtc IS NULL`) created before `now − StuckAgeThreshold` (default 10 minutes) is classified as stuck. Stuck is display-only: surfaced as a `StuckCount` KPI and a row badge in the UI. There is no escalation or alerting.

## Architecture

### `SiteCallAuditActor`

`SiteCallAuditActor` is a `ReceiveActor` with two constructors:

- **Production**: receives `IServiceProvider` and opens a fresh DI scope per message to resolve the scoped EF Core `ISiteCallAuditRepository`. This mirrors `AuditLogIngestActor`'s pattern — a long-lived singleton cannot hold a scope across messages.
- **Test**: receives a concrete `ISiteCallAuditRepository` and reuses it across all messages, allowing integration tests to run against a real MS SQL fixture without DI scaffolding.

The actor catches all repository exceptions in its write path and replies `Accepted=false` without rethrowing, keeping the singleton alive across storage faults. The `SupervisorStrategy` override (one-for-one, `maxNrOfRetries: 0`) governs any future children — the actor currently has none.

```csharp
private async Task OnUpsertAsync(UpsertSiteCallCommand cmd)
{
    var replyTo = Sender;
    var id = cmd.SiteCall.TrackedOperationId;

    IServiceScope? scope = null;
    ISiteCallAuditRepository repository;
    if (_injectedRepository is not null)
    {
        repository = _injectedRepository;
    }
    else
    {
        scope = _serviceProvider!.CreateScope();
        repository = scope.ServiceProvider.GetRequiredService<ISiteCallAuditRepository>();
    }

    try
    {
        var siteCall = cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow };
        await repository.UpsertAsync(siteCall).ConfigureAwait(false);
        replyTo.Tell(new UpsertSiteCallReply(id, Accepted: true));
    }
    catch (Exception ex)
    {
        _logger.LogError(ex, "SiteCallAudit upsert failed for {TrackedOperationId}", id);
        replyTo.Tell(new UpsertSiteCallReply(id, Accepted: false));
    }
    finally
    {
        scope?.Dispose();
    }
}
```

`IngestedAtUtc` is always stamped at central-side persist time, not carried from the site. This ensures the column reflects when central last processed the row, not when the site emitted it.

### Message handlers

| Message | Direction | Handler |
|---|---|---|
| `UpsertSiteCallCommand` | Central ingest → actor | `OnUpsertAsync` — scope-per-message upsert |
| `SiteCallQueryRequest` | Central UI → actor | `HandleQuery` — keyset-paged query, max 200 rows |
| `SiteCallDetailRequest` | Central UI → actor | `HandleDetail` — single-row full detail |
| `SiteCallKpiRequest` | Central UI → actor | `HandleKpi` — global KPI snapshot |
| `PerSiteSiteCallKpiRequest` | Central UI → actor | `HandlePerSiteKpi` — per-site KPI list |
| `RetrySiteCallRequest` | Central UI → actor | `HandleRetrySiteCall` — relay to site |
| `DiscardSiteCallRequest` | Central UI → actor | `HandleDiscardSiteCall` — relay to site |
| `RegisterCentralCommunication` | Host → actor | Wires the `CentralCommunicationActor` transport |

All read handlers capture `Sender` before the first `await` and use `PipeTo` to return the response — the standard Akka pattern for async ask-reply handlers.

### `SiteCalls` table

One row per `TrackedOperationId` in the central MS SQL configuration database. Key columns:

| Column | Type | Notes |
|---|---|---|
| `TrackedOperationId` | `uniqueidentifier` | PK; GUID stamped site-side at call time |
| `Channel` | `varchar` | `"ApiOutbound"` or `"DbOutbound"` |
| `Target` | `varchar` | Human-readable target, e.g. `"ERP.GetOrder"` |
| `SourceSite` | `varchar` | Site that issued the call |
| `SourceNode` | `varchar NULL` | `node-a` / `node-b`; nullable for retired nodes |
| `Status` | `varchar` | String form of `AuditStatus`; monotonic |
| `RetryCount` | `int` | Dispatch attempts so far |
| `LastError` | `varchar NULL` | Most recent error text |
| `HttpStatus` | `int NULL` | Last HTTP response code (API calls only) |
| `CreatedAtUtc` | `datetime2` | First submit timestamp from site |
| `UpdatedAtUtc` | `datetime2` | Latest site-side status mutation |
| `TerminalAtUtc` | `datetime2 NULL` | Set when status reaches a terminal rank |
| `IngestedAtUtc` | `datetime2` | Central-side stamp, updated on every upsert |

Unlike the `AuditLog` table, `SiteCalls` is a standard non-partitioned table on `[PRIMARY]` holding mutable operational state. No DB-role restriction applies; it is updated in place by the upsert.

### Status lifecycle

```text
Submitted → Forwarded → Attempted ──→ Delivered   (terminal, success)
                                  └──→ Parked      (non-terminal, awaiting operator action)
                                  └──→ Failed      (terminal, permanent failure)
                                  └──→ Discarded   (terminal, operator-initiated on Parked row)
```

`Failed` rows are not operator-actionable — a permanent failure (e.g. HTTP 4xx) would fail again, and the error was already returned synchronously to the calling script. Only `Parked` rows support Retry and Discard.

### Retry / Discard relay

The `CentralCommunicationActor` is wired into `SiteCallAuditActor` after both actors exist, via `RegisterCentralCommunication`. Until registration completes, any relay request receives an immediate `SiteCallRelayOutcome.SiteUnreachable` outcome — there is genuinely no route to any site.

When `_centralCommunication` is set, the relay handler wraps the command in a `SiteEnvelope` keyed to `SourceSite` and Asks the `CentralCommunicationActor`, which routes it over the per-site `ClusterClient`:

```csharp
private void HandleRetrySiteCall(RetrySiteCallRequest request)
{
    var sender = Sender;

    if (_centralCommunication is null)
    {
        sender.Tell(UnreachableRetry(request.CorrelationId));
        return;
    }

    var relay = new RetryParkedOperation(
        request.CorrelationId, new TrackedOperationId(request.TrackedOperationId));
    var envelope = new SiteEnvelope(request.SourceSite, relay);

    _centralCommunication.Ask<ParkedOperationActionAck>(envelope, _options.RelayTimeout)
        .PipeTo(
            sender,
            success: ack => MapRetryResponse(request.CorrelationId, ack),
            failure: ex => MapRetryFailure(request.CorrelationId, request.SourceSite, ex));
}
```

The site applies `RetryParkedOperation` / `DiscardParkedOperation` to its own Store-and-Forward buffer and returns a `ParkedOperationActionAck`. The actor maps the ack to a `SiteCallRelayOutcome`:

| Ack | Outcome |
|---|---|
| `Applied=true` | `Applied` |
| `Applied=false`, no error | `NotParked` — site had nothing to do |
| `Applied=false`, error present | `OperationFailed` — site faulted |
| Ask timeout / no route | `SiteUnreachable` |

`SiteUnreachable` is distinguished from `OperationFailed` because central is a mirror — a relay that never reached the site is a transient transport condition, not an operation failure. The UI surfaces "site unreachable" so operators know to retry once the site is back online.

The corrected cached-call state flows back to central via the normal telemetry path after the site applies the action. Central never writes the `SiteCalls` row to reflect a relay outcome directly.

### KPI computation

KPIs are computed point-in-time against the `SiteCalls` table by `ISiteCallAuditRepository.ComputeKpisAsync` and `ComputePerSiteKpisAsync`. All aggregation is server-side; no rows are materialised. The actor derives the cutoff timestamps from `SiteCallAuditOptions` before calling the repository:

```csharp
private void HandleKpi(SiteCallKpiRequest request)
{
    var sender = Sender;
    var now = DateTime.UtcNow;
    var stuckCutoff = now - _options.StuckAgeThreshold;
    var intervalSince = now - _options.KpiInterval;

    KpiAsync(request.CorrelationId, stuckCutoff, intervalSince).PipeTo(
        sender,
        success: response => response,
        failure: ex => new SiteCallKpiResponse(
            request.CorrelationId, Success: false,
            ErrorMessage: ex.GetBaseException().Message,
            BufferedCount: 0, ParkedCount: 0, FailedLastInterval: 0,
            DeliveredLastInterval: 0, OldestPendingAge: null, StuckCount: 0));
}
```

The `SiteCallKpiSnapshot` shape mirrors `NotificationKpiSnapshot` so the Central UI dashboard can reuse the same tile layout for both components.

## Usage

The actor accepts only Akka messages — there is no public API beyond the message protocol defined in Commons. The Central UI's Site Calls page sends `SiteCallQueryRequest` / `SiteCallKpiRequest` / `PerSiteSiteCallKpiRequest` / `SiteCallDetailRequest` through `CommunicationService`, which Asks the singleton and awaits `SiteCallQueryResponse` / `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` / `SiteCallDetailResponse`.

The ingest path is driven by `AuditLogIngestActor.OnCachedTelemetryAsync`, which tells an `UpsertSiteCallCommand` after committing the dual-write transaction. The `SiteCallAuditActor` does not need to coordinate with `AuditLogIngestActor` — the transaction guarantees the `AuditLog` row always precedes the upsert command.

Registration is via `ServiceCollectionExtensions.AddSiteCallAudit`, which binds `SiteCallAuditOptions` from the `ScadaBridge:SiteCallAudit` configuration section. The actor `Props` and the `ClusterSingletonManager` registration are wired in the Host's central-role composition.

## Configuration

`SiteCallAuditOptions` is bound from the `ScadaBridge:SiteCallAudit` section.

| Key | Default | Description |
|---|---|---|
| `StuckAgeThreshold` | `00:10:00` | Age past which a non-terminal row is counted as stuck. Display-only; no escalation. |
| `KpiInterval` | `00:01:00` | Trailing window for `DeliveredLastInterval` and `FailedLastInterval` KPIs. |
| `RelayTimeout` | `00:00:10` | Ask timeout for the central→site Retry/Discard relay. Must be less than `CommunicationOptions.QueryTimeout` (default 30 s) so the inner relay times out first and returns the distinct `SiteUnreachable` outcome. |

## Dependencies & Interactions

- [Commons (#16)](./Commons.md) — owns `SiteCall`, `SiteCallOperational`, `TrackedOperationId`, `SiteCallAuditOptions`-adjacent types (`SiteCallKpiSnapshot`, `SiteCallSiteKpiSnapshot`, `SiteCallQueryFilter`, `SiteCallPaging`), all message contracts (`UpsertSiteCallCommand`, `UpsertSiteCallReply`, `SiteCallQueryRequest/Response`, `SiteCallDetailRequest/Response`, `SiteCallKpiRequest/Response`, `PerSiteSiteCallKpiRequest/Response`, `RetrySiteCallRequest/Response`, `DiscardSiteCallRequest/Response`, `SiteCallRelayOutcome`), and the `ISiteCallAuditRepository` interface.
- [Configuration Database (#17)](./ConfigurationDatabase.md) — implements `ISiteCallAuditRepository` against the central `dbo.SiteCalls` table. Central hosts must call `AddConfigurationDatabase` for the actor to resolve its scoped repository.
- [Audit Log (#23)](./AuditLog.md) — shares the `CachedCallTelemetry` packet. `AuditLogIngestActor.OnCachedTelemetryAsync` writes the `AuditLog` row and the `SiteCalls` upsert in a single MS SQL transaction, then tells `UpsertSiteCallCommand` to this actor. The two components coordinate via message-passing, not a shared service.
- [Central–Site Communication (#5)](./Communication.md) — the `CentralCommunicationActor` is the transport the relay handlers use. It is registered via `RegisterCentralCommunication` by the Host after both actors are running. `CommunicationService` also provides the async wrappers (`RetrySiteCallAsync`, `DiscardSiteCallAsync`) that the Central UI calls; those methods Ask the `SiteCallAuditActor` with the outer `CommunicationOptions.QueryTimeout`.
- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — site-side executor of `RetryParkedOperation` and `DiscardParkedOperation`. The site's S&F buffer is the source of truth for parked cached calls; it emits updated telemetry after applying an operator action.
- [Health Monitoring (#11)](./HealthMonitoring.md) — consumes `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` to surface buffered count, parked count, stuck count, and throughput KPI tiles on the health dashboard alongside the Notification Outbox tiles.
- [Central UI (#9)](./CentralUI.md) — the Site Calls page queries this actor for the paginated list, detail modal, and KPIs; it issues Retry/Discard actions that flow through `CommunicationService` to the relay handlers here.
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — hosts the `SiteCallAuditActor` singleton with active/standby failover via `ClusterSingletonManager`.

## Troubleshooting

### Relay returns `SiteUnreachable`

The `CentralCommunicationActor` could not route the command to the site — the site is offline, the `ClusterClient` route has not yet resolved, or the relay timed out waiting for a `ParkedOperationActionAck`. The `_options.RelayTimeout` (default 10 s) is the inner Ask timeout. The action was NOT applied. Retry the operator action once the site is back online; the `SiteCalls` mirror row will self-correct via telemetry after the site applies it.

### Relay returns `NotParked`

The site was reached but reported no parked row for the given `TrackedOperationId`. The call was likely already delivered, discarded, or transitioned out of `Parked` status between the operator clicking Retry/Discard and the relay arriving. No action is required; the telemetry will update the mirror row shortly.

### Upsert replied `Accepted=false`

The actor caught a repository exception and replied false to the caller without rethrowing. The central singleton remains alive. Check the structured log for a `SiteCallAudit upsert failed for {TrackedOperationId}` error with the exception detail. If the MS SQL configuration database is temporarily unavailable, the telemetry sender will retry on its next cycle (the at-least-once gRPC transport) or the future reconciliation pull will backfill the row.

### `SiteCalls` rows not appearing

Ingest flows through `AuditLogIngestActor.OnCachedTelemetryAsync`, which writes the `AuditLog` row and `SiteCalls` upsert in one transaction before telling `UpsertSiteCallCommand`. If that transaction fails, neither row is written. Check `AuditLog` ingest health first — a missing `AuditLog` row for the same `TrackedOperationId` confirms the telemetry never reached central, not that the `SiteCalls` upsert failed in isolation.

## Related Documentation

- [Site Call Audit design specification](../requirements/Component-SiteCallAudit.md)
- [Audit Log](./AuditLog.md)
- [Notification Outbox](./NotificationOutbox.md)
- [Configuration Database](./ConfigurationDatabase.md)
- [Central–Site Communication](./Communication.md)
- [Store-and-Forward Engine](./StoreAndForward.md)
- [Commons](./Commons.md)
- [Health Monitoring](./HealthMonitoring.md)