9175b0c013
NotificationService (Notify.Send returns string not NotificationId; MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox (one Attempted row always, terminal row only on terminal status), SiteCallAudit (direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring (CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
254 lines
18 KiB
Markdown
254 lines
18 KiB
Markdown
# Site Call Audit
|
||
|
||
Site Call Audit (#22) is a central-only observability component that maintains an eventually-consistent mirror of every cached call — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — issued by site scripts. It ingests lifecycle telemetry from sites into the central `SiteCalls` MS SQL table, computes point-in-time KPIs, and relays operator Retry/Discard actions back to the owning site. It does not deliver anything: cached-call execution stays entirely site-local.
|
||
|
||
## Overview
|
||
|
||
The component lives in `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/` and runs only on the central cluster. Its single class, `SiteCallAuditActor`, is an Akka.NET `ReceiveActor` deployed as a `ClusterSingletonManager`-managed singleton on the active central node.
|
||
|
||
Telemetry reaches central through the shared `CachedCallTelemetry` packet (see [Audit Log](./AuditLog.md)), which carries both an `AuditEvent` for the `AuditLog` table and a `SiteCallOperational` snapshot for the `SiteCalls` table. The `AuditLogIngestActor` (Audit Log #23) writes both rows directly inside a single MS SQL transaction when it receives an `IngestCachedTelemetryCommand`: `auditRepo.InsertIfNotExistsAsync(...)` followed by `siteCallRepo.UpsertAsync(...)`, committed together or rolled back together. There is no Tell to `SiteCallAuditActor` on this path; the `UpsertSiteCallCommand` / `OnUpsertAsync` handler exists for other callers, not the cached-telemetry hot path. The `SiteCallAuditActor` is therefore an ingest target, not a transport; it never constructs telemetry packets and never decides what gets delivered.
|
||
|
||
Sites remain the source of truth. `Tracking.Status()` is answered site-locally from the site SQLite tracking store; the central `SiteCalls` row is what the Central UI Site Calls page reads — it may lag by one telemetry cycle.
|
||
|
||
## Key Concepts
|
||
|
||
### Mirror, not dispatcher
|
||
|
||
The Notification Outbox (#21) ingests notifications and dispatches them centrally. Site Call Audit is different: the Store-and-Forward Engine on each site performs all retry and delivery attempts against the site's locally reachable external systems and databases. Central cannot reach those systems. The `SiteCalls` table is read-only from central's perspective — operators can view it and trigger Retry/Discard actions, but the actions are forwarded to the site; central never mutates the mirror row directly.
|
||
|
||
### Monotonic upsert idempotency
|
||
|
||
The `SiteCalls` table holds one row per `TrackedOperationId`. `ISiteCallAuditRepository.UpsertAsync` implements insert-if-not-exists followed by a conditional update that only applies when the incoming status has a strictly higher rank than the stored status:
|
||
|
||
```text
|
||
Submitted=0, Forwarded=1, Attempted=2, Skipped=2,
|
||
Delivered=3, Failed=3, Parked=3, Discarded=3
|
||
```
|
||
|
||
Out-of-order telemetry, duplicate gRPC packets, and future reconciliation pulls therefore all feed the same writer safely — status never regresses.
|
||
|
||
### Stuck calls
|
||
|
||
A non-terminal row (`TerminalAtUtc IS NULL`) created before `now − StuckAgeThreshold` (default 10 minutes) is classified as stuck. Stuck is display-only: surfaced as a `StuckCount` KPI and a row badge in the UI. There is no escalation or alerting.
|
||
|
||
## Architecture
|
||
|
||
### `SiteCallAuditActor`
|
||
|
||
`SiteCallAuditActor` is a `ReceiveActor` with two constructors:
|
||
|
||
- **Production**: receives `IServiceProvider` and opens a fresh DI scope per message to resolve the scoped EF Core `ISiteCallAuditRepository`. This mirrors `AuditLogIngestActor`'s pattern — a long-lived singleton cannot hold a scope across messages.
|
||
- **Test**: receives a concrete `ISiteCallAuditRepository` and reuses it across all messages, allowing integration tests to run against a real MS SQL fixture without DI scaffolding.
|
||
|
||
The actor catches all repository exceptions in its write path and replies `Accepted=false` without rethrowing, keeping the singleton alive across storage faults. The `SupervisorStrategy` override (one-for-one, `maxNrOfRetries: 0`) governs any future children — the actor currently has none.
|
||
|
||
```csharp
|
||
private async Task OnUpsertAsync(UpsertSiteCallCommand cmd)
|
||
{
|
||
var replyTo = Sender;
|
||
var id = cmd.SiteCall.TrackedOperationId;
|
||
|
||
IServiceScope? scope = null;
|
||
ISiteCallAuditRepository repository;
|
||
if (_injectedRepository is not null)
|
||
{
|
||
repository = _injectedRepository;
|
||
}
|
||
else
|
||
{
|
||
scope = _serviceProvider!.CreateScope();
|
||
repository = scope.ServiceProvider.GetRequiredService<ISiteCallAuditRepository>();
|
||
}
|
||
|
||
try
|
||
{
|
||
var siteCall = cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow };
|
||
await repository.UpsertAsync(siteCall).ConfigureAwait(false);
|
||
replyTo.Tell(new UpsertSiteCallReply(id, Accepted: true));
|
||
}
|
||
catch (Exception ex)
|
||
{
|
||
_logger.LogError(ex, "SiteCallAudit upsert failed for {TrackedOperationId}", id);
|
||
replyTo.Tell(new UpsertSiteCallReply(id, Accepted: false));
|
||
}
|
||
finally
|
||
{
|
||
scope?.Dispose();
|
||
}
|
||
}
|
||
```
|
||
|
||
`IngestedAtUtc` is always stamped at central-side persist time, not carried from the site. This ensures the column reflects when central last processed the row, not when the site emitted it.
|
||
|
||
### Message handlers
|
||
|
||
| Message | Direction | Handler |
|
||
|---|---|---|
|
||
| `UpsertSiteCallCommand` | Central ingest → actor | `OnUpsertAsync` — scope-per-message upsert |
|
||
| `SiteCallQueryRequest` | Central UI → actor | `HandleQuery` — keyset-paged query, max 200 rows |
|
||
| `SiteCallDetailRequest` | Central UI → actor | `HandleDetail` — single-row full detail |
|
||
| `SiteCallKpiRequest` | Central UI → actor | `HandleKpi` — global KPI snapshot |
|
||
| `PerSiteSiteCallKpiRequest` | Central UI → actor | `HandlePerSiteKpi` — per-site KPI list |
|
||
| `RetrySiteCallRequest` | Central UI → actor | `HandleRetrySiteCall` — relay to site |
|
||
| `DiscardSiteCallRequest` | Central UI → actor | `HandleDiscardSiteCall` — relay to site |
|
||
| `RegisterCentralCommunication` | Host → actor | Wires the `CentralCommunicationActor` transport |
|
||
|
||
All read handlers capture `Sender` before the first `await` and use `PipeTo` to return the response — the standard Akka pattern for async ask-reply handlers.
|
||
|
||
### `SiteCalls` table
|
||
|
||
One row per `TrackedOperationId` in the central MS SQL configuration database. Key columns:
|
||
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| `TrackedOperationId` | `uniqueidentifier` | PK; GUID stamped site-side at call time |
|
||
| `Channel` | `varchar` | `"ApiOutbound"` or `"DbOutbound"` |
|
||
| `Target` | `varchar` | Human-readable target, e.g. `"ERP.GetOrder"` |
|
||
| `SourceSite` | `varchar` | Site that issued the call |
|
||
| `SourceNode` | `varchar NULL` | `node-a` / `node-b`; nullable for retired nodes |
|
||
| `Status` | `varchar` | String form of `AuditStatus`; monotonic |
|
||
| `RetryCount` | `int` | Dispatch attempts so far |
|
||
| `LastError` | `varchar NULL` | Most recent error text |
|
||
| `HttpStatus` | `int NULL` | Last HTTP response code (API calls only) |
|
||
| `CreatedAtUtc` | `datetime2` | First submit timestamp from site |
|
||
| `UpdatedAtUtc` | `datetime2` | Latest site-side status mutation |
|
||
| `TerminalAtUtc` | `datetime2 NULL` | Set when status reaches a terminal rank |
|
||
| `IngestedAtUtc` | `datetime2` | Central-side stamp, updated on every upsert |
|
||
|
||
Unlike the `AuditLog` table, `SiteCalls` is a standard non-partitioned table on `[PRIMARY]` holding mutable operational state. No DB-role restriction applies; it is updated in place by the upsert.
|
||
|
||
### Status lifecycle
|
||
|
||
```text
|
||
Submitted → Forwarded → Attempted ──→ Delivered (terminal, success)
|
||
└──→ Parked (non-terminal, awaiting operator action)
|
||
└──→ Failed (terminal, permanent failure)
|
||
└──→ Discarded (terminal, operator-initiated on Parked row)
|
||
```
|
||
|
||
`Failed` rows are not operator-actionable — a permanent failure (e.g. HTTP 4xx) would fail again, and the error was already returned synchronously to the calling script. Only `Parked` rows support Retry and Discard.
|
||
|
||
### Retry / Discard relay
|
||
|
||
The `CentralCommunicationActor` is wired into `SiteCallAuditActor` after both actors exist, via `RegisterCentralCommunication`. Until registration completes, any relay request receives an immediate `SiteCallRelayOutcome.SiteUnreachable` outcome — there is genuinely no route to any site.
|
||
|
||
When `_centralCommunication` is set, the relay handler wraps the command in a `SiteEnvelope` keyed to `SourceSite` and Asks the `CentralCommunicationActor`, which routes it over the per-site `ClusterClient`:
|
||
|
||
```csharp
|
||
private void HandleRetrySiteCall(RetrySiteCallRequest request)
|
||
{
|
||
var sender = Sender;
|
||
|
||
if (_centralCommunication is null)
|
||
{
|
||
sender.Tell(UnreachableRetry(request.CorrelationId));
|
||
return;
|
||
}
|
||
|
||
var relay = new RetryParkedOperation(
|
||
request.CorrelationId, new TrackedOperationId(request.TrackedOperationId));
|
||
var envelope = new SiteEnvelope(request.SourceSite, relay);
|
||
|
||
_centralCommunication.Ask<ParkedOperationActionAck>(envelope, _options.RelayTimeout)
|
||
.PipeTo(
|
||
sender,
|
||
success: ack => MapRetryResponse(request.CorrelationId, ack),
|
||
failure: ex => MapRetryFailure(request.CorrelationId, request.SourceSite, ex));
|
||
}
|
||
```
|
||
|
||
The site applies `RetryParkedOperation` / `DiscardParkedOperation` to its own Store-and-Forward buffer and returns a `ParkedOperationActionAck`. The actor maps the ack to a `SiteCallRelayOutcome`:
|
||
|
||
| Ack | Outcome |
|
||
|---|---|
|
||
| `Applied=true` | `Applied` |
|
||
| `Applied=false`, no error | `NotParked` — site had nothing to do |
|
||
| `Applied=false`, error present | `OperationFailed` — site faulted |
|
||
| Ask timeout / no route | `SiteUnreachable` |
|
||
|
||
`SiteUnreachable` is distinguished from `OperationFailed` because central is a mirror — a relay that never reached the site is a transient transport condition, not an operation failure. The UI surfaces "site unreachable" so operators know to retry once the site is back online.
|
||
|
||
The corrected cached-call state flows back to central via the normal telemetry path after the site applies the action. Central never writes the `SiteCalls` row to reflect a relay outcome directly.
|
||
|
||
### KPI computation
|
||
|
||
KPIs are computed point-in-time against the `SiteCalls` table by `ISiteCallAuditRepository.ComputeKpisAsync` and `ComputePerSiteKpisAsync`. All aggregation is server-side; no rows are materialised. The actor derives the cutoff timestamps from `SiteCallAuditOptions` before calling the repository:
|
||
|
||
```csharp
|
||
private void HandleKpi(SiteCallKpiRequest request)
|
||
{
|
||
var sender = Sender;
|
||
var now = DateTime.UtcNow;
|
||
var stuckCutoff = now - _options.StuckAgeThreshold;
|
||
var intervalSince = now - _options.KpiInterval;
|
||
|
||
KpiAsync(request.CorrelationId, stuckCutoff, intervalSince).PipeTo(
|
||
sender,
|
||
success: response => response,
|
||
failure: ex => new SiteCallKpiResponse(
|
||
request.CorrelationId, Success: false,
|
||
ErrorMessage: ex.GetBaseException().Message,
|
||
BufferedCount: 0, ParkedCount: 0, FailedLastInterval: 0,
|
||
DeliveredLastInterval: 0, OldestPendingAge: null, StuckCount: 0));
|
||
}
|
||
```
|
||
|
||
`SiteCallKpiSnapshot` is structurally similar to `NotificationKpiSnapshot` so the Central UI dashboard can reuse the same tile layout for both components. The shapes differ: `SiteCallKpiSnapshot` carries 6 fields (`BufferedCount`, `ParkedCount`, `FailedLastInterval`, `DeliveredLastInterval`, `OldestPendingAge`, `StuckCount`), while `NotificationKpiSnapshot` carries 5 (`QueueDepth`, `StuckCount`, `ParkedCount`, `DeliveredLastInterval`, `OldestPendingAge`) — `BufferedCount` replaces `QueueDepth` and `FailedLastInterval` is an addition with no counterpart in the notification shape.
|
||
|
||
## Usage
|
||
|
||
The actor accepts only Akka messages — there is no public API beyond the message protocol defined in Commons. The Central UI's Site Calls page sends `SiteCallQueryRequest` / `SiteCallKpiRequest` / `PerSiteSiteCallKpiRequest` / `SiteCallDetailRequest` through `CommunicationService`, which Asks the singleton and awaits `SiteCallQueryResponse` / `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` / `SiteCallDetailResponse`.
|
||
|
||
The ingest path is driven by `AuditLogIngestActor.OnCachedTelemetryAsync`, which writes both the `AuditLog` row and the `SiteCalls` upsert directly inside a single EF transaction — no message is sent to `SiteCallAuditActor` on this path. Both writes succeed or both roll back; neither component needs to coordinate with the other after the transaction commits.
|
||
|
||
Registration is via `ServiceCollectionExtensions.AddSiteCallAudit`, which binds `SiteCallAuditOptions` from the `ScadaBridge:SiteCallAudit` configuration section. The actor `Props` and the `ClusterSingletonManager` registration are wired in the Host's central-role composition.
|
||
|
||
## Configuration
|
||
|
||
`SiteCallAuditOptions` is bound from the `ScadaBridge:SiteCallAudit` section.
|
||
|
||
| Key | Default | Description |
|
||
|---|---|---|
|
||
| `StuckAgeThreshold` | `00:10:00` | Age past which a non-terminal row is counted as stuck. Display-only; no escalation. |
|
||
| `KpiInterval` | `00:01:00` | Trailing window for `DeliveredLastInterval` and `FailedLastInterval` KPIs. |
|
||
| `RelayTimeout` | `00:00:10` | Ask timeout for the central→site Retry/Discard relay. Must be less than `CommunicationOptions.QueryTimeout` (default 30 s) so the inner relay times out first and returns the distinct `SiteUnreachable` outcome. |
|
||
|
||
## Dependencies & Interactions
|
||
|
||
- [Commons (#16)](./Commons.md) — owns `SiteCall`, `SiteCallOperational`, `TrackedOperationId`, `SiteCallAuditOptions`-adjacent types (`SiteCallKpiSnapshot`, `SiteCallSiteKpiSnapshot`, `SiteCallQueryFilter`, `SiteCallPaging`), all message contracts (`UpsertSiteCallCommand`, `UpsertSiteCallReply`, `SiteCallQueryRequest/Response`, `SiteCallDetailRequest/Response`, `SiteCallKpiRequest/Response`, `PerSiteSiteCallKpiRequest/Response`, `RetrySiteCallRequest/Response`, `DiscardSiteCallRequest/Response`, `SiteCallRelayOutcome`), and the `ISiteCallAuditRepository` interface.
|
||
- [Configuration Database (#17)](./ConfigurationDatabase.md) — implements `ISiteCallAuditRepository` against the central `dbo.SiteCalls` table. Central hosts must call `AddConfigurationDatabase` for the actor to resolve its scoped repository.
|
||
- [Audit Log (#23)](./AuditLog.md) — shares the `CachedCallTelemetry` packet. `AuditLogIngestActor.OnCachedTelemetryAsync` writes the `AuditLog` row and the `SiteCalls` upsert directly in a single MS SQL transaction; it does not send a message to this actor on that path. The two components share a database transaction, not a message exchange.
|
||
- [Central–Site Communication (#5)](./Communication.md) — the `CentralCommunicationActor` is the transport the relay handlers use. It is registered via `RegisterCentralCommunication` by the Host after both actors are running. `CommunicationService` also provides the async wrappers (`RetrySiteCallAsync`, `DiscardSiteCallAsync`) that the Central UI calls; those methods Ask the `SiteCallAuditActor` with the outer `CommunicationOptions.QueryTimeout`.
|
||
- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — site-side executor of `RetryParkedOperation` and `DiscardParkedOperation`. The site's S&F buffer is the source of truth for parked cached calls; it emits updated telemetry after applying an operator action.
|
||
- [Central UI (#9)](./CentralUI.md) — the `Health.razor` page (`SiteCallKpiTiles` component) consumes `SiteCallKpiResponse` to surface buffered count, parked count, stuck count, and throughput KPI tiles on the health dashboard alongside the Notification Outbox tiles; the Site Calls page queries this actor for the paginated list, detail modal, and KPIs and issues Retry/Discard actions that flow through `CommunicationService` to the relay handlers here.
|
||
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — hosts the `SiteCallAuditActor` singleton with active/standby failover via `ClusterSingletonManager`.
|
||
|
||
## Troubleshooting
|
||
|
||
### Relay returns `SiteUnreachable`
|
||
|
||
The `CentralCommunicationActor` could not route the command to the site — the site is offline, the `ClusterClient` route has not yet resolved, or the relay timed out waiting for a `ParkedOperationActionAck`. The `_options.RelayTimeout` (default 10 s) is the inner Ask timeout. The action was NOT applied. Retry the operator action once the site is back online; the `SiteCalls` mirror row will self-correct via telemetry after the site applies it.
|
||
|
||
### Relay returns `NotParked`
|
||
|
||
The site was reached but reported no parked row for the given `TrackedOperationId`. The call was likely already delivered, discarded, or transitioned out of `Parked` status between the operator clicking Retry/Discard and the relay arriving. No action is required; the telemetry will update the mirror row shortly.
|
||
|
||
### Upsert replied `Accepted=false`
|
||
|
||
The actor caught a repository exception and replied false to the caller without rethrowing. The central singleton remains alive. Check the structured log for a `SiteCallAudit upsert failed for {TrackedOperationId}` error with the exception detail. If the MS SQL configuration database is temporarily unavailable, the telemetry sender will retry on its next cycle (the at-least-once gRPC transport) or the future reconciliation pull will backfill the row.
|
||
|
||
### `SiteCalls` rows not appearing
|
||
|
||
Ingest flows through `AuditLogIngestActor.OnCachedTelemetryAsync`, which writes the `AuditLog` row and `SiteCalls` upsert directly in one EF transaction. If that transaction fails, neither row is written. Check `AuditLog` ingest health first — a missing `AuditLog` row for the same `TrackedOperationId` confirms the telemetry never reached central, not that the `SiteCalls` upsert failed in isolation.
|
||
|
||
## Related Documentation
|
||
|
||
- [Site Call Audit design specification](../requirements/Component-SiteCallAudit.md)
|
||
- [Audit Log](./AuditLog.md)
|
||
- [Notification Outbox](./NotificationOutbox.md)
|
||
- [Configuration Database](./ConfigurationDatabase.md)
|
||
- [Central–Site Communication](./Communication.md)
|
||
- [Store-and-Forward Engine](./StoreAndForward.md)
|
||
- [Commons](./Commons.md)
|
||
- [Health Monitoring](./HealthMonitoring.md)
|