NotificationService (Notify.Send returns string not NotificationId; MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox (one Attempted row always, terminal row only on terminal status), SiteCallAudit (direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring (CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
18 KiB
Site Call Audit
Site Call Audit (#22) is a central-only observability component that maintains an eventually-consistent mirror of every cached call — ExternalSystem.CachedCall() and Database.CachedWrite() — issued by site scripts. It ingests lifecycle telemetry from sites into the central SiteCalls MS SQL table, computes point-in-time KPIs, and relays operator Retry/Discard actions back to the owning site. It does not deliver anything: cached-call execution stays entirely site-local.
Overview
The component lives in src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/ and runs only on the central cluster. Its single class, SiteCallAuditActor, is an Akka.NET ReceiveActor deployed as a ClusterSingletonManager-managed singleton on the active central node.
Telemetry reaches central through the shared CachedCallTelemetry packet (see Audit Log), which carries both an AuditEvent for the AuditLog table and a SiteCallOperational snapshot for the SiteCalls table. The AuditLogIngestActor (Audit Log #23) writes both rows directly inside a single MS SQL transaction when it receives an IngestCachedTelemetryCommand: auditRepo.InsertIfNotExistsAsync(...) followed by siteCallRepo.UpsertAsync(...), committed together or rolled back together. There is no Tell to SiteCallAuditActor on this path; the UpsertSiteCallCommand / OnUpsertAsync handler exists for other callers, not the cached-telemetry hot path. The SiteCallAuditActor is therefore an ingest target, not a transport; it never constructs telemetry packets and never decides what gets delivered.
Sites remain the source of truth. Tracking.Status() is answered site-locally from the site SQLite tracking store; the central SiteCalls row is what the Central UI Site Calls page reads — it may lag by one telemetry cycle.
Key Concepts
Mirror, not dispatcher
The Notification Outbox (#21) ingests notifications and dispatches them centrally. Site Call Audit is different: the Store-and-Forward Engine on each site performs all retry and delivery attempts against the site's locally reachable external systems and databases. Central cannot reach those systems. The SiteCalls table is read-only from central's perspective — operators can view it and trigger Retry/Discard actions, but the actions are forwarded to the site; central never mutates the mirror row directly.
Monotonic upsert idempotency
The SiteCalls table holds one row per TrackedOperationId. ISiteCallAuditRepository.UpsertAsync implements insert-if-not-exists followed by a conditional update that only applies when the incoming status has a strictly higher rank than the stored status:
Submitted=0, Forwarded=1, Attempted=2, Skipped=2,
Delivered=3, Failed=3, Parked=3, Discarded=3
Out-of-order telemetry, duplicate gRPC packets, and future reconciliation pulls therefore all feed the same writer safely — status never regresses.
Stuck calls
A non-terminal row (TerminalAtUtc IS NULL) created before now − StuckAgeThreshold (default 10 minutes) is classified as stuck. Stuck is display-only: surfaced as a StuckCount KPI and a row badge in the UI. There is no escalation or alerting.
Architecture
SiteCallAuditActor
SiteCallAuditActor is a ReceiveActor with two constructors:
- Production: receives
IServiceProviderand opens a fresh DI scope per message to resolve the scoped EF CoreISiteCallAuditRepository. This mirrorsAuditLogIngestActor's pattern — a long-lived singleton cannot hold a scope across messages. - Test: receives a concrete
ISiteCallAuditRepositoryand reuses it across all messages, allowing integration tests to run against a real MS SQL fixture without DI scaffolding.
The actor catches all repository exceptions in its write path and replies Accepted=false without rethrowing, keeping the singleton alive across storage faults. The SupervisorStrategy override (one-for-one, maxNrOfRetries: 0) governs any future children — the actor currently has none.
private async Task OnUpsertAsync(UpsertSiteCallCommand cmd)
{
var replyTo = Sender;
var id = cmd.SiteCall.TrackedOperationId;
IServiceScope? scope = null;
ISiteCallAuditRepository repository;
if (_injectedRepository is not null)
{
repository = _injectedRepository;
}
else
{
scope = _serviceProvider!.CreateScope();
repository = scope.ServiceProvider.GetRequiredService<ISiteCallAuditRepository>();
}
try
{
var siteCall = cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow };
await repository.UpsertAsync(siteCall).ConfigureAwait(false);
replyTo.Tell(new UpsertSiteCallReply(id, Accepted: true));
}
catch (Exception ex)
{
_logger.LogError(ex, "SiteCallAudit upsert failed for {TrackedOperationId}", id);
replyTo.Tell(new UpsertSiteCallReply(id, Accepted: false));
}
finally
{
scope?.Dispose();
}
}
IngestedAtUtc is always stamped at central-side persist time, not carried from the site. This ensures the column reflects when central last processed the row, not when the site emitted it.
Message handlers
| Message | Direction | Handler |
|---|---|---|
UpsertSiteCallCommand |
Central ingest → actor | OnUpsertAsync — scope-per-message upsert |
SiteCallQueryRequest |
Central UI → actor | HandleQuery — keyset-paged query, max 200 rows |
SiteCallDetailRequest |
Central UI → actor | HandleDetail — single-row full detail |
SiteCallKpiRequest |
Central UI → actor | HandleKpi — global KPI snapshot |
PerSiteSiteCallKpiRequest |
Central UI → actor | HandlePerSiteKpi — per-site KPI list |
RetrySiteCallRequest |
Central UI → actor | HandleRetrySiteCall — relay to site |
DiscardSiteCallRequest |
Central UI → actor | HandleDiscardSiteCall — relay to site |
RegisterCentralCommunication |
Host → actor | Wires the CentralCommunicationActor transport |
All read handlers capture Sender before the first await and use PipeTo to return the response — the standard Akka pattern for async ask-reply handlers.
SiteCalls table
One row per TrackedOperationId in the central MS SQL configuration database. Key columns:
| Column | Type | Notes |
|---|---|---|
TrackedOperationId |
uniqueidentifier |
PK; GUID stamped site-side at call time |
Channel |
varchar |
"ApiOutbound" or "DbOutbound" |
Target |
varchar |
Human-readable target, e.g. "ERP.GetOrder" |
SourceSite |
varchar |
Site that issued the call |
SourceNode |
varchar NULL |
node-a / node-b; nullable for retired nodes |
Status |
varchar |
String form of AuditStatus; monotonic |
RetryCount |
int |
Dispatch attempts so far |
LastError |
varchar NULL |
Most recent error text |
HttpStatus |
int NULL |
Last HTTP response code (API calls only) |
CreatedAtUtc |
datetime2 |
First submit timestamp from site |
UpdatedAtUtc |
datetime2 |
Latest site-side status mutation |
TerminalAtUtc |
datetime2 NULL |
Set when status reaches a terminal rank |
IngestedAtUtc |
datetime2 |
Central-side stamp, updated on every upsert |
Unlike the AuditLog table, SiteCalls is a standard non-partitioned table on [PRIMARY] holding mutable operational state. No DB-role restriction applies; it is updated in place by the upsert.
Status lifecycle
Submitted → Forwarded → Attempted ──→ Delivered (terminal, success)
└──→ Parked (non-terminal, awaiting operator action)
└──→ Failed (terminal, permanent failure)
└──→ Discarded (terminal, operator-initiated on Parked row)
Failed rows are not operator-actionable — a permanent failure (e.g. HTTP 4xx) would fail again, and the error was already returned synchronously to the calling script. Only Parked rows support Retry and Discard.
Retry / Discard relay
The CentralCommunicationActor is wired into SiteCallAuditActor after both actors exist, via RegisterCentralCommunication. Until registration completes, any relay request receives an immediate SiteCallRelayOutcome.SiteUnreachable outcome — there is genuinely no route to any site.
When _centralCommunication is set, the relay handler wraps the command in a SiteEnvelope keyed to SourceSite and Asks the CentralCommunicationActor, which routes it over the per-site ClusterClient:
private void HandleRetrySiteCall(RetrySiteCallRequest request)
{
var sender = Sender;
if (_centralCommunication is null)
{
sender.Tell(UnreachableRetry(request.CorrelationId));
return;
}
var relay = new RetryParkedOperation(
request.CorrelationId, new TrackedOperationId(request.TrackedOperationId));
var envelope = new SiteEnvelope(request.SourceSite, relay);
_centralCommunication.Ask<ParkedOperationActionAck>(envelope, _options.RelayTimeout)
.PipeTo(
sender,
success: ack => MapRetryResponse(request.CorrelationId, ack),
failure: ex => MapRetryFailure(request.CorrelationId, request.SourceSite, ex));
}
The site applies RetryParkedOperation / DiscardParkedOperation to its own Store-and-Forward buffer and returns a ParkedOperationActionAck. The actor maps the ack to a SiteCallRelayOutcome:
| Ack | Outcome |
|---|---|
Applied=true |
Applied |
Applied=false, no error |
NotParked — site had nothing to do |
Applied=false, error present |
OperationFailed — site faulted |
| Ask timeout / no route | SiteUnreachable |
SiteUnreachable is distinguished from OperationFailed because central is a mirror — a relay that never reached the site is a transient transport condition, not an operation failure. The UI surfaces "site unreachable" so operators know to retry once the site is back online.
The corrected cached-call state flows back to central via the normal telemetry path after the site applies the action. Central never writes the SiteCalls row to reflect a relay outcome directly.
KPI computation
KPIs are computed point-in-time against the SiteCalls table by ISiteCallAuditRepository.ComputeKpisAsync and ComputePerSiteKpisAsync. All aggregation is server-side; no rows are materialised. The actor derives the cutoff timestamps from SiteCallAuditOptions before calling the repository:
private void HandleKpi(SiteCallKpiRequest request)
{
var sender = Sender;
var now = DateTime.UtcNow;
var stuckCutoff = now - _options.StuckAgeThreshold;
var intervalSince = now - _options.KpiInterval;
KpiAsync(request.CorrelationId, stuckCutoff, intervalSince).PipeTo(
sender,
success: response => response,
failure: ex => new SiteCallKpiResponse(
request.CorrelationId, Success: false,
ErrorMessage: ex.GetBaseException().Message,
BufferedCount: 0, ParkedCount: 0, FailedLastInterval: 0,
DeliveredLastInterval: 0, OldestPendingAge: null, StuckCount: 0));
}
SiteCallKpiSnapshot is structurally similar to NotificationKpiSnapshot so the Central UI dashboard can reuse the same tile layout for both components. The shapes differ: SiteCallKpiSnapshot carries 6 fields (BufferedCount, ParkedCount, FailedLastInterval, DeliveredLastInterval, OldestPendingAge, StuckCount), while NotificationKpiSnapshot carries 5 (QueueDepth, StuckCount, ParkedCount, DeliveredLastInterval, OldestPendingAge) — BufferedCount replaces QueueDepth and FailedLastInterval is an addition with no counterpart in the notification shape.
Usage
The actor accepts only Akka messages — there is no public API beyond the message protocol defined in Commons. The Central UI's Site Calls page sends SiteCallQueryRequest / SiteCallKpiRequest / PerSiteSiteCallKpiRequest / SiteCallDetailRequest through CommunicationService, which Asks the singleton and awaits SiteCallQueryResponse / SiteCallKpiResponse / PerSiteSiteCallKpiResponse / SiteCallDetailResponse.
The ingest path is driven by AuditLogIngestActor.OnCachedTelemetryAsync, which writes both the AuditLog row and the SiteCalls upsert directly inside a single EF transaction — no message is sent to SiteCallAuditActor on this path. Both writes succeed or both roll back; neither component needs to coordinate with the other after the transaction commits.
Registration is via ServiceCollectionExtensions.AddSiteCallAudit, which binds SiteCallAuditOptions from the ScadaBridge:SiteCallAudit configuration section. The actor Props and the ClusterSingletonManager registration are wired in the Host's central-role composition.
Configuration
SiteCallAuditOptions is bound from the ScadaBridge:SiteCallAudit section.
| Key | Default | Description |
|---|---|---|
StuckAgeThreshold |
00:10:00 |
Age past which a non-terminal row is counted as stuck. Display-only; no escalation. |
KpiInterval |
00:01:00 |
Trailing window for DeliveredLastInterval and FailedLastInterval KPIs. |
RelayTimeout |
00:00:10 |
Ask timeout for the central→site Retry/Discard relay. Must be less than CommunicationOptions.QueryTimeout (default 30 s) so the inner relay times out first and returns the distinct SiteUnreachable outcome. |
Dependencies & Interactions
- Commons (#16) — owns
SiteCall,SiteCallOperational,TrackedOperationId,SiteCallAuditOptions-adjacent types (SiteCallKpiSnapshot,SiteCallSiteKpiSnapshot,SiteCallQueryFilter,SiteCallPaging), all message contracts (UpsertSiteCallCommand,UpsertSiteCallReply,SiteCallQueryRequest/Response,SiteCallDetailRequest/Response,SiteCallKpiRequest/Response,PerSiteSiteCallKpiRequest/Response,RetrySiteCallRequest/Response,DiscardSiteCallRequest/Response,SiteCallRelayOutcome), and theISiteCallAuditRepositoryinterface. - Configuration Database (#17) — implements
ISiteCallAuditRepositoryagainst the centraldbo.SiteCallstable. Central hosts must callAddConfigurationDatabasefor the actor to resolve its scoped repository. - Audit Log (#23) — shares the
CachedCallTelemetrypacket.AuditLogIngestActor.OnCachedTelemetryAsyncwrites theAuditLogrow and theSiteCallsupsert directly in a single MS SQL transaction; it does not send a message to this actor on that path. The two components share a database transaction, not a message exchange. - Central–Site Communication (#5) — the
CentralCommunicationActoris the transport the relay handlers use. It is registered viaRegisterCentralCommunicationby the Host after both actors are running.CommunicationServicealso provides the async wrappers (RetrySiteCallAsync,DiscardSiteCallAsync) that the Central UI calls; those methods Ask theSiteCallAuditActorwith the outerCommunicationOptions.QueryTimeout. - Store-and-Forward Engine (#6) — site-side executor of
RetryParkedOperationandDiscardParkedOperation. The site's S&F buffer is the source of truth for parked cached calls; it emits updated telemetry after applying an operator action. - Central UI (#9) — the
Health.razorpage (SiteCallKpiTilescomponent) consumesSiteCallKpiResponseto surface buffered count, parked count, stuck count, and throughput KPI tiles on the health dashboard alongside the Notification Outbox tiles; the Site Calls page queries this actor for the paginated list, detail modal, and KPIs and issues Retry/Discard actions that flow throughCommunicationServiceto the relay handlers here. - Cluster Infrastructure (#13) — hosts the
SiteCallAuditActorsingleton with active/standby failover viaClusterSingletonManager.
Troubleshooting
Relay returns SiteUnreachable
The CentralCommunicationActor could not route the command to the site — the site is offline, the ClusterClient route has not yet resolved, or the relay timed out waiting for a ParkedOperationActionAck. The _options.RelayTimeout (default 10 s) is the inner Ask timeout. The action was NOT applied. Retry the operator action once the site is back online; the SiteCalls mirror row will self-correct via telemetry after the site applies it.
Relay returns NotParked
The site was reached but reported no parked row for the given TrackedOperationId. The call was likely already delivered, discarded, or transitioned out of Parked status between the operator clicking Retry/Discard and the relay arriving. No action is required; the telemetry will update the mirror row shortly.
Upsert replied Accepted=false
The actor caught a repository exception and replied false to the caller without rethrowing. The central singleton remains alive. Check the structured log for a SiteCallAudit upsert failed for {TrackedOperationId} error with the exception detail. If the MS SQL configuration database is temporarily unavailable, the telemetry sender will retry on its next cycle (the at-least-once gRPC transport) or the future reconciliation pull will backfill the row.
SiteCalls rows not appearing
Ingest flows through AuditLogIngestActor.OnCachedTelemetryAsync, which writes the AuditLog row and SiteCalls upsert directly in one EF transaction. If that transaction fails, neither row is written. Check AuditLog ingest health first — a missing AuditLog row for the same TrackedOperationId confirms the telemetry never reached central, not that the SiteCalls upsert failed in isolation.