Files
Joseph Doherty 9175b0c013 docs(components): accuracy fixes from deep review (batch 3)
NotificationService (Notify.Send returns string not NotificationId;
MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox
(one Attempted row always, terminal row only on terminal status), SiteCallAudit
(direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring
(CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on
IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both
nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
2026-06-03 16:37:15 -04:00

18 KiB
Raw Permalink Blame History

Site Call Audit

Site Call Audit (#22) is a central-only observability component that maintains an eventually-consistent mirror of every cached call — ExternalSystem.CachedCall() and Database.CachedWrite() — issued by site scripts. It ingests lifecycle telemetry from sites into the central SiteCalls MS SQL table, computes point-in-time KPIs, and relays operator Retry/Discard actions back to the owning site. It does not deliver anything: cached-call execution stays entirely site-local.

Overview

The component lives in src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/ and runs only on the central cluster. Its single class, SiteCallAuditActor, is an Akka.NET ReceiveActor deployed as a ClusterSingletonManager-managed singleton on the active central node.

Telemetry reaches central through the shared CachedCallTelemetry packet (see Audit Log), which carries both an AuditEvent for the AuditLog table and a SiteCallOperational snapshot for the SiteCalls table. The AuditLogIngestActor (Audit Log #23) writes both rows directly inside a single MS SQL transaction when it receives an IngestCachedTelemetryCommand: auditRepo.InsertIfNotExistsAsync(...) followed by siteCallRepo.UpsertAsync(...), committed together or rolled back together. There is no Tell to SiteCallAuditActor on this path; the UpsertSiteCallCommand / OnUpsertAsync handler exists for other callers, not the cached-telemetry hot path. The SiteCallAuditActor is therefore an ingest target, not a transport; it never constructs telemetry packets and never decides what gets delivered.

Sites remain the source of truth. Tracking.Status() is answered site-locally from the site SQLite tracking store; the central SiteCalls row is what the Central UI Site Calls page reads — it may lag by one telemetry cycle.

Key Concepts

Mirror, not dispatcher

The Notification Outbox (#21) ingests notifications and dispatches them centrally. Site Call Audit is different: the Store-and-Forward Engine on each site performs all retry and delivery attempts against the site's locally reachable external systems and databases. Central cannot reach those systems. The SiteCalls table is read-only from central's perspective — operators can view it and trigger Retry/Discard actions, but the actions are forwarded to the site; central never mutates the mirror row directly.

Monotonic upsert idempotency

The SiteCalls table holds one row per TrackedOperationId. ISiteCallAuditRepository.UpsertAsync implements insert-if-not-exists followed by a conditional update that only applies when the incoming status has a strictly higher rank than the stored status:

Submitted=0, Forwarded=1, Attempted=2, Skipped=2,
Delivered=3, Failed=3, Parked=3, Discarded=3

Out-of-order telemetry, duplicate gRPC packets, and future reconciliation pulls therefore all feed the same writer safely — status never regresses.

Stuck calls

A non-terminal row (TerminalAtUtc IS NULL) created before now StuckAgeThreshold (default 10 minutes) is classified as stuck. Stuck is display-only: surfaced as a StuckCount KPI and a row badge in the UI. There is no escalation or alerting.

Architecture

SiteCallAuditActor

SiteCallAuditActor is a ReceiveActor with two constructors:

  • Production: receives IServiceProvider and opens a fresh DI scope per message to resolve the scoped EF Core ISiteCallAuditRepository. This mirrors AuditLogIngestActor's pattern — a long-lived singleton cannot hold a scope across messages.
  • Test: receives a concrete ISiteCallAuditRepository and reuses it across all messages, allowing integration tests to run against a real MS SQL fixture without DI scaffolding.

The actor catches all repository exceptions in its write path and replies Accepted=false without rethrowing, keeping the singleton alive across storage faults. The SupervisorStrategy override (one-for-one, maxNrOfRetries: 0) governs any future children — the actor currently has none.

private async Task OnUpsertAsync(UpsertSiteCallCommand cmd)
{
    var replyTo = Sender;
    var id = cmd.SiteCall.TrackedOperationId;

    IServiceScope? scope = null;
    ISiteCallAuditRepository repository;
    if (_injectedRepository is not null)
    {
        repository = _injectedRepository;
    }
    else
    {
        scope = _serviceProvider!.CreateScope();
        repository = scope.ServiceProvider.GetRequiredService<ISiteCallAuditRepository>();
    }

    try
    {
        var siteCall = cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow };
        await repository.UpsertAsync(siteCall).ConfigureAwait(false);
        replyTo.Tell(new UpsertSiteCallReply(id, Accepted: true));
    }
    catch (Exception ex)
    {
        _logger.LogError(ex, "SiteCallAudit upsert failed for {TrackedOperationId}", id);
        replyTo.Tell(new UpsertSiteCallReply(id, Accepted: false));
    }
    finally
    {
        scope?.Dispose();
    }
}

IngestedAtUtc is always stamped at central-side persist time, not carried from the site. This ensures the column reflects when central last processed the row, not when the site emitted it.

Message handlers

Message Direction Handler
UpsertSiteCallCommand Central ingest → actor OnUpsertAsync — scope-per-message upsert
SiteCallQueryRequest Central UI → actor HandleQuery — keyset-paged query, max 200 rows
SiteCallDetailRequest Central UI → actor HandleDetail — single-row full detail
SiteCallKpiRequest Central UI → actor HandleKpi — global KPI snapshot
PerSiteSiteCallKpiRequest Central UI → actor HandlePerSiteKpi — per-site KPI list
RetrySiteCallRequest Central UI → actor HandleRetrySiteCall — relay to site
DiscardSiteCallRequest Central UI → actor HandleDiscardSiteCall — relay to site
RegisterCentralCommunication Host → actor Wires the CentralCommunicationActor transport

All read handlers capture Sender before the first await and use PipeTo to return the response — the standard Akka pattern for async ask-reply handlers.

SiteCalls table

One row per TrackedOperationId in the central MS SQL configuration database. Key columns:

Column Type Notes
TrackedOperationId uniqueidentifier PK; GUID stamped site-side at call time
Channel varchar "ApiOutbound" or "DbOutbound"
Target varchar Human-readable target, e.g. "ERP.GetOrder"
SourceSite varchar Site that issued the call
SourceNode varchar NULL node-a / node-b; nullable for retired nodes
Status varchar String form of AuditStatus; monotonic
RetryCount int Dispatch attempts so far
LastError varchar NULL Most recent error text
HttpStatus int NULL Last HTTP response code (API calls only)
CreatedAtUtc datetime2 First submit timestamp from site
UpdatedAtUtc datetime2 Latest site-side status mutation
TerminalAtUtc datetime2 NULL Set when status reaches a terminal rank
IngestedAtUtc datetime2 Central-side stamp, updated on every upsert

Unlike the AuditLog table, SiteCalls is a standard non-partitioned table on [PRIMARY] holding mutable operational state. No DB-role restriction applies; it is updated in place by the upsert.

Status lifecycle

Submitted → Forwarded → Attempted ──→ Delivered   (terminal, success)
                                  └──→ Parked      (non-terminal, awaiting operator action)
                                  └──→ Failed      (terminal, permanent failure)
                                  └──→ Discarded   (terminal, operator-initiated on Parked row)

Failed rows are not operator-actionable — a permanent failure (e.g. HTTP 4xx) would fail again, and the error was already returned synchronously to the calling script. Only Parked rows support Retry and Discard.

Retry / Discard relay

The CentralCommunicationActor is wired into SiteCallAuditActor after both actors exist, via RegisterCentralCommunication. Until registration completes, any relay request receives an immediate SiteCallRelayOutcome.SiteUnreachable outcome — there is genuinely no route to any site.

When _centralCommunication is set, the relay handler wraps the command in a SiteEnvelope keyed to SourceSite and Asks the CentralCommunicationActor, which routes it over the per-site ClusterClient:

private void HandleRetrySiteCall(RetrySiteCallRequest request)
{
    var sender = Sender;

    if (_centralCommunication is null)
    {
        sender.Tell(UnreachableRetry(request.CorrelationId));
        return;
    }

    var relay = new RetryParkedOperation(
        request.CorrelationId, new TrackedOperationId(request.TrackedOperationId));
    var envelope = new SiteEnvelope(request.SourceSite, relay);

    _centralCommunication.Ask<ParkedOperationActionAck>(envelope, _options.RelayTimeout)
        .PipeTo(
            sender,
            success: ack => MapRetryResponse(request.CorrelationId, ack),
            failure: ex => MapRetryFailure(request.CorrelationId, request.SourceSite, ex));
}

The site applies RetryParkedOperation / DiscardParkedOperation to its own Store-and-Forward buffer and returns a ParkedOperationActionAck. The actor maps the ack to a SiteCallRelayOutcome:

Ack Outcome
Applied=true Applied
Applied=false, no error NotParked — site had nothing to do
Applied=false, error present OperationFailed — site faulted
Ask timeout / no route SiteUnreachable

SiteUnreachable is distinguished from OperationFailed because central is a mirror — a relay that never reached the site is a transient transport condition, not an operation failure. The UI surfaces "site unreachable" so operators know to retry once the site is back online.

The corrected cached-call state flows back to central via the normal telemetry path after the site applies the action. Central never writes the SiteCalls row to reflect a relay outcome directly.

KPI computation

KPIs are computed point-in-time against the SiteCalls table by ISiteCallAuditRepository.ComputeKpisAsync and ComputePerSiteKpisAsync. All aggregation is server-side; no rows are materialised. The actor derives the cutoff timestamps from SiteCallAuditOptions before calling the repository:

private void HandleKpi(SiteCallKpiRequest request)
{
    var sender = Sender;
    var now = DateTime.UtcNow;
    var stuckCutoff = now - _options.StuckAgeThreshold;
    var intervalSince = now - _options.KpiInterval;

    KpiAsync(request.CorrelationId, stuckCutoff, intervalSince).PipeTo(
        sender,
        success: response => response,
        failure: ex => new SiteCallKpiResponse(
            request.CorrelationId, Success: false,
            ErrorMessage: ex.GetBaseException().Message,
            BufferedCount: 0, ParkedCount: 0, FailedLastInterval: 0,
            DeliveredLastInterval: 0, OldestPendingAge: null, StuckCount: 0));
}

SiteCallKpiSnapshot is structurally similar to NotificationKpiSnapshot so the Central UI dashboard can reuse the same tile layout for both components. The shapes differ: SiteCallKpiSnapshot carries 6 fields (BufferedCount, ParkedCount, FailedLastInterval, DeliveredLastInterval, OldestPendingAge, StuckCount), while NotificationKpiSnapshot carries 5 (QueueDepth, StuckCount, ParkedCount, DeliveredLastInterval, OldestPendingAge) — BufferedCount replaces QueueDepth and FailedLastInterval is an addition with no counterpart in the notification shape.

Usage

The actor accepts only Akka messages — there is no public API beyond the message protocol defined in Commons. The Central UI's Site Calls page sends SiteCallQueryRequest / SiteCallKpiRequest / PerSiteSiteCallKpiRequest / SiteCallDetailRequest through CommunicationService, which Asks the singleton and awaits SiteCallQueryResponse / SiteCallKpiResponse / PerSiteSiteCallKpiResponse / SiteCallDetailResponse.

The ingest path is driven by AuditLogIngestActor.OnCachedTelemetryAsync, which writes both the AuditLog row and the SiteCalls upsert directly inside a single EF transaction — no message is sent to SiteCallAuditActor on this path. Both writes succeed or both roll back; neither component needs to coordinate with the other after the transaction commits.

Registration is via ServiceCollectionExtensions.AddSiteCallAudit, which binds SiteCallAuditOptions from the ScadaBridge:SiteCallAudit configuration section. The actor Props and the ClusterSingletonManager registration are wired in the Host's central-role composition.

Configuration

SiteCallAuditOptions is bound from the ScadaBridge:SiteCallAudit section.

Key Default Description
StuckAgeThreshold 00:10:00 Age past which a non-terminal row is counted as stuck. Display-only; no escalation.
KpiInterval 00:01:00 Trailing window for DeliveredLastInterval and FailedLastInterval KPIs.
RelayTimeout 00:00:10 Ask timeout for the central→site Retry/Discard relay. Must be less than CommunicationOptions.QueryTimeout (default 30 s) so the inner relay times out first and returns the distinct SiteUnreachable outcome.

Dependencies & Interactions

  • Commons (#16) — owns SiteCall, SiteCallOperational, TrackedOperationId, SiteCallAuditOptions-adjacent types (SiteCallKpiSnapshot, SiteCallSiteKpiSnapshot, SiteCallQueryFilter, SiteCallPaging), all message contracts (UpsertSiteCallCommand, UpsertSiteCallReply, SiteCallQueryRequest/Response, SiteCallDetailRequest/Response, SiteCallKpiRequest/Response, PerSiteSiteCallKpiRequest/Response, RetrySiteCallRequest/Response, DiscardSiteCallRequest/Response, SiteCallRelayOutcome), and the ISiteCallAuditRepository interface.
  • Configuration Database (#17) — implements ISiteCallAuditRepository against the central dbo.SiteCalls table. Central hosts must call AddConfigurationDatabase for the actor to resolve its scoped repository.
  • Audit Log (#23) — shares the CachedCallTelemetry packet. AuditLogIngestActor.OnCachedTelemetryAsync writes the AuditLog row and the SiteCalls upsert directly in a single MS SQL transaction; it does not send a message to this actor on that path. The two components share a database transaction, not a message exchange.
  • CentralSite Communication (#5) — the CentralCommunicationActor is the transport the relay handlers use. It is registered via RegisterCentralCommunication by the Host after both actors are running. CommunicationService also provides the async wrappers (RetrySiteCallAsync, DiscardSiteCallAsync) that the Central UI calls; those methods Ask the SiteCallAuditActor with the outer CommunicationOptions.QueryTimeout.
  • Store-and-Forward Engine (#6) — site-side executor of RetryParkedOperation and DiscardParkedOperation. The site's S&F buffer is the source of truth for parked cached calls; it emits updated telemetry after applying an operator action.
  • Central UI (#9) — the Health.razor page (SiteCallKpiTiles component) consumes SiteCallKpiResponse to surface buffered count, parked count, stuck count, and throughput KPI tiles on the health dashboard alongside the Notification Outbox tiles; the Site Calls page queries this actor for the paginated list, detail modal, and KPIs and issues Retry/Discard actions that flow through CommunicationService to the relay handlers here.
  • Cluster Infrastructure (#13) — hosts the SiteCallAuditActor singleton with active/standby failover via ClusterSingletonManager.

Troubleshooting

Relay returns SiteUnreachable

The CentralCommunicationActor could not route the command to the site — the site is offline, the ClusterClient route has not yet resolved, or the relay timed out waiting for a ParkedOperationActionAck. The _options.RelayTimeout (default 10 s) is the inner Ask timeout. The action was NOT applied. Retry the operator action once the site is back online; the SiteCalls mirror row will self-correct via telemetry after the site applies it.

Relay returns NotParked

The site was reached but reported no parked row for the given TrackedOperationId. The call was likely already delivered, discarded, or transitioned out of Parked status between the operator clicking Retry/Discard and the relay arriving. No action is required; the telemetry will update the mirror row shortly.

Upsert replied Accepted=false

The actor caught a repository exception and replied false to the caller without rethrowing. The central singleton remains alive. Check the structured log for a SiteCallAudit upsert failed for {TrackedOperationId} error with the exception detail. If the MS SQL configuration database is temporarily unavailable, the telemetry sender will retry on its next cycle (the at-least-once gRPC transport) or the future reconciliation pull will backfill the row.

SiteCalls rows not appearing

Ingest flows through AuditLogIngestActor.OnCachedTelemetryAsync, which writes the AuditLog row and SiteCalls upsert directly in one EF transaction. If that transaction fails, neither row is written. Check AuditLog ingest health first — a missing AuditLog row for the same TrackedOperationId confirms the telemetry never reached central, not that the SiteCalls upsert failed in isolation.