From 9175b0c013529c41b912d37dcbefb09b43bc697e Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 3 Jun 2026 16:37:15 -0400 Subject: [PATCH] docs(components): accuracy fixes from deep review (batch 3) NotificationService (Notify.Send returns string not NotificationId; MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox (one Attempted row always, terminal row only on terminal status), SiteCallAudit (direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring (CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both nodes), InboundAPI (whole System.Diagnostics namespace forbidden). --- docs/components/HealthMonitoring.md | 6 +++--- docs/components/InboundAPI.md | 2 +- docs/components/NotificationOutbox.md | 8 ++++---- docs/components/NotificationService.md | 22 +++++++++++----------- docs/components/SiteCallAudit.md | 13 ++++++------- docs/components/SiteEventLogging.md | 4 ++-- 6 files changed, 27 insertions(+), 28 deletions(-) diff --git a/docs/components/HealthMonitoring.md b/docs/components/HealthMonitoring.md index 4852c64e..3af093bf 100644 --- a/docs/components/HealthMonitoring.md +++ b/docs/components/HealthMonitoring.md @@ -30,7 +30,7 @@ Sequence numbers are seeded at construction with the current Unix epoch in milli Online status is driven by `LastHeartbeatAt`, not by `LastReportReceivedAt`. Heartbeats arrive from `SiteCommunicationActor` every ~5 s (`CommunicationOptions.TransportHeartbeatInterval`), so the 60 s `OfflineTimeout` tolerates roughly twelve missed heartbeats before declaring a site offline. A single-node failover — where the standby is alive but the active cannot produce a full report — therefore does not trigger a false offline transition. -The synthetic `$central` site has no heartbeat source; its only signal is the 30 s `CentralHealthReportLoop` self-report. It therefore gets a longer `CentralOfflineTimeout` (default 3 × `ReportInterval` = 90 s), equivalent to one missed self-report. The validator rejects any configuration where `CentralOfflineTimeout < OfflineTimeout`. +The synthetic `$central` site has no heartbeat source; its only signal is the 30 s `CentralHealthReportLoop` self-report. It therefore gets a longer `CentralOfflineTimeout` (default 6 × `ReportInterval` = 180 s / 3 min), equivalent to ~6 missed report periods. The validator rejects any configuration where `CentralOfflineTimeout < OfflineTimeout`. The offline-check `PeriodicTimer` runs at half the shorter of the two timeouts so whichever site class has the tighter window is swept at least twice within it. @@ -138,7 +138,7 @@ Options class: `HealthMonitoringOptions`, bound from the `ScadaBridge:HealthMoni |-----|---------|-----------|-------------| | `ScadaBridge:HealthMonitoring:ReportInterval` | `00:00:30` (30 s) | Must be `> 0` | Interval at which site nodes emit health reports to central. Also the `CentralHealthReportLoop` self-report cadence. | | `ScadaBridge:HealthMonitoring:OfflineTimeout` | `00:01:00` (60 s) | Must be `> 0` | Silence window after which a real site is marked offline. Driven by `LastHeartbeatAt`, not last report time. | -| `ScadaBridge:HealthMonitoring:CentralOfflineTimeout` | `00:03:00` (3 min) | Must be `>= OfflineTimeout` | Grace window for the `$central` synthetic site, which has no heartbeat source. Defaults to 3× `ReportInterval`. | +| `ScadaBridge:HealthMonitoring:CentralOfflineTimeout` | `00:03:00` (3 min) | Must be `>= OfflineTimeout` | Grace window for the `$central` synthetic site, which has no heartbeat source. Defaults to 6× `ReportInterval`. | The offline-check cadence is derived at runtime as `min(OfflineTimeout, CentralOfflineTimeout) / 2` — not directly configurable. @@ -149,7 +149,7 @@ The offline-check cadence is derived at runtime as `min(OfflineTimeout, CentralO - [Site Runtime (#3)](./SiteRuntime.md) — Script Actors call `IncrementScriptError`; Alarm Actors call `IncrementAlarmError`; the Deployment Manager singleton ownership check drives `SetActiveNode`. - [Data Connection Layer (#4)](./DataConnectionLayer.md) — connection actors call `UpdateConnectionHealth`, `UpdateTagResolution`, `UpdateConnectionEndpoint`, `UpdateTagQuality`, and `RemoveConnection` on `ISiteHealthCollector`. - [Store-and-Forward Engine (#6)](./StoreAndForward.md) — `HealthReportSender` queries `StoreAndForwardStorage` for `GetParkedMessageCountAsync` and `GetBufferDepthByCategoryAsync`; the results populate `ParkedMessageCount` and `StoreAndForwardBufferDepths` (keyed by `StoreAndForwardCategory` name). -- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — `IClusterNodeProvider` supplies cluster node list and `SelfIsPrimary` flag to both `HealthReportSender` and `CentralHealthReportLoop`. Heartbeat cadence (default 5 s) is owned by Cluster Infrastructure / `SiteCommunicationActor`. +- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — `IClusterNodeProvider` supplies the cluster node list to `HealthReportSender` (for the node-list payload); `HealthReportSender`'s active/standby gate is `_collector.IsActiveNode`, which is set externally by `DeploymentManagerActor.PreStart`/`PostStop`. `CentralHealthReportLoop` reads both `GetClusterNodes()` and `SelfIsPrimary` from `IClusterNodeProvider`. Heartbeat cadence (default 5 s) is owned by Cluster Infrastructure / `SiteCommunicationActor`. - [Audit Log (#23)](./AuditLog.md) — `AddAuditLogHealthMetricsBridge` wires `HealthMetricsAuditWriteFailureCounter` and `HealthMetricsAuditRedactionFailureCounter` into the site collector, and registers `SiteAuditBacklogReporter` to poll the site-local SQLite drain backlog. On central, `AuditCentralHealthSnapshot` exposes `CentralAuditWriteFailures`, `AuditRedactionFailure`, and per-site `SiteAuditTelemetryStalled` alongside the aggregated site states on the health dashboard. - [Central UI (#9)](./CentralUI.md) — the health dashboard resolves `ICentralHealthAggregator` and polls `GetAllSiteStates()` on a ~10 s timer. Notification Outbox and Site Call Audit KPIs are computed on demand from their own central tables by those components; Health Monitoring does not own or cache them. - [Host (#15)](./Host.md) — implements `ISiteIdentityProvider` (supplies `SiteId` for report payloads) and `IClusterNodeProvider`, and calls the appropriate `Add*` entry points from the role-specific composition root. diff --git a/docs/components/InboundAPI.md b/docs/components/InboundAPI.md index db0b0845..70c295cb 100644 --- a/docs/components/InboundAPI.md +++ b/docs/components/InboundAPI.md @@ -219,7 +219,7 @@ var result = await Route.To(siteId).Call("GetProductionSummary", new { date = Pa return result; ``` -The `Route.To().Call()` inherits the method-level timeout automatically. A script that needs a tighter per-call bound may pass an explicit `CancellationToken`. Scripts may not access `System.IO`, `System.Diagnostics.Process`, `System.Threading` (except `Tasks`), `System.Reflection`, `System.Net`, or reflection-gateway members — violations are rejected statically at compile time. +The `Route.To().Call()` inherits the method-level timeout automatically. A script that needs a tighter per-call bound may pass an explicit `CancellationToken`. Scripts may not access `System.IO`, the entire `System.Diagnostics` namespace (including `Process`), `System.Threading` (except `Tasks`), `System.Reflection`, `System.Net`, or reflection-gateway members — violations are rejected statically at compile time. ### Startup compilation and hot-reload diff --git a/docs/components/NotificationOutbox.md b/docs/components/NotificationOutbox.md index 170ac270..dfb876d0 100644 --- a/docs/components/NotificationOutbox.md +++ b/docs/components/NotificationOutbox.md @@ -168,12 +168,12 @@ Every attempt also writes audit rows via `ICentralAuditWriter` (see Audit Integr ### Audit integration -Each delivery attempt emits two `AuditChannel.Notification` / `AuditKind.NotifyDeliver` rows via `ICentralAuditWriter`: +Each delivery attempt emits at least one `AuditChannel.Notification` / `AuditKind.NotifyDeliver` row via `ICentralAuditWriter`: - An `AuditStatus.Attempted` row (always, per attempt), carrying attempt duration in milliseconds. -- A terminal row (`Delivered`, `Parked`, or `Discarded`) when the post-outcome status is terminal. +- A second, terminal row (`Delivered`, `Parked`, or `Discarded`) only when the post-outcome status is terminal — a transient failure that transitions the notification to `Retrying` emits only the `Attempted` row. -`CorrelationId` on both rows is parsed from the `NotificationId` GUID. `ExecutionId` and `ParentExecutionId` are echoed from `Notification.OriginExecutionId` / `Notification.OriginParentExecutionId`, linking the central `NotifyDeliver` rows to the site-emitted `NotifySend` row for the same script run. The `Actor` field is `"system"` — there is no authenticated user at dispatch time. +`CorrelationId` on the emitted row(s) is parsed from the `NotificationId` GUID. `ExecutionId` and `ParentExecutionId` are echoed from `Notification.OriginExecutionId` / `Notification.OriginParentExecutionId`, linking the central `NotifyDeliver` rows to the site-emitted `NotifySend` row for the same script run. The `Actor` field is `"system"` — there is no authenticated user at dispatch time. Manual discard via `HandleDiscard` also emits a terminal `Discarded` row (with a null error, because the discard is operator-driven, not a delivery failure). @@ -223,7 +223,7 @@ Delivery retry policy (`MaxRetries`, `RetryDelay`) is read at runtime from `Smtp - [Notification Service (#8)](./NotificationService.md) — supplies `ISmtpClientWrapper`, `OAuth2TokenService`, `NotificationOptions`, `SmtpTlsModeParser`, `SmtpErrorClassifier`, and the `SmtpPermanentException` type. `AddNotificationOutbox` relies on `AddNotificationService` being called by the Host to register these shared SMTP primitives; registering them twice would duplicate them. - [Central–Site Communication (#5)](./Communication.md) — carries `NotificationSubmit` / `NotificationSubmitAck` between sites and central via ClusterClient, and `NotificationStatusQuery` / `NotificationStatusResponse` for the `Notify.Status` round-trip. - [Store-and-Forward Engine (#6)](./StoreAndForward.md) — the site-side component that durably buffers notifications in SQLite and retries forwarding until central acks. The outbox is the receiving end of the S&F handoff. -- [Audit Log (#23)](./AuditLog.md) — the outbox is a central direct-write caller of `ICentralAuditWriter`. It emits `NotifyDeliver` rows (Attempted + terminal) per delivery attempt and per operator Discard. The upstream `NotifySend` row is emitted by the site and arrives at central via standard audit telemetry. +- [Audit Log (#23)](./AuditLog.md) — the outbox is a central direct-write caller of `ICentralAuditWriter`. It emits an `Attempted` `NotifyDeliver` row per delivery attempt, plus a terminal row only when the attempt drives the notification to a terminal status (`Delivered`/`Parked`/`Discarded`); it also emits a terminal row per operator Discard. The upstream `NotifySend` row is emitted by the site and arrives at central via standard audit telemetry. - [Health Monitoring (#11)](./HealthMonitoring.md) — polls `NotificationKpiRequest` / `PerSiteNotificationKpiRequest` for the headline KPI tiles on the health dashboard (queue depth, stuck count, parked count). These are central-computed from the `Notifications` table and are separate from the site S&F backlog metric. - [Central UI (#9)](./CentralUI.md) — hosts the Notification Outbox page: KPI tiles, a queryable/filterable notification list, per-row Retry/Discard actions on parked notifications, and a stuck-row badge. diff --git a/docs/components/NotificationService.md b/docs/components/NotificationService.md index 744684e7..f5cfc754 100644 --- a/docs/components/NotificationService.md +++ b/docs/components/NotificationService.md @@ -7,7 +7,7 @@ The Notification Service is the central-only component that owns notification-li Notification Service (#8) runs on the central cluster only. Its responsibilities split cleanly into two layers: - **Definitions** — `NotificationList` and `SmtpConfiguration` entities stored in the central Configuration Database. Notification lists carry a `NotificationType` discriminator (`Email` now; additional types such as `Teams` are planned). Lists and SMTP config are never deployed to sites. -- **Delivery adapters** — stateless, per-type implementations of `INotificationDeliveryAdapter`. The Notification Outbox selects the adapter matching a notification's `Type`, calls `DeliverAsync`, and receives a three-way `DeliveryOutcome` (`Success` / `TransientFailure` / `PermanentFailure`). The adapter owns the full recipient-resolution, connection, authentication, send, and disconnect sequence. +- **Delivery adapters** — per-type implementations of `INotificationDeliveryAdapter`. The Notification Outbox selects the adapter matching a notification's `Type`, calls `DeliverAsync`, and receives a three-way `DeliveryOutcome` (`Success` / `TransientFailure` / `PermanentFailure`). The adapter owns the full recipient-resolution, connection, authentication, send, and disconnect sequence. `EmailNotificationDeliveryAdapter` is registered as scoped (it holds a scoped `INotificationRepository`) and the outbox actor caches a single instance for its lifetime. The component code lives in `src/ZB.MOM.WW.ScadaBridge.NotificationService/`. The `EmailNotificationDeliveryAdapter` that consumes these primitives lives in `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Delivery/`. @@ -15,12 +15,12 @@ The component code lives in `src/ZB.MOM.WW.ScadaBridge.NotificationService/`. Th ### Central-only delivery -Before the current design, site nodes delivered notifications directly over SMTP. That arrangement required SMTP credentials and notification lists to be deployed to every site. The redesign inverts the path: a site script calls `Notify.To("list").Send(subject, body)`, receives a `NotificationId` GUID immediately, and the notification is store-and-forwarded to central. The Notification Outbox on central ingests it and calls the delivery adapter. Sites never open an SMTP connection. +Before the current design, site nodes delivered notifications directly over SMTP. That arrangement required SMTP credentials and notification lists to be deployed to every site. The redesign inverts the path: a site script calls `Notify.To("list").Send(subject, body)`, receives a `string` notification id immediately, and the notification is store-and-forwarded to central. The Notification Outbox on central ingests it and calls the delivery adapter. Sites never open an SMTP connection. This means: - Credential exposure is limited to the central cluster. - List membership is resolved at delivery time, so a list change takes effect for all future deliveries without redeploying to sites. -- The SMTP `MaxConcurrentConnections` limit is enforced at a single point. +- The SMTP `MaxConcurrentConnections` value is configured at a single point, though it is not currently enforced (no connection gate or semaphore). ### `NotificationType` discriminator @@ -56,7 +56,7 @@ public static IServiceCollection AddNotificationService(this IServiceCollection } ``` -Three things are registered: the `NotificationOptions` fallback values, the `OAuth2TokenService` token cache, and the `ISmtpClientWrapper` factory. The `EmailNotificationDeliveryAdapter` itself is registered by `ZB.MOM.WW.ScadaBridge.NotificationOutbox`, which depends on this project. +Four things are registered: the `NotificationOptions` fallback values, the `HttpClient` infrastructure (required by `OAuth2TokenService`), the `OAuth2TokenService` token cache, and the `ISmtpClientWrapper` factory. The `EmailNotificationDeliveryAdapter` itself is registered by `ZB.MOM.WW.ScadaBridge.NotificationOutbox`, which depends on this project. ### `INotificationDeliveryAdapter` @@ -79,7 +79,7 @@ The `DeliveryOutcome` record carries a `DeliveryResult` (`Success` / `TransientF 1. **Resolve list** — calls `INotificationRepository.GetListByNameAsync`. An unknown list returns `Permanent` immediately (the list was deleted; retrying cannot fix it). 2. **Resolve recipients** — calls `GetRecipientsByListIdAsync`. An empty list returns `Permanent`. 3. **Resolve SMTP config** — calls `GetAllSmtpConfigurationsAsync`, takes the first row. No config returns `Permanent`. -4. **Parse TLS mode** — `SmtpTlsModeParser.Parse(smtpConfig.TlsMode)`. An unrecognised string returns `Permanent` (config fault, not a transient network condition). +4. **Parse TLS mode** — `SmtpTlsModeParser.Parse(smtpConfig.TlsMode)`. An unrecognised string throws `ArgumentException`; `DeliverAsync` catches it and returns `Permanent` (config fault, not a transient network condition). 5. **Validate addresses** — `EmailAddressValidator.ValidateAddresses(fromAddress, recipients)`. A malformed address returns `Permanent`. 6. **Send** — calls the private `SendAsync`, which connect/auth/send/disconnects via a fresh `ISmtpClientWrapper`. @@ -130,18 +130,18 @@ A `Permanent` classification inside `SendAsync` is wrapped in `SmtpPermanentExce Site scripts do not interact with this component directly. The script surface is: ```csharp -// Returns a NotificationId immediately — does not block for delivery. -NotificationId id = Notify.To("Shift-Supervisors").Send("Tank overflow", "Tank T-03 is at 98%"); +// Returns a string notification id immediately — does not block for delivery. +string id = await Notify.To("Shift-Supervisors").Send("Tank overflow", "Tank T-03 is at 98%"); // Site-local while still in the S&F buffer; round-trips to central once forwarded. -NotificationDeliveryStatus status = Notify.Status(id); +NotificationDeliveryStatus status = await Notify.Status(id); ``` -`Notify.To("list")` is type-agnostic. The `NotificationId` is a GUID generated at the site. `Notify.Status` returns a `NotificationDeliveryStatus` record with `Status` (`Forwarding` site-local, or `Pending` / `Retrying` / `Delivered` / `Parked` / `Discarded` from central), `RetryCount`, `LastError`, and `DeliveredAt`. +`Notify.To("list")` is type-agnostic. The notification id is a 32-character "N"-format GUID string (`Guid.NewGuid().ToString("N")`) generated at the site. `Notify.Status(string notificationId)` returns a `NotificationDeliveryStatus` record with `Status` (`Forwarding` site-local, `Unknown` if no central row and not in the S&F buffer, or `Pending` / `Retrying` / `Delivered` / `Parked` / `Discarded` from central), `RetryCount`, `LastError`, and `DeliveredAt`. ### Registering the adapter -On the central host, both projects are registered. The Notification Outbox registers `EmailNotificationDeliveryAdapter` as a keyed or enumerated `INotificationDeliveryAdapter` and calls `AddNotificationService` to get its dependencies: +On the central host, both projects are registered. The Notification Outbox registers `EmailNotificationDeliveryAdapter` as a scoped concrete type and as a scoped `INotificationDeliveryAdapter`; the outbox actor resolves adapters by enumerating `IEnumerable` (no keyed/named registration). `AddNotificationService` is called to register the shared SMTP primitives: ```csharp // Central composition root (simplified) @@ -156,7 +156,7 @@ services.AddNotificationOutbox(); // registers EmailNotificationDeliveryAdapte | Section | Key | Default | Description | |---------|-----|---------|-------------| | `ScadaBridge:Notification` | `ConnectionTimeoutSeconds` | `30` | SMTP connection/operation timeout in seconds. Applied when `SmtpConfiguration.ConnectionTimeoutSeconds` is zero or negative. | -| `ScadaBridge:Notification` | `MaxConcurrentConnections` | `5` | Maximum concurrent SMTP connections. Used as the documented default; enforcement is in `EmailNotificationDeliveryAdapter`. | +| `ScadaBridge:Notification` | `MaxConcurrentConnections` | `5` | Maximum concurrent SMTP connections. Used as the documented fallback default when the `SmtpConfiguration` row is unset; this limit is not currently enforced by a connection gate or semaphore. | SMTP retry settings (`MaxRetries`, `RetryDelay`) live on the `SmtpConfiguration` entity and are read by the Notification Outbox dispatcher — they are not part of `NotificationOptions`. diff --git a/docs/components/SiteCallAudit.md b/docs/components/SiteCallAudit.md index 403a1ceb..0097a532 100644 --- a/docs/components/SiteCallAudit.md +++ b/docs/components/SiteCallAudit.md @@ -6,7 +6,7 @@ Site Call Audit (#22) is a central-only observability component that maintains a The component lives in `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/` and runs only on the central cluster. Its single class, `SiteCallAuditActor`, is an Akka.NET `ReceiveActor` deployed as a `ClusterSingletonManager`-managed singleton on the active central node. -Telemetry reaches central through the shared `CachedCallTelemetry` packet (see [Audit Log](./AuditLog.md)), which carries both an `AuditEvent` for the `AuditLog` table and a `SiteCallOperational` snapshot for the `SiteCalls` table. The `AuditLogIngestActor` (Audit Log #23) writes both in a single MS SQL transaction when it receives an `IngestCachedTelemetryCommand`; it then tells `SiteCallAuditActor` an `UpsertSiteCallCommand` so the `SiteCalls` row is always consistent with the paired audit row. The `SiteCallAuditActor` is therefore an ingest target, not a transport; it never constructs telemetry packets and never decides what gets delivered. +Telemetry reaches central through the shared `CachedCallTelemetry` packet (see [Audit Log](./AuditLog.md)), which carries both an `AuditEvent` for the `AuditLog` table and a `SiteCallOperational` snapshot for the `SiteCalls` table. The `AuditLogIngestActor` (Audit Log #23) writes both rows directly inside a single MS SQL transaction when it receives an `IngestCachedTelemetryCommand`: `auditRepo.InsertIfNotExistsAsync(...)` followed by `siteCallRepo.UpsertAsync(...)`, committed together or rolled back together. There is no Tell to `SiteCallAuditActor` on this path; the `UpsertSiteCallCommand` / `OnUpsertAsync` handler exists for other callers, not the cached-telemetry hot path. The `SiteCallAuditActor` is therefore an ingest target, not a transport; it never constructs telemetry packets and never decides what gets delivered. Sites remain the source of truth. `Tracking.Status()` is answered site-locally from the site SQLite tracking store; the central `SiteCalls` row is what the Central UI Site Calls page reads — it may lag by one telemetry cycle. @@ -193,13 +193,13 @@ private void HandleKpi(SiteCallKpiRequest request) } ``` -The `SiteCallKpiSnapshot` shape mirrors `NotificationKpiSnapshot` so the Central UI dashboard can reuse the same tile layout for both components. +`SiteCallKpiSnapshot` is structurally similar to `NotificationKpiSnapshot` so the Central UI dashboard can reuse the same tile layout for both components. The shapes differ: `SiteCallKpiSnapshot` carries 6 fields (`BufferedCount`, `ParkedCount`, `FailedLastInterval`, `DeliveredLastInterval`, `OldestPendingAge`, `StuckCount`), while `NotificationKpiSnapshot` carries 5 (`QueueDepth`, `StuckCount`, `ParkedCount`, `DeliveredLastInterval`, `OldestPendingAge`) — `BufferedCount` replaces `QueueDepth` and `FailedLastInterval` is an addition with no counterpart in the notification shape. ## Usage The actor accepts only Akka messages — there is no public API beyond the message protocol defined in Commons. The Central UI's Site Calls page sends `SiteCallQueryRequest` / `SiteCallKpiRequest` / `PerSiteSiteCallKpiRequest` / `SiteCallDetailRequest` through `CommunicationService`, which Asks the singleton and awaits `SiteCallQueryResponse` / `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` / `SiteCallDetailResponse`. -The ingest path is driven by `AuditLogIngestActor.OnCachedTelemetryAsync`, which tells an `UpsertSiteCallCommand` after committing the dual-write transaction. The `SiteCallAuditActor` does not need to coordinate with `AuditLogIngestActor` — the transaction guarantees the `AuditLog` row always precedes the upsert command. +The ingest path is driven by `AuditLogIngestActor.OnCachedTelemetryAsync`, which writes both the `AuditLog` row and the `SiteCalls` upsert directly inside a single EF transaction — no message is sent to `SiteCallAuditActor` on this path. Both writes succeed or both roll back; neither component needs to coordinate with the other after the transaction commits. Registration is via `ServiceCollectionExtensions.AddSiteCallAudit`, which binds `SiteCallAuditOptions` from the `ScadaBridge:SiteCallAudit` configuration section. The actor `Props` and the `ClusterSingletonManager` registration are wired in the Host's central-role composition. @@ -217,11 +217,10 @@ Registration is via `ServiceCollectionExtensions.AddSiteCallAudit`, which binds - [Commons (#16)](./Commons.md) — owns `SiteCall`, `SiteCallOperational`, `TrackedOperationId`, `SiteCallAuditOptions`-adjacent types (`SiteCallKpiSnapshot`, `SiteCallSiteKpiSnapshot`, `SiteCallQueryFilter`, `SiteCallPaging`), all message contracts (`UpsertSiteCallCommand`, `UpsertSiteCallReply`, `SiteCallQueryRequest/Response`, `SiteCallDetailRequest/Response`, `SiteCallKpiRequest/Response`, `PerSiteSiteCallKpiRequest/Response`, `RetrySiteCallRequest/Response`, `DiscardSiteCallRequest/Response`, `SiteCallRelayOutcome`), and the `ISiteCallAuditRepository` interface. - [Configuration Database (#17)](./ConfigurationDatabase.md) — implements `ISiteCallAuditRepository` against the central `dbo.SiteCalls` table. Central hosts must call `AddConfigurationDatabase` for the actor to resolve its scoped repository. -- [Audit Log (#23)](./AuditLog.md) — shares the `CachedCallTelemetry` packet. `AuditLogIngestActor.OnCachedTelemetryAsync` writes the `AuditLog` row and the `SiteCalls` upsert in a single MS SQL transaction, then tells `UpsertSiteCallCommand` to this actor. The two components coordinate via message-passing, not a shared service. +- [Audit Log (#23)](./AuditLog.md) — shares the `CachedCallTelemetry` packet. `AuditLogIngestActor.OnCachedTelemetryAsync` writes the `AuditLog` row and the `SiteCalls` upsert directly in a single MS SQL transaction; it does not send a message to this actor on that path. The two components share a database transaction, not a message exchange. - [Central–Site Communication (#5)](./Communication.md) — the `CentralCommunicationActor` is the transport the relay handlers use. It is registered via `RegisterCentralCommunication` by the Host after both actors are running. `CommunicationService` also provides the async wrappers (`RetrySiteCallAsync`, `DiscardSiteCallAsync`) that the Central UI calls; those methods Ask the `SiteCallAuditActor` with the outer `CommunicationOptions.QueryTimeout`. - [Store-and-Forward Engine (#6)](./StoreAndForward.md) — site-side executor of `RetryParkedOperation` and `DiscardParkedOperation`. The site's S&F buffer is the source of truth for parked cached calls; it emits updated telemetry after applying an operator action. -- [Health Monitoring (#11)](./HealthMonitoring.md) — consumes `SiteCallKpiResponse` / `PerSiteSiteCallKpiResponse` to surface buffered count, parked count, stuck count, and throughput KPI tiles on the health dashboard alongside the Notification Outbox tiles. -- [Central UI (#9)](./CentralUI.md) — the Site Calls page queries this actor for the paginated list, detail modal, and KPIs; it issues Retry/Discard actions that flow through `CommunicationService` to the relay handlers here. +- [Central UI (#9)](./CentralUI.md) — the `Health.razor` page (`SiteCallKpiTiles` component) consumes `SiteCallKpiResponse` to surface buffered count, parked count, stuck count, and throughput KPI tiles on the health dashboard alongside the Notification Outbox tiles; the Site Calls page queries this actor for the paginated list, detail modal, and KPIs and issues Retry/Discard actions that flow through `CommunicationService` to the relay handlers here. - [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — hosts the `SiteCallAuditActor` singleton with active/standby failover via `ClusterSingletonManager`. ## Troubleshooting @@ -240,7 +239,7 @@ The actor caught a repository exception and replied false to the caller without ### `SiteCalls` rows not appearing -Ingest flows through `AuditLogIngestActor.OnCachedTelemetryAsync`, which writes the `AuditLog` row and `SiteCalls` upsert in one transaction before telling `UpsertSiteCallCommand`. If that transaction fails, neither row is written. Check `AuditLog` ingest health first — a missing `AuditLog` row for the same `TrackedOperationId` confirms the telemetry never reached central, not that the `SiteCalls` upsert failed in isolation. +Ingest flows through `AuditLogIngestActor.OnCachedTelemetryAsync`, which writes the `AuditLog` row and `SiteCalls` upsert directly in one EF transaction. If that transaction fails, neither row is written. Check `AuditLog` ingest health first — a missing `AuditLog` row for the same `TrackedOperationId` confirms the telemetry never reached central, not that the `SiteCalls` upsert failed in isolation. ## Related Documentation diff --git a/docs/components/SiteEventLogging.md b/docs/components/SiteEventLogging.md index d99753f9..2a48dd4d 100644 --- a/docs/components/SiteEventLogging.md +++ b/docs/components/SiteEventLogging.md @@ -20,7 +20,7 @@ The DI entry point is `ServiceCollectionExtensions.AddSiteEventLogging`, registe ### Active-node-only writes -Only the active site node generates and stores events. The standby's local SQLite receives no writes, so purging there is unnecessary. `EventLogPurgeService` consults an optional `SiteEventLogActiveNodeCheck` delegate on each tick; the Host registers the real check on site nodes, and the purge early-exits on the standby. When no delegate is registered (tests, non-clustered hosts), the purge runs on every tick, preserving pre-cluster behaviour. +Only the active site node generates and stores events. The standby's local SQLite receives no writes, so purging there is unnecessary. `EventLogPurgeService` consults an optional `SiteEventLogActiveNodeCheck` delegate on each tick and early-exits when the delegate returns `false`. The delegate is an optional seam: `AddSiteEventLogging` resolves it via `sp.GetService()`, so the service compiles and runs without it. The Host does **not** currently register the delegate, so `GetService` returns `null` and the constructor defaults to `() => true`. As a result the purge currently runs on every tick on both nodes. When no delegate is registered, the purge runs on every tick, preserving pre-cluster behaviour. On failover, the newly active node starts logging to its own SQLite database. Historical events from the previous active node are not queryable until that node comes back online. This is acceptable because event logs are diagnostic, not transactional — a missing log tail after failover is not a data-integrity concern. @@ -199,7 +199,7 @@ The docker cluster appsettings (`ScadaBridge:SiteEventLog`) sets `RetentionDays: - [Site Runtime (#3)](./SiteRuntime.md) — `ScriptActor` and `ScriptExecutionActor` log `script`-type events: trigger expression failures, script execution errors, and timeouts. `ISiteEventLogger` is resolved from DI inside execution actors. - [Data Connection Layer (#4)](./DataConnectionLayer.md) — `DataConnectionActor` logs `connection`-type events: connection loss, reconnection, and endpoint failover. `DataConnectionManagerActor` may also log connection-category events. - [Store-and-Forward Engine (#6)](./StoreAndForward.md) — logs `store_and_forward`-type events on the site→central notification forward path (forward failures, long-buffered notifications). Routine enqueue and forward-success events are not logged; central's `Notifications` table is the authoritative record. -- [Host (#15)](./Host.md) — `SiteServiceRegistration` calls `AddSiteEventLogging` and binds `SiteEventLogOptions`. `AkkaHostedService` wires `EventLogHandlerActor` as a cluster singleton scoped to `"site-{SiteId}"` and registers the `SiteEventLogActiveNodeCheck` delegate so the purge runs only on the active node. +- [Host (#15)](./Host.md) — `SiteServiceRegistration` calls `AddSiteEventLogging` and binds `SiteEventLogOptions`. `AkkaHostedService` wires `EventLogHandlerActor` as a cluster singleton scoped to `"site-{SiteId}"`. The `SiteEventLogActiveNodeCheck` delegate is an optional seam defined in `SiteEventLogging` for the Host to register when it wants to gate the purge to the active node only; the Host does not currently register it, so the purge defaults to always-active and runs on every node. - [Audit Log (#23)](./AuditLog.md) — a distinct component. The Audit Log captures every trust-boundary action (outbound API calls, DB writes, notifications, inbound API) and flows to a central append-only table with monthly partitioning and 365-day retention. The site event log captures internal runtime diagnostics (failures, state transitions) locally with 30-day retention. The two stores are complementary, not overlapping. - [Site Call Audit (#22)](./SiteCallAudit.md) — a distinct component. Site Call Audit mirrors cached-call operational status in the central `SiteCalls` table via gRPC telemetry. Site Event Logging has no role in that flow.