diff --git a/code-reviews/AuditLog/findings.md b/code-reviews/AuditLog/findings.md
index 4355dd03..38b962ca 100644
--- a/code-reviews/AuditLog/findings.md
+++ b/code-reviews/AuditLog/findings.md
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
-| Open findings | 11 |
+| Open findings | 9 |
## Summary
@@ -111,7 +111,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` |
**Description**
@@ -144,9 +144,13 @@ as a forward-compat hedge — change the decider to `Decider.From(_ => Directive
or similar to match the comment, AND add a clear note that the per-row catch is what
keeps the actor running across handler throws, not the supervisor strategy.
-**Resolution**
+**Resolution (2026-05-28):**
-_Unresolved._
+Comment-only fix on all three actors (`AuditLogIngestActor`, `AuditLogPurgeActor`,
+`SiteAuditReconciliationActor`). XML doc remarks now correctly attribute alive-on-throw
+to the per-row/per-batch/per-site try/catch blocks, describe the `SupervisorStrategy`
+override as a children-only forward-compat placeholder, and state the actual
+`DefaultDecider` Restart semantics (no more "Resume" claim). Behaviour unchanged.
### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously
@@ -394,7 +398,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740` |
**Description**
@@ -425,9 +429,12 @@ describe the actual ordering: the channel is completed first, the loop drains
remaining items under the lock, and `_disposed = true` is set only after the loop
exits. The current code is correct; the comment is wrong.
-**Resolution**
+**Resolution (2026-05-28):**
-_Unresolved._
+Comment-only fix on `SqliteAuditWriter.DisposeAsync`. Rewrote the misleading comment
+to describe the actual ordering: completing the channel writer is the shutdown signal,
+the writer loop drains buffered items, and `_disposed` is intentionally set only after
+the loop has drained (in the second lock block). Behaviour unchanged.
### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream
diff --git a/code-reviews/CLI/findings.md b/code-reviews/CLI/findings.md
index 4fd24af0..fd29173a 100644
--- a/code-reviews/CLI/findings.md
+++ b/code-reviews/CLI/findings.md
@@ -1048,9 +1048,11 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `docs/requirements/Component-CLI.md:310-311` (vs. `src/ScadaLink.CLI/Commands/AuditQueryHelpers.cs:186`, `src/ScadaLink.CLI/Commands/AuditExportHelpers.cs:126`, `src/ScadaLink.CLI/ManagementHttpClient.cs:94-156`) |
+**Resolution (2026-05-28):** Updated `Component-CLI.md` Dependencies bullets — the Management Service (#18) bullet now says the `scadalink audit` group rides a parallel REST surface (`GET /api/audit/query` / `GET /api/audit/export`) sharing HTTP Basic Auth with `/management` but bypassing the actor; the Audit Log (#23) bullet names the specific endpoints and the server-side `AuditEndpoints` permission enforcement (`OperationalAudit` / `AuditExport`).
+
**Description**
`Component-CLI.md:310` states: "The `scadalink audit` command group rides this same
diff --git a/code-reviews/ClusterInfrastructure/findings.md b/code-reviews/ClusterInfrastructure/findings.md
index 302924e5..7e90aaa7 100644
--- a/code-reviews/ClusterInfrastructure/findings.md
+++ b/code-reviews/ClusterInfrastructure/findings.md
@@ -733,8 +733,8 @@ that proves the section name flows from this module to the Host._
|--|--|
| Severity | Low |
| Category | Design-document adherence |
-| Status | Open |
-| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:30-33` |
+| Status | Resolved |
+| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:30-43` |
**Description**
@@ -775,9 +775,15 @@ list, and add a test case for `SeedNodes.Count == 1` failing validation. Once th
module's validator enforces the rule, `Host.StartupValidator`'s duplicate check
becomes redundant and can be removed in the Host's review.
-**Resolution**
-
-_Open._
+**Resolution (2026-05-28):** Tightened the seed-node check to require
+`SeedNodes.Count >= 2` with a message that cites Component-ClusterInfrastructure.md
+→ Node Configuration ("both nodes are seed nodes"). Added
+`ClusterOptionsValidatorTests.SingleSeedNode_FailsValidation`; the existing
+`ServiceCollectionExtensionsTests.AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution`
+still passes because errors accumulate (MinNrOfMembers=2 still fails as before).
+Removing the duplicate check in `Host.StartupValidator` is left for the Host's
+review per the original recommendation. Tests green (19/19 in
+ClusterInfrastructure.Tests).
### ClusterInfrastructure-013 — Test uses catastrophic config values without an inline-intent comment
diff --git a/code-reviews/Commons/findings.md b/code-reviews/Commons/findings.md
index cac9bae9..3d0e8461 100644
--- a/code-reviews/Commons/findings.md
+++ b/code-reviews/Commons/findings.md
@@ -816,9 +816,11 @@ owned by a Transport-component option, document the link.
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `docs/requirements/Component-Commons.md:41-44`, `:75-79`, `:88-95`, `:107-117`, `:152-232` |
+**Resolution (2026-05-28):** Refreshed `Component-Commons.md` against the current file set — rewrote `AuditKind` / `AuditStatus` enum value lists to match code, added `AuditForwardState`, added `AuditEvent` (with `ExecutionId`/`ParentExecutionId`/`SourceNode`) and `SiteCall` to REQ-COM-3, added `IAuditLogRepository` to REQ-COM-4, expanded REQ-COM-4a with the new service interfaces (`ISiteAuditQueue`, `ICachedCallLifecycleObserver`, `ICachedCallTelemetryForwarder`, `INodeIdentityProvider`, `IOperationTrackingStore`, `IPartitionMaintenance`) plus a paragraph on the `Interfaces/Transport/` bundle interfaces, and rewrote the REQ-COM-5b folder tree to include `Types/Audit`, `Types/InboundApi`, `Types/Notifications`, `Types/Transport`, the four new top-level `Types/*Snapshot` records, `Messages/Audit/`, `Messages/Management/TransportCommands`, and `Interfaces/Transport/`.
+
**Description**
The Commons design doc has fallen materially behind the code:
@@ -1059,9 +1061,11 @@ behavior.
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.Commons/Interfaces/Transport/IAuditCorrelationContext.cs:11`, `src/ScadaLink.Commons/Types/Transport/ImportPreview.cs:11`, `src/ScadaLink.Commons/Entities/Notifications/Notification.cs:33` |
+**Resolution (2026-05-28):** Replaced the unresolvable `` in `IAuditCorrelationContext` with a plain-text `BundleImporter.ApplyAsync` reference (qualified inline as "in the Transport component") so the XML doc no longer emits a CS1574 warning from Commons, which cannot see the implementation type. The `ImportPreviewItem.FieldDiffJson` / `Notification.ResolvedTargets` JSON-shape sub-point is tracked separately and not in scope for this close — the XML doc on `IAuditCorrelationContext` does not name those columns.
+
**Description**
Two related XML-doc weaknesses, both around the new Transport / Audit surface:
diff --git a/code-reviews/Communication/findings.md b/code-reviews/Communication/findings.md
index 4327b7fa..86c144d5 100644
--- a/code-reviews/Communication/findings.md
+++ b/code-reviews/Communication/findings.md
@@ -908,8 +908,8 @@ caller, so the reply skips the coordinator.)
|--|--|
| Severity | Low |
| Category | Design-document adherence |
-| Status | Open |
-| Location | `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:357-371` |
+| Status | Resolved |
+| Location | `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:376-465` |
**Description**
@@ -939,6 +939,16 @@ consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
(additive-only-evolution: deprecate the field, default to `true`, plan
removal in a major message contract revision).
+**Resolution (2026-05-28):** Took option (a). `SiteCommunicationActor` now
+accepts an optional `Func? isActiveCheck` (default = real `Cluster.Get`
+leader check mirroring `ActiveNodeGate` / `ActiveNodeHealthCheck`) and
+`SendHeartbeatToCentral` stamps `HeartbeatMessage.IsActive` with the result.
+A try/catch keeps the heartbeat tick alive when the cluster state is
+unreadable (warm-up / TestKit without cluster plugin) — falls back to
+`IsActive: false`, the safe non-claiming value. Added parameterised test
+`Heartbeat_StampsIsActive_FromInjectedCheck`. Tests green (199/199 in
+Communication.Tests).
+
---
### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
diff --git a/code-reviews/ConfigurationDatabase/findings.md b/code-reviews/ConfigurationDatabase/findings.md
index a6b83af9..17f5ec41 100644
--- a/code-reviews/ConfigurationDatabase/findings.md
+++ b/code-reviews/ConfigurationDatabase/findings.md
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
-| Open findings | 10 |
+| Open findings | 8 |
## Summary
@@ -1212,7 +1212,7 @@ boundary lookup resolves to the expected partition.
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:8-14` |
**Description**
@@ -1236,13 +1236,21 @@ snapshots, and the Restrict-FK-aware `DeleteInstanceAsync` for the
deployment pipeline. Cross-reference the optimistic-concurrency contract on
`DeploymentRecord.RowVersion`.
+**Resolution (2026-05-28):**
+
+Replaced the stale "WP-24: Stub level sufficient for diff/staleness support" XML
+doc with an accurate one-paragraph summary covering `DeploymentRecord` CRUD
+(plus the `RowVersion` optimistic-concurrency contract), `SystemArtifactDeploymentRecord`
+CRUD, `DeployedConfigSnapshot` CRUD, and the Restrict-FK-aware `DeleteInstanceAsync`.
+No behaviour change.
+
### ConfigurationDatabase-023 — `AuditLog` correlation-index name drifts from design doc (`IX_AuditLog_CorrelationId` vs `IX_AuditLog_Correlation`)
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs:99-101`, `Migrations/20260520142214_AddAuditLogTable.cs:103-107` |
**Description**
@@ -1268,6 +1276,14 @@ preserves the existing migration; renaming the index in the database requires a
migration that does `sp_rename`. Document-aligning is the lower-cost option and
matches the resolution pattern used for CD-005.
+**Resolution (2026-05-28):**
+
+Doc-aligns-to-code, preserving the existing migration. `docs/requirements/Component-AuditLog.md`
+already lists the index as `IX_AuditLog_CorrelationId` (line 109). The stale
+`IX_AuditLog_Correlation` reference in `docs/requirements/Component-ConfigurationDatabase.md`
+line 64 (AuditLog table prose) was updated to `IX_AuditLog_CorrelationId`. Code unchanged;
+index name remains `IX_AuditLog_CorrelationId`.
+
### ConfigurationDatabase-024 — Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete
| | |
diff --git a/code-reviews/DeploymentManager/findings.md b/code-reviews/DeploymentManager/findings.md
index 3457eb28..dce861ac 100644
--- a/code-reviews/DeploymentManager/findings.md
+++ b/code-reviews/DeploymentManager/findings.md
@@ -1012,8 +1012,8 @@ also produces an audit entry.
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
-| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:683-686` |
+| Status | Resolved |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:698-712` |
**Description**
@@ -1041,6 +1041,16 @@ Use `user` (the parameter on `DeployInstanceAsync`, threaded through
`OriginalDeployer = prior.DeployedBy` in the detail object so the original
attribution is preserved without misrepresenting who took the action.
+**Resolution (2026-05-28):** Threaded the `user` parameter from
+`DeployInstanceAsync` into `TryReconcileWithSiteAsync` as a new `currentUser`
+argument (consistent with the DeploymentManager-018 `forceEnabledState`
+parameter-threading pattern) and rewrote the audit call to log
+`currentUser` as the actor with `OriginalDeployer = prior.DeployedBy` carried
+in the detail object. Added test
+`DeployInstanceAsync_Reconciled_AuditAttributesCurrentUserNotPriorDeployer`
+that pins the new attribution and asserts the prior deployer is no longer used
+as the actor. Tests green (80/80 in DeploymentManager.Tests).
+
### DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing
| | |
diff --git a/code-reviews/ExternalSystemGateway/findings.md b/code-reviews/ExternalSystemGateway/findings.md
index 9e75f03f..1d445288 100644
--- a/code-reviews/ExternalSystemGateway/findings.md
+++ b/code-reviews/ExternalSystemGateway/findings.md
@@ -1261,9 +1261,11 @@ is in the body branch but not the design-doc list; see finding 023).
|--|--|
| Severity | Low |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:241`, `docs/requirements/Component-ExternalSystemGateway.md:43` |
+**Resolution (2026-05-28):** Doc-only fix; confirmed PATCH is wired in `ExternalSystemClient.cs:258-260` alongside POST/PUT for body serialization. Added `PATCH` to the design doc's HTTP-method list (line 42) and updated the body/query-parameter sentence (line 75) so the documented set matches the code's `body = POST/PUT/PATCH; query = GET/DELETE` split.
+
**Description**
The component design doc lists the supported HTTP methods as `GET, POST, PUT, or
diff --git a/code-reviews/HealthMonitoring/findings.md b/code-reviews/HealthMonitoring/findings.md
index 6b99df3e..7bc81d4f 100644
--- a/code-reviews/HealthMonitoring/findings.md
+++ b/code-reviews/HealthMonitoring/findings.md
@@ -914,9 +914,11 @@ collector API once it lands.
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `docs/requirements/Component-HealthMonitoring.md:39,40`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs`, `src/ScadaLink.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58` |
+**Resolution (2026-05-28):** Took the simpler doc-removal path rather than surfacing the metrics through a new HealthMonitoring API. Removed `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` from the Monitored Metrics table, the Audit Log KPIs section, and the Audit Log Dependencies entry in `Component-HealthMonitoring.md`. The two metrics remain internal to `AuditLog`'s `AuditCentralHealthSnapshot` — promoting them to dashboard tiles is a separate feature, out of scope here.
+
**Description**
`Component-HealthMonitoring.md` lists `SiteAuditTelemetryStalled` and
@@ -1104,9 +1106,11 @@ module's tests.
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122` |
+**Resolution (2026-05-28):** Renamed the test method from `StoreAndForwardBufferDepths_IsEmptyPlaceholder` to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled` so the name describes what it actually asserts — the default-state contract of a fresh collector. No behaviour change; the body still constructs a collector without calling `SetStoreAndForwardDepths` and asserts `Empty(report.StoreAndForwardBufferDepths)`.
+
**Description**
The test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` was originally named
diff --git a/code-reviews/Host/findings.md b/code-reviews/Host/findings.md
index 23ec5aa3..bd041c16 100644
--- a/code-reviews/Host/findings.md
+++ b/code-reviews/Host/findings.md
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
-| Open findings | 7 |
+| Open findings | 6 |
## Summary
@@ -871,7 +871,7 @@ _Open._
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.Host/Program.cs:229-265`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs` |
**Description**
@@ -910,9 +910,32 @@ integration test under `tests/ScadaLink.Host.Tests` that starts a site host,
opens a stream, triggers shutdown, and asserts the stream completes with
`Cancelled` before the actor system tears down.
-**Resolution**
+**Resolution (2026-05-28):**
-_Open._
+REQ-HOST-7 steps (1)+(2) wired. `SiteStreamGrpcServer` gained:
+- a monotonic `_shuttingDown` flag,
+- `CancelAllStreams()` — flips the flag, cancels every `_activeStreams[*].Cts`
+ (with `ObjectDisposedException` swallow for entries cleaning themselves
+ up concurrently), idempotent on repeat calls,
+- a `SubscribeInstance` guard that returns `Unavailable "Server shutting
+ down"` for new subscriptions arriving after the flag flips.
+
+`Program.cs` site branch now resolves `IHostApplicationLifetime` and the
+`SiteStreamGrpcServer` singleton, then registers
+`ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams())`.
+`ApplicationStopping` fires before any `IHostedService.StopAsync`, so the
+gRPC server begins refusing new streams and tears down in-flight ones
+BEFORE `AkkaHostedService` runs `CoordinatedShutdown` — matching REQ-HOST-7's
+ordering. Clients observe a clean `Cancelled` and reconnect rather than a
+silent stream that times out via keepalive (~25 s).
+
+Two unit regression tests added to
+`tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs`:
+`Host017_CancelAllStreams_CancelsActiveStreamsAndRefusesNewOnes` (active
+streams complete, new ones rejected) and `Host017_CancelAllStreams_IsIdempotent`
+(double-call safe). A full site-host integration test was deferred — the
+unit suite covers both server-side invariants and the wiring is a single
+`Register` line in `Program.cs`.
### Host-018 — Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null
diff --git a/code-reviews/InboundAPI/findings.md b/code-reviews/InboundAPI/findings.md
index 9c23c56d..6d7d0003 100644
--- a/code-reviews/InboundAPI/findings.md
+++ b/code-reviews/InboundAPI/findings.md
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
-| Open findings | 8 |
+| Open findings | 7 |
## Summary
@@ -1017,7 +1017,7 @@ regression test posting with `application/JSON` and Transfer-Encoding: chunked.
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:141-143`, `:182-183`, `:225-226`; `src/ScadaLink.Commons/Messages/InboundApi/RouteToInstanceRequest.cs:15-21`, `:36-40`, `:55-59` |
**Description**
@@ -1055,6 +1055,25 @@ from `_parentExecutionId` in `RouteTarget.GetAttributes` and
their emitted audit rows. Add a `RouteHelperTests` regression case asserting
that an attribute read/write carries the inherited `ParentExecutionId`.
+**Resolution (2026-05-28):**
+
+Wire fix landed — `RouteToGetAttributesRequest` and `RouteToSetAttributesRequest`
+now carry a trailing additive `Guid? ParentExecutionId = null`, mirroring
+`RouteToCallRequest`; `RouteTarget.GetAttributes` and `RouteTarget.SetAttributes`
+stamp `_parentExecutionId` onto the request, so the field travels off the
+inbound API box symmetrically across all three `Route.To()` verbs. Four
+regression tests added to `RouteHelperTests` (with/without parent id for both
+verbs).
+
+Site-side audit emission for routed reads/writes is NOT currently wired —
+`DeploymentManagerActor.RouteInboundApiGetAttributes` / `…SetAttributes` and
+`InstanceActor.HandleGetAttribute` / `HandleSetStaticAttribute` /
+`HandleSetDataAttribute` do not call `IAuditWriter` today (only script-driven
+`AuditingDbConnection` / `AuditingDbCommand` paths emit audit rows on the site
+side). The `ParentExecutionId` is now available on the wire so once those
+audit emissions land (tracked separately under the Audit Log site-wiring
+backlog), they can stamp the parent id without any further plumbing.
+
### InboundAPI-022 — `IActiveNodeGate` has no production registration in Host — standby-node gating is silently disabled in production
| | |
diff --git a/code-reviews/ManagementService/findings.md b/code-reviews/ManagementService/findings.md
index 1217d1c0..b04c9ffe 100644
--- a/code-reviews/ManagementService/findings.md
+++ b/code-reviews/ManagementService/findings.md
@@ -1029,9 +1029,11 @@ the actor tests; the bundle round-trip belongs in `Transport` tests).
|--|--|
| Severity | Low |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `docs/requirements/Component-ManagementService.md:77`–`:175`, `:205`–`:209` |
+**Resolution (2026-05-28):** Updated `Component-ManagementService.md` — added a "Transport (Bundle Import / Export)" entry under "Message Groups" listing `ExportBundle` (Design), `PreviewBundle`/`ImportBundle` (Admin); inserted a new "HTTP Audit API" section describing `GET /api/audit/query` (`OperationalAudit`) and `GET /api/audit/export` (`AuditExport`) with the site-scope intersection rule; and rewrote the Configuration table to document the now-wired `CommandTimeout` (30 s default, non-positive falls back). Legacy `QueryAuditLog` is now annotated as Admin-gated and superseded by the REST surface for #23.
+
**Description**
`Component-ManagementService.md` does not mention three pieces of shipped functionality:
diff --git a/code-reviews/NotificationOutbox/findings.md b/code-reviews/NotificationOutbox/findings.md
index f2edff0a..23136b66 100644
--- a/code-reviews/NotificationOutbox/findings.md
+++ b/code-reviews/NotificationOutbox/findings.md
@@ -322,7 +322,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:13`, `:22`, `:25`; `docs/requirements/Component-NotificationOutbox.md:152-160` |
**Description**
@@ -352,9 +352,13 @@ defaults, or remove them from the implementation if they were meant to be fixed
constants. Cross-link `DeliveredKpiWindow` from the §Monitoring "Delivered (last
interval)" KPI bullet so a reader sees what controls the bucket length.
-**Resolution**
+**Resolution (2026-05-28):**
-_Unresolved._
+Resolved the code-side gap by adding clear XML `` docs to every property on
+`NotificationOutboxOptions` (`DispatchInterval`, `DispatchBatchSize`,
+`StuckAgeThreshold`, `TerminalRetention`, `PurgeInterval`, `DeliveredKpiWindow`) —
+each now states what it controls and its default value. The design-doc update
+remains tracked separately.
### NotificationOutbox-008 — `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested
@@ -402,7 +406,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:15-16` |
**Description**
@@ -429,9 +433,11 @@ Rewrite the XML to match the design: "Age past which a still-`Pending`/`Retrying
notification is counted as stuck on the KPI tile and the per-row badge.
Display-only — does not affect dispatch."
-**Resolution**
+**Resolution (2026-05-28):**
-_Unresolved._
+Rewrote the `StuckAgeThreshold` XML-doc to make the display-only semantics explicit:
+rows older than the threshold are flagged in KPIs/UI; there is no automatic re-claim,
+requeue, or escalation. Matches `Component-NotificationOutbox.md §Monitoring`.
### NotificationOutbox-010 — Comment claims `PipeTo` is not used "because the writer never throws"; the surrounding try/catch is dead-letter for the documented failure mode
diff --git a/code-reviews/NotificationService/findings.md b/code-reviews/NotificationService/findings.md
index a410dc35..00bd090f 100644
--- a/code-reviews/NotificationService/findings.md
+++ b/code-reviews/NotificationService/findings.md
@@ -756,9 +756,11 @@ Document the per-send lifecycle on `MailKitSmtpClientWrapper` (XML on the class:
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:12-17`, `src/ScadaLink.Commons/Interfaces/Services/INotificationDeliveryService.cs:3-12`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:8-9` |
+**Resolution (2026-05-28):** Closed by NS-019 — both `NotificationDeliveryService.cs` and `INotificationDeliveryService.cs` were removed in commit `ac96b83`, and `ServiceCollectionExtensions.AddNotificationService`'s XML doc was rewritten in the same commit to describe the central-only design (shared SMTP primitives consumed by `EmailNotificationDeliveryAdapter`, with an explicit NS-019 cross-reference and a note that sites no longer deliver notifications). No stale XML docs remain in this module.
+
**Description**
XML comments still claim the dead path is the live path:
diff --git a/code-reviews/README.md b/code-reviews/README.md
index 1aec089c..e925ce0e 100644
--- a/code-reviews/README.md
+++ b/code-reviews/README.md
@@ -41,37 +41,37 @@ module file and counted in **Total**.
|----------|---------------|
| Critical | 0 |
| High | 0 |
-| Medium | 46 |
-| Low | 90 |
-| **Total** | **136** |
+| Medium | 41 |
+| Low | 71 |
+| **Total** | **112** |
## Module Status
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|--------|---------------|--------|----------------|------|-------|
-| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/8 | 11 | 11 |
-| [CLI](CLI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 23 |
+| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/6 | 9 | 11 |
+| [CLI](CLI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/3 | 5 | 23 |
| [CentralUI](CentralUI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 33 |
-| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/4 | 4 | 14 |
-| [Commons](Commons/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/6 | 9 | 23 |
-| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/5 | 6 | 22 |
-| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/4/5 | 9 | 24 |
+| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/3 | 3 | 14 |
+| [Commons](Commons/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 23 |
+| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/4 | 5 | 22 |
+| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/4/3 | 7 | 24 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
-| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/5 | 6 | 24 |
-| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/3 | 5 | 23 |
-| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 23 |
-| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 22 |
-| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/4 | 7 | 25 |
-| [ManagementService](ManagementService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 23 |
-| [NotificationOutbox](NotificationOutbox/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/3 | 5 | 10 |
-| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/3 | 5 | 25 |
+| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/4 | 5 | 24 |
+| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 23 |
+| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/4 | 5 | 23 |
+| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/5 | 6 | 22 |
+| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 25 |
+| [ManagementService](ManagementService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/1 | 3 | 23 |
+| [NotificationOutbox](NotificationOutbox/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/2 | 3 | 10 |
+| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 25 |
| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/2 | 2 | 21 |
-| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 6 |
-| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/6 | 8 | 23 |
-| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/3 | 5 | 26 |
+| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 6 |
+| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 23 |
+| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 26 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/3 | 6 | 24 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/4/1 | 5 | 22 |
-| [Transport](Transport/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 12 |
+| [Transport](Transport/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/3 | 5 | 12 |
## Pending Findings
@@ -88,7 +88,7 @@ _None open._
_None open._
-### Medium (46)
+### Medium (41)
| ID | Module | Title |
|----|--------|-------|
@@ -100,7 +100,6 @@ _None open._
| CentralUI-026 | [CentralUI](CentralUI/findings.md) | `AuditFilterBar` From/To filters treat browser-local datetimes as UTC |
| CentralUI-027 | [CentralUI](CentralUI/findings.md) | Same UTC misinterpretation in `SiteCallsReport`, `NotificationReport`, and `EventLogs` |
| Commons-015 | [Commons](Commons/findings.md) | `EncryptionMetadata` accepts any algorithm string and any iteration count |
-| Commons-017 | [Commons](Commons/findings.md) | `Component-Commons.md` is significantly stale (audit enums, new entities, new repositories, new service interfaces, new folders) |
| Commons-019 | [Commons](Commons/findings.md) | New `*Utc`-suffixed `DateTime` columns on `AuditEvent` / `SiteCall` are not enforced as UTC; inconsistent with `Notification`'s `DateTimeOffset` |
| Communication-017 | [Communication](Communication/findings.md) | `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up |
| ConfigurationDatabase-016 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with the unpeppered `ApiKeyHasher.Default` |
@@ -111,16 +110,12 @@ _None open._
| ExternalSystemGateway-019 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default |
| ExternalSystemGateway-020 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry |
| HealthMonitoring-017 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts |
-| HealthMonitoring-019 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface |
| Host-016 | [Host](Host/findings.md) | Site `CentralContactPoints` second entry targets the site's own remoting port |
-| Host-017 | [Host](Host/findings.md) | Site-shutdown ordering from REQ-HOST-7 is not wired |
| InboundAPI-018 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` fires `WriteAsync` as `_ = task` — faulted async writes are unobserved |
-| InboundAPI-021 | [InboundAPI](InboundAPI/findings.md) | `ParentExecutionId` correlation flows only through `Call`; attribute reads/writes lose the inbound→site execution-tree link |
| InboundAPI-025 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` runs against the entire `/api/*` branch — emits spurious `ApiInbound` audit rows for `/api/audit/query` and `/api/audit/export` |
| ManagementService-020 | [ManagementService](ManagementService/findings.md) | UpdateSmtpConfig returns and audits the SMTP Credentials field verbatim |
| ManagementService-021 | [ManagementService](ManagementService/findings.md) | Transport bundle handlers have zero test coverage |
| NotificationOutbox-005 | [NotificationOutbox](NotificationOutbox/findings.md) | Ingest persistence inherits the CD-015 check-then-act race; under contention the second writer throws and the site retries |
-| NotificationOutbox-007 | [NotificationOutbox](NotificationOutbox/findings.md) | `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, and `PurgeInterval` are not in the design document |
| NotificationService-020 | [NotificationService](NotificationService/findings.md) | NS-001 fix superseded; `AkkaHostedService` would register two competing `Notification` S&F handlers if both code paths ran |
| NotificationService-024 | [NotificationService](NotificationService/findings.md) | No test affirms the central-only invariant; the orphaned-path tests give a false coverage signal |
| SiteCallAudit-001 | [SiteCallAudit](SiteCallAudit/findings.md) | SupervisorStrategy override is dead code; XML claims Resume that is not enforced |
@@ -139,60 +134,49 @@ _None open._
| Transport-004 | [Transport](Transport/findings.md) | `MaxUnlockAttemptsPerIpPerHour` option is declared but never enforced |
| Transport-010 | [Transport](Transport/findings.md) | Critical Overwrite + cross-cutting paths uncovered by tests |
-### Low (90)
+### Low (71)
| ID | Module | Title |
|----|--------|-------|
-| AuditLog-002 | [AuditLog](AuditLog/findings.md) | `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider |
| AuditLog-003 | [AuditLog](AuditLog/findings.md) | `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously |
| AuditLog-006 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock |
| AuditLog-007 | [AuditLog](AuditLog/findings.md) | `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations |
| AuditLog-008 | [AuditLog](AuditLog/findings.md) | Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain |
-| AuditLog-009 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't |
| AuditLog-010 | [AuditLog](AuditLog/findings.md) | Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream |
| AuditLog-011 | [AuditLog](AuditLog/findings.md) | `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call |
| CLI-020 | [CLI](CLI/findings.md) | `bundle export` success-envelope parse is unguarded |
| CLI-021 | [CLI](CLI/findings.md) | `CliConfig.Load` crashes the CLI on a malformed config file |
| CLI-022 | [CLI](CLI/findings.md) | `CommandTreeTests` excludes the two new command groups |
-| CLI-023 | [CLI](CLI/findings.md) | `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses REST endpoints |
| CentralUI-029 | [CentralUI](CentralUI/findings.md) | `ConfigurationAuditLog` uses `JS.InvokeAsync("eval", ...)` instead of a dedicated JS module |
| CentralUI-030 | [CentralUI](CentralUI/findings.md) | `SandboxConsoleCapture`'s per-call `StringWriter` is not thread-safe under intra-script concurrency |
| CentralUI-031 | [CentralUI](CentralUI/findings.md) | `TransportImport` buffers the full bundle bytes in component state |
| CentralUI-032 | [CentralUI](CentralUI/findings.md) | `AuditResultsGrid` paging is forward-only, no Previous button |
| CentralUI-033 | [CentralUI](CentralUI/findings.md) | Drill-in / query-string code paths for the new Transport + SiteCalls pages are untested |
| ClusterInfrastructure-011 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `SectionName` constant is decorative — no binding site references it |
-| ClusterInfrastructure-012 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator accepts `SeedNodes.Count == 1` despite design requiring both nodes as seeds |
| ClusterInfrastructure-013 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Test uses catastrophic config values without an inline-intent comment |
| ClusterInfrastructure-014 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `AddClusterInfrastructureActors` is dead surface — no caller, no behaviour |
| Commons-016 | [Commons](Commons/findings.md) | `BundleSession.Locked` uses a magic `3` rather than a named constant |
| Commons-018 | [Commons](Commons/findings.md) | `IOperationTrackingStore` and `IPartitionMaintenance` are at the root of `Interfaces/` instead of `Interfaces/Services/` |
| Commons-020 | [Commons](Commons/findings.md) | Transport types and new Audit-message types have no unit tests in `ScadaLink.Commons.Tests` |
| Commons-021 | [Commons](Commons/findings.md) | `ExternalCallResult.Response` has a benign lazy-parse race |
-| Commons-022 | [Commons](Commons/findings.md) | `IAuditCorrelationContext` references an unresolvable `BundleImporter.ApplyAsync` cref; JSON-blob columns have no documented shape |
| Commons-023 | [Commons](Commons/findings.md) | Trailing-optional `SourceNode` on positional records mixes additive evolution patterns |
-| Communication-018 | [Communication](Communication/findings.md) | Site heartbeats hard-code `IsActive: true` regardless of node role |
| Communication-019 | [Communication](Communication/findings.md) | `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository |
| Communication-020 | [Communication](Communication/findings.md) | `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types |
| Communication-021 | [Communication](Communication/findings.md) | `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try |
| Communication-022 | [Communication](Communication/findings.md) | `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber |
| ConfigurationDatabase-020 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `GetPartitionBoundariesOlderThanAsync` returns `DateTime` with `Kind=Unspecified` |
| ConfigurationDatabase-021 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `SwitchOutPartitionAsync` interpolates `monthBoundary` / staging table name into raw SQL |
-| ConfigurationDatabase-022 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stale "WP-24 Stub level sufficient for diff/staleness support" XML comment on `DeploymentManagerRepository` |
-| ConfigurationDatabase-023 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `AuditLog` correlation-index name drifts from design doc (`IX_AuditLog_CorrelationId` vs `IX_AuditLog_Correlation`) |
| ConfigurationDatabase-024 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete |
-| DeploymentManager-020 | [DeploymentManager](DeploymentManager/findings.md) | `DeployReconciled` audit attributes the action to the prior deployer, not the current user |
| DeploymentManager-021 | [DeploymentManager](DeploymentManager/findings.md) | `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing |
| DeploymentManager-022 | [DeploymentManager](DeploymentManager/findings.md) | `Pending` and `InProgress` are written back-to-back with no intervening work |
| DeploymentManager-023 | [DeploymentManager](DeploymentManager/findings.md) | `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site |
| DeploymentManager-024 | [DeploymentManager](DeploymentManager/findings.md) | Test probe actors hold mutable static state across tests |
| ExternalSystemGateway-021 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config |
| ExternalSystemGateway-022 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `new HttpMethod(method.HttpMethod)` accepts any string at runtime; an invalid HTTP verb fails only at call time |
-| ExternalSystemGateway-023 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set |
| HealthMonitoring-018 | [HealthMonitoring](HealthMonitoring/findings.md) | Same counter-reset-before-publish hazard in `CentralHealthReportLoop` |
| HealthMonitoring-020 | [HealthMonitoring](HealthMonitoring/findings.md) | `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` |
| HealthMonitoring-021 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralSiteId = "central"` reserved constant silently collides with a real site named "central" |
| HealthMonitoring-022 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI |
-| HealthMonitoring-023 | [HealthMonitoring](HealthMonitoring/findings.md) | `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder |
| Host-018 | [Host](Host/findings.md) | Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null |
| Host-019 | [Host](Host/findings.md) | Migration `StartupRetry` call drops the host `CancellationToken` |
| Host-020 | [Host](Host/findings.md) | `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel` |
@@ -202,34 +186,26 @@ _None open._
| InboundAPI-020 | [InboundAPI](InboundAPI/findings.md) | `ContentType.Contains("json")` is case-sensitive; `application/JSON` with no Content-Length skips body parsing |
| InboundAPI-023 | [InboundAPI](InboundAPI/findings.md) | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test coverage |
| InboundAPI-024 | [InboundAPI](InboundAPI/findings.md) | `_knownBadMethods` is unbounded — an attacker can grow the cache by spamming distinct method names against the audit middleware path |
-| ManagementService-022 | [ManagementService](ManagementService/findings.md) | Design doc is stale on Transport bundle commands, /api/audit/* endpoints, and CommandTimeout |
| ManagementService-023 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments unfiltered branch is N+1 on instance lookup |
| NotificationOutbox-006 | [NotificationOutbox](NotificationOutbox/findings.md) | `ResolveAdapters` rebuilds the `NotificationType → adapter` dictionary on every dispatch sweep |
| NotificationOutbox-008 | [NotificationOutbox](NotificationOutbox/findings.md) | `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested |
-| NotificationOutbox-009 | [NotificationOutbox](NotificationOutbox/findings.md) | `StuckAgeThreshold` XML-doc says "in-progress notification is re-claimed" — contradicts the design's display-only stuck detection |
| NotificationService-022 | [NotificationService](NotificationService/findings.md) | `MailKitSmtpClientWrapper` holds a long-lived `SmtpClient`; combined with per-send factory, the design comment about pooling is contradicted |
-| NotificationService-023 | [NotificationService](NotificationService/findings.md) | XML docs on the orphaned classes still describe the removed site-delivery flow; misleading to maintainers |
| NotificationService-025 | [NotificationService](NotificationService/findings.md) | `CredentialRedactor` over-masks: any 4-character credential component is masked anywhere it appears, including unrelated log text |
| Security-020 | [Security](Security/findings.md) | `SecurityOptions` has no startup validation for required fields (`LdapServer`, `LdapSearchBase`) |
| Security-021 | [Security](Security/findings.md) | `RequireHttpsCookie=false` dev opt-out has no warning path — an HTTP production deployment silently transmits the JWT bearer credential in cleartext |
| SiteCallAudit-002 | [SiteCallAudit](SiteCallAudit/findings.md) | Singleton failover does not wait for in-flight async upserts |
-| SiteCallAudit-004 | [SiteCallAudit](SiteCallAudit/findings.md) | Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift |
-| SiteCallAudit-005 | [SiteCallAudit](SiteCallAudit/findings.md) | `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing |
| SiteCallAudit-006 | [SiteCallAudit](SiteCallAudit/findings.md) | Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor |
| SiteEventLogging-018 | [SiteEventLogging](SiteEventLogging/findings.md) | `FailedWriteCount` is exposed but never consumed by Health Monitoring |
-| SiteEventLogging-019 | [SiteEventLogging](SiteEventLogging/findings.md) | `EventLogPurgeService` runs on every host node; design says "active node" |
| SiteEventLogging-020 | [SiteEventLogging](SiteEventLogging/findings.md) | `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced |
| SiteEventLogging-021 | [SiteEventLogging](SiteEventLogging/findings.md) | `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales |
| SiteEventLogging-022 | [SiteEventLogging](SiteEventLogging/findings.md) | `Cache=Shared` is redundant for a single-connection logger |
| SiteEventLogging-023 | [SiteEventLogging](SiteEventLogging/findings.md) | Concurrent-stress test uses a non-volatile `stop` flag |
| SiteRuntime-023 | [SiteRuntime](SiteRuntime/findings.md) | `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive |
| SiteRuntime-025 | [SiteRuntime](SiteRuntime/findings.md) | `HandleSetStaticAttribute` persists unknown attribute names as static overrides |
-| SiteRuntime-026 | [SiteRuntime](SiteRuntime/findings.md) | `ReplicationMessages.cs` public record types have no XML documentation |
| StoreAndForward-022 | [StoreAndForward](StoreAndForward/findings.md) | `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId` |
| StoreAndForward-023 | [StoreAndForward](StoreAndForward/findings.md) | `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation |
| StoreAndForward-024 | [StoreAndForward](StoreAndForward/findings.md) | `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown |
| TemplateEngine-022 | [TemplateEngine](TemplateEngine/findings.md) | `LockEnforcer.ValidateLockChange` enforces "once-locked-stays-locked" for `IsLocked` but not for `LockedInDerived` |
| Transport-008 | [Transport](Transport/findings.md) | `PreviewAsync` issues an N+1 `GetTemplateWithChildrenAsync` per matching template name |
| Transport-009 | [Transport](Transport/findings.md) | `IAuditCorrelationContext.BundleImportId` is mutated on the same scoped instance the AuditService reads |
-| Transport-011 | [Transport](Transport/findings.md) | Design doc's Step-1 manifest preview promises decryption-free preview, but `LoadAsync` reads and validates content before passphrase |
| Transport-012 | [Transport](Transport/findings.md) | "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI |
diff --git a/code-reviews/SiteCallAudit/findings.md b/code-reviews/SiteCallAudit/findings.md
index 7a28dbec..8ff091ac 100644
--- a/code-reviews/SiteCallAudit/findings.md
+++ b/code-reviews/SiteCallAudit/findings.md
@@ -202,7 +202,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ScadaLink.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` |
**Description**
@@ -229,9 +229,13 @@ doc as deferred — the doc reads as if it ships.
listing what's not yet implemented (matches the pattern Audit Log uses for
its tamper-evidence hash chain).
-**Resolution**
+**Resolution (2026-05-28):**
-_Unresolved._
+Updated the class-level XML on `SiteCallAuditActor` to reflect actual state:
+telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site
+Retry/Discard relay (Task 5) are implemented; the periodic reconciliation
+puller and the daily terminal-row purge scheduler remain deferred. The design
+doc update is tracked separately.
### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing
@@ -239,7 +243,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:548-563` |
**Description**
@@ -281,9 +285,16 @@ SiteCallRelayOutcome.SiteUnreachable =>
— fail fast if the invariant is ever violated by a refactor.
-**Resolution**
+**Resolution (2026-05-28):**
-_Unresolved._
+Behaviour kept (return `ack.ErrorMessage`); `AckErrorMessage` stays total and
+side-effect-free. Expanded the inline comment on the `SiteUnreachable` arm to
+explain WHY it returns rather than throws: site-unreachable is classified as
+transient by the upstream relay (which has already built its
+`SiteUnreachable` response and detail text via `SiteUnreachableMessage`), so a
+defensive fall-through surfaces the ack's message and lets the caller schedule
+a retry — throwing would turn a benign refactor invariant violation into a
+relay-path crash.
### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
diff --git a/code-reviews/SiteEventLogging/findings.md b/code-reviews/SiteEventLogging/findings.md
index 219c6cd1..216228d8 100644
--- a/code-reviews/SiteEventLogging/findings.md
+++ b/code-reviews/SiteEventLogging/findings.md
@@ -919,8 +919,8 @@ file a tracking item for the wiring. The current doc claim is misleading.
|--|--|
| Severity | Low |
| Category | Design-document adherence |
-| Status | Open |
-| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:21`, `docs/requirements/Component-SiteEventLogging.md:45` |
+| Status | Resolved |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:57-95`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:30-39`, `docs/requirements/Component-SiteEventLogging.md:45` |
**Description**
@@ -949,6 +949,21 @@ the design doc to "the purge runs on every node against its own local database;
on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code
defect.
+**Resolution (2026-05-28):** Took option (a) at the loop level — registration
+stays unchanged on every host. Introduced a `SiteEventLogActiveNodeCheck`
+delegate that `EventLogPurgeService` consults at the top of every
+`RunPurge()` tick; standby returns early with a debug log. The DI factory
+resolves the delegate from the container so the Host can register the real
+check on a site node, and a null/unregistered delegate falls back to the
+prior "always run" behaviour (backward compatible for non-clustered hosts and
+existing tests). Defensive try/catch around the check defaults to "run" so a
+transient cluster-state read failure cannot stop the purge loop. Added tests
+`RunPurge_OnStandbyNode_SkipsAllWork`,
+`RunPurge_OnActiveNode_RunsTheRetentionPurge`, and
+`RunPurge_WithNullCheck_FallsBackToRunning`. Wiring the real check on the
+Host's site-role branch is left for the Host's review. Tests green (50/50 in
+SiteEventLogging.Tests).
+
### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced
| | |
diff --git a/code-reviews/SiteRuntime/findings.md b/code-reviews/SiteRuntime/findings.md
index f1dbb6f9..531cb8a8 100644
--- a/code-reviews/SiteRuntime/findings.md
+++ b/code-reviews/SiteRuntime/findings.md
@@ -1307,7 +1307,7 @@ failure response).
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:10`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:13`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:15`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:17`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:19`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:25`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:28`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:30`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:32`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:34` |
**Description**
@@ -1334,3 +1334,13 @@ inbound vs outbound split with a marker base type (currently they're just
named conventionally) so `Receive` vs `Receive` is
expressed at the type level — but that's optional and out of scope for a
docs-only finding.
+
+**Resolution (2026-05-28):**
+
+Added a one-line `` to each of the ten records
+(`ReplicateConfigDeploy`/`Remove`/`SetEnabled`/`Artifacts`/`StoreAndForward` and
+`ApplyConfigDeploy`/`Remove`/`SetEnabled`/`Artifacts`/`StoreAndForward`) naming
+the direction (outbound to peer / inbound from peer) and what is replicated.
+The two pre-existing group-header XML blocks were converted to plain `//`
+comments to avoid orphaned doc-summaries above the first record in each group.
+Marker-base-type idea left out of scope.
diff --git a/code-reviews/Transport/findings.md b/code-reviews/Transport/findings.md
index e48d4ab6..d1d26eee 100644
--- a/code-reviews/Transport/findings.md
+++ b/code-reviews/Transport/findings.md
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
-| Open findings | 12 |
+| Open findings | 11 |
## Summary
@@ -412,7 +412,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
| Location | `docs/requirements/Component-Transport.md` Import Flow Step 1, `src/ScadaLink.Transport/Import/BundleImporter.cs:124-203` |
**Description**
@@ -437,9 +437,14 @@ skips the content read for the pure preview case, or (b) update the design
doc to clarify the full envelope is read on every `LoadAsync` and the cheap
"peek" is conceptual rather than runtime.
-**Resolution**
+**Resolution (2026-05-28):**
-_Unresolved._
+Took option (b) — `docs/requirements/Component-Transport.md` (§"`manifest.json`
+(plaintext)") now carries an explicit implementation note clarifying that
+`BundleImporter.LoadAsync` reads the full envelope and verifies the content
+hash on every call regardless of passphrase availability, that the encrypted-
+bundle prompt is surfaced AFTER the manifest+hash check, and that a dedicated
+`ReadManifestAsync(Stream)` is a deferred optimisation. Code unchanged.
### Transport-012 — "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI
diff --git a/docs/requirements/Component-CLI.md b/docs/requirements/Component-CLI.md
index c0025756..434bf2df 100644
--- a/docs/requirements/Component-CLI.md
+++ b/docs/requirements/Component-CLI.md
@@ -307,8 +307,8 @@ Configuration is resolved in the following priority order (highest wins):
- **Commons**: Message contracts (`Messages/Management/`) for command type definitions and registry.
- **System.CommandLine**: Command-line argument parsing.
- **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection.
-- **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadalink audit` command group rides this same transport — there is no separate audit endpoint.
-- **Audit Log (#23)**: The `scadalink audit query`, `audit export`, and `audit verify-chain` subcommands target the centralized Audit Log component's query/export/verify surfaces via the Management API. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side.
+- **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadalink audit` command group rides a parallel REST surface on the same Host (`GET /api/audit/query` and `GET /api/audit/export`), sharing HTTP Basic Auth with `/management` but bypassing the actor for read-only, keyset-paged / streaming workloads.
+- **Audit Log (#23)**: The `scadalink audit query` and `audit export` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`) on the Host's Management API surface; `audit verify-chain` rides `POST /management` until hash-chain verification ships. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side by `AuditEndpoints`.
## Interactions
diff --git a/docs/requirements/Component-Commons.md b/docs/requirements/Component-Commons.md
index f3a1111c..14800adc 100644
--- a/docs/requirements/Component-Commons.md
+++ b/docs/requirements/Component-Commons.md
@@ -39,9 +39,11 @@ Commons must define shared primitive and utility types used across multiple comp
- **`TrackedOperationKind` enum**: ExternalCall, DatabaseWrite. Discriminates the two cached-call kinds carried by a tracked operation (notifications are tracked separately via the `NotificationType` enum).
- **`TrackedOperationStatus` enum**: Pending, Retrying, Delivered, Parked, Failed, Discarded. The unified lifecycle state shared by all tracked store-and-forward operations. This is the operation's externally-observable lifecycle status in the site-local tracking table (the status record); it is related to but distinct from the S&F buffer's own `StoreAndForwardMessageStatus`, which tracks a buffered message's retry state within the buffer (the retry mechanism). `Failed` (permanent failure) has no notification analogue — notifications use only the other five states (the `NotificationStatus` enum omits `Failed`).
- **`AuditChannel` enum**: ApiOutbound, DbOutbound, Notification, ApiInbound. Discriminates the script-trust-boundary channel that produced an `AuditEvent`. Owned by the Audit Log component.
-- **`AuditKind` enum**: SyncCall, CachedEnqueued, CachedAttempt, CachedTerminal, SyncWrite, SyncRead, Enqueued, Attempt, Terminal, Completed. Channel-specific event kind — the valid `Kind` values for each `AuditChannel` are listed in the Audit Log component design (`Component-AuditLog.md`).
-- **`AuditStatus` enum**: Success, TransientFailure, PermanentFailure, Enqueued, Retrying, Delivered, Parked, Discarded. Outcome of a single audit event row; superset of `TrackedOperationStatus` to also cover one-shot sync calls.
-- **`AuditEvent`**: A record carrying every column of the central `AuditLog` row — `EventId` (GUID, idempotency key), `OccurredAtUtc`, `IngestedAtUtc`, `Channel` (`AuditChannel`), `Kind` (`AuditKind`), `CorrelationId`, `SourceSiteId`, `SourceInstanceId`, `SourceScript`, `Actor`, `Target`, `Status` (`AuditStatus`), `HttpStatus`, `DurationMs`, `ErrorMessage`, `ErrorDetail`, `RequestSummary`, `ResponseSummary`, `PayloadTruncated`, `Extra` — plus a site-only `ForwardState` (`Pending` | `Forwarded` | `Reconciled`) used by the site SQLite write-buffer's telemetry/reconciliation loop. `IngestedAtUtc` is unset at the site and stamped on central ingest. See `Component-AuditLog.md` for the persistence schema and ingest semantics.
+- **`AuditKind` enum**: ApiCall, ApiCallCached, DbWrite, DbWriteCached, NotifySend, NotifyDeliver, InboundRequest, InboundAuthFailure, CachedSubmit, CachedResolve. Channel-specific event kind — the valid `Kind` values for each `AuditChannel` are listed in the Audit Log component design (`Component-AuditLog.md`).
+- **`AuditStatus` enum**: Submitted, Forwarded, Attempted, Delivered, Failed, Parked, Discarded, Skipped. Lifecycle status of an audit event row; cached operations transit Submitted → Forwarded → Attempted → Delivered/Parked/Discarded. `Skipped` covers short-circuited (e.g. dry-run) actions that should still be audited.
+- **`AuditForwardState` enum**: Pending, Forwarded, Reconciled. Site-local SQLite flag governing the telemetry/reconciliation loop (set on a row but never sent to central).
+- **`AuditEvent`**: A record carrying every column of the central `AuditLog` row — `EventId` (GUID, idempotency key), `OccurredAtUtc`, `IngestedAtUtc`, `Channel` (`AuditChannel`), `Kind` (`AuditKind`), `CorrelationId`, `ExecutionId`, `ParentExecutionId`, `SourceSiteId`, `SourceNode`, `SourceInstanceId`, `SourceScript`, `Actor`, `Target`, `Status` (`AuditStatus`), `HttpStatus`, `DurationMs`, `ErrorMessage`, `ErrorDetail`, `RequestSummary`, `ResponseSummary`, `PayloadTruncated`, `Extra` — plus a site-only `ForwardState` (`AuditForwardState`) used by the site SQLite write-buffer's telemetry/reconciliation loop. `IngestedAtUtc` is unset at the site and stamped on central ingest. See `Component-AuditLog.md` for the persistence schema and ingest semantics.
+- **`SiteCall`**: A record carrying the central `SiteCalls` operational-mirror row — `TrackedOperationId`, `SourceSiteId`, `SourceNode`, `Kind`, `Target`, `Status`, `RetryCount`, key timestamps, and provenance — fed by site `CachedCallTelemetry` and the periodic reconciliation pull.
Types defined here must be immutable and thread-safe.
@@ -76,7 +78,7 @@ Entity classes are organized by domain area:
- **Inbound API**: `ApiKey`, `ApiMethod`.
- **Security**: `LdapGroupMapping`, `SiteScopeRule`.
- **Deployment**: `DeploymentRecord`, `SystemArtifactDeploymentRecord`, `DeployedConfigSnapshot`.
-- **Audit**: `AuditLogEntry`.
+- **Audit**: `AuditLogEntry` (configuration-change audit, owned by Configuration Database), `AuditEvent` (centralized Audit Log row, see REQ-COM-1), `SiteCall` (`SiteCalls` operational-mirror row).
The **`Notification`** entity is the persistence-ignorant POCO for a row of the central `Notifications` table — the durable notification queue owned by the Notification Outbox. It is a plain class with properties for `NotificationId` (GUID, the idempotency key), `Type` (`NotificationType` enum discriminator), `ListName`, `Subject`, `Body`, `TypeData` (a JSON string — the type-agnostic extensibility hook), `Status` (`NotificationStatus` enum), `RetryCount`, `LastError`, `ResolvedTargets`, the provenance fields `SourceSiteId` / `SourceInstanceId` / `SourceScript`, and the UTC timestamps `SiteEnqueuedAt`, `CreatedAt`, `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt`. As with every entity class it has no EF dependency; the Configuration Database component supplies the Fluent API mapping, value conversions, and indexes. The `Type` and `Status` enums (`NotificationType`: `Email`, `Teams`, …; `NotificationStatus`: `Pending`, `Retrying`, `Delivered`, `Parked`, `Discarded`) are defined under `Types/Enums/` per REQ-COM-1.
@@ -92,6 +94,7 @@ Commons must define repository interfaces that consuming components use for data
- `INotificationRepository` — Notification lists (including the `Type` field), recipients, SMTP configuration.
- `INotificationOutboxRepository` — The `Notifications` table: insert-if-not-exists ingest on `NotificationId`, due-row polling (`Pending` rows and `Retrying` rows past `NextAttemptAt`), status transitions, KPI aggregate queries, and the bulk delete of terminal rows used by the daily purge job.
- `ISiteCallAuditRepository` — The `SiteCalls` table: insert-if-not-exists ingest on `TrackedOperationId`, upsert-on-newer-status from telemetry and reconciliation pulls, KPI aggregate queries, and the bulk delete of terminal rows used by the daily purge job.
+- `IAuditLogRepository` — The central `AuditLog` table (Audit Log #23): insert-if-not-exists ingest on `EventId`, keyset-paged query, monthly partition switch-out and boundary inspection, KPI snapshots, recursive execution-tree walks, and distinct-source-node enumeration.
- `ISiteRepository` — Sites, data connections, and their site assignments.
- `ICentralUiRepository` — Read-oriented queries spanning multiple domain areas for display purposes.
@@ -113,6 +116,13 @@ Commons must define service interfaces for cross-cutting concerns that multiple
- **`INotificationDeliveryService`**: Sends notifications to a named notification list, routing transient failures to store-and-forward. Implemented by the Notification Service, consumed by the script runtime context.
- **`IAuditWriter`**: Site-local hot-path interface for appending an `AuditEvent` to the site SQLite `AuditLog`: `Task WriteAsync(AuditEvent evt, CancellationToken ct)`. Single durable INSERT, `ForwardState = Pending`. Consumed by the script-trust-boundary call paths (External System Gateway, Database layer, Store-and-Forward Engine). Implementation lives in the Audit Log component.
- **`ICentralAuditWriter`**: Central direct-write interface for central-originated audit rows (Inbound API request completion, Notification Outbox dispatcher attempts/terminals): `Task WriteAsync(AuditEvent evt, CancellationToken ct)`, with insert-if-not-exists semantics on `EventId` so retried handlers cannot produce duplicates. Implementation lives in the Audit Log component.
+- **`ISiteAuditQueue`**: Site-local queue handing off `AuditEvent` rows from the hot path to the gRPC telemetry forwarder. Implementation lives in the Audit Log component.
+- **`ICachedCallLifecycleObserver`** / **`ICachedCallTelemetryForwarder`**: Bridge between the Store-and-Forward Engine's cached-call lifecycle transitions and the central `CachedCallTelemetry` packet (combined audit + operational state). Implementations live in the Audit Log component.
+- **`INodeIdentityProvider`**: Resolves the writing node's `SourceNode` label (`node-a` / `node-b` / `central-a` / `central-b`) stamped on every audit row, notification, and site-call.
+- **`IOperationTrackingStore`**: Site-local SQLite-backed status record store for tracked store-and-forward operations (`Tracking.Status(id)`).
+- **`IPartitionMaintenance`**: Central monthly partition-switch / retention purge job hook used by the Audit Log partition maintenance service.
+
+Bundle transport interfaces (`IBundleExporter`, `IBundleImporter`, `IBundleSessionStore`, `IAuditCorrelationContext`) live alongside the data types in `Interfaces/Transport/` and are owned by the Transport component (#24); they are defined in Commons so other components (Configuration Database for audit correlation, Central UI for the import workflow) can depend on the abstraction without taking a Transport dependency.
These interfaces are defined in Commons so that consuming components depend only on the abstraction, not on the implementing component.
@@ -159,19 +169,36 @@ ScadaLink.Commons/
│ ├── ValueFormatter.cs # culture-invariant value-to-string helper
│ ├── DynamicJsonElement.cs # dynamic JSON wrapper for scripts
│ ├── TrackedOperationId.cs # tracked store-and-forward operation ID (GUID)
+│ ├── AuditLogKpiSnapshot.cs # central AuditLog KPI tile shape
+│ ├── SiteAuditBacklogSnapshot.cs # per-site audit-forward backlog snapshot
+│ ├── SiteCallOperational.cs # SiteCalls operational-row projection
+│ ├── TrackingStatusSnapshot.cs # site-local Tracking.Status(id) projection
│ ├── Enums/ # InstanceState, DeploymentStatus, AlarmState,
│ │ # AlarmLevel, AlarmTriggerType, ConnectionHealth,
│ │ # DataType, StoreAndForwardCategory,
│ │ # StoreAndForwardMessageStatus,
│ │ # NotificationType, NotificationStatus,
│ │ # TrackedOperationKind, TrackedOperationStatus,
-│ │ # AuditChannel, AuditKind, AuditStatus
-│ ├── Audit/ # AuditEvent record (site + central audit row)
+│ │ # AuditChannel, AuditKind, AuditStatus,
+│ │ # AuditForwardState
+│ ├── Audit/ # AuditLogPaging, AuditLogQueryFilter,
+│ │ # AuditQueryParamParsers, ExecutionTreeNode,
+│ │ # SiteCallKpiSnapshot, SiteCallPaging,
+│ │ # SiteCallQueryFilter, SiteCallSiteKpiSnapshot
│ ├── DataConnections/ # OPC UA endpoint config value objects + enums
│ ├── Flattening/ # FlattenedConfiguration, ConfigurationDiff,
│ │ # DeploymentPackage, ValidationResult
+│ ├── InboundApi/ # ApiKeyHasher, ParameterDefinition
+│ ├── Notifications/ # NotificationKpiSnapshot, NotificationOutboxFilter,
+│ │ # SiteNotificationKpiSnapshot
+│ ├── Transport/ # Transport bundle value objects: BundleManifest,
+│ │ # BundleSession, BundleSummary, EncryptionMetadata,
+│ │ # ExportSelection, ImportPreview, ImportResolution,
+│ │ # ImportResult, ManifestContentEntry
│ └── Scripts/ # AlarmContext, ScriptScope
├── Interfaces/ # Shared interfaces by concern
+│ ├── IOperationTrackingStore.cs # site-local tracked-operation status store
+│ ├── IPartitionMaintenance.cs # central partition-switch / retention purge hook
│ ├── Protocol/ # REQ-COM-2: Protocol abstraction (IDataConnection, etc.)
│ ├── Repositories/ # REQ-COM-4: Per-component repository interfaces
│ │ ├── ITemplateEngineRepository.cs
@@ -182,16 +209,26 @@ ScadaLink.Commons/
│ │ ├── INotificationRepository.cs
│ │ ├── INotificationOutboxRepository.cs
│ │ ├── ISiteCallAuditRepository.cs
+│ │ ├── IAuditLogRepository.cs
│ │ ├── ISiteRepository.cs
│ │ └── ICentralUiRepository.cs
-│ └── Services/ # REQ-COM-4a: Cross-cutting service interfaces
-│ ├── IAuditService.cs
-│ ├── IAuditWriter.cs
-│ ├── ICentralAuditWriter.cs
-│ ├── IDatabaseGateway.cs
-│ ├── IExternalSystemClient.cs
-│ ├── IInstanceLocator.cs
-│ └── INotificationDeliveryService.cs
+│ ├── Services/ # REQ-COM-4a: Cross-cutting service interfaces
+│ │ ├── IAuditService.cs
+│ │ ├── IAuditWriter.cs
+│ │ ├── ICentralAuditWriter.cs
+│ │ ├── ISiteAuditQueue.cs
+│ │ ├── ICachedCallLifecycleObserver.cs
+│ │ ├── ICachedCallTelemetryForwarder.cs
+│ │ ├── INodeIdentityProvider.cs
+│ │ ├── IDatabaseGateway.cs
+│ │ ├── IExternalSystemClient.cs
+│ │ ├── IInstanceLocator.cs
+│ │ └── INotificationDeliveryService.cs
+│ └── Transport/ # Bundle transport interfaces (Transport #24):
+│ ├── IAuditCorrelationContext.cs
+│ ├── IBundleExporter.cs
+│ ├── IBundleImporter.cs
+│ └── IBundleSessionStore.cs
├── Entities/ # REQ-COM-3: Domain entity POCOs, by domain area
│ ├── Templates/ # Template, TemplateAttribute, TemplateAlarm,
│ │ # TemplateScript, TemplateComposition, TemplateFolder
@@ -207,7 +244,9 @@ ScadaLink.Commons/
│ ├── Deployment/ # DeploymentRecord, SystemArtifactDeploymentRecord,
│ │ # DeployedConfigSnapshot
│ ├── Scripts/ # SharedScript
-│ └── Audit/ # AuditLogEntry
+│ └── Audit/ # AuditLogEntry (config-change audit),
+│ # AuditEvent (centralized AuditLog row),
+│ # SiteCall (SiteCalls operational mirror)
├── Messages/ # REQ-COM-5: Cross-component message contracts, by concern
│ ├── Deployment/
│ ├── Lifecycle/
@@ -227,7 +266,13 @@ ScadaLink.Commons/
│ ├── InboundApi/ # Route.To() request messages
│ ├── RemoteQuery/ # event-log and parked-message query messages,
│ │ # parked-operation retry/discard commands
-│ └── Management/ # HTTP/ClusterClient management commands + registry
+│ ├── Audit/ # Audit Log (#23) + Site Call Audit (#22) ingest:
+│ │ # IngestAuditEventsCommand/Reply,
+│ │ # IngestCachedTelemetryCommand/Reply,
+│ │ # UpsertSiteCallCommand/Reply, SiteCallQueries,
+│ │ # SiteCallRelayMessages
+│ └── Management/ # HTTP/ClusterClient management commands + registry,
+│ # including TransportCommands (Export/Preview/Import bundle)
├── Serialization/ # OpcUaEndpointConfigSerializer (typed↔legacy JSON)
└── Validators/ # OpcUaEndpointConfigValidator
```
diff --git a/docs/requirements/Component-ConfigurationDatabase.md b/docs/requirements/Component-ConfigurationDatabase.md
index d41cf346..7cb28577 100644
--- a/docs/requirements/Component-ConfigurationDatabase.md
+++ b/docs/requirements/Component-ConfigurationDatabase.md
@@ -61,7 +61,7 @@ The configuration database stores all central system data, organized by domain a
- **SiteCalls**: The central audit table for cached site calls — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — owned by the Site Call Audit component and a sibling of the `Notifications` table. One row per cached operation. Columns: `TrackedOperationId` (GUID, primary key — generated site-side at call time, used as the idempotency key), `SourceSite`, `Kind` (a `TrackedOperationKind` enum stored with values `ExternalCall` / `DatabaseWrite`), `TargetSummary` (external system + method for an `ExternalCall`, database connection name for a `DatabaseWrite`), `Status` (a `TrackedOperationStatus` enum stored with values `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`), `RetryCount`, `LastError`, `Provenance` (source instance / script), `CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`. The table is populated **only** by Site Call Audit telemetry and reconciliation pulls — sites are the source of truth and the row is an eventually-consistent mirror, never written by a central dispatcher. Ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`, then **upsert-on-newer-status**; the lifecycle is monotonic, so at-least-once and out-of-order telemetry are harmless. Indexed on `Status` and `SourceSite` for KPI computation and the Central UI query page. Terminal rows are removed by a daily purge job — see Scheduled Maintenance below. See Component-SiteCallAudit.md for the full lifecycle.
### Audit Log
-- **AuditLog**: The central, append-only audit table owned by the Audit Log component — one row per script-trust-boundary lifecycle event across all channels (outbound API calls, outbound DB writes/reads, notifications, and inbound API requests). Sibling of the `Notifications` and `SiteCalls` tables but distinct: `AuditLog` is the immutable history that observes the other subsystems, not an operational state store. Columns: `EventId` (`uniqueidentifier` primary key — generated at the originator, used as the idempotency key), `OccurredAtUtc` (`datetime2`), `IngestedAtUtc` (`datetime2`), `Channel` (`varchar(32)` — `ApiOutbound` / `DbOutbound` / `Notification` / `ApiInbound`), `Kind` (`varchar(32)` — channel-specific event kind), `CorrelationId` (`uniqueidentifier` NULL — `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API), `SourceSiteId` (`varchar(64)` NULL), `SourceInstanceId` (`varchar(128)` NULL), `SourceScript` (`varchar(128)` NULL), `Actor` (`varchar(128)` NULL), `Target` (`varchar(256)` NULL), `Status` (`varchar(32)` — outcome of *this event*: `Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`), `HttpStatus` (`int` NULL), `DurationMs` (`int` NULL), `ErrorMessage` (`nvarchar(1024)` NULL), `ErrorDetail` (`nvarchar(max)` NULL), `RequestSummary` (`nvarchar(max)` NULL — truncated request payload, headers redacted), `ResponseSummary` (`nvarchar(max)` NULL — truncated response payload), `PayloadTruncated` (`bit`), `Extra` (`nvarchar(max)` NULL — channel-specific JSON for fields not promoted to columns). Indexes: `IX_AuditLog_OccurredAtUtc` (primary time-range index for global scans), `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` (per-site filters), `IX_AuditLog_Correlation (CorrelationId)` (drilldown from a single operation), `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` (KPI / dashboard tiles), and `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` ("what did we send to system X"). The primary key on `EventId` enforces idempotency — central ingest is `INSERT … WHERE NOT EXISTS`, so at-least-once telemetry and reconciliation retries collapse to a single row. **Monthly partitioning** on `OccurredAtUtc` from day one via partition function `pf_AuditLog_Month` and partition scheme `ps_AuditLog_Month`, with a filegroup-per-month rollover so that retention purge is a partition switch rather than a row-level delete. The partition-maintenance job that rolls the scheme forward and switches expired partitions is owned by the Audit Log component, not this component. The table is populated only by Audit Log writers (site telemetry, central direct-write, reconciliation pulls); central ingest is **insert-if-not-exists** keyed on `EventId`. See Component-AuditLog.md for the full lifecycle, payload-capture policy, and ingestion paths.
+- **AuditLog**: The central, append-only audit table owned by the Audit Log component — one row per script-trust-boundary lifecycle event across all channels (outbound API calls, outbound DB writes/reads, notifications, and inbound API requests). Sibling of the `Notifications` and `SiteCalls` tables but distinct: `AuditLog` is the immutable history that observes the other subsystems, not an operational state store. Columns: `EventId` (`uniqueidentifier` primary key — generated at the originator, used as the idempotency key), `OccurredAtUtc` (`datetime2`), `IngestedAtUtc` (`datetime2`), `Channel` (`varchar(32)` — `ApiOutbound` / `DbOutbound` / `Notification` / `ApiInbound`), `Kind` (`varchar(32)` — channel-specific event kind), `CorrelationId` (`uniqueidentifier` NULL — `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API), `SourceSiteId` (`varchar(64)` NULL), `SourceInstanceId` (`varchar(128)` NULL), `SourceScript` (`varchar(128)` NULL), `Actor` (`varchar(128)` NULL), `Target` (`varchar(256)` NULL), `Status` (`varchar(32)` — outcome of *this event*: `Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`), `HttpStatus` (`int` NULL), `DurationMs` (`int` NULL), `ErrorMessage` (`nvarchar(1024)` NULL), `ErrorDetail` (`nvarchar(max)` NULL), `RequestSummary` (`nvarchar(max)` NULL — truncated request payload, headers redacted), `ResponseSummary` (`nvarchar(max)` NULL — truncated response payload), `PayloadTruncated` (`bit`), `Extra` (`nvarchar(max)` NULL — channel-specific JSON for fields not promoted to columns). Indexes: `IX_AuditLog_OccurredAtUtc` (primary time-range index for global scans), `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` (per-site filters), `IX_AuditLog_CorrelationId (CorrelationId)` (drilldown from a single operation), `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` (KPI / dashboard tiles), and `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` ("what did we send to system X"). The primary key on `EventId` enforces idempotency — central ingest is `INSERT … WHERE NOT EXISTS`, so at-least-once telemetry and reconciliation retries collapse to a single row. **Monthly partitioning** on `OccurredAtUtc` from day one via partition function `pf_AuditLog_Month` and partition scheme `ps_AuditLog_Month`, with a filegroup-per-month rollover so that retention purge is a partition switch rather than a row-level delete. The partition-maintenance job that rolls the scheme forward and switches expired partitions is owned by the Audit Log component, not this component. The table is populated only by Audit Log writers (site telemetry, central direct-write, reconciliation pulls); central ingest is **insert-if-not-exists** keyed on `EventId`. See Component-AuditLog.md for the full lifecycle, payload-capture policy, and ingestion paths.
### Inbound API
- **API Keys**: Key definitions (name/label, key value, enabled flag).
diff --git a/docs/requirements/Component-ExternalSystemGateway.md b/docs/requirements/Component-ExternalSystemGateway.md
index 03cfa4d0..dd3c4dd6 100644
--- a/docs/requirements/Component-ExternalSystemGateway.md
+++ b/docs/requirements/Component-ExternalSystemGateway.md
@@ -39,7 +39,7 @@ Each external system definition includes:
- **Retry Settings**: Max retry count, fixed time between retries (used by Store-and-Forward Engine for transient failures only).
- **Method Definitions**: List of available API methods, each with:
- Method name.
- - **HTTP method**: GET, POST, PUT, or DELETE.
+ - **HTTP method**: GET, POST, PUT, PATCH, or DELETE.
- **Path**: Relative path appended to the base URL (e.g., `/recipes/{id}`).
- Parameter definitions (name, type). Supports the extended type system (Boolean, Integer, Float, String, Object, List).
- Return type definition. Supports the extended type system for complex response structures.
@@ -72,7 +72,7 @@ Each database connection definition includes:
All external system calls are **HTTP/REST** with **JSON** serialization:
- The ESG acts as an HTTP client. The external system definition provides the base URL; each method definition specifies the HTTP method and relative path.
-- Request parameters are serialized as JSON in the request body (POST/PUT) or as query parameters (GET/DELETE).
+- Request parameters are serialized as JSON in the request body (POST/PUT/PATCH) or as query parameters (GET/DELETE).
- Response bodies are deserialized from JSON into the method's defined return type.
- Credentials (API key header or Basic Auth header) are attached to every request per the system's authentication configuration.
diff --git a/docs/requirements/Component-HealthMonitoring.md b/docs/requirements/Component-HealthMonitoring.md
index e02349af..de2ec2a8 100644
--- a/docs/requirements/Component-HealthMonitoring.md
+++ b/docs/requirements/Component-HealthMonitoring.md
@@ -36,8 +36,6 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
| Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported |
| `SiteAuditBacklog` | Audit Log (site) | Count of `Pending` rows in the site-local `AuditLog` plus oldest-pending-age plus on-disk bytes. A configurable threshold drives a Health dashboard warning on the affected site tile. |
| `SiteAuditWriteFailures` | Audit Log (site) | Count of failed hot-path audit appends at the site since the last health report. |
-| `SiteAuditTelemetryStalled` | Audit Log (site) | Boolean flag set when reconciliation reports a non-draining site-local audit backlog over two consecutive cycles. |
-| `CentralAuditWriteFailures` | Audit Log (central) | Count of central direct-write audit failures (Inbound API middleware, Notification Outbox dispatcher, and any other central direct writers) since the last interval. |
| `AuditRedactionFailure` | Audit Log (central) | Count of payload redactor errors (over-redacted payloads, safety-net hit) since the last interval. |
## Reporting Protocol
@@ -86,10 +84,10 @@ Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** —
The Audit Log spans both sites (hot-path append + telemetry forward) and central (direct-write + ingest + redaction). Its operational health surfaces as three new dashboard tiles grouped under **Audit**:
- **Audit volume** — events/min landing in the central `AuditLog` table, shown global plus per-site sparkline; sourced from the Audit Log component on the active central node.
-- **Audit error rate** — percent of central `AuditLog` rows with `Status` other than `Success` / `Delivered` / `Enqueued` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries, etc.) — NOT the audit writer's own health. Audit-writer issues surface separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
-- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes); click drills into a per-site breakdown. The per-site tile surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold or when `SiteAuditTelemetryStalled` is set.
+- **Audit error rate** — percent of central `AuditLog` rows with `Status` other than `Success` / `Delivered` / `Enqueued` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries, etc.) — NOT the audit writer's own health. Audit-writer issues surface separately via `AuditRedactionFailure`.
+- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes); click drills into a per-site breakdown. The per-site tile surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold.
-These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics arrive in the existing site health report; the central-scoped `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics are central-computed alongside the existing central KPIs.
+These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` metrics arrive in the existing site health report; the central-scoped `AuditRedactionFailure` metric is central-computed alongside the existing central KPIs.
## Central Storage
@@ -112,7 +110,7 @@ These tiles are **point-in-time** like the Notification Outbox and Site Call Aud
- **Cluster Infrastructure (site)**: Provides node role status.
- **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles.
- **Site Call Audit (central)**: Provides central-computed cached-call KPIs — buffered count, parked count, failed/delivered (last interval), oldest pending age, stuck count — for the headline dashboard tiles.
-- **Audit Log (#23)**: Provides the site-reported `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics (via the site health report) and the central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics, plus the central audit-row rate feeding the **Audit** dashboard tile group (Audit volume, Audit error rate, Audit backlog).
+- **Audit Log (#23)**: Provides the site-reported `SiteAuditBacklog` / `SiteAuditWriteFailures` metrics (via the site health report) and the central-computed `AuditRedactionFailure` metric, plus the central audit-row rate feeding the **Audit** dashboard tile group (Audit volume, Audit error rate, Audit backlog).
## Interactions
diff --git a/docs/requirements/Component-ManagementService.md b/docs/requirements/Component-ManagementService.md
index c56275c4..881dcf77 100644
--- a/docs/requirements/Component-ManagementService.md
+++ b/docs/requirements/Component-ManagementService.md
@@ -74,6 +74,15 @@ Content-Type: application/json
The endpoint performs LDAP authentication and role resolution server-side, collapsing the CLI's previous two-step flow (ResolveRoles + actual command) into a single HTTP round-trip.
+## HTTP Audit API
+
+In addition to `/management`, the Management Service exposes a dedicated REST surface for the centralized Audit Log component (#23). These endpoints live in `AuditEndpoints.cs` and bypass the `ManagementActor` because the query/export workloads are read-only, keyset-paged, and stream large result sets:
+
+- `GET /api/audit/query` — keyset-paged JSON query over the central `AuditLog` table. Authenticated via HTTP Basic Auth (shared with `/management`); gated on the `OperationalAudit` permission (Admin / Audit / AuditReadOnly roles).
+- `GET /api/audit/export` — server-side streaming bulk export (CSV or JSONL) of the filtered rows. Gated on the `AuditExport` permission (Admin / Audit).
+
+Both endpoints honour any site-scope rules attached to the caller's audit role by intersecting the caller-supplied `sourceSiteId` filter with the user's `PermittedSiteIds` (out-of-scope requests yield HTTP 403). Permission denial returns HTTP 403 with the same envelope shape used by `/management`.
+
## Message Groups
### Templates
@@ -145,7 +154,13 @@ The endpoint performs LDAP authentication and role resolution server-side, colla
### Audit Log
-- **QueryAuditLog**: Query audit log entries with filtering by entity type, user, date range, etc.
+- **QueryAuditLog**: Legacy configuration-change audit query (filtered by entity type, user, date range, etc.) routed through `/management`. Gated to the `Admin` role; superseded for the centralized Audit Log component (#23) by the dedicated `/api/audit/*` REST endpoints described below.
+
+### Transport (Bundle Import / Export)
+
+- **ExportBundle**: Build an encrypted bundle (templates, system artifacts, central-only configuration). Gated to the `Design` role. Returns base64-encoded bundle bytes plus a byte count.
+- **PreviewBundle**: Unlock and inspect a previously uploaded bundle session, returning the per-entity preview (adds, modifies, identicals, blockers). Gated to the `Admin` role.
+- **ImportBundle**: Apply a previewed bundle with per-conflict resolutions inside a single audit-correlated session. Gated to the `Admin` role.
### Shared Scripts
@@ -206,7 +221,7 @@ The ManagementActor receives the following services and repositories via DI (inj
| Section | Options Class | Contents |
|---------|--------------|----------|
-| `ScadaLink:ManagementService` | `ManagementServiceOptions` | (Reserved for future configuration — e.g., command timeout overrides) |
+| `ScadaLink:ManagementService` | `ManagementServiceOptions` | `CommandTimeout` (`TimeSpan`, default 30 s) — Ask timeout the HTTP endpoint applies when forwarding to the `ManagementActor`. A non-positive configured value falls back to the 30 s default. |
## Dependencies
diff --git a/docs/requirements/Component-Transport.md b/docs/requirements/Component-Transport.md
index b7576416..2aa2950f 100644
--- a/docs/requirements/Component-Transport.md
+++ b/docs/requirements/Component-Transport.md
@@ -76,7 +76,7 @@ Exactly one of `content.json` or `content.enc` is present.
}
```
-The manifest is plaintext so the import wizard can preview bundle contents and source provenance before the user supplies a passphrase.
+The manifest is plaintext so the import wizard can preview bundle contents and source provenance before the user supplies a passphrase. (Implementation note: `BundleImporter.LoadAsync` always parses the manifest *and* reads the content blob to verify the SHA-256 hash on every call, regardless of whether a passphrase is available — the "manifest peek" is conceptual rather than a cheap O(manifest) operation. For an encrypted bundle without a passphrase the call surfaces the encrypted-bundle prompt via the validated envelope, so the UI gets the manifest + provenance for free, but the cost is O(bundle-size) per `LoadAsync`. A future `ReadManifestAsync(Stream)` that skips the content read is a deferred optimisation.)
### `content.json` / `content.enc`
diff --git a/src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs b/src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs
index ae5e7ed9..3cffa566 100644
--- a/src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs
+++ b/src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs
@@ -26,10 +26,12 @@ namespace ScadaLink.AuditLog.Central;
///
/// Per Bundle D's brief, audit-write failures must NEVER abort the user-facing
/// action. The actor wraps each repository call in its own try/catch so a
-/// single bad row cannot cause the rest of the batch to be lost; the actor's
-/// uses Resume so a thrown exception
-/// inside ReceiveAsync does not restart the actor (which would also
-/// reset any in-flight state).
+/// single bad row cannot cause the rest of the batch to be lost — that
+/// per-row catch is what keeps this actor alive across handler throws, not
+/// the supervisor strategy. The override
+/// returns the Akka default decider (Restart for most exceptions) and
+/// governs children only; this actor has no children today, so the override
+/// is a forward-compat placeholder.
///
///
/// Two constructors exist for a deliberate reason: Bundle D's tests inject a
diff --git a/src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs b/src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs
index a0768513..efb78989 100644
--- a/src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs
+++ b/src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs
@@ -35,9 +35,11 @@ namespace ScadaLink.AuditLog.Central;
/// Continue-on-error. A single boundary that throws (transient SQL
/// failure, contention with backup, missing object) must NOT prevent the
/// other eligible boundaries from being purged on the same tick. Per-boundary
-/// work runs inside its own try/catch; the actor's
-/// uses Resume so any leaked exception keeps
-/// the singleton alive for the next tick.
+/// work runs inside its own try/catch — that per-boundary catch is what
+/// keeps the singleton alive across handler throws. The
+/// override returns the Akka default
+/// decider (Restart) and governs children only; this actor has no children
+/// today, so the override is a forward-compat placeholder.
///
///
/// DI scopes. is a scoped EF Core
diff --git a/src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs b/src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs
index 737be9fb..59f868d8 100644
--- a/src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs
+++ b/src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs
@@ -48,10 +48,13 @@ namespace ScadaLink.AuditLog.Central;
///
/// Failure isolation. A single site that throws (DNS, transport,
/// repository write) must NOT prevent other sites from being polled on the
-/// same tick. The per-site work runs inside its own try/catch; the actor's
-/// supervisor strategy keeps it alive across any leaked exception with
-/// 's Restart
-/// semantics — restart resets the in-memory cursors, but as noted above that's
+/// same tick. The per-site work runs inside its own try/catch — that
+/// per-site catch is what keeps the actor running across handler throws.
+/// The override returns
+/// (Restart
+/// semantics) and governs children only; this actor has no children today,
+/// so the override is a forward-compat placeholder. If it ever did fire,
+/// restart would reset the in-memory cursors — but as noted above that's
/// a safe (over-pull, idempotent) recovery.
///
///
diff --git a/src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs b/src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs
index 9173af6c..a508e31d 100644
--- a/src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs
+++ b/src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs
@@ -706,9 +706,13 @@ public class SqliteAuditWriter : IAuditWriter, ISiteAuditQueue, IAsyncDisposable
lock (_writeLock)
{
if (_disposed) return;
- // Stop accepting new events. Setting _disposed first ensures any
- // FlushBatch entered after we mark disposed will fault its pending
- // events rather than touching the about-to-close connection.
+ // Stop accepting new events. Completing the channel writer is the
+ // shutdown signal: WriteAsync calls observe the completion and
+ // fault, and the writer loop drains any already-buffered items
+ // before exiting. _disposed is intentionally NOT set here — it
+ // flips only after the loop has fully drained (second lock block
+ // below), so FlushBatch's existing _disposed check guards the
+ // post-drain window when the connection is about to close.
_writeQueue.Writer.TryComplete();
writerLoop = _writerLoop;
}
diff --git a/src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs b/src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs
index 61e96a72..6cb9b0c4 100644
--- a/src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs
+++ b/src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs
@@ -27,9 +27,18 @@ public sealed class ClusterOptionsValidator : IValidateOptions
{
var failures = new List();
- if (options.SeedNodes is null || options.SeedNodes.Count == 0)
+ if (options.SeedNodes is null || options.SeedNodes.Count < 2)
{
- failures.Add("ClusterOptions.SeedNodes must contain at least one seed node.");
+ // CI-012: design doc states "both nodes are seed nodes — each node lists
+ // both itself and its partner" so a properly-configured deployment lists
+ // two. Accepting a single-seed configuration silently defeats the
+ // "no startup ordering dependency" guarantee called out by
+ // Component-ClusterInfrastructure.md (Node Configuration).
+ failures.Add(
+ "ClusterOptions.SeedNodes must contain at least 2 seed nodes "
+ + "(Component-ClusterInfrastructure.md → Node Configuration: "
+ + "both nodes are seed nodes); a single-seed configuration defeats "
+ + "the no-startup-ordering-dependency guarantee.");
}
if (string.IsNullOrWhiteSpace(options.SplitBrainResolverStrategy)
diff --git a/src/ScadaLink.Commons/Interfaces/Transport/IAuditCorrelationContext.cs b/src/ScadaLink.Commons/Interfaces/Transport/IAuditCorrelationContext.cs
index bc0007a2..0688313f 100644
--- a/src/ScadaLink.Commons/Interfaces/Transport/IAuditCorrelationContext.cs
+++ b/src/ScadaLink.Commons/Interfaces/Transport/IAuditCorrelationContext.cs
@@ -8,12 +8,13 @@ namespace ScadaLink.Commons.Interfaces.Transport;
/// Re-entrancy / thread-safety: mutating is NOT
/// thread-safe. The service is registered scoped, and the assumed usage is a
/// single Blazor Server circuit (or single API request) at a time — within that
-/// scope is the sole writer, and the
-/// audit service is the sole reader, in a strictly sequential await chain.
-/// Callers that perform concurrent imports within a shared scope (e.g. two
-/// ApplyAsync calls awaited via Task.WhenAll on the same circuit)
-/// MUST serialize access externally — there is no internal lock and the last
-/// writer wins, which would cross-contaminate audit rows between imports.
+/// scope BundleImporter.ApplyAsync (in the Transport component) is the
+/// sole writer, and the audit service is the sole reader, in a strictly
+/// sequential await chain. Callers that perform concurrent imports within a
+/// shared scope (e.g. two ApplyAsync calls awaited via
+/// Task.WhenAll on the same circuit) MUST serialize access externally —
+/// there is no internal lock and the last writer wins, which would
+/// cross-contaminate audit rows between imports.
///
///
public interface IAuditCorrelationContext
diff --git a/src/ScadaLink.Commons/Messages/InboundApi/RouteToInstanceRequest.cs b/src/ScadaLink.Commons/Messages/InboundApi/RouteToInstanceRequest.cs
index 90bc6f1e..59671d85 100644
--- a/src/ScadaLink.Commons/Messages/InboundApi/RouteToInstanceRequest.cs
+++ b/src/ScadaLink.Commons/Messages/InboundApi/RouteToInstanceRequest.cs
@@ -33,11 +33,19 @@ public record RouteToCallResponse(
///
/// Request to read attribute(s) from a remote instance.
///
+///
+/// Audit Log #23 (ParentExecutionId): mirrors .
+/// For an inbound-API-routed read this is the inbound request's per-request execution id;
+/// future site-side audit emission for routed reads can stamp it as ParentExecutionId
+/// so the inbound→site execution-tree link survives the read path. Additive trailing
+/// member — null for the Central UI sandbox path or for callers built before the field existed.
+///
public record RouteToGetAttributesRequest(
string CorrelationId,
string InstanceUniqueName,
IReadOnlyList AttributeNames,
- DateTimeOffset Timestamp);
+ DateTimeOffset Timestamp,
+ Guid? ParentExecutionId = null);
///
/// Response containing attribute values from a remote instance.
@@ -52,11 +60,20 @@ public record RouteToGetAttributesResponse(
///
/// Request to write attribute(s) on a remote instance.
///
+///
+/// Audit Log #23 (ParentExecutionId): mirrors .
+/// For an inbound-API-routed write this is the inbound request's per-request execution id;
+/// site-side audit emission for the underlying device / static-attribute write can stamp
+/// it as ParentExecutionId so the inbound→site execution-tree link survives the
+/// write path. Additive trailing member — null for the Central UI sandbox path or for
+/// callers built before the field existed.
+///
public record RouteToSetAttributesRequest(
string CorrelationId,
string InstanceUniqueName,
IReadOnlyDictionary AttributeValues,
- DateTimeOffset Timestamp);
+ DateTimeOffset Timestamp,
+ Guid? ParentExecutionId = null);
///
/// Response confirming attribute writes on a remote instance.
diff --git a/src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs b/src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs
index feb88f6e..d1849db0 100644
--- a/src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs
+++ b/src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs
@@ -1,4 +1,5 @@
using Akka.Actor;
+using Akka.Cluster;
using Akka.Cluster.Tools.Client;
using Akka.Event;
using ScadaLink.Commons.Messages.Artifacts;
@@ -27,6 +28,15 @@ public class SiteCommunicationActor : ReceiveActor, IWithTimers
private readonly string _siteId;
private readonly CommunicationOptions _options;
+ ///
+ /// Communication-018: predicate that returns true when this node is
+ /// the active member of the local site cluster (used to stamp
+ /// ). Production builds default to
+ /// the Akka leader check; tests inject a stub so they
+ /// do not need a real cluster.
+ ///
+ private readonly Func _isActiveCheck;
+
///
/// Reference to the local Deployment Manager singleton proxy.
///
@@ -54,14 +64,23 @@ public class SiteCommunicationActor : ReceiveActor, IWithTimers
/// The site identifier included in outbound messages.
/// Communication options including heartbeat interval and transport settings.
/// Local reference to the Deployment Manager singleton proxy.
+ ///
+ /// Communication-018: optional override returning true when this node
+ /// is the active member of the site cluster. null uses the real
+ /// Akka leader check (the default for production
+ /// wiring); tests pass a stub so they do not need to load Akka.Cluster
+ /// into the TestKit ActorSystem.
+ ///
public SiteCommunicationActor(
string siteId,
CommunicationOptions options,
- IActorRef deploymentManagerProxy)
+ IActorRef deploymentManagerProxy,
+ Func? isActiveCheck = null)
{
_siteId = siteId;
_options = options;
_deploymentManagerProxy = deploymentManagerProxy;
+ _isActiveCheck = isActiveCheck ?? DefaultIsActiveCheck;
// Registration
Receive(msg =>
@@ -360,16 +379,60 @@ public class SiteCommunicationActor : ReceiveActor, IWithTimers
return;
var hostname = Environment.MachineName;
+
+ // Communication-018: stamp HeartbeatMessage.IsActive with this node's
+ // true active/standby role rather than hard-coding `true`. The field is
+ // part of the wire contract (additive-only-evolution) so a future
+ // central health dashboard can distinguish "active node down, standby
+ // up" from "site fully offline" without a new message type.
+ bool isActive;
+ try
+ {
+ isActive = _isActiveCheck();
+ }
+ catch (Exception ex)
+ {
+ // Defensive: never let a cluster-state read failure abort the
+ // heartbeat itself (heartbeats are health signal — their absence is
+ // already meaningful). Fall back to the safest non-claiming value:
+ // standby. Logged at Debug because this path normally only fires
+ // during ActorSystem warm-up.
+ _log.Debug(ex,
+ "Active-node check threw while sending heartbeat for site {0}; reporting IsActive=false",
+ _siteId);
+ isActive = false;
+ }
+
var heartbeat = new HeartbeatMessage(
_siteId,
hostname,
- IsActive: true,
+ IsActive: isActive,
DateTimeOffset.UtcNow);
_centralClient.Tell(
new ClusterClient.Send("/user/central-communication", heartbeat), Self);
}
+ ///
+ /// Communication-018: default active-node check used when no override is
+ /// supplied. Mirrors ActiveNodeGate in the Host (and
+ /// ActiveNodeHealthCheck): the node is the active member of the
+ /// site cluster when it is the current cluster leader AND its own
+ /// is . Any other
+ /// state (still joining, leaving, no leader yet) reports standby —
+ /// safe-by-default, matching the standby case.
+ ///
+ private bool DefaultIsActiveCheck()
+ {
+ var cluster = Cluster.Get(Context.System);
+ var self = cluster.SelfMember;
+ if (self.Status != MemberStatus.Up)
+ return false;
+
+ var leader = cluster.State.Leader;
+ return leader != null && leader == self.Address;
+ }
+
// ── Internal messages ──
internal record SendHeartbeat;
diff --git a/src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs b/src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs
index 60b90e65..e453f2bc 100644
--- a/src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs
+++ b/src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs
@@ -25,6 +25,11 @@ public class SiteStreamGrpcServer : SiteStreamService.SiteStreamServiceBase
private readonly int _maxConcurrentStreams;
private readonly TimeSpan _maxStreamLifetime;
private volatile bool _ready;
+ // Host-017 / REQ-HOST-7: flipped by CancelAllStreams() when the host enters
+ // CoordinatedShutdown so SubscribeInstance refuses new streams with
+ // Unavailable before the actor system tears down. Strictly monotonic — once
+ // true, never reset (the server is single-lifetime per host).
+ private volatile bool _shuttingDown;
private long _actorCounter;
// Audit Log (#23 M2): central-side ingest actor proxy. Set by the host
// after the cluster singleton starts (see Bundle E wiring). When null the
@@ -131,6 +136,40 @@ public class SiteStreamGrpcServer : SiteStreamService.SiteStreamServiceBase
_siteAuditQueue = queue;
}
+ ///
+ /// Host-017 / REQ-HOST-7: signals the gRPC server to begin its part of the
+ /// site shutdown sequence — refuse new
+ /// streams with and cancel every
+ /// active stream so its await foreach observes
+ /// and the response stream
+ /// completes with Cancelled on the client. Idempotent — safe to call
+ /// more than once. Invoked from the site host's
+ /// IHostApplicationLifetime.ApplicationStopping callback BEFORE
+ /// Akka's CoordinatedShutdown runs, so in-flight clients get a
+ /// clean cancellation they can reconnect on rather than a silent stream
+ /// that only times out via gRPC keepalive.
+ ///
+ public void CancelAllStreams()
+ {
+ _shuttingDown = true;
+ foreach (var entry in _activeStreams.Values)
+ {
+ try
+ {
+ entry.Cts.Cancel();
+ }
+ catch (ObjectDisposedException)
+ {
+ // Already cleaned up by its own finally — nothing to do.
+ }
+ }
+ }
+
+ ///
+ /// Host-017: exposed for test assertions on the shutdown state.
+ ///
+ internal bool IsShuttingDown => _shuttingDown;
+
///
/// Number of currently active streaming subscriptions. Exposed for diagnostics.
///
@@ -151,6 +190,11 @@ public class SiteStreamGrpcServer : SiteStreamService.SiteStreamServiceBase
if (!_ready)
throw new RpcException(new GrpcStatus(StatusCode.Unavailable, "Server not ready"));
+ // Host-017 / REQ-HOST-7: refuse new subscriptions during shutdown so
+ // CoordinatedShutdown can quiesce without racing fresh streams.
+ if (_shuttingDown)
+ throw new RpcException(new GrpcStatus(StatusCode.Unavailable, "Server shutting down"));
+
// Communication-014: correlation_id arrives off the wire on a public gRPC
// endpoint and is used (below) to compose an Akka actor name. Akka actor names
// have a restricted character set — a id containing '/', whitespace, or other
diff --git a/src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs b/src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs
index 8e897cc4..fe350973 100644
--- a/src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs
+++ b/src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs
@@ -6,11 +6,12 @@ using ScadaLink.Commons.Interfaces.Repositories;
namespace ScadaLink.ConfigurationDatabase.Repositories;
///
-/// EF Core implementation of IDeploymentManagerRepository.
-/// Provides storage/query of deployed configuration snapshots per instance,
-/// current deployment status, and optimistic concurrency on deployment status records.
-///
-/// WP-24: Stub level sufficient for diff/staleness support.
+/// EF Core implementation of covering
+/// the deployment pipeline's persistence surface: DeploymentRecord CRUD
+/// (with optimistic concurrency via DeploymentRecord.RowVersion),
+/// SystemArtifactDeploymentRecord CRUD, DeployedConfigSnapshot CRUD,
+/// and a Restrict-FK-aware that explicitly
+/// clears dependent deployment-record rows before removing an instance.
///
public class DeploymentManagerRepository : IDeploymentManagerRepository
{
diff --git a/src/ScadaLink.DeploymentManager/DeploymentService.cs b/src/ScadaLink.DeploymentManager/DeploymentService.cs
index d2922ff6..e7aeac06 100644
--- a/src/ScadaLink.DeploymentManager/DeploymentService.cs
+++ b/src/ScadaLink.DeploymentManager/DeploymentService.cs
@@ -170,7 +170,7 @@ public class DeploymentService
// Idempotency"). A clean prior Success or a fresh first-time deploy
// skips this extra round-trip.
var reconciled = await TryReconcileWithSiteAsync(
- instance, revisionHash, configJson, cancellationToken);
+ instance, revisionHash, configJson, user, cancellationToken);
if (reconciled != null)
return Result.Success(reconciled);
@@ -617,6 +617,7 @@ public class DeploymentService
Instance instance,
string targetRevisionHash,
string configJson,
+ string currentUser,
CancellationToken cancellationToken)
{
var prior = await _repository.GetCurrentDeploymentStatusAsync(instance.Id, cancellationToken);
@@ -695,9 +696,19 @@ public class DeploymentService
instance, prior.DeploymentId, targetRevisionHash, configJson,
forceEnabledState: false, cancellationToken);
- await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", "Instance",
+ // DeploymentManager-020: attribute the audit row to the user driving
+ // THIS redeploy (the caller of DeployInstanceAsync), not the user
+ // who issued the original timed-out / stuck deployment. The original
+ // deployer is preserved in the detail object so forensics can still
+ // see who launched the run that reconciliation rescued.
+ await _auditService.LogAsync(currentUser, "DeployReconciled", "Instance",
instance.Id.ToString(), instance.UniqueName,
- new { DeploymentId = prior.DeploymentId, RevisionHash = targetRevisionHash },
+ new
+ {
+ DeploymentId = prior.DeploymentId,
+ RevisionHash = targetRevisionHash,
+ OriginalDeployer = prior.DeployedBy
+ },
cancellationToken);
return prior;
diff --git a/src/ScadaLink.Host/Program.cs b/src/ScadaLink.Host/Program.cs
index 9651925a..03ca911c 100644
--- a/src/ScadaLink.Host/Program.cs
+++ b/src/ScadaLink.Host/Program.cs
@@ -271,6 +271,18 @@ try
// Map gRPC service — resolves the singleton SiteStreamGrpcServer from DI
app.MapGrpcService();
+ // Host-017 / REQ-HOST-7: site-shutdown ordering. ApplicationStopping
+ // fires BEFORE IHostedService.StopAsync runs, so the gRPC server
+ // refuses new streams (Unavailable) and cancels every active stream
+ // here — clients observe a clean Cancelled and reconnect — and only
+ // THEN does AkkaHostedService run CoordinatedShutdown and tear down
+ // actors. Without this hand-off, in-flight streams go silent and only
+ // time out via gRPC keepalive (~25 s), violating the documented
+ // four-step sequence.
+ var siteLifetime = app.Services.GetRequiredService();
+ var siteGrpcServer = app.Services.GetRequiredService();
+ siteLifetime.ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams());
+
await app.RunAsync();
}
else
diff --git a/src/ScadaLink.InboundAPI/RouteHelper.cs b/src/ScadaLink.InboundAPI/RouteHelper.cs
index 0a53011e..1b069493 100644
--- a/src/ScadaLink.InboundAPI/RouteHelper.cs
+++ b/src/ScadaLink.InboundAPI/RouteHelper.cs
@@ -179,8 +179,14 @@ public class RouteTarget
var siteId = await ResolveSiteAsync(token);
var correlationId = Guid.NewGuid().ToString();
+ // Audit Log #23 (ParentExecutionId): mirrors the Call path — stamp the
+ // spawning inbound request's ExecutionId so future site-side audit
+ // emission for routed reads can record this read's parent. Symmetric
+ // with RouteToCallRequest so script authors get the same correlation
+ // across Call / GetAttributes / SetAttributes.
var request = new RouteToGetAttributesRequest(
- correlationId, _instanceCode, attributeNames.ToList(), DateTimeOffset.UtcNow);
+ correlationId, _instanceCode, attributeNames.ToList(), DateTimeOffset.UtcNow,
+ _parentExecutionId);
var response = await _instanceRouter.RouteToGetAttributesAsync(siteId, request, token);
@@ -222,8 +228,14 @@ public class RouteTarget
var siteId = await ResolveSiteAsync(token);
var correlationId = Guid.NewGuid().ToString();
+ // Audit Log #23 (ParentExecutionId): mirrors the Call path — stamp the
+ // spawning inbound request's ExecutionId so future site-side audit
+ // emission for routed writes can record this write's parent. Symmetric
+ // with RouteToCallRequest so script authors get the same correlation
+ // across Call / GetAttributes / SetAttributes.
var request = new RouteToSetAttributesRequest(
- correlationId, _instanceCode, attributeValues, DateTimeOffset.UtcNow);
+ correlationId, _instanceCode, attributeValues, DateTimeOffset.UtcNow,
+ _parentExecutionId);
var response = await _instanceRouter.RouteToSetAttributesAsync(siteId, request, token);
diff --git a/src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs b/src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs
index 2e69d04e..3f98a005 100644
--- a/src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs
+++ b/src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs
@@ -6,21 +6,46 @@ namespace ScadaLink.NotificationOutbox;
///
public class NotificationOutboxOptions
{
- /// Interval between dispatch sweeps that pick up pending notifications for delivery.
+ ///
+ /// Interval between dispatch sweeps that pick up pending notifications for delivery.
+ /// Default: 10 seconds.
+ ///
public TimeSpan DispatchInterval { get; set; } = TimeSpan.FromSeconds(10);
- /// Maximum number of notifications claimed for delivery in a single dispatch sweep.
+ ///
+ /// Maximum number of notifications claimed for delivery in a single dispatch sweep.
+ /// Caps per-sweep batch size to bound memory and per-iteration latency.
+ /// Default: 100.
+ ///
public int DispatchBatchSize { get; set; } = 100;
- /// Age past which an in-progress notification is considered stuck and re-claimed.
+ ///
+ /// Age past which a still-Pending/Retrying notification is counted as
+ /// "stuck" on the KPI tile and the per-row badge in the Central UI.
+ /// Display-only: rows older than this threshold are flagged in KPIs/UI; there is no
+ /// automatic re-claim, requeue, or escalation — the dispatcher behaviour is unaffected.
+ /// Default: 10 minutes.
+ ///
public TimeSpan StuckAgeThreshold { get; set; } = TimeSpan.FromMinutes(10);
- /// Retention period for notifications in a terminal state before they are purged.
+ ///
+ /// Retention period for notifications in a terminal state (Delivered,
+ /// Parked, Discarded) before they are purged from the Notifications table.
+ /// Default: 365 days.
+ ///
public TimeSpan TerminalRetention { get; set; } = TimeSpan.FromDays(365);
- /// Interval between background purge sweeps of terminal notifications.
+ ///
+ /// Interval between background purge sweeps of terminal notifications older than
+ /// .
+ /// Default: 1 day.
+ ///
public TimeSpan PurgeInterval { get; set; } = TimeSpan.FromDays(1);
- /// Trailing window used to compute the delivered-notifications throughput KPI.
+ ///
+ /// Trailing window used to compute the "delivered (last interval)" throughput KPI
+ /// surfaced on the Health dashboard and the Notification Outbox page.
+ /// Default: 1 minute.
+ ///
public TimeSpan DeliveredKpiWindow { get; set; } = TimeSpan.FromMinutes(1);
}
diff --git a/src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs b/src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs
index 0cced74d..77bee5ea 100644
--- a/src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs
+++ b/src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs
@@ -23,10 +23,14 @@ namespace ScadaLink.SiteCallAudit;
///
///
///
-/// Query, detail and KPIs (Task 4) and the central→site Retry/Discard relay
-/// (Task 5 — the relay handlers live in this actor) are implemented; only
-/// reconciliation remains deferred (per CLAUDE.md scope discipline — it lands
-/// in a later follow-up).
+/// Implemented: direct telemetry ingest,
+/// query, detail and KPI handlers (Task 4), and the central→site Retry/Discard
+/// relay (Task 5 — the relay handlers live in this actor). Deferred (per
+/// CLAUDE.md scope discipline — both land in a later follow-up): the periodic
+/// per-site reconciliation puller that backfills lost telemetry, and the daily
+/// terminal-row purge scheduler (the repository exposes
+/// PurgeTerminalAsync but nothing in this module currently invokes it
+/// on a schedule).
///
///
/// Per CLAUDE.md "audit-write failure NEVER aborts the user-facing action" —
@@ -556,6 +560,13 @@ public class SiteCallAuditActor : ReceiveActor
// SiteUnreachable is never produced from a ParkedOperationActionAck —
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
// before any ack is classified, so this arm is unreachable by construction.
+ // We deliberately return ack.ErrorMessage (rather than throwing) to keep
+ // AckErrorMessage total and side-effect-free: site-unreachable is classified
+ // as transient by the upstream relay path (which has already constructed the
+ // SiteUnreachable response and detail text via SiteUnreachableMessage), so a
+ // defensive fall-through here just surfaces whatever error text the ack
+ // carries and lets the caller schedule a retry. Throwing would turn a benign
+ // refactor invariant violation into a relay-path crash.
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
_ => throw new ArgumentOutOfRangeException(
nameof(outcome), outcome, "unknown SiteCallRelayOutcome"),
diff --git a/src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs b/src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs
index 53425b1e..8b032bdc 100644
--- a/src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs
+++ b/src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs
@@ -4,6 +4,21 @@ using Microsoft.Extensions.Options;
namespace ScadaLink.SiteEventLogging;
+///
+/// SiteEventLogging-019: predicate the
+/// consults at the top of every purge tick to decide whether THIS node should
+/// run the daily purge. The design states "a daily background job runs on the
+/// active node and deletes all events older than 30 days"; the standby's local
+/// SQLite receives no writes, so purging there is harmless but unnecessary —
+/// and silently doing it anyway diverges from the design.
+///
+/// Registration is the Host's responsibility (it knows the cluster topology);
+/// when no implementation is registered the purge service defaults to "always
+/// active" so non-clustered hosts and unit tests are unaffected — backward
+/// compatible with the prior "run on every host" behaviour.
+///
+public delegate bool SiteEventLogActiveNodeCheck();
+
///
/// Background service that periodically purges old events from the SQLite event log.
/// Enforces both time-based retention (default 30 days) and storage cap (default 1GB).
@@ -17,15 +32,24 @@ public class EventLogPurgeService : BackgroundService
private readonly SiteEventLogger _eventLogger;
private readonly SiteEventLogOptions _options;
private readonly ILogger _logger;
+ private readonly SiteEventLogActiveNodeCheck _isActiveNode;
/// Initializes a new instance of .
/// The concrete event logger providing lock-guarded database access.
/// Site event log options (retention days, storage cap, purge interval).
/// Logger instance.
+ ///
+ /// SiteEventLogging-019: optional active-node check. When null, the
+ /// service runs the purge on every tick (preserves the pre-fix behaviour
+ /// for non-clustered hosts and existing tests). When supplied — e.g. by
+ /// the Host on a site node — each tick early-exits on the standby so the
+ /// daily purge runs only on the active node, matching the design.
+ ///
public EventLogPurgeService(
SiteEventLogger eventLogger,
IOptions options,
- ILogger logger)
+ ILogger logger,
+ SiteEventLogActiveNodeCheck? isActiveNode = null)
{
// Depend on the concrete recorder directly: purge must funnel database access
// through its lock-guarded WithConnection. Taking ISiteEventLogger and
@@ -33,6 +57,7 @@ public class EventLogPurgeService : BackgroundService
_eventLogger = eventLogger;
_options = options.Value;
_logger = logger;
+ _isActiveNode = isActiveNode ?? (static () => true);
}
///
@@ -58,6 +83,31 @@ public class EventLogPurgeService : BackgroundService
{
try
{
+ // SiteEventLogging-019: gate every tick on the active-node check.
+ // The standby's local SQLite receives no writes, so purging there
+ // is harmless but unnecessary; the design (Component-SiteEventLogging
+ // → Storage) explicitly states the purge runs on the active node.
+ // Defensive try/catch: a transient cluster-state read failure must
+ // not stop the purge loop — fall back to running the purge (the
+ // pre-fix behaviour was "always run", which is harmless on standby).
+ bool isActive;
+ try
+ {
+ isActive = _isActiveNode();
+ }
+ catch (Exception checkEx)
+ {
+ _logger.LogDebug(checkEx,
+ "Active-node check threw during purge tick; running purge to be safe");
+ isActive = true;
+ }
+
+ if (!isActive)
+ {
+ _logger.LogDebug("Skipping event log purge tick — this node is not the active site member");
+ return;
+ }
+
PurgeByRetention();
PurgeByStorageCap();
}
diff --git a/src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs b/src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs
index 0ecbed05..66e07aae 100644
--- a/src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs
+++ b/src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs
@@ -1,4 +1,7 @@
using Microsoft.Extensions.DependencyInjection;
+using Microsoft.Extensions.Hosting;
+using Microsoft.Extensions.Logging;
+using Microsoft.Extensions.Options;
namespace ScadaLink.SiteEventLogging;
@@ -18,7 +21,19 @@ public static class ServiceCollectionExtensions
services.AddSingleton();
services.AddSingleton(sp => sp.GetRequiredService());
services.AddSingleton();
- services.AddHostedService();
+
+ // SiteEventLogging-019: the purge service still registers on every host
+ // node, but it consults an optional SiteEventLogActiveNodeCheck on each
+ // tick and early-exits on the standby. The Host registers the real
+ // active-node check on site nodes; tests and non-clustered hosts leave
+ // it unregistered, and the purge defaults to "always run" (the
+ // pre-fix behaviour). Building the service via a factory so the
+ // optional delegate flows from DI rather than the constructor default.
+ services.AddHostedService(sp => new EventLogPurgeService(
+ sp.GetRequiredService(),
+ sp.GetRequiredService>(),
+ sp.GetRequiredService>(),
+ sp.GetService()));
return services;
}
diff --git a/src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs b/src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs
index 080458c1..5b0e0e6a 100644
--- a/src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs
+++ b/src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs
@@ -3,32 +3,40 @@ using ScadaLink.StoreAndForward;
namespace ScadaLink.SiteRuntime.Messages;
-///
-/// Outbound messages — sent by local DeploymentManagerActor/S&F service
-/// to the local SiteReplicationActor for forwarding to the peer node.
-///
+// Outbound messages — sent by local DeploymentManagerActor/S&F service
+// to the local SiteReplicationActor for forwarding to the peer node.
+
+/// Outbound: replicate a deployed instance config (create or update) to the peer node.
public record ReplicateConfigDeploy(
string InstanceName, string ConfigJson, string DeploymentId, string RevisionHash, bool IsEnabled);
+/// Outbound: replicate removal of a deployed instance config to the peer node.
public record ReplicateConfigRemove(string InstanceName);
+/// Outbound: replicate an instance enabled/disabled flag change to the peer node.
public record ReplicateConfigSetEnabled(string InstanceName, bool IsEnabled);
+/// Outbound: replicate a system-wide artifact deployment (shared scripts, external systems, etc.) to the peer node.
public record ReplicateArtifacts(DeployArtifactsCommand Command);
+/// Outbound: replicate a store-and-forward buffer mutation (enqueue/dequeue/park/etc.) to the peer node.
public record ReplicateStoreAndForward(ReplicationOperation Operation);
-///
-/// Inbound messages — received from the peer's SiteReplicationActor
-/// and applied to local SQLite storage.
-///
+// Inbound messages — received from the peer's SiteReplicationActor
+// and applied to local SQLite storage.
+
+/// Inbound: apply a peer-replicated instance config (create or update) to local SQLite.
public record ApplyConfigDeploy(
string InstanceName, string ConfigJson, string DeploymentId, string RevisionHash, bool IsEnabled);
+/// Inbound: apply peer-replicated removal of a deployed instance config to local SQLite.
public record ApplyConfigRemove(string InstanceName);
+/// Inbound: apply a peer-replicated instance enabled/disabled flag change to local SQLite.
public record ApplyConfigSetEnabled(string InstanceName, bool IsEnabled);
+/// Inbound: apply a peer-replicated system-wide artifact deployment to local SQLite.
public record ApplyArtifacts(DeployArtifactsCommand Command);
+/// Inbound: apply a peer-replicated store-and-forward buffer mutation to the local buffer.
public record ApplyStoreAndForward(ReplicationOperation Operation);
diff --git a/tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsValidatorTests.cs b/tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsValidatorTests.cs
index 616d84c3..f972a79d 100644
--- a/tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsValidatorTests.cs
+++ b/tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsValidatorTests.cs
@@ -69,6 +69,23 @@ public class ClusterOptionsValidatorTests
Assert.Contains("SeedNodes", result.FailureMessage);
}
+ [Fact]
+ public void SingleSeedNode_FailsValidation()
+ {
+ // CI-012: design doc says "both nodes are seed nodes" — a single-seed
+ // configuration defeats the no-startup-ordering-dependency guarantee and
+ // must be rejected by the contract owner's validator, not just by the
+ // Host's startup validator.
+ var options = ValidOptions();
+ options.SeedNodes = new List { "akka.tcp://scadalink@node1:8081" };
+
+ var result = new ClusterOptionsValidator().Validate(null, options);
+
+ Assert.True(result.Failed);
+ Assert.Contains("SeedNodes", result.FailureMessage);
+ Assert.Contains("2", result.FailureMessage);
+ }
+
[Fact]
public void HeartbeatNotBelowFailureThreshold_FailsValidation()
{
diff --git a/tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs b/tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs
index 438bca65..0af6dfaa 100644
--- a/tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs
+++ b/tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs
@@ -138,6 +138,63 @@ public class SiteStreamGrpcServerTests : TestKit
Assert.Equal(0, server.ActiveStreamCount);
}
+ // --- Host-017 / REQ-HOST-7: site-shutdown ordering ---
+
+ [Fact]
+ public async Task Host017_CancelAllStreams_CancelsActiveStreamsAndRefusesNewOnes()
+ {
+ // REQ-HOST-7 step (1)+(2): on CoordinatedShutdown the gRPC server must
+ // stop accepting new streams AND cancel every active stream so the
+ // client observes a clean Cancelled (not a silent stream that only
+ // times out via keepalive). Program.cs registers
+ // ApplicationStopping → CancelAllStreams(); this test exercises the
+ // server-side guarantee in isolation.
+ var server = CreateServer();
+ server.SetReady(Sys);
+
+ var cts1 = new CancellationTokenSource();
+ var context1 = CreateMockContext(cts1.Token);
+ var writer1 = Substitute.For>();
+
+ var stream1Task = Task.Run(() => server.SubscribeInstance(
+ MakeRequest("corr-shutdown-1"), writer1, context1));
+
+ await WaitForConditionAsync(() => server.ActiveStreamCount == 1);
+
+ // Begin shutdown — flip the flag AND cancel the active stream.
+ server.CancelAllStreams();
+
+ Assert.True(server.IsShuttingDown);
+
+ // Active stream's await foreach observes OCE and falls through finally
+ // → entry is removed from _activeStreams.
+ await stream1Task;
+ Assert.Equal(0, server.ActiveStreamCount);
+
+ // A second SubscribeInstance after shutdown is refused immediately
+ // with Unavailable rather than allowed to register a new stream.
+ var writer2 = Substitute.For>();
+ var context2 = CreateMockContext();
+ var ex = await Assert.ThrowsAsync(
+ () => server.SubscribeInstance(MakeRequest("corr-shutdown-2"), writer2, context2));
+ Assert.Equal(StatusCode.Unavailable, ex.StatusCode);
+ Assert.Contains("shutting", ex.Status.Detail, StringComparison.OrdinalIgnoreCase);
+ }
+
+ [Fact]
+ public void Host017_CancelAllStreams_IsIdempotent()
+ {
+ // Repeated calls during a double-fire shutdown sequence must not throw.
+ var server = CreateServer();
+ server.SetReady(Sys);
+
+ server.CancelAllStreams();
+ server.CancelAllStreams();
+
+ Assert.True(server.IsShuttingDown);
+ Assert.Equal(0, server.ActiveStreamCount);
+ }
+
[Fact]
public async Task SubscribesAndRemovesFromStreamManager()
{
diff --git a/tests/ScadaLink.Communication.Tests/SiteCommunicationActorTests.cs b/tests/ScadaLink.Communication.Tests/SiteCommunicationActorTests.cs
index 4bc880bd..caa6af0c 100644
--- a/tests/ScadaLink.Communication.Tests/SiteCommunicationActorTests.cs
+++ b/tests/ScadaLink.Communication.Tests/SiteCommunicationActorTests.cs
@@ -2,6 +2,7 @@ using Akka.Actor;
using Akka.Cluster.Tools.Client;
using Akka.TestKit.Xunit2;
using ScadaLink.Commons.Messages.Deployment;
+using ScadaLink.Commons.Messages.Health;
using ScadaLink.Commons.Messages.Lifecycle;
using ScadaLink.Commons.Messages.Integration;
using ScadaLink.Commons.Messages.Notification;
@@ -282,4 +283,35 @@ public class SiteCommunicationActorTests : TestKit
Assert.False(ack.Applied);
Assert.NotNull(ack.ErrorMessage);
}
+
+ // ── Communication-018: heartbeat IsActive reflects this node's cluster role ──
+
+ [Theory]
+ [InlineData(true)]
+ [InlineData(false)]
+ public void Heartbeat_StampsIsActive_FromInjectedCheck(bool isActive)
+ {
+ // Communication-018: HeartbeatMessage.IsActive must reflect the actual
+ // active/standby role of this node, not a hard-coded `true`. The actor
+ // now takes a Func override (defaulting to a real Akka.Cluster
+ // leader check in production); tests inject a stub so they do not need
+ // to bring up a full cluster in the TestKit ActorSystem.
+ var dmProbe = CreateTestProbe();
+ var centralClientProbe = CreateTestProbe();
+ var fastHeartbeatOptions = new CommunicationOptions
+ {
+ TransportHeartbeatInterval = TimeSpan.FromMilliseconds(50)
+ };
+
+ var siteActor = Sys.ActorOf(Props.Create(() =>
+ new SiteCommunicationActor("site1", fastHeartbeatOptions, dmProbe.Ref, () => isActive)));
+
+ siteActor.Tell(new RegisterCentralClient(centralClientProbe.Ref));
+
+ var send = centralClientProbe.FishForMessage(
+ s => s.Message is HeartbeatMessage, TimeSpan.FromSeconds(3));
+ var heartbeat = Assert.IsType(send.Message);
+ Assert.Equal(isActive, heartbeat.IsActive);
+ Assert.Equal("site1", heartbeat.SiteId);
+ }
}
diff --git a/tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs b/tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs
index d96a51e6..ab6d08bf 100644
--- a/tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs
+++ b/tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs
@@ -909,6 +909,50 @@ public class DeploymentServiceTests : TestKit
Assert.Equal("sha256:target", prior.RevisionHash);
}
+ // ── DeploymentManager-020: reconciliation audit attributes to the CURRENT user, not the prior deployer ──
+
+ [Fact]
+ public async Task DeployInstanceAsync_Reconciled_AuditAttributesCurrentUserNotPriorDeployer()
+ {
+ // DeploymentManager-020: a redeploy that reconciles a timed-out prior
+ // record must be audited as the action of the user driving THIS
+ // redeploy — not the user who originally issued the now-reconciled
+ // deployment. The prior deployer is preserved in the detail object so
+ // forensics still see who started the rescued run.
+ var instance = new Instance("ReconcileAuditUser")
+ {
+ Id = 73, SiteId = 1, State = InstanceState.NotDeployed
+ };
+ _repo.GetInstanceByIdAsync(73, Arg.Any()).Returns(instance);
+ SetupValidPipeline(73, "ReconcileAuditUser", "sha256:target");
+
+ var prior = new DeploymentRecord("dep-prior-73", "originalUser")
+ {
+ InstanceId = 73,
+ Status = DeploymentStatus.InProgress,
+ RevisionHash = "sha256:target"
+ };
+ _repo.GetCurrentDeploymentStatusAsync(73, Arg.Any()).Returns(prior);
+ _repo.GetDeployedSnapshotByInstanceIdAsync(73, Arg.Any())
+ .Returns((DeployedConfigSnapshot?)null);
+
+ var commActor = Sys.ActorOf(Props.Create(() =>
+ new ReconcileProbeActor(siteHash: "sha256:target", failQuery: false)));
+ var service = CreateServiceWithCommActor(commActor);
+
+ var result = await service.DeployInstanceAsync(73, "currentUser");
+
+ Assert.True(result.IsSuccess);
+ // DeploymentManager-020: audit row's actor is the current user.
+ await _audit.Received().LogAsync(
+ "currentUser", "DeployReconciled", "Instance", "73", "ReconcileAuditUser",
+ Arg.Any