fix(api-surface): close Theme 9 — 27 naming / dead-code / config / hygiene findings

The largest themed batch — small mechanical fixes across 11 modules. API / message hygiene: - Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary / IReadOnlyList — Akka messages must be immutable. - Commons-016: BundleSession.MaxUnlockAttempts named constant replaces magic 3. - Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from Interfaces/ root to Interfaces/Services/ (namespace preserved — 9 consumers exceeded the in-prompt move threshold). - Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the trailing-optional-with-default pattern used elsewhere. - SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection — exposes AuditingDbConnection.Inner via internal API surface. Dead code / config cleanup: - ClusterInfra-011: decorative SectionName constant deleted. - ClusterInfra-014: dead AddClusterInfrastructureActors method + its "throws-when-called" test deleted. - Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json (dead under Serilog). Fail-loud over fail-silent: - DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently substituting a DB id). - DM-022: dropped transient Pending write — record now lands directly in InProgress (no UI flicker, one fewer DB write). - Host-020: LoggerConfigurationFactory emits a Console.Error warning when both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set (ScadaLink remains truth per Host-011). - SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable TrackedOperationId (was silently dropping). - SnF-023: empty siteId default replaced with $unknown-site sentinel + constructor normalisation. Correctness: - SCA-001: SupervisorStrategy XML rewritten to match actual DefaultDecider/Restart semantics (was claiming Resume). - SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert. - SR-021: HandleDeployArtifacts now dispatches an internal ApplyArtifactDataConnectionsToDcl message after the SQLite write so system-wide artifact-deploy data-connection changes go live immediately (was requiring a site restart). - SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the local write so a concurrent delete can't skip standby replication. Sentinels / naming collisions: - HM-021: CentralSiteId changed from "central" to "$central" (uncollideable — leading $ is forbidden in real SiteIdentifiers). Doc / surface cleanups: - SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened to "Available for future Health Monitoring integration". - SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform cap; maxRetries:0 is the unbounded escape hatch). - SnF-021: Component-StoreAndForward.md no longer claims the tracking table lives in SnF — it's in SiteRuntime, the interface is in Commons. - CLI-020: bundle export response parse guarded with try/catch on JsonException / KeyNotFoundException / FormatException — emits a clean INVALID_RESPONSE exit instead of a stack trace. Config: - ClusterInfra-013: intent comment added to "catastrophic config" test. - Host-016: appsettings.Site.json second CentralContactPoints entry removed (was pointing at the SITE's own port); doc-key explains how to extend. - Host-018: NodeName added to both shipped per-role configs (was causing SourceNode to be null on audit rows). UI: - CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES module import (new wwwroot/js/browser-time.js). - CentralUI-032: AuditResultsGrid gains a Previous button backed by a cursor stack. 10+ new regression tests across the affected projects. Build clean; all suites green. README regenerated: 6 open (was 33). Session-to-date: 130 of 136 originally-open Theme findings closed.
2026-05-28 08:39:01 -04:00
parent d190345ef0
commit 77cb0ad0e2
46 changed files with 966 additions and 278 deletions
@@ -18,8 +18,7 @@ Site clusters only. The central cluster does not buffer messages.
 - Retry delivery per message according to the configured retry policy.
 - Park messages that exhaust their retry limit (dead-letter).
 - Persist buffered messages to local SQLite for durability.
- Maintain a site-local **operation tracking table** holding one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`) — the authoritative status record consulted by `Tracking.Status(id)`.
- Emit cached-call lifecycle telemetry to the central Site Call Audit component on every status transition.
+- Emit cached-call lifecycle telemetry to the central Site Call Audit component via the `ICachedCallLifecycleObserver` hook (one notification per attempt outcome) so the audit pipeline can record each status transition.
 - Replicate buffered messages to the standby node via application-level replication over Akka.NET remoting.
 - On failover, the standby node takes over delivery from its replicated copy.
 - Respond to remote queries from central for parked message management (list, retry, discard), including central-driven Retry/Discard of parked cached calls.
@@ -44,7 +43,7 @@ Attempt immediate delivery
                    └── Max retries exhausted → Park message
 ```

-For notifications, "delivery" means forwarding the message to the central cluster via Central–Site Communication; "success" is central's ack, on which the message is cleared. Notifications do not park — they are retried at the fixed forward interval until central acks. Parking applies only to the external-system-call and cached-database-write categories.
+For notifications, "delivery" means forwarding the message to the central cluster via Central–Site Communication; "success" is central's ack, on which the message is cleared. Notifications are retried at the fixed forward interval until central acks, but — like every other category — they are bounded by the engine's `DefaultMaxRetries` cap: a sustained central outage that exceeds `DefaultMaxRetries × forward-interval` will park the buffered notification, after which an operator can Retry/Discard it via the parked-message UI. Operationally, the cap is sized so the normal central-recovery window stays well inside it; "do not park" is the design's operational intent on the happy path, not an absolute invariant. Callers that genuinely require unbounded retry pass `maxRetries: 0` on `EnqueueAsync` (the documented "no limit" escape hatch — see `StoreAndForward-015`).

 For the cached-call categories (`ExternalCall` and `DatabaseWrite`), the operation tracking table is the status record and the S&F buffer is purely the retry mechanism. A cached call that succeeds on its first immediate attempt is written directly as a terminal `Delivered` tracking row and never enters the S&F buffer. When immediate delivery fails transiently, the message is buffered and its tracking row moves to `Pending`/`Retrying`; the buffered message carries its `TrackedOperationId` so the tracking row and the retry record stay linked. When immediate delivery fails **permanently** (e.g. HTTP 4xx), the message is not buffered — the error is returned synchronously to the calling script as before — but the tracking row is written directly as a terminal `Failed` row capturing the error. On every tracking-table status transition the site emits `CachedCallTelemetry` to central.

@@ -56,7 +55,7 @@ For the external-system-call and cached-database-write categories, retry setting
 - **External systems**: Each external system definition includes max retry count and time between retries.
 - **Cached database writes**: Each database connection definition includes max retry count and time between retries.

-The **notification** category retries differently: it has no source-entity setting. The site→central forward uses a single fixed retry interval configured in the host `appsettings.json`. This interval is infrastructure config for reaching the central cluster, not a per-notification-list setting. It applies uniformly to every buffered notification regardless of its target list. A buffered notification is retried until central acks it; it is not parked on a retry limit (central, once reachable, owns delivery, retry, and parking from that point on).
+The **notification** category retries differently: it has no source-entity setting. The site→central forward uses a single fixed retry interval configured in the host `appsettings.json`. This interval is infrastructure config for reaching the central cluster, not a per-notification-list setting. It applies uniformly to every buffered notification regardless of its target list. A buffered notification is retried at that interval until central acks it; the engine's `DefaultMaxRetries` cap still applies (matching the cached-call categories) and a notification whose retries are exhausted under a sustained central outage parks like any other buffered message. The cap is sized so the normal central-recovery window stays well inside it — central, once reachable, owns delivery, retry, and parking from the ack point on.

 The retry interval is **fixed** (not exponential backoff). Fixed interval is sufficient for the expected use cases.

@@ -74,14 +73,24 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del
 - On failover, the new active node has a near-complete copy of the buffer. In rare cases, the most recent operations may not have been replicated (e.g., a message added or removed just before failover). This can result in a few **duplicate deliveries** (message delivered but remove not replicated) or a few **missed retries** (message added but not replicated). Both are acceptable trade-offs for the latency benefit.
 - On failover, the new active node resumes delivery from its local copy.

-### Operation Tracking Table
+### Operation Tracking Table (lives in Site Runtime, not here)

-Alongside the S&F buffer DB, each site node holds a **site-local operation tracking table** in SQLite. It carries one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome.
+> **StoreAndForward-021:** the operation tracking table is **not** owned by
+> this component. The `IOperationTrackingStore` interface lives in
+> `src/ScadaLink.Commons/Interfaces/Services/`, and the SQLite-backed
+> implementation (`OperationTrackingStore`, alongside `OperationTrackingOptions`)
+> lives in [`src/ScadaLink.SiteRuntime/Tracking/`](../../src/ScadaLink.SiteRuntime/Tracking/).
+> See [`Component-SiteRuntime.md`](Component-SiteRuntime.md) for the table's
+> semantics, lifecycle, and central-mirror coordination — it is summarised here
+> only because the S&F retry loop carries the `TrackedOperationId` linking a
+> buffered cached-call row to its tracking entry.

- This table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row.
+For context: each site node also holds a site-local operation tracking table in SQLite (owned by Site Runtime) carrying one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome.
+
+- That table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row.
 - Each row records the operation kind (`TrackedOperationKind`), a target summary (external system + method, or database connection name), the unified `TrackedOperationStatus`, retry count, last error, source provenance (instance / script), and the created/updated/terminal UTC timestamps.
- `Tracking.Status(id)` reads this table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror.
- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly here, with nothing placed in the S&F buffer.
+- `Tracking.Status(id)` reads that table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror.
+- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly there, with nothing placed in the S&F buffer.
 - Terminal rows are purged after a configurable retention window (default 7 days) — the site holds live operational state; central holds long-term audit.

 Notifications are unaffected: they have no tracking table. Their `NotificationId` and status are owned by the central `Notifications` table, and their lifecycle continues to forward to central exactly as before.