fix(api-surface): close Theme 9 — 27 naming / dead-code / config / hygiene findings
The largest themed batch — small mechanical fixes across 11 modules.
API / message hygiene:
- Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary /
IReadOnlyList — Akka messages must be immutable.
- Commons-016: BundleSession.MaxUnlockAttempts named constant replaces
magic 3.
- Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from
Interfaces/ root to Interfaces/Services/ (namespace preserved — 9
consumers exceeded the in-prompt move threshold).
- Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the
trailing-optional-with-default pattern used elsewhere.
- SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection —
exposes AuditingDbConnection.Inner via internal API surface.
Dead code / config cleanup:
- ClusterInfra-011: decorative SectionName constant deleted.
- ClusterInfra-014: dead AddClusterInfrastructureActors method + its
"throws-when-called" test deleted.
- Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json
(dead under Serilog).
Fail-loud over fail-silent:
- DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently
substituting a DB id).
- DM-022: dropped transient Pending write — record now lands directly in
InProgress (no UI flicker, one fewer DB write).
- Host-020: LoggerConfigurationFactory emits a Console.Error warning when
both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set
(ScadaLink remains truth per Host-011).
- SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable
TrackedOperationId (was silently dropping).
- SnF-023: empty siteId default replaced with $unknown-site sentinel
+ constructor normalisation.
Correctness:
- SCA-001: SupervisorStrategy XML rewritten to match actual
DefaultDecider/Restart semantics (was claiming Resume).
- SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert.
- SR-021: HandleDeployArtifacts now dispatches an internal
ApplyArtifactDataConnectionsToDcl message after the SQLite write so
system-wide artifact-deploy data-connection changes go live
immediately (was requiring a site restart).
- SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the
local write so a concurrent delete can't skip standby replication.
Sentinels / naming collisions:
- HM-021: CentralSiteId changed from "central" to "$central"
(uncollideable — leading $ is forbidden in real SiteIdentifiers).
Doc / surface cleanups:
- SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened
to "Available for future Health Monitoring integration".
- SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries
in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform
cap; maxRetries:0 is the unbounded escape hatch).
- SnF-021: Component-StoreAndForward.md no longer claims the tracking
table lives in SnF — it's in SiteRuntime, the interface is in Commons.
- CLI-020: bundle export response parse guarded with try/catch on
JsonException / KeyNotFoundException / FormatException — emits a
clean INVALID_RESPONSE exit instead of a stack trace.
Config:
- ClusterInfra-013: intent comment added to "catastrophic config" test.
- Host-016: appsettings.Site.json second CentralContactPoints entry
removed (was pointing at the SITE's own port); doc-key explains how
to extend.
- Host-018: NodeName added to both shipped per-role configs (was
causing SourceNode to be null on audit rows).
UI:
- CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES
module import (new wwwroot/js/browser-time.js).
- CentralUI-032: AuditResultsGrid gains a Previous button backed by a
cursor stack.
10+ new regression tests across the affected projects. Build clean;
all suites green. README regenerated: 6 open (was 33).
Session-to-date: 130 of 136 originally-open Theme findings closed.
This commit is contained in:
@@ -18,8 +18,7 @@ Site clusters only. The central cluster does not buffer messages.
|
||||
- Retry delivery per message according to the configured retry policy.
|
||||
- Park messages that exhaust their retry limit (dead-letter).
|
||||
- Persist buffered messages to local SQLite for durability.
|
||||
- Maintain a site-local **operation tracking table** holding one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`) — the authoritative status record consulted by `Tracking.Status(id)`.
|
||||
- Emit cached-call lifecycle telemetry to the central Site Call Audit component on every status transition.
|
||||
- Emit cached-call lifecycle telemetry to the central Site Call Audit component via the `ICachedCallLifecycleObserver` hook (one notification per attempt outcome) so the audit pipeline can record each status transition.
|
||||
- Replicate buffered messages to the standby node via application-level replication over Akka.NET remoting.
|
||||
- On failover, the standby node takes over delivery from its replicated copy.
|
||||
- Respond to remote queries from central for parked message management (list, retry, discard), including central-driven Retry/Discard of parked cached calls.
|
||||
@@ -44,7 +43,7 @@ Attempt immediate delivery
|
||||
└── Max retries exhausted → Park message
|
||||
```
|
||||
|
||||
For notifications, "delivery" means forwarding the message to the central cluster via Central–Site Communication; "success" is central's ack, on which the message is cleared. Notifications do not park — they are retried at the fixed forward interval until central acks. Parking applies only to the external-system-call and cached-database-write categories.
|
||||
For notifications, "delivery" means forwarding the message to the central cluster via Central–Site Communication; "success" is central's ack, on which the message is cleared. Notifications are retried at the fixed forward interval until central acks, but — like every other category — they are bounded by the engine's `DefaultMaxRetries` cap: a sustained central outage that exceeds `DefaultMaxRetries × forward-interval` will park the buffered notification, after which an operator can Retry/Discard it via the parked-message UI. Operationally, the cap is sized so the normal central-recovery window stays well inside it; "do not park" is the design's operational intent on the happy path, not an absolute invariant. Callers that genuinely require unbounded retry pass `maxRetries: 0` on `EnqueueAsync` (the documented "no limit" escape hatch — see `StoreAndForward-015`).
|
||||
|
||||
For the cached-call categories (`ExternalCall` and `DatabaseWrite`), the operation tracking table is the status record and the S&F buffer is purely the retry mechanism. A cached call that succeeds on its first immediate attempt is written directly as a terminal `Delivered` tracking row and never enters the S&F buffer. When immediate delivery fails transiently, the message is buffered and its tracking row moves to `Pending`/`Retrying`; the buffered message carries its `TrackedOperationId` so the tracking row and the retry record stay linked. When immediate delivery fails **permanently** (e.g. HTTP 4xx), the message is not buffered — the error is returned synchronously to the calling script as before — but the tracking row is written directly as a terminal `Failed` row capturing the error. On every tracking-table status transition the site emits `CachedCallTelemetry` to central.
|
||||
|
||||
@@ -56,7 +55,7 @@ For the external-system-call and cached-database-write categories, retry setting
|
||||
- **External systems**: Each external system definition includes max retry count and time between retries.
|
||||
- **Cached database writes**: Each database connection definition includes max retry count and time between retries.
|
||||
|
||||
The **notification** category retries differently: it has no source-entity setting. The site→central forward uses a single fixed retry interval configured in the host `appsettings.json`. This interval is infrastructure config for reaching the central cluster, not a per-notification-list setting. It applies uniformly to every buffered notification regardless of its target list. A buffered notification is retried until central acks it; it is not parked on a retry limit (central, once reachable, owns delivery, retry, and parking from that point on).
|
||||
The **notification** category retries differently: it has no source-entity setting. The site→central forward uses a single fixed retry interval configured in the host `appsettings.json`. This interval is infrastructure config for reaching the central cluster, not a per-notification-list setting. It applies uniformly to every buffered notification regardless of its target list. A buffered notification is retried at that interval until central acks it; the engine's `DefaultMaxRetries` cap still applies (matching the cached-call categories) and a notification whose retries are exhausted under a sustained central outage parks like any other buffered message. The cap is sized so the normal central-recovery window stays well inside it — central, once reachable, owns delivery, retry, and parking from the ack point on.
|
||||
|
||||
The retry interval is **fixed** (not exponential backoff). Fixed interval is sufficient for the expected use cases.
|
||||
|
||||
@@ -74,14 +73,24 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del
|
||||
- On failover, the new active node has a near-complete copy of the buffer. In rare cases, the most recent operations may not have been replicated (e.g., a message added or removed just before failover). This can result in a few **duplicate deliveries** (message delivered but remove not replicated) or a few **missed retries** (message added but not replicated). Both are acceptable trade-offs for the latency benefit.
|
||||
- On failover, the new active node resumes delivery from its local copy.
|
||||
|
||||
### Operation Tracking Table
|
||||
### Operation Tracking Table (lives in Site Runtime, not here)
|
||||
|
||||
Alongside the S&F buffer DB, each site node holds a **site-local operation tracking table** in SQLite. It carries one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome.
|
||||
> **StoreAndForward-021:** the operation tracking table is **not** owned by
|
||||
> this component. The `IOperationTrackingStore` interface lives in
|
||||
> `src/ScadaLink.Commons/Interfaces/Services/`, and the SQLite-backed
|
||||
> implementation (`OperationTrackingStore`, alongside `OperationTrackingOptions`)
|
||||
> lives in [`src/ScadaLink.SiteRuntime/Tracking/`](../../src/ScadaLink.SiteRuntime/Tracking/).
|
||||
> See [`Component-SiteRuntime.md`](Component-SiteRuntime.md) for the table's
|
||||
> semantics, lifecycle, and central-mirror coordination — it is summarised here
|
||||
> only because the S&F retry loop carries the `TrackedOperationId` linking a
|
||||
> buffered cached-call row to its tracking entry.
|
||||
|
||||
- This table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row.
|
||||
For context: each site node also holds a site-local operation tracking table in SQLite (owned by Site Runtime) carrying one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome.
|
||||
|
||||
- That table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row.
|
||||
- Each row records the operation kind (`TrackedOperationKind`), a target summary (external system + method, or database connection name), the unified `TrackedOperationStatus`, retry count, last error, source provenance (instance / script), and the created/updated/terminal UTC timestamps.
|
||||
- `Tracking.Status(id)` reads this table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror.
|
||||
- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly here, with nothing placed in the S&F buffer.
|
||||
- `Tracking.Status(id)` reads that table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror.
|
||||
- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly there, with nothing placed in the S&F buffer.
|
||||
- Terminal rows are purged after a configurable retention window (default 7 days) — the site holds live operational state; central holds long-term audit.
|
||||
|
||||
Notifications are unaffected: they have no tracking table. Their `NotificationId` and status are owned by the central `Notifications` table, and their lifecycle continues to forward to central exactly as before.
|
||||
|
||||
Reference in New Issue
Block a user