From 0ba4e49e118190817c004a7d0070c738d284f473 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sun, 17 May 2026 00:51:58 -0400 Subject: [PATCH] =?UTF-8?q?docs(code-reviews):=20re-review=20batch=204=20a?= =?UTF-8?q?t=2039d737e=20=E2=80=94=20SiteEventLogging,=20SiteRuntime,=20St?= =?UTF-8?q?oreAndForward,=20TemplateEngine?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016. --- code-reviews/README.md | 29 +++- code-reviews/SiteEventLogging/findings.md | 146 +++++++++++++++- code-reviews/SiteRuntime/findings.md | 147 ++++++++++++++++- code-reviews/StoreAndForward/findings.md | 192 +++++++++++++++++++++- code-reviews/TemplateEngine/findings.md | 126 +++++++++++++- 5 files changed, 613 insertions(+), 27 deletions(-) diff --git a/code-reviews/README.md b/code-reviews/README.md index 17f5d66..15e8afe 100644 --- a/code-reviews/README.md +++ b/code-reviews/README.md @@ -41,9 +41,9 @@ module file and counted in **Total**. |----------|---------------| | Critical | 0 | | High | 8 | -| Medium | 20 | -| Low | 27 | -| **Total** | **55** | +| Medium | 26 | +| Low | 32 | +| **Total** | **66** | ## Module Status @@ -64,10 +64,10 @@ module file and counted in **Total**. | [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 17 | | [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/2/1/2 | 5 | 18 | | [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/2 | 4 | 15 | -| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 | -| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 | -| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 | -| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 | +| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 14 | +| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 19 | +| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 17 | +| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/0 | 2 | 16 | ## Pending Findings @@ -93,7 +93,7 @@ _None open._ | NotificationService-014 | [NotificationService](NotificationService/findings.md) | OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever | | NotificationService-015 | [NotificationService](NotificationService/findings.md) | Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script | -### Medium (20) +### Medium (26) | ID | Module | Title | |----|--------|-------| @@ -117,8 +117,14 @@ _None open._ | NotificationService-016 | [NotificationService](NotificationService/findings.md) | `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials | | Security-012 | [Security](Security/findings.md) | Partial LDAP failure during login yields a roleless authenticated session | | Security-014 | [Security](Security/findings.md) | `RefreshToken` re-issues a token without checking the idle timeout | +| SiteEventLogging-012 | [SiteEventLogging](SiteEventLogging/findings.md) | Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted | +| SiteRuntime-017 | [SiteRuntime](SiteRuntime/findings.md) | Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors | +| StoreAndForward-015 | [StoreAndForward](StoreAndForward/findings.md) | `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever" | +| StoreAndForward-016 | [StoreAndForward](StoreAndForward/findings.md) | Operator-initiated parked-message retry and discard are not replicated to the standby | +| TemplateEngine-015 | [TemplateEngine](TemplateEngine/findings.md) | `RenameCompositionAsync` does not cascade-rename nested derived templates | +| TemplateEngine-016 | [TemplateEngine](TemplateEngine/findings.md) | Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules | -### Low (27) +### Low (32) | ID | Module | Title | |----|--------|-------| @@ -149,3 +155,8 @@ _None open._ | NotificationService-018 | [NotificationService](NotificationService/findings.md) | Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed | | Security-013 | [Security](Security/findings.md) | `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas | | Security-015 | [Security](Security/findings.md) | Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims | +| SiteEventLogging-013 | [SiteEventLogging](SiteEventLogging/findings.md) | Keyword search does not escape SQL `LIKE` wildcards in user input | +| SiteEventLogging-014 | [SiteEventLogging](SiteEventLogging/findings.md) | Initial purge runs synchronously on the host startup thread | +| SiteRuntime-018 | [SiteRuntime](SiteRuntime/findings.md) | `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher" | +| SiteRuntime-019 | [SiteRuntime](SiteRuntime/findings.md) | Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor` | +| StoreAndForward-017 | [StoreAndForward](StoreAndForward/findings.md) | Retry/Discard activity-log entries hard-code the `ExternalSystem` category | diff --git a/code-reviews/SiteEventLogging/findings.md b/code-reviews/SiteEventLogging/findings.md index 438abf5..b5d1b09 100644 --- a/code-reviews/SiteEventLogging/findings.md +++ b/code-reviews/SiteEventLogging/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.SiteEventLogging` | | Design doc | `docs/requirements/Component-SiteEventLogging.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 3 | ## Summary @@ -28,16 +28,33 @@ cluster-singleton placement of the handler actor (which can pin to the standby node), missing indexes for common query filters, retention/cap purge not enforcing the requirement strictly, and several documentation/maintainability issues. +#### Re-review 2026-05-17 (commit `39d737e`) + +Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed +(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions +hold up under inspection — the background writer, lock-guarded `WithConnection`, +`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and +the concrete-recorder DI wiring are all present and correct at this commit. The +module source is byte-identical between `39d737e` and current `HEAD`, so this review +reflects the live code. Three new findings were recorded, all low-to-medium and none +regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness +gap left by the SiteEventLogging-005 background-writer rework: when an event cannot +be persisted because the logger has been disposed, the returned `Task` is completed +*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit +event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the +keyword-search filter (SiteEventLogging-013) and the initial purge running +synchronously on the host startup thread (SiteEventLogging-014). + ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| -| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). | +| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). | | 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. | | 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). | | 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. | | 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. | -| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). | +| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: initial purge blocks host startup thread (-014). | | 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). | | 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). | | 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). | @@ -529,3 +546,122 @@ explanatory note added to `AddSiteEventLogging` pointing readers to where the ac is actually registered. Documentation/dead-code change only; no regression test was added — the change is a method removal verified by the compiler (no callers) and the full module suite still passing. + +### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` | + +**Description** + +`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under +SiteEventLogging-009), "completes once the event is durably persisted and faults if +the write fails, so callers that `await` it observe success or failure." Two paths +break that contract by signalling **success** for an event that was never written: + +1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel + has been completed because the logger was disposed), the code calls + `pending.Completion.TrySetResult()` — completing the `Task` successfully — even + though the comment immediately above acknowledges "there is nowhere to persist the + event." +2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has + been disposed mid-drain. The code does not inspect the returned `written` flag and + unconditionally calls `pending.Completion.TrySetResult()`, again reporting success + for an event the comment admits "simply cannot be persisted." + +The event log is the site's diagnostic audit trail. A caller that `await`s +`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was +recorded will observe a *successful* completion for an event that was silently +dropped. This is the same class of defect SiteEventLogging-008 fixed for write +*errors* — but the disposed-drop path was left reporting false success. The window +is the disposal/shutdown interval, during which shutdown-related events (graceful +singleton handover, instance disable) are exactly the events most likely to be +enqueued and lost. + +**Recommendation** + +For both paths, fault the `Task` (or complete it with a sentinel failure) instead of +`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))` +— so an `await`-ing caller can distinguish a dropped event from a persisted one. +Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync` +and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a +deliberate "drop silently on shutdown" semantics is chosen instead. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` | + +**Description** + +The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"` +and binds it as a parameter. Parameterisation correctly prevents SQL injection, but +it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the +user-supplied keyword. A search for a literal `_` (common in event sources and +identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as +"match any single character", and a `%` matches any run of characters. The design +calls keyword search "free-text search on message and source fields ... Useful for +finding events by script name, alarm name, or error message" — users will reasonably +expect a literal substring match, so a query for `store_and_forward` silently returns +events containing `storeXandYforward` and similar false positives. There is no way +for the caller to search for a literal underscore or percent. + +**Recommendation** + +Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before +wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression +(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field +accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the +design implies. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` | + +**Description** + +`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method +that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await` +(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is +invoked from `StartAsync`, and the host's startup pipeline does not proceed past a +`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because +`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge +that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a +near-1 GB database is back under the cap — executes inline on the startup thread, +blocking host startup (and therefore the `/health/ready` gate) for as long as the +purge takes. On a site that has accumulated a large log this can be a multi-second +stall during every node start/failover. The class doc states the service "runs on a +background thread and does not block event recording" — the startup-thread block is +inconsistent with that intent. + +**Recommendation** + +Yield before the initial purge so it runs on the background scheduler rather than the +startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`, +or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync` +(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/SiteRuntime/findings.md b/code-reviews/SiteRuntime/findings.md index 06b331b..90904c5 100644 --- a/code-reviews/SiteRuntime/findings.md +++ b/code-reviews/SiteRuntime/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.SiteRuntime` | | Design doc | `docs/requirements/Component-SiteRuntime.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 3 | ## Summary @@ -28,6 +28,24 @@ in a comment but ships it anyway). Test coverage exists for the coordinator acto persistence and scripting, but the short-lived execution actors, the replication actor, and the repositories are untested. +#### Re-review 2026-05-17 (commit `39d737e`) + +The module was re-reviewed at commit `39d737e`. No source under +`src/ScadaLink.SiteRuntime` has changed since the previous review at `9c60592` +(the only intervening commits are code-review documentation updates), so all of +SiteRuntime-001..013, 015, 016 remain Resolved and SiteRuntime-014 remains +Deferred — its Deferred justification (a trigger-evaluation concurrency design +decision is required before either recommended fix can land in-module) still +holds verbatim against the unchanged `ScriptActor`/`AlarmActor` source. The +re-review nonetheless worked through all 10 checklist categories afresh and +surfaced three new findings that the prior pass did not record: a cross-thread +`Dictionary` enumeration race when the Instance Actor's live `_attributes` +dictionary is handed by reference into child `ScriptActor`/`AlarmActor` +constructors (SiteRuntime-017, Medium); a stale `ScriptExecutionActor` XML doc +that still claims a "dedicated blocking I/O dispatcher" (SiteRuntime-018, Low); +and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager +never routes to (SiteRuntime-019, Low). Open findings: 3. + ## Checklist coverage | # | Category | Examined | Notes | @@ -733,3 +751,126 @@ harness is a larger test-infrastructure task tracked separately and out of scope Low-severity coverage finding; the highest-value untested paths the finding called out (script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests green. + +### SiteRuntime-017 — Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:625`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:675`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:83`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:93` | + +**Description** + +`InstanceActor.CreateChildActors` passes the Instance Actor's own mutable +`_attributes` field (a plain `Dictionary`) by reference into the +`Props.Create(...)` factory for every `ScriptActor` and `AlarmActor` (as the +`initialAttributes` constructor argument). Each child constructor then iterates +that dictionary to seed its `_attributeSnapshot`: + +```csharp +if (initialAttributes != null) + foreach (var kvp in initialAttributes) + _attributeSnapshot[kvp.Key] = kvp.Value; +``` + +`Context.ActorOf` returns immediately; the child actor's constructor runs later on +the *child's* mailbox thread. Meanwhile the Instance Actor's `PreStart` returns and +the Instance Actor begins processing its mailbox — `HandleTagValueUpdate` and +`HandleAttributeValueChanged` both mutate `_attributes` (`_attributes[...] = ...`). +A DCL tag update that arrives before a child has finished its constructor copy +therefore mutates the dictionary on the Instance Actor thread while the child +thread is enumerating it. `Dictionary<,>` is explicitly not safe for concurrent +read/write: the enumeration can throw `InvalidOperationException` ("collection was +modified") — which surfaces as an `ActorInitializationException` and, under the +Instance Actor's `SupervisorStrategy`, **stops** the child (the strategy returns +`Stop` for `ActorInitializationException`). The script or alarm is then silently +absent for the life of the instance. A torn read of an entry is also possible. The +window is small but deterministically reachable on a busy site at startup/failover +— exactly the staggered-startup scenario the design is most concerned about. + +**Recommendation** + +Do not share the live dictionary. Snapshot it on the Instance Actor thread before +constructing the child — e.g. pass `new Dictionary(_attributes)` +(or an immutable copy) into each `Props.Create`. The copy is made on the Instance +Actor thread inside `CreateChildActors`, so it is race-free, and each child gets a +private dictionary to seed from. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-018 — `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher" + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:17` | + +**Description** + +The class-level XML summary on `ScriptExecutionActor` states "Runs on a dedicated +blocking I/O dispatcher." That is not what the code does. SiteRuntime-009 was +resolved by introducing `ScriptExecutionScheduler` (a bounded dedicated +`TaskScheduler`); the *actor itself and its mailbox* run on the **default** Akka +dispatcher, and only the script body runs on the scheduler's threads via +`Task.Factory.StartNew(..., scheduler)`. The resolution of SiteRuntime-009 +explicitly chose the `TaskScheduler` route *instead of* a HOCON dispatcher and +even removed the "in production, configure a dedicated dispatcher" comments +elsewhere — but this stale summary line was missed. A reader is told the actor is +on a dedicated dispatcher when it is not, which is misleading when reasoning about +mailbox throughput and thread-pool pressure. (`AlarmExecutionActor` does not carry +the equivalent claim — its summary only says "Same pattern as ScriptExecutionActor.") + +**Recommendation** + +Correct the summary to describe the actual model: the actor runs on the default +dispatcher and the script body is dispatched onto the dedicated +`ScriptExecutionScheduler` (SiteRuntime-009). Align the wording with the accurate +comment already present at `ScriptExecutionActor.cs:71-73`. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-019 — Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor` + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:113` | + +**Description** + +`InstanceActor`'s constructor registers `Receive` and +`Receive` handlers that log and reply with a successful +`InstanceLifecycleResponse`. These handlers are unreachable. The Deployment Manager +is the only sender of those commands, and `DeploymentManagerActor.HandleDisable` / +`HandleEnable` handle the lifecycle entirely themselves — they call +`Context.Stop(actor)` (disable) or `CreateInstanceActor(...)` (enable) directly and +reply to the original sender from the Deployment Manager. Neither command is ever +`Forward`-ed or `Tell`-ed to the Instance Actor. The handlers are dead code, and +they are actively misleading: a maintainer reading `InstanceActor` would reasonably +believe disable/enable is partly an Instance-Actor responsibility, and the no-op +"true" reply implies an instance-side acknowledgement contract that does not exist. +If a future change *did* route these commands here, the disable handler would do +nothing useful (it does not stop children or tear down state — Akka does that when +the parent stops the actor). + +**Recommendation** + +Remove the two `Receive<...>` registrations and their handler bodies from +`InstanceActor`, since the Deployment Manager owns the disable/enable lifecycle. +If the intent is to keep them for a future instance-side hook, add an XML comment +stating that the Deployment Manager currently handles these and the handlers are a +reserved placeholder — but removal is preferred. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/StoreAndForward/findings.md b/code-reviews/StoreAndForward/findings.md index d59d7b4..b1cd34d 100644 --- a/code-reviews/StoreAndForward/findings.md +++ b/code-reviews/StoreAndForward/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.StoreAndForward` | | Design doc | `docs/requirements/Component-StoreAndForward.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) | +| Commit reviewed | `39d737e` | +| Open findings | 3 (3 Deferred: 002, 011, 012 — see notes) | ## Summary @@ -30,20 +30,45 @@ status set, and untested critical paths (retry-due timing, replication-from-acti the actor bridge). None of the findings are blockers for compilation, but the replication and retry-count issues are functional defects against the design. +#### Re-review 2026-05-17 (commit `39d737e`) + +Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and +003–010, plus 013–014, are confirmed `Resolved` against the current source: the +replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected +retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the +transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity` +hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database +directory creation are all present as described. Findings 002, 011 and 012 were +re-verified and remain validly `Deferred` — their preconditions are unchanged (002's +residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement). + +This pass surfaced **three new findings**. StoreAndForward-015 records the +StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by +ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only +as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry +forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller +passing `0` to mean "never retry" gets the opposite behaviour with no warning from the +S&F API surface. StoreAndForward-016 records that operator-initiated parked-message +retry and discard are not replicated to the standby — only the add/remove/park sweep +paths are — so a failover diverges the standby buffer from the active one. +StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the +`ExternalSystem` category, mislabelling notification and cached-DB-write messages in +the site event log. + ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| -| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). | +| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). | | 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. | | 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). | -| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). | +| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). | | 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. | | 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). | | 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. | | 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). | | 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). | -| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). | +| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). | ## Findings @@ -703,3 +728,158 @@ and bare filenames are skipped). Regression test `InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code; all six `SiteActorPathTests` now pass. Fixed by the commit whose message references `StoreAndForward-014`. + +### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever" + +| | | +|--|--| +| Severity | Medium | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114`–`:130`, `:285` | + +**Description** + +The re-review brief asks for the StoreAndForward side of the cross-module ambiguity +recorded as `ExternalSystemGateway-015`. The semantics are split across this module and +its callers, and the StoreAndForward side carries a genuine documentation/API-contract +fault: + +- `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries` + (`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every + sweep, so a `0` value means **"no limit — retry forever"**. The + `StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state + `"0 = no limit"`, so the persistence model is internally consistent. +- But `EnqueueAsync` — the *only* public entry point into the engine — exposes a + `maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter + documentation at all**, and its method summary (lines 114–122) describes the lifecycle + only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the + `_deliveryHandlers` field doc, line 50–51). Nothing on the public surface tells a + caller that passing `0` flips the meaning from "park immediately / never retry" to + "retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0` + retries means zero retries. +- This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` / + `DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim, + intending `0` to mean "never retry", and instead get unbounded retry — the + duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note + warns against. The fault is not solely ESG's: the S&F public API silently overloads + `0` with the opposite of its natural reading and does not document it. + +The defect is in this module's API contract and documentation, so it is recorded here +in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or +"no retry" is the cross-module design decision tracked by ESG-015; this finding is +specifically that the StoreAndForward public surface fails to document whichever +meaning is chosen.) + +**Recommendation** + +Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `` tag +that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries` +(`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the +`0` case to the method summary's lifecycle description. Better still — and consistent +with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept +a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or +park on the first sweep) so the natural reading and the behaviour agree. Either way the +public `EnqueueAsync` contract must state the chosen meaning; today it states nothing. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339`–`:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131`–`:136` | + +**Description** + +`StoreAndForward-001`'s fix wired replication into the active *delivery* paths: +`BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park +replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339) +and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch +`_replication`: + +- `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears + `retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays + `Parked`. +- `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's + copy is left in place, still `Parked`. + +The Component design doc ("Persistence") requires the active node to forward "each +buffer operation (add, remove, park)" so that on failover "the new active node has a +near-complete copy of the buffer." An operator retrying a parked message is a buffer +state change; an operator discarding one is a removal. After a failover that follows an +operator action: + +1. A **discarded** message reappears on the new active node — it is still `Parked` + there, so it resurfaces in the central UI's parked-message list and an operator must + discard it a second time. For a message deliberately removed (e.g. a known-bad + payload) this is a correctness regression of the operator's intent. +2. A **retried** message is still `Parked` on the new active node, so the operator's + "move it back to the queue" action is silently lost across the failover and the + message is not re-attempted. + +`ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for +"un-park / move back to pending", so even a minimal fix needs either a new operation +type or a re-use of `Add` to overwrite the standby row. This is the same class of defect +as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep +paths. + +**Recommendation** + +Replicate both operator actions. `DiscardParkedMessageAsync` should call +`_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing +`Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a +`Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case +resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load +the updated message and replicate it as an `Add`-style upsert. Add replication tests for +both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep +paths). + +**Resolution** + +_Unresolved._ + +### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` | + +**Description** + +`RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity +notification (consumed by Site Event Logging — WP-14) but pass a hard-coded +`StoreAndForwardCategory.ExternalSystem` as the category argument: + +```csharp +RaiseActivity("Retry", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue"); +RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded"); +``` + +Both methods take only a `messageId` and never load the message, so they have no access +to its real category. When an operator retries or discards a parked **Notification** or +**CachedDbWrite** message, the site event log records the activity under +`ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's +true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are +inconsistent and produce mislabelled audit entries — misleading when an operator later +filters or reviews S&F activity by category. + +**Recommendation** + +Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` / +`DiscardParkedMessageAsync` return the affected row's category) and pass the real +`Category` to `RaiseActivity`. If loading the row is considered too costly on these +infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to +allow a nullable category for management actions rather than asserting a false one. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/TemplateEngine/findings.md b/code-reviews/TemplateEngine/findings.md index f2de185..9413f07 100644 --- a/code-reviews/TemplateEngine/findings.md +++ b/code-reviews/TemplateEngine/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.TemplateEngine` | | Design doc | `docs/requirements/Component-TemplateEngine.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 2 | ## Summary @@ -29,11 +29,30 @@ create, optimistic concurrency on instance state) are claimed but not implemente Themes: validation that is weaker than the design promises, and asymmetric handling of attributes vs. alarms vs. scripts throughout the resolve/flatten/derive paths. +#### Re-review 2026-05-17 (commit `39d737e`) + +Re-reviewed the whole module against all ten checklist categories at commit +`39d737e`. All fourteen prior findings remain closed — the batch-4 fixes +(`bc88a36`/`804697f` and predecessors) hold up: the recursive composition walk, +the per-slot alarm override mechanism, the code-region-aware delimiter scanner, +and the single-source deletion-constraint logic are all correctly in place. Two +new Medium findings surfaced, both in the composition-cascade path and both +affecting **nested** (depth ≥ 2) compositions specifically — the same blind spot +that produced TemplateEngine-001. **TemplateEngine-015**: `RenameCompositionAsync` +renames only the directly slot-owned derived template, leaving cascaded inner +derived templates with a stale dotted-path name. **TemplateEngine-016**: +`FlatteningService` hard-codes `ScriptScope.ParentPath` to the empty string for +every composed script regardless of nesting depth, so a script two or more +levels deep cannot resolve `Parent.X` references to its real parent module. +Both are limited-impact (nested compositions are the less common case and there +is design-time visibility) but represent genuine drift from the recursive-nesting +design promise. + ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| -| 1 | Correctness & logic bugs | ✓ | Multiple real bugs: deep composed-member loss, derived alarms omitted, granularity bypass, no-op create-time collision block. | +| 1 | Correctness & logic bugs | ✓ | Prior bugs (001–005, 013) all resolved and verified. Re-review 2026-05-17 found two new nested-composition defects: rename does not cascade (TemplateEngine-015), composed-script `ParentPath` always empty (TemplateEngine-016). | | 2 | Akka.NET conventions | ✓ | No actors in this module (`AddTemplateEngineActors` is an empty placeholder). Nothing to assess. | | 3 | Concurrency & thread safety | ✓ | Services are stateless, scoped per request; static helpers hold no mutable state. Design says template editing is last-write-wins; that is honoured. See TemplateEngine-010 re: a doc claim of optimistic concurrency that is not implemented. | | 4 | Error handling & resilience | ✓ | `Result` used consistently; repository nulls guarded. `FlatteningService` wraps in try/catch. No store-and-forward or failover surface in this module. | @@ -648,3 +667,102 @@ reports all blocking reasons and uses `TemplateDeletionService`'s phrasing — t affected `TemplateServiceTests` delete tests were updated to the unified messages, and a regression test `DeleteTemplate_MultipleConstraints_ReportsAllNotJustFirst` verifies all three constraint categories are surfaced together. + +### TemplateEngine-015 — `RenameCompositionAsync` does not cascade-rename nested derived templates + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:680` | + +**Description** + +`AddCompositionAsync` builds a cascade of derived templates whose names follow a +dotted path: composing `$Sensor` (which itself composes `$Probe` as `Probe1`) +into `$Pump` as `TempSensor` produces `$Pump.TempSensor` **and** the nested +`$Pump.TempSensor.Probe1` (see `CreateCascadedCompositionAsync` and the +`AddComposition_CascadesChildCompositions` test). `RenameCompositionAsync`, +however, renames only the **directly** slot-owned derived template: + +```csharp +var derived = await _repository.GetTemplateByIdAsync(composition.ComposedTemplateId, ...); +if (derived != null && derived.IsDerived && derived.OwnerCompositionId == compositionId) +{ + var newDerivedName = $"{owner.Name}.{newInstanceName}"; + ... + derived.Name = newDerivedName; + await _repository.UpdateTemplateAsync(derived, ...); +} +``` + +There is no recursion into `derived.Compositions`. After renaming the `TempSensor` +slot to `MainSensor`, the parent derived becomes `$Pump.MainSensor` but the +cascaded child stays `$Pump.TempSensor.Probe1` — its name no longer reflects the +slot path it lives under, breaking the dotted-path naming invariant the cascade +otherwise maintains. `DeleteCompositionAsync` correctly recurses +(`CascadeDeleteDerivedAsync`), so rename is the asymmetric outlier. The +`RenameComposition_RenamesSlotAndDerivedTemplate` test only exercises a +single-level derived, so the gap is untested. The stale name also breaks the +`AddComposition_DerivedNameCollision_Fails` / cascade-name pre-check on any +subsequent compose that walks the now-inconsistent name tree. + +**Recommendation** + +Recurse over `derived.Compositions` (mirroring `CascadeDeleteDerivedAsync`), +re-deriving each cascaded child's name from the renamed parent +(`$"{parentDerivedName}.{childComposition.InstanceName}"`), and run the +existing same-name collision pre-check across every name the cascade will +produce — not just the top-level one. Add a regression test covering a +two-level cascade rename. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-016 — Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:750` | + +**Description** + +`ResolveComposedScriptsRecursive` assigns each composed script a `ScriptScope`: + +```csharp +Scope = new Commons.Types.Scripts.ScriptScope(SelfPath: prefix, ParentPath: "") +``` + +`prefix` is the accumulated path-qualified module path (`Outer` at depth 1, +`Outer.Inner` at depth 2, etc.), so `SelfPath` is correct. `ParentPath`, however, +is hard-coded to the empty string at every depth. Per `ScriptScope`'s own XML +doc, `ParentPath` is "computed at flattening time and seeded into the script's +globals … so `Attributes["X"]` / `Parent.X` can prepend the right path-prefix." +For a script directly composed at depth 1 the parent is the root and `""` is +correct, but for a script in a nested module (`Outer.Inner.Foo`) the parent +module is `Outer` — yet `ParentPath` is still `""`. A nested composed script +that references `Parent.X` will therefore resolve the reference against the root +flat namespace instead of its actual parent module, reading the wrong attribute +(or failing to find one). This is the same depth-≥2 nesting blind spot as +TemplateEngine-001; the recursive walk was added there but the `Scope` +construction was not updated to carry the parent path. `ResolveComposedScripts` +for direct (root-template) scripts leaves `Scope` at the default `ScriptScope.Root`, +which is correct. + +**Recommendation** + +Thread the parent module path through `ResolveComposedScriptsRecursive` (the +caller already knows it — it is the `prefix` of the enclosing recursion frame, +or `""` for a depth-1 composition) and set +`ParentPath` to that value, so `SelfPath = "Outer.Inner"` pairs with +`ParentPath = "Outer"`. Add a flattening test asserting the `Scope` of a +two-level composed script. + +**Resolution** + +_Unresolved._