docs(code-reviews): re-review batch 4 at 39d737e — SiteEventLogging, SiteRuntime, StoreAndForward, TemplateEngine

11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016.
This commit is contained in:
Joseph Doherty
2026-05-17 00:51:58 -04:00
parent 3b3760f026
commit 0ba4e49e11
5 changed files with 613 additions and 27 deletions

View File

@@ -41,9 +41,9 @@ module file and counted in **Total**.
|----------|---------------|
| Critical | 0 |
| High | 8 |
| Medium | 20 |
| Low | 27 |
| **Total** | **55** |
| Medium | 26 |
| Low | 32 |
| **Total** | **66** |
## Module Status
@@ -64,10 +64,10 @@ module file and counted in **Total**.
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 17 |
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/2/1/2 | 5 | 18 |
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/2 | 4 | 15 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 14 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 19 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 17 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/0 | 2 | 16 |
## Pending Findings
@@ -93,7 +93,7 @@ _None open._
| NotificationService-014 | [NotificationService](NotificationService/findings.md) | OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever |
| NotificationService-015 | [NotificationService](NotificationService/findings.md) | Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script |
### Medium (20)
### Medium (26)
| ID | Module | Title |
|----|--------|-------|
@@ -117,8 +117,14 @@ _None open._
| NotificationService-016 | [NotificationService](NotificationService/findings.md) | `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials |
| Security-012 | [Security](Security/findings.md) | Partial LDAP failure during login yields a roleless authenticated session |
| Security-014 | [Security](Security/findings.md) | `RefreshToken` re-issues a token without checking the idle timeout |
| SiteEventLogging-012 | [SiteEventLogging](SiteEventLogging/findings.md) | Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted |
| SiteRuntime-017 | [SiteRuntime](SiteRuntime/findings.md) | Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors |
| StoreAndForward-015 | [StoreAndForward](StoreAndForward/findings.md) | `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever" |
| StoreAndForward-016 | [StoreAndForward](StoreAndForward/findings.md) | Operator-initiated parked-message retry and discard are not replicated to the standby |
| TemplateEngine-015 | [TemplateEngine](TemplateEngine/findings.md) | `RenameCompositionAsync` does not cascade-rename nested derived templates |
| TemplateEngine-016 | [TemplateEngine](TemplateEngine/findings.md) | Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules |
### Low (27)
### Low (32)
| ID | Module | Title |
|----|--------|-------|
@@ -149,3 +155,8 @@ _None open._
| NotificationService-018 | [NotificationService](NotificationService/findings.md) | Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed |
| Security-013 | [Security](Security/findings.md) | `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas |
| Security-015 | [Security](Security/findings.md) | Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims |
| SiteEventLogging-013 | [SiteEventLogging](SiteEventLogging/findings.md) | Keyword search does not escape SQL `LIKE` wildcards in user input |
| SiteEventLogging-014 | [SiteEventLogging](SiteEventLogging/findings.md) | Initial purge runs synchronously on the host startup thread |
| SiteRuntime-018 | [SiteRuntime](SiteRuntime/findings.md) | `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher" |
| SiteRuntime-019 | [SiteRuntime](SiteRuntime/findings.md) | Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor` |
| StoreAndForward-017 | [StoreAndForward](StoreAndForward/findings.md) | Retry/Discard activity-log entries hard-code the `ExternalSystem` category |

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -28,16 +28,33 @@ cluster-singleton placement of the handler actor (which can pin to the standby
node), missing indexes for common query filters, retention/cap purge not enforcing
the requirement strictly, and several documentation/maintainability issues.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
hold up under inspection — the background writer, lock-guarded `WithConnection`,
`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
the concrete-recorder DI wiring are all present and correct at this commit. The
module source is byte-identical between `39d737e` and current `HEAD`, so this review
reflects the live code. Three new findings were recorded, all low-to-medium and none
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
be persisted because the logger has been disposed, the returned `Task` is completed
*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
keyword-search filter (SiteEventLogging-013) and the initial purge running
synchronously on the host startup thread (SiteEventLogging-014).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: initial purge blocks host startup thread (-014). |
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
@@ -529,3 +546,122 @@ explanatory note added to `AddSiteEventLogging` pointing readers to where the ac
is actually registered. Documentation/dead-code change only; no regression test was
added — the change is a method removal verified by the compiler (no callers) and the
full module suite still passing.
### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |
**Description**
`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
SiteEventLogging-009), "completes once the event is durably persisted and faults if
the write fails, so callers that `await` it observe success or failure." Two paths
break that contract by signalling **success** for an event that was never written:
1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
has been completed because the logger was disposed), the code calls
`pending.Completion.TrySetResult()` — completing the `Task` successfully — even
though the comment immediately above acknowledges "there is nowhere to persist the
event."
2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
been disposed mid-drain. The code does not inspect the returned `written` flag and
unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
for an event the comment admits "simply cannot be persisted."
The event log is the site's diagnostic audit trail. A caller that `await`s
`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
recorded will observe a *successful* completion for an event that was silently
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
*errors* — but the disposed-drop path was left reporting false success. The window
is the disposal/shutdown interval, during which shutdown-related events (graceful
singleton handover, instance disable) are exactly the events most likely to be
enqueued and lost.
**Recommendation**
For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
— so an `await`-ing caller can distinguish a dropped event from a persisted one.
Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
deliberate "drop silently on shutdown" semantics is chosen instead.
**Resolution**
_Unresolved._
### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |
**Description**
The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
user-supplied keyword. A search for a literal `_` (common in event sources and
identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
"match any single character", and a `%` matches any run of characters. The design
calls keyword search "free-text search on message and source fields ... Useful for
finding events by script name, alarm name, or error message" — users will reasonably
expect a literal substring match, so a query for `store_and_forward` silently returns
events containing `storeXandYforward` and similar false positives. There is no way
for the caller to search for a literal underscore or percent.
**Recommendation**
Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
design implies.
**Resolution**
_Unresolved._
### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |
**Description**
`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
near-1 GB database is back under the cap — executes inline on the startup thread,
blocking host startup (and therefore the `/health/ready` gate) for as long as the
purge takes. On a site that has accumulated a large log this can be a multi-second
stall during every node start/failover. The class doc states the service "runs on a
background thread and does not block event recording" — the startup-thread block is
inconsistent with that intent.
**Recommendation**
Yield before the initial purge so it runs on the background scheduler rather than the
startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteRuntime` |
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -28,6 +28,24 @@ in a comment but ships it anyway). Test coverage exists for the coordinator acto
persistence and scripting, but the short-lived execution actors, the replication
actor, and the repositories are untested.
#### Re-review 2026-05-17 (commit `39d737e`)
The module was re-reviewed at commit `39d737e`. No source under
`src/ScadaLink.SiteRuntime` has changed since the previous review at `9c60592`
(the only intervening commits are code-review documentation updates), so all of
SiteRuntime-001..013, 015, 016 remain Resolved and SiteRuntime-014 remains
Deferred — its Deferred justification (a trigger-evaluation concurrency design
decision is required before either recommended fix can land in-module) still
holds verbatim against the unchanged `ScriptActor`/`AlarmActor` source. The
re-review nonetheless worked through all 10 checklist categories afresh and
surfaced three new findings that the prior pass did not record: a cross-thread
`Dictionary` enumeration race when the Instance Actor's live `_attributes`
dictionary is handed by reference into child `ScriptActor`/`AlarmActor`
constructors (SiteRuntime-017, Medium); a stale `ScriptExecutionActor` XML doc
that still claims a "dedicated blocking I/O dispatcher" (SiteRuntime-018, Low);
and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
never routes to (SiteRuntime-019, Low). Open findings: 3.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -733,3 +751,126 @@ harness is a larger test-infrastructure task tracked separately and out of scope
Low-severity coverage finding; the highest-value untested paths the finding called out
(script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests
green.
### SiteRuntime-017 — Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:625`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:675`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:83`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:93` |
**Description**
`InstanceActor.CreateChildActors` passes the Instance Actor's own mutable
`_attributes` field (a plain `Dictionary<string, object?>`) by reference into the
`Props.Create(...)` factory for every `ScriptActor` and `AlarmActor` (as the
`initialAttributes` constructor argument). Each child constructor then iterates
that dictionary to seed its `_attributeSnapshot`:
```csharp
if (initialAttributes != null)
foreach (var kvp in initialAttributes)
_attributeSnapshot[kvp.Key] = kvp.Value;
```
`Context.ActorOf` returns immediately; the child actor's constructor runs later on
the *child's* mailbox thread. Meanwhile the Instance Actor's `PreStart` returns and
the Instance Actor begins processing its mailbox — `HandleTagValueUpdate` and
`HandleAttributeValueChanged` both mutate `_attributes` (`_attributes[...] = ...`).
A DCL tag update that arrives before a child has finished its constructor copy
therefore mutates the dictionary on the Instance Actor thread while the child
thread is enumerating it. `Dictionary<,>` is explicitly not safe for concurrent
read/write: the enumeration can throw `InvalidOperationException` ("collection was
modified") — which surfaces as an `ActorInitializationException` and, under the
Instance Actor's `SupervisorStrategy`, **stops** the child (the strategy returns
`Stop` for `ActorInitializationException`). The script or alarm is then silently
absent for the life of the instance. A torn read of an entry is also possible. The
window is small but deterministically reachable on a busy site at startup/failover
— exactly the staggered-startup scenario the design is most concerned about.
**Recommendation**
Do not share the live dictionary. Snapshot it on the Instance Actor thread before
constructing the child — e.g. pass `new Dictionary<string, object?>(_attributes)`
(or an immutable copy) into each `Props.Create`. The copy is made on the Instance
Actor thread inside `CreateChildActors`, so it is race-free, and each child gets a
private dictionary to seed from.
**Resolution**
_Unresolved._
### SiteRuntime-018 — `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher"
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:17` |
**Description**
The class-level XML summary on `ScriptExecutionActor` states "Runs on a dedicated
blocking I/O dispatcher." That is not what the code does. SiteRuntime-009 was
resolved by introducing `ScriptExecutionScheduler` (a bounded dedicated
`TaskScheduler`); the *actor itself and its mailbox* run on the **default** Akka
dispatcher, and only the script body runs on the scheduler's threads via
`Task.Factory.StartNew(..., scheduler)`. The resolution of SiteRuntime-009
explicitly chose the `TaskScheduler` route *instead of* a HOCON dispatcher and
even removed the "in production, configure a dedicated dispatcher" comments
elsewhere — but this stale summary line was missed. A reader is told the actor is
on a dedicated dispatcher when it is not, which is misleading when reasoning about
mailbox throughput and thread-pool pressure. (`AlarmExecutionActor` does not carry
the equivalent claim — its summary only says "Same pattern as ScriptExecutionActor.")
**Recommendation**
Correct the summary to describe the actual model: the actor runs on the default
dispatcher and the script body is dispatched onto the dedicated
`ScriptExecutionScheduler` (SiteRuntime-009). Align the wording with the accurate
comment already present at `ScriptExecutionActor.cs:71-73`.
**Resolution**
_Unresolved._
### SiteRuntime-019 — Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:113` |
**Description**
`InstanceActor`'s constructor registers `Receive<DisableInstanceCommand>` and
`Receive<EnableInstanceCommand>` handlers that log and reply with a successful
`InstanceLifecycleResponse`. These handlers are unreachable. The Deployment Manager
is the only sender of those commands, and `DeploymentManagerActor.HandleDisable` /
`HandleEnable` handle the lifecycle entirely themselves — they call
`Context.Stop(actor)` (disable) or `CreateInstanceActor(...)` (enable) directly and
reply to the original sender from the Deployment Manager. Neither command is ever
`Forward`-ed or `Tell`-ed to the Instance Actor. The handlers are dead code, and
they are actively misleading: a maintainer reading `InstanceActor` would reasonably
believe disable/enable is partly an Instance-Actor responsibility, and the no-op
"true" reply implies an instance-side acknowledgement contract that does not exist.
If a future change *did* route these commands here, the disable handler would do
nothing useful (it does not stop children or tear down state — Akka does that when
the parent stops the actor).
**Recommendation**
Remove the two `Receive<...>` registrations and their handler bodies from
`InstanceActor`, since the Deployment Manager owns the disable/enable lifecycle.
If the intent is to keep them for a future instance-side hook, add an XML comment
stating that the Deployment Manager currently handles these and the handlers are a
reserved placeholder — but removal is preferred.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
| Commit reviewed | `39d737e` |
| Open findings | 3 (3 Deferred: 002, 011, 012 — see notes) |
## Summary
@@ -30,20 +30,45 @@ status set, and untested critical paths (retry-due timing, replication-from-acti
the actor bridge). None of the findings are blockers for compilation, but the
replication and retry-count issues are functional defects against the design.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and
003010, plus 013014, are confirmed `Resolved` against the current source: the
replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected
retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the
transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity`
hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database
directory creation are all present as described. Findings 002, 011 and 012 were
re-verified and remain validly `Deferred` — their preconditions are unchanged (002's
residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement).
This pass surfaced **three new findings**. StoreAndForward-015 records the
StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by
ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only
as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry
forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller
passing `0` to mean "never retry" gets the opposite behaviour with no warning from the
S&F API surface. StoreAndForward-016 records that operator-initiated parked-message
retry and discard are not replicated to the standby — only the add/remove/park sweep
paths are — so a failover diverges the standby buffer from the active one.
StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
the site event log.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). |
| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). |
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). |
## Findings
@@ -703,3 +728,158 @@ and bare filenames are skipped). Regression test
`InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code;
all six `SiteActorPathTests` now pass. Fixed by the commit whose message references
`StoreAndForward-014`.
### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever"
| | |
|--|--|
| Severity | Medium |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114``:130`, `:285` |
**Description**
The re-review brief asks for the StoreAndForward side of the cross-module ambiguity
recorded as `ExternalSystemGateway-015`. The semantics are split across this module and
its callers, and the StoreAndForward side carries a genuine documentation/API-contract
fault:
- `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
(`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every
sweep, so a `0` value means **"no limit — retry forever"**. The
`StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state
`"0 = no limit"`, so the persistence model is internally consistent.
- But `EnqueueAsync` — the *only* public entry point into the engine — exposes a
`maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter
documentation at all**, and its method summary (lines 114122) describes the lifecycle
only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the
`_deliveryHandlers` field doc, line 5051). Nothing on the public surface tells a
caller that passing `0` flips the meaning from "park immediately / never retry" to
"retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0`
retries means zero retries.
- This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` /
`DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim,
intending `0` to mean "never retry", and instead get unbounded retry — the
duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note
warns against. The fault is not solely ESG's: the S&F public API silently overloads
`0` with the opposite of its natural reading and does not document it.
The defect is in this module's API contract and documentation, so it is recorded here
in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or
"no retry" is the cross-module design decision tracked by ESG-015; this finding is
specifically that the StoreAndForward public surface fails to document whichever
meaning is chosen.)
**Recommendation**
Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `<param>` tag
that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries`
(`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the
`0` case to the method summary's lifecycle description. Better still — and consistent
with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept
a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or
park on the first sweep) so the natural reading and the behaviour agree. Either way the
public `EnqueueAsync` contract must state the chosen meaning; today it states nothing.
**Resolution**
_Unresolved._
### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339``:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131``:136` |
**Description**
`StoreAndForward-001`'s fix wired replication into the active *delivery* paths:
`BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park
replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339)
and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch
`_replication`:
- `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears
`retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays
`Parked`.
- `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's
copy is left in place, still `Parked`.
The Component design doc ("Persistence") requires the active node to forward "each
buffer operation (add, remove, park)" so that on failover "the new active node has a
near-complete copy of the buffer." An operator retrying a parked message is a buffer
state change; an operator discarding one is a removal. After a failover that follows an
operator action:
1. A **discarded** message reappears on the new active node — it is still `Parked`
there, so it resurfaces in the central UI's parked-message list and an operator must
discard it a second time. For a message deliberately removed (e.g. a known-bad
payload) this is a correctness regression of the operator's intent.
2. A **retried** message is still `Parked` on the new active node, so the operator's
"move it back to the queue" action is silently lost across the failover and the
message is not re-attempted.
`ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for
"un-park / move back to pending", so even a minimal fix needs either a new operation
type or a re-use of `Add` to overwrite the standby row. This is the same class of defect
as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep
paths.
**Recommendation**
Replicate both operator actions. `DiscardParkedMessageAsync` should call
`_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing
`Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a
`Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case
resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load
the updated message and replicate it as an `Add`-style upsert. Add replication tests for
both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep
paths).
**Resolution**
_Unresolved._
### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` |
**Description**
`RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity
notification (consumed by Site Event Logging — WP-14) but pass a hard-coded
`StoreAndForwardCategory.ExternalSystem` as the category argument:
```csharp
RaiseActivity("Retry", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue");
RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded");
```
Both methods take only a `messageId` and never load the message, so they have no access
to its real category. When an operator retries or discards a parked **Notification** or
**CachedDbWrite** message, the site event log records the activity under
`ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's
true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are
inconsistent and produce mislabelled audit entries — misleading when an operator later
filters or reviews S&F activity by category.
**Recommendation**
Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` /
`DiscardParkedMessageAsync` return the affected row's category) and pass the real
`Category` to `RaiseActivity`. If loading the row is considered too costly on these
infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to
allow a nullable category for management actions rather than asserting a false one.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.TemplateEngine` |
| Design doc | `docs/requirements/Component-TemplateEngine.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 2 |
## Summary
@@ -29,11 +29,30 @@ create, optimistic concurrency on instance state) are claimed but not implemente
Themes: validation that is weaker than the design promises, and asymmetric handling
of attributes vs. alarms vs. scripts throughout the resolve/flatten/derive paths.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the whole module against all ten checklist categories at commit
`39d737e`. All fourteen prior findings remain closed — the batch-4 fixes
(`bc88a36`/`804697f` and predecessors) hold up: the recursive composition walk,
the per-slot alarm override mechanism, the code-region-aware delimiter scanner,
and the single-source deletion-constraint logic are all correctly in place. Two
new Medium findings surfaced, both in the composition-cascade path and both
affecting **nested** (depth ≥ 2) compositions specifically — the same blind spot
that produced TemplateEngine-001. **TemplateEngine-015**: `RenameCompositionAsync`
renames only the directly slot-owned derived template, leaving cascaded inner
derived templates with a stale dotted-path name. **TemplateEngine-016**:
`FlatteningService` hard-codes `ScriptScope.ParentPath` to the empty string for
every composed script regardless of nesting depth, so a script two or more
levels deep cannot resolve `Parent.X` references to its real parent module.
Both are limited-impact (nested compositions are the less common case and there
is design-time visibility) but represent genuine drift from the recursive-nesting
design promise.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Multiple real bugs: deep composed-member loss, derived alarms omitted, granularity bypass, no-op create-time collision block. |
| 1 | Correctness & logic bugs | ✓ | Prior bugs (001005, 013) all resolved and verified. Re-review 2026-05-17 found two new nested-composition defects: rename does not cascade (TemplateEngine-015), composed-script `ParentPath` always empty (TemplateEngine-016). |
| 2 | Akka.NET conventions | ✓ | No actors in this module (`AddTemplateEngineActors` is an empty placeholder). Nothing to assess. |
| 3 | Concurrency & thread safety | ✓ | Services are stateless, scoped per request; static helpers hold no mutable state. Design says template editing is last-write-wins; that is honoured. See TemplateEngine-010 re: a doc claim of optimistic concurrency that is not implemented. |
| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently; repository nulls guarded. `FlatteningService` wraps in try/catch. No store-and-forward or failover surface in this module. |
@@ -648,3 +667,102 @@ reports all blocking reasons and uses `TemplateDeletionService`'s phrasing — t
affected `TemplateServiceTests` delete tests were updated to the unified messages,
and a regression test `DeleteTemplate_MultipleConstraints_ReportsAllNotJustFirst`
verifies all three constraint categories are surfaced together.
### TemplateEngine-015 — `RenameCompositionAsync` does not cascade-rename nested derived templates
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:680` |
**Description**
`AddCompositionAsync` builds a cascade of derived templates whose names follow a
dotted path: composing `$Sensor` (which itself composes `$Probe` as `Probe1`)
into `$Pump` as `TempSensor` produces `$Pump.TempSensor` **and** the nested
`$Pump.TempSensor.Probe1` (see `CreateCascadedCompositionAsync` and the
`AddComposition_CascadesChildCompositions` test). `RenameCompositionAsync`,
however, renames only the **directly** slot-owned derived template:
```csharp
var derived = await _repository.GetTemplateByIdAsync(composition.ComposedTemplateId, ...);
if (derived != null && derived.IsDerived && derived.OwnerCompositionId == compositionId)
{
var newDerivedName = $"{owner.Name}.{newInstanceName}";
...
derived.Name = newDerivedName;
await _repository.UpdateTemplateAsync(derived, ...);
}
```
There is no recursion into `derived.Compositions`. After renaming the `TempSensor`
slot to `MainSensor`, the parent derived becomes `$Pump.MainSensor` but the
cascaded child stays `$Pump.TempSensor.Probe1` — its name no longer reflects the
slot path it lives under, breaking the dotted-path naming invariant the cascade
otherwise maintains. `DeleteCompositionAsync` correctly recurses
(`CascadeDeleteDerivedAsync`), so rename is the asymmetric outlier. The
`RenameComposition_RenamesSlotAndDerivedTemplate` test only exercises a
single-level derived, so the gap is untested. The stale name also breaks the
`AddComposition_DerivedNameCollision_Fails` / cascade-name pre-check on any
subsequent compose that walks the now-inconsistent name tree.
**Recommendation**
Recurse over `derived.Compositions` (mirroring `CascadeDeleteDerivedAsync`),
re-deriving each cascaded child's name from the renamed parent
(`$"{parentDerivedName}.{childComposition.InstanceName}"`), and run the
existing same-name collision pre-check across every name the cascade will
produce — not just the top-level one. Add a regression test covering a
two-level cascade rename.
**Resolution**
_Unresolved._
### TemplateEngine-016 — Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:750` |
**Description**
`ResolveComposedScriptsRecursive` assigns each composed script a `ScriptScope`:
```csharp
Scope = new Commons.Types.Scripts.ScriptScope(SelfPath: prefix, ParentPath: "")
```
`prefix` is the accumulated path-qualified module path (`Outer` at depth 1,
`Outer.Inner` at depth 2, etc.), so `SelfPath` is correct. `ParentPath`, however,
is hard-coded to the empty string at every depth. Per `ScriptScope`'s own XML
doc, `ParentPath` is "computed at flattening time and seeded into the script's
globals … so `Attributes["X"]` / `Parent.X` can prepend the right path-prefix."
For a script directly composed at depth 1 the parent is the root and `""` is
correct, but for a script in a nested module (`Outer.Inner.Foo`) the parent
module is `Outer` — yet `ParentPath` is still `""`. A nested composed script
that references `Parent.X` will therefore resolve the reference against the root
flat namespace instead of its actual parent module, reading the wrong attribute
(or failing to find one). This is the same depth-≥2 nesting blind spot as
TemplateEngine-001; the recursive walk was added there but the `Scope`
construction was not updated to carry the parent path. `ResolveComposedScripts`
for direct (root-template) scripts leaves `Scope` at the default `ScriptScope.Root`,
which is correct.
**Recommendation**
Thread the parent module path through `ResolveComposedScriptsRecursive` (the
caller already knows it — it is the `prefix` of the enclosing recursion frame,
or `""` for a depth-1 composition) and set
`ParentPath` to that value, so `SelfPath = "Outer.Inner"` pairs with
`ParentPath = "Outer"`. Add a flattening test asserting the `Scope` of a
two-level composed script.
**Resolution**
_Unresolved._