code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+500
View File
@@ -0,0 +1,500 @@
# Code Review — AuditLog
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.AuditLog` |
| Design doc | `docs/requirements/Component-AuditLog.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 11 |
## Summary
AuditLog is one of the larger and most carefully-engineered modules in the codebase.
The site-side hot-path (`SqliteAuditWriter` + `FallbackAuditWriter` + `RingBufferFallback`)
implements a textbook bounded-channel + dedicated-writer pattern with batched transactions,
UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that
genuinely honours the "audit-write must NEVER abort the user-facing action" contract.
The central side mirrors that with per-row try/catch on batch ingest, a transactional
dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation,
and a partition-switch purge that is metadata-only. The payload filter is well-factored
with a compile-time regex cache, per-stage failure isolation, and per-target overrides.
Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.
Themes across findings: (1) the largest issue is a **specced-but-unwired transport path**
`ISiteStreamAuditClient.IngestCachedTelemetryAsync` and `AuditLogIngestActor.OnCachedTelemetryAsync`
both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry
client; the cached-call lifecycle audit rows ride the audit-only `IngestAuditEventsAsync` drain
and the central dual-write transaction is dead code (AuditLog-001). (2) Several
**Akka.NET supervisor-strategy comments are inaccurate** — multiple actors document
"`SupervisorStrategy` uses Resume" but the code returns `DefaultDecider` (which Restarts), and
the strategy applies to children, not to the actor itself (AuditLog-002). (3) The
**`SqliteAuditWriter` hot-path lock is contended by the 30 s backlog probe** — `GetBacklogStatsAsync`
takes the same `_writeLock` that serialises every batch INSERT, so a large-backlog scan can
park the hot-path writer (AuditLog-005). (4) **Sync-over-async in `Dispose`** can deadlock under
an ASP.NET sync context (AuditLog-006). (5) A handful of **misleading code comments and minor
configuration drift** (AuditLog-007, AuditLog-008, AuditLog-009). (6) `CancellationToken`
parameters on the actor drain paths are accepted but immediately replaced with
`CancellationToken.None` (AuditLog-010). (7) The site-only `AddAuditLogHealthMetricsBridge`
registers the `SiteAuditBacklogReporter` hosted service but the `AddAuditLog` registration
chain doesn't reject a central composition root that mistakenly calls the site bridge
(AuditLog-011). No Critical-severity issues; three Medium, eight Low.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009). |
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. |
| 3 | Concurrency & thread safety | Yes | `GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). |
| 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). |
| 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). |
| 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size. |
| 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. |
| 8 | Code organization & conventions | Yes | Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011). |
| 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does). |
| 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. |
## Findings
### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198` |
**Description**
The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a
single `CachedCallTelemetry` packet per lifecycle event that carries BOTH the audit row
AND the operational `SiteCalls` upsert, with central writing both rows in one transaction.
The infrastructure exists: `ISiteStreamAuditClient.IngestCachedTelemetryAsync` is on the
interface; `ClusterClientSiteAuditClient.IngestCachedTelemetryAsync` builds the
`IngestCachedTelemetryCommand`; the proto carries `CachedTelemetryBatch`;
`AuditLogIngestActor.OnCachedTelemetryAsync` performs the dual `InsertIfNotExists` +
`UpsertAsync` inside a `BeginTransactionAsync`. But a `grep` for callers of
`IngestCachedTelemetryAsync` in `src/ScadaLink.AuditLog` shows only the interface
declaration and the two implementations — nothing produces a `CachedTelemetryBatch` for
the site to push. The `SiteAuditTelemetryActor.OnDrainAsync` only calls
`IngestAuditEventsAsync` (the audit-only path); cached-call audit rows written by
`CachedCallTelemetryForwarder` to local SQLite are drained as ordinary audit events,
and the `SiteCalls` operational half rides a separate `UpsertSiteCallCommand` channel
into `SiteCallAuditActor`. The "central writes AuditLog + SiteCalls in one transaction"
guarantee is therefore not delivered — the two writes are now uncorrelated across
actors and can fail independently, and the dual-write path in `AuditLogIngestActor`
is dead production code.
**Recommendation**
Either (a) wire the combined path: build a `CachedTelemetryBatch` from the audit rows
the forwarder writes (alongside the operational half held by `IOperationTrackingStore`),
add a parallel drain loop that calls `IngestCachedTelemetryAsync`, and gate the audit-only
drain so cached-call rows don't double-emit; or (b) update the design doc + the
`AuditLogIngestActor` / `ClusterClientSiteAuditClient` / interface XML comments to
acknowledge that the two halves now flow via separate transports, and delete the
unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the
`AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct
actor injection only).
**Resolution**
_Unresolved._
### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` |
**Description**
Three central actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`)
all override `SupervisorStrategy()` and return
`new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider)`.
The surrounding XML / inline comments variously claim "uses `Resume` so a thrown exception
inside `ReceiveAsync` does not restart the actor" (AuditLogIngestActor remarks),
"uses Resume so any leaked exception keeps the singleton alive for the next tick"
(AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive
across any leaked exception with `DefaultDecider`'s Restart semantics — restart resets
the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery"
(SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts
with the other two). Two things are wrong: (1) the strategy returned by an actor's
`SupervisorStrategy()` override governs how that actor supervises its CHILDREN, not how
its own parent treats it — so it is not the mechanism that protects these singletons
from their own throws; (2) `DefaultDecider` Restarts for most exceptions, not Resumes.
The actors are in fact protected by the per-row / per-batch try/catch blocks inside
the receive handlers — the supervisor override is effectively unused, since these
actors have no children. The comments mislead a reader into trusting a guarantee
that the code does not deliver.
**Recommendation**
Pick one of two corrections: either delete the `SupervisorStrategy` override (these
actors have no children, so the override is dead) and rewrite the comments to credit
the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept
as a forward-compat hedge — change the decider to `Decider.From(_ => Directive.Stop)`
or similar to match the comment, AND add a clear note that the per-row catch is what
keeps the actor running across handler throws, not the supervisor strategy.
**Resolution**
_Unresolved._
### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:178` |
**Description**
`OnCachedTelemetryAsync` opens `_serviceProvider!.CreateAsyncScope()` and lets
`await using` dispose it. `OnIngestAsync`, `OnTickAsync` in
`SiteAuditReconciliationActor`, and `OnTickAsync` in `AuditLogPurgeActor` all open
`_services.CreateScope()` (the synchronous variant) and dispose it with a synchronous
`scope.Dispose()` in a `finally` block — even though the per-message work is async and
the scoped `IAuditLogRepository` resolves an EF Core `DbContext`, which implements
`IAsyncDisposable`. The synchronous `Dispose()` on a `DbContext` blocks on any pending
async connection cleanup; under load this can hold the actor thread for the duration
of a connection close, which on SQL Server may include sending a `SET TRANSACTION
ISOLATION LEVEL` reset round-trip. Switching to `CreateAsyncScope()` + `await using`
is the recommended pattern for scoped EF resources.
**Recommendation**
Change `_services.CreateScope()` to `_services.CreateAsyncScope()` in
`OnIngestAsync`, `SiteAuditReconciliationActor.OnTickAsync`, and
`AuditLogPurgeActor.OnTickAsync`, and replace the `try/finally { scope?.Dispose(); }`
pattern with `await using var scope = _services.CreateAsyncScope();`. The DI scope
will dispose asynchronously and the EF Core context will be released without
blocking the actor thread.
**Resolution**
_Unresolved._
### AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |
**Description**
`PullSiteAsync` iterates the pulled events, calls `InsertIfNotExistsAsync` inside a
per-row try/catch, and unconditionally updates `maxOccurred = evt.OccurredAtUtc` after
the try/catch — regardless of whether the insert succeeded or threw. The comment at
line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the
row was returned by the site, so the next tick won't re-fetch it; if it permanently
fails to persist, that's an operational concern surfaced by the log, not a hot-loop
trigger." For a transient fault that flips to success on the next pull the design
holds. But if a row throws on EVERY central attempt (truly permanent persistence fault —
e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves
past it, and central will simply log on every reconciliation tick. No alert escalates
beyond a log line. Worse, the site keeps the row `Pending` (because `MarkReconciledAsync`
is only called for rows the puller flipped centrally) AND will trip the
`SiteAuditTelemetryStalled` signal because the backlog never drains, but the central
log message is the only place an operator could correlate the stall with the
persistent insert failure.
**Recommendation**
Either (a) only advance the cursor for rows whose `InsertIfNotExistsAsync` returned
cleanly — leave `maxOccurred` at the previous value for the failing row so the next
tick retries; or (b) increment a dedicated `CentralAuditPermanentInsertFailure` health
counter on the per-row catch so the failure is observable on the dashboard instead of
buried in the log. Option (a) needs a guard against the same row throwing forever
(saturate the puller) — a small per-event retry counter held in the actor's state with
a permanent-skip + `LogCritical` threshold is the standard escape valve.
**Resolution**
_Unresolved._
### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:597-657` |
**Description**
`SqliteAuditWriter.GetBacklogStatsAsync` takes `_writeLock` (the same lock that
serialises every batch INSERT in `FlushBatch`) and holds it for the duration of a
`SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'`.
`SiteAuditBacklogReporter` calls this on a 30-second timer. On a healthy site with
few `Pending` rows the index-only scan is fast; under the scenario the metric exists
to detect — a prolonged central outage growing the backlog "indefinitely" per
Component-AuditLog.md — a `COUNT(*)` over hundreds of thousands of `Pending` rows
on the `IX_SiteAuditLog_ForwardState_Occurred` index is no longer cheap, and the
duration of that scan is added to the hot-path write latency for every concurrent
script. The hot path is supposed to be "durable in microseconds" per the design doc;
a multi-hundred-millisecond probe stall in the same period would not be visible
externally but would back-pressure the bounded write channel. `ReadPendingAsync` and
`ReadPendingSinceAsync` share the same lock for the same reason and have the same
exposure under backlog growth.
**Recommendation**
Either (a) move the SELECT outside the write lock by using a second, dedicated
read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections
to the same file when journal_mode=WAL is enabled — which would also benefit the
hot path); or (b) cache the last snapshot inside the writer and recompute it
lazily on a dedicated background tick so the reporter reads a pre-computed snapshot
without acquiring the write lock. Option (a) also unblocks `ReadPendingAsync` /
`ReadPendingSinceAsync` from competing with the writer.
**Resolution**
_Unresolved._
### AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700` |
**Description**
```csharp
public void Dispose()
{
DisposeAsync().AsTask().GetAwaiter().GetResult();
}
```
This is the classic sync-over-async anti-pattern. `DisposeAsync` `await`s the
writer-loop task with `.ConfigureAwait(false)`, so on a thread with no synchronization
context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes
`Dispose()` from a context that captures (an ASP.NET request, a SynchronizationContext
test runner, an Akka.NET dispatcher in some configurations) the `GetResult()` blocks
the captured thread while the continuation tries to resume on it — classic deadlock.
The writer is registered as a DI singleton, so this is unlikely to bite during the
host's `IAsyncDisposable` shutdown (DI prefers `DisposeAsync` when available), but
an integration test or future code path that constructs the writer manually inside
a sync context will hang.
**Recommendation**
Drop the `IDisposable` interface and rely on `IAsyncDisposable` only — the DI container
will call `DisposeAsync` on singletons that implement it. If a sync `Dispose` is
required for compatibility with consumers that don't honour `IAsyncDisposable`,
implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
short wait, without blocking the thread for the full async drain.
**Resolution**
_Unresolved._
### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:148-218` |
**Description**
`AddAuditLog` registers three components that depend on `INodeIdentityProvider`:
- `CachedCallTelemetryForwarder` — resolves with `sp.GetService<INodeIdentityProvider>()`
(optional, falls back to a null `SourceNode`).
- `CachedCallLifecycleBridge` — resolves with `sp.GetService<INodeIdentityProvider>()`
(optional, same fallback).
- `CentralAuditWriter` — resolves with `sp.GetRequiredService<INodeIdentityProvider>()`
(required, throws at first resolution if unregistered).
The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are
optional because tests may skip the registration; the third is required because "the
production composition root in `SiteServiceRegistration` registers the provider as a
singleton on both site and central paths". But this is a fragile guarantee — `AddAuditLog`
itself does NOT register the provider, so a future composition root that calls
`AddAuditLog` without first calling whatever registers `INodeIdentityProvider` will fail
on the FIRST resolution of `ICentralAuditWriter` (which is a lazy factory) rather than
at `AddAuditLog` time. The result: site nodes that "happen to work" because they hold
a registered provider, central composition test fixtures that fail at runtime instead
of DI-build time, and a `GetService`/`GetRequiredService` split that gives no clear
contract to the reader.
**Recommendation**
Either (a) make all three optional: `CentralAuditWriter` already handles a null provider
gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys
nothing. Or (b) make all three required and either add `services.AddSingleton<INodeIdentityProvider, ...>()`
inside `AddAuditLog` (with a sensible default — null node name returns `<unknown>`) or
add an explicit guard at the top of `AddAuditLog` that throws if no provider has been
registered yet (`services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))`).
**Resolution**
_Unresolved._
### AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:125,155` |
**Description**
`FallbackAuditWriter`, `CentralAuditWriter`, and `AuditLogIngestActor` all accept an
`IAuditPayloadFilter` as an optional dependency, defaulting to `null = pass-through`.
The justification in every XML comment is the same: "the M4 test composition roots
that don't pass one keep working (they only ever write small payloads)". This is fine
for size — but the filter also performs HEADER REDACTION (`Authorization`, `Cookie`,
`Set-Cookie`, `X-API-Key`), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test
fixture (or any future composition root that bypasses `AddAuditLog`) that injects a
real `RequestSummary` will see secrets written to SQLite / MS SQL with no redaction.
The combination "audit-write must never abort the user-facing action" + "unredacted
secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the
no-filter fallback genuinely dangerous — over-redacting on a missing filter is the
contract the production setup honours, but the code itself defaults to under-redact.
**Recommendation**
Change the three null-coalesce sites to default to a non-null sentinel filter that
performs the header redaction (`HeaderRedactList`) using the hard-coded defaults
from `AuditLogOptions`, even when no `IAuditPayloadFilter` is registered. The
truncation stage can remain optional; the header redaction must not. Alternatively,
make `IAuditPayloadFilter` non-optional and have `AddAuditLog` register the real
filter unconditionally — tests that don't bind the options section will resolve the
default `AuditLogOptions` and get the production-default redact list automatically.
**Resolution**
_Unresolved._
### AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740` |
**Description**
The first `lock (_writeLock)` block in `DisposeAsync` is commented:
> Stop accepting new events. Setting _disposed first ensures any FlushBatch entered
> after we mark disposed will fault its pending events rather than touching the
> about-to-close connection.
But the block does NOT set `_disposed = true` — it only calls
`_writeQueue.Writer.TryComplete()` and captures `_writerLoop`. The `_disposed` flag is
flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer
loop. During the wait window, a concurrent `WriteAsync` that observed the channel
NOT-yet-completed (race: it ran before `TryComplete`) and got past `TryWrite` would
land on the writer loop's `FlushBatch`, which then takes the lock and checks
`_disposed` — and finds it still `false`. The check at the top of `FlushBatch`
(line 265) `if (_disposed) { fault pending; return; }` therefore does NOT fire during
the dispose window. In practice the channel being completed drains the loop cleanly
and the disposable race is benign, but the comment claims a guarantee that the code
does not implement.
**Recommendation**
Either set `_disposed = true` in the first lock block to match the comment (and remove
the duplicate `_disposed` check in the second block); or rewrite the comment to
describe the actual ordering: the channel is completed first, the loop drains
remaining items under the lock, and `_disposed = true` is set only after the loop
exits. The current code is correct; the comment is wrong.
**Resolution**
_Unresolved._
### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |
**Description**
The drain loops on `SiteAuditTelemetryActor.OnDrainAsync` and the per-site pull on
`SiteAuditReconciliationActor.PullSiteAsync` both pass `CancellationToken.None` to
every async dependency call (queue reads, gRPC client, repository writes). The actor
has no `CancellationToken` field, so there's no in-flight cancellation source —
graceful shutdown relies entirely on `PostStop` being called and the actor's
`Receive` continuation completing naturally. For a healthy gRPC client this is fine,
but a stuck `IngestAuditEventsAsync` call (slow central, partition switch in progress)
holds the actor's continuation indefinitely; the host's coordinated-shutdown will then
time out the actor system and leave the actor in an undefined state. The brief
references "cancellation on stop" in the partition-maintenance comments but
`SiteAuditTelemetryActor` does not implement it.
**Recommendation**
Introduce a per-actor `CancellationTokenSource` populated in `PreStart` and cancelled
in `PostStop`; pass `_lifecycleCts.Token` instead of `CancellationToken.None` in
every async dependency call. Same change for `SiteAuditReconciliationActor`. The
existing `OperationCanceledException` is already swallowed by the top-level catch
in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.
**Resolution**
_Unresolved._
### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346` |
**Description**
The XML doc on `AddAuditLog` is explicit: "Idempotent re-registration is not supported;
call this exactly once per `IServiceCollection`." But `AddAuditLogHealthMetricsBridge`
calls `services.AddHostedService<SiteAuditBacklogReporter>()` (line 275), which is
NOT idempotent — every call registers another descriptor, and the host will spin up
N reporters and have them all poll SQLite every 30 s, all push the same snapshot into
`ISiteHealthCollector`. The site composition path is supposed to call this exactly
once, but tests or composition refactors that accidentally call twice will pay 2x the
SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just
wasted work). Worse, `AddAuditLogCentralMaintenance` (line 301) is also non-idempotent —
`AddOptions<AuditLogPartitionMaintenanceOptions>` and `AddHostedService<AuditLogPartitionMaintenanceService>`
will pile up.
**Recommendation**
Either (a) guard each Add* helper with a "has the marker been seen?" sentinel
(register a private marker descriptor on first call, no-op on subsequent calls);
or (b) explicitly document idempotency on the public surface of every helper and
verify with a unit test in `AddAuditLogTests`. Option (a) matches the pattern other
SDK extensions use and removes a foot-gun.
**Resolution**
_Unresolved._
+329 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.CLI` |
| Design doc | `docs/requirements/Component-CLI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -47,6 +47,36 @@ and `WriteAsTable` derives table columns from only the first array element, sile
dropping columns for any later element with a different shape (CLI-016). No
Critical/High issues; the module remains healthy.
#### Re-review 2026-05-28 (commit `1eb6e97`)
The CLI has grown two substantial new command groups since the last re-review —
`scadalink audit` (Audit Log #23 M8) and `scadalink bundle` (Transport #24) — together
adding ~1,500 lines of new production code. The new `audit` surface is well-tested and
well-factored (pure helpers + a clear `IAuditFormatter` seam), but the new `bundle`
surface is untested, duplicates the URL/credential resolution that already exists in
`CommandHelpers`, and inherits a partial authorization-exit-code regression that also
appears in the audit path. Two longstanding fragility gaps that the prior reviews missed
also surface in this pass: `CliConfig.Load` parses the config file with no try/catch, and
`CommandTreeTests` still pins the old 14-group count so the two new groups are excluded
from the leaf-action and registry-resolution coverage that protected the rest of the
tree. Module health is broadly good but the consolidated count is now seven Open
findings (none Critical, three Medium).
- **CLI-017** — `BundleCommands` duplicates `ExecuteCommandAsync` and skips the
`FORBIDDEN`/`UNAUTHORIZED` exit-code mapping (auth exit 2 contract regression).
- **CLI-018** — `AuditQueryHelpers.RunQueryAsync` / `AuditExportHelpers.RunExportAsync`
return exit 1 on every error, never the documented exit 2 for authorization failure.
- **CLI-019** — `BundleCommands.bundle export` decodes the entire base64 bundle in
memory and writes synchronously — 100 MB bundles double-buffer.
- **CLI-020** — `BundleCommands.bundle export` parses the success body with bare
`JsonDocument.Parse` + `GetProperty` and throws on a malformed/abbreviated envelope.
- **CLI-021** — `CliConfig.Load` crashes the whole CLI when `~/.scadalink/config.json`
is malformed or unreadable, even if `--url` was supplied on the command line.
- **CLI-022** — `AuditCommands` and `BundleCommands` are absent from `CommandTreeTests`;
the test still pins `Equal(14, groups.Count)` and silently excludes the new groups.
- **CLI-023** — `Component-CLI.md` says the audit commands ride `POST /management`,
but the implementation calls a new `GET /api/audit/*` REST endpoint pair.
## Checklist coverage
_Original review (2026-05-16, `9c60592`):_
@@ -79,6 +109,21 @@ _Re-review (2026-05-17, `39d737e`):_
| 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. |
| 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `BundleCommands.BuildExport` unguarded `JsonDocument.Parse` + `GetProperty` (CLI-020); `CliConfig.Load` unguarded JSON parse (CLI-021). |
| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR/REST client. No issues. |
| 3 | Concurrency & thread safety | ☑ | No new concurrency surface; `debug stream` unchanged since CLI-011/012. No issues. |
| 4 | Error handling & resilience | ☑ | Bundle and audit paths skip the auth exit-code contract (CLI-017, CLI-018); bundle JSON-envelope parse is brittle (CLI-020); config-file parse aborts the process (CLI-021). |
| 5 | Security | ☑ | No new credential or trust-boundary issues. No issues. |
| 6 | Performance & resource management | ☑ | `bundle export` double-buffers the whole bundle in memory (CLI-019). |
| 7 | Design-document adherence | ☑ | `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses new REST endpoints (CLI-023). |
| 8 | Code organization & conventions | ☑ | `BundleCommands.RunBundleCommandAsync` re-implements credential/URL resolution that `CommandHelpers.ExecuteCommandAsync` already provides — drift waiting to happen (CLI-017). |
| 9 | Testing coverage | ☑ | `BundleCommands` has no tests; `CommandTreeTests` pins `Equal(14, …)` and excludes the new `AuditCommands` + `BundleCommands` groups (CLI-022). |
| 10 | Documentation & comments | ☑ | XML docs accurate; doc-vs-code transport drift folded into CLI-023. No other issues. |
## Findings
### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
@@ -741,3 +786,284 @@ list and `OutputFormatter.WriteTable` pads missing cells, so heterogeneous array
render every column. Regression tests added in `TableHeaderUnionTests` (3 tests:
later-element-only column included, first-seen column order preserved,
first-element-extra column still rendered).
### CLI-017 — `BundleCommands.RunBundleCommandAsync` duplicates `ExecuteCommandAsync` and breaks the auth exit-code contract
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:244-289` (vs. `src/ScadaLink.CLI/Commands/CommandHelpers.cs:20-73`, `:159-174`) |
**Description**
`BundleCommands.RunBundleCommandAsync` re-implements the URL/credential resolution,
validation, and HTTP plumbing that `CommandHelpers.ExecuteCommandAsync` already provides
for every other command group — to attach a 5-minute timeout (`BundleCommandTimeout`)
and a caller-supplied success handler. In duplicating it, two contracts that
`CommandHelpers` carefully establishes were dropped:
1. **Authorization exit code.** `CommandHelpers.HandleResponse` routes through
`IsAuthorizationFailure`, which returns exit 2 for **either** HTTP 403 **or** an
`UNAUTHORIZED`/`FORBIDDEN` error code on any status (resolution of CLI-009). The
bundle path at line 287 uses a bare `if (response.StatusCode == 403) return 2;` — a
server that signals authorization failure via the `code` field on a non-403 status
(the same channel the rest of the CLI honours) will exit 1 instead of 2 from
`bundle export`/`preview`/`import`. `Component-Transport.md:289` explicitly states
"Exit codes follow the project convention: `0` = success, `1` = command failure,
`2` = authorization failure," so this is a contract regression.
2. **Error-message phrasing drift.** The two duplicated error paths
(`bundle:258-260`, `:264-266`) emit shorter messages that omit the
`SCADALINK_MANAGEMENT_URL` / `SCADALINK_USERNAME` env-var hints the canonical paths
give — confusing if the user is trying to debug what's missing.
**Recommendation**
Refactor `CommandHelpers.ExecuteCommandAsync` to accept an optional `TimeSpan` timeout
and an optional success handler, and have `BundleCommands` call it. Failing that,
extract `CommandHelpers.IsAuthorizationFailure` to `internal` and call it from
`RunBundleCommandAsync` in place of the bare 403 check, and copy the canonical error
messages verbatim.
**Resolution**
_Unresolved._
### CLI-018 — `audit query` and `audit export` never return exit 2 for an authorization failure
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/AuditQueryHelpers.cs:186-193`, `src/ScadaLink.CLI/Commands/AuditExportHelpers.cs:147-153` |
**Description**
The two audit-log subcommands (`audit query`, `audit export`) ride a new REST surface
(`GET /api/audit/query` and `GET /api/audit/export`) — not the `POST /management`
envelope that goes through `CommandHelpers.HandleResponse`. Both helpers map *any*
non-success response to a generic `OutputFormatter.WriteError(...)` + `return 1`:
- `AuditQueryHelpers.RunQueryAsync:186-193` returns 1 unconditionally when `JsonData`
is null (i.e. any error). It never inspects `StatusCode` or `ErrorCode`.
- `AuditExportHelpers.RunExportAsync:147-153` returns 1 for every non-success status,
again with no 403 / `FORBIDDEN` carve-out.
`Component-CLI.md:295-296` documents exit code 2 for "Authorization failure (insufficient
role)". `Component-AuditLog.md` (Security & Tamper-Evidence) and `Component-CLI.md:184-187`
both call out that the audit endpoints are gated by `OperationalAudit` and `AuditExport`
permissions enforced server-side — i.e. these are exactly the commands most likely to
return 403 in routine use. The exit-code regression silently downgrades a 403 to a
generic command failure, breaking the CI/CD scripting contract.
**Recommendation**
Promote `CommandHelpers.IsAuthorizationFailure` to `internal` (or move it to a small
shared helper) and have `RunQueryAsync` / `RunExportAsync` return 2 when it matches.
The check needs to use the `ManagementResponse.StatusCode` / `ErrorCode` pair the
audit `SendGetAsync` already populates.
**Resolution**
_Unresolved._
### CLI-019 — `bundle export` decodes the entire base64 bundle into memory before writing
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:117-124`, `src/ScadaLink.CLI/ManagementHttpClient.cs:47-92` |
**Description**
`Component-Transport.md:271` ceilings the raw bundle at 100 MB and notes the
per-request body cap is raised to 200 MB once base64-inflated. The CLI's export path
goes through `ManagementHttpClient.SendCommandAsync`, which reads the entire response
body into a string (`responseBody = await httpResponse.Content.ReadAsStringAsync(...)`)
and returns it as `ManagementResponse.JsonData`. `BundleCommands.BuildExport` then:
1. `JsonDocument.Parse(jsonOk)` re-allocates the JSON DOM (~200 MB string + DOM).
2. `doc.RootElement.GetProperty("base64Bundle").GetString()` materializes the base64
payload as another ~200 MB `string`.
3. `Convert.FromBase64String(base64)` allocates a fresh ~100 MB `byte[]`.
4. `File.WriteAllBytes(output, bytes)` writes synchronously.
Peak working-set for a 100 MB bundle is therefore ~600 MB, all on the LOH, plus the
file-I/O is fully synchronous. The streaming `SendGetStreamAsync` path the audit
export uses (line 155-156) shows the right pattern is already available for plain GETs,
but bundles ride a `POST /management` envelope so they currently can't reuse it.
**Recommendation**
For the export path specifically, add a streaming variant — either a new
`POST /api/bundle/export` REST endpoint mirroring the audit pattern, or a chunk-fetch
follow-up `GET /api/bundle/<exportId>` so the CLI can stream bytes through
`Stream.CopyToAsync` without buffering the whole envelope. If a v1 stop-gap is needed,
at minimum switch to `File.WriteAllBytesAsync` and use `Convert.TryFromBase64Chars`
with a rented buffer to avoid the double-LOH allocation.
**Resolution**
_Unresolved._
### CLI-020 — `bundle export` success-envelope parse is unguarded
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:117-126` |
**Description**
The export success handler does:
```csharp
using var doc = JsonDocument.Parse(jsonOk);
var base64 = doc.RootElement.GetProperty("base64Bundle").GetString()!;
var byteCount = doc.RootElement.GetProperty("byteCount").GetInt32();
var bytes = Convert.FromBase64String(base64);
```
None of these calls are wrapped in a `try/catch`. A server-side bug that omits one of
the two properties, returns a `null` `base64Bundle`, sends invalid base64, or sends a
malformed JSON envelope will surface as one of `KeyNotFoundException` /
`InvalidOperationException` / `FormatException` — an unhandled stack trace, not a clean
`INVALID_RESPONSE` / exit 1, contradicting the "graceful-degradation" theme that the
prior reviews (CLI-002, CLI-003, CLI-005) repeatedly hardened.
**Recommendation**
Wrap the parse + base64-decode in a `try` block that catches `JsonException`,
`KeyNotFoundException`, `InvalidOperationException`, and `FormatException` and emits a
clean `OutputFormatter.WriteError(..., "INVALID_RESPONSE")` + `return 1`. Add a
regression test against a malformed-envelope stub `HttpMessageHandler`.
**Resolution**
_Unresolved._
### CLI-021 — `CliConfig.Load` crashes the CLI on a malformed config file
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CLI/CliConfig.cs:41-53` |
**Description**
`CliConfig.Load` is the first thing every command runs (via `ExecuteCommandAsync`,
`AuditCommandHelpers.ResolveConnection`, and `BundleCommands.RunBundleCommandAsync`).
Its config-file branch is:
```csharp
if (File.Exists(configPath))
{
var json = File.ReadAllText(configPath);
var fileConfig = JsonSerializer.Deserialize<CliConfigFile>(json, ...);
...
}
```
Neither call is guarded. If `~/.scadalink/config.json` exists but is malformed
(stale, partial, or someone's `vim` swap), `JsonSerializer.Deserialize` throws
`JsonException`. If the file exists but isn't readable (mode 0000),
`File.ReadAllText` throws `UnauthorizedAccessException`. Either fault aborts every
CLI invocation with an unhandled stack trace — even invocations that supply every
input on the command line and don't need the config file at all (`--url`,
`--username`, `--password`, `--format` all on the CLI).
**Recommendation**
Wrap the file-read and the `JsonSerializer.Deserialize` in a single
`try/catch (Exception)` (or specifically `JsonException` +
`UnauthorizedAccessException` + `IOException`). On failure, write a single one-line
warning to `Console.Error` ("ignoring malformed `~/.scadalink/config.json`: {message}")
and return the default `CliConfig`, so the rest of the precedence chain (env vars +
command-line flags) still works.
**Resolution**
_Unresolved._
### CLI-022 — `CommandTreeTests` excludes the two new command groups
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.CLI.Tests/CommandTreeTests.cs:21-37`, `:55-58` (vs. `src/ScadaLink.CLI/Program.cs:21-36`) |
**Description**
`CommandTreeTests.AllCommandGroups()` builds 14 command groups; `Program.cs` now
registers 16 (`AuditCommands` and `BundleCommands` were added since the last
re-review). Worse, the smoke test pins `Assert.Equal(14, groups.Count)`, so the
test list intentionally matches the harness's array and stays green even though the
real production tree is two groups larger. The downstream assertions
(`EveryLeafCommand_HasAnAction`, `CommandPayloadTypes_ResolveViaRegistry`) therefore
also do NOT cover the new audit / bundle leaves — and `BundleCommands` has zero
test coverage of any kind (no parsing tests, no success-handler tests, no
registry-resolution tests).
**Recommendation**
Add `AuditCommands.Build(...)` and `BundleCommands.Build(...)` to the
`AllCommandGroups()` array, bump the assertion to `Equal(16, groups.Count)`, and add
representative payload types to `CommandPayloadTypes_ResolveViaRegistry`
(`ExportBundleCommand`, `PreviewBundleCommand`, `ImportBundleCommand`). Optionally,
add a `BundleCommandsTests` file covering the success-envelope parse and the
`NameListOption` comma-split parser.
**Resolution**
_Unresolved._
### CLI-023 — `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses REST endpoints
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-CLI.md:310-311` (vs. `src/ScadaLink.CLI/Commands/AuditQueryHelpers.cs:186`, `src/ScadaLink.CLI/Commands/AuditExportHelpers.cs:126`, `src/ScadaLink.CLI/ManagementHttpClient.cs:94-156`) |
**Description**
`Component-CLI.md:310` states: "The `scadalink audit` command group rides this same
transport — there is no separate audit endpoint." But the implementation calls a
new REST surface — `GET /api/audit/query` and `GET /api/audit/export` — via two new
methods on `ManagementHttpClient` (`SendGetAsync`, `SendGetStreamAsync`), distinct
from the `POST /management` envelope. The plan document
(`docs/plans/2026-05-20-audit-log-code-roadmap.md:1583`) corroborates the
implementation: "REST endpoints `GET /api/audit/query` (paged) and
`GET /api/audit/export` (streaming)" — i.e. the design doc is the stale one.
A reader following `Component-CLI.md` would expect the audit endpoints to share
the management envelope's authentication + dispatch path and route through
`ManagementActor`, neither of which is true. The auth-exit-code regression
(CLI-018) is itself a direct consequence of this divergence: the audit helpers
duplicate the management envelope's response handling instead of riding it, and
forgot to copy the auth carve-out.
**Recommendation**
Update `Component-CLI.md:310-311` (and the Dependencies bullet at `:311`) to
describe the actual REST surface: `GET /api/audit/query` (paged) and
`GET /api/audit/export` (streaming), with HTTP Basic Auth shared with the
management envelope and permission checks enforced by the server-side
`AuditController`. Optionally cross-link to
`docs/plans/2026-05-20-audit-log-code-roadmap.md` (M8 task list) as the
authoritative source.
**Resolution**
_Unresolved._
+327 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.CentralUI` |
| Design doc | `docs/requirements/Component-CentralUI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 8 |
## Summary
@@ -73,6 +73,55 @@ cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4
claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories
1, 2, 5, 6, 7, 10 produced no new findings.
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | CentralUI-026 (AuditFilterBar UTC), CentralUI-027 (3 other pages with same UTC bug). |
| 2 | Akka.NET conventions | ☑ | No new findings — module is presentation; `DebugStreamService` actor usage unchanged. |
| 3 | Concurrency & thread safety | ☑ | CentralUI-030 (StringWriter capture buffer not thread-safe under intra-script `Task.WhenAll`). |
| 4 | Error handling & resilience | ☑ | No new findings — the prior CentralUI-018/023 patterns hold. |
| 5 | Security | ☑ | CentralUI-028 (NotificationReport + SiteCallsReport not site-scoped — CentralUI-002 regression on new pages). |
| 6 | Performance & resource management | ☑ | CentralUI-031 (TransportImport buffers full bundle bytes in component state). |
| 7 | Design-document adherence | ☑ | CentralUI-032 (AuditResultsGrid forward-only paging diverges from "keyset paginated" implied bi-directional). |
| 8 | Code organization & conventions | ☑ | CentralUI-029 (`JS.InvokeAsync<int>("eval", ...)` in ConfigurationAuditLog vs the `_content/.../BrowserTime` module pattern). |
| 9 | Testing coverage | ☑ | CentralUI-033 (TransportImport / SiteCallsReport query-string drill-in code paths untested). |
| 10 | Documentation & comments | ☑ | No new findings — code comments accurately describe intent. |
#### Re-review 2026-05-28 (commit `1eb6e97`)
All 25 prior findings remain closed. This re-review re-examined the full
module against the 10-category checklist with attention to the
recently-added Transport export/import wizards (`TransportExport`,
`TransportImport`) and the operational Audit Log page (Bundle B..G). The
most consequential pattern in this pass is that the **CentralUI-008
local-input-treated-as-UTC** bug, fixed for the legacy
`AuditLog.razor` via the `BrowserTime.LocalInputToUtc` helper, has been
silently recreated on every other page that exposes a
`<input type="datetime-local">` filter — `AuditFilterBar` (the new
operational Audit Log filter, CentralUI-026), `SiteCallsReport`,
`NotificationReport`, and `EventLogs` (CentralUI-027). The Audit Log
page CSV export URL therefore mis-shifts the From/To filter window by
the operator's UTC offset, and the same offset bug silently corrupts
audit-style queries on Site Calls / Notification Report / Event Logs.
Second-most consequential is **CentralUI-028**: the new `NotificationReport`
and `SiteCallsReport` pages (both `[Authorize(RequireDeployment)]`) do
NOT filter their site dropdown or row data through `SiteScopeService`,
and the relay actions (`RetryNotification`/`DiscardNotification`,
`RetrySiteCall`/`DiscardSiteCall`) issue no server-side site-scope
re-check before relaying to the owning site — so a site-scoped Deployment
user can read and act on notifications and cached calls for sites
outside their grant, replicating the original CentralUI-002 defect on
the two pages added after the CentralUI-002 fix landed. The remaining
new findings (CentralUI-029..CentralUI-033) cover a residual `JS.InvokeAsync<int>("eval", ...)`
in `ConfigurationAuditLog`, a single-thread `StringWriter` capture buffer
in the Test Run sandbox (a sandboxed script that uses `Task.WhenAll` can
write concurrently), a `using var` `MemoryStream` followed by `ms.ToArray()`
buffering the full bundle in memory in `TransportImport`, the
`AuditResultsGrid` having no Previous-page control (forward-only navigation,
a UX/design adherence gap), and the un-tested `TransportImport` /
`SiteCallsReport` query-string drill-in code paths.
## Findings
### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
@@ -1216,3 +1265,278 @@ also forces the CentralUI-020 fix.
**Resolution**
2026-05-17 — added `SessionExpiryComponentTests` (bUnit): an expired ping (401) redirects to `/login`, a live ping (200) and a transient failure (status 0) do not, and on the `/login` route the component neither pings nor redirects; also added `AuthPingEndpointTests` covering the `/auth/ping` endpoint contract.
### CentralUI-026 — `AuditFilterBar` From/To filters treat browser-local datetimes as UTC
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Audit/AuditFilterBar.razor:97-104`; `src/ScadaLink.CentralUI/Components/Audit/AuditQueryModel.cs:56-58,150-178,203-213` |
**Description**
The new operational Audit Log filter bar binds two `<input type="datetime-local">` controls
straight to `AuditQueryModel.CustomFromUtc` / `CustomToUtc` (`DateTime?`), and `ToFilter`
emits those values as `AuditLogQueryFilter.FromUtc` / `ToUtc` without converting from
the browser's local time zone. A `datetime-local` input yields the user's *browser-local*
wall-clock value, so for any non-UTC user the audit query window is shifted by their UTC
offset — returning the wrong rows from the central `AuditLog` table and producing a
mis-shifted CSV export through `AuditLogPage.BuildExportUrl`, which round-trips the
filter's `FromUtc`/`ToUtc` straight into `?from=`/`?to=` query params. This is the same
defect CentralUI-008 fixed for the legacy `Components/Pages/Monitoring/AuditLog.razor`
via the `BrowserTime.LocalInputToUtc(value, _browserUtcOffsetMinutes)` helper — but the
new Audit Log v2 filter bar does not use that helper, so a Bundle B/C/D/E/F regression
re-introduced the bug for the page-replacement target. The CLAUDE.md "all timestamps are
UTC throughout" decision is satisfied at the wire level but violated at the input
boundary, exactly as the original finding called out.
**Recommendation**
Fetch the browser offset once via JS interop (mirroring `ConfigurationAuditLog.OnAfterRenderAsync`
and `AuditLog.razor`'s implementation), pipe both `CustomFromUtc` and `CustomToUtc` through
`BrowserTime.LocalInputToUtc(value, offsetMinutes)` inside `AuditQueryModel.ToFilter`
(or in the filter-bar Apply path before calling `ToFilter`), and add a regression test
that pins the non-UTC behaviour (mirroring `BrowserTimeTests.LocalInputToUtc_NonUtcBrowser_DoesNotEqualNaiveRelabelling`).
The label "Custom From / To" should also be clarified ("UTC" vs "local") in the UI.
### CentralUI-027 — Same UTC misinterpretation in `SiteCallsReport`, `NotificationReport`, and `EventLogs`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor:74-80`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:421-425`; `src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor:75-81,639-640`; `src/ScadaLink.CentralUI/Components/Pages/Monitoring/EventLogs.razor:62-73,261-262` |
**Description**
The same `datetime-local`-treated-as-UTC bug from CentralUI-008 and CentralUI-026 is
present on three other pages:
- `SiteCallsReport.ToUtc` stamps `DateTimeKind.Utc` on the local-input value
(`DateTime.SpecifyKind(value.Value, DateTimeKind.Utc)`).
- `NotificationReport.ToUtc` does the same — `new DateTimeOffset(DateTime.SpecifyKind(local.Value, DateTimeKind.Utc))`.
- `EventLogs.FetchPage` emits `new DateTimeOffset(_filterFrom.Value, TimeSpan.Zero)`,
which labels the browser-local wall-clock value as UTC (the exact pre-fix shape of
CentralUI-008).
For any non-UTC operator, every Site-Calls / Notification / Event-Log query is silently
shifted by their UTC offset. The bug is mass-recreated on every page added after
CentralUI-008 landed — the `BrowserTime` helper exists but is only used by the legacy
Audit Log page and `ConfigurationAuditLog`.
**Recommendation**
Plumb the browser offset (via `eval` interop or a dedicated JS module, mirroring
`ConfigurationAuditLog`/`AuditLog.razor`) into each of these pages and route every
local-input value through `BrowserTime.LocalInputToUtc(value, offsetMinutes)` before
constructing the wire filter. Add regression tests pinning the non-UTC behaviour for
at least one representative page so the helper's continued use is enforced.
### CentralUI-028 — `NotificationReport` and `SiteCallsReport` bypass `SiteScopeService` — Deployment role site-scoping defeated on the two new central-mirror pages
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor:2,434,472,502`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor:2,52-59`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:97-110,201,250-251,278-279` |
**Description**
Both pages are `[Authorize(Policy = RequireDeployment)]` and, per CLAUDE.md "Security &
Auth", the Deployment role must be site-scoped. CentralUI-002 fixed this for every
Deployment/Monitoring page that existed at the time by introducing `SiteScopeService`
and threading `FilterSitesAsync` / `IsSiteAllowedAsync` through the site dropdowns and
mutating calls. The two new central-mirror pages — Notification Report (Notification
Outbox queryable list) and Site Calls Report (Site Call Audit queryable list) — do NOT
inject `SiteScopeService`, do NOT filter their Source-Site `<select>` lists (they
enumerate `await SiteRepository.GetAllSitesAsync()` straight to the dropdown), do NOT
narrow the query results by permitted site, and do NOT re-check the user's grant
before relaying Retry/Discard to the owning site. `NotificationReport.RetryNotificationAsync`,
`NotificationReport.DiscardNotificationAsync`, `SiteCallsReport.RetrySiteCallAsync`,
and `SiteCallsReport.DiscardSiteCallAsync` all dispatch with the row's `SourceSiteId` /
`SourceSite` unchecked. A scoped Deployment user can therefore (a) browse every row in
the central `Notifications` / `SiteCalls` table including those for sites outside their
grant, (b) submit Retry/Discard URLs hand-crafted from the row metadata, and (c) the
site relay completes successfully because the CommunicationService only sees the
row's source-site identifier, not the user's grant. This is a direct regression of the
CentralUI-002 contract on the two pages that landed after CentralUI-002 was closed.
**Recommendation**
Inject `SiteScopeService` into both pages; filter the source-site dropdown through
`FilterSitesAsync`; default the filter to the permitted-site set so a scoped user sees
only their own rows (or push the predicate into the central query — preferred, so the
filter cannot be bypassed by URL manipulation); and re-check `IsSiteAllowedAsync` in
`RetryNotificationAsync`/`DiscardNotificationAsync`/`RetrySiteCallAsync`/`DiscardSiteCallAsync`
before the CommunicationService call, surfacing a "not permitted for this site" toast
on failure (mirroring `ParkedMessages.razor`'s `SelectedSiteIsPermitted` guard).
Add `Site_ScopedDeploymentUser_OnlySeesPermittedRows` and
`Site_ScopedDeploymentUser_CannotRetryRowOnNonPermittedSite` regression tests modelled
on `TopologyPageTests.SiteScoping_*`.
### CentralUI-029 — `ConfigurationAuditLog` uses `JS.InvokeAsync<int>("eval", ...)` instead of a dedicated JS module
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Audit/ConfigurationAuditLog.razor:248-263` |
**Description**
`OnAfterRenderAsync` fetches the browser's UTC offset with
`JS.InvokeAsync<int>("eval", "new Date().getTimezoneOffset()")`. Calling `eval` over
JS interop is a code-smell: it widens the JS-interop attack surface (any future
attacker who can influence the second argument runs arbitrary JS), it is brittle
under stricter Content-Security-Policy headers (CSP `script-src` directives commonly
forbid `unsafe-eval`), and it bypasses the existing module-import pattern the rest
of the module follows (`session-expiry.js`, `audit-grid.js`, `nav-state.js`,
`transport.js` are all loaded as `IJSObjectReference` modules). The legacy
`AuditLog.razor` (CentralUI-008 fix) and the planned helper exist precisely to avoid
this. Today the eval text is a static string so there is no live bug; the issue is
that the pattern invites a future caller to compose the argument from page state.
**Recommendation**
Move the offset lookup into a small wwwroot JS module (e.g.
`wwwroot/js/browser-time.js` exporting `getTimezoneOffsetMinutes()`) and `import` it
via `IJSObjectReference` like the other helpers. Replace the `eval` call with
`module.InvokeAsync<int>("getTimezoneOffsetMinutes")`. The fix is local and removes
a residual eval surface; the same module can host the rest of the `BrowserTime`
plumbing CentralUI-027 will need.
### CentralUI-030 — `SandboxConsoleCapture`'s per-call `StringWriter` is not thread-safe under intra-script concurrency
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/SandboxConsoleCapture.cs:31-118`; `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:401-404` |
**Description**
CentralUI-003 correctly routed console capture through an `AsyncLocal<StringWriter?>`
so concurrent Test Runs cannot cross-contaminate. `BeginCapture` flows the capture
buffer through the call-tree, and `Target` reads it on every `Write`. But a single
script execution can still write to its captured `StringWriter` from multiple threads
within one call-tree: the script trust model allows `System.Threading.Tasks`, so a
user script can `await Task.WhenAll(t1, t2, t3)` where each task is `Task.Run(() => Console.WriteLine(...))`,
and `_current.Value` flows into each `Task.Run`. The capture buffer is a plain
`StringWriter` (`captured = new StringWriter()` in `RunInSandboxAsync`), which is
**not** thread-safe — concurrent `WriteLine` calls can throw or interleave
character-level. The Akka/gRPC-thread race fixed by CentralUI-003 is gone, but the
intra-script-concurrency race is a residual hazard for any script that exercises
parallel tasks (a realistic shape for a Test Run that calls multiple `External.Call`s
concurrently). Severity is Low because the symptom is a corrupted ConsoleOutput
string, not a security/data-loss issue, and the script must opt into Task-based
concurrency to trigger it.
**Recommendation**
Wrap the capture buffer with `TextWriter.Synchronized(new StringWriter())` (the
BCL's purpose-built thread-safe wrapper), or hold a lock inside `SandboxConsoleCapture.Write*`
on the current scope's `StringWriter`. Add a focused test that runs `await Task.WhenAll(...)`
with `Console.WriteLine` in each task and asserts the resulting `ConsoleOutput` has
the expected line count regardless of thread interleaving.
### CentralUI-031 — `TransportImport` buffers the full bundle bytes in component state
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:72,104-142,160-161` |
**Description**
`OnFileSelectedAsync` reads the uploaded `.scadabundle` into a `MemoryStream`,
calls `ms.ToArray()`, and stores the byte array on the component as
`private byte[]? _bundleBytes`. The bytes live on the Blazor circuit for the
lifetime of the wizard — through the passphrase step, the diff step (which can
take an arbitrary amount of operator time on a large bundle), the confirm step,
and the apply step — and are only cleared in `ResetSessionState` (Done /
re-upload). For an operator who walks away from the diff step mid-review, the
configured `MaxBundleSizeMb` (default not enforced here; only the file-size
check on read) worth of bytes stays pinned on the central node's heap per
open circuit. The page has no `IDisposable` to clear the bytes on tear-down
either. Severity is Low because the cap is checked at upload time and Import
is Admin-only (limited concurrent users), but the lifetime is longer than the
strictly-needed retention.
**Recommendation**
Stream the bundle to a temp file (or to the `IBundleImporter`'s session store)
rather than caching it on the component. Failing that, implement `IDisposable`
on `TransportImport` and clear `_bundleBytes` (`Array.Clear` for sensitivity)
on dispose; also clear the cached passphrase string. Tighten `MaxBundleSizeMb`
docs to call out the in-memory cost per concurrent import session.
### CentralUI-032 — `AuditResultsGrid` paging is forward-only, no Previous button
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Audit/AuditResultsGrid.razor:76-82`; `src/ScadaLink.CentralUI/Components/Audit/AuditResultsGrid.razor.cs:65,196-197,219-220` |
**Description**
The Audit Log results grid (Bundle B / M7-T3) renders a single "Next page" button
and a `Page N · M rows` label, with no Previous control. The design doc says
"Keyset pagination ordered by `(OccurredAtUtc desc, EventId desc)`. Default page
size 100." — keyset paging is naturally forward-only, but a usable audit-triage
workflow needs to step back to the previous page (the `SiteCallsReport` keyset
implementation correctly maintains a `Stack<(...)> _cursorStack` for exactly this).
An operator who clicks Next once and misses a row on the first page cannot return
without re-applying the filter to start a fresh first page. The current shape
also makes the "Page N" label slightly misleading — there is no in-grid affordance
to use it as a navigation target.
**Recommendation**
Mirror the `SiteCallsReport.razor.cs` keyset-paging shape: maintain a
`Stack<(DateTime?, Guid?)> _cursorStack` of previous-page cursors, add a Previous
button gated on `_cursorStack.Count > 0`, push the current cursor on Next and pop
on Previous. Either implement this or update the design doc to acknowledge
forward-only paging on the Audit Log grid.
### CentralUI-033 — Drill-in / query-string code paths for the new Transport + SiteCalls pages are untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:97-238,267-319`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:107-148`; `tests/ScadaLink.CentralUI.Tests/Pages/Design/TransportImportPageTests.cs`; `tests/ScadaLink.CentralUI.Tests/Pages/SiteCallsReportPageTests.cs` |
**Description**
The CentralUI-025 lesson — "a critical drill-in/redirect path was untested, so the
CentralUI-020 defect was not caught" — applies again to the two newest pages.
`SiteCallsReport.ApplyQueryStringFilters` parses `?status=` and `?stuck=true` to
seed the filters from a Health-dashboard KPI tile drill-in; there is no test that
pins this seeding (an unrecognised status, a missing param, the case-insensitive
match). `TransportImport` has a 5-step state machine and a 3-strike passphrase
lockout, both with intricate transition logic
(`GoFromUploadAsync` re-trying `LoadAsync`, the `_failedUnlockAttempts` reset on
success, the audit-row write on failure) — none of the step-machine transition
paths or the lockout reset / lockout-trip behaviours are pinned by tests. The
existing `TransportImportPageTests` exercise rendering shapes, not the lifecycle.
**Recommendation**
Add bUnit tests for `SiteCallsReport.ApplyQueryStringFilters` covering valid /
invalid / case-mismatched `?status=` values and the `?stuck=true` toggle, and
add `TransportImport` lifecycle tests covering: an encrypted-bundle upload
advances to Step 2 without opening a session; a wrong passphrase increments the
counter and writes the `BundleImportUnlockFailed` audit row; the lockout resets
the wizard to Step 1 once `MaxUnlockAttemptsPerSession` is reached; a successful
unlock resets the counter and advances to Step 3.
+233 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ClusterInfrastructure` |
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 4 |
## Summary
@@ -45,6 +45,43 @@ part of the configuration contract but is never consumed — `ScadaLink.Host`'s
does not enforce the design doc's requirement that `down-if-alone` be `on` for the
keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).
#### Re-review 2026-05-28 (commit `1eb6e97`)
The only change to this module between `39d737e` and `1eb6e97` is the
documentation-only commit `1eb6e97` itself, which added a handful of `<param>`
XML doc tags to `ClusterOptionsValidator.Validate` and to
`AddClusterInfrastructureActors` — no source-of-truth changes. Walked all three
source files and all three test files against the full 10-category checklist
again. Found **four new issues**, all Low severity, that the prior re-review
either did not surface or that have aged into the file:
- **CI-011 (Low, Code organization)** — `ClusterOptions.SectionName` is
documented as "the single source of truth so binding sites do not hard-code
the magic string" (the very justification CI-005's resolution offered), but
`ScadaLink.Host.SiteServiceRegistration.BindSharedOptions:100` and three
references in `ScadaLink.Host.StartupValidator` all hard-code
`"ScadaLink:Cluster"` literals. The constant is decorative — a "single source
of truth" that nothing reads. Same pattern as CI-009 (inert configuration knob).
- **CI-012 (Low, Design-document adherence)** — the validator accepts
`SeedNodes.Count == 1` even though the design doc states "both nodes are seed
nodes" (a properly-configured deployment lists 2). `Host.StartupValidator:45`
already enforces `>= 2`, so this module's own contract validator is the
weaker of the two. Inconsistent enforcement across the two projects that
share ownership of the cluster contract.
- **CI-013 (Low, Documentation & comments)** — `ClusterOptionsTests
.Properties_CanBeSetToCustomValues` deliberately sets
`SplitBrainResolverStrategy = "keep-majority"` and `MinNrOfMembers = 2` — the
exact values the design doc warns are catastrophic. The CI-006 resolution
acknowledged this is intentional (testing the POCO accepts any value; the
validator does the rejecting) but the test has no inline comment saying so,
and a future reader could easily misinterpret it as endorsing those values.
- **CI-014 (Low, Code organization)** — `AddClusterInfrastructureActors` is
dead surface: no caller exists anywhere in the solution (verified via
`grep -rn`), its XML doc instructs callers "do not call", and its body
unconditionally throws. CI-002's resolution chose "fail loudly" over "delete"
but the method now offers nothing — keeping it is API-surface noise that an
IDE will still suggest via auto-complete.
## Checklist coverage
Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
@@ -63,6 +100,21 @@ Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Validator logic and DI registration are correct. No new defects. |
| 2 | Akka.NET conventions | ✓ | No actors in this module (legitimate, per CI-001 resolution). Nothing actor-shaped to evaluate. |
| 3 | Concurrency & thread safety | ✓ | Validator and DI extensions remain stateless. No issues. |
| 4 | Error handling & resilience | ✓ | Validator now rejects every catastrophic value the design doc enumerates. New — it accepts `SeedNodes.Count == 1` even though the design doc requires both nodes as seeds, and `Host.StartupValidator` enforces `>= 2`, so the module's own validator is the weaker check (CI-012). |
| 5 | Security | ✓ | No authn/authz surface, no secret handling, no remoting transport configured here. No issues. |
| 6 | Performance & resource management | ✓ | No resources held; validator allocates a small failure list per call only. No issues. |
| 7 | Design-document adherence | ✓ | `ClusterOptions` contract complete and validated. New — validator's seed-node count check is weaker than the design (CI-012). |
| 8 | Code organization & conventions | ✓ | Options/validator placement and Options pattern correct. New — `SectionName` constant documented as "single source of truth" but never read by any binding site (CI-011); `AddClusterInfrastructureActors` is dead surface that no caller invokes (CI-014). |
| 9 | Testing coverage | ✓ | 16 tests across three classes. New — `ClusterOptionsTests.Properties_CanBeSetToCustomValues` sets the exact catastrophic values the design doc forbids without an inline comment explaining why (CI-013). |
| 10 | Documentation & comments | ✓ | XML docs accurate across all source files (commit `1eb6e97` filled in the remaining `<param>` tags). New — CI-013 (test lacks intent comment); CI-011 (XML doc for `SectionName` claims a property the code does not deliver). |
## Findings
### ClusterInfrastructure-001 — Module implements none of its documented responsibilities
@@ -628,3 +680,181 @@ message explaining the isolated-single-node-cluster hazard, consistent with how
validator already rejects quorum split-brain strategies. Developed test-first:
`ClusterOptionsValidatorTests.DownIfAloneFalse_FailsValidation` was written first,
confirmed failing, then passing after the fix. Module test suite green (18 passed).
### ClusterInfrastructure-011 — `SectionName` constant is decorative — no binding site references it
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:24-27`, `src/ScadaLink.Host/SiteServiceRegistration.cs:100`, `src/ScadaLink.Host/StartupValidator.cs:43`, `src/ScadaLink.Host/StartupValidator.cs:45`, `src/ScadaLink.Host/StartupValidator.cs:75` |
**Description**
`ClusterOptions.SectionName` was added by CI-005 as `public const string SectionName =
"ScadaLink:Cluster";`, with an XML doc declaring it "the single source of truth so
binding sites do not hard-code the magic string". CI-005's resolution likewise framed
the constant as the canonical reference value. In practice, **no caller in the
solution reads it**. `grep -rn "ClusterOptions.SectionName" src/` returns zero hits.
Every site that needs the section name hard-codes the literal:
- `ScadaLink.Host.SiteServiceRegistration.BindSharedOptions:100` —
`services.Configure<ClusterOptions>(config.GetSection("ScadaLink:Cluster"));`
- `ScadaLink.Host.StartupValidator:43,45,75` — three `"ScadaLink:Cluster"` /
`"ScadaLink:Cluster:SeedNodes"` literals.
The `SectionName_IsTheExpectedAppSettingsSection` test pins the constant's value but
does not protect against the underlying drift hazard: if someone changes
`SectionName` to `"ScadaLink:Akka:Cluster"`, the test still passes (because it tests
the constant against the same literal), the validator still registers, and binding
silently goes to whichever string the Host hard-codes. The constant currently
provides none of the safety its XML doc claims. This is the same pattern of "inert
configuration knob" CI-009 flagged for `DownIfAlone`, just with the harm being
configuration drift rather than runtime behaviour.
**Recommendation**
Either (a) replace the hard-coded `"ScadaLink:Cluster"` literals in
`SiteServiceRegistration.cs:100` and `StartupValidator.cs:43,45,75` with
`ClusterOptions.SectionName` (a small Host-module change, to be tracked there), or
(b) if the constant is intentionally decorative, soften the XML doc so it does not
claim to be the source of truth. Do not leave a public constant whose stated
guarantee the code does not deliver.
**Resolution**
_Open — needs a one-line Host-side change to reference the constant, plus a test
that proves the section name flows from this module to the Host._
### ClusterInfrastructure-012 — Validator accepts `SeedNodes.Count == 1` despite design requiring both nodes as seeds
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:30-33` |
**Description**
`Component-ClusterInfrastructure.md` (Node Configuration) states:
> Cluster seed nodes: **Both nodes** are seed nodes — each node lists both itself and
> its partner. Either node can start first and form the cluster; the other joins when
> it starts. No startup ordering dependency.
A correctly-configured ScadaLink deployment therefore lists **two** seed nodes.
`ClusterOptionsValidator.Validate` only checks that `SeedNodes` is non-null and
non-empty (`Count == 0`). A configuration with a single seed node passes validation
silently — but that defeats the "no startup ordering dependency" guarantee the
design doc explicitly calls out.
`ScadaLink.Host.StartupValidator:43-46` does enforce the rule:
```csharp
var seedNodes = configuration.GetSection("ScadaLink:Cluster:SeedNodes").Get<List<string>>();
if (seedNodes is null || seedNodes.Count < 2)
errors.Add("ScadaLink:Cluster:SeedNodes must have at least 2 entries");
```
So the rule is enforced — but by the **other** project, after the
`ClusterOptionsValidator` (the contract owner) already accepted the value. This is
both inconsistent (two validators with different rules for the same field) and the
weaker check is the contract-owner's. The pre-existing test
`ServiceCollectionExtensionsTests.AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution`
even constructs a `SeedNodes` list with one entry and expects validation to succeed
on that count — locking in the gap.
**Recommendation**
Tighten the validator: require `SeedNodes.Count >= 2` with a message that references
the "both nodes are seed nodes" design rule. Update
`AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution` to use a two-entry
list, and add a test case for `SeedNodes.Count == 1` failing validation. Once this
module's validator enforces the rule, `Host.StartupValidator`'s duplicate check
becomes redundant and can be removed in the Host's review.
**Resolution**
_Open._
### ClusterInfrastructure-013 — Test uses catastrophic config values without an inline-intent comment
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:47-67` |
**Description**
`ClusterOptionsTests.Properties_CanBeSetToCustomValues` deliberately sets two values
the design doc explicitly warns are catastrophic:
```csharp
SplitBrainResolverStrategy = "keep-majority", // design doc: total cluster shutdown on partition
...
MinNrOfMembers = 2 // design doc: blocks singleton, halts data collection
```
The CI-006 resolution acknowledged this is intentional — the test exercises the POCO
property setter (which by design accepts any string/int because the validator does
the rejecting), and `ClusterOptionsValidatorTests.UnsupportedSplitBrainStrategy_FailsValidation`
+ `MinNrOfMembers_NotOne_FailsValidation` prove the validator rejects them. But this
reasoning is recorded **only** in the CI-006 resolution text in this findings file,
not in the test itself. A reader landing on the test cold has no signal that these
values are forbidden in production; they could reasonably infer the test endorses
them.
**Recommendation**
Add a brief XML-doc / inline comment to `Properties_CanBeSetToCustomValues` stating
that it exercises only the POCO's setter — these values intentionally do **not**
represent a valid runtime configuration, and `ClusterOptionsValidator` rejects them
(with a cross-reference to the relevant validator tests). Two lines is enough; the
goal is to make the test's intent self-documenting.
**Resolution**
_Open._
### ClusterInfrastructure-014 — `AddClusterInfrastructureActors` is dead surface — no caller, no behaviour
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:42-48` |
**Description**
`AddClusterInfrastructureActors` has now reached a curious state: it is a public
extension method with an XML doc that ends "Do not call AddClusterInfrastructureActors()"
and a body that unconditionally throws `NotImplementedException`. CI-002's resolution
chose "throw loudly" over "delete" specifically because CI-001 was still resolving the
ownership-split question. That question is settled — the design doc, the README
component table, and `Component-ClusterInfrastructure.md`'s "Implementation Note — Code
Placement" all permanently locate the Akka actor bootstrap in `ScadaLink.Host`.
A `grep -rn "AddClusterInfrastructureActors" src/ tests/` confirms there is no caller
anywhere in the solution. The method's only consumer is its own test
(`AddClusterInfrastructureActors_ThrowsRatherThanSilentlySucceeding`), which asserts
that the method throws when called. Keeping it costs API surface (IDE auto-complete
suggests it, the docs render it, and a future contributor might re-introduce a call
expecting it to register something), and gives nothing in return.
**Recommendation**
Delete `AddClusterInfrastructureActors`, delete its test, and add a one-line note to
`docs/requirements/Component-ClusterInfrastructure.md`'s code-placement section
explicitly stating that this project exposes no actor-registration extension
(actor wiring lives in `ScadaLink.Host`). If the user prefers to keep the
"fail-fast" trap, mark the method `[Obsolete(true, error: true)]` so the compiler —
not the runtime — rejects the call.
**Resolution**
_Open._
+466 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Commons` |
| Design doc | `docs/requirements/Component-Commons.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 9 |
## Summary
@@ -46,6 +46,42 @@ indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSer
legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014).
No Critical, High, or Medium issues were found.
#### Re-review 2026-05-28 (commit `1eb6e97`)
Commons has grown substantially since `39d737e` — 132 changed files (≈ +4 600 lines), driven
by the Audit Log (#23), Site Call Audit (#22), and Transport (#24) work. The new surface
area covers six new entity domain folders (Audit, Transport types under `Types/Transport`),
seven new service interfaces (`IPartitionMaintenance`, `INodeIdentityProvider`,
`ISiteAuditQueue`, `ICachedCallLifecycleObserver`, `ICachedCallTelemetryForwarder`,
`IOperationTrackingStore`, `IBundleExporter` / `IBundleImporter` / `IBundleSessionStore` /
`IAuditCorrelationContext`), a new `IAuditLogRepository`, and three new message folders
(`Messages/Audit/`, `Messages/Integration/` extensions, `Messages/Management/TransportCommands`).
The `SourceNode` thread-through and `ExecutionId` / `ParentExecutionId` additive-evolution
fields are uniformly applied across `AuditEvent`, `SiteCall`, `Notification`,
`NotificationSubmit`, `RouteToCallRequest`, `ScriptCallRequest`, and `SiteHealthReport`
all as trailing optional parameters, consistent with REQ-COM-5a.
All fourteen prior findings (Commons-001 through Commons-014) remain `Resolved`. Nine new
findings were recorded this pass: one Medium on the lack of UTC-kind enforcement for the
new `DateTime`-typed `*Utc` columns (Commons-019), one Medium on unconstrained
`EncryptionMetadata` (Commons-015), one Medium on the now-substantially-stale design doc
(Commons-017), and six Low findings covering minor convention drift, missing unit tests
for the Transport types, an unresolvable `<see cref>` in `IAuditCorrelationContext`, a
benign lazy-parse race in `ExternalCallResult.Response`, undocumented JSON-blob shapes,
two interfaces parked in the wrong folder, and a magic-number threshold in `BundleSession`.
No Critical or High issues were found.
The architectural-constraint tests still enforce the no-Akka/no-EF/no-ASP.NET rule, the
POCO-entity and message-as-record conventions, and the `ToLocalTime` ban; they do not yet
cover the new `*Utc`-suffixed `DateTime` properties on `AuditEvent` / `SiteCall`. Test
coverage for the new types is uneven — `TrackedOperationId`, `SiteCallOperational`,
`CachedCallTelemetry`, `SiteCallQueries`, `AuditQueryParamParsers`, `ApiKeyHasher`,
`Notification`, and `SiteCall` are all directly tested; the Transport types
(`BundleManifest`, `EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`,
`ImportPreview`, `ImportResolution`, `ImportResult`, `ManifestContentEntry`) have only
integration-level coverage in `tests/ScadaLink.Transport.IntegrationTests/`, with no
shape/serialization tests in `ScadaLink.Commons.Tests`.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -61,6 +97,21 @@ No Critical, High, or Medium issues were found.
| 9 | Testing coverage | ✓ | `ValueFormatter`, `DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`, `Result<T>`, `ConfigurationDiff`, `AlarmContext`, and the OPC UA serializer round-trip have no tests (Commons-010). |
| 10 | Documentation & comments | ✓ | `OpcUaEndpointConfigSerializer.Deserialize` XML doc does not mention the silent data-loss path (Commons-005). `Component-Commons.md` is stale relative to the actual file set (Commons-009). `ValueFormatter` uses current-culture formatting without documenting it (Commons-012). |
## Checklist coverage — Re-review 2026-05-28 (commit `1eb6e97`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `EncryptionMetadata` accepts any algorithm string + any iteration count with no validation (Commons-015). New `*Utc`-suffixed `DateTime` columns on `AuditEvent`/`SiteCall` have no `DateTimeKind.Utc` enforcement and are inconsistent with `Notification`'s `DateTimeOffset` (Commons-019). |
| 2 | Akka.NET conventions | ✓ | Commons has no actors. All new message contracts (`Messages/Audit`, `Messages/Integration` extensions, `RouteToCallRequest`, `ScriptCallRequest`) are records with trailing optional members per REQ-COM-5a. Correlation IDs present on request/response messages. |
| 3 | Concurrency & thread safety | ✓ | `IAuditCorrelationContext` documents its scoped/sequential thread-safety contract explicitly (good). `ExternalCallResult.Response` has a benign lazy-parse race — two concurrent reads can both parse and produce distinct wrappers (Commons-021). |
| 4 | Error handling & resilience | ✓ | The new ingest/upsert command + reply pairs (`UpsertSiteCallReply`, `IngestAuditEventsReply`, `IngestCachedTelemetryReply`) carry idempotency-friendly accepted-id lists and an `Accepted` flag that explicitly does NOT propagate audit-write failure to the user-facing action (alog.md §13). |
| 5 | Security | ✓ | `ApiKeyHasher` correctly fails-fast on missing / weak pepper (≥16 chars), uses HMAC-SHA256, never accepts a null plaintext, and provides a clearly-labelled `Default` for tests only. `ApiKey.FromHash` is the production constructor; the plaintext constructor only ever uses the unpeppered `Default` and is documented as such. No script-trust violations in any new file. |
| 6 | Performance & resource management | ✓ | `IBundleSessionStore.EvictExpired` exists for sessions — good. `BundleSession` carries `DecryptedContent` plus `Manifest` per session; the size is bounded by the configured bundle cap but no explicit per-session size accounting. `ExternalCallResult.Response` lazy parse not thread-safe (Commons-021). |
| 7 | Design-document adherence | ✓ | `Component-Commons.md` is now significantly stale relative to the actual file set: stale enum values for `AuditKind`/`AuditStatus`, missing `AuditEvent`/`SiteCall` entities, missing `IAuditLogRepository`, missing six service interfaces and `Interfaces/Transport/`, missing four `Types/*` folders and `Messages/Audit/` (Commons-017). |
| 8 | Code organization & conventions | ✓ | `IOperationTrackingStore` and `IPartitionMaintenance` live at the root of `Interfaces/` rather than under `Interfaces/Services/` (Commons-018). `BundleSession.Locked` uses a magic `3` rather than a named constant (Commons-016). Message contracts and entities otherwise follow the additive-evolution / POCO / `record` conventions. |
| 9 | Testing coverage | ✓ | Transport types (`BundleManifest`, `EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`, `ImportPreview`, `ImportResolution`, `ImportResult`, `ManifestContentEntry`) have no unit tests in `tests/ScadaLink.Commons.Tests/`; only `tests/ScadaLink.Transport.IntegrationTests/` exercises them (Commons-020). `IngestAuditEventsCommand` / `IngestCachedTelemetryCommand` / `UpsertSiteCallCommand` / `PullAuditEventsRequest` / `PullAuditEventsResponse` / `AuditTelemetryEnvelope` shape tests also absent. |
| 10 | Documentation & comments | ✓ | `IAuditCorrelationContext` references `BundleImporter.ApplyAsync` — an implementation type Commons does not see, so the `<see cref>` is unresolvable (Commons-022b, folded into Commons-022). `ImportPreviewItem.FieldDiffJson` and `Notification.ResolvedTargets` are JSON-string columns with no documented shape contract (Commons-022). |
## Findings
### Commons-001 — `StaleTagMonitor` stale-fire race between timer and `OnValueReceived`
@@ -674,3 +725,415 @@ describe the corrupt-typed-row branch. Regression tests added in
`OpcUaEndpointConfigSerializerTests` (`Deserialize_TypedShapeWithInvalidEnum_ReportsMalformedNotLegacy`,
`Deserialize_TypedShapeWithWrongTypeField_ReportsMalformedNotLegacy`,
`Deserialize_ValidTypedRow_StillReportsTyped`).
### Commons-015 — `EncryptionMetadata` accepts any algorithm string and any iteration count
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/Transport/EncryptionMetadata.cs:3-8` |
**Description**
`EncryptionMetadata` is a positional record that carries the bundle's encryption parameters
over the wire and into the persistence/audit layer:
```csharp
public sealed record EncryptionMetadata(
string Algorithm, // "AES-256-GCM"
string Kdf, // "PBKDF2-SHA256"
int Iterations,
string SaltB64,
string IvB64);
```
The expected values are documented as inline comments only — there is no validation, no
enum, and no constructor invariant. The consequences:
- A bundle manifest that says `Algorithm = "AES-128-CBC"` (or any garbage string) will
deserialize successfully. The mismatch surfaces only when `BundleImporter` tries to
decrypt, where it most likely manifests as a misleading exception (or a silent wrong-key
result, depending on the implementation).
- `Iterations` is unconstrained — `0`, negative, or absurdly large values round-trip. A
zero/negative iteration count weakens the KDF and a billion-iteration count is a DoS
vector against a passphrase-unlock attempt.
- `SaltB64` / `IvB64` are just `string` — there is no length, format, or non-null check.
A null or empty salt/IV silently rides through serialization and surfaces inside the
cipher init.
`EncryptionMetadata` is the integrity contract for the bundle's encryption envelope and
crosses both the file boundary (the on-disk bundle manifest) and the central audit log.
The defense-in-depth principle says malformed values should be rejected at the type
boundary, not at the cipher.
**Recommendation**
Validate in a static factory or constructor: reject unsupported `Algorithm`/`Kdf` (an
enum or a small whitelist of strings), require `Iterations >= 100_000` (or whatever the
documented PBKDF2 minimum is) and `<= 10_000_000`, require non-blank `SaltB64`/`IvB64`,
and Base64-decode them at construction so a malformed encoding fails fast. Document the
accepted values on the record.
### Commons-016 — `BundleSession.Locked` uses a magic `3` rather than a named constant
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/Transport/BundleSession.cs:13-16` |
**Description**
`BundleSession` exposes:
```csharp
public int FailedUnlockAttempts { get; set; }
public bool Locked => FailedUnlockAttempts >= 3;
```
The `3` is a magic number with no constant, no XML doc reference, and no symbol to
search for if a future operator wants to change the threshold (or write a test that
deliberately exercises the lockout). The XML comment on `Locked` repeats the literal
("three or more unlock attempts have failed") rather than citing a constant, so a
change to the threshold would have to be made in three places (the comparison, the XML
text, and any caller-side `attempts < 3` checks). The lockout count is also a
security-relevant policy parameter — it deserves a named symbol so a security review
can find it.
**Recommendation**
Promote the threshold to a `public const int MaxUnlockAttempts = 3;` on `BundleSession`
(or to the `IBundleSessionStore`/`BundleImporter` if that is the better home), and rewrite
the `Locked` expression and the XML comment in terms of it. If the threshold is actually
owned by a Transport-component option, document the link.
### Commons-017 — `Component-Commons.md` is significantly stale (audit enums, new entities, new repositories, new service interfaces, new folders)
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-Commons.md:41-44`, `:75-79`, `:88-95`, `:107-117`, `:152-232` |
**Description**
The Commons design doc has fallen materially behind the code:
- **REQ-COM-1 audit enums** — the doc's `AuditKind` enum lists
`SyncCall, CachedEnqueued, CachedAttempt, CachedTerminal, SyncWrite, SyncRead, Enqueued,
Attempt, Terminal, Completed`; the actual enum in `Types/Enums/AuditKind.cs` has
*completely different* values: `ApiCall, ApiCallCached, DbWrite, DbWriteCached, NotifySend,
NotifyDeliver, InboundRequest, InboundAuthFailure, CachedSubmit, CachedResolve`.
Likewise `AuditStatus` — doc says `Success, TransientFailure, PermanentFailure, Enqueued,
Retrying, Delivered, Parked, Discarded`; actual values are `Submitted, Forwarded,
Attempted, Delivered, Failed, Parked, Discarded, Skipped`. The doc's enum names cannot
be matched to the code at all.
- **REQ-COM-3 entities** — the Audit bullet still lists only `AuditLogEntry`; the
actual `Entities/Audit/` folder now contains `AuditEvent` and `SiteCall` as well, and
both carry significant additional columns (`SourceNode`, `ExecutionId`,
`ParentExecutionId`) that are core to the M3-M7 work and entirely absent from the doc.
- **REQ-COM-4 repositories**`IAuditLogRepository` is in the code (with its
`InsertIfNotExistsAsync`, `QueryAsync`, `SwitchOutPartitionAsync`,
`GetPartitionBoundariesOlderThanAsync`, `GetKpiSnapshotAsync`, `GetExecutionTreeAsync`,
`GetDistinctSourceNodesAsync` surface) but missing from the REQ-COM-4 list.
- **REQ-COM-4a services** — the doc lists seven service interfaces. The code adds
`ICachedCallLifecycleObserver`, `ICachedCallTelemetryForwarder`, `INodeIdentityProvider`,
`ISiteAuditQueue`, plus the misplaced `IOperationTrackingStore` and `IPartitionMaintenance`
(see Commons-018), and the `Interfaces/Transport/` folder with four more interfaces
(`IBundleExporter`, `IBundleImporter`, `IBundleSessionStore`, `IAuditCorrelationContext`)
— none of which appear in REQ-COM-4a.
- **REQ-COM-5b folder tree** — missing: `Types/Audit/` (`AuditLogPaging`,
`AuditLogQueryFilter`, `AuditQueryParamParsers`, `ExecutionTreeNode`,
`SiteCallKpiSnapshot`, `SiteCallPaging`, `SiteCallQueryFilter`,
`SiteCallSiteKpiSnapshot`), `Types/Notifications/` (`NotificationKpiSnapshot`,
`NotificationOutboxFilter`, `SiteNotificationKpiSnapshot`), `Types/InboundApi/`
(`ApiKeyHasher`, `ParameterDefinition`), `Types/Transport/` (nine records),
`Messages/Audit/` (seven new message files), `Interfaces/Transport/` (four
interfaces), plus the new `AuditLogKpiSnapshot`, `SiteAuditBacklogSnapshot`,
`SiteCallOperational`, `TrackingStatusSnapshot` directly under `Types/`.
CLAUDE.md's editing rules state design docs and code must travel together. The doc is now
much less useful as a map of the actual file set than after the previous (Commons-009)
refresh.
**Recommendation**
Refresh `Component-Commons.md` against the current file set: rewrite the `AuditKind` /
`AuditStatus` enum value lists to match the code, add `AuditEvent` and `SiteCall` to
REQ-COM-3, add `IAuditLogRepository` to REQ-COM-4, expand REQ-COM-4a with the new service
interfaces (and add a sentence on the Transport interfaces in `Interfaces/Transport/`),
and rewrite the REQ-COM-5b folder tree to include the new `Types/*`, `Messages/Audit`,
and `Interfaces/Transport` folders. The same kind of refresh that resolved Commons-009 is
needed again now.
### Commons-018 — `IOperationTrackingStore` and `IPartitionMaintenance` are at the root of `Interfaces/` instead of `Interfaces/Services/`
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Commons/Interfaces/IOperationTrackingStore.cs`, `src/ScadaLink.Commons/Interfaces/IPartitionMaintenance.cs` |
**Description**
REQ-COM-5b documents the `Interfaces/` folder as having exactly three sub-folders:
`Protocol/` (REQ-COM-2), `Repositories/` (REQ-COM-4), and `Services/` (REQ-COM-4a). Two
new interfaces — `IOperationTrackingStore` and `IPartitionMaintenance` — are filed at
the root of `Interfaces/` (namespace `ScadaLink.Commons.Interfaces`) rather than under
`Interfaces/Services/` (namespace `ScadaLink.Commons.Interfaces.Services`). They are
straightforward cross-cutting service interfaces consumed by the Audit Log component (a
site-local SQLite tracking store; a central partition-maintenance hosted-service helper)
and conceptually belong alongside `ISiteAuditQueue`, `ICachedCallLifecycleObserver`, etc.
The inconsistency is small but it breaks the "every interface lives under a sub-folder"
rule REQ-COM-5b establishes, and it makes the namespace surface inconsistent — every
other recently-added service interface uses `Interfaces.Services`.
**Recommendation**
Move both files into `Interfaces/Services/` and adjust the namespace to
`ScadaLink.Commons.Interfaces.Services`. Update consumers in `ScadaLink.AuditLog`,
`ScadaLink.SiteRuntime`, and `ScadaLink.ConfigurationDatabase`. Add them to the
REQ-COM-4a list (see Commons-017).
### Commons-019 — New `*Utc`-suffixed `DateTime` columns on `AuditEvent` / `SiteCall` are not enforced as UTC; inconsistent with `Notification`'s `DateTimeOffset`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Commons/Entities/Audit/AuditEvent.cs:15-18`, `src/ScadaLink.Commons/Entities/Audit/SiteCall.cs:59-68`, `tests/ScadaLink.Commons.Tests/Entities/EntityConventionTests.cs:49-69` |
**Description**
CLAUDE.md mandates UTC throughout the system, "DateTime with DateTimeKind.Utc *or*
DateTimeOffset". The pre-existing convention in Commons entities is `DateTimeOffset`,
and the architectural test `AllTimestampProperties_ShouldBeDateTimeOffset` enforces it
on a name-allowlist (`Timestamp`, `DeployedAt`, `CompletedAt`, `GeneratedAt`,
`ReportTimestamp`, `SnapshotTimestamp`). The new audit entities deviate:
- `AuditEvent.OccurredAtUtc` and `IngestedAtUtc``DateTime` (nullable on the second).
- `SiteCall.CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`, `IngestedAtUtc``DateTime`.
The `Notification` entity in the same domain uses `DateTimeOffset` for every timestamp
(`SiteEnqueuedAt`, `CreatedAt`, `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt`). The
architectural test does not catch the `*Utc` columns because those property names are not
on the allowlist. Concretely:
- Nothing prevents a producer from assigning `DateTime.Now` (kind = `Local`) or
`new DateTime(2026,1,1)` (kind = `Unspecified`) to an `OccurredAtUtc` column. The
value will round-trip through `System.Text.Json` losing the `Kind` (it defaults to
`Unspecified` on read). The `Utc` suffix is convention-only.
- Comparison across the boundary is now ambiguous — the central `AuditLog.OccurredAtUtc`
and the central `Notifications.CreatedAt` are different CLR types, with `DateTimeOffset`
carrying an explicit offset and `DateTime` not.
- The repository query filters (`AuditLogQueryFilter.FromUtc`/`ToUtc`,
`SiteCallQueryFilter.FromUtc`/`ToUtc`) also use bare `DateTime`. A caller building one
from `DateTime.UtcNow.AddHours(-1)` is fine; a caller using `DateTimeOffset.UtcNow.DateTime`
is fine; a caller using `DateTime.Now` is silently wrong.
This is the same defect the architectural test was designed to catch on the
`DateTimeOffset` side — the test just doesn't cover the new column-naming convention.
**Recommendation**
Pick a single rule:
1. Convert the audit entities to `DateTimeOffset` to match every other Commons entity
and the architectural-test allowlist (largest blast radius — touches gRPC proto
types, EF mappings, SQL schemas, query filters).
2. Keep `DateTime` for audit but extend `EntityConventionTests` to recognise the `*Utc`
property-name pattern and assert (a) it is `DateTime` (not `DateTimeOffset`) and
(b) any constant-default has `DateTimeKind.Utc`. Add a runtime assertion at the
write boundary (`SqliteAuditWriter.WriteAsync`, the central upsert) that the
incoming `Kind == DateTimeKind.Utc` and reject otherwise.
Option 2 is the smaller change and is consistent with how `AuditLog` rows are stored in
SQL Server (`datetime2`, no offset). Either way the inconsistency with `Notification`
should be documented in REQ-COM-1 as a deliberate choice.
### Commons-020 — Transport types and new Audit-message types have no unit tests in `ScadaLink.Commons.Tests`
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.Commons.Tests/` |
**Description**
The Transport (#24) work adds nine records under `Types/Transport/` (`BundleManifest`,
`EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`,
`ImportPreview` + `ImportPreviewItem`, `ImportResolution`, `ImportResult`,
`ManifestContentEntry`) and four interfaces under `Interfaces/Transport/`. None of them
have a focused test file in `tests/ScadaLink.Commons.Tests/` — coverage is entirely
inside `tests/ScadaLink.Transport.IntegrationTests/`, which exercises the
end-to-end exporter/importer flow but does not pin the Commons-level wire contracts.
Similarly, the new `Messages/Audit/` folder (`IngestAuditEventsCommand`/`Reply`,
`IngestCachedTelemetryCommand`/`Reply`, `UpsertSiteCallCommand`/`Reply`,
`SiteCallRelayMessages`) and the `Messages/Integration/` additions
(`AuditTelemetryEnvelope`, `PullAuditEventsRequest`/`Response`) have no
serialization-shape tests in Commons. The existing `MessageConventionTests`,
`CompatibilityTests`, `ConnectionBindingSerializationTests`, and
`SiteCallQueriesTests` cover some but not all of the new traffic — `PullAuditEvents`
and `AuditTelemetryEnvelope` in particular cross the site→central version-skew
boundary that REQ-COM-5a is designed to enforce, so a JSON round-trip + named-property
assertion is the minimum protection against a future positional/tuple slip.
This is the same pattern as Commons-010 — behavior-bearing types with no Commons-level
test coverage, where the integration suite cannot catch a Commons-only contract
regression.
**Recommendation**
Add focused tests in `tests/ScadaLink.Commons.Tests/Types/Transport/` (round-trip
serialization for each Transport record, named JSON property assertions for
`EncryptionMetadata` / `BundleManifest`, the `BundleSession.Locked` threshold —
see Commons-016, the `ConflictKind`/`ResolutionAction` enum coverage), and in
`tests/ScadaLink.Commons.Tests/Messages/Audit/` (round-trip + named-property assertions
for the seven new message files). Prioritise the contracts that cross the site→central
boundary (`AuditTelemetryEnvelope`, `PullAuditEventsRequest`/`Response`,
`IngestCachedTelemetryCommand`).
### Commons-021 — `ExternalCallResult.Response` has a benign lazy-parse race
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.Commons/Interfaces/Services/IExternalSystemClient.cs:91-104` |
**Description**
`ExternalCallResult` is a `record` returned to scripts after an outbound HTTP call. The
`Response` property lazily parses `ResponseJson` into a `DynamicJsonElement`:
```csharp
public dynamic? Response
{
get
{
if (!_responseParsed)
{
_response = string.IsNullOrEmpty(ResponseJson)
? null
: new DynamicJsonElement(System.Text.Json.JsonDocument.Parse(ResponseJson).RootElement);
_responseParsed = true;
}
return _response;
}
}
```
`_response` and `_responseParsed` are plain mutable fields on a `record` that the
language otherwise treats as immutable. Two threads reading `Response` simultaneously
can both see `_responseParsed == false`, both call `JsonDocument.Parse`, and produce
two distinct `DynamicJsonElement` wrappers — the second write wins, and any reference
the loser thread already held becomes inconsistent with the winner. The race is benign
in the current usage (scripts get the result on one thread and use it on that thread),
and `DynamicJsonElement` after Commons-002 clones the underlying `JsonElement`, so the
duplicate parses do not even leak document handles. But the pattern is fragile — a
future caller that hands the result to a background continuation or `Task.WhenAll` would
introduce a real correctness gap, and the laziness is implicit in `record` semantics
that otherwise suggest immutability.
**Recommendation**
Use `Lazy<dynamic?>` initialised in the property (with `LazyThreadSafetyMode.ExecutionAndPublication`,
the default) and drop the mutable backing fields, or replace the property with a method
named `ParseResponse()` so the laziness is explicit and the caller knows to call it once
and cache. Either way, the change is local and preserves the existing `record`-equality
behavior.
### Commons-022 — `IAuditCorrelationContext` references an unresolvable `BundleImporter.ApplyAsync` cref; JSON-blob columns have no documented shape
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.Commons/Interfaces/Transport/IAuditCorrelationContext.cs:11`, `src/ScadaLink.Commons/Types/Transport/ImportPreview.cs:11`, `src/ScadaLink.Commons/Entities/Notifications/Notification.cs:33` |
**Description**
Two related XML-doc weaknesses, both around the new Transport / Audit surface:
1. `IAuditCorrelationContext`'s remarks say
`<see cref="BundleImporter.ApplyAsync"/>`. `BundleImporter` is the concrete
implementation in `ScadaLink.Transport.Import`, which Commons does not (and must
not) reference. The cref is unresolvable from Commons and will surface as a
build-time XML doc warning. The correct reference is the interface method
`IBundleImporter.ApplyAsync`.
2. Two JSON-string columns flow across components without a documented shape:
- `ImportPreviewItem.FieldDiffJson` — described only as "string?" with no remarks on
who produces it, who reads it, or what shape it carries. The Central UI renders it,
so a drift between producer and renderer is a silent UI regression.
- `Notification.ResolvedTargets` — described as "Resolved delivery targets snapshotted
at delivery time, for audit" but the shape (newline-separated emails? a JSON array?
comma-separated?) is undocumented. Audit consumers and the Central UI both read
this field.
Both are wire/persistence-format strings; an undocumented schema invites the same
kind of producer/consumer drift the `ValueTuple` finding in Commons-008 surfaced for
the typed messages.
**Recommendation**
- Fix the `<see cref>` in `IAuditCorrelationContext` to point at `IBundleImporter.ApplyAsync`.
- Add a remarks block to `ImportPreviewItem.FieldDiffJson` describing the JSON shape
(e.g. "a JSON object keyed by field name with `{ existing, incoming }` values") or, if
the shape is meant to be opaque to the wire, document that explicitly.
- Add a remarks block to `Notification.ResolvedTargets` documenting the format.
- Consider replacing both with strong-typed Commons records — `ResolvedTargets` could be
`IReadOnlyList<string>` serialised via EF value converter, and `FieldDiffJson` could
be a `FieldDiff` record. That is a larger change and is left as a follow-up.
### Commons-023 — Trailing-optional `SourceNode` on positional records mixes additive evolution patterns
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.Commons/Messages/Audit/SiteCallQueries.cs:53-66`, `:110-123`, `src/ScadaLink.Commons/Messages/Notification/NotificationOutboxQueries.cs:26-39`, `:104-123`, `src/ScadaLink.Commons/Types/SiteCallOperational.cs:42-54`, `src/ScadaLink.Commons/Types/TrackingStatusSnapshot.cs:33-46` |
**Description**
The `SourceNode` rollout adds an optional trailing parameter to a long list of positional
records. Two minor patterns emerge that are worth flagging:
- `SiteCallSummary` (twelve required positional members plus an optional 13th
`SourceNode = null`) — and the parallel `NotificationSummary` (ten required + optional
`SourceNode = null`) — both push the optional past a `bool IsStuck` flag. A consumer
reading the positional signature is now mixing required and optional members. The
record otherwise works correctly because every consumer constructs it via named
arguments, but a positional constructor call (which the language allows) would silently
miss the new field.
- `TrackingStatusSnapshot` has been made non-optional `SourceNode` (`string? SourceNode`
without `= null`), inconsistent with `SiteCallOperational`'s `string? SourceNode` (also
without default — but `SiteCallOperational` is purely positional). The mix of "optional
with default" and "optional without default" across the same domain is fine technically
but is the kind of inconsistency that bites a future additive field.
Neither pattern is a defect today — every consumer is updated, and JSON serialization
treats nullable-without-default the same as nullable-with-default. But the conventions
across the Audit / Notifications message surface have drifted enough that REQ-COM-5a's
"additive-only" rule deserves a one-paragraph clarification: do new optional fields take
a `= null` default, or not? The current code is mixed.
**Recommendation**
Add a one-paragraph "How to add a field" sub-section to REQ-COM-5a stating: new optional
fields on positional records MUST be added at the end of the parameter list AND MUST
carry a `= null` (or other safe) default value, so existing positional construction
sites keep compiling. Apply that rule retroactively to `TrackingStatusSnapshot` and any
other recent record that did not adopt it. No behavioral change required.
+335 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Communication` |
| Design doc | `docs/requirements/Component-Communication.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -42,6 +42,47 @@ gRPC-supplied `correlation_id` flows straight into an Akka actor name
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
#### Re-review 2026-05-28 (commit `1eb6e97`)
All prior findings (Communication-001..015) remain `Resolved` in this commit. The
re-review walked all 10 checklist categories again on the surface that has not
been re-examined before — the central↔site command/control routing surface
(`CentralCommunicationActor`, `SiteCommunicationActor`) rather than the
previously-mined gRPC streaming surface — and uncovered a cluster of defects
around the connection-state-change workflow. The single material finding is
**`HandleConnectionStateChanged` is dead code**: no production code path emits
`ConnectionStateChanged`, so the documented "kill active debug streams for the
disconnected site" + "mark in-progress deployments as failed" workflow never
fires at runtime (Communication-016). The downstream consequence is
**`_inProgressDeployments` grows unboundedly** — entries are inserted on every
deployment but only cleaned via that dead path (Communication-017). Three
smaller items round out the re-review: site heartbeats hard-code
`IsActive: true` regardless of node role (Communication-018), the
60-second-periodic `LoadSiteAddressesFromDb` task has no CancellationToken so a
hung DB query has no upper bound (Communication-019), the
`SiteAddressCacheLoaded` internal message carries a mutable
`Dictionary`/`List` (Communication-020), `SiteStreamGrpcServer.SubscribeInstance`
leaks the StreamRelayActor if `_streamSubscriber.Subscribe` throws between
`ActorOf` and the `try` block (Communication-021), and `_debugSubscriptions`
keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
(Communication-022). Seven new findings, all Open: one High, one Medium, five
Low.
## Checklist coverage 2026-05-28
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022). |
| 2 | Akka.NET conventions | ✓ | `SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean. |
| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues. |
| 4 | Error handling & resilience | ✓ | `LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021). |
| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
| 6 | Performance & resource management | ✓ | `_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
| 7 | Design-document adherence | ✓ | `ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018). |
| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
| 9 | Testing coverage | ✓ | `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
## Checklist coverage
| # | Category | Examined | Notes |
@@ -726,3 +767,294 @@ gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
than being masked by an endpoint-agnostic mock.
### Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:338-375` |
**Description**
`CentralCommunicationActor.HandleConnectionStateChanged` is wired to
`Receive<ConnectionStateChanged>` and implements two important workflows on
`IsConnected == false`: (1) kill every active debug stream for the disconnected
site (`_debugSubscriptions` walk → `DebugStreamTerminated` Tell to each
subscriber); (2) mark every in-progress deployment for that site as failed
(`_inProgressDeployments` walk → entry removal). Both are documented in the
component design doc's "Connection Failure Behavior" section and in WP-5 of the
work plan referenced in the class's own XML doc comment.
A repo-wide search (`grep -rn ConnectionStateChanged src/ tests/`) shows **no
production code ever emits `ConnectionStateChanged`**. The only producers are
the unit test `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
(line 137) and the Commons message-roundtrip test. The
`CentralCommunicationActor` therefore never receives one in production, the
disconnect-cleanup workflow never fires, and `_debugSubscriptions` /
`_inProgressDeployments` are never pruned via this path.
Concrete consequences:
- A site goes down → its active debug streams do **not** get a synchronous
`DebugStreamTerminated` notification from central. The bridge actor must
detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST.
Subscribers wait that long for the `OnStreamTerminated` callback instead of
the documented "immediately killed by central" behaviour.
- In-progress deployments to a disconnected site continue to occupy the
Ask-reply window and only fail when the Ask times out at the
`CommunicationService.DeployInstanceAsync` layer (120s). They are never
proactively marked failed.
- The unit test gives a strong false impression that the workflow works — it
exercises a code path that has no production caller.
The design doc and CLAUDE.md mention "ClusterClient handles failover between
NodeA and NodeB internally — there is no application-level NodeA preference /
NodeB fallback logic" — so the ClusterClient mechanism is the documented
failover transport. But that says nothing about *signalling* a fully-down
remote cluster to central's coordinator actor, which is exactly what
`ConnectionStateChanged` was meant to do.
**Recommendation**
Pick one of:
- Wire a producer for `ConnectionStateChanged` — e.g. subscribe to
`ClusterClient`'s contact-point/cluster events (`ClusterClient.ContactPoints`
Refresh / `ContactPointAdded` / `ContactPointRemoved`) or watch the
ClusterClient actor for a "no contact points reachable" state — and have it
publish `ConnectionStateChanged` to `Self` on each transition.
- If the documented "synchronously kill streams on disconnect" behaviour is
intentionally being dropped in favour of the slower keepalive-based
detection, delete the handler, the `ConnectionStateChanged` record, and the
related `_debugSubscriptions` / `_inProgressDeployments` tracking, then
update the design doc's "Connection Failure Behavior" section accordingly.
Either way, replace `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
— at present it asserts a behaviour that no production code triggers.
---
### Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:357-367` |
**Description**
`TrackMessageForCleanup` inserts `_inProgressDeployments[deploy.DeploymentId] =
envelope.SiteId` on every `DeployInstanceCommand` routed to a site (line 501).
The only places that *remove* from `_inProgressDeployments` are:
- `HandleConnectionStateChanged` on `IsConnected == false` (line 366) — which
per Communication-016 never fires in production.
- `PostStop` (line 553) — only on actor death (central failover).
There is **no removal on the normal happy path** — neither when the site replies
`DeploymentStatusResponse` (the reply goes to the Ask's temporary reply actor,
not back through `CentralCommunicationActor`), nor on Ask timeout. Every
successful or failed deployment leaves its entry behind for the lifetime of the
process.
Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
monotonically. Over months of operation across all sites a central node could
accumulate tens of thousands of entries — a real, observable leak. More
seriously, the field is *also* the source-of-truth set the
`HandleConnectionStateChanged` walk uses to fail in-progress deployments, so
even if a `ConnectionStateChanged` *were* fired today, the walk would
"fail" thousands of already-completed deployments and Tell their (now stale)
correlation-IDs into the void.
`_debugSubscriptions` (line 67) shares the same shape — but a normal debug
session ends with an `UnsubscribeDebugViewRequest` that *does* drive cleanup
(line 497), so leaks are only realised when a consumer crashes without
unsubscribing.
**Recommendation**
Either remove `_inProgressDeployments` entirely (it has no other consumer once
Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
is retained, add a removal hook on the reply path. The simplest fix is to
subscribe `CentralCommunicationActor` to the Ask reply: route
`DeployInstanceCommand` through the actor with the actor as the Ask sender,
forward the reply to the original caller, and `_inProgressDeployments.Remove`
in the same handler. (Today the Ask is taken on the *actor* itself by the
caller, so the reply skips the coordinator.)
---
### Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:357-371` |
**Description**
`SiteCommunicationActor.SendHeartbeatToCentral` builds
`new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)`
on every periodic tick (line 366), with no inspection of whether this node is
actually the active site node or a standby. The `HeartbeatMessage.IsActive`
field thus carries the literal value `true` on every heartbeat from every
node, and the field is effectively dead — central's `HandleHeartbeat` doesn't
consume it either (line 297 only passes `SiteId` and `Timestamp` to
`MarkHeartbeat`).
Per CLAUDE.md's Cluster & Failover section the active/standby distinction is
real ("Both nodes are seed nodes", "keep-oldest split-brain resolver",
"automatic dual-node recovery"), so a heartbeat that *could* carry node-role
information would be useful for the central health dashboard distinguishing
"active node down, standby up" from "site fully offline". As shipped, the
field is contract noise and a future implementer might mistakenly assume it
already carries meaningful state.
**Recommendation**
Either (a) resolve the current cluster role at heartbeat-send time and pass it
through — e.g. `Cluster.Get(Context.System).SelfRoles.Contains("active")` or
the project's existing role mechanism — and have the central aggregator
consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
(additive-only-evolution: deprecate the field, default to `true`, plan
removal in a major message contract revision).
---
### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431` |
**Description**
`LoadSiteAddressesFromDb` runs `await repo.GetAllSitesAsync()` inside
`Task.Run(async () => ...).PipeTo(self)` with no cancellation token (line 404).
The repository signature accepts `CancellationToken` (the test mock declares
`GetAllSitesAsync(Arg.Any<CancellationToken>())`), but the actor calls the
no-arg overload — so a hung MS SQL connection has no upper bound. The
60-second-periodic refresh keeps firing; each tick spawns a fresh `Task.Run`
that piles up if the database is consistently slow. The actor itself is
unaffected (it's not blocked), but pending tasks and DB connection-pool
resources accumulate, and the `Status.Failure` handler (Communication-006)
never fires because the task never faults — it just sits.
**Recommendation**
Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
60s the refresh runs on, or a configurable timeout in `CommunicationOptions`).
Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
a new load to avoid task accumulation.
---
### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:567` |
**Description**
The Akka.NET convention is that messages crossing actor boundaries (even
internal Self-messages over an async task boundary) are immutable.
`SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts)` is a
record but its `SiteContacts` payload is a mutable `Dictionary` whose values
are mutable `List<string>`. Constructed inside `Task.Run` and handed off to
the actor, the cache could in principle be mutated by either side; in
practice nothing does, but the type is a stale-evidence guarantee that
CLAUDE.md's "message immutability" rule is being followed only by convention.
**Recommendation**
Change the record signature to use `IReadOnlyDictionary<string, IReadOnlyList<string>>`
(or `ImmutableDictionary` / `ImmutableArray<string>`) and freeze the data
before piping. The cost is negligible — the payload is built and consumed
once per refresh tick.
---
### Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
**Description**
`SubscribeInstance` performs these statements in order (lines 189-194), all
*before* the `try` block at line 200:
1. `Interlocked.Increment(ref _actorCounter)`
2. `_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))`
3. `_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)`
If step 3 throws (the subscriber is wired but its `Subscribe` faults — a stale
instance name, a temporary index lookup failure, etc.), the exception escapes
the method as an unhandled `RpcException` *and* leaks the freshly-created
`relayActor`. The `finally` block at line 211 is unreachable because the
throw happens before the `try`. The actor's `_activeStreams` entry, the
`StreamEntry.Cts`, and the `Channel<SiteStreamEvent>` are also leaked.
In normal operation `_streamSubscriber.Subscribe` does not throw, so the bug is
latent — but a misbehaving site runtime (e.g. `SiteStreamManager` faulted
because the actor system is shutting down) would surface it.
**Recommendation**
Restructure to either (a) wrap the `Subscribe` call in a `try` whose `catch`
stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
creation *inside* the existing `try` block (the `finally` will then handle
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
past the `try {` brace.
---
### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493` |
**Description**
`TrackMessageForCleanup` on `SubscribeDebugViewRequest` does
`_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender)` (line 493).
The dictionary indexer silently overwrites any prior entry for the same
`CorrelationId`. If two debug sessions ever reuse the same correlation ID (e.g.
two Blazor users start a stream at the same moment with a non-GUID id, or a
caller bug, or a malicious caller as flagged in the cousin
Communication-014), the first subscriber's entry is overwritten and lost —
on a later `ConnectionStateChanged(false)` (per Communication-016 it never
actually fires today, but the design intent stands), only the *second*
subscriber would be notified of the disconnect.
`DebugStreamService.StartStreamAsync` uses `Guid.NewGuid().ToString("N")` as
the session id (`DebugStreamService.cs:97`), so a real collision is
astronomically unlikely in normal operation. But the central side is not
defending itself: a CLI consumer or a future caller is implicitly trusted to
generate globally-unique ids.
**Recommendation**
When the slot is already occupied, log a Warning and either reject the new
subscription with an error response or evict the prior subscriber via
`DebugStreamTerminated` before installing the new one. Mirrors the
`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
cancels the existing stream (line 167).
+479 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ConfigurationDatabase` |
| Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 10 |
## Summary
@@ -59,6 +59,59 @@ inconsistency — a redundant cast on one of the three `HasConversion` calls
(`ConfigurationDatabase-014`). The module is otherwise healthy and the prior fixes
hold up well.
#### Re-review 2026-05-28 (commit `1eb6e97`)
Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain
`Resolved`; their fixes still hold (encryption converter, fail-fast guard,
peppered API-key hash, ephemeral-fallback hardening, etc.). The module has
grown since the last review — new code includes Audit Log (#23) raw-SQL
paths in `AuditLogRepository` (partition-switch purge, recursive
execution-tree CTE, KPI snapshot, partition-boundary discovery), the
`AuditLogPartitionMaintenance` SPLIT-RANGE roll-forward implementation, the
`AuditCorrelationContext` scoped service that stamps `BundleImportId`, the
`SiteCallAuditRepository` monotonic-rank upsert, and the
`NotificationOutboxRepository` per-site KPI surface — and most of the new
findings are concentrated in those raw-SQL paths and in latent gaps left
behind by the CD-012 hash migration.
Ten new findings were recorded. The most material is
`ConfigurationDatabase-015`: a check-then-act race in
`NotificationOutboxRepository.InsertIfNotExistsAsync` with no duplicate-key
catch — unlike the sibling Audit Log / Site Call ingest paths, a concurrent
ack-after-persist on the same `NotificationId` will surface as an
unhandled `DbUpdateException` and break the at-least-once site→central
handoff. `ConfigurationDatabase-016` flags that
`InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with
`ApiKeyHasher.Default` (unpeppered) while the production create-path uses
the configured peppered hasher — any future caller (or test that exercises
the method) will silently fail to find a real key; the production
`ApiKeyValidator` happens not to call it, but the method is a publicly
exposed `IInboundApiRepository` member and a latent bug.
`ConfigurationDatabase-017` records that the `DeleteDeploymentRecordAsync`
stub-attach delete bypasses the documented optimistic-concurrency rule on
`DeploymentRecord.RowVersion` — the SQLite tests pass because the test
fixture re-maps `RowVersion` as a nullable concurrency token, but in
production this is likely to throw `DbUpdateConcurrencyException`.
`ConfigurationDatabase-018` records the `DateTime`-typed `*Utc` columns on
`AuditEvent` and `SiteCall` re-emerge as `Kind=Unspecified` on read; the
sibling Commons module flagged the same pattern as Commons-019, and
`AuditLogPartitionMaintenance.GetMaxBoundaryAsync` already defends against
it with an explicit `SpecifyKind(Utc)` — but `GetPartitionBoundariesOlderThanAsync`
does not (`ConfigurationDatabase-020`). `ConfigurationDatabase-019` is the
SPLIT-RANGE loop in `AuditLogPartitionMaintenance.EnsureLookaheadAsync`
swallowing every `SqlException` as a Warning and continuing — a genuine
failure (permissions, deadlock, transient) leaves a missing boundary and
the next iteration cheerfully splits the following month, creating a hole.
`ConfigurationDatabase-021` is a low-severity hardening concern around
`SwitchOutPartitionAsync`'s raw-SQL interpolation of `monthBoundaryStr` /
`stagingTableName` (currently safe by construction, but truncates fractional
seconds). `ConfigurationDatabase-022` is the stale "WP-24 stub" XML comment
on `DeploymentManagerRepository`. `ConfigurationDatabase-023` is a
design-doc-adherence drift on `IX_AuditLog_CorrelationId` (design says
`IX_AuditLog_Correlation`). `ConfigurationDatabase-024` is missing test
coverage for the SPLIT-RANGE failure-continuation behaviour and for the
production-shape stub-attach delete with a real rowversion.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -74,6 +127,21 @@ hold up well.
| 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
| 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `GetPartitionBoundariesOlderThanAsync` returns `DateTimeKind.Unspecified` (CD-020). `GetApiKeyByValueAsync` hashes with the unpeppered default (CD-016). |
| 2 | Akka.NET conventions | ✓ | No actors in this module. No issues found. |
| 3 | Concurrency & thread safety | ✓ | `NotificationOutboxRepository.InsertIfNotExistsAsync` check-then-act has no duplicate-key catch (CD-015). Stub-attach delete bypasses documented optimistic concurrency on `DeploymentRecord.RowVersion` (CD-017). |
| 4 | Error handling & resilience | ✓ | `AuditLogPartitionMaintenance.EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues (CD-019). |
| 5 | Security | ✓ | `SwitchOutPartitionAsync` interpolates a `DateTime` string and a GUID-suffixed identifier into raw SQL — safe by construction but pattern is risky (CD-021). |
| 6 | Performance & resource management | ✓ | No new issues found. |
| 7 | Design-document adherence | ✓ | Index name drift: design says `IX_AuditLog_Correlation`, code uses `IX_AuditLog_CorrelationId` (CD-023). |
| 8 | Code organization & conventions | ✓ | `DateTime *Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement (CD-018). |
| 9 | Testing coverage | ✓ | No tests for SPLIT failure continuation and no production-shape rowversion stub-attach test (CD-024). |
| 10 | Documentation & comments | ✓ | Stale "WP-24 stub" XML comment on `DeploymentManagerRepository` (CD-022). |
## Findings
### ConfigurationDatabase-001 — `GetTemplateWithChildrenAsync` loads child templates then discards them
@@ -816,3 +884,411 @@ no behavioural regression test is meaningful (cf. CD-005); a forward guard was a
in `SchemaConfigurationTests.cs`
`SecretColumns_AllHaveEncryptedStringConverterApplied` (theory over all three secret
columns) — asserting each column keeps an `EncryptedStringConverter`.
### ConfigurationDatabase-015 — `NotificationOutboxRepository.InsertIfNotExistsAsync` is a check-then-act race with no duplicate-key catch
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45` |
**Description**
`InsertIfNotExistsAsync` does `AnyAsync(x => x.NotificationId == n.NotificationId)`,
then — if false — `AddAsync` + `SaveChangesAsync`. There is a check-then-act window
between the two operations: two sessions can both pass the `AnyAsync` check and both
attempt the INSERT, and the loser surfaces as a uniqueness violation on the
`NotificationId` primary key wrapped in a `DbUpdateException` / `SqlException` (error
2627). The site→central handoff for notifications is documented as **at-least-once
with ack-after-persist plus insert-if-not-exists**; collisions on the same
`NotificationId` are therefore not a "should never happen" but the *expected* contention
mode. As written, the second concurrent ack throws, fails the site→central
acknowledgement, and the site retries the same row again on its next forward — a
livelock if the contending pair keeps racing.
The sibling raw-SQL `IF NOT EXISTS … INSERT` paths in `AuditLogRepository.InsertIfNotExistsAsync`
(see SqlErrorUniqueIndexViolation / SqlErrorPrimaryKeyViolation handling at
`AuditLogRepository.cs:74-89`) and `SiteCallAuditRepository.UpsertAsync`
(`SiteCallAuditRepository.cs:87-96`) explicitly catch errors 2601/2627 and treat the
loser as a no-op — exactly the right pattern for "first-write-wins idempotent ingest".
This repository alone does not.
**Recommendation**
Either (a) rewrite the body as a single raw-SQL `IF NOT EXISTS … INSERT` and apply the
same 2601/2627 catch-and-log-Debug pattern the AuditLog and SiteCall repositories use,
or (b) wrap the existing flow in a try/catch around `SaveChangesAsync` that inspects
the inner `SqlException.Number` and returns `false` (i.e. "another writer won the race")
on 2601/2627. Option (a) is preferable because it collapses the two round-trips to one
and matches the established idempotent-ingest pattern used elsewhere in the module.
Add a regression test that simulates two concurrent `InsertIfNotExistsAsync` calls
(using two open contexts) for the same `NotificationId` and asserts neither call
throws and exactly one row lands.
### ConfigurationDatabase-016 — `InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with the unpeppered `ApiKeyHasher.Default`
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/InboundApiRepository.cs:35-39` |
**Description**
`GetApiKeyByValueAsync` resolves an API key by its presented plaintext value by hashing
the candidate and looking up `KeyHash`. The hash, however, is computed with the static
`ApiKeyHasher.Default` (the fixed, deployment-independent unpeppered hasher used for
tests). Production key creation uses the DI-registered, *peppered* `IApiKeyHasher`
constructed from `InboundApiOptions.ApiKeyPepper` (see CD-012 resolution and
`ApiKeyHasher.ctor(string pepper)`), so the stored `KeyHash` of any real key was
produced under the deployment pepper. Hashing the candidate with the unpeppered
`Default` yields a different digest, and the `WHERE KeyHash = @hash` lookup will never
match a real key.
The production `ApiKeyValidator` (InboundAPI module) deliberately does NOT call this
method — it fetches all keys and runs a constant-time comparison via the
DI-registered hasher (`ApiKeyValidator.cs:53-64`) — so the immediate
authentication path is unaffected. But `GetApiKeyByValueAsync` remains a publicly
exposed `IInboundApiRepository` member; any new caller (a future admin tool, a CLI
command, a test) that uses it under a peppered configuration will silently get a
`null` result for an existing, valid key, and almost certainly mis-route the failure
as "key not found".
**Recommendation**
Either (a) take `IApiKeyHasher` via constructor injection — alongside the existing
`ScadaLinkDbContext` and optional `ILogger` — and use it here so the repository
participates in the same peppered scheme as the rest of the system; or (b) delete
the method from both the implementation and `IInboundApiRepository` (Commons) on the
grounds that the production authentication path correctly avoids it for timing
reasons and there is no remaining valid caller. Add a regression test that constructs
the repository under a real `ApiKeyHasher("a-strong-pepper-value")`, inserts an
`ApiKey.FromHash(...)` using the same hasher, and asserts `GetApiKeyByValueAsync`
returns the row — under option (a) it should pass; under option (b) the method no
longer exists.
### ConfigurationDatabase-017 — Stub-attach delete on `DeploymentRecord` bypasses optimistic concurrency
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:83-97` |
**Description**
`DeploymentRecord` carries a SQL Server `rowversion` concurrency token (declared
in `DeploymentConfiguration` and confirmed by `ConcurrencyTests`), per the design
doc's "Optimistic concurrency is used on deployment status records". When
`DeleteDeploymentRecordAsync` falls into its stub-attach branch (no tracked entity
in `_dbContext.DeploymentRecords.Local` for the given id), it constructs
`new DeploymentRecord("stub", "stub") { Id = id }`, `Attach`es it, and `Remove`s it.
The stub's `RowVersion` is left at its default `null` (or `byte[0]`).
EF Core's SQL Server provider generates the delete as
`DELETE FROM DeploymentRecords WHERE Id = @id AND RowVersion = @stubRowVersion` — and
the stub rowversion is not the row's real rowversion, so on a real SQL Server (with
`IsRowVersion()` auto-populating the column) the WHERE never matches and `SaveChanges`
throws `DbUpdateConcurrencyException`. The path is exercised by
`RepositoryCoverageTests.DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity`
but the test fixture remaps `RowVersion` as a nullable `IsConcurrencyToken()` column
without auto-population (`SqliteTestHelper.ConfigureForTests`), so the stored
RowVersion is null AND the stub's RowVersion is null AND the SQLite delete matches.
Production-shape behaviour is the opposite.
The same stub-attach pattern is used on `SystemArtifactDeploymentRecord`,
`Site`, and `DataConnection`. Those entities have no rowversion token, so the
production behaviour is correct for them — the issue is specific to
`DeploymentRecord`.
**Recommendation**
Replace the stub-attach branch in `DeleteDeploymentRecordAsync` with a real lookup —
`await _dbContext.DeploymentRecords.FindAsync([id], ct)` then `Remove` if non-null —
mirroring `DeleteInstanceAttributeOverrideAsync` and `DeleteDeployedSnapshotAsync`.
This loses the "delete by id without a read" micro-optimisation (a real concern only
in batched-delete loops) but restores the documented concurrency contract. If the
optimisation is genuinely required, attach a `DeploymentRecord` with the *caller's*
known RowVersion (the caller had to fetch the row at some point) and accept the
`DbUpdateConcurrencyException` as the correct concurrency signal. Add a regression
test under MS SQL (extend `RepositoryCoverageTests` with a SQL-Server-flavoured
fixture, or use `MsSqlMigrationFixture`) that asserts the stub-attach delete works
when the real RowVersion is supplied.
### ConfigurationDatabase-018 — `DateTime`-typed `*Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs`, `Configurations/SiteCallEntityTypeConfiguration.cs` (mappings for `OccurredAtUtc`, `IngestedAtUtc`, `CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`) |
**Description**
`AuditEvent.OccurredAtUtc` / `IngestedAtUtc` and `SiteCall.CreatedAtUtc` /
`UpdatedAtUtc` / `TerminalAtUtc` / `IngestedAtUtc` are declared as `DateTime` (not
`DateTimeOffset`) per the Audit Log #23 spec, with a UTC suffix convention. SQL Server's
`datetime2` provider strips the `Kind` flag on the wire — values inserted with
`DateTimeKind.Utc` round-trip as `DateTimeKind.Unspecified` on read. The EF mappings
add no `HasConversion(...)` to normalise the kind. The sibling Commons module just
flagged the same pattern as `Commons-019`; in this module the consequence is concrete:
- `AuditLogPartitionMaintenance.GetMaxBoundaryAsync` already defends with an explicit
`DateTime.SpecifyKind(dt, DateTimeKind.Utc)` (see `AuditLogPartitionMaintenance.cs:103-104`).
That defence is necessary precisely because the EF mapping does not enforce it.
- `AuditLogRepository.GetPartitionBoundariesOlderThanAsync` does NOT defend — it
returns `reader.GetDateTime(0)` directly with `Kind=Unspecified` (separate finding
CD-020).
- Downstream comparisons like `DateTime.UtcNow` (Kind=Utc) against a re-read
`OccurredAtUtc` (Kind=Unspecified) do not produce a runtime error, but any code
path that converts via `.ToLocalTime()` or `.ToUniversalTime()` will silently
interpret an unspecified-kind value as local time and produce wrong results.
**Recommendation**
Apply a value converter on every `DateTime`-typed `*Utc` column that re-tags the
`Kind` to `Utc` on read (and asserts/`SpecifyKind` on write to defend against an
accidental local-kind write). EF Core's built-in
`UtcValueConverter`-style pattern is a single line per column:
```csharp
builder.Property(e => e.OccurredAtUtc)
.HasConversion(
v => v,
v => DateTime.SpecifyKind(v, DateTimeKind.Utc));
```
Apply uniformly to `AuditEvent` (OccurredAtUtc, IngestedAtUtc), `SiteCall`
(CreatedAtUtc, UpdatedAtUtc, TerminalAtUtc, IngestedAtUtc), and any other
`DateTime *Utc` columns added later. Add a regression test that inserts a UTC row,
re-reads it in a fresh context, and asserts `Kind == DateTimeKind.Utc`. Coordinate
with the sibling `Commons-019` finding so the resolution is consistent across both
modules.
### ConfigurationDatabase-019 — `EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues, creating partition holes
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Maintenance/AuditLogPartitionMaintenance.cs:181-199` |
**Description**
`EnsureLookaheadAsync` loops one month at a time from `next` up to `horizon` and
issues `ALTER PARTITION SCHEME … NEXT USED` + `ALTER PARTITION FUNCTION … SPLIT RANGE`
per month. The class doc says idempotency is guaranteed by reading the max-boundary
first and only issuing SPLITs for strictly-greater months — so "boundary already
exists" (SQL Server msg 7708/7711) cannot occur by construction. Yet the loop wraps
each iteration in `catch (SqlException ex) { _logger.LogWarning(...); }` and
continues, with the rationale "the desired end state (boundary present) is satisfied
by either path."
That rationale is correct only for an "already-exists" error — which the pre-check
makes impossible. Any *other* `SqlException` — a permissions failure (the
`scadalink_audit_purger` role's `ALTER ON SCHEMA::dbo` revoked or not granted), a
deadlock victim, a transient connection drop, a transaction log full, an underlying
filegroup full — leaves the boundary genuinely **not** created, logs a Warning
(quiet by default in most appenders), and the next iteration tries to SPLIT the
following month. That split *can* succeed (it is a different range value), creating
a permanent **hole** in the partition layout: month N never had a partition created,
month N+1 does, so any future row in month N lands in the partition that previously
spanned both months and partition-switch purge for month N becomes impossible.
The class is the central singleton's daily-tick partition roll-forward, so the hole
persists until an operator notices it and rebuilds manually — by which point months
of audit retention may be locked behind the unsplit range.
**Recommendation**
Either (a) drop the `try/catch` entirely so any SPLIT failure aborts the loop and
surfaces to the hosted service (the next tick retries — at-least-once with no holes),
or (b) keep the catch but narrow it to ONLY the
"boundary-already-exists" errors (SQL Server msg 7708 and 7711) and log at Debug,
mirroring how `AuditLogRepository.InsertIfNotExistsAsync` narrowly catches 2601/2627.
Option (a) is preferable: by class-doc construction the catch should never fire, so
its only effect is to mask the real-failure case. Add tests that simulate a SPLIT
failure (e.g. a permission denial via a constrained test login) and assert the loop
aborts after the first failure with no further SPLITs.
### ConfigurationDatabase-020 — `GetPartitionBoundariesOlderThanAsync` returns `DateTime` with `Kind=Unspecified`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs:378-387` |
**Description**
`GetPartitionBoundariesOlderThanAsync` reads `reader.GetDateTime(0)` and adds the
raw value to the returned list. SQL Server's `datetime2` materialises as
`DateTimeKind.Unspecified` on the ADO.NET side (see CD-019), so every returned
boundary has `Kind=Unspecified`. The sibling `AuditLogPartitionMaintenance.GetMaxBoundaryAsync`
(`AuditLogPartitionMaintenance.cs:103-104`) explicitly defends against this exact
issue by calling `DateTime.SpecifyKind(dt, DateTimeKind.Utc)` — exactly because EF /
ADO.NET strips the kind — but the repository method does not. Callers (the
`AuditLogPurgeActor`) that compare a returned boundary to `DateTime.UtcNow` get a
silently wrong comparison if they ever serialise to/from a string with a local-kind
assumption in between.
**Recommendation**
Wrap the read with `DateTime.SpecifyKind(reader.GetDateTime(0), DateTimeKind.Utc)`,
matching the explicit defensive pattern already in
`AuditLogPartitionMaintenance.GetMaxBoundaryAsync`. Better still: fix CD-019 (a value
converter on the column) so the defence at the read site is no longer required.
### ConfigurationDatabase-021 — `SwitchOutPartitionAsync` interpolates `monthBoundary` / staging table name into raw SQL
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs:192-338` |
**Description**
`SwitchOutPartitionAsync` builds two large SQL batches via interpolated strings
(`sampleSql` and `sql`) that include `{monthBoundaryStr}` and `{stagingTableName}`
directly in the SQL text, and executes them via `ExecuteSqlRawAsync` /
`cmd.ExecuteScalarAsync`. Both values are constructed inside the method —
`monthBoundaryStr = monthBoundary.ToUniversalTime().ToString("yyyy-MM-dd HH:mm:ss")`
and `stagingTableName = $"AuditLog_Staging_{Guid.NewGuid():N}"` — and the formats are
fully controlled. SQL injection is therefore not possible as the code stands.
Two related concerns:
1. The format string `"yyyy-MM-dd HH:mm:ss"` truncates fractional seconds. The
partition function is seeded at `T00:00:00` exactly, so truncation happens to
produce the right boundary value today. A future change that adds a sub-second
boundary (or invokes `SwitchOutPartitionAsync` with a non-midnight value) would
silently round to the wrong partition with no error — and SWITCH PARTITION would
either fail loudly or succeed against the wrong month. Use
`"yyyy-MM-dd HH:mm:ss.fffffff"` to match the precision the migration seeds at,
and the rounding ambiguity disappears.
2. The pattern of "build a multi-statement DDL batch by string concatenation" is
robust today only by inspection. A code review tripwire — the CLAUDE.md note "the
data-access layer must not concatenate SQL" — would catch the pattern earlier;
converting the batch to a parameterised `sp_executesql` invocation (the inner
`EXEC sp_executesql @sql` already exists for the SWITCH itself) is the textbook
safe form even when the input is internally controlled.
**Recommendation**
(1) Switch `monthBoundaryStr`'s format to `"yyyy-MM-dd HH:mm:ss.fffffff"`. (2)
Optionally migrate the two batches to fully parameterised `sp_executesql` form so
the `monthBoundary` value flows as a typed `@boundary datetime2(7)` parameter
rather than as interpolated text — the only piece that genuinely *cannot* be
parameterised is the staging table identifier (DDL identifiers are not parameterisable
in T-SQL), but a server-side `QUOTENAME(@stagingTable)` wrapper covers it. Add a
regression test that supplies a non-midnight `monthBoundary` value and asserts the
boundary lookup resolves to the expected partition.
### ConfigurationDatabase-022 — Stale "WP-24 Stub level sufficient for diff/staleness support" XML comment on `DeploymentManagerRepository`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:8-14` |
**Description**
The class-level XML doc on `DeploymentManagerRepository` reads "WP-24: Stub level
sufficient for diff/staleness support." WP-24 (Deployment Manager work-package) shipped
long ago; the repository now covers full `DeploymentRecord` CRUD,
`SystemArtifactDeploymentRecord` CRUD, `DeployedConfigSnapshot` CRUD, and an
`Instance` deletion path with explicit Restrict-FK cleanup
(`DeleteInstanceAsync` at line 210-229). The comment misleads a reader into
thinking the repository is incomplete and tempts them not to investigate further
before adding new behaviour. The same module-context observation was noted but
not raised in the prior review.
**Recommendation**
Remove the WP-24 line and rewrite the class doc to describe what the repository
actually does today: EF Core implementation of `IDeploymentManagerRepository`
covering deployment records, system-artifact deployment records, deployed config
snapshots, and the Restrict-FK-aware `DeleteInstanceAsync` for the
deployment pipeline. Cross-reference the optimistic-concurrency contract on
`DeploymentRecord.RowVersion`.
### ConfigurationDatabase-023 — `AuditLog` correlation-index name drifts from design doc (`IX_AuditLog_CorrelationId` vs `IX_AuditLog_Correlation`)
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs:99-101`, `Migrations/20260520142214_AddAuditLogTable.cs:103-107` |
**Description**
The Component-ConfigurationDatabase design doc lists the AuditLog indexes by name —
including `IX_AuditLog_Correlation (CorrelationId)` for the "drilldown from a single
operation" use case. The implemented index name is `IX_AuditLog_CorrelationId` (the
fluent-config `HasDatabaseName` call and the matching DDL in the migration both use
the `Id`-suffixed form). The names are syntactically valid SQL Server index names and
the index does the right work; the drift is cosmetic but it breaks scripted
maintenance ops that grep for the documented name (e.g. a runbook reindex script).
The other four documented index names (`IX_AuditLog_OccurredAtUtc`,
`IX_AuditLog_Site_Occurred`, `IX_AuditLog_Channel_Status_Occurred`,
`IX_AuditLog_Target_Occurred`, plus the post-design additions
`IX_AuditLog_Execution`, `IX_AuditLog_ParentExecution`, `IX_AuditLog_Node_Occurred`)
agree with the code.
**Recommendation**
Pick one direction. Updating the design doc to match the code is cheap (one word) and
preserves the existing migration; renaming the index in the database requires a new
migration that does `sp_rename`. Document-aligning is the lower-cost option and
matches the resolution pattern used for CD-005.
### ConfigurationDatabase-024 — Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.ConfigurationDatabase.Tests/Maintenance/AuditLogPartitionMaintenanceTests.cs`, `tests/.../RepositoryCoverageTests.cs:855-869` |
**Description**
`AuditLogPartitionMaintenanceTests` exercises the happy-path SPLIT-RANGE behaviour
(no-op, single-month, three-month, already-exists idempotency) but never simulates a
SPLIT *failure* — so the catch-and-continue behaviour flagged in CD-019 is
behaviourally untested. The class is a central singleton driving daily audit purge;
a regression that turned the failure path into a permanent hole would not surface in
the test suite.
Separately, `RepositoryCoverageTests.DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity`
covers the stub-attach delete path under the SQLite test fixture, but the fixture
remaps `RowVersion` as a nullable concurrency token (`SqliteTestHelper`), so it does
not exercise the production-shape `IsRowVersion()` auto-population — the actual
concurrency-token bug flagged in CD-018 cannot show up. There is an
`MsSqlMigrationFixture` in the test project already (used by the Audit Log migration
tests); the stub-attach delete deserves a parallel MS-SQL-flavoured test.
**Recommendation**
(1) Add an `AuditLogPartitionMaintenanceTests` case that constructs a context against
a constrained login (no `ALTER ON SCHEMA::dbo`), invokes `EnsureLookaheadAsync` for a
three-month gap, and asserts: only the partition boundaries created BEFORE the
permissions failure landed remain, and the call aborts cleanly without continuing to
later months. This pins down the resolution of CD-019. (2) Add a
`RepositoryCoverageTests` case that uses `MsSqlMigrationFixture` to insert a
`DeploymentRecord`, clear the change tracker, call `DeleteDeploymentRecordAsync`,
and assert the row is gone — pinning the resolution of CD-018. Both tests should be
`[SkippableFact]` so the suite still passes when no MS SQL Server is available.
+318 -4
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 5 |
## Summary
@@ -30,6 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
#### Re-review 2026-05-28 (commit `1eb6e97`)
The 2026-05-28 re-review walked all 10 checklist categories against the current
source and found **5 new findings**. All 17 prior findings remain `Resolved` and the
fixes (reverse-index unsubscribe, atomic disconnect guards, real-logger threading,
initial-connect failover, per-tag write-batch results, subscribe-response accuracy)
were verified in place. The new findings cluster around `HandleSubscribe` /
`HandleSubscribeCompleted` race-induced state drift:
- **High** — concurrent subscribes for the same tag from different instances each see
the tag as not-yet-subscribed (the `alreadySubscribed` snapshot was taken before
the Task.Run dispatch), so each Task.Run calls `_adapter.SubscribeAsync` and the
later `HandleSubscribeCompleted` silently discards the second adapter subscription
handle — the orphan never gets `UnsubscribeAsync`'d.
- **Medium**`OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>`
mutated from thread-pool continuations of `SubscribeAsync` / `UnsubscribeAsync` /
`DisconnectAsync` running in parallel — the same class of bug DCL-003 fixed in
`RealOpcUaClient` but missed in the layer above.
- **Medium**`HandleSubscribeCompleted`'s success branch never checks
`_unresolvedTags`, so a tag that previously failed resolution (incrementing
`_totalSubscribed`) and is then successfully subscribed by a different instance gets
`_totalSubscribed++` a second time, double-counting; meanwhile the unresolved entry
lingers until the retry timer also resolves it, creating an orphaned monitored item.
- **Medium** — when an instance is unsubscribed mid-flight,
`HandleSubscribeCompleted` re-creates an empty `_subscriptionsByInstance[name]`
entry and processes the late results, leaking `_tagSubscriberCount` /
`_totalSubscribed` / `_resolvedTags` increments for an instance with no
`_subscribers` entry to deliver values to.
- **Medium**`HandleSubscribeCompleted` calls `Timers.StartPeriodicTimer` on every
completed subscribe with unresolved tags; in Akka.NET, `StartPeriodicTimer` with the
same key cancels and replaces the existing timer, so a burst of subscribes arriving
faster than `TagResolutionRetryInterval` (10 s default) keeps resetting the timer
and the retry never actually fires.
#### Re-review 2026-05-17 (commit `39d737e`)
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
@@ -50,7 +84,22 @@ so a mid-batch disconnect aborts the whole write batch (the same class of defect
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
`DataConnectionLayer-014`.
## Checklist coverage
## Checklist coverage (2026-05-28 re-review, commit `1eb6e97`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | Findings 020 (double-count `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance) and 021 (leaked `_subscriptionsByInstance` entry + counter increments when instance unsubscribes mid-flight). |
| 2 | Akka.NET conventions | x | Finding 022 — `Timers.StartPeriodicTimer` reset on every `HandleSubscribeCompleted` for unresolved tags can stall the retry timer indefinitely under a subscribe burst. |
| 3 | Concurrency & thread safety | x | Finding 018 — concurrent subscribes for the same tag from different instances each spawn an adapter subscription and the second handle is orphaned. Finding 019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from thread-pool continuations (same class of bug as DCL-003 one layer above). |
| 4 | Error handling & resilience | x | No new issues; DCL-004 / DCL-007 / DCL-015 / DCL-017 fixes verified in place. |
| 5 | Security | x | No new issues; DCL-012 / DCL-014 fixes verified. The Commons-side `OpcUaEndpointConfig.AutoAcceptUntrustedCerts = true` default surfaced in DCL-012 is still present but is out of this module's scope. |
| 6 | Performance & resource management | x | No new issues; DCL-008 reverse index verified. (Finding 018's orphaned adapter handle is logged under concurrency.) |
| 7 | Design-document adherence | x | No new issues. DCL-009's design-doc action (document unstable-disconnect failover trigger + configurable threshold) is still open at the doc level but out of this module's scope. |
| 8 | Code organization & conventions | x | No issues — POCOs in Commons, options class owned by component, factory + DI registration consistent. |
| 9 | Testing coverage | x | DCL001017 regression tests present. Gaps remain for finding 018 (concurrent subscribe of same tag from two instances), 019 (concurrent `_subscriptionHandles` mutation), 020 (resolve-via-different-instance), 021 (unsubscribe-mid-flight), 022 (timer-reset starvation). |
| 10 | Documentation & comments | x | No new issues; DCL-013 atomic-guard XML comments verified. |
## Checklist coverage (2026-05-17 re-review, commit `39d737e`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
@@ -896,3 +945,268 @@ unhandled exception. Regression test
`DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch` fails against the
pre-fix code (the batch throws, no map returned) and passes after;
`DCL017_WriteBatch_CancellationAbortsWholeBatch` guards that cancellation still aborts.
### DataConnectionLayer-018 — Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
**Description**
`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
set on the actor thread before dispatching the `Task.Run` that performs the adapter
I/O (line 557). The snapshot is the only basis on which the background task decides
whether to call `_adapter.SubscribeAsync` — and it is taken **once**, before the I/O
runs.
If two `SubscribeTagsRequest` messages arrive on the actor thread for different
instances that both reference the same tag path, both `HandleSubscribe` invocations
take a snapshot at a time when neither subscribe has completed, so `alreadySubscribed`
does not contain the shared tag in either snapshot. Both background tasks then call
`_adapter.SubscribeAsync(tagPath, ...)`, the adapter creates **two** monitored items
and returns two distinct subscription ids, and each task pipes a `SubscribeCompleted`
back to the actor with `AlreadySubscribed: false, Success: true`.
`HandleSubscribeCompleted` for the first message takes the success branch and writes
`_subscriptionIds[tagPath] = subId1`. The second message arrives, hits the
"already in `_subscriptionIds`" guard at line 653 (`_subscriptionIds.ContainsKey(...)`)
and `continue`s — but `result.SubscriptionId` (the orphan handle for the second
adapter subscription) is silently discarded. The orphan monitored item stays alive in
the OPC UA session for the lifetime of the adapter, sending duplicate data-change
notifications (whose callbacks were stamped with the captured `generation`) into
`HandleTagValueReceived` for every value change. Across a deploy that creates many
instances sharing a few tags, this leaks N-1 monitored items per shared tag and
doubles/triples the per-tag publish traffic.
DCL-010 fixed an analogous duplicate-dispatch bug for the tag-resolution retry path
via `_resolutionInFlight`; the equivalent guard is missing on the user-subscribe
path.
**Recommendation**
Track in-flight subscribes the same way DCL-010 tracks in-flight retries: maintain a
`HashSet<string> _subscribesInFlight` and add `tagPath` to it on the actor thread
**before** the `Task.Run` dispatch, only for tags not already in
`_subscriptionIds` and not already in `_subscribesInFlight`. Tags that are already
in flight should produce a `SubscribeTagResult(..., AlreadySubscribed: true, ...)`
without touching the adapter. Remove from `_subscribesInFlight` in
`HandleSubscribeCompleted` once the result is applied. Add a regression test that
fans two simultaneous `SubscribeTagsRequest` messages for the same tag and asserts
exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription id).
### DataConnectionLayer-019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
**Description**
`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
string>`. It is mutated from:
- `SubscribeAsync` (line 167): `_subscriptionHandles[subscriptionId] = tagPath;`
after an `await _client!.CreateSubscriptionAsync(...)` — i.e. the assignment
executes on the continuation thread (a thread-pool thread).
- `UnsubscribeAsync` (line 177): `_subscriptionHandles.Remove(subscriptionId);`
similarly after an `await`.
- `DisconnectAsync` indirectly via the underlying `_client.DisconnectAsync` does
**not** touch `_subscriptionHandles`, but multiple `SubscribeAsync` /
`UnsubscribeAsync` calls can run in parallel from the upper layer.
The DCL upper layer calls `_adapter.SubscribeAsync` from multiple places that all
run off the actor thread:
- `DataConnectionActor.HandleSubscribe` inside its `Task.Run` (multiple invocations
can run in parallel — see DCL-018);
- `HandleRetryTagResolution` issues `_adapter.SubscribeAsync` for every tag in
`_unresolvedTags` and pipes the continuation (each subscribe runs concurrently
via the SDK's async machinery);
- `ReSubscribeAll` does the same after a reconnect.
So plain-`Dictionary` mutations occur on multiple thread-pool threads concurrently —
the exact pattern DCL-003 fixed by switching `RealOpcUaClient._monitoredItems` and
`_callbacks` to `ConcurrentDictionary<,>`. Plain `Dictionary` mutations during a
concurrent resize are undefined behaviour: they can throw
`InvalidOperationException`, corrupt the internal hash buckets, or lose entries.
This is `_subscriptionHandles` is currently dead state (the dictionary is written to
and `Remove`d but **never read**), so a corruption today would not crash the
subscribe path — but the bug is latent and the field will become load-bearing the
moment any code reads it (e.g., to expose a subscription-id-to-tag-path lookup for
diagnostics, which is what the dictionary's name suggests it was intended for).
**Recommendation**
Either (a) change `_subscriptionHandles` to
`ConcurrentDictionary<string, string>` and use `TryAdd` / `TryRemove`, mirroring
DCL-003's fix one layer down, or (b) delete the field entirely since it is never
read — the bookkeeping is fully owned by `RealOpcUaClient._monitoredItems` /
`_callbacks` and `DataConnectionActor._subscriptionIds`. Removing it eliminates the
race and removes dead state in one stroke. Add a regression test (or extend
`DCL003_SharedDictionaryFields_AreConcurrentCollections`) that asserts no
non-concurrent `Dictionary` field is shared across thread boundaries in adapter
state.
### DataConnectionLayer-020 — `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
**Description**
`HandleSubscribeCompleted`'s success branch (line 656-661) writes
`_subscriptionIds[result.TagPath] = result.SubscriptionId!; _totalSubscribed++;
_resolvedTags++;`. The guard at line 653 only skips when the tag is already in
`_subscriptionIds`; it does **not** check `_unresolvedTags`. So the success branch
runs for a tag that previously failed resolution from an earlier instance's subscribe
(which incremented `_totalSubscribed` and added the tag to `_unresolvedTags` at line
674-676) and is now successfully subscribed by a later instance.
Sequence:
1. Instance A subscribes "Tag1". `_adapter.SubscribeAsync` throws a non-connection-level
exception. `HandleSubscribeCompleted` takes the resolution-failure branch:
`_unresolvedTags.Add("Tag1"); _totalSubscribed++;` (now 1).
2. The device finishes booting. Instance B subscribes "Tag1". `_adapter.SubscribeAsync`
succeeds, returning `subId`. `HandleSubscribeCompleted` takes the success branch:
`_subscriptionIds["Tag1"] = subId; _totalSubscribed++; _resolvedTags++;`
(now `_totalSubscribed = 2`, `_resolvedTags = 1`).
3. `_unresolvedTags` still contains "Tag1". The retry timer fires next tick,
`HandleRetryTagResolution` dispatches `SubscribeAsync("Tag1", ...)` against the
adapter (creating a **second** monitored item for the same tag), and
`HandleTagResolutionSucceeded` runs `_unresolvedTags.Remove("Tag1")`
`_subscriptionIds["Tag1"] = newSubId` (overwriting Instance B's id, orphaning that
monitored item) → `_resolvedTags++` (now 2, matching `_totalSubscribed`).
Net effect:
- `_totalSubscribed` is over-counted by 1 from step 2 until step 3 reconciles
`_resolvedTags`. During that window the health report's "subscribed / resolved"
ratio is wrong.
- Two adapter subscription handles for the same tag are leaked across this race
(DCL-018's orphan plus the retry's second adapter call); the second leaks
permanently because `_subscriptionIds["Tag1"]` only stores the most recent id.
**Recommendation**
In `HandleSubscribeCompleted`'s success branch, before the `_totalSubscribed++`,
check `_unresolvedTags.Remove(result.TagPath)` — if the tag was already counted as
unresolved, promote it without re-incrementing `_totalSubscribed` (mirror
`HandleTagResolutionSucceeded`'s shape: only increment `_resolvedTags`,
`_subscriptionIds[tag] = subId`, and clear `_resolutionInFlight`). Add a regression
test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
"resolve via a second instance" sequence.
### DataConnectionLayer-021 — `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
**Description**
`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
thread and pipes a `SubscribeCompleted` back. If an `UnsubscribeTagsRequest` for the
same instance is processed on the actor thread between dispatch and completion,
`HandleUnsubscribe` removes the instance from both `_subscriptionsByInstance` and
`_subscribers`. When the late `SubscribeCompleted` arrives,
`HandleSubscribeCompleted` (line 629-634) **re-creates** the
`_subscriptionsByInstance[instanceName] = new HashSet<string>()` entry and proceeds
to apply the results — but `_subscribers[instanceName]` was already removed by the
unsubscribe and is **not** re-added.
Consequences:
1. `_subscriptionsByInstance` keeps a permanently-leaked entry for an instance that
no longer exists. `ReSubscribeAll` derives its tag list from
`_subscriptionsByInstance.Values` and will keep re-subscribing the leaked tags on
every future reconnect.
2. For each tag, `_tagSubscriberCount[tagPath]` is incremented (line 647-649), so the
reverse index treats the leaked instance as a real subscriber. The only way to
drop the count is another `HandleUnsubscribe` for the same instance — which can
never arrive because the Instance Actor that owned the instance is gone.
3. The success branch increments `_totalSubscribed` / `_resolvedTags` (or
`_unresolvedTags` for genuine resolution failures), drifting health counters
permanently above the actual subscribed instance count.
4. Subsequent `HandleTagValueReceived` fanout iterates `_subscriptionsByInstance` and
skips this entry via the `_subscribers.TryGetValue` check (line 1019), so values
are silently dropped — but the work of fanning them out (the iteration and the
tag lookup) is still done for every value update on every leaked tag, forever.
5. The genuine-resolution-failure path at line 682-686 (`subscriber.Tell(new
TagValueUpdate(..., QualityCode.Bad, ...))`) also silently no-ops because
`_subscribers.TryGetValue` is false — so the design doc's "push bad quality on
resolution failure" promise is broken for this case (a minor, edge-case wrinkle).
**Recommendation**
In `HandleSubscribeCompleted`, when `_subscriptionsByInstance.TryGetValue` fails,
treat the result as obsolete: log it and `return` without re-creating the entry or
applying any state mutations. Any successfully-created adapter subscriptions in
`msg.Results` should be cleaned up — iterate the results and
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` (fire-and-forget) for each
successful one so the orphan handles do not leak in the adapter. Add a regression
test that subscribes from instance A, immediately sends an `UnsubscribeTagsRequest`
for A while the subscribe I/O is in flight, completes the subscribe, and asserts
`_subscriptionsByInstance`, `_tagSubscriberCount` and health counters are all clean.
### DataConnectionLayer-022 — `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
**Description**
`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
991-998) both call:
```
Timers.StartPeriodicTimer(
"tag-resolution-retry",
new RetryTagResolution(),
_options.TagResolutionRetryInterval,
_options.TagResolutionRetryInterval);
```
`Akka.Actor.ITimerScheduler.StartPeriodicTimer(key, ...)` cancels and replaces any
existing timer registered under the same key. So every additional subscribe (or
every additional tag-resolution failure) that produces unresolved tags **resets** the
retry timer's countdown to the full interval — the timer never accumulates
elapsed time across calls.
With the default `TagResolutionRetryInterval = 10s`, an instance-startup burst that
produces a new `SubscribeTagsRequest` every 5s (a not-unusual cadence during
deployment fan-out) will keep cancelling the not-yet-fired retry every 5s, so the
"periodic" retry never actually fires until subscribes go quiet. In a steady-state
site with many instances deploying together this can delay tag resolution by tens
of seconds, leaving attributes at `Bad` quality longer than the documented retry
interval implies.
**Recommendation**
Start the periodic timer once, when the actor first transitions to having
non-empty `_unresolvedTags`, and only re-start it after `Timers.Cancel(...)` has
been called (e.g., when the actor enters `Reconnecting`). The cleanest pattern is to
gate the start with `if (!Timers.IsTimerActive("tag-resolution-retry"))` before
calling `StartPeriodicTimer``IsTimerActive` is on `ITimerScheduler`. Apply the
same gate at both call sites. Add a regression test that fires 5 subscribes with
unresolved tags within one retry interval and asserts the retry fires at most one
interval after the first failure (not after the fifth subscribe).
+335 -13
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.DeploymentManager` |
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -53,20 +53,52 @@ DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
it still describes the query-before-redeploy behaviour that actually moved into
`TryReconcileWithSiteAsync` (DeploymentManager-017).
#### Re-review 2026-05-28 (commit `1eb6e97`)
Re-reviewed at commit `1eb6e97` after the DeploymentManager-015/016/017 fixes
and a docs-only XML-comment pass. The three prior findings remain `Resolved`
and verified — `ApplyPostSuccessSideEffectsAsync` is now invoked from both the
normal success path and `TryReconcileWithSiteAsync`, the reconciled-success
branch corrects `prior.RevisionHash` to the target, and `GetDeploymentStatusAsync`'s
XML doc now describes the local-DB-read it actually performs and cross-refs the
reconciliation helper. The DiffService wiring, options binding, ref-counted
operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
test seam are still in place. The 7 new findings here are not regressions in
the DeploymentManager-015/016 fixes — they are issues uncovered by widening
the lens to the lifecycle paths, reconciliation's interaction with
intentional `Disabled` state, audit semantics, and operational concerns
(per-site artifact-build cost, Pending→InProgress double-write).
The single notable correctness issue is DeploymentManager-018: the
reconciliation shortcut unconditionally sets `instance.State = Enabled` via
`ApplyPostSuccessSideEffectsAsync`. After a central failover that loses the
in-memory operation lock, a user can legitimately `Disable` an instance whose
prior deploy record is still `InProgress`; a subsequent redeploy then reconciles
and silently re-enables the instance against the user's explicit intent.
The remaining six findings are medium/low: lifecycle-timeout audit gap
(DeploymentManager-019), audit-user attribution in reconciliation
(DeploymentManager-020), silent fallback in `ResolveSiteIdentifierAsync`
(DeploymentManager-021), back-to-back `Pending``InProgress` writes
(DeploymentManager-022), per-site re-query of system-wide artifacts
(DeploymentManager-023), and shared static state across `*ProbeActor` tests
(DeploymentManager-024).
## Checklist coverage
#### Re-review 2026-05-28 (commit `1eb6e97`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018). |
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019). |
| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
| 6 | Performance & resource management | ✓ | New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023). |
| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation. |
| 8 | Code organization & conventions | ✓ | New: redundant `Pending``InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021). |
| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
| 10 | Documentation & comments | ✓ | New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020). |
## Findings
@@ -873,3 +905,293 @@ database as a pure local read, and cross-references `TryReconcileWithSiteAsync`
as where the query-the-site-before-redeploy reconciliation actually lives.
Documentation-only change; no regression test (a test asserting comment text
would be meaningless).
### DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748` |
**Description**
`TryReconcileWithSiteAsync` calls `ApplyPostSuccessSideEffectsAsync` whenever
the site reports it has the target revision hash, and that helper
unconditionally writes `instance.State = InstanceState.Enabled`. The
reconciliation shortcut only runs when the prior `DeploymentRecord` is
`InProgress` or timeout-`Failed` — exactly the scenarios that survive a central
failover (the in-memory `OperationLockManager` is lost on failover, by design:
*"Lost on central failover (acceptable per design — in-progress treated as
failed)"*).
After such a failover, the per-instance operation lock is gone but the
deployment record is still `InProgress` in the DB. A user can legitimately
issue `DisableInstanceAsync` for the same instance — there is nothing in
`DisableInstanceAsync` that consults the deployment record, only the
`StateTransitionValidator` over `Instance.State`. If the state is `Enabled`
(the typical case when the deploy started), the disable proceeds, the site
honours it (the design states a disabled instance retains its deployed
configuration), and central now persists `Instance.State = Disabled`. The
deployment-record row remains `InProgress` (no one transitioned it). Later the
user retries the deploy: `TryReconcileWithSiteAsync` runs, the site still has
the target revision hash (Disable doesn't change the deployed config), the
prior record is marked `Success`, and `ApplyPostSuccessSideEffectsAsync` writes
`Instance.State = Enabled` — silently overriding the user's explicit Disable.
The same trap exists for any direct DB edit / migration that flipped the state
between the timed-out deploy and the redeploy. The normal deploy path can
defensibly assume `Enabled` after a fresh successful apply, but the
reconciliation path is reconciling *prior* state with *prior* user intent; it
should preserve `Disabled` if that is the current `Instance.State` at the time
of reconciliation, mirroring the design's separation between deploy (config
apply) and disable (subscription/script lifecycle).
**Recommendation**
In the reconciliation branch, do not force `Enabled`. Either:
- Pass a flag/parameter to `ApplyPostSuccessSideEffectsAsync` telling it
whether to touch state, and skip the state write on the reconciliation path
(leaving the current `Instance.State` intact, which is already `Enabled`
for a fresh deploy that timed out and `Disabled` for the user-disabled
follow-up case); or
- Only set `Enabled` when the current `Instance.State` is `NotDeployed` (i.e.
the first-deploy timed-out case), and leave existing `Enabled`/`Disabled`
alone.
Add a regression test where an instance with `Instance.State = Disabled` and a
prior `InProgress` deployment record is reconciled — the resulting
`Instance.State` must remain `Disabled`, and the deployment record must still
be marked `Success`.
### DeploymentManager-019 — Lifecycle command timeout writes no audit entry
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |
**Description**
`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
wrap the `CommunicationService` call in a linked CTS with
`LifecycleCommandTimeout` (DeploymentManager-012). On timeout they log a
warning and `return Result<...>.Failure(...)` — and skip the
`_auditService.LogAsync` call entirely. As a result, an operator-initiated
disable/enable/delete that times out at the site leaves **no audit trail**:
the user, the timestamp, the command id, and the failure mode are not
recorded in the audit log. The deploy path goes out of its way to write a
`DeployFailed` audit entry on the same failure mode
(`DeploymentService.cs:274-276`), with `CancellationToken.None` so the write is
durable; the lifecycle commands do not.
The design lists audit logging as a Deployment Manager responsibility for "all
deployment actions, system-wide artifact deployments, and instance lifecycle
changes" — a timed-out lifecycle command **is** an attempted lifecycle change,
and the operator action is exactly the kind of event the audit log exists to
record.
**Recommendation**
In each of the three `catch (Exception ex) when (ex is TimeoutException or
OperationCanceledException)` blocks, write a `DisableTimeout`/`EnableTimeout`/
`DeleteTimeout` (or use the existing operation name with a failure flag)
audit entry with `CancellationToken.None` so a cancelled outer token does not
prevent the audit write, mirroring `DeployFailed`. Add a unit test asserting
that `DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
also produces an audit entry.
### DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:683-686` |
**Description**
In `TryReconcileWithSiteAsync` the audit call is:
```
await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
```
`prior.DeployedBy` is the user who issued the original (timed-out / stuck)
deployment, not the `user` parameter passed into `DeployInstanceAsync`. The
current user — the one who triggered the redeploy that produced the
reconciliation — is dropped on the floor. For audit forensics this is
misleading: the row will read "user A reconciled their own deployment"
when in fact user B initiated the action that reconciled it.
The original deployer is interesting context, but it should be carried in the
audit-detail object (where `DeploymentId` and `RevisionHash` already live), not
substituted for the actor.
**Recommendation**
Use `user` (the parameter on `DeployInstanceAsync`, threaded through
`TryReconcileWithSiteAsync`) as the audit actor, and include
`OriginalDeployer = prior.DeployedBy` in the detail object so the original
attribution is preserved without misrepresenting who took the action.
### DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111` |
**Description**
```
private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
{
var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
return site?.SiteIdentifier ?? siteId.ToString();
}
```
If the `Site` row is missing (FK was deleted, race with admin delete, DB
inconsistency), the method silently returns the numeric DB id rendered as a
string. This is then passed to `CommunicationService.{Deploy,Disable,Enable,
Delete}InstanceAsync` and `QueryDeploymentStateAsync` as if it were a real
`SiteIdentifier` (e.g. "site-a"). The communication layer will fail with an
"unknown site" or routing error, producing a confusing diagnostic that hides
the actual problem (no site row).
This is a defensive concern, but every mutating operation in the module goes
through this method, so a stale instance whose site was deleted will produce a
misleading error every time it is touched.
**Recommendation**
Treat a missing site as a hard validation failure: return a
`Result.Failure($"Site with ID {siteId} not found")` early from the calling
operations, instead of fabricating an identifier. The repository already
returns `Site?`, so the null path is type-visible; just don't paper over it.
### DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194` |
**Description**
`DeployInstanceAsync` does:
```
record.Status = Pending;
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
record.Status = InProgress;
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
```
There is no work between the two writes — flattening, validation, and
reconciliation have already completed by line 174. The deploy command is sent
immediately after the `InProgress` write. The `Pending` write therefore costs:
an extra `SaveChangesAsync` round-trip, an extra `IDeploymentStatusNotifier`
invocation (which the CentralUI-006 page renders, so the user briefly sees a
`Pending` flicker before `InProgress`), and an extra row-version bump if EF
optimistic concurrency is enabled on the table.
The design uses `Pending` to mean "queued, not yet sent" and `InProgress` to
mean "sent to site, awaiting response". The code's `Pending` slot has no
queuing — it is set and immediately overwritten — so the state buys nothing
operationally.
**Recommendation**
Either:
- Drop the `Pending` write entirely and create the record directly in
`InProgress` (one row insert, one notification, simpler UI); or
- Move the `Pending``InProgress` transition to bracket actual queueing/work
(e.g. set `Pending` *before* flattening + reconciliation, set `InProgress`
immediately before `DeployInstanceAsync` on the comm service) so the two
states carry distinguishable semantics worth a separate write.
### DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173` |
**Description**
`DeployToAllSitesAsync` loops over sites and calls
`BuildDeployArtifactsCommandAsync(site.Id, ...)` for each one. Of the six
artifact sets the method gathers, **only** `dataConnections` is per-site:
- `_templateRepo.GetAllSharedScriptsAsync` — global.
- `_externalSystemRepo.GetAllExternalSystemsAsync` — global, plus
`GetMethodsByExternalSystemIdAsync` per external system per site.
- `_externalSystemRepo.GetAllDatabaseConnectionsAsync` — global.
- `_notificationRepo.GetAllNotificationListsAsync` — global.
- `_notificationRepo.GetAllSmtpConfigurationsAsync` — global.
- `_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)`**per-site**.
With N sites this issues ≈ 5·N redundant queries on the global sets (plus
M·N method queries, where M is the external-system count). On a hub-and-spoke
deployment with many sites the artifact-deploy path is noticeably slower than
necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the
DbContext is not thread-safe and the per-site commands are already built
sequentially (good); the redundant queries are sequential too, but the
network/round-trip cost is real.
**Recommendation**
Hoist the global queries (shared scripts, external systems + their methods,
DB connections, notification lists, SMTP configurations) out of
`BuildDeployArtifactsCommandAsync`, fetch them once in `DeployToAllSitesAsync`,
and pass them in alongside the site id (or expose a
`BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals)` overload).
`RetryForSiteAsync` (the single-site path) can keep the convenience-overload
behaviour. Add a test using NSubstitute's `.Received()` to assert
`_templateRepo.GetAllSharedScriptsAsync` is called exactly once for an
N-site deployment.
### DeploymentManager-024 — Test probe actors hold mutable static state across tests
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217` |
**Description**
`ReconcileProbeActor.QueryCount` / `DeployCount`, `SerializationProbeActor.MaxConcurrent`
/ `_current`, and `ArtifactProbeActor.Received` are all `static` fields.
Each test's actor constructor resets them — but reset-on-construction only
works as long as no two tests in the same class run concurrently. xUnit's
default parallelism disables intra-class parallelism, so today's tests pass;
flip the assembly-level `[CollectionBehavior(DisableTestParallelization = true)]`
or move to xUnit v3 (which enables intra-class parallelism by default) and the
counters race — a deploy in test A could increment `DeployCount` while test B
is asserting on it.
Static state shared across tests is also why a flaky-test investigation here
will be unusually painful: the offending interaction is invisible from any
single test file.
**Recommendation**
Replace the static counters with instance state, hand the actor a probe
recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
in each test. Where the simpler counter shape is preferred, pass a
shared-state object into the actor's constructor so each test owns its own
instance — never reach for `static` mutable test state.
+343 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ExternalSystemGateway` |
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 |
## Summary
@@ -51,6 +51,36 @@ both substantive findings are second-order defects in earlier fixes — the earl
resolutions did not verify the downstream contract of the S&F engine they integrate
with.
#### Re-review 2026-05-28 (commit `1eb6e97`)
All seventeen prior findings (001017) remain `Resolved`; spot-checks against the
current source confirm the fixes still hold. Between `39d737e` and this re-review the
only source changes to the module are the documentation-only commit `1eb6e97` (XML
doc additions) and the `executionId` / `sourceScript` / `parentExecutionId` plumbing
threaded through `CachedCallAsync` / `CachedWriteAsync` to the S&F enqueue (Audit Log
#23 Tasks 4/6). The re-review walked the full 10-category checklist again and
surfaced **six new findings**, none Critical. The most serious
(`ExternalSystemGateway-018`, High) is that `DeliverBufferedAsync` on both
`ExternalSystemClient` and `DatabaseGateway` lets a `JsonException` from
`JsonSerializer.Deserialize` propagate out of the delivery handler — the S&F engine
treats any thrown exception as a transient retry, so a corrupted or
schema-incompatible buffered row becomes a permanent poison message that is retried
on every sweep forever (the same retry-forever class of hazard `-015` already
addressed for a different cause). `ExternalSystemGateway-019` (Medium) is that
`HttpClient.Timeout` is never set, so any operator-configured `DefaultHttpTimeout`
greater than 100s is silently clipped by `HttpClient`'s built-in 100s default and the
gateway's "timeout applies to the HTTP request round-trip" guarantee no longer
holds — a partial reopen of the `-002` contract for the long-timeout case.
`ExternalSystemGateway-020` (Medium) is a silent precision-loss bug in the cached-DB-write
retry path: `JsonElementToParameterValue` collapses any JSON number that is not
Int64-convertible to `double`, so a script's `decimal` SQL parameter is downcast on
retry and only on retry. The remaining three (`-021`/`-022`/`-023`, Low) are an
unauthenticated-by-default `ApplyAuth` for unknown `AuthType` / malformed Basic config,
runtime-only HTTP-verb validation, and an undocumented PATCH HTTP method (code vs
design-doc drift). Theme: every new finding is in a code path that was added or
touched by the earlier fix bundle but whose error-propagation contract was not
verified end-to-end against the S&F engine or the design doc.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -66,6 +96,21 @@ with.
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `JsonException` not caught in either `DeliverBufferedAsync`, so a corrupt buffered payload becomes a permanent poison-message retried forever — finding 018. `JsonElementToParameterValue` collapses a non-Int64 number to `double`, silently losing precision for `decimal` SQL parameters on cached-write retry — finding 020. `new HttpMethod(method.HttpMethod)` accepts any string at runtime, so an invalid HTTP verb is only diagnosed at call time — finding 022. |
| 2 | Akka.NET conventions | ☑ | Still no actors in this module; `AddExternalSystemGatewayActors` remains a no-op. The cached-call lifecycle/audit emission lives in `ScriptRuntimeContext` / `CachedCallTelemetryForwarder` (SiteRuntime / AuditLog), not here, and that boundary is correct. No issues found. |
| 3 | Concurrency & thread safety | ☑ | Services are still stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. The added `executionId` / `sourceScript` / `parentExecutionId` plumbing flows through method arguments only — no shared state introduced. No findings. |
| 4 | Error handling & resilience | ☑ | The poison-payload retry-forever path is the headline resilience issue (finding 018). `HttpClient.Timeout` not being set leaves the gateway's per-call round-trip cap clipped to the framework's 100s default whenever the configured `DefaultHttpTimeout` is larger — finding 019 (partial reopen of the `-002` contract). |
| 5 | Security | ☑ | Auth secrets still never logged; error bodies still truncated. `ApplyAuth` is silent on unknown `AuthType` / empty `AuthConfiguration` / malformed Basic config — finding 021 (fail-open is a real but bounded risk; recorded Low because misconfiguration is the precondition). Connection-string handling in `DatabaseGateway` reads from the entity verbatim and never logs it. |
| 6 | Performance & resource management | ☑ | Disposal paths from findings 005/010 still hold. The `IHttpClientFactory` name-keyed-options registration (finding 016 fix) creates a fresh `SocketsHttpHandler` per primary-handler build — acceptable because `IHttpClientFactory` recycles handlers. No new findings. |
| 7 | Design-document adherence | ☑ | The design doc enumerates GET/POST/PUT/DELETE but the code also serializes a body for PATCH (and accepts arbitrary HTTP verbs at runtime) — finding 023 (drift to be reconciled). The per-call timeout guarantee is partially defeated by the unset `HttpClient.Timeout` for option values > 100s — finding 019. |
| 8 | Code organization & conventions | ☑ | The `-016` fix replaced `ConfigureHttpClientDefaults` with a scoped `IConfigureNamedOptions<HttpClientFactoryOptions>` — verified clean, no new conventions issue. `internal virtual CreateConnection` (DatabaseGateway) and `internal InvokeHttpAsync` (ExternalSystemClient) are exposed via `InternalsVisibleTo` for tests — acceptable. No new findings. |
| 9 | Testing coverage | ☑ | The `JsonException` deserialization path for `DeliverBufferedAsync` is untested; the `JsonElementToParameterValue` `double`-downcast path is untested; `ApplyAuth`'s unknown-AuthType / empty-config / malformed-Basic branches are untested. Recorded under findings 018 / 020 / 021 rather than a standalone coverage finding. |
| 10 | Documentation & comments | ☑ | XML doc additions in `1eb6e97` are accurate and consistent. PATCH support is undocumented in the design doc (finding 023). The inline `ExternalSystemGateway-015` block-comment in `CachedCallAsync` (lines 126133) and the equivalent in `DatabaseGateway.cs:106113` now correctly describe the "treat 0 as unset" semantics. |
## Findings
### ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
@@ -951,3 +996,298 @@ method whose effective parameter set is empty produces a URL identical to the
no-parameters case. Regression test
`Call_GetWithAllNullParameters_DoesNotAppendTrailingQuestionMark` asserts the
captured request URI has no trailing `?`; it was verified to fail before the fix.
### ExternalSystemGateway-018 — `DeliverBufferedAsync` lets `JsonException` propagate, turning a corrupt buffered row into a permanent retry-forever poison message
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:176`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:151` |
**Description**
Both `ExternalSystemClient.DeliverBufferedAsync` and `DatabaseGateway.DeliverBufferedAsync`
begin with an unguarded `JsonSerializer.Deserialize<...>(message.PayloadJson)`:
```csharp
var payload = JsonSerializer.Deserialize<CachedCallPayload>(message.PayloadJson);
if (payload == null || string.IsNullOrEmpty(payload.SystemName) || ...) {
_logger.LogError("... unreadable payload; parking.");
return false;
}
```
The "unreadable payload; parking" branch is only entered when `Deserialize` *succeeds*
and produces a null / partially-empty object. If `PayloadJson` is **malformed JSON**
the column was truncated mid-write, an older payload schema is being deserialized into a
newer record, or storage corruption occurred — `Deserialize` throws `JsonException`
before that check is ever reached. The exception propagates out of the delivery handler.
The Store-and-Forward retry loop treats *any* thrown exception from a delivery handler
as a transient failure (only a returned `false` parks the message); see
`StoreAndForwardService.RetryMessageAsync`. Combined with the `MaxRetries == 0`
"unset → bounded default" fix from `-015`, the resulting behaviour is:
1. Corrupt payload arrives in the buffer.
2. Every retry sweep deserializes, throws `JsonException`, increments `RetryCount`.
3. The message is retried until `RetryCount >= MaxRetries`, then parked — *only* if
`MaxRetries > 0` is configured (which `-015` already established is not the default
site configuration today). With the bounded S&F default it does eventually park, but
it park-loops noisily for `DefaultMaxRetries` iterations first; without that bound it
retries forever.
4. The script is unaware — the cached call was returned `WasBuffered: true` long ago.
This is the same "poison message buffered forever" class of hazard that
`ExternalSystemGateway-001` (no-handler) and `ExternalSystemGateway-015` (MaxRetries==0)
already removed for their own causes; corrupt JSON is an alternative arrival path into
the same bad state.
The `DatabaseGateway.DeliverBufferedAsync` path has the same shape and the same defect:
`JsonSerializer.Deserialize<CachedWritePayload>` at line 151 is not guarded.
**Recommendation**
Wrap the `Deserialize` call in a `try/catch (JsonException)` block in both
`DeliverBufferedAsync` methods. A `JsonException` is by definition a permanent failure —
re-running the same deserialization against the same payload will produce the same
exception — so the catch should log at `LogError` and **return `false`** so the S&F
engine parks the message rather than retrying. Add regression tests that feed a
malformed `PayloadJson` to each handler and assert `delivered == false` (i.e. the
message parks) and that no exception escapes the handler.
### ExternalSystemGateway-019 — `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:226,257-264`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:90-102` |
**Description**
The `-002` fix enforces the per-call timeout via a linked `CancellationTokenSource`
built from `_options.DefaultHttpTimeout` and passed into `SendAsync`. That correctly
caps every call to *at most* the configured value when `DefaultHttpTimeout` ≤ 100s.
However, `HttpClient.Timeout` (the framework default) is never set on either the named
client or its primary handler — the `GatewayHttpClientConfigurator` only sets
`MaxConnectionsPerServer`. `HttpClient.Timeout` defaults to **100 seconds**, and
`SendAsync` enforces it internally by cancelling its own private CTS, raising a
`TaskCanceledException` from `SendAsync` *without* cancelling either the caller's token
or the gateway's `timeoutCts`.
Consequences when an operator configures `DefaultHttpTimeout` to anything > 100s
(a legitimate setting for external systems with long-running endpoints — recipe
exports, large queries):
1. The gateway's `timeoutCts` (e.g. 5 minutes) has not yet fired.
2. `HttpClient.Timeout` fires at 100s, `SendAsync` throws.
3. Neither `when (cancellationToken.IsCancellationRequested)` nor
`when (timeoutCts.IsCancellationRequested)` matches, so the exception falls into
the generic `catch (Exception ex) when (ErrorClassifier.IsTransient(ex))` branch
(line 277) and is re-thrown as a `TransientExternalSystemException` with the
message `"Connection error to {Name}: A task was canceled."` — misattributing a
timeout as a connection error.
4. The configured 5-minute round-trip window the design doc promises ("Each external
system definition specifies a timeout that applies to all method calls on that
system" / "applies to the HTTP request round-trip") is silently overridden.
The opposite case (`DefaultHttpTimeout` < 100s) is the only one the `-002` regression
test exercises (200ms), so the defect is not caught by the existing suite.
**Recommendation**
Set `HttpClient.Timeout = Timeout.InfiniteTimeSpan` on the gateway's named clients via
the existing `GatewayHttpClientConfigurator` (delegate `HttpClientActions` rather than
just `HttpMessageHandlerBuilderActions`), so the cancellation-token mechanism is the
sole timeout source. The linked `timeoutCts` then reliably enforces
`DefaultHttpTimeout` for every value, and the timeout-vs-cancellation classification at
lines 266276 stays accurate. Add a regression test that configures `DefaultHttpTimeout`
to ~150s, hangs the handler, and asserts the call times out at the configured value
and produces a `"Timeout calling..."` (not `"Connection error to..."`) error.
### ExternalSystemGateway-020 — `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:185-193` |
**Description**
`DatabaseGateway.JsonElementToParameterValue` materialises the buffered cached-write
SQL parameter values during a retry-sweep delivery:
```csharp
private static object JsonElementToParameterValue(JsonElement element) => element.ValueKind switch
{
JsonValueKind.String => (object?)element.GetString() ?? DBNull.Value,
JsonValueKind.Number => element.TryGetInt64(out var l) ? l : element.GetDouble(),
...
};
```
For a JSON number, the helper attempts `Int64` first and otherwise returns a `double`.
There is no `decimal` branch. The immediate-attempt path is unaffected — `CachedWriteAsync`
on the original call serializes the script-provided typed parameters via
`JsonSerializer.Serialize(new { ConnectionName, Sql, Parameters = parameters })` and
executes the SQL directly outside this code path. But the **retry path** runs through
`DeliverBufferedAsync``JsonElementToParameterValue`, so a script that submitted
a `decimal` value (e.g. `123.4567890123m`) gets:
1. Immediate attempt: `decimal` parameter, full precision (or, more accurately, the
value never enters this helper because cached writes today never re-execute on the
immediate path — but on the retry path it does).
2. Retry attempt(s) after a transient failure: the value is deserialized as a JSON
number, fails `TryGetInt64`, and is downcast to `double` — which has ~1517 digits
of precision against `decimal`'s 2829. A SQL column of type `decimal(18, 6)` or
`numeric` receives a value that has been truncated to `double` precision before
parameter binding.
Two further consequences worth recording:
- The downcast is **silent** — there is no log, no error, and the cached-write
acknowledgement to the script has long since happened. Data drift between a
same-call immediate-success delivery and a same-call retry delivery is the worst
shape of "looks like the right value but isn't" defect.
- For SCADA telemetry (process variables, totals, currency-denominated quality
reports) `decimal` is the correct CLR type and `double`'s representation error
changes the persisted value.
**Recommendation**
Replace the `Number` branch with a precision-preserving cascade — try `Int64`, then
`decimal` (`element.TryGetDecimal(out var d) ? d : element.GetDouble()`), and only
fall back to `double` when even `decimal` fails. Add a regression test against
`DatabaseGateway.DeliverBufferedAsync` that buffers a write with a high-precision
`decimal` parameter, drives the delivery, and asserts the SQL parameter bound is a
`decimal` (or compares the round-tripped value to the original at the parameter level)
rather than a `double` with truncated precision. The same Number-branch decision should
be reviewed against `JsonValueKind.True`/`False`/`Null` (currently fine) and a string
that happens to encode a number (already correctly returns `string`).
### ExternalSystemGateway-021 — `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:385-415` |
**Description**
`ApplyAuth` has three fail-open paths that all result in an HTTP request being sent
**without** the credential the operator configured:
1. Line 387 — `if (string.IsNullOrEmpty(system.AuthConfiguration)) return;` returns
early regardless of `AuthType`. A system entity with `AuthType = "apikey"` but an
empty `AuthConfiguration` (e.g. the secret column failed to deploy, or the
protector key changed and decryption produced `""`) sends every request with no
`X-API-Key` header — the gateway is silent.
2. The `switch` has no `default` arm. A system entity with `AuthType = "bearer"`,
`"oauth2"`, a typo like `"ApiKey "` (trailing space) or even `"none"` falls off the
`switch` and the request is sent without any auth header — again silent.
3. Line 408 — `if (basicParts.Length == 2)` skips the auth attach when
`AuthConfiguration` for `basic` lacks a `:` separator. The request is sent with no
`Authorization` header.
Effectively the gateway treats every misconfiguration as "send anonymously" and
relies on the remote system rejecting it with a 401/403. That is a defensible default
on its own, but combined with `-007`'s 2 KB error-body cap and the fact that no audit
or warning is emitted, an operator debugging "why does my external system always
return 401" has nothing to go on inside ScadaLink — the gateway never says it failed
to apply auth. For `AuthType = "none"` (the design's expected sentinel for
unauthenticated systems) the fall-through is correct; the failure mode is misconfig.
**Recommendation**
Add a `default:` arm to the `switch` that logs `_logger.LogWarning(...)` naming the
unknown `AuthType` and the system, and emit a similar warning when
`AuthConfiguration` is empty for an `AuthType` of `"apikey"` or `"basic"` (those
require a value; `"none"` does not). For Basic auth specifically, the
`basicParts.Length != 2` branch should also warn. Do **not** include the
`AuthConfiguration` value in the log message — secrets must stay out of the log
(consistent with the existing module). A small set of `ApplyAuth` unit tests
verifying the warning emission and that no `Authorization` / `X-API-Key` header is
ever leaked in the warning text would close the test gap as well.
### ExternalSystemGateway-022 — `new HttpMethod(method.HttpMethod)` accepts any string at runtime; an invalid HTTP verb fails only at call time
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:233` |
**Description**
`InvokeHttpAsync` constructs the request method directly from the string column:
`new HttpRequestMessage(new HttpMethod(method.HttpMethod), url)`. `System.Net.Http.HttpMethod(string)`
performs only a token-character validation (it rejects whitespace and control chars
but accepts arbitrary non-standard tokens like `"FOO"` or `"GIT"`). The body-vs-query
selection at lines 239250 explicitly checks for POST/PUT/PATCH; for any other
non-standard verb (`"FOO"`) the parameters silently go to neither body nor query
string and the request is dispatched anyway.
The design doc enumerates GET/POST/PUT/DELETE as the supported set. There is no
validation at deployment time, at definition save time, or at gateway
resolution time that `method.HttpMethod` is one of the expected verbs. An operator
who typos `"DLETE"` discovers the issue only when a script invokes that method and
the remote server rejects the request — usually as a 4xx that the gateway classifies
as permanent, which is correct but obscures the root cause.
**Recommendation**
Validate `method.HttpMethod` at gateway entry — either with a small `switch` of
allowed verbs in `InvokeHttpAsync` that throws `PermanentExternalSystemException` for
an unsupported verb (cheap, immediate, surfaces a clear error to the script), or by
adding a validation pass in the Template/Deployment Manager so it can never reach
the gateway. The first option is local to this module and cheaper to land. Either
way, the canonical list should agree with `BuildUrl`'s query-vs-body decision (which
currently knows about POST/PUT/PATCH for body and GET/DELETE for query — note PATCH
is in the body branch but not the design-doc list; see finding 023).
### ExternalSystemGateway-023 — PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:241`, `docs/requirements/Component-ExternalSystemGateway.md:43` |
**Description**
The component design doc lists the supported HTTP methods as `GET, POST, PUT, or
DELETE` (line 43: `**HTTP method**: GET, POST, PUT, or DELETE.`). `InvokeHttpAsync`'s
body-serialization branch at lines 239250 explicitly includes `PATCH` alongside POST
and PUT — so PATCH is in fact supported (and routes parameters into the JSON body),
but operators reading the spec would not know it. Conversely, `BuildUrl`'s
query-string branch at lines 364366 lists only `GET` and `DELETE`, so a PATCH
method's parameters always go to the body, matching the body-branch but not appearing
anywhere in the documented contract.
This is mild drift — the code is more permissive than the spec. It only becomes a
real issue if a future change relies on the documented "only GET/POST/PUT/DELETE"
set and breaks the PATCH path silently, or if PATCH is genuinely out of scope and a
template author defines a PATCH method on purpose only to learn later it is
unsupported.
**Recommendation**
Pick one direction and apply it in the same session, per the project's "design doc +
code travel together" rule:
- If PATCH is intentionally supported, add `PATCH` to the Component-ExternalSystemGateway.md
HTTP-method list (line 43) and add a parameterised test confirming a PATCH method
sends its parameters in the JSON body and resolves like POST/PUT for error
classification.
- If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
The design-doc list then remains the single source of truth.
+358 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -51,6 +51,35 @@ HealthMonitoring + CentralUI change), and `CollectReport` reading
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
broadly faithful to the design intent.
#### Re-review 2026-05-28 (commit `1eb6e97`)
All sixteen prior findings (HealthMonitoring-001..016) remain `Resolved`. This
baseline re-review applied the full 10-category checklist and produced **7 new
findings** (1 Medium, 6 Low — none crash-class). The most material observation
is a **metric-loss race** in `HealthReportSender.ExecuteAsync`
(HealthMonitoring-017): `CollectReport` resets the per-interval error counters
(`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`,
`SiteAuditWriteFailures`, `AuditRedactionFailure`) **before**
`_transport.Send(...)` is attempted, so a transport failure (the existing
`catch { LogError; }` path) silently discards every error this site recorded in
the failed interval — the module-specific concern of "metric counters drifting
from raw-per-interval to cumulative" inverted into _drifting_ to _lost_. A
parallel hazard exists in `CentralHealthReportLoop` (HealthMonitoring-018). The
remaining items are smaller: two Audit Log metrics
(`SiteAuditTelemetryStalled`, `CentralAuditWriteFailures`) listed in the design
doc never make it into a HealthMonitoring surface (HealthMonitoring-019); a
heartbeat with `receivedAt <= existing.LastHeartbeatAt` brings an offline site
back online with a stale heartbeat that can flap right back to offline on the
next check (HealthMonitoring-020); the reserved `CentralSiteId = "central"`
constant collides with any real site named `"central"` and silently extends its
offline grace (HealthMonitoring-021); `CentralHealthReportLoopTests` uses real
wall-clock 50 ms timers + `Task.Delay`, making it timing-sensitive
(HealthMonitoring-022); and one obsolete placeholder test name
(`StoreAndForwardBufferDepths_IsEmptyPlaceholder`) misrepresents what it now
covers (HealthMonitoring-023). All sequence-number and offline-detection
arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
monotonic mismatch was observed.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -66,6 +95,21 @@ broadly faithful to the design intent.
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `HealthReportSender` and `CentralHealthReportLoop` reset per-interval counters before the send/process call — counts lost on transport failure (HealthMonitoring-017, HealthMonitoring-018). `MarkHeartbeat` brings an offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` — site can flap straight back to offline (HealthMonitoring-020). `CentralSiteId = "central"` reserved constant silently collides with any real site named "central" (HealthMonitoring-021). |
| 2 | Akka.NET conventions | x | Module contains no actors itself. `IHealthReportTransport` cleanly abstracts the Akka-remoting send. `ProcessReport`/`MarkHeartbeat` are called from `CentralCommunicationActor`'s receive — invoked on the actor thread but the aggregator's CAS loops make that safe regardless. No issues found. |
| 3 | Concurrency & thread safety | x | Verified the resolved `SiteHealthState` immutable-record / CAS-loop pattern still holds across `ProcessReport`, `MarkHeartbeat`, `CheckForOfflineSites`. `SiteHealthCollector` uses `volatile` for reference fields (`_clusterNodes`, `_nodeHostname`, `_siteAuditBacklog`, `_isActiveNode`) and `Interlocked` for counters consistently. `CollectReport`'s `new Dictionary<>(concurrentDict)` snapshots are not strictly atomic but acceptable at the documented scale. No new issues found. |
| 4 | Error handling & resilience | x | `try/catch` blocks now log all non-fatal failures (resolved HealthMonitoring-010 still in place). Outer `catch (Exception)` in `ExecuteAsync` keeps the loop alive — sound. New: the counter-reset-before-send issue (HealthMonitoring-017, HealthMonitoring-018) is an error-handling gap — transport failure silently swallows the interval's metric data. |
| 5 | Security | x | No issues found. The module handles only numeric/string operational metrics; no secrets, auth surface, or untrusted input parsing. `MarkHeartbeat` and `ProcessReport` trust the caller (intra-cluster). |
| 6 | Performance & resource management | x | `PeriodicTimer` instances disposed via `using`. CAS retry loops in `ProcessReport`/`MarkHeartbeat` have no bounded retry cap but contention is the dictionary-size limit (one entry per site) so the loop is effectively wait-free for the common case. No issues found. |
| 7 | Design-document adherence | x | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` are listed as required dashboard tiles in `Component-HealthMonitoring.md` but have no HealthMonitoring-side surface — both live only in `AuditLog`'s `AuditCentralHealthSnapshot` with no integration into the health aggregator or its consumers (HealthMonitoring-019). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component, validator registered idempotently across all three `Add*` paths. POCO/messages in Commons. `AddCentralHealthAggregation` implicitly depends on `ISiteHealthCollector` being registered elsewhere (Host calls `AddHealthMonitoring()` first) — works but is a hidden ordering requirement. Minor; not flagged. |
| 9 | Testing coverage | x | Per-interval reset semantics covered for site-side counters but NOT for the failed-send case (no test asserts counters remain accumulated when the transport throws — would catch HealthMonitoring-017). `CentralHealthReportLoopTests` uses real wall-clock 50 ms `PeriodicTimer` + `Task.Delay(250)` for timing — flake-prone on a slow CI runner (HealthMonitoring-022). The placeholder test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` name is stale (HealthMonitoring-023). |
| 10 | Documentation & comments | x | XML docs in the new audit-bridge surfaces (`IncrementSiteAuditWriteFailures`, `IncrementAuditRedactionFailure`, `UpdateSiteAuditBacklog`) are accurate. The stale placeholder test name is the only issue (HealthMonitoring-023). |
## Findings
### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
@@ -776,3 +820,314 @@ continues to work via the optional parameter. Regression test
asserts the timestamp equals a fixed injected instant exactly (not just a
before/after window); it would not compile against the pre-fix single-arg-less
constructor.
### HealthMonitoring-017 — `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:140-154`, `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:146-153` |
**Description**
`HealthReportSender.ExecuteAsync` calls `_collector.CollectReport(_siteId)` and
then `_transport.Send(reportWithSeq)` inside a single `try` block whose `catch`
logs and continues. `CollectReport` atomically read-and-resets the per-interval
counters via `Interlocked.Exchange(ref _scriptErrorCount, 0)` (and the same for
`_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`,
`_auditRedactionFailures`). If `_transport.Send` then throws — Akka remoting
hiccup, transport not yet associated, central side temporarily unavailable,
serialization failure on a malformed metric, etc. — the `catch (Exception ex)`
on line 150 logs an error and the loop simply waits for the next tick. The
report was never delivered, but the counters have already been reset to zero, so
**every error this site recorded in the failed interval is gone**: it is neither
in the (un-sent) report nor in the (zeroed) collector. The very next successful
report will show "0 script errors / 0 alarm errors" for the entire window in
which the transport was broken, masking exactly the period the operator most
needs to triage.
This contradicts the design doc's "raw counts per reporting interval" / "counter
resets **after each report is sent**" wording — current code resets on each
report _attempt_, regardless of outcome. The hazard worsens under sustained
transport failure: every interval's errors are lost; the central dashboard sees
a quiet site while the site is, in fact, failing.
The same shape exists in `CentralHealthReportLoop` (see HealthMonitoring-018) —
`CollectReport` is called before `_aggregator.ProcessReport`. The aggregator
call is in-process and unlikely to throw, but the structural bug is identical.
**Recommendation**
Build the report from a non-destructive read first (`PeekReport(siteId)`,
returning a snapshot without mutating the counters) and only call a dedicated
`ResetIntervalCounters()` after a successful `_transport.Send`. Alternatively,
on a `Send` failure, restore the lost counts via `Interlocked.Add` of the
captured values back into the collector fields — atomically correct as long as
no other thread can read them in between, which is true here because the next
read is the next `CollectReport` on the same loop. The "peek then commit"
shape is the cleaner public API.
A regression test should add a failing-transport scenario:
`Send` throws an `InvalidOperationException`; assert that the next successful
report includes the previously-failed interval's `ScriptErrorCount`.
### HealthMonitoring-018 — Same counter-reset-before-publish hazard in `CentralHealthReportLoop`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:87-98` |
**Description**
`CentralHealthReportLoop.ExecuteAsync` calls `_collector.CollectReport(CentralSiteId)`
(which resets the per-interval counters on the shared `SiteHealthCollector`
instance — see HealthMonitoring-017) and then `_aggregator.ProcessReport(reportWithSeq)`
inside the same `try` block. If `ProcessReport` throws, the central node's own
per-interval counters (`ScriptErrorCount`, `AlarmEvaluationErrorCount`,
`DeadLetterCount`, `SiteAuditWriteFailures`, `AuditRedactionFailure`) are lost
for that interval.
In practice `ProcessReport` is a pure in-memory CAS loop and is very unlikely
to throw, so the operational impact is small. However, the structural bug is
identical to HealthMonitoring-017 and would be fixed by the same
"peek then commit" refactor in `SiteHealthCollector`. The Audit-Log-related
metrics matter most here: `AuditRedactionFailure` is genuinely incremented at
central during normal operation (the Notification Outbox dispatcher and
Inbound API middleware both write through `CentralAuditRedactionFailureCounter`
which can fan out to the collector via the bridge), so this is not purely
theoretical.
**Recommendation**
Adopt the same "peek then reset on successful publish" pattern recommended for
HealthMonitoring-017. Reuse the new `PeekReport` / `ResetIntervalCounters`
collector API once it lands.
### HealthMonitoring-019 — `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-HealthMonitoring.md:39,40`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs`, `src/ScadaLink.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58` |
**Description**
`Component-HealthMonitoring.md` lists `SiteAuditTelemetryStalled` and
`CentralAuditWriteFailures` (and reiterates them under the Audit Log KPIs
section and in the Dependencies section) as required dashboard metrics. The
doc also says they "are central-computed alongside the existing central KPIs"
(Notification Outbox / Site Call Audit) and surface in the **Audit** dashboard
tile group.
Tracing the code:
- `SiteAuditTelemetryStalled` is published by `SiteAuditReconciliationActor`,
picked up by `SiteAuditTelemetryStalledTracker`, and latched into
`AuditCentralHealthSnapshot._stalled` (a `ConcurrentDictionary<string, bool>`
in the `ScadaLink.AuditLog` assembly).
- `CentralAuditWriteFailures` is incremented inside `AuditCentralHealthSnapshot`
via `ICentralAuditWriteFailureCounter.Increment()` (also in `ScadaLink.AuditLog`).
Neither metric is referenced anywhere in `src/ScadaLink.HealthMonitoring/`:
- `ICentralHealthAggregator` does not expose them.
- `SiteHealthCollector` has no central counterpart (it is site-only).
- `SiteHealthReport` has no `SiteAuditTelemetryStalled` / `CentralAuditWriteFailures`
fields (the site-only `SiteAuditWriteFailures`, `AuditRedactionFailure`, and
`SiteAuditBacklog` _are_ wired; the central pair is the gap).
Currently the only consumer of `IAuditCentralHealthSnapshot` is whatever
Central UI page binds to it directly (out of scope for this module), but the
design doc places these metrics under HealthMonitoring's responsibility
("Health Monitoring Dashboard displays aggregated metrics"). At minimum the
Dependencies section's claim that Health Monitoring provides "the
central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics"
is false for `CentralAuditWriteFailures`: nothing under
`src/ScadaLink.HealthMonitoring/` knows about it.
**Recommendation**
Decide whether HealthMonitoring or the consuming UI page owns the
`IAuditCentralHealthSnapshot` integration:
- If HealthMonitoring owns it, expose a `CentralKpis` accessor on
`ICentralHealthAggregator` (e.g. a `GetCentralAuditHealth()` method that
returns a typed DTO derived from the injected `IAuditCentralHealthSnapshot`)
so the dashboard has a single read surface mirroring `GetAllSiteStates`.
- If the UI page binds `IAuditCentralHealthSnapshot` directly, update the
HealthMonitoring design doc's Responsibilities / Dependencies sections to
reflect that and remove the implied integration.
Either way, add a regression test that the chosen surface returns the live
counter and per-site stalled state.
### HealthMonitoring-020 — `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:128-147` |
**Description**
The CAS path in `MarkHeartbeat` picks `newHeartbeat = max(receivedAt, existing.LastHeartbeatAt)`,
then short-circuits only when `newHeartbeat == existing.LastHeartbeatAt &&
existing.IsOnline`. That short-circuit is correct, but consider the case where
`existing.IsOnline == false` and `receivedAt <= existing.LastHeartbeatAt`:
1. Suppose a site is marked offline by `CheckForOfflineSites` at time T1.
2. A late/out-of-order heartbeat carrying a `receivedAt` _older_ than the last
stored `LastHeartbeatAt` arrives at T2 (clock skew at the receive site, or a
delayed message that was generated before the offline-marking).
3. `newHeartbeat == existing.LastHeartbeatAt` (kept), but the short-circuit
condition fails because `existing.IsOnline == false`, so the CAS produces a
new record with `IsOnline = true` and the **stale** `LastHeartbeatAt`.
4. On the very next `CheckForOfflineSites` tick (≤ `OfflineTimeout/2` later),
`now - LastHeartbeatAt` is still ≥ `OfflineTimeout`, so the site is
immediately marked offline again — the heartbeat brought it online for less
than the check cadence, producing a "flap" in the dashboard.
In practice `receivedAt` is normally `_timeProvider.GetUtcNow()` at the
`CentralCommunicationActor` receive site, so monotonically increasing — the bug
is latent. But the contract `MarkHeartbeat(string siteId, DateTimeOffset receivedAt)`
makes no guarantee about ordering, and an out-of-order delivery (Akka remoting
ordering across connection re-establishment edge cases) or a small wall-clock
correction at central would expose it.
**Recommendation**
When transitioning offline → online, use `now` (from the injected
`TimeProvider`) rather than the caller-supplied `receivedAt` for
`LastHeartbeatAt`, or take `max(receivedAt, _timeProvider.GetUtcNow())` so the
recovery point is always recent. A unit test driving `MarkHeartbeat` with a
`receivedAt` older than the last stored heartbeat on an offline site, then a
`CheckForOfflineSites` immediately afterwards, would assert the site stays
online.
### HealthMonitoring-021 — `CentralSiteId = "central"` reserved constant silently collides with a real site named "central"
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:22`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:224-226` |
**Description**
`CentralHealthAggregator.CheckForOfflineSites` looks up the per-site offline
timeout with:
```csharp
var timeout = kvp.Key == CentralHealthReportLoop.CentralSiteId
? _options.CentralOfflineTimeout
: _options.OfflineTimeout;
```
`CentralSiteId` is the literal string `"central"`. Site IDs are free-form
strings set in configuration / the Sites repository; there is no validation
that excludes the reserved `"central"` name. An operator who creates a real
site with `SiteId = "central"` will have:
- Their real-site reports arriving via `ProcessReport` get stored in the same
dictionary slot as the central self-report (they share the keyspace), so the
central self-report and the real-site report repeatedly overwrite each
other via the sequence-number guard — whichever has the higher Unix-ms seed
wins, and the other is silently rejected as stale. The dashboard alternates
between two unrelated payloads.
- The real site gets the longer `CentralOfflineTimeout` (default 3 minutes)
instead of the normal `OfflineTimeout` (60 s), so a genuinely-failed real
site marked "central" stays falsely-online for an extra two minutes.
**Recommendation**
Two options:
1. Reject the reserved name at the Site entity / configuration validation
layer (Configuration Database component, out of this module's scope) and
document `"central"` as reserved. This is the cleaner UX fix.
2. As a defence-in-depth inside HealthMonitoring, store the central
self-report under a key that cannot collide — e.g. prefix it with a
character that is forbidden in real site IDs (`":central"` or `"#central"`)
— and adjust `CheckForOfflineSites` accordingly.
Either fix should include a regression test creating a real `SiteHealthReport`
with `SiteId = "central"` and asserting the central self-report's identity is
preserved.
### HealthMonitoring-022 — `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs:32-42` |
**Description**
`RunLoopBriefly` starts the hosted service with a 50 ms `PeriodicTimer` and
then `await Task.Delay(runForMs, CancellationToken.None)` (with `runForMs`
between 150 ms and 300 ms). `GeneratesCentralReports_WhenSelfIsPrimary` and
`AssignsMonotonicSequenceNumbers` both assert "at least 2 reports were
generated" within the window. On a heavily-contended CI runner where the
hosted-service start-up plus a couple of `PeriodicTimer` ticks can blow past
300 ms, these tests will silently flake.
The rest of the suite (`CentralHealthAggregatorTests`, `SiteHealthCollectorTests`,
`HealthReportSenderTests` partially) was deliberately refactored to use the
injected `TimeProvider` precisely to avoid this. `CentralHealthReportLoop` and
`HealthReportSender` already accept a `TimeProvider`, but the loop's
`PeriodicTimer` is still real-time because `PeriodicTimer` does not consume
the `TimeProvider` parameter.
**Recommendation**
Either (a) accept the timing-sensitivity and bump the delay budget
generously, or (b) refactor the hosted-service loop to use a
`TimeProvider.CreateTimer`-based tick mechanism so the test can advance a
fake clock and assert deterministically how many ticks fire. Option (b) is
the better long-term fix and matches the pattern used elsewhere in the
module's tests.
### HealthMonitoring-023 — `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122` |
**Description**
The test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` was originally named
to codify the HealthMonitoring-001 bug ("`SetStoreAndForwardDepths` has no
callers, so `StoreAndForwardBufferDepths` is always empty"). HealthMonitoring-001
is `Resolved``HealthReportSender` now populates per-category depths from
the S&F engine, and the same test class has `SetStoreAndForwardDepths_ReflectedInReport`
covering the populated path. The "placeholder" test still passes because it
constructs a fresh collector and never calls the setter, so its assertion
(`Assert.Empty(report.StoreAndForwardBufferDepths)`) is now testing the
**default empty state of an un-configured collector**. The HealthMonitoring-001
resolution note explicitly chose to keep it as "the collector-level
default-state test", but the test method name and the implied semantics no
longer match.
A maintainer reading the test name today will misread it as documentation that
the metric is unimplemented (which it isn't), and may waste time investigating
a non-bug.
**Recommendation**
Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
(or similar) and update the test body's intent — purely a documentation /
maintainability fix; no behaviour change.
+325 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Host` |
| Design doc | `docs/requirements/Component-Host.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -48,6 +48,38 @@ Serilog sink setup is hard-coded in `Program.cs` rather than configuration-drive
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
exception type including permanent schema-validation failures (Host-015).
#### Re-review 2026-05-28 (commit `1eb6e97`)
All fifteen prior findings (Host-001..015) remain `Resolved` in the current tree
and the regressions introduced for them — Host-001's predicate, the externalised
secrets, the Site GrpcPort/RemotingPort/seed-port validation rules, the escaped
HOCON builder with `DownIfAlone` and millisecond-precision durations, the
configuration-driven Serilog sinks, the transient-only `StartupRetry`
classifier — are all still in place. This re-review walked the ten checklist
categories over the full module again and recorded seven new findings, none of
them crash/data-loss class. Host-016 (Medium) mirrors the resolved Host-004
shipped-config bug on the **Communication** side: `appsettings.Site.json`'s
second `CentralContactPoints` entry points at the site's own remoting port
(`localhost:8082`) instead of central, an incorrect dev example that copies
into multi-central deployments. Host-017 (Medium) flags a partial REQ-HOST-7
implementation — the documented site-shutdown ordering (stop accepting streams
first, cancel active streams via `IHostApplicationLifetime.ApplicationStopping`,
then tear down actors) is not wired: the site path registers no
`ApplicationStopping` handler that signals `SiteStreamGrpcServer`, and the gRPC
server exposes no cancel-all-streams entry point. The remaining five are Low:
`NodeOptions.NodeName` (the operator-configured value stamped on
`AuditLog.SourceNode`) is absent from both shipped per-role configs even though
the docker per-node configs set it (Host-018); the migration `StartupRetry`
call passes `default` for `CancellationToken`, so a SIGTERM during the
bounded-retry window is ignored for up to ~2 minutes (Host-019);
`LoggerConfigurationFactory` layers `MinimumLevel.Is` over
`ReadFrom.Configuration`, so any `Serilog:MinimumLevel` an operator sets is
silently overridden by `ScadaLink:Logging:MinimumLevel` (Host-020); the
shipped `appsettings.json` carries a Microsoft `Logging:LogLevel` block but
Serilog is the only logger provider and the section is dead config (Host-021);
and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
a typo) and falls back to `Information` with no warning (Host-022).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -63,6 +95,21 @@ exception type including permanent schema-validation failures (Host-015).
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Re-review: `appsettings.Site.json` second `CentralContactPoints` entry targets the site's own remoting port instead of central (Host-016) — same defect class as the resolved Host-004 seed-list bug. |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping, role-scoped site singletons, ClusterClient initial-contact wiring all reviewed; no new issues. |
| 3 | Concurrency & thread safety | ☑ | `_trackedDisposables` is locked on both sides of the lifecycle; `_actorSystem` publication is safe via the IHost startup `await` boundary. New Low: `StartupRetry` migration call passes `default` `CancellationToken`, so SIGTERM during the retry window is ignored (Host-019). |
| 4 | Error handling & resilience | ☑ | `IsTransientDatabaseFault` correctly classifies socket / timeout / SqlException; the retry helper itself remains sound. Host-019 is the resilience gap. |
| 5 | Security | ☑ | Secrets stay externalised; the `_secrets` placeholder comment is intact. No new issues. |
| 6 | Performance & resource management | ☑ | No new undisposed resources; gRPC stream lifetime cap remains correct. No new issues. |
| 7 | Design-document adherence | ☑ | Re-review: REQ-HOST-7 site-shutdown ordering — stop accepting new streams, cancel active streams via `ApplicationStopping`, then tear down actors — is not wired in `Program.cs` (Host-017). |
| 8 | Code organization & conventions | ☑ | Re-review: `NodeOptions.NodeName` is absent from the shipped per-role configs even though it stamps `AuditLog.SourceNode` (Host-018); the appsettings `Logging:LogLevel` Microsoft section is dead config under Serilog (Host-021). |
| 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
| 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaLink:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
## Findings
### Host-001 — `/health/ready` includes the leader-only `active-node` check
@@ -777,3 +824,278 @@ site now passes it. Regression tests in `StartupRetryTests`:
when `isTransient` returns false) and `ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent`
(retries a `TimeoutException` then stops at a permanent `InvalidOperationException`).
Full Host suite green (182 passed).
### Host-016 — Site `CentralContactPoints` second entry targets the site's own remoting port
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/appsettings.Site.json:33-37` |
**Description**
The shipped site config sets `Node:RemotingPort = 8082` and lists
`Communication:CentralContactPoints` as
`["akka.tcp://scadalink@localhost:8081", "akka.tcp://scadalink@localhost:8082"]`.
The second contact point — port `8082` — is the **site's own** remoting endpoint,
not a central node. `SiteCommunicationActor` / `ClusterClient` uses these
addresses as initial contacts when discovering the central
`ClusterClientReceptionist`; a contact pointing at the site itself can never
reach the central receptionist and will be a permanent failure in the
initial-contact rotation. For the single-node dev loopback layout the first
contact (`8081`, central) succeeds and the bug is masked, but this is exactly
the kind of dev-config "example" that gets duplicated into multi-central
deployments — the same failure mode the resolved Host-004 finding called out
for the seed-node list. `StartupValidator` validates seed nodes against the
gRPC port (Host-004) but does not cross-check `CentralContactPoints` against
the site's own `RemotingPort`, so the misconfiguration passes silently.
**Recommendation**
Correct the shipped site example to list two central remoting endpoints (e.g.
`localhost:8081` for `central-a` and a distinct port for `central-b` in a
multi-node layout). Consider extending `StartupValidator` to reject any
`Communication:CentralContactPoints` entry whose host+port matches this site
node's `NodeHostname`+`RemotingPort`. Add a regression test in
`StartupValidatorTests` mirroring `Site_SeedNodeOnGrpcPort_FailsValidation`.
**Resolution**
_Open._
### Host-017 — Site-shutdown ordering from REQ-HOST-7 is not wired
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:229-265`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs` |
**Description**
REQ-HOST-7 documents an explicit four-step shutdown sequence for site nodes:
"(1) On `CoordinatedShutdown`, stop accepting new gRPC streams first.
(2) Cancel all active gRPC streams (triggering client-side reconnect).
(3) Tear down actors.
(4) Use `IHostApplicationLifetime.ApplicationStopping` to signal the gRPC
server." The site path in `Program.cs` (the `role == "Site"` branch) registers
no `IHostApplicationLifetime.ApplicationStopping` callback, and
`SiteStreamGrpcServer` exposes no "stop accepting" / "cancel all streams"
entry point — it has `SetReady` but no corresponding `SetUnavailable` or
`CancelAllStreams`. In practice, on `SIGTERM` Kestrel closes its listener
naturally and `AkkaHostedService.StopAsync` runs Akka `CoordinatedShutdown`,
but there is no explicit, ordered handoff that meets the documented contract:
in-flight streams are not actively cancelled before actors begin tearing down,
so clients see a stream that goes silent (and only times out via gRPC
keepalive) rather than a clean `Cancelled` they can reconnect on. This is a
contract-vs-code drift — either the design doc is overstating what is
implemented, or the implementation is incomplete.
**Recommendation**
Add a `SiteStreamGrpcServer.CancelAllStreams()` method that flips a "shutting
down" flag (so `SubscribeSite` immediately fails new streams with
`StatusCode.Unavailable`) and cancels every entry's `Cts` in the `_streams`
map. In `Program.cs` site branch, resolve `IHostApplicationLifetime` and
register a callback on `ApplicationStopping` that calls `CancelAllStreams()`
before the Akka hosted service runs `CoordinatedShutdown` (or order via
`AkkaHostedService.StopAsync` itself — `IHostedService.StopAsync` runs in
reverse-registration order, so the gRPC server's lifetime can be sequenced
before Akka shutdown). Alternatively, reconcile REQ-HOST-7 with the actual
implementation if the explicit ordering is no longer intended. Add an
integration test under `tests/ScadaLink.Host.Tests` that starts a site host,
opens a stream, triggers shutdown, and asserts the stream completes with
`Cancelled` before the actor system tears down.
**Resolution**
_Open._
### Host-018 — Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Host/appsettings.Central.json`, `src/ScadaLink.Host/appsettings.Site.json`, `src/ScadaLink.Host/NodeOptions.cs:10-16` |
**Description**
`NodeOptions.NodeName` is documented as "the operator-configured semantic node
name used to stamp the SourceNode column on audit rows", with conventional
values `node-a`/`node-b` for site nodes and `central-a`/`central-b` for
central nodes. The CLAUDE.md "Centralized Audit Log" key-decision section
calls this out: `SourceNode` is meant to be carried verbatim through audit
telemetry and reconciliation, and is indexed via
`IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)`. The docker per-node
configs (`docker/central-node-a/appsettings.Central.json`,
`docker/site-a-node-a/appsettings.Site.json`, etc.) all set
`ScadaLink:Node:NodeName`. The **shipped, default** per-role files in
`src/ScadaLink.Host/` — the templates a developer running the binary
directly will use — do not. `NodeIdentityProvider` normalises an empty
`NodeName` to `null`, so dev audit rows carry a null `SourceNode` and the
indexed lookup never narrows. The dev examples should match the docker
examples; at minimum the field should appear in the shipped templates with a
placeholder explaining the convention.
**Recommendation**
Add `"NodeName": "central-a"` (or a placeholder like `"${NODE_NAME}"`) to
`appsettings.Central.json` and `"NodeName": "node-a"` to
`appsettings.Site.json`, with a short comment that the value must be set
per-node in multi-node deployments. Consider validating in `StartupValidator`
that `NodeName` is non-empty, or accept the null and document explicitly that
single-node dev deployments leave `SourceNode` null.
**Resolution**
_Open._
### Host-019 — Migration `StartupRetry` call drops the host `CancellationToken`
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:154-165` |
**Description**
`StartupRetry.ExecuteWithRetryAsync` accepts an optional
`CancellationToken cancellationToken = default` and observes it both at the
top of each attempt and inside the `Task.Delay` between retries. The migration
call site in `Program.cs` passes no token, so the helper runs with
`CancellationToken.None`. With `maxAttempts: 8`, `initialDelay: 2s`, and the
30s cap, a database that stays unreachable can keep the retry loop alive for
~2 minutes before the host process responds to `SIGTERM` / `Ctrl+C` /
Windows-Service stop. The `Program.cs` startup pipeline does not yet have a
host-lifetime token to forward at this point (the `app` is built but not
yet running), but `app.Lifetime.ApplicationStopping` is available the moment
`builder.Build()` returns. Threading it into the retry call honours the host
lifecycle and matches the helper's documented contract.
**Recommendation**
Pass `app.Lifetime.ApplicationStopping` (or `CancellationToken.None`
explicitly with a comment if intentional) into
`StartupRetry.ExecuteWithRetryAsync`. Add a `StartupRetryTests` case
exercising token-cancellation mid-backoff.
**Resolution**
_Open._
### Host-020 — `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:36-43` |
**Description**
`LoggerConfigurationFactory.Build` reads the `Serilog` configuration section
via `ReadFrom.Configuration(configuration)` (which can include a
`MinimumLevel` block — the standard Serilog way to set the floor) and **then**
calls `.MinimumLevel.Is(minimumLevel)` derived from
`ScadaLink:Logging:MinimumLevel`. Serilog's fluent builder applies the later
call, so any `Serilog:MinimumLevel:Default` an operator sets is silently
overridden by `ScadaLink:Logging:MinimumLevel` (or by its
`Information` fallback when the ScadaLink key is absent). There are now two
documented configuration paths for the same setting with non-obvious
precedence, and the override direction is the opposite of what most Serilog
users would expect (the more-specific `Serilog` section being the authority).
The XML doc on `Build` says "the explicit `MinimumLevel.Is` pins the floor"
but does not warn that the floor *overrides* the Serilog section's own
`MinimumLevel`.
**Recommendation**
Pick one mechanism: either (a) drop the `MinimumLevel.Is` call and let
`ReadFrom.Configuration` consume `Serilog:MinimumLevel`, migrating any docs/
deployments that reference `ScadaLink:Logging:MinimumLevel`; or (b) keep the
current "ScadaLink:Logging" path and reject `Serilog:MinimumLevel` if present
(throw at startup so the operator sees the conflict). At minimum, expand the
XML doc + REQ-HOST-8 to spell out the precedence explicitly.
**Resolution**
_Open._
### Host-021 — Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Host/appsettings.json:2-6` |
**Description**
`appsettings.json` carries a Microsoft `Logging:LogLevel:Default = Information`
block. The `Logging:LogLevel` map is consumed by
`Microsoft.Extensions.Logging.ConfigurationConsoleLoggerOptions` and similar
provider configurations bound from the standard `Logging` section. The Host
calls `builder.Host.UseSerilog()`, which replaces the default
`ILoggerFactory` setup with Serilog as the **only** logger provider; Serilog
reads from `configuration.ReadFrom.Configuration(...)` which consumes the
`Serilog` section, **not** `Logging:LogLevel`. The result is that an operator
editing `Logging:LogLevel:Default` (a very natural thing to try, since it is
the .NET convention) sees no behaviour change — the section is dead config.
**Recommendation**
Either remove the `Logging:LogLevel` block from `appsettings.json` (Serilog
owns logging configuration in this Host), or replace it with a brief comment
explaining it is intentionally retained for non-Serilog tooling. Document the
authoritative location (`Serilog` + `ScadaLink:Logging`) in
`Component-Host.md` REQ-HOST-8 if not already explicit.
**Resolution**
_Open._
### Host-022 — `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information`
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:50-55` |
**Description**
`LoggerConfigurationFactory.ParseLevel` uses
`Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed)` and
returns `LogEventLevel.Information` when parsing fails — without logging the
fallback. An operator who sets
`ScadaLink:Logging:MinimumLevel = "Informaiton"` (a common typo) or
`"Verbose,Debug"` or any unrecognised value gets the default level silently;
there is no warning, no log line, no startup error. Combined with Host-020
(this is the only mechanism that pins the floor), a misspelt value is
invisible until someone wonders why the level change "didn't take". The
helper is small and could either fail-fast in `StartupValidator` or emit a
console warning before the logger is configured.
**Recommendation**
In `LoggerConfigurationFactory.Build`, when `loggingOptions.MinimumLevel` is
non-null/non-blank but does not parse to a valid `LogEventLevel`, write a
`Console.Error.WriteLine` warning (the logger is not yet built) and proceed
with `Information`. Alternatively, validate the value in `StartupValidator`
and fail fast — that matches the pattern used for other ScadaLink
configuration keys. Add a `LoggerConfigurationTests` case asserting the
behaviour you choose.
**Resolution**
_Open._
+389 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.InboundAPI` |
| Design doc | `docs/requirements/Component-InboundAPI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 8 |
## Summary
@@ -64,6 +64,66 @@ statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteH
| 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. |
| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |
#### Re-review 2026-05-28 (commit `1eb6e97`)
All 17 prior findings remain `Resolved`. The module has grown materially since the
last pass — a new `AuditWriteMiddleware` (Audit Log #23 M4 Bundle D) now lives under
`src/ScadaLink.InboundAPI/Middleware/`, the `ApiKeyValidator` was rewired to hash the
candidate with `IApiKeyHasher` (ConfigurationDatabase-012), and an `IInstanceRouter`
seam was introduced. This re-review re-walked all 10 checklist categories against
`1eb6e97` and surfaced **8 new findings** concentrated on the new audit middleware
and a stranded follow-up from InboundAPI-008:
1. The InboundAPI-008 resolution explicitly deferred registering an `IActiveNodeGate`
implementation in `ScadaLink.Host` as a "follow-up outside this module's scope" —
that follow-up is still unfulfilled (no production registration anywhere in
`src/ScadaLink.Host/`), so the design-mandated standby-node gating is silently
disabled in production today (`InboundAPI-022`, High).
2. `AuditWriteMiddleware` is wired in `Program.cs` against `/api/*` rather than the
specific `POST /api/{methodName}` route, so GETs against `/api/audit/query` and
`/api/audit/export` (audit query endpoints — themselves not script invocations)
now emit spurious `AuditChannel.ApiInbound`/`InboundRequest` rows back into the
audit log with `Target` set to the last path segment (`InboundAPI-025`, Medium).
3. The middleware fires its audit write as `_ = _auditWriter.WriteAsync(evt)` — the
wrapping try/catch only catches synchronous throws, so a faulted async writer
task is unobserved and the row silently disappears with no log line
(`InboundAPI-018`, Low/Medium).
4. `ParentExecutionId` correlation flows only through `RouteToCallRequest`
`RouteToGetAttributesRequest`/`RouteToSetAttributesRequest` have no
`ParentExecutionId` field, so attribute reads/writes from inbound scripts lose
the inbound→site execution-tree link the Audit Log decision in CLAUDE.md
describes (`InboundAPI-021`, Medium).
5. `EndpointExtensions.HandleInboundApiRequest` — the entire wiring composition
that ties validator/executor/route/audit together — has no test coverage; only
the components it composes are tested (`InboundAPI-023`, Low).
6. `EndpointExtensions.HandleInboundApiRequest` does
`ContentType?.Contains("json")` (case-sensitive) so a request with
`application/JSON` and no Content-Length silently skips JSON body parsing
(`InboundAPI-020`, Low).
7. `AuditWriteMiddleware.InvokeAsync` calls `EnableBuffering()` unconditionally
before the empty-body short-circuit, allocating a `FileBufferingReadStream` for
every request including bodyless ones (`InboundAPI-019`, Low).
Severity mix: 1 High, 3 Medium, 4 Low — no Critical. (The eighth finding —
`InboundAPI-024`, Low — is a defensive watch-list item flagging that
`_knownBadMethods` is unbounded; it is bounded *in practice* today by the
configuration database, but the invariant is undocumented.)
## Checklist coverage — 2026-05-28 (commit `1eb6e97`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `ContentType?.Contains("json")` is case-sensitive (InboundAPI-020). |
| 2 | Akka.NET conventions | ☑ | ASP.NET-hosted, no actors of its own; routes via `IInstanceRouter`/`CommunicationService`. No new issues. |
| 3 | Concurrency & thread safety | ☑ | `ConcurrentDictionary` handler cache (post-001/002 fix). New audit middleware is per-request scoped, no shared mutable state. No new issues. |
| 4 | Error handling & resilience | ☑ | Audit `WriteAsync` is fire-and-forget; async faults are unobserved (InboundAPI-018). |
| 5 | Security | ☑ | `IActiveNodeGate` not registered in Host — standby-node gating disabled in production (InboundAPI-022). |
| 6 | Performance & resource management | ☑ | `EnableBuffering()` unconditional on bodyless requests (InboundAPI-019); audit middleware wraps `Response.Body` and mints `ExecutionId` for non-script /api routes (InboundAPI-025). |
| 7 | Design-document adherence | ☑ | `ParentExecutionId` not stamped on attribute-read/write routed messages (InboundAPI-021). InboundAPI-008's deferred Host registration still unfulfilled (InboundAPI-022). |
| 8 | Code organization & conventions | ☑ | No new issues. |
| 9 | Testing coverage | ☑ | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test (InboundAPI-023); middleware/filter/validator/executor/route are individually covered. |
| 10 | Documentation & comments | ☑ | No new issues. |
## Findings
### InboundAPI-001 — Singleton script handler cache mutated without synchronization
@@ -844,3 +904,329 @@ now depends on `IInstanceLocator` + `IInstanceRouter` (both substitutable). Adde
for each routed method, `GetAttribute` delegating to the batch `GetAttributes` and
returning `null` for an absent key, `SetAttribute` delegating to `SetAttributes`, and
the InboundAPI-016 deadline-token inheritance behaviour. All 15 pass.
### InboundAPI-018 — `AuditWriteMiddleware` fires `WriteAsync` as `_ = task` — faulted async writes are unobserved
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:257` |
**Description**
`EmitInboundAudit` calls `_ = _auditWriter.WriteAsync(evt);` — the returned `Task` is
discarded with the discard operator inside a synchronous `try` block. The wrapping
`try/catch (Exception ex)` (lines 198266) only catches a *synchronous* throw before
the writer returns a task. Once `WriteAsync` returns a task, any exception that
faults that task (e.g. a DB timeout in the central audit writer, a serialization
failure, a cancellation that bubbles up) is never observed: it is not logged, it
does not increment the `CentralAuditWriteFailures` health-monitoring counter the
design doc references ("Fail-soft semantics" paragraph), and the audit row is
silently lost. In .NET, unobserved task exceptions are eventually surfaced via
`TaskScheduler.UnobservedTaskException` and may also be logged by the runtime —
either way, the middleware itself has no control over what (if anything) happens
on a fault. The XML doc comment at line 255 claims "the writer itself swallows"
but this is an implicit cross-component contract: the abstraction
`ICentralAuditWriter.WriteAsync` returns `Task` and makes no such guarantee, and
the only test that exercises a throwing writer (`AuditWriter_Throws_*` in
`AuditWriteMiddlewareTests.cs`) uses an `OnWrite` callback that throws
*synchronously*, not asynchronously — so the async fault path is not covered by
tests either.
This matters because Component-InboundAPI.md states that audit-emission failures
must increment `CentralAuditWriteFailures` (Health Monitoring #11) — a counter
that, with the current fire-and-forget, will under-count async-faulted writes.
**Recommendation**
Either (a) await the write and rely on the surrounding try/catch to log the
failure, accepting an extra await on the request hot path; or (b) keep the
fire-and-forget for latency but attach a `ContinueWith(t => ..., OnlyOnFaulted)`
that logs the fault and increments the failure counter, so a faulted async write
is at least observed. Option (b) preserves "audit emission never blocks the HTTP
response" while restoring the visibility the design assumes. Add a regression
test using a writer whose `WriteAsync` returns a faulted `Task` (not a
synchronous throw) to pin the new contract.
### InboundAPI-019 — `EnableBuffering()` called unconditionally on every request, including bodyless requests
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:141` |
| Status | Open |
**Description**
`InvokeAsync` always calls `ctx.Request.EnableBuffering()` before the empty-body
short-circuit at `ReadBufferedRequestBodyAsync` line 289 (`if (request.ContentLength
is 0) return (null, false);`). `EnableBuffering()` swaps the request stream for a
`FileBufferingReadStream` whose construction allocates an internal buffer (default
threshold ~30 KB before spilling to a temp file) regardless of whether the request
actually has a body. The /api scope this middleware lives under will see at least
some bodyless requests (e.g. GET `/api/audit/query` once that route is in the same
branch — see InboundAPI-025; future health checks; misbehaving clients) and each
one pays the buffering allocation cost for no benefit.
**Recommendation**
Defer the `EnableBuffering()` call into `ReadBufferedRequestBodyAsync` after the
`ContentLength is 0` check, or short-circuit in `InvokeAsync` before enabling
buffering when `ContentLength is 0` and `Method is "GET" or "HEAD" or "DELETE"`.
The win is a per-request `FileBufferingReadStream` allocation avoided on every
bodyless request through the middleware.
### InboundAPI-020 — `ContentType.Contains("json")` is case-sensitive; `application/JSON` with no Content-Length skips body parsing
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:70` |
**Description**
`HandleInboundApiRequest` parses the JSON body only when
`httpContext.Request.ContentLength > 0 || httpContext.Request.ContentType?.Contains("json") == true`.
The `string.Contains(string)` overload used here is case-sensitive — a perfectly
valid HTTP header `Content-Type: application/JSON` (uppercase) would yield
`false` (`"application/JSON".Contains("json")` is `false`). With no
Content-Length (e.g. chunked transfer-encoding) and an uppercase content type,
the handler then leaves `body = null` and `ParameterValidator.Validate` runs
against a missing body — so a method that declares any required parameter is
rejected with "Missing required parameters" even though the caller did send a
well-formed JSON body. HTTP RFC 7230 §3.2 makes header field names case-insensitive
but is silent on values; in practice clients do sometimes uppercase media-type
tokens, and the framework's own `MediaTypeHeaderValue` is case-insensitive.
**Recommendation**
Use the case-insensitive overload —
`httpContext.Request.ContentType?.Contains("json", StringComparison.OrdinalIgnoreCase) == true`
— or rely on the framework's `IsJson` check via
`MediaTypeHeaderValue.TryParse`/`HttpRequest.HasJsonContentType()`. Add a
regression test posting with `application/JSON` and Transfer-Encoding: chunked.
### InboundAPI-021 — `ParentExecutionId` correlation flows only through `Call`; attribute reads/writes lose the inbound→site execution-tree link
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:141-143`, `:182-183`, `:225-226`; `src/ScadaLink.Commons/Messages/InboundApi/RouteToInstanceRequest.cs:15-21`, `:36-40`, `:55-59` |
**Description**
CLAUDE.md's Centralized Audit Log section describes `ParentExecutionId` as the
cross-execution spawn pointer that "every row of a spawned run carries" and
specifically calls out "the inbound API → routed-site-script case". The current
implementation honours this only on `RouteToCallRequest` — which carries
`ParentExecutionId` as its trailing additive field (line 21 of
`RouteToInstanceRequest.cs`) and is stamped by `RouteTarget.Call` with the
inbound request's execution id at line 143 of `RouteHelper.cs`.
`RouteToGetAttributesRequest` and `RouteToSetAttributesRequest`, however, have
**no `ParentExecutionId` field** and the matching `RouteTarget.GetAttributes` /
`SetAttributes` methods (`RouteHelper.cs:182-183`, `:225-226`) never reference
`_parentExecutionId`. So when an inbound API script reads or writes a site
attribute via `Route.To("inst").GetAttribute(...)` /
`Route.To("inst").SetAttribute(...)`, the site-side audit row for that
trust-boundary action (an outbound-by-the-script DB / OPC write at the site) is
emitted with `ParentExecutionId = null` and the execution-tree walk
`IX_AuditLog_ParentExecution` cannot link it back to the spawning inbound
request. The two-row pair (inbound + spawned site work) reverts to the
"top-level / null" state the design says is the *fallback* for non-spawned runs.
The asymmetry between `Call` and `GetAttributes`/`SetAttributes` is also surprising
— a script author would reasonably expect the same correlation across all
`Route.To(...)` calls.
**Recommendation**
Add a trailing `Guid? ParentExecutionId = null` field to
`RouteToGetAttributesRequest` and `RouteToSetAttributesRequest` (additive
trailing member, matches the message-evolution rule in CLAUDE.md); stamp it
from `_parentExecutionId` in `RouteTarget.GetAttributes` and
`RouteTarget.SetAttributes`; have the site-side handlers thread the field onto
their emitted audit rows. Add a `RouteHelperTests` regression case asserting
that an attribute read/write carries the inherited `ParentExecutionId`.
### InboundAPI-022 — `IActiveNodeGate` has no production registration in Host — standby-node gating is silently disabled in production
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/IActiveNodeGate.cs`, `src/ScadaLink.InboundAPI/InboundApiEndpointFilter.cs:52-60`; absent from `src/ScadaLink.Host/Program.cs` |
**Description**
InboundAPI-008's resolution adds `IActiveNodeGate` (lines 1724 of
`IActiveNodeGate.cs`) so a standby central node can refuse to serve the inbound
API. `InboundApiEndpointFilter.InvokeAsync` consults the gate at line 52
(`var gate = httpContext.RequestServices.GetService<IActiveNodeGate>();`), and
when `gate is { IsActiveNode: false }` returns HTTP 503. The filter's behaviour
when **no implementation is registered** (line 51 comment) is to fall through and
serve the request — the resolution paragraph for InboundAPI-008 closes with:
> "Follow-up (outside this module's scope): `ScadaLink.Host` should register an
> `IActiveNodeGate` implementation backed by `ActiveNodeHealthCheck` /
> `Cluster.State.Leader` in the central-role branch of `Program.cs` so the gate is
> actually enforced in production; until then the endpoint defaults to "allow"."
A grep of the entire `src/ScadaLink.Host/` tree at `1eb6e97` finds **zero**
`IActiveNodeGate` registrations: `grep -rn "IActiveNodeGate\|AddSingleton.*ActiveNode"
src/ScadaLink.Host/` returns no matches. The follow-up was never carried out. So
in production today the standby central node still serves the inbound API exactly
as InboundAPI-008 described — executes method scripts, runs `Route.To()` calls,
races the active node, and may operate against stale singleton state. The new
infrastructure (interface + filter check) is present but unwired; from the user's
perspective the original High-severity issue is unresolved in deployed binaries.
The design says the inbound API is "Central cluster only (active node)" and
"fails over with it" — this guarantee is not currently enforced in production.
**Recommendation**
Register an `IActiveNodeGate` implementation in the central-role branch of
`ScadaLink.Host/Program.cs`. The natural backing is the existing
`ActiveNodeHealthCheck` (already wired for `/health/active`) or a direct read of
`Cluster.Get(actorSystem).State.Leader == Cluster.Get(actorSystem).SelfAddress`.
Add an integration test in the Host that spins up the central role and asserts
that the gate is resolvable and returns `IsActiveNode` consistent with cluster
leader state. Until that wiring lands, this finding is the user-facing
realisation of the InboundAPI-008 vulnerability.
### InboundAPI-023 — `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test coverage
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:31-140`, `tests/ScadaLink.InboundAPI.Tests/` |
**Description**
The endpoint handler `HandleInboundApiRequest` is the wiring composition that
ties the validator → JSON parse → `ParameterValidator``InboundScriptExecutor`
result-serialization path together; it is the single piece of code that maps
validator status codes to HTTP responses, threads the `parentExecutionId` from
`HttpContext.Items` into the executor, stashes the resolved API key name as
`AuditActorItemKey`, and emits the request-aborted short-circuit. The test
project covers each composed component (`ApiKeyValidatorTests`,
`ParameterValidatorTests`, `InboundScriptExecutorTests`, `RouteHelperTests`,
`InboundApiEndpointFilter`, `AuditWriteMiddlewareTests`,
`MiddlewareOrderTests`) but no test exercises `HandleInboundApiRequest` itself —
so regressions in the wiring (e.g. forgetting to stash the actor name on
`HttpContext.Items`, the `Contains("json")` case sensitivity from
InboundAPI-020, or accidentally swapping `validationResult.StatusCode` for a
literal) are not caught.
**Recommendation**
Add an `EndpointExtensionsTests` suite using `TestServer` (the same pattern
`MiddlewareOrderTests` uses) covering: the happy path (200 + body), invalid
JSON (400), validator 401, validator 403, parameter-validation failure (400),
script-failure 500, client-aborted short-circuit (`Results.Empty`), and the
actor-stash invariant (HttpContext.Items[AuditActorItemKey] is set with the
resolved key name after successful auth, but is absent on auth failures).
### InboundAPI-024 — `_knownBadMethods` is unbounded — an attacker can grow the cache by spamming distinct method names against the audit middleware path
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:30`, `:77`, `:223`, `:233` |
**Description**
The InboundAPI-009 fix introduced `_knownBadMethods`, a `ConcurrentDictionary<string, byte>`
of method names whose Roslyn compilation failed, to short-circuit lazy
recompilation. It is keyed by `method.Name` and entries are only ever removed
when `CompileAndRegister` succeeds for the same name (line 83). Practically the
key space is bounded by the configured method definitions in the database, so
this is bounded in normal operation. But because the cache is mutated from the
lazy-compile path at `ExecuteAsync.cs:233`, and `ExecuteAsync` is called from
`HandleInboundApiRequest` only **after** `ApiKeyValidator.ValidateAsync` has
returned `Valid` (i.e. a real method exists), the entry is keyed by a name that
must have already been resolved through `GetMethodByNameAsync` — so this attack
surface is gated by the configuration database. The finding is therefore mostly
defensive: there is no rate limit on inbound API calls (deliberate design), so
if a future change ever causes `ExecuteAsync` to be called for an unvalidated
caller-supplied method name (e.g. a refactor that moves method-existence
checking later), this cache would become attacker-controllable.
**Recommendation**
Optional / defensive: cap `_knownBadMethods` (e.g. an LRU with a fixed size, or
clear it periodically). At minimum, document the invariant in the executor's
XML comment that `_knownBadMethods` keys must come from validated
`ApiMethod.Name` values, so the safety property survives future refactors. No
immediate change required; this is a watch-list item.
### InboundAPI-025 — `AuditWriteMiddleware` runs against the entire `/api/*` branch — emits spurious `ApiInbound` audit rows for `/api/audit/query` and `/api/audit/export`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:183-185`; consumers: `src/ScadaLink.ManagementService/AuditEndpoints.cs:93-94`; emitter: `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:175-252` |
**Description**
`Program.cs` wires the audit middleware as
`app.UseWhen(ctx => ctx.Request.Path.StartsWithSegments("/api"), branch => branch.UseAuditWriteMiddleware())`
— scoped to the `/api` *prefix*, not to the `POST /api/{methodName}` route.
Meanwhile, `ScadaLink.ManagementService/AuditEndpoints.cs` maps
`MapGet("/api/audit/query", ...)` (line 93) and `MapGet("/api/audit/export", ...)`
(line 94). Both routes therefore inherit `AuditWriteMiddleware`, which emits an
`AuditEvent { Channel = AuditChannel.ApiInbound, Kind = AuditKind.InboundRequest, ... }`
row for every call. The middleware's `ResolveMethodName` falls back to the last
path segment (lines 446452), so a GET `/api/audit/query?...` is recorded as if a
caller had invoked an inbound API method named "query"; an export is recorded
as method "export". Effects:
1. **Audit log is polluted with non-script rows.** The audit log is now
recording its own query traffic as if it were inbound script invocations,
contradicting Component-AuditLog.md's scope ("script trust boundary actions").
2. **Audit reads recursively emit audit writes.** Every audit-log query (e.g.
from the Central UI Audit Log page or the CLI `audit query` command) writes
an additional row into `AuditLog`, growing the table on read.
3. **`Target` is meaningless.** `/api/audit/query` has no method definition, so
the recorded `Target = "query"` is not joinable to any `ApiMethod` row in
audit-log drill-ins.
4. **Wasted resources on health probes / management calls.** Any future routes
added under `/api/` will inherit the middleware and pay the
`EnableBuffering`, `CapturedResponseStream`, and `JsonSerializer.Serialize`
costs even though they are not inbound script invocations.
Tests for the audit middleware (`AuditWriteMiddlewareTests`) and pipeline order
(`MiddlewareOrderTests`) wire the middleware only against the
`POST /api/{methodName}` route in test hosts, so this production-only
mis-scoping is not exercised.
**Recommendation**
Tighten the predicate so the middleware runs only on the inbound API method
route, not on the `/api/` prefix. Options:
- `app.UseWhen(ctx => ctx.Request.Path.StartsWithSegments("/api") && !ctx.Request.Path.StartsWithSegments("/api/audit") && !ctx.Request.Path.StartsWithSegments("/api/management"), ...)`
— defensive, but fragile to future route additions.
- Move the audit emission from a pipeline middleware to an `IEndpointFilter`
applied via `.AddEndpointFilter<>()` on the `MapInboundAPI` registration
(alongside `InboundApiEndpointFilter`). This makes the scope explicit on the
one route that needs it and survives future `/api/...` route additions
unchanged.
The endpoint-filter form is the recommended fix — it co-locates the audit-emission
scope with the route definition and matches how InboundAPI-006/008 gating is
already wired.
+335 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ManagementService` |
| Design doc | `docs/requirements/Component-ManagementService.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 (1 Deferred — see ManagementService-012) |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 (1 Deferred — see ManagementService-012) |
## Summary
@@ -46,6 +46,32 @@ that can leave an instance partially modified after an error (015, Medium), raw
messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and
`QueryDeploymentsCommand` having no test coverage at all (017, Low).
#### Re-review 2026-05-28 (commit `1eb6e97`)
All seventeen prior findings remain correctly closed; ManagementService-012 is still the
only Deferred entry (marker-interface on `ManagementEnvelope.Command` still belongs in the
Commons module). The module has grown substantially since the last review (`+1997 lines`):
the Transport (#24) bundle commands (`ExportBundle`/`PreviewBundle`/`ImportBundle`) have
been added to `ManagementActor`, and a new `AuditEndpoints.cs` (`/api/audit/query` and
`/api/audit/export`) ships alongside the existing `/management` endpoint. This re-review
re-ran the full 10-category checklist and surfaced **six new findings**. The dominant
theme is the same authorization gap that findings 001/002/003/014 closed for the
ManagementActor, now resurfacing in the new surfaces:
**QueryAuditLogCommand has no role gate at all** (018, High) — any authenticated user can
read the configuration audit log via `/management`, even though the parallel
`/api/audit/query` requires `OperationalAuditRoles`. The new `/api/audit/{query,export}`
endpoints build an `AuthenticatedUser` with `PermittedSiteIds` but never enforce site scope
(019, Medium) — although audit roles are not site-scoped by design, the user-supplied
`sourceSiteId` filter is honoured verbatim. `HandleUpdateSmtpConfig` returns the full
SmtpConfiguration entity (including the `Credentials` field, which can carry SMTP passwords
/ OAuth2 client secrets) in the response and audit row (020, Medium). The Transport (#24)
bundle commands have zero test coverage in `ManagementActorTests` (021, Medium) — neither
role gating nor success/error paths. The `Component-ManagementService.md` design doc is
stale on three fronts: it does not mention Transport bundle commands, the `/api/audit/*`
endpoints, or the now-wired `CommandTimeout` option (022, Low). Finally,
`HandleQueryDeployments` issues one `GetInstanceByIdAsync` per unique instance ID when
filtering for a site-scoped user — an N+1 read pattern on the unfiltered branch (023, Low).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -61,6 +87,21 @@ messages from unexpected faults being returned verbatim to HTTP callers (016, Lo
| 9 | Testing coverage | + | Authorization is well covered; site-scope enforcement, the HTTP endpoint, `DebugStreamHub`, and remote-query handlers have no tests. See 013. |
| 10 | Documentation & comments | + | XML docs are accurate where present; `ManagementServiceOptions` and `ResolveRolesCommand` paths are undocumented dead code (010, 011). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | + | `HandleImportBundle` correctly dedupes resolutions per (entity,name); `ParseDocument` still allocates a `JsonDocument.Parse("{}")` on the failure path but the caller's `using` disposes it. No new defects. |
| 2 | Akka.NET conventions | + | PipeTo dispatch from 004 is intact; supervision strategy from 005 is intact; `Sender` correctly captured to local before PipeTo. No new findings. |
| 3 | Concurrency & thread safety | + | Bundle handlers `await` cleanly; `BundleSession` is not cleaned up if `PreviewAsync`/`ApplyAsync` throws, but that is an `IBundleImporter` contract concern outside this module. No new findings. |
| 4 | Error handling & resilience | + | `ManagementCommandException` from 016 is applied consistently across the new bundle handlers (curated `CryptographicException`/`ArgumentException` paths). No new findings. |
| 5 | Security | + | `QueryAuditLogCommand` has no role gate (018, High). New `/api/audit/*` endpoints build `PermittedSiteIds` but never enforce them (019, Medium). `HandleUpdateSmtpConfig` returns + audits `Credentials` verbatim (020, Medium). |
| 6 | Performance & resource management | + | `HandleQueryDeployments` unfiltered-with-scope branch is N+1 on instance lookups (023, Low). Request body up to 200 MB read into a single `string` in `HandleRequest` (acceptable per Transport bundle requirement). |
| 7 | Design-document adherence | + | `Component-ManagementService.md` is stale on Transport bundle commands, `/api/audit/*` endpoints, and the now-wired `CommandTimeout` (022, Low). |
| 8 | Code organization & conventions | + | `AuditEndpoints` duplicates the Basic Auth → LDAP → roles flow from `ManagementEndpoints` (~50 lines). Acknowledged in `AuditEndpoints` XML but worth tracking. No new finding raised. |
| 9 | Testing coverage | + | Transport bundle commands have zero `ManagementActorTests` coverage — neither role gating nor handler logic (021, Medium). |
| 10 | Documentation & comments | + | New `AuditEndpoints` XML doc is high quality. `Component-ManagementService.md` not updated for Transport/Audit endpoints (022 covers). |
## Findings
### ManagementService-001 — Remote-query and debug-snapshot handlers bypass site-scope enforcement
@@ -748,3 +789,294 @@ Resolved 2026-05-17 (commit pending). Added seven `QueryDeployments_*` tests to
Deployment user and an Admin user, in- and out-of-scope
(`_FilteredByOutOfScopeInstance_ReturnsUnauthorized`, `_FilteredByInScopeInstance_ReturnsRecords`,
`_UnfilteredForSiteScopedUser_DropsOutOfScopeRecords`, `_UnfilteredForAdminUser_ReturnsAllRecords`).
### ManagementService-018 — QueryAuditLogCommand has no role gate
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:153``:207`, `:336`, `:1302` |
**Description**
`QueryAuditLogCommand` is dispatched at line 336 to `HandleQueryAuditLog`, which calls
`ICentralUiRepository.GetAuditLogEntriesAsync(...)` with no role check, no site-scope
check, and no actor filter. `GetRequiredRole` (lines 153207) does not list
`QueryAuditLogCommand`, so it falls through to the `_ => null` case — i.e. "read-only
queries — any authenticated user". The parallel `/api/audit/query` endpoint in
`AuditEndpoints.HandleQuery` correctly enforces `AuthorizationPolicies.OperationalAuditRoles`
(`{ "Admin", "Audit", "AuditReadOnly" }`), so a CLI authenticated as a user with only the
`Deployment` role — or no roles at all — is rejected at `/api/audit/query` but can read
the *same* audit log table through `/management` by sending `QueryAuditLogCommand`. The
two surfaces enforce different permissions on the same data; the older
ManagementActor-routed path is the looser one. The audit log records every script-trust-
boundary action and is sensitive operationally — it should not be readable by a default
authenticated user.
This is the same authorization-bypass class as findings 001/002/014 and was missed in
that sweep because `QueryAuditLogCommand` (legacy `Action`/`EntityType` filter) is a
separate command from the new keyset-paged `IAuditLogRepository.QueryAsync` path the
`/api/audit/query` endpoint uses.
**Recommendation**
Add `QueryAuditLogCommand` to `GetRequiredRole`. The natural fit is a new
`"OperationalAudit"`-style role group — but `GetRequiredRole` returns a single string and
the project's existing role gates do too (`Admin`/`Design`/`Deployment`). Two equally
defensible options:
1. Add `QueryAuditLogCommand` to the `Admin`-required group — strict, mirrors that
`AuditExportRoles` includes `Admin`. The CLI's CLI-017/018 audit work uses
`/api/audit/query`, so `QueryAuditLogCommand` may be effectively orphaned anyway.
2. Extend `GetRequiredRole` to return a role *set* and add an `AuditRoles` group equal to
`AuthorizationPolicies.OperationalAuditRoles`, so the two surfaces converge.
Recommended: option 1 plus a deprecation comment on `QueryAuditLogCommand` pointing at
`/api/audit/query` — the legacy command's filter shape is a subset of the new endpoint's,
so the ManagementActor route is redundant. Add a regression test asserting that a
no-role / `Deployment`-only caller gets `ManagementUnauthorized` for `QueryAuditLogCommand`.
### ManagementService-019 — AuditEndpoints builds PermittedSiteIds but never enforces them
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/AuditEndpoints.cs:358``:368`, `:397``:437` |
**Description**
`AuditEndpoints.AuthenticateAsync` resolves the caller's roles AND `PermittedSiteIds` and
wraps them in an `AuthenticatedUser` (lines 358366), but the returned `AuthenticatedUser`
is then only used for the `HasAnyRole(...)` role check on lines 114 and 163 — its
`PermittedSiteIds` are never read. `ParseFilter` (line 397) accepts the caller-supplied
`sourceSiteId=...` query string verbatim and passes it straight into the
`IAuditLogRepository.QueryAsync` filter. A user whose `Audit` (or `AuditReadOnly`) role
mapping carries scope rules — e.g. `AuditReadOnly` scoped to "plant-a" — can still ask
for `sourceSiteId=plant-b` and get back rows for plant-b.
Today this gap is partially benign because the design treats `Audit`/`AuditReadOnly` as
non-site-scoped roles (`Component-AuditLog.md` does not list site scoping for the audit
permissions, and the LDAP role mapping UI does not currently surface site scope rules
for those roles). But (a) the `RoleMapper` will silently honour scope rules attached to
any role, including `Audit`, so an operator who *does* configure them gets a UI that
says "scoped" and an endpoint that ignores the scope — a contract violation; (b) the
`Admin` role's `PermittedSiteIds` are always empty (system-wide), so enforcing for the
other roles is cheap. The asymmetry with the `/management` endpoint — which routes every
site-targeted command through `EnforceSiteScope` — is also a maintenance hazard.
**Recommendation**
Decide explicitly whether the audit endpoints honour site scope. Two options:
1. **Honour scope** — in `HandleQuery` / `HandleExport`, after the role check, intersect
the caller-supplied `filter.SourceSiteIds` with `user.PermittedSiteIds`. If the
caller supplied no `sourceSiteId` and `PermittedSiteIds` is non-empty, restrict to
`PermittedSiteIds`. If the intersection is empty, return an empty page (or a 403 if
the caller explicitly asked for an out-of-scope site).
2. **Document the intentional bypass** — drop the `PermittedSiteIds` field from the
`AuthenticatedUser` constructed in `AuthenticateAsync` (or comment it as "ignored —
audit roles are not site-scoped") so the code stops carrying a value it does not
read, and add an XML doc note on the endpoint class that audit roles are always
system-wide by design.
Recommended: option 1, mirroring the `ManagementActor` pattern — same security posture
across both surfaces. Add a regression test that a site-scoped `AuditReadOnly` user
filtering on an out-of-scope site gets a 403 (or an empty page).
### ManagementService-020 — UpdateSmtpConfig returns and audits the SMTP Credentials field verbatim
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:1136``:1153` |
**Description**
`HandleUpdateSmtpConfig` reads the existing `SmtpConfiguration` entity, applies the
incoming command, and then **(a)** passes the full `config` object as the `afterState`
to `AuditAsync` (line 1151) — meaning the SMTP credential string is persisted in the
audit log — and **(b)** returns the full `config` to the caller (line 1152), which is
serialized via `SerializeResult` and sent back over HTTP. `SmtpConfiguration.Credentials`
carries the SMTP-Auth password (for `Basic`) or the OAuth2 client secret (for
`OAuth2ClientCredentials`); `SmtpConfiguration` has no `[JsonIgnore]` on this field
and `SerializeResult`'s `JsonSerializerOptions` does not exclude it. The pattern
parallels what ConfigurationDatabase-012 fixed for inbound API keys: a credential
artifact must not be echoed back through every read/audit path.
The credential is supplied by the operator in `UpdateSmtpConfigCommand.Credentials`,
so the caller already has it. But (1) anyone with read access to the audit log
(`OperationalAuditRoles`) can now retrieve every SMTP credential change verbatim — a
strictly larger blast radius than `Admin`-only `UpdateSmtpConfig`. (2) The serialized
`config` echo means the credential moves over the wire in the response even though the
caller has no need for it. (3) Any future read path that returns
`SmtpConfiguration``ListSmtpConfigsCommand` already does at line 1130 — will leak
the stored credential too.
**Recommendation**
Three changes, in order of priority:
1. In `HandleUpdateSmtpConfig` and `HandleListSmtpConfigs`, project to a credential-free
shape before returning — e.g. `new { config.Id, config.Host, config.Port,
config.AuthType, config.FromAddress, config.TlsMode }`. Match the
`HandleListApiKeys` pattern.
2. In `AuditAsync` for the SMTP path, pass a credential-free `afterState` (the same
anonymous shape). The fact that *something* changed is auditable; the secret value
is not.
3. Tag `SmtpConfiguration.Credentials` with `[JsonIgnore]` in Commons (out-of-scope edit
for this module, but worth a follow-up). Alternatively, configure
`ResultSerializerOptions` with a property name policy that skips a known set of
credential field names — but a per-entity projection is cleaner.
Add regression tests: `UpdateSmtpConfig_DoesNotEchoCredentialsInResponse` and
`UpdateSmtpConfig_DoesNotPersistCredentialsInAuditLog`.
### ManagementService-021 — Transport bundle handlers have zero test coverage
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1`; `src/ScadaLink.ManagementService/ManagementActor.cs:1717``:1897` |
**Description**
The three Transport (#24) bundle handlers — `HandleExportBundle`, `HandlePreviewBundle`,
`HandleImportBundle` (~180 lines of handler logic at the bottom of `ManagementActor.cs`)
— have **no tests** in `ManagementActorTests`. Specifically untested:
1. **Role gating.** `ExportBundleCommand` requires `Design`; `PreviewBundleCommand` and
`ImportBundleCommand` require `Admin`. No test asserts that the wrong role gets
`ManagementUnauthorized`. CLI-017 / CLI-018 just landed around bundle plumbing — a
future refactor that moves these commands between role groups in `GetRequiredRole`
would silently regress the gate.
2. **Name resolution in `HandleExportBundle`.** The inner `ResolveIds<T>` helper raises
`ManagementCommandException` for unknown names. The "all entity types" branch
(`cmd.All == true`) and the "missing name" branch are both untested.
3. **`HandleImportBundle` blocker rejection.** The handler aborts before `ApplyAsync`
when any `ConflictKind.Blocker` row is present; the produced error message is
curated and surfaced to the caller, but no test asserts the abort path or that the
importer's `ApplyAsync` was not called.
4. **Resolution dedupe.** `HandleImportBundle` dedupes `(EntityType, Name)` keys
last-write-wins — the dedupe is critical (CLI-014 was about it on the CLI side) but
has no actor-side regression test.
5. **`DecodeBundle` failure modes** (empty/non-base64 input) — both branches return
curated `ManagementCommandException` but neither is exercised.
6. **`ParseConflictPolicy`** for `"skip"`, `"overwrite"`, `"rename"`, and the invalid-
value branch — all untested.
Given the size and reach of the bundle path (cross-cutting central configuration
import), this gap is materially larger than usual for new handler code.
**Recommendation**
Add an `ImportBundleHandlerTests` suite covering:
- role gating for all three commands (`Design`/`Admin` mismatch -> `ManagementUnauthorized`),
- `ExportBundleCommand(All: true)` happy-path,
- `ExportBundleCommand` with an unknown name -> `ManagementError`,
- `ImportBundleCommand` with a `Blocker` row -> `ManagementError` and `ApplyAsync` not called,
- `ImportBundleCommand` with duplicate preview items -> dedupe to one resolution per (type, name),
- `DecodeBundle` empty/invalid base64,
- `ParseConflictPolicy` all four branches.
Use NSubstitute for `IBundleImporter` / `IBundleExporter` (no need for a real bundle in
the actor tests; the bundle round-trip belongs in `Transport` tests).
### ManagementService-022 — Design doc is stale on Transport bundle commands, /api/audit/* endpoints, and CommandTimeout
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-ManagementService.md:77``:175`, `:205``:209` |
**Description**
`Component-ManagementService.md` does not mention three pieces of shipped functionality:
1. **Transport (#24) bundle commands.** `ExportBundleCommand`, `PreviewBundleCommand`,
and `ImportBundleCommand` are dispatched at `ManagementActor.cs:350``:352` and
role-gated in `GetRequiredRole` (Design for Export; Admin for Preview/Import). The
design doc's "Message Groups" section enumerates Templates, Instances, Sites, Data
Connections, Deployments, External Systems, Notifications, Security, Audit Log,
Shared Scripts, Database Connections, Inbound API Methods, Health, and Remote
Queries — but has no "Transport" / "Bundles" group. The CLI now offers `bundle
export`/`preview`/`import` (per the recent CLI-017/018 work) and points
at these commands.
2. **`/api/audit/*` endpoints.** The doc's "HTTP Management API" section (line 52)
describes only `POST /management`. `AuditEndpoints.MapAuditAPI()` adds
`GET /api/audit/query` and `GET /api/audit/export` with their own auth-and-role
path mirroring `ManagementEndpoints` (intentionally — see the `AuditEndpoints` XML
docs), but the design doc gives no signal that the module exposes more than one
route group, no per-endpoint role mapping table, and no mention that the response
shape differs (keyset cursor vs. opaque page).
3. **`CommandTimeout`.** Line 209 still says "Reserved for future configuration —
e.g., command timeout overrides", but ManagementService-010 wired the option through
`ResolveAskTimeout`. The doc is stale.
**Recommendation**
Update `Component-ManagementService.md`:
- Add a "Transport" entry to "Message Groups" listing `ExportBundle`,
`PreviewBundle`, `ImportBundle` with their per-command roles. Cross-reference
`Component-Transport.md`.
- Add an "Audit Log HTTP API" subsection under "HTTP Management API" describing
`GET /api/audit/query` (keyset cursor, `OperationalAuditRoles`) and
`GET /api/audit/export` (csv/jsonl streaming, `AuditExportRoles`, parquet 501).
Note the deliberate divergence in the source-site query-string key
(`sourceSiteId` vs CentralUI's `site`).
- In the "Configuration" table, replace "Reserved for future configuration" with the
actual `CommandTimeout` semantics: "Max time the HTTP endpoint will Ask the
ManagementActor before returning HTTP 504; falls back to 30 s when unset or
non-positive."
### ManagementService-023 — HandleQueryDeployments unfiltered branch is N+1 on instance lookup
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:1276``:1295` |
**Description**
The site-scoped unfiltered branch of `HandleQueryDeployments` (added under
ManagementService-014) reads every `DeploymentRecord` via `GetAllDeploymentRecordsAsync`,
then for each *unique* `record.InstanceId` calls
`ITemplateEngineRepository.GetInstanceByIdAsync` to resolve the instance's
`SiteId`. The handler caches results in `instanceSiteCache` so each instance is loaded
at most once per call, but for a fleet with N distinct instances having deployment
history, the handler still issues N round-trips to the configuration database to
authorize a single query. With a large deployment history the cumulative DB hit can be
material; it also runs every time a site-scoped user opens the deployments page.
This is acceptable in steady state today (sites tend to have small fleets and few
deployments) but is a textbook N+1 read pattern, and on a busy day for a site-scoped
operator the cost will dominate the request. Admin and system-wide Deployment users
correctly skip the loop (they hit only `GetAllDeploymentRecordsAsync`).
**Recommendation**
Add a batch-resolve method to `ITemplateEngineRepository` — e.g.
`Task<IDictionary<int, int>> GetInstanceSiteIdsAsync(IEnumerable<int> instanceIds)`
backed by a single EF query
(`Instances.Where(i => instanceIds.Contains(i.Id)).Select(i => new { i.Id, i.SiteId })`).
`HandleQueryDeployments` would then issue exactly two queries on the unfiltered branch
(records + sites) regardless of fleet size. The change is additive to
`ITemplateEngineRepository` and out-of-module for the actual implementation, but the
handler change is local; a quick interim alternative is to project deployment records
to include the instance's `SiteId` at the repo level, which removes the second query
entirely.
Defer until a noticeable hot path emerges, but track it: this is the only N+1 in
`ManagementActor` once 002 / 014 are folded in.
+488
View File
@@ -0,0 +1,488 @@
# Code Review — NotificationOutbox
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.NotificationOutbox` |
| Design doc | `docs/requirements/Component-NotificationOutbox.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 10 |
## Summary
NotificationOutbox is a small, focused module — one ~985-line actor
(`NotificationOutboxActor`), a strongly-typed options class, an
`INotificationDeliveryAdapter` seam, and the single concrete `EmailNotificationDeliveryAdapter`.
The Akka.NET conventions are textbook: every async path is wrapped with `PipeTo`, the
dispatcher uses an in-flight guard cleared on `DispatchComplete`, the sender is captured
before crossing the await, and the actor isolates per-notification failures so one bad row
never aborts a batch. Test coverage is broad — ingest, dispatch, query, retry/discard,
purge, KPI, and the new audit-emission paths (B2 attempts + B3 terminals) all have
dedicated test files — and the audit-write-failure-never-aborts-delivery contract is
explicitly asserted.
The dominant theme is **trust-boundary leakage between Outbox, NotificationService, and
ConfigurationDatabase**. The outbox inherits two known defects from its sibling modules
that are reachable through `EmailNotificationDeliveryAdapter`: the OAuth2 SASL empty-user
bug (NS-021) ships every M365 send with `user=""`, and the
`InsertIfNotExistsAsync` check-then-act race (CD-015) lives on the outbox's ack-after-persist
hot path. Neither is a defect of code under `src/ScadaLink.NotificationOutbox/`, but both
are surfaced here because production dispatch and ingest go through these exact lines.
A secondary theme is **dispatcher-fire-and-forget audit writes** (`_ = _auditWriter.WriteAsync(...)`)
that can race the per-sweep scope dispose under the wrong DI graph, and a few smaller
drifts: the dispatcher passes `CancellationToken.None` to adapter delivery (no graceful
shutdown for in-flight SMTP sends), the `StuckAgeThreshold` XML-doc describes a behavior
the design explicitly forbids (display-only, never reclaim), the `MaxRetries` boundary check
uses `>=` against a config value that can be zero (immediate park on first transient
failure), and several `NotificationOutboxOptions` fields are documented in code but absent
from `Component-NotificationOutbox.md`. No Critical findings; two High, six Medium, two Low.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `MaxRetries` zero/negative immediately parks (NotificationOutbox-002); `StuckAgeThreshold` XML doc contradicts design (NotificationOutbox-009); `Guid.TryParse` accepts compact `"N"` ids emitted by sites. |
| 2 | Akka.NET conventions | Yes | `PipeTo` / sender-capture / in-flight guard pattern is correctly applied throughout. Fire-and-forget `_ = _auditWriter.WriteAsync(...)` raises a scope-lifetime concern (NotificationOutbox-004). |
| 3 | Concurrency & thread safety | Yes | Actor state mutated only on actor thread. Inherited CD-015 race on `InsertIfNotExistsAsync` (NotificationOutbox-005) is the only race; the dispatcher's in-flight guard correctly serializes sweeps. |
| 4 | Error handling & resilience | Yes | Outer try/catch on `RunDispatchPass`/`RunPurgePass` keeps the in-flight guard sane; per-notification isolation is correct. CT not threaded into delivery (NotificationOutbox-003). |
| 5 | Security | Yes | Inherited OAuth2 empty-user (NotificationOutbox-001) reachable through the adapter. No new credential or trust-boundary issues introduced by the outbox itself. |
| 6 | Performance & resource management | Yes | Dispatch interval & batch size are simple polling; `ResolveAdapters` rebuilds the lookup per sweep (NotificationOutbox-006). No leaks. |
| 7 | Design-document adherence | Yes | `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, `PurgeInterval` are not in the design doc (NotificationOutbox-007). |
| 8 | Code organization & conventions | Yes | Options class lives in the component project (correct); DI extension lives in the component (correct); adapter is `scoped`, actor singleton — interaction correctly documented in `ServiceCollectionExtensions`. No issues. |
| 9 | Testing coverage | Yes | Solid actor-behaviour coverage. Missing tests for `FallbackMaxRetries` / empty-SMTP-config dispatch path (NotificationOutbox-008). |
| 10 | Documentation & comments | Yes | XML on `StuckAgeThreshold` misleading (NotificationOutbox-009); XML on dispatcher's audit `_ =` fire-and-forget says "writer never throws" but `EmitAttemptAudit` still wraps in try/catch — comment contradicts itself (NotificationOutbox-010). |
## Findings
### NotificationOutbox-001 — `EmailNotificationDeliveryAdapter` inherits the OAuth2 empty-user SASL bug (NS-021) on the M365 send path
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/Delivery/EmailNotificationDeliveryAdapter.cs:185-191` (calls `smtp.AuthenticateAsync("oauth2", token)`); root cause in `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:76-79` |
**Description**
`EmailNotificationDeliveryAdapter.SendAsync` resolves an OAuth2 access token via
`_tokenService.GetTokenAsync(...)` and then calls
`await smtp.AuthenticateAsync(config.AuthType, credentials, cancellationToken);`
on `ISmtpClientWrapper`. The production implementation (`MailKitSmtpClientWrapper`)
constructs `new SaslMechanismOAuth2("", credentials)` — an empty user-name field —
which Microsoft 365 SMTP rejects with `535 5.7.3 Authentication unsuccessful`. The
sibling NotificationService finding NS-021 documents this in full; the outbox is the
*new home* for delivery on central, so every OAuth2 send that the outbox dispatches
hits this code path. The defect is therefore reachable here even though the offending
constructor lives in the NotificationService project, and the central-only redesign
means this is now the only delivery path in production. Existing outbox tests do not
catch it because they all substitute `ISmtpClientWrapper` and assert only that
`AuthenticateAsync` is invoked with `("oauth2", "<token>")` — the real
`SaslMechanismOAuth2` is never instantiated. `OAuth2TokenService.GetTokenAsync` is
explicitly wired to `login.microsoftonline.com/.../oauth2/v2.0/token` with
`scope=https://outlook.office365.com/.default`, so M365 SMTP is the intended target —
and is precisely the relay that requires the user field to be populated.
**Recommendation**
Track the NS-021 fix and add an outbox-side regression test once the wrapper signature
is widened. Concretely, when `ISmtpClientWrapper.AuthenticateAsync` is extended to
accept the sender mailbox (or a dedicated `oauth2UserName` parameter), update
`EmailNotificationDeliveryAdapter.SendAsync` to pass `config.FromAddress`, and add a
test in `EmailNotificationDeliveryAdapterTests` that asserts the OAuth2 path forwards
the sender identity. Until then, surface the same finding here so the outbox is not
treated as resolved when NS-021 fires.
**Resolution**
_Unresolved._
### NotificationOutbox-002 — Dispatcher parks on first transient failure when `SmtpConfiguration.MaxRetries == 0`
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:348-360` |
**Description**
The transient-failure branch increments `RetryCount` then evaluates
`if (notification.RetryCount >= maxRetries) notification.Status = NotificationStatus.Parked;`.
`maxRetries` is read from the central `SmtpConfiguration.MaxRetries` column, which has
no enforced lower bound and is not validated by the outbox. A row whose `MaxRetries`
is `0` (or any negative value) immediately satisfies `1 >= 0` on the very first
transient failure, so the notification is parked without a single retry — directly
contradicting the design doc's "fixed retry interval, reuse central SMTP
max-retry-count" intent, where a configured value of zero would naturally read as
"never retry, fail straight to permanent". `SetupSmtpRetryPolicy` in the dispatch
tests always supplies a positive value, so this path is not exercised.
Additionally, an operator who clears the SMTP config row drops into the
`FallbackMaxRetries = 10` / `FallbackRetryDelay = 1 min` path
(`ResolveRetryPolicyAsync` line 251); that path is also untested — see
NotificationOutbox-008. The operational result is that a single bad SMTP config
value silently halves the outbox's delivery guarantees.
**Recommendation**
Validate `MaxRetries` at the read point: treat a non-positive value as either the
configured fallback (current `FallbackMaxRetries = 10`) or — preferred — surface the
mis-configuration to the operator via a health metric and refuse to dispatch until
the row is corrected. Either way, add a test that asserts the dispatcher's behaviour
for `MaxRetries == 0` and `MaxRetries < 0`.
**Resolution**
_Unresolved._
### NotificationOutbox-003 — Dispatcher does not propagate a `CancellationToken` into delivery; in-flight SMTP sends cannot be cancelled on shutdown
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:334`, `src/ScadaLink.NotificationOutbox/Delivery/INotificationDeliveryAdapter.cs:22` |
**Description**
`DeliverOneAsync` calls `var outcome = await adapter.DeliverAsync(notification);`
the second `CancellationToken` parameter on `INotificationDeliveryAdapter.DeliverAsync`
is left at its `default(CancellationToken)` value, meaning `CancellationToken.None`.
`EmailNotificationDeliveryAdapter.SendAsync` then threads that `None` token into
`smtp.ConnectAsync`, `smtp.AuthenticateAsync`, and `smtp.SendAsync`. The consequence
is that during a coordinated cluster shutdown (singleton handover, drain) any
in-flight SMTP send is uncancellable and the dispatcher's sweep must wait for the
underlying socket/SMTP timeout (`SmtpConfiguration.ConnectionTimeoutSeconds`) before
the sweep's task completes and `DispatchComplete` lowers the in-flight guard. With
the default connect-timeout values this is on the order of tens of seconds per
notification in the in-progress batch, blocking `CoordinatedShutdown`.
The adapter implementations clearly *expect* a token — the contract type is
`CancellationToken cancellationToken = default` everywhere — so this is a wiring
gap, not a missing interface.
**Recommendation**
Wire a per-sweep `CancellationTokenSource` linked to the actor's lifecycle (cancel
in `PostStop`) and pass its token into `DeliverAsync`. A linked source per sweep
also bounds individual deliveries by the configured connection timeout when a more
explicit per-attempt budget is wanted. Add a test that cancels mid-`DeliverAsync` and
asserts the dispatcher completes promptly and the row is left non-terminal
(`Pending`/`Retrying` unchanged) for the next sweep.
**Resolution**
_Unresolved._
### NotificationOutbox-004 — `EmitAttemptAudit`/`EmitTerminalAudit` fire-and-forget pattern can outlive the per-sweep DI scope
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:425-435`, `463-485` |
**Description**
Both emission helpers issue `_ = _auditWriter.WriteAsync(evt);` — discarding the
returned task. `CentralAuditWriter.WriteAsync` opens its own `await using var scope =
_services.CreateAsyncScope();` and resolves a scoped `IAuditLogRepository` (verified
at `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:118-121`), so the writer is
defensively scope-independent. However the dispatcher already holds a per-sweep
`using var scope = _serviceProvider.CreateScope();` and the per-notification
`UpdateAsync` runs in that scope. The fire-and-forget pattern means:
1. The dispatcher's outer scope can be disposed (sweep done, `DispatchComplete`
piped) while the audit `WriteAsync` task is still running on a *different*
scope it owns — works today only because the writer creates its own scope.
2. A faulted unobserved task is silently lost: if `CentralAuditWriter.WriteAsync`
itself were ever made `async void` or refactored to not internally try/catch,
the dispatcher would never see the fault and the audit row would vanish without
the `_logger.LogWarning` reaching the operator.
3. The XML-doc above `EmitAttemptAudit` says "PipeTo is not used because the writer
never throws" — but the surrounding `try { _ = _auditWriter.WriteAsync(evt); }
catch (Exception ex)` will only catch a synchronous throw from the *task
construction*, not the awaited body of `WriteAsync`. The comment understates the
risk: the catch is structurally unreachable for the documented failure mode.
The system actually wants the *invariant* "audit write never affects delivery"
(verified by the `AuditWriter_Throws_…StillSucceeds` tests). That invariant is
better expressed by `await`-ing the writer inside the actor's outer try/catch (the
dispatcher already swallows per-notification exceptions) than by a discard-task,
which couples the lifetime of the dispatcher's scope to that of the audit task
through whatever scope graph the writer happens to use today.
**Recommendation**
Either `await _auditWriter.WriteAsync(evt)` inside the existing `try`/`catch` (the
preferred fix — preserves the invariant, plays well with the per-sweep scope, and
makes the catch block actually reachable), or — if a true fire-and-forget remains
desired — capture the returned task and attach a continuation that calls
`_logger.LogWarning` on faulted to keep diagnostics intact. Either way, fix the
"writer never throws" XML-doc to match the implementation.
**Resolution**
_Unresolved._
### NotificationOutbox-005 — Ingest persistence inherits the CD-015 check-then-act race; under contention the second writer throws and the site retries
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:127-132` (caller); root cause in `src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45` |
**Description**
`HandleSubmit``PersistAsync` calls `repository.InsertIfNotExistsAsync(notification)`
on `INotificationOutboxRepository`. The current implementation
(`src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs`)
does a check-then-act with no duplicate-key catch — documented as CD-015 (High,
Open). The Notification Outbox's documented contract is "at-least-once handoff with
ack-after-persist plus insert-if-not-exists on `NotificationId`" (CLAUDE.md,
Component-NotificationOutbox.md §Ingest & Idempotency), and the duplicate-insert
race is the **expected contention pattern** — the site retries the same submission
after a lost ack. As written, the loser surfaces a `SqlException` (2627 PK
violation) wrapped in `DbUpdateException`, propagates through `PipeTo`'s failure
projection as a `NotificationSubmitAck { Accepted: false, Error: "... PRIMARY KEY ..." }`,
the site treats the ack as a forwarding failure and forwards the message **again**,
re-entering the same race. If the contending pair keeps racing this can livelock.
The actor side is fine — `PipeTo`'s success/failure projection correctly forwards
the exception message. The repository side needs the standard `2601/2627 → no-op`
pattern that AuditLog and SiteCall already use. This finding tracks the outbox-side
visibility of the CD-015 defect so a re-review of NotificationOutbox surfaces it
even if the reader has not yet read the ConfigurationDatabase findings.
**Recommendation**
Track CD-015 to resolution. As a defense-in-depth complement here, consider
treating a duplicate-key `DbUpdateException` in the actor's ingest failure
projection as `Accepted: true` so a lost ack between persisted-by-the-first-writer
and ack-back does not produce a permanent re-forward loop — but the cleanest fix
remains the CD-015 raw-SQL `IF NOT EXISTS … INSERT` with `2601/2627` catch in
`NotificationOutboxRepository`.
**Resolution**
_Unresolved._
### NotificationOutbox-006 — `ResolveAdapters` rebuilds the `NotificationType → adapter` dictionary on every dispatch sweep
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:267-277` |
**Description**
Every dispatch sweep calls `ResolveAdapters(scope.ServiceProvider)` which enumerates
`scopedServices.GetServices<INotificationDeliveryAdapter>()` and builds a fresh
`Dictionary<NotificationType, INotificationDeliveryAdapter>`. Adapter registration
is decided at startup (`AddNotificationOutbox` registers
`EmailNotificationDeliveryAdapter`); the registration set does not change at
runtime. With a default `DispatchInterval = 10s` and only ever one entry today, the
allocation overhead is trivial — but the comment "the last adapter registered for a
given type wins, mirroring DI's last-wins resolution semantics" elevates this to a
behaviour contract, and the per-sweep dictionary construction obscures the lookup's
identity from one sweep to the next, making any future stateful adapter (rate
limiter, circuit breaker) silently lose its state.
The same issue is the reason `EmailNotificationDeliveryAdapter` is *scoped* — it
holds a scoped `INotificationRepository`. A trivial cache-the-types-but-resolve-
the-instance fix is possible: cache the set of declared `NotificationType` values
and look up each adapter by `GetService<INotificationDeliveryAdapter>()`
filtered by `Type` per sweep.
**Recommendation**
Document the per-sweep contract explicitly ("each sweep gets a fresh adapter
instance per the scoped DI contract — adapters must not carry state across
sweeps") in the actor XML, or — preferred — cache only the *types* at startup
(`PreStart`) and resolve the scoped instance per sweep, so future adapters with
stateful intent (timeouts, circuit breakers) cannot accidentally lose state.
**Resolution**
_Unresolved._
### NotificationOutbox-007 — `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, and `PurgeInterval` are not in the design document
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:13`, `:22`, `:25`; `docs/requirements/Component-NotificationOutbox.md:152-160` |
**Description**
`Component-NotificationOutbox.md` §Configuration enumerates three options: dispatch
interval, stuck-age threshold, and terminal-row retention window. The implemented
`NotificationOutboxOptions` adds three additional fields:
- `DispatchBatchSize` (default `100`) — caps the per-sweep claim size, but is invisible
to anyone reading only the spec.
- `PurgeInterval` (default `1 day`) — the design doc says "daily purge" as if the
cadence is fixed; in code it is configurable.
- `DeliveredKpiWindow` (default `1 min`) — the KPI section says "Delivered (last
interval)" without saying how long "last interval" is or that it is configurable.
The design doc also asserts "Delivery max-retry-count and retry interval are not
part of `NotificationOutboxOptions` — they are reused from the central SMTP
configuration" (line 160) — implementation honours this. But the three additions
above are dead text in the design doc. The KPI dashboard cadence and the dispatch
batch size are both operationally important values an operator/engineer will hunt
for; their absence from the spec is design drift.
**Recommendation**
Add the three fields to `Component-NotificationOutbox.md §Configuration` with their
defaults, or remove them from the implementation if they were meant to be fixed
constants. Cross-link `DeliveredKpiWindow` from the §Monitoring "Delivered (last
interval)" KPI bullet so a reader sees what controls the bucket length.
**Resolution**
_Unresolved._
### NotificationOutbox-008 — `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:29-31`, `:251-259`; tests in `tests/ScadaLink.NotificationOutbox.Tests/NotificationOutboxActorDispatchTests.cs` |
**Description**
`ResolveRetryPolicyAsync` falls back to `FallbackMaxRetries = 10` and
`FallbackRetryDelay = 1 min` when `notificationRepository.GetAllSmtpConfigurationsAsync()`
returns an empty list (no SMTP configuration row). The comment correctly observes
that delivery itself will then return `Permanent("No SMTP configuration available")`
from `EmailNotificationDeliveryAdapter.cs:78-81`, so the fallback retry policy
never actually retries anything — the row is permanently parked on first attempt
regardless of retry count or delay.
This produces three concerns. (1) The fallback is essentially dead code — the retry
policy values are never consulted in practice because delivery always fails
permanently before the retry branch is reached. (2) The fallback can be reached
*after* a previously-deployed SMTP config is deleted, which is precisely the
moment an operator needs accurate audit trails; the row will say `Parked` with
`LastError = "No SMTP configuration available"` but the audit signal "retry policy
fell back to defaults" is invisible. (3) Tests never exercise either the fallback
path or the empty-SMTP-config dispatch path: `SetupSmtpRetryPolicy` always supplies
a config in every dispatch test.
**Recommendation**
Add a regression test that runs a dispatch sweep with no SMTP config row and
asserts the row is parked with the documented error. Optionally remove the fallback
constants if parking-with-no-config is the *intended* operational signal; document
the choice in the actor XML so a maintainer does not "fix" the unreachable code.
**Resolution**
_Unresolved._
### NotificationOutbox-009 — `StuckAgeThreshold` XML-doc says "in-progress notification is re-claimed" — contradicts the design's display-only stuck detection
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:15-16` |
**Description**
```csharp
/// <summary>Age past which an in-progress notification is considered stuck and re-claimed.</summary>
public TimeSpan StuckAgeThreshold { get; set; } = TimeSpan.FromMinutes(10);
```
The implementation never reclaims anything based on `StuckAgeThreshold`. It is used
only as a cutoff for the stuck-count KPI (`StuckCutoff`/`IsStuck` in
`NotificationOutboxActor.cs:932-942`) and as a `StuckCutoff` filter on paginated
queries. The design doc is explicit: "A notification is **stuck** if it is `Pending`
or `Retrying` and older than a configurable age threshold (default 10 minutes).
Detection is **display-only** — a count KPI and a row badge. There is no automated
escalation or alerting" (`Component-NotificationOutbox.md:143-145`). A maintainer
reading the XML and expecting "re-claim" behaviour will be surprised twice — once
when no re-claim happens, and once when they go looking for the re-claim code and
find none.
**Recommendation**
Rewrite the XML to match the design: "Age past which a still-`Pending`/`Retrying`
notification is counted as stuck on the KPI tile and the per-row badge.
Display-only — does not affect dispatch."
**Resolution**
_Unresolved._
### NotificationOutbox-010 — Comment claims `PipeTo` is not used "because the writer never throws"; the surrounding try/catch is dead-letter for the documented failure mode
| | |
|--|--|
| Severity | Medium |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:469-477` |
**Description**
```csharp
try
{
var evt = BuildNotifyDeliverEvent(notification, now, AuditStatus.Attempted, errorMessage)
with { DurationMs = durationMs };
// Fire-and-forget — we do NOT await: the dispatcher loop must not
// be blocked by audit IO, and the writer swallows its own faults.
// PipeTo is not used because the writer never throws.
_ = _auditWriter.WriteAsync(evt);
}
catch (Exception ex)
{
_logger.LogWarning(ex, "Failed to emit Attempted audit row …");
}
```
The XML-doc on `EmitAttemptAudit` is internally inconsistent and structurally
incorrect: (1) if "the writer never throws" then the surrounding try/catch is
unreachable and dead code; (2) if the writer *can* throw (and the catch is
meaningful) then "never throws" is wrong. In practice the catch only ever fires
on a synchronous throw from the writer's *task construction* — never on a fault
in the awaited body — because the discarded task is not observed. The current
behaviour matches the design intent ("audit failure NEVER aborts delivery"), but
the comment misleads the next reader on the *why*.
This is the same root cause as NotificationOutbox-004 — they target the same lines
from different angles (NotificationOutbox-004 is the scope-lifetime /
fire-and-forget Akka concern, NotificationOutbox-010 is the doc/comment-clarity
concern). Closing NotificationOutbox-004 by switching to `await` resolves both.
**Recommendation**
If `await`-ing the writer (recommended fix per NotificationOutbox-004): delete the
"PipeTo is not used because the writer never throws" line entirely and let
the try/catch's behaviour speak for itself. If keeping fire-and-forget: rewrite
the comment to "fire-and-forget by design (the writer is responsible for its
own failure handling); the surrounding try/catch only catches the synchronous
task-construction throw and is otherwise unreachable."
**Resolution**
_Unresolved._
+254 -13
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.NotificationService` |
| Design doc | `docs/requirements/Component-NotificationService.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -55,20 +55,65 @@ any code (NS-017, dead config — NS-007 sourced the timeout/limit from
outside its lock, is sized once and never resized on redeployment, and is never
disposed (NS-018).
#### Re-review 2026-05-28 (commit `1eb6e97`)
Re-reviewed at commit `1eb6e97` against the **materially-changed design**: per the
updated `Component-NotificationService.md` and CLAUDE.md, the Notification Service
is now **central-only**. Sites no longer deliver notifications over SMTP — a
script's `Notify.Send` enqueues to the site Store-and-Forward Engine and
`NotificationForwarder.DeliverAsync` (S&F handler in StoreAndForward) forwards
the payload to the central Notification Outbox, which dispatches via the
`INotificationDeliveryAdapter` registered for the list's `Type`. Email delivery
on central is performed by `EmailNotificationDeliveryAdapter` in the
NotificationOutbox project — it reuses this module's SMTP machinery
(`ISmtpClientWrapper`, `OAuth2TokenService`, `SmtpErrorClassifier`,
`SmtpTlsModeParser`, `EmailAddressValidator`, `CredentialRedactor`,
`SmtpPermanentException`, `NotificationOptions`) but is the actual production
caller. The intended residual responsibility of this module is to **supply that
shared SMTP machinery** plus list/SMTP-config definition management on central.
The re-review surfaced **seven new findings**. The dominant theme is **dead
code that contradicts the design doc**: `NotificationDeliveryService`, the
`INotificationDeliveryService` interface in Commons, the `NotificationResult`
record, the entire `DeliverBufferedAsync` S&F handler, and the prior NS-001…
NS-018 test fixtures that exercise them are now orphaned — no production code
path resolves `INotificationDeliveryService` on a site (sites no longer register
this module per `SiteServiceRegistration.cs:33-38`) and on central the
NotificationOutbox uses its own `EmailNotificationDeliveryAdapter` (which
duplicates the connect/auth/send/disconnect sequence rather than delegating to
`NotificationDeliveryService`). The class is still registered by
`AddNotificationService` on central (`Program.cs:77`) but no consumer resolves
it (NS-019). The `S&F handler must be registered` workaround that NS-001 added
to `AkkaHostedService` is itself superseded by the `NotificationForwarder`
registered for the same category at `AkkaHostedService.cs:654-660` (NS-020).
Secondary findings: a real-world correctness gap (the OAuth2
`SaslMechanismOAuth2` is constructed with an **empty user id** so server-side
account binding fails for any provider that requires it — NS-021); the SMTP
client wrapper holds a single `MailKit.SmtpClient` for the lifetime of the
wrapper but the factory delegate creates a new wrapper per send, so successive
sends through the same factory share NO connection but DO share a wrapper that
mutates `_client.Timeout` on every connect (benign because every wrapper has its
own client, but the design comment about pooling is now contradicted — NS-022);
the design-doc retention/maintenance language has no implementation in this
module and there is no test affirming the module is central-only (NS-023, NS-024);
and `CredentialRedactor` masks any component of the credential string that is
≥ 4 characters long — a 4-character user name like `root` or a 4-char tenant
prefix could be aggressively scrubbed out of unrelated log text (NS-025).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Double SMTP client construction; `Auto` socket option for non-TLS; `TimeoutException`/`OperationCanceledException` misclassified. |
| 2 | Akka.NET conventions | ☑ | No actors in this module (`AddNotificationServiceActors` is a no-op); delivery is a plain DI service. No Akka-specific issues. |
| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` is a singleton with a shared mutable token cache; double-checked locking present but cache key is wrong (NS-006). |
| 4 | Error handling & resilience | ☑ | Critical: no S&F delivery handler registered for `Notification` (NS-001). Fragile substring error classification (NS-002, NS-003). |
| 5 | Security | ☑ | Credentials handled as plaintext strings; OAuth2 client secret in DB credential blob; no recipient address validation. |
| 6 | Performance & resource management | ☑ | Two `ISmtpClientWrapper` instances created per send, one leaked; connection not pooled; `MaxConcurrentConnections` unenforced. |
| 7 | Design-document adherence | ☑ | Connection timeout, max concurrent connections, and TLS `SSL`/`None` modes from the design doc are not implemented. |
| 8 | Code organization & conventions | ☑ | `SmtpPermanentException` in the wrong file; `SmtpConfiguration` POCO has non-nullable strings with no initializer (compiler-warning risk). |
| 9 | Testing coverage | ☑ | Happy path and main error branches covered; OAuth2 delivery path, `DeliverAsync` permanent fallback, and token-cache concurrency untested. |
| 10 | Documentation & comments | ☑ | XML comment on `DeliverAsync` ("Throws on failure") and the misleading "OAuth2 token refresh if needed" comment do not match behaviour. |
| 1 | Correctness & logic bugs | ☑ | Re-review: OAuth2 SASL constructed with empty user id (NS-021); `CredentialRedactor` over-masks short components (NS-025). Earlier NS-005/NS-008 fixes hold. |
| 2 | Akka.NET conventions | ☑ | No actors in this module. `AddNotificationServiceActors` remains a documented no-op. |
| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` per-credential locks now correct (NS-006 hold). No new issues. |
| 4 | Error handling & resilience | ☑ | NS-014/NS-015 classification fixes hold but the entire `DeliverBufferedAsync` / `SendAsync` error path is dead (NS-019/NS-020). |
| 5 | Security | ☑ | OAuth2 `SaslMechanismOAuth2` empty user id (NS-021); `CredentialRedactor` aggressiveness (NS-025); at-rest encryption still deferred (NS-013). |
| 6 | Performance & resource management | ☑ | `MailKitSmtpClientWrapper` keeps a single `SmtpClient` for the wrapper lifetime; combined with per-send factory this means no pooling — re-document or fix (NS-022). |
| 7 | Design-document adherence | ☑ | Critical drift: module still exposes site-style S&F sending; the design doc inverted delivery to central months ago (NS-019). Site registration removed but central still wires the dead service. |
| 8 | Code organization & conventions | ☑ | `INotificationDeliveryService` lives in Commons and is now unused — should be retired or relocated to a NotificationService-internal namespace (NS-019). Module-vs-NotificationOutbox boundary unclear. |
| 9 | Testing coverage | ☑ | 56 tests pass but ~40 of them assert behaviour of a code path no production caller exercises (NS-024). No test affirms the central-only design — i.e. that `AddNotificationService` registers no notification-sending service on a site. |
| 10 | Documentation & comments | ☑ | `NotificationDeliveryService` XML doc still claims "WP-11/12: Notification delivery via SMTP" with no warning that the class is orphaned; `INotificationDeliveryService` Commons doc claims "Implemented by NotificationService, consumed by ScriptRuntimeContext" — both consumers are wrong now (NS-023). |
## Findings
@@ -595,3 +640,199 @@ Replace the hand-rolled double-checked init with `Lazy<SemaphoreSlim>` or `LazyI
**Resolution**
Resolved 2026-05-17. All three issues confirmed against source. The hand-rolled double-checked init was replaced with a `Lazy<SemaphoreSlim>` — its publication is correctly synchronised, eliminating the lock-free read of a non-`volatile` reference. `NotificationDeliveryService` now implements `IDisposable` and disposes the limiter (if created) under the existing lock, with idempotent re-entry and an `ObjectDisposedException` guard in `SendAsync`/`GetConcurrencyLimiter`; the scoped DI registration disposes it per scope. The limiter remains scoped (not hoisted to a site singleton) — the design doc deploys one SMTP config per site and the per-instance capture is bounded; the redeploy-resize concern is acknowledged as low-impact and not changed here, since hoisting would require a registration change for marginal benefit. Tests `Service_Dispose_DisposesConcurrencyLimiter` plus the existing `Send_MaxConcurrentConnections_LimitsConcurrentDeliveries`.
### NotificationService-019 — `NotificationDeliveryService` and `INotificationDeliveryService` are orphaned by the central-only redesign
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:18-442`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:20-21`, `src/ScadaLink.Commons/Interfaces/Services/INotificationDeliveryService.cs:1-33`, `src/ScadaLink.Host/Program.cs:77` |
**Description**
The updated `Component-NotificationService.md` (re-read in full at this commit) makes the new design unambiguous: "The Notification Service is the central component that manages notification-list and SMTP definitions and provides the per-type delivery adapters used to send notifications. … Notification delivery has been inverted: a site script's notification is store-and-forwarded to the central cluster, and the central **Notification Outbox** owns dispatch and delivery, calling an `INotificationDeliveryAdapter` supplied by this component." The doc explicitly states the service is "central cluster only", "no longer present at site clusters", and "no longer delivers notifications from sites".
The current source does not match. `NotificationDeliveryService` is a site-shaped notification sender: it accepts `(listName, subject, message)`, performs an immediate SMTP `DeliverAsync`, catches transient failures and **buffers them to a `StoreAndForwardCategory.Notification` row**, and exposes `DeliverBufferedAsync` as the matching S&F handler. That is precisely the old site-side flow the design doc says was removed. The doc explicitly notes "there is no … local SQLite copy" of notification lists at sites, yet `DeliverBufferedAsync` re-resolves the list from a repository expected to be reachable on the buffering node.
Who actually calls it?
- **Sites** do **not**. `SiteServiceRegistration.cs:33-38` documents the deliberate omission: "AddNotificationService() is intentionally NOT registered on the site path." Sites register `NotificationForwarder` (in `ScadaLink.StoreAndForward`) as the S&F handler for `StoreAndForwardCategory.Notification` (`AkkaHostedService.cs:654-660`), which Asks the central comms actor and never touches SMTP. `ScriptRuntimeContext.NotifyHelper` (in `SiteRuntime`) enqueues directly to S&F as a serialized `NotificationSubmit`, **not** via `INotificationDeliveryService.SendAsync`.
- **Central** registers it (`Program.cs:77` calls `AddNotificationService`) but no central component resolves it. The central notification dispatcher is `NotificationOutboxActor``INotificationDeliveryAdapter``EmailNotificationDeliveryAdapter`. The adapter is a full re-implementation of the connect/auth/send/disconnect sequence (see `EmailNotificationDeliveryAdapter.cs:163-222`) — it deliberately does not call `NotificationDeliveryService.DeliverAsync` (XML-doc on the adapter says "Reuses the `ScadaLink.NotificationService` SMTP machinery — `ISmtpClientWrapper`, `SmtpTlsModeParser`, `OAuth2TokenService` and the typed `SmtpPermanentException`", i.e. only the leaf primitives).
The `NotificationDeliveryService` class, its `DeliverBufferedAsync`, the `Func<ISmtpClientWrapper>` registration consumed only by it, and the `INotificationDeliveryService` interface (still in Commons) and `NotificationResult` record are therefore dead code that contradicts the design. Worse, every prior finding NS-001..NS-018 was reviewed and resolved against this dead path. The 56-test green test suite (NS-012 resolution note) exercises behaviour no production caller invokes — it gives a false sense of coverage. The misleading XML doc on `NotificationDeliveryService` ("WP-11/12: Notification delivery via SMTP") tells a maintainer this is *the* delivery path; the registration on central does the same.
Risk: an operator following the design doc will look here for "the central email delivery code" and find a parallel implementation that is never called; a future feature change (e.g. retry policy tweak) made here will silently have no effect; the `Notify` script-API end-to-end behaviour now depends on `NotificationOutbox` + `EmailNotificationDeliveryAdapter` + `NotificationForwarder`, none of which are tested in this module's suite.
**Recommendation**
Decide and execute one of:
1. **Delete `NotificationDeliveryService`, `DeliverBufferedAsync`, the `BufferedNotification` payload type, the `Func<ISmtpClientWrapper>` scoped registration (move it to NotificationOutbox if still needed there — it already has its own), and `INotificationDeliveryService`/`NotificationResult` in Commons.** Reduce `AddNotificationService` to registering the shared primitives — `OAuth2TokenService`, `ISmtpClientWrapper` factory, `NotificationOptions`. Delete the NS-001..NS-018 tests that target the orphaned path; rebase the ones that exercise primitives (`SmtpErrorClassifier`, `SmtpTlsModeParser`, `CredentialRedactor`, `EmailAddressValidator`, `MailKitSmtpClientWrapper`, `OAuth2TokenService`) which remain genuinely shared. Update `CompositionRootTests` (`tests/ScadaLink.Host.Tests/CompositionRootTests.cs:208-209`) and `IntegrationSurfaceTests` (`tests/ScadaLink.IntegrationTests/IntegrationSurfaceTests.cs:122-135`) to drop the stale assertions.
2. **Keep the class as the central-only Email delivery primitive** and rewrite `EmailNotificationDeliveryAdapter` to delegate to it. This is the smaller diff but the larger semantic burden — `NotificationDeliveryService.SendAsync` returns `NotificationResult` (Success / WasBuffered) which cannot encode the three-way `DeliveryOutcome` (Success / Transient / Permanent) the outbox needs, so the contract still has to change.
Recommended path is option 1: the parallel implementation in `EmailNotificationDeliveryAdapter` is already complete and matches the new design's `DeliveryOutcome` model; salvaging the old class would re-introduce the very inversion this redesign removed.
### NotificationService-020 — NS-001 fix superseded; `AkkaHostedService` would register two competing `Notification` S&F handlers if both code paths ran
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:654-660`, NS-001 resolution note (this file) |
**Description**
NS-001 was resolved by registering an `S&F → DeliverBufferedAsync` handler for `StoreAndForwardCategory.Notification` at site startup in `AkkaHostedService`. The current source registers a **different** handler for the same category at `AkkaHostedService.cs:654-660``NotificationForwarder.DeliverAsync`, which forwards to central instead of sending SMTP. `StoreAndForwardService.RegisterDeliveryHandler` (verified by reading `StoreAndForward/StoreAndForwardService.cs` around line 109) takes a single handler per category — last-write-wins or first-write-wins, either way the two registrations cannot both be active.
The NS-001 resolution note in this file describes a state of the code that no longer exists: it says the handler "is now registered at site startup in `AkkaHostedService`" and points to a handler resolving `NotificationDeliveryService` via a fresh DI scope. That registration is gone from the current `AkkaHostedService` (only `ExternalSystem`, `CachedDbWrite`, and the `NotificationForwarder`-based `Notification` registration are present at the current location). So the NS-001 fix has been silently rolled back / replaced as part of the central-only redesign.
The risk this finding tracks is not the current state per se — `NotificationForwarder` registration is correct under the new design — but the **stale resolution note** plus the fact that `NotificationDeliveryService.DeliverBufferedAsync` still exists in this module and is still tested as an S&F handler. A future merge or revert that re-introduces the NS-001-style registration (because it is what the test suite shape implies) would conflict with `NotificationForwarder`. The two handlers do diametrically opposite things (forward to central vs. send SMTP locally on a site where there is no SMTP config), so a misregistration would cause a silent regression of the design inversion.
**Recommendation**
Mark the NS-001 resolution note in this file as **superseded by NS-019** with a one-line note explaining that the registration was removed when sites stopped delivering. Delete the orphan `DeliverBufferedAsync` and its tests as part of the NS-019 work. Add a comment on `NotificationForwarder` registration in `AkkaHostedService` cross-referencing NS-019/NS-020 so a maintainer searching for the `Notification` S&F handler finds the one canonical registration.
### NotificationService-021 — OAuth2 SASL constructed with empty user identifier; M365 SMTP will reject the auth handshake
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:76-79` |
**Description**
```csharp
case "oauth2":
// OAuth2 token is passed directly as credentials (pre-fetched by token service)
var oauth2 = new SaslMechanismOAuth2("", credentials);
await _client.AuthenticateAsync(oauth2, cancellationToken);
break;
```
`SaslMechanismOAuth2(string userName, string token)` — MailKit's XOAUTH2 mechanism — sends the SASL initial response as `user=<userName>\x01auth=Bearer <token>\x01\x01`. Microsoft 365 (and most OAuth2-enabled SMTP relays) **require the `userName` field to be the From mailbox identity the token was issued for**; an empty string is rejected with a server response like `535 5.7.3 Authentication unsuccessful` ("Either the user identity does not match the principal in the token, or the user is empty"). Office 365's documentation for SMTP AUTH XOAUTH2 calls this out explicitly.
The token-fetch path supports this: `OAuth2TokenService.GetTokenAsync` issues a Client Credentials grant against `login.microsoftonline.com/{tenantId}/oauth2/v2.0/token` with `scope=https://outlook.office365.com/.default`, which is the Microsoft 365 SMTP send scope — meaning the intended target is M365 SMTP, which is precisely the server that rejects an empty user. The `SmtpConfiguration.FromAddress` field is exactly the user identity that should be passed.
This bug is not caught by tests because every existing test uses a fake `ISmtpClientWrapper` (`Substitute.For<ISmtpClientWrapper>()`, `RecordingAuthClient`, etc.) — `MailKitSmtpClientWrapper.AuthenticateAsync` is never exercised against a real `SaslMechanismOAuth2`. The OAuth2 delivery test (NS-012, `Send_OAuth2Config_AuthenticatesWithResolvedAccessToken`) only asserts the wrapper's `AuthenticateAsync` is invoked with `("oauth2", "<access-token>")`; the wrapper itself is mocked out. The same defect is present in `EmailNotificationDeliveryAdapter` only because it routes through this same `AuthenticateAsync` method.
**Recommendation**
Pass the sender mailbox into the wrapper's `AuthenticateAsync` path. The cleanest fix is to thread `config.FromAddress` (or a dedicated `oauth2UserName` parameter) through `ISmtpClientWrapper.AuthenticateAsync` so the OAuth2 branch can construct `new SaslMechanismOAuth2(config.FromAddress, credentials)`. Add an integration-style test that runs `MailKitSmtpClientWrapper.AuthenticateAsync` against a stub `SmtpClient` and asserts the XOAUTH2 initial-response bytes contain the expected `user=<from>` field, so this regression is caught next time.
### NotificationService-022 — `MailKitSmtpClientWrapper` holds a long-lived `SmtpClient`; combined with per-send factory, the design comment about pooling is contradicted
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:14`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:19` |
**Description**
`MailKitSmtpClientWrapper` declares `private readonly SmtpClient _client = new();` — a single `SmtpClient` is constructed when the wrapper is constructed and lives for the wrapper's lifetime. The DI registration is `services.AddSingleton<Func<ISmtpClientWrapper>>(_ => () => new MailKitSmtpClientWrapper());` (`ServiceCollectionExtensions.cs:19`) — every invocation of the factory creates a **new** wrapper and therefore a **new** `SmtpClient`. `NotificationDeliveryService.DeliverAsync` (the orphan, per NS-019) and `EmailNotificationDeliveryAdapter.SendAsync` both invoke the factory per send and dispose the wrapper at end of send. So in practice there is no connection pooling — every send pays a full TCP+TLS handshake.
This is internally consistent (and matches MailKit guidance — `SmtpClient` is not thread-safe and reusing across deliveries needs careful guarding). However:
1. The XML on the wrapper class says nothing about lifetime; the field-initializer `new SmtpClient()` *implies* a reusable connection. A maintainer might "fix" the factory to reuse a single wrapper (singleton) believing they are enabling pooling, and immediately introduce a concurrency bug: `MailKit.SmtpClient` rejects concurrent send calls and the wrapper carries no synchronization.
2. `ConnectAsync` mutates `_client.Timeout` (`MailKitSmtpClientWrapper.cs:39-42`) every time it runs. If a wrapper is ever reused across deliveries with different `SmtpConfiguration.ConnectionTimeoutSeconds` values, the timeout is silently overwritten — not a current bug, but a latent footgun.
3. The design doc requirement "Max concurrent connections (default 5)" is currently honoured by the NS-007 `SemaphoreSlim` on `NotificationDeliveryService`, but `EmailNotificationDeliveryAdapter` has **no equivalent throttle** — see `EmailNotificationDeliveryAdapter.cs:163-222`, no semaphore. So on central, where the actual delivery now happens, the design-doc concurrency limit is no longer enforced. This is a regression introduced by the redesign — the outbox does not carry NS-007's limiter forward.
**Recommendation**
Document the per-send lifecycle on `MailKitSmtpClientWrapper` (XML on the class: "one wrapper per delivery; the wrapper owns a single `SmtpClient` that is connected/authenticated/sent/disconnected/disposed once"). Either move the NS-007 `SemaphoreSlim` into a shared per-site holder consumed by `EmailNotificationDeliveryAdapter`, or accept the loss and update the design doc. Add `[Obsolete]` or `internal` to discourage re-using a wrapper across sends.
### NotificationService-023 — XML docs on the orphaned classes still describe the removed site-delivery flow; misleading to maintainers
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:12-17`, `src/ScadaLink.Commons/Interfaces/Services/INotificationDeliveryService.cs:3-12`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:8-9` |
**Description**
XML comments still claim the dead path is the live path:
- `NotificationDeliveryService` class summary: "WP-11: Notification delivery via SMTP. WP-12: Error classification and S&F integration. Transient: connection refused, timeout, SMTP 4xx → hand to S&F. Permanent: SMTP 5xx → returned to script." This is the pre-redesign behaviour. The site-S&F branch in particular is dead (see NS-019), and "returned to script" is no longer accurate — `Notify.Send` is async and never returns a permanent error to the script per the design doc.
- `INotificationDeliveryService` (Commons): "Interface for sending notifications. Implemented by NotificationService, consumed by ScriptRuntimeContext." Verified against source: `ScriptRuntimeContext` does **not** consume this interface — it enqueues directly to `StoreAndForwardService` (see `SiteRuntime/Scripts/ScriptRuntimeContext.cs:1770-1774`). The Commons-level claim therefore documents an interaction that no longer exists.
- `NotificationResult` is a record returned only by the orphaned `SendAsync`. The Notification Outbox uses `DeliveryOutcome` instead, which encodes the Success/Transient/Permanent three-way that `NotificationResult(Success, ErrorMessage, WasBuffered)` cannot.
- `ServiceCollectionExtensions.AddNotificationService` XML doc says "Registers the notification delivery services (SMTP, OAuth2 token, delivery adapter)" — no mention that the central-only redesign means most of what it registers is unused.
A reader following the XML docs from any entry point ends up at a path that does not run. The CLAUDE.md "External Integrations" section and `Component-NotificationService.md` describe the new design; the in-source docs contradict them.
**Recommendation**
Tied to NS-019: if the orphan classes are deleted, this finding closes itself. If they are kept temporarily, prepend each summary with "**Obsolete — superseded by NotificationOutbox's `EmailNotificationDeliveryAdapter`. Retained for transitional compatibility; do not add new callers.**" and update `INotificationDeliveryService`'s summary to reflect the inverted flow or remove the interface.
### NotificationService-024 — No test affirms the central-only invariant; the orphaned-path tests give a false coverage signal
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.NotificationService.Tests/NotificationDeliveryServiceTests.cs`, `tests/ScadaLink.IntegrationTests/IntegrationSurfaceTests.cs:118-136`, `tests/ScadaLink.Host.Tests/CompositionRootTests.cs:207-209` |
**Description**
The module test suite has 56 tests; counting `NotificationDeliveryServiceTests.cs`, ~40 of them exercise `NotificationDeliveryService.SendAsync`/`DeliverBufferedAsync` — code paths that, per NS-019, no production caller resolves. They pass against the orphaned class and so the suite stays green, but the green is a false signal: changing the dead implementation (or deleting it) does not flag any regression in the live notification-delivery flow, which now lives in `EmailNotificationDeliveryAdapter` (covered by NotificationOutbox's own tests) and `NotificationForwarder` (covered, if at all, by StoreAndForward's tests).
In particular there is **no test in this module** that affirms the central-only invariant the design doc requires:
- No test that `AddNotificationService()` registered on a *site* role would be inert / no-op'd, or that `SiteServiceRegistration.Configure` does **not** call `AddNotificationService` (an obvious regression vector — re-adding it would silently restore the orphaned site-delivery path).
- No test that confirms `INotificationDeliveryService` has no production consumer (i.e. an architecture test that fails if anyone re-introduces a constructor parameter or `GetRequiredService<INotificationDeliveryService>()` call).
- The cross-module `CompositionRootTests` (`tests/ScadaLink.Host.Tests/CompositionRootTests.cs:208-209`) still asserts `NotificationDeliveryService` and `INotificationDeliveryService` are registered, locking in the orphan rather than catching it.
- `IntegrationSurfaceTests.cs:122-125` constructs `NotificationDeliveryService` directly to validate "the integration surface" — testing a surface that no script actually crosses.
**Recommendation**
After NS-019 is decided:
1. If the orphan is deleted, remove the orphaned-path tests (NS-001/004/005/007/008/009/010/014/015/016/017/018-style tests targeting `SendAsync`/`DeliverBufferedAsync`). Retain `SmtpErrorClassifierTests`, `SmtpTlsModeParserTests`, `CredentialRedactorTests`, `OAuth2TokenServiceTests`, and `MailKitSmtpClientWrapperTests` (primitives genuinely shared). Update `CompositionRootTests` to drop the stale rows and `IntegrationSurfaceTests` to call the live path via `INotificationDeliveryAdapter`/`EmailNotificationDeliveryAdapter`.
2. Add a one-shot architecture test in `tests/ScadaLink.Architecture.Tests` (if it exists, else this module) that scans for direct references to `INotificationDeliveryService` outside this project and the obsolete-interface declaration in Commons, failing if any new consumer reappears.
### NotificationService-025 — `CredentialRedactor` over-masks: any 4-character credential component is masked anywhere it appears, including unrelated log text
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/CredentialRedactor.cs:34-48` |
**Description**
```csharp
var parts = credentials.Split(':')
.Where(p => p.Length >= 4)
.Append(credentials)
.Distinct()
.OrderByDescending(p => p.Length);
foreach (var part in parts)
{
result = result.Replace(part, Mask, StringComparison.Ordinal);
}
```
The threshold `p.Length >= 4` is permissive enough that common short identifiers used by operators become aggressive global redaction tokens:
- A Basic-Auth credential of `root:hunter2` produces components `["root", "hunter2", "root:hunter2"]`. Every literal `root` anywhere in the exception/log text is masked — including unrelated mentions like file paths (`/root/.config`) or default-account names in the server's reply. This obscures legitimate diagnostic information without protecting any additional secret.
- An OAuth2 tenant id is a GUID (long, safe). The client id is typically a GUID. The client secret is the high-entropy part. The full `tenant:client:secret` is the actual sensitive triple. A tenant GUID embedded in unrelated text (a tenant-bound error code, a partial URL) will be masked even when the appearance is non-sensitive.
- The user name in Basic Auth is sometimes the From address (`scada-notifications@company.com`) — masking *the company's notification mailbox* in every log line that mentions it has real operational cost.
The function also uses `String.Replace` ordinarily, not word-boundary aware — a 4-char prefix that happens to be a substring of a longer benign token gets eaten.
The threshold is a defence-in-depth choice; the existing tests assert that `Hunter2pw!` and `Sup3rSecretValue` are masked (good) and that `null` text/credentials are handled (good), but nothing pins the negative behaviour: e.g. a test that a 4-char user name `root` is **not** also masked when it appears in an unrelated path.
**Recommendation**
Tighten the redaction policy: mask only the obviously-secret components — the password (Basic), the client secret (OAuth2), and the whole packed string — not the user name / tenant / client id. The simplest implementation is to redact only the **last** colon-separated component (the secret) plus the full packed string. Bump the per-component minimum length to something high enough that a typical short user name does not match (≥ 12 chars is the usual heuristic for a password). Add a test asserting `Scrub("/root/.config", "root:hunter2")` does not mask `/root/.config`'s `root`.
+208 -29
View File
@@ -40,34 +40,38 @@ module file and counted in **Total**.
| Severity | Open findings |
|----------|---------------|
| Critical | 0 |
| High | 0 |
| Medium | 0 |
| Low | 0 |
| **Total** | **0** |
| High | 18 |
| Medium | 62 |
| Low | 92 |
| **Total** | **172** |
## Module Status
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|--------|---------------|--------|----------------|------|-------|
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 25 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 10 |
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 18 |
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/8 | 11 | 11 |
| [CLI](CLI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/4 | 7 | 23 |
| [CentralUI](CentralUI/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/5 | 8 | 33 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/4 | 4 | 14 |
| [Commons](Commons/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/6 | 9 | 23 |
| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/1/5 | 7 | 22 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/5 | 10 | 24 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/0 | 5 | 22 |
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/1/5 | 7 | 24 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/3 | 6 | 23 |
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 23 |
| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 22 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/4 | 8 | 25 |
| [ManagementService](ManagementService/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/2 | 6 | 23 |
| [NotificationOutbox](NotificationOutbox/findings.md) | 2026-05-28 | `1eb6e97` | 0/2/5/3 | 10 | 10 |
| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/2/2/3 | 7 | 25 |
| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 21 |
| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 6 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/6 | 9 | 23 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/4/3 | 7 | 26 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/3 | 7 | 24 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/1 | 6 | 22 |
| [Transport](Transport/findings.md) | 2026-05-28 | `1eb6e97` | 0/3/5/4 | 12 | 12 |
## Pending Findings
@@ -80,14 +84,189 @@ description, location, recommendation — lives in the module's `findings.md`.
_None open._
### High (0)
### High (18)
_None open._
| ID | Module | Title |
|----|--------|-------|
| CentralUI-028 | [CentralUI](CentralUI/findings.md) | `NotificationReport` and `SiteCallsReport` bypass `SiteScopeService` — Deployment role site-scoping defeated on the two new central-mirror pages |
| Communication-016 | [Communication](Communication/findings.md) | `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires |
| ConfigurationDatabase-015 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `NotificationOutboxRepository.InsertIfNotExistsAsync` is a check-then-act race with no duplicate-key catch |
| DataConnectionLayer-018 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle |
| DeploymentManager-018 | [DeploymentManager](DeploymentManager/findings.md) | Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover |
| ExternalSystemGateway-018 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `DeliverBufferedAsync` lets `JsonException` propagate, turning a corrupt buffered row into a permanent retry-forever poison message |
| InboundAPI-022 | [InboundAPI](InboundAPI/findings.md) | `IActiveNodeGate` has no production registration in Host — standby-node gating is silently disabled in production |
| ManagementService-018 | [ManagementService](ManagementService/findings.md) | QueryAuditLogCommand has no role gate |
| NotificationOutbox-001 | [NotificationOutbox](NotificationOutbox/findings.md) | `EmailNotificationDeliveryAdapter` inherits the OAuth2 empty-user SASL bug (NS-021) on the M365 send path |
| NotificationOutbox-002 | [NotificationOutbox](NotificationOutbox/findings.md) | Dispatcher parks on first transient failure when `SmtpConfiguration.MaxRetries == 0` |
| NotificationService-019 | [NotificationService](NotificationService/findings.md) | `NotificationDeliveryService` and `INotificationDeliveryService` are orphaned by the central-only redesign |
| NotificationService-021 | [NotificationService](NotificationService/findings.md) | OAuth2 SASL constructed with empty user identifier; M365 SMTP will reject the auth handshake |
| SiteEventLogging-016 | [SiteEventLogging](SiteEventLogging/findings.md) | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps |
| StoreAndForward-018 | [StoreAndForward](StoreAndForward/findings.md) | Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant |
| TemplateEngine-017 | [TemplateEngine](TemplateEngine/findings.md) | Revision hash and diff both ignore `Description` and `Connections`, defeating staleness detection for real deployment changes |
| Transport-001 | [Transport](Transport/findings.md) | Template Overwrite never syncs attributes / alarms / scripts |
| Transport-002 | [Transport](Transport/findings.md) | ExternalSystem Overwrite never syncs methods |
| Transport-003 | [Transport](Transport/findings.md) | Unlock lockout is enforced only client-side; server session is never marked Locked |
### Medium (0)
### Medium (62)
_None open._
| ID | Module | Title |
|----|--------|-------|
| AuditLog-001 | [AuditLog](AuditLog/findings.md) | Combined-telemetry transport is plumbed end-to-end but never invoked in production |
| AuditLog-004 | [AuditLog](AuditLog/findings.md) | `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows |
| AuditLog-005 | [AuditLog](AuditLog/findings.md) | `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan |
| CLI-017 | [CLI](CLI/findings.md) | `BundleCommands.RunBundleCommandAsync` duplicates `ExecuteCommandAsync` and breaks the auth exit-code contract |
| CLI-018 | [CLI](CLI/findings.md) | `audit query` and `audit export` never return exit 2 for an authorization failure |
| CLI-019 | [CLI](CLI/findings.md) | `bundle export` decodes the entire base64 bundle into memory before writing |
| CentralUI-026 | [CentralUI](CentralUI/findings.md) | `AuditFilterBar` From/To filters treat browser-local datetimes as UTC |
| CentralUI-027 | [CentralUI](CentralUI/findings.md) | Same UTC misinterpretation in `SiteCallsReport`, `NotificationReport`, and `EventLogs` |
| Commons-015 | [Commons](Commons/findings.md) | `EncryptionMetadata` accepts any algorithm string and any iteration count |
| Commons-017 | [Commons](Commons/findings.md) | `Component-Commons.md` is significantly stale (audit enums, new entities, new repositories, new service interfaces, new folders) |
| Commons-019 | [Commons](Commons/findings.md) | New `*Utc`-suffixed `DateTime` columns on `AuditEvent` / `SiteCall` are not enforced as UTC; inconsistent with `Notification`'s `DateTimeOffset` |
| Communication-017 | [Communication](Communication/findings.md) | `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up |
| ConfigurationDatabase-016 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with the unpeppered `ApiKeyHasher.Default` |
| ConfigurationDatabase-017 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stub-attach delete on `DeploymentRecord` bypasses optimistic concurrency |
| ConfigurationDatabase-018 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `DateTime`-typed `*Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement |
| ConfigurationDatabase-019 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues, creating partition holes |
| DataConnectionLayer-019 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations |
| DataConnectionLayer-020 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe |
| DataConnectionLayer-021 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight |
| DataConnectionLayer-022 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts |
| DeploymentManager-019 | [DeploymentManager](DeploymentManager/findings.md) | Lifecycle command timeout writes no audit entry |
| ExternalSystemGateway-019 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default |
| ExternalSystemGateway-020 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry |
| HealthMonitoring-017 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts |
| HealthMonitoring-019 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface |
| Host-016 | [Host](Host/findings.md) | Site `CentralContactPoints` second entry targets the site's own remoting port |
| Host-017 | [Host](Host/findings.md) | Site-shutdown ordering from REQ-HOST-7 is not wired |
| InboundAPI-018 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` fires `WriteAsync` as `_ = task` — faulted async writes are unobserved |
| InboundAPI-021 | [InboundAPI](InboundAPI/findings.md) | `ParentExecutionId` correlation flows only through `Call`; attribute reads/writes lose the inbound→site execution-tree link |
| InboundAPI-025 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` runs against the entire `/api/*` branch — emits spurious `ApiInbound` audit rows for `/api/audit/query` and `/api/audit/export` |
| ManagementService-019 | [ManagementService](ManagementService/findings.md) | AuditEndpoints builds PermittedSiteIds but never enforces them |
| ManagementService-020 | [ManagementService](ManagementService/findings.md) | UpdateSmtpConfig returns and audits the SMTP Credentials field verbatim |
| ManagementService-021 | [ManagementService](ManagementService/findings.md) | Transport bundle handlers have zero test coverage |
| NotificationOutbox-003 | [NotificationOutbox](NotificationOutbox/findings.md) | Dispatcher does not propagate a `CancellationToken` into delivery; in-flight SMTP sends cannot be cancelled on shutdown |
| NotificationOutbox-004 | [NotificationOutbox](NotificationOutbox/findings.md) | `EmitAttemptAudit`/`EmitTerminalAudit` fire-and-forget pattern can outlive the per-sweep DI scope |
| NotificationOutbox-005 | [NotificationOutbox](NotificationOutbox/findings.md) | Ingest persistence inherits the CD-015 check-then-act race; under contention the second writer throws and the site retries |
| NotificationOutbox-007 | [NotificationOutbox](NotificationOutbox/findings.md) | `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, and `PurgeInterval` are not in the design document |
| NotificationOutbox-010 | [NotificationOutbox](NotificationOutbox/findings.md) | Comment claims `PipeTo` is not used "because the writer never throws"; the surrounding try/catch is dead-letter for the documented failure mode |
| NotificationService-020 | [NotificationService](NotificationService/findings.md) | NS-001 fix superseded; `AkkaHostedService` would register two competing `Notification` S&F handlers if both code paths ran |
| NotificationService-024 | [NotificationService](NotificationService/findings.md) | No test affirms the central-only invariant; the orphaned-path tests give a false coverage signal |
| Security-016 | [Security](Security/findings.md) | `RoleMapper` silently drops the system-wide Deployment grant when a user is also in any site-scoped Deployment group |
| Security-017 | [Security](Security/findings.md) | `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code from production callers — `[Authorize(Policy = RequireDeployment)]` does NOT enforce site scoping |
| SiteCallAudit-001 | [SiteCallAudit](SiteCallAudit/findings.md) | SupervisorStrategy override is dead code; XML claims Resume that is not enforced |
| SiteCallAudit-003 | [SiteCallAudit](SiteCallAudit/findings.md) | `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it |
| SiteEventLogging-015 | [SiteEventLogging](SiteEventLogging/findings.md) | Background write queue is unbounded; can grow without limit under sustained writer slowness |
| SiteEventLogging-017 | [SiteEventLogging](SiteEventLogging/findings.md) | Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale |
| SiteRuntime-020 | [SiteRuntime](SiteRuntime/findings.md) | Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name |
| SiteRuntime-021 | [SiteRuntime](SiteRuntime/findings.md) | `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL |
| SiteRuntime-022 | [SiteRuntime](SiteRuntime/findings.md) | `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner` |
| SiteRuntime-024 | [SiteRuntime](SiteRuntime/findings.md) | `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async |
| StoreAndForward-019 | [StoreAndForward](StoreAndForward/findings.md) | Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks" |
| StoreAndForward-020 | [StoreAndForward](StoreAndForward/findings.md) | `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load |
| StoreAndForward-021 | [StoreAndForward](StoreAndForward/findings.md) | Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime |
| TemplateEngine-018 | [TemplateEngine](TemplateEngine/findings.md) | `DiffService` reports no entries for added/removed/changed connections |
| TemplateEngine-019 | [TemplateEngine](TemplateEngine/findings.md) | `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector` |
| TemplateEngine-020 | [TemplateEngine](TemplateEngine/findings.md) | `Create*` audit entries are written with `EntityId = "0"` before `SaveChangesAsync` populates the real key |
| TemplateEngine-021 | [TemplateEngine](TemplateEngine/findings.md) | `MoveTemplateAsync` skips folder cycle and sibling-name-collision validation |
| Transport-004 | [Transport](Transport/findings.md) | `MaxUnlockAttemptsPerIpPerHour` option is declared but never enforced |
| Transport-005 | [Transport](Transport/findings.md) | Manifest fields outside `ContentHash` are not bound to the encrypted payload |
| Transport-006 | [Transport](Transport/findings.md) | Bundle ZIP read has no per-entry size cap or entry-count cap (zip-bomb / decompression-bomb) |
| Transport-007 | [Transport](Transport/findings.md) | Failed import sessions retain decrypted plaintext for the full 30-minute TTL |
| Transport-010 | [Transport](Transport/findings.md) | Critical Overwrite + cross-cutting paths uncovered by tests |
### Low (0)
### Low (92)
_None open._
| ID | Module | Title |
|----|--------|-------|
| AuditLog-002 | [AuditLog](AuditLog/findings.md) | `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider |
| AuditLog-003 | [AuditLog](AuditLog/findings.md) | `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously |
| AuditLog-006 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock |
| AuditLog-007 | [AuditLog](AuditLog/findings.md) | `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations |
| AuditLog-008 | [AuditLog](AuditLog/findings.md) | Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain |
| AuditLog-009 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't |
| AuditLog-010 | [AuditLog](AuditLog/findings.md) | Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream |
| AuditLog-011 | [AuditLog](AuditLog/findings.md) | `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call |
| CLI-020 | [CLI](CLI/findings.md) | `bundle export` success-envelope parse is unguarded |
| CLI-021 | [CLI](CLI/findings.md) | `CliConfig.Load` crashes the CLI on a malformed config file |
| CLI-022 | [CLI](CLI/findings.md) | `CommandTreeTests` excludes the two new command groups |
| CLI-023 | [CLI](CLI/findings.md) | `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses REST endpoints |
| CentralUI-029 | [CentralUI](CentralUI/findings.md) | `ConfigurationAuditLog` uses `JS.InvokeAsync<int>("eval", ...)` instead of a dedicated JS module |
| CentralUI-030 | [CentralUI](CentralUI/findings.md) | `SandboxConsoleCapture`'s per-call `StringWriter` is not thread-safe under intra-script concurrency |
| CentralUI-031 | [CentralUI](CentralUI/findings.md) | `TransportImport` buffers the full bundle bytes in component state |
| CentralUI-032 | [CentralUI](CentralUI/findings.md) | `AuditResultsGrid` paging is forward-only, no Previous button |
| CentralUI-033 | [CentralUI](CentralUI/findings.md) | Drill-in / query-string code paths for the new Transport + SiteCalls pages are untested |
| ClusterInfrastructure-011 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `SectionName` constant is decorative — no binding site references it |
| ClusterInfrastructure-012 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator accepts `SeedNodes.Count == 1` despite design requiring both nodes as seeds |
| ClusterInfrastructure-013 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Test uses catastrophic config values without an inline-intent comment |
| ClusterInfrastructure-014 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `AddClusterInfrastructureActors` is dead surface — no caller, no behaviour |
| Commons-016 | [Commons](Commons/findings.md) | `BundleSession.Locked` uses a magic `3` rather than a named constant |
| Commons-018 | [Commons](Commons/findings.md) | `IOperationTrackingStore` and `IPartitionMaintenance` are at the root of `Interfaces/` instead of `Interfaces/Services/` |
| Commons-020 | [Commons](Commons/findings.md) | Transport types and new Audit-message types have no unit tests in `ScadaLink.Commons.Tests` |
| Commons-021 | [Commons](Commons/findings.md) | `ExternalCallResult.Response` has a benign lazy-parse race |
| Commons-022 | [Commons](Commons/findings.md) | `IAuditCorrelationContext` references an unresolvable `BundleImporter.ApplyAsync` cref; JSON-blob columns have no documented shape |
| Commons-023 | [Commons](Commons/findings.md) | Trailing-optional `SourceNode` on positional records mixes additive evolution patterns |
| Communication-018 | [Communication](Communication/findings.md) | Site heartbeats hard-code `IsActive: true` regardless of node role |
| Communication-019 | [Communication](Communication/findings.md) | `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository |
| Communication-020 | [Communication](Communication/findings.md) | `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types |
| Communication-021 | [Communication](Communication/findings.md) | `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try |
| Communication-022 | [Communication](Communication/findings.md) | `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber |
| ConfigurationDatabase-020 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `GetPartitionBoundariesOlderThanAsync` returns `DateTime` with `Kind=Unspecified` |
| ConfigurationDatabase-021 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `SwitchOutPartitionAsync` interpolates `monthBoundary` / staging table name into raw SQL |
| ConfigurationDatabase-022 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stale "WP-24 Stub level sufficient for diff/staleness support" XML comment on `DeploymentManagerRepository` |
| ConfigurationDatabase-023 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `AuditLog` correlation-index name drifts from design doc (`IX_AuditLog_CorrelationId` vs `IX_AuditLog_Correlation`) |
| ConfigurationDatabase-024 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete |
| DeploymentManager-020 | [DeploymentManager](DeploymentManager/findings.md) | `DeployReconciled` audit attributes the action to the prior deployer, not the current user |
| DeploymentManager-021 | [DeploymentManager](DeploymentManager/findings.md) | `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing |
| DeploymentManager-022 | [DeploymentManager](DeploymentManager/findings.md) | `Pending` and `InProgress` are written back-to-back with no intervening work |
| DeploymentManager-023 | [DeploymentManager](DeploymentManager/findings.md) | `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site |
| DeploymentManager-024 | [DeploymentManager](DeploymentManager/findings.md) | Test probe actors hold mutable static state across tests |
| ExternalSystemGateway-021 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config |
| ExternalSystemGateway-022 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `new HttpMethod(method.HttpMethod)` accepts any string at runtime; an invalid HTTP verb fails only at call time |
| ExternalSystemGateway-023 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set |
| HealthMonitoring-018 | [HealthMonitoring](HealthMonitoring/findings.md) | Same counter-reset-before-publish hazard in `CentralHealthReportLoop` |
| HealthMonitoring-020 | [HealthMonitoring](HealthMonitoring/findings.md) | `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` |
| HealthMonitoring-021 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralSiteId = "central"` reserved constant silently collides with a real site named "central" |
| HealthMonitoring-022 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI |
| HealthMonitoring-023 | [HealthMonitoring](HealthMonitoring/findings.md) | `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder |
| Host-018 | [Host](Host/findings.md) | Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null |
| Host-019 | [Host](Host/findings.md) | Migration `StartupRetry` call drops the host `CancellationToken` |
| Host-020 | [Host](Host/findings.md) | `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel` |
| Host-021 | [Host](Host/findings.md) | Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog |
| Host-022 | [Host](Host/findings.md) | `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information` |
| InboundAPI-019 | [InboundAPI](InboundAPI/findings.md) | `EnableBuffering()` called unconditionally on every request, including bodyless requests |
| InboundAPI-020 | [InboundAPI](InboundAPI/findings.md) | `ContentType.Contains("json")` is case-sensitive; `application/JSON` with no Content-Length skips body parsing |
| InboundAPI-023 | [InboundAPI](InboundAPI/findings.md) | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test coverage |
| InboundAPI-024 | [InboundAPI](InboundAPI/findings.md) | `_knownBadMethods` is unbounded — an attacker can grow the cache by spamming distinct method names against the audit middleware path |
| ManagementService-022 | [ManagementService](ManagementService/findings.md) | Design doc is stale on Transport bundle commands, /api/audit/* endpoints, and CommandTimeout |
| ManagementService-023 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments unfiltered branch is N+1 on instance lookup |
| NotificationOutbox-006 | [NotificationOutbox](NotificationOutbox/findings.md) | `ResolveAdapters` rebuilds the `NotificationType → adapter` dictionary on every dispatch sweep |
| NotificationOutbox-008 | [NotificationOutbox](NotificationOutbox/findings.md) | `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested |
| NotificationOutbox-009 | [NotificationOutbox](NotificationOutbox/findings.md) | `StuckAgeThreshold` XML-doc says "in-progress notification is re-claimed" — contradicts the design's display-only stuck detection |
| NotificationService-022 | [NotificationService](NotificationService/findings.md) | `MailKitSmtpClientWrapper` holds a long-lived `SmtpClient`; combined with per-send factory, the design comment about pooling is contradicted |
| NotificationService-023 | [NotificationService](NotificationService/findings.md) | XML docs on the orphaned classes still describe the removed site-delivery flow; misleading to maintainers |
| NotificationService-025 | [NotificationService](NotificationService/findings.md) | `CredentialRedactor` over-masks: any 4-character credential component is masked anywhere it appears, including unrelated log text |
| Security-018 | [Security](Security/findings.md) | Role names are hard-coded magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and `AuthorizationPolicies` |
| Security-019 | [Security](Security/findings.md) | Service-account rebind failure is reported as "Invalid username or password" — masks misconfiguration as a user-credential error |
| Security-020 | [Security](Security/findings.md) | `SecurityOptions` has no startup validation for required fields (`LdapServer`, `LdapSearchBase`) |
| Security-021 | [Security](Security/findings.md) | `RequireHttpsCookie=false` dev opt-out has no warning path — an HTTP production deployment silently transmits the JWT bearer credential in cleartext |
| SiteCallAudit-002 | [SiteCallAudit](SiteCallAudit/findings.md) | Singleton failover does not wait for in-flight async upserts |
| SiteCallAudit-004 | [SiteCallAudit](SiteCallAudit/findings.md) | Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift |
| SiteCallAudit-005 | [SiteCallAudit](SiteCallAudit/findings.md) | `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing |
| SiteCallAudit-006 | [SiteCallAudit](SiteCallAudit/findings.md) | Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor |
| SiteEventLogging-018 | [SiteEventLogging](SiteEventLogging/findings.md) | `FailedWriteCount` is exposed but never consumed by Health Monitoring |
| SiteEventLogging-019 | [SiteEventLogging](SiteEventLogging/findings.md) | `EventLogPurgeService` runs on every host node; design says "active node" |
| SiteEventLogging-020 | [SiteEventLogging](SiteEventLogging/findings.md) | `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced |
| SiteEventLogging-021 | [SiteEventLogging](SiteEventLogging/findings.md) | `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales |
| SiteEventLogging-022 | [SiteEventLogging](SiteEventLogging/findings.md) | `Cache=Shared` is redundant for a single-connection logger |
| SiteEventLogging-023 | [SiteEventLogging](SiteEventLogging/findings.md) | Concurrent-stress test uses a non-volatile `stop` flag |
| SiteRuntime-023 | [SiteRuntime](SiteRuntime/findings.md) | `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive |
| SiteRuntime-025 | [SiteRuntime](SiteRuntime/findings.md) | `HandleSetStaticAttribute` persists unknown attribute names as static overrides |
| SiteRuntime-026 | [SiteRuntime](SiteRuntime/findings.md) | `ReplicationMessages.cs` public record types have no XML documentation |
| StoreAndForward-022 | [StoreAndForward](StoreAndForward/findings.md) | `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId` |
| StoreAndForward-023 | [StoreAndForward](StoreAndForward/findings.md) | `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation |
| StoreAndForward-024 | [StoreAndForward](StoreAndForward/findings.md) | `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown |
| TemplateEngine-022 | [TemplateEngine](TemplateEngine/findings.md) | `LockEnforcer.ValidateLockChange` enforces "once-locked-stays-locked" for `IsLocked` but not for `LockedInDerived` |
| Transport-008 | [Transport](Transport/findings.md) | `PreviewAsync` issues an N+1 `GetTemplateWithChildrenAsync` per matching template name |
| Transport-009 | [Transport](Transport/findings.md) | `IAuditCorrelationContext.BundleImportId` is mutated on the same scoped instance the AuditService reads |
| Transport-011 | [Transport](Transport/findings.md) | Design doc's Step-1 manifest preview promises decryption-free preview, but `LoadAsync` reads and validates content before passphrase |
| Transport-012 | [Transport](Transport/findings.md) | "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI |
+271 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Security` |
| Design doc | `docs/requirements/Component-Security.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 (1 deferred — Security-008) |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 (1 deferred — Security-008) |
## Summary
@@ -48,6 +48,36 @@ omits the separate idle check (Security-014). The two Low findings concern fragi
DN parsing of group names containing escaped commas and an un-trimmed username flowing
into the LDAP filter, fallback DN, and JWT claims.
#### Re-review 2026-05-28 (commit `1eb6e97`)
Re-reviewed the module on a fresh baseline. All Security-001..007, 009..015 fixes remain
in place; the only Open carry-over is Security-008 (still correctly **Deferred**
`ISecurityRepository` still exposes no per-set scope-rule query, so the N+1 in
`RoleMapper` cannot be removed from within this module). The original
Security-014 fix is now load-bearing: `RefreshToken` calls `IsIdleTimedOut` before
re-issuing, and the new cookie sliding-expiry tests in `SecurityReviewRegressionTests`
pin CentralUI-005's Security-side contract. This pass surfaced **6 new findings**
(Security-016..021): one Medium correctness/security defect, one Medium design-adherence
defect, and four Low. The most consequential is **Security-016** — when a user is
mapped to *both* a system-wide Deployment LDAP group (e.g. `SCADA-Deploy-All`) and a
site-scoped Deployment LDAP group (e.g. `SCADA-Deploy-SiteA`), `RoleMapper` silently
treats the union as site-scoped (the system-wide grant is dropped); the design's
"multiple groups grant multiple independent roles" intent is not honoured for this
mix-and-match case. **Security-017** is the cross-module partner of CentralUI-028:
`SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are declared and registered
but no production caller ever instantiates them — `[Authorize(Policy = RequireDeployment)]`
*does not* enforce the documented site scoping, callers must remember to inject
`SiteScopeService` and re-check `IsSiteAllowedAsync` themselves (which the two new
report pages flagged by CentralUI-028 forgot to do). The remaining Lows are: role names
are magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and
`AuthorizationPolicies` (Security-018); a service-account-rebind failure is reported
to the user as "Invalid username or password" — masking a misconfiguration as a
user-credential error (Security-019); required `SecurityOptions` fields
(`LdapServer`, `LdapSearchBase`) have no `IValidateOptions` startup check, so empty
values silently surface only on first login (Security-020); and the
`RequireHttpsCookie=false` dev opt-out emits no warning, so an HTTP production
deployment silently transmits the JWT bearer credential in cleartext (Security-021).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -63,6 +93,21 @@ into the LDAP filter, fallback DN, and JWT claims.
| 9 | Testing coverage | ☑ | No tests for `RoleMapper` N+1 behavior, DN-injection inputs, StartTLS path, or idle-timeout-after-refresh. Insecure-config combinations under-tested (Security-011). |
| 10 | Documentation & comments | ☑ | `SecurityOptions` XML docs say direct bind uses `cn={username}` while the search filter uses `uid=` — comment is misleading (covered under Security-004). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `RoleMapper` drops a system-wide Deployment grant when the user is also in any site-scoped Deployment group (Security-016); hard-coded role-name string `"Deployment"` in two separate places allows a refactor to silently break site scoping (Security-018). |
| 2 | Akka.NET conventions | ☑ | No actors. `AddSecurityActors` is still a registration placeholder. No issues. |
| 3 | Concurrency & thread safety | ☑ | Services stateless; LDAP sync calls wrapped in `Task.Run` with the now-bounded timeout (Security-009 resolution holds). No issues found. |
| 4 | Error handling & resilience | ☑ | A service-account-rebind failure inside `AuthenticateAsync` is reported as "Invalid username or password", masking a misconfiguration as a user-credential error (Security-019). LDAP-failure rule + partial-outage path remain correctly enforced post-Security-012. |
| 5 | Security | ☑ | `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code — no policy is registered that uses them and no production caller instantiates them, so declarative `[Authorize]` does not enforce site scoping (Security-017, cross-module partner of CentralUI-028). `RequireHttpsCookie=false` dev opt-out has no warning path — a production misconfiguration silently transmits the JWT bearer credential over HTTP (Security-021). |
| 6 | Performance & resource management | ☑ | Security-008 N+1 remains correctly Deferred (still gated on `ISecurityRepository`). No new perf issues. |
| 7 | Design-document adherence | ☑ | `RoleMapper`'s drop-system-wide-on-any-scoped behaviour (Security-016) contradicts the design's "A user can hold multiple roles simultaneously … roles are independent — there is no implied hierarchy" rule for the union case; `SiteScopeRequirement` advertises a site-scope authorization pattern the implementation does not actually wire up (Security-017). |
| 8 | Code organization & conventions | ☑ | Role-name strings are duplicated as magic literals across `RoleMapper.cs`, `SiteScopeAuthorizationHandler.cs`, and `AuthorizationPolicies.cs` — only the audit roles have a single source of truth via `OperationalAuditRoles` / `AuditExportRoles` (Security-018). `SecurityOptions` defaults pass through to runtime with no `IValidateOptions` for required fields like `LdapServer` / `LdapSearchBase` (Security-020). |
| 9 | Testing coverage | ☑ | No test covers a user mapped to both a system-wide AND a site-scoped Deployment LDAP group (the Security-016 case). No test covers the `SiteScopeRequirement` cross-page integration — tests evaluate the handler in isolation, not the absence of a policy that uses it (Security-017). |
| 10 | Documentation & comments | ☑ | `SiteScopeAuthorizationHandler` XML doc describes a permission model no caller actually invokes (Security-017). Otherwise stable. |
## Findings
### Security-001 — StartTLS upgrade path is unreachable dead code
@@ -654,3 +699,226 @@ use the single canonical identity. Regression tests
`NormalizeUsername_TrimsLeadingAndTrailingWhitespace`,
`BuildFallbackUserDn_TrimmedUsername_NoLeadingTrailingSpace`,
`AuthenticateAsync_UsernameWithSurroundingWhitespace_StillRejectedForInsecure`.
### Security-016 — `RoleMapper` silently drops the system-wide Deployment grant when a user is also in any site-scoped Deployment group
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Security/RoleMapper.cs:30-31`, `:41-55`, `:59` |
**Description**
`MapGroupsToRolesAsync` resolves the Deployment role's site scope as a single
`isSystemWide = hasDeploymentRole && !hasDeploymentWithScopeRules` flag computed across
ALL matched Deployment mappings. If a user is a member of both a system-wide Deployment
group (e.g. `SCADA-Deploy-All`, no scope rules) AND a site-scoped Deployment group
(e.g. `SCADA-Deploy-SiteA`, one scope rule for Site A), the second mapping sets
`hasDeploymentWithScopeRules = true`, so the final `isSystemWide` becomes `false` and
the returned `PermittedSiteIds` is just `[SiteA]`. The system-wide grant from
`SCADA-Deploy-All` is silently dropped — the user loses access to every other site, even
though one of their LDAP groups was intended to grant them system-wide reach. This
contradicts the design's "A user can hold multiple roles simultaneously … roles are
independent — there is no implied hierarchy" intent: the union of grants should be the
broadest grant in the set, not the narrowest. The mistake is also non-obvious to an
operator: from the Admin → LDAP Mappings page nothing flags that adding a site-scoped
Deployment mapping for a user already in `SCADA-Deploy-All` *removes* sites from their
effective grant. The downstream `SiteScopeService.IsSystemWideAsync` / `FilterSitesAsync`
faithfully reproduce this narrowing, so the user can no longer see or act on sites
outside `[SiteA]`.
**Recommendation**
Track the union semantics explicitly: if any matched Deployment mapping has no scope
rules, the user is system-wide regardless of what other mappings have. The simplest
change is to set `hasDeploymentWithScopeRules` only when the mapping has scope rules
AND another flag `hasUnscopedDeploymentMapping` is false; then compute
`isSystemWide = hasUnscopedDeploymentMapping || (hasDeploymentRole && !hasDeploymentWithScopeRules)`.
Equivalently: collect per-mapping `(hasRules, scopedSiteIds)` first, then
`isSystemWide = any mapping has hasRules==false`, and `permittedSiteIds = union of all
scopedSiteIds` (left empty for system-wide users). Add a regression test
`MapGroupsToRoles_UserInBothSystemWideAndScopedDeploymentGroup_IsSystemWide` covering
the design's example pair `SCADA-Deploy-All` + `SCADA-Deploy-SiteA`.
### Security-017 — `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code from production callers — `[Authorize(Policy = RequireDeployment)]` does NOT enforce site scoping
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Security/SiteScopeAuthorizationHandler.cs:8-58`; `src/ScadaLink.Security/AuthorizationPolicies.cs:113-143` |
**Description**
The module declares `SiteScopeRequirement` (an `IAuthorizationRequirement` carrying a
`TargetSiteId`) and the matching `SiteScopeAuthorizationHandler` that combines the
Deployment role claim with the `SiteId` claims to enforce the design's site-scoping
rule. The handler is registered in `AddScadaLinkAuthorization`
(`services.AddSingleton<IAuthorizationHandler, SiteScopeAuthorizationHandler>()`). But
no `AddPolicy` call ever wires the requirement to a named policy, and a grep across
`src/ScadaLink.CentralUI` and `src/ScadaLink.ManagementService` confirms that **no
production code ever instantiates `new SiteScopeRequirement(...)` or calls
`AuthorizeAsync(...)` with one** — the only callers are the unit tests in
`SecurityTests.cs:1146,1166,1185,1203`. The design + CLAUDE.md state that "Deployment
and Monitoring pages must filter every site/instance list through `FilterSitesAsync`
and re-check `IsSiteAllowedAsync` before any cross-site command", and the
CentralUI-028 finding (High, Open) confirms this is exactly the contract two new
report pages forgot — because there is no declarative `[Authorize(Policy = ...)]`
shortcut, callers must remember to inject `SiteScopeService` and write the check by
hand, and any new page that forgets is a silent regression with no compile-time or
test-time signal. The module's published surface advertises an authorization-handler
pattern that is, in practice, unwired plumbing.
**Recommendation**
Either (a) **delete** `SiteScopeRequirement` and `SiteScopeAuthorizationHandler` (and
the dead `IAuthorizationHandler` registration) and document `SiteScopeService` as the
sole site-scoping mechanism — this is the smaller change and matches what the codebase
actually does today; or, preferably, (b) **finish the wiring**: add a `RequireSiteScope`
policy that uses `SiteScopeRequirement` and provide a small helper / source generator
or analyzer that flags Deployment-policy-attributed pages without a site-scope check.
Either way, address the cross-module gap: CentralUI-028 stays open until production
pages reliably enforce the rule. If (b) is chosen, a route-parameter-aware
`IAuthorizationPolicyProvider` is needed so the policy can read the target site id from
the request — that is a meaningful design extension and would need to be planned
alongside the Central UI's existing `SiteScopeService` usage rather than replacing it
piecemeal.
### Security-018 — Role names are hard-coded magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and `AuthorizationPolicies`
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Security/RoleMapper.cs:41`; `src/ScadaLink.Security/SiteScopeAuthorizationHandler.cs:36`; `src/ScadaLink.Security/AuthorizationPolicies.cs:118,121,124,95,107` |
**Description**
The role-name literals `"Admin"`, `"Design"`, `"Deployment"`, `"Audit"`, and
`"AuditReadOnly"` are duplicated as magic strings across three separate files:
`RoleMapper.cs:41` hard-codes `"Deployment"` to detect the site-scope branch;
`SiteScopeAuthorizationHandler.cs:36` independently hard-codes `"Deployment"` to gate
the handler; and `AuthorizationPolicies.cs:118,121,124` hard-code the four role names
as the policy `RequireClaim` values. Only the audit roles have a single source of truth
(via the `OperationalAuditRoles` / `AuditExportRoles` arrays on
`AuthorizationPolicies`). A future rename or addition of a role that misses any one of
these call sites silently breaks the system: e.g. renaming "Deployment" → "Deployer"
in `RoleMapper` alone would leave the policy still requiring `"Deployment"` (logins
get the new role name but the policy never matches), while changing it in the policy
alone would leave `RoleMapper` failing to populate scope rules for the renamed role.
The bug class is "string drift" — exactly the kind the `OperationalAuditRoles` constant
was introduced to prevent.
**Recommendation**
Introduce a `public static class Roles { public const string Admin = "Admin"; public const
string Design = "Design"; public const string Deployment = "Deployment"; public const string
Audit = "Audit"; public const string AuditReadOnly = "AuditReadOnly"; }` in the Security
project and replace every magic-string occurrence — including the elements of
`OperationalAuditRoles` and `AuditExportRoles` — with the constants. A single rename
will then either succeed everywhere or fail to compile.
### Security-019 — Service-account rebind failure is reported as "Invalid username or password" — masks misconfiguration as a user-credential error
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:85-89`, `:147-151` |
**Description**
After the user's credentials bind successfully, `AuthenticateAsync` re-binds as the
configured service account to perform the group/attribute search
(`connection.Bind(_options.LdapServiceAccountDn, _options.LdapServiceAccountPassword)`).
A failure of this second bind — wrong service-account password, deleted/disabled
service-account, locked-out service-account — throws `LdapException` which is caught by
the broad outer `catch (LdapException)` and returned as
`new LdapAuthResult(false, null, username, null, "Invalid username or password.")`.
The user sees an "invalid credentials" message for *their* credentials even though
their bind succeeded and the failure was in the system's own service-account
configuration. Worse, every user attempting to log in sees the same incorrect message
during a service-account outage, which routes operators down the wrong incident path
(reset the user's password) instead of the right one (check the service-account
credentials). The successful user bind itself is also not auditable as a discrete
event because the result is "Invalid username or password" — indistinguishable from a
genuine bad-password attempt.
**Recommendation**
Wrap the service-account rebind in its own `try`/`catch (LdapException)` and surface a
distinct error: log `_logger.LogError(ex, "Service-account rebind failed; check
LdapServiceAccountDn / LdapServiceAccountPassword configuration")` and return
`new LdapAuthResult(false, null, username, null, "Authentication service is misconfigured. Contact an administrator.")`.
Add a regression test that exercises the service-account-bind failure path (a mocked
or seamed `LdapConnection.Bind` that throws on the second call) and asserts the
distinct error message.
### Security-020 — `SecurityOptions` has no startup validation for required fields (`LdapServer`, `LdapSearchBase`)
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Security/SecurityOptions.cs:6-7`, `:36-37`; `src/ScadaLink.Security/ServiceCollectionExtensions.cs:13-30` |
**Description**
`SecurityOptions.JwtSigningKey` correctly fails fast at `JwtTokenService` construction
(Security-003 fix), but the LDAP-side required fields — `LdapServer` (default
`string.Empty`) and `LdapSearchBase` (default `string.Empty`) — have no equivalent
guard. `AddSecurity` does not register an `IValidateOptions<SecurityOptions>`. A
deployment that fails to set `LdapServer` (a typo in the appsettings.json section name,
a missing environment-variable substitution, a misconfigured Docker compose file)
starts cleanly — the Central UI comes up, the login page loads, and only the first
authentication attempt fails with `LdapConnection.Connect("")` throwing a low-level
exception that bubbles up as the generic "An unexpected error occurred during
authentication." message. The misconfiguration surfaces minutes or hours into the
deploy, on the first real user login, rather than at startup where it is cheap to
diagnose.
**Recommendation**
Add an `IValidateOptions<SecurityOptions>` registered via
`services.AddOptions<SecurityOptions>().ValidateOnStart()` that fails when
`LdapServer` is null/whitespace, `LdapSearchBase` is null/whitespace, or
`LdapPort <= 0`. Combine with the existing `JwtTokenService` constructor check so
every required `SecurityOptions` field is enforced at startup, not at first use.
### Security-021 — `RequireHttpsCookie=false` dev opt-out has no warning path — an HTTP production deployment silently transmits the JWT bearer credential in cleartext
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Security/SecurityOptions.cs:100-108`; `src/ScadaLink.Security/ServiceCollectionExtensions.cs:54-59` |
**Description**
The Security-002 fix added `RequireHttpsCookie` (default `true`) so the auth cookie's
`SecurePolicy` is `Always` in production. The current Docker dev cluster sets
`RequireHttpsCookie=false` in both central nodes' `appsettings.Central.json`, downgrading
to `SameAsRequest` so the local HTTP cluster works. The downgrade is documented in the
XML doc but is silent at runtime: no log line warns that the cookie carrying the JWT
bearer credential is being sent over an HTTP-only path. A production deployment that
inherits a dev-derived appsettings — or that copy-pastes the docker config and forgets
to flip the flag — transmits the session token in cleartext with no diagnostic signal.
The default is correct; the gap is that the unsafe override has no operational guard.
**Recommendation**
In the `PostConfigure` block in `AddSecurity`, when `RequireHttpsCookie == false`, log
a single startup warning along the lines of `_logger.LogWarning("RequireHttpsCookie is
DISABLED — auth cookie SecurePolicy is SameAsRequest. The cookie-embedded JWT will be
transmitted over plain HTTP. This setting is intended for local dev only — set
SecurityOptions:RequireHttpsCookie=true in production.")`. Optionally, also fail
startup when `RequireHttpsCookie=false` AND `ASPNETCORE_ENVIRONMENT=Production`. Add a
regression test that asserts the warning is emitted when the flag is disabled and not
when it is enabled.
+322
View File
@@ -0,0 +1,322 @@
# Code Review — SiteCallAudit
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.SiteCallAudit` |
| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 |
## Summary
The module is small (one actor + DI extension + options class). The actor is a
central cluster singleton that exposes three responsibility groups: direct
`UpsertSiteCallCommand` ingest, paginated/KPI read handlers, and the central→site
Retry/Discard relay. Ingest idempotency is delegated to the repository's
monotonic-upsert (the CD-015 check-then-act window is mitigated by the
duplicate-key swallow on the insert leg). Findings cluster around two themes:
(a) the `SupervisorStrategy` override is dead-code that contradicts the XML
docstring — it governs children, and this actor has none, so the documented
"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
between the design doc and the code (reconciliation puller + daily purge
schedule are still deferred; `OnUpsertAsync` does not stamp `IngestedAtUtc`
unlike the dual-write path). The relay path is well covered by Akka TestKit
unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
tests using a shared `MsSqlMigrationFixture`.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `OnUpsertAsync` does not refresh `IngestedAtUtc` (Finding 003). |
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy()` override is dead code (Finding 001). `Sender` correctly captured before first await on every handler. `PipeTo` used for read replies. |
| 3 | Concurrency & thread safety | Yes | `_centralCommunication` mutated only on actor thread via `RegisterCentralCommunication`. DI scope-per-message disposed in `try/finally`. No issues found. |
| 4 | Error handling & resilience | Yes | Ingest catches all + replies `Accepted=false`. Relay distinguishes `SiteUnreachable` vs `OperationFailed`. Failover handover does not wait for in-flight async work (Finding 002). |
| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond `SourceSite`. No issues found. |
| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. `MaxPageSize=200` clamp present. No issues found. |
| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
| 8 | Code organization & conventions | Yes | `RegisterCentralCommunication` is a top-level record colocated with the actor — by design (carries `IActorRef`, cannot live in Commons). No issues found. |
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
## Findings
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:32-46`, `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:147-151` |
**Description**
The XML remarks block (lines 32-46) states:
> "The `SupervisorStrategy` uses `Resume` so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
The override at lines 147-151 returns a `OneForOneStrategy` with `DefaultDecider`
and `maxNrOfRetries: 0`. Two problems compound:
1. `ActorBase.SupervisorStrategy()` governs the actor's **children**, not the
actor itself. `SiteCallAuditActor` creates no children, so this override is
dead code.
2. The returned strategy uses `DefaultDecider` (Restart for most exceptions),
**not** `Directive.Resume`. So even if the actor did have children, the
strategy would not be Resume — it would be the default Restart-on-most-faults
behaviour with `maxNrOfRetries: 0` (which forces a Stop after the first
failure).
Net effect: the actor's own self-supervision is whatever the parent supplies
(`SupervisorStrategy.DefaultDecider` from the singleton manager / user
guardian in tests), which Restarts on most exceptions. If the `try/catch` in
`OnUpsertAsync` ever leaked (e.g. a synchronous throw constructing `replyTo`),
the actor would Restart, reset `_centralCommunication` to null, and silently
break the relay until `RegisterCentralCommunication` runs again.
This same pattern (with the same misleading XML doc) exists in
`AuditLogIngestActor`, `AuditLogPurgeActor`, and `SiteAuditReconciliationActor`
— they were likely cargo-culted; this finding documents the local instance.
**Recommendation**
Either:
- Remove the `SupervisorStrategy()` override entirely (it does nothing useful)
and revise the XML comment to drop the "Resume" claim. Self-supervision is
the parent's concern (the cluster singleton manager); the `try/catch` in
`OnUpsertAsync` is what actually keeps the actor alive.
- Or, if Resume-on-self-throw is actually desired, that requires wiring a
custom supervisor in the parent (`ClusterSingletonManager`) — not overriding
`SupervisorStrategy()` here. Simpler path: keep the `try/catch`, drop the
override.
The CLAUDE.md "Resume for coordinator actors" decision applies to actors with
children (Site Runtime hierarchy) — not to leaf cluster singletons.
**Resolution**
_Unresolved._
### SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:455-462` (singleton wiring), `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
**Description**
The singleton is created with `terminationMessage: PoisonPill.Instance`. On
failover the active node's singleton stops as soon as the mailbox is drained
of normal messages and the PoisonPill is processed. An in-flight
`OnUpsertAsync` Task started before the PoisonPill arrived will be allowed to
complete (the message-handler runs synchronously from the mailbox's view),
but the Akka actor model does NOT cancel the EF Core
`ExecuteSqlInterpolatedAsync` call.
Two consequences:
1. The new singleton on the other node may begin accepting
`UpsertSiteCallCommand` for the same `TrackedOperationId` while the old
singleton's in-flight upsert is still running. The repository's
monotonic-upsert and the SQL duplicate-key swallow protect storage state.
2. The original `replyTo` sender may receive its `Accepted=true` after the new
singleton has already returned a different reply. Idempotency keys protect
correctness; wire-level ordering is best-effort by design.
This is consistent with the design ("eventually-consistent mirror, sites are
source of truth"), but worth documenting as an explicit invariant. The
Notification Outbox sibling has the same pattern.
**Recommendation**
- Document the failover/handover semantics in the actor's XML remarks: "On
cluster singleton handover, in-flight `OnUpsertAsync` tasks complete on the
old node and may produce a late `Accepted=true` reply; the repository's
monotonic upsert ensures storage state is consistent."
- Add an integration test that deliberately races two concurrent upserts on
the same `TrackedOperationId` to verify the duplicate-key swallow +
monotonic rank check (the CD-015 race-pattern check the parent task
flagged).
**Resolution**
_Unresolved._
### SiteCallAudit-003 — `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
**Description**
The combined-telemetry hot path (`AuditLogIngestActor.OnCachedTelemetryAsync`)
stamps `IngestedAtUtc = DateTime.UtcNow` on both the `AuditLog` row and the
`SiteCall` row at central-side persist time
(`src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:238-239`). The design
doc treats `IngestedAtUtc` as "central ingested (or last refreshed) this row"
— a central-side timestamp.
`SiteCallAuditActor.OnUpsertAsync` writes the supplied `SiteCall` as-is, with
whatever `IngestedAtUtc` the caller stamped. The only current callers are the
unit tests (which use `DateTime.UtcNow` at command-construction time). Once
the deferred reconciliation puller lands and starts emitting
`UpsertSiteCallCommand`s, the puller (running on central) is responsible for
stamping a central timestamp — but if a future direct-write caller forgets,
or constructs from a site DTO, the value could drift (e.g. become a site
clock value).
This is currently latent because no production caller exists, but it's
inconsistent with the dual-write code path and undocumented.
**Recommendation**
- Either: stamp `IngestedAtUtc = DateTime.UtcNow` inside `OnUpsertAsync`
before calling `UpsertAsync` (matching `AuditLogIngestActor`'s behaviour),
using `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }`.
- Or: document in the `UpsertSiteCallCommand` XML that callers MUST stamp
`IngestedAtUtc` to a central-side `DateTime.UtcNow` immediately before
sending.
Preferred: stamp inside the actor — same as the combined-telemetry path —
because callers cannot in general know the actor is colocated on central.
**Resolution**
_Unresolved._
### SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ScadaLink.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` |
**Description**
The design doc (`Component-SiteCallAudit.md` lines 24-32) lists five
responsibilities, including:
- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
- "Purge terminal audit rows after a configurable retention window."
The repository exposes `PurgeTerminalAsync` but nothing in this module
schedules a daily call (Notification Outbox owns a `MaintenanceService` for
its equivalent; no `SiteCallAuditMaintenanceService` exists). The
reconciliation puller is acknowledged in the actor XML
(`only reconciliation remains deferred`) but is not surfaced in the design
doc as deferred — the doc reads as if it ships.
**Recommendation**
- Either: implement the deferred pieces (a hosted service that wakes daily
and calls `repo.PurgeTerminalAsync(now - retentionWindow)`, plus a per-site
reconciliation puller with a cursor + an `IPullCachedTelemetryClient`).
- Or: add a "Status" / "Deferred" subsection to the design doc explicitly
listing what's not yet implemented (matches the pattern Audit Log uses for
its tamper-evidence hash chain).
**Resolution**
_Unresolved._
### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:548-563` |
**Description**
```csharp
return outcome switch
{
SiteCallRelayOutcome.Applied => null,
SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
// SiteUnreachable is never produced from a ParkedOperationActionAck —
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
// before any ack is classified, so this arm is unreachable by construction.
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
_ => throw new ArgumentOutOfRangeException(...)
};
```
The comment correctly states the `SiteUnreachable` arm is unreachable when
called from `ClassifyAck`. The arm therefore exists only to satisfy
exhaustiveness, but instead of throwing or returning a sentinel, it falls
through to `ack.ErrorMessage` — indistinguishable from the `OperationFailed`
arm above. If any future caller *does* feed `SiteUnreachable` into
`AckErrorMessage` (e.g. via refactor), the result will be a silent
wrong-detail-text bug rather than an immediate crash. The default arm
correctly throws `ArgumentOutOfRangeException`, so the `SiteUnreachable` arm
is the inconsistent one.
**Recommendation**
Replace the `SiteUnreachable => ack.ErrorMessage` arm with:
```csharp
SiteCallRelayOutcome.SiteUnreachable =>
throw new InvalidOperationException(
"AckErrorMessage cannot be called for SiteUnreachable — those responses "
+ "are built by UnreachableRetry/UnreachableDiscard before classification."),
```
— fail fast if the invariant is ever violated by a refactor.
**Resolution**
_Unresolved._
### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392` |
**Description**
`SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor` covers
the case where stuck rows are interleaved with non-stuck rows (page-1 returns
2 stuck rows, page-2 returns the third). It does not cover the edge where
the row at the keyset cursor boundary (`AfterCreatedAtUtc + AfterId`) is
itself a non-stuck row — i.e. the cursor points at a row the next page must
SKIP through to find more stuck rows. The repository's SQL composes the
cursor predicate (`CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id <
...)`) with the stuck predicate, so it should be honest, but the test only
asserts row counts and `IsStuck`, not that the second-page query specifically
skipped non-stuck rows between the cursor and the next stuck row.
Lower priority because the SQL composition is straightforward, but adding a
direct test would lock the invariant.
**Recommendation**
Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
no overlap and all 3 stuck rows visited.
**Resolution**
_Unresolved._
+381 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 9 |
## Summary
@@ -46,6 +46,31 @@ keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block o
host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
premise does not hold on .NET 8+).
#### Re-review 2026-05-28 (commit `1eb6e97`)
Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain closed
and their resolutions hold up under inspection: the lock-guarded `WithConnection`
overloads, the background-writer `Channel<T>` with disposed-mid-drain fault
propagation, the `auto_vacuum = INCREMENTAL` schema + logical-size measurement, the
severity index, the `LIKE` keyword-search escaping, and the concrete-recorder DI
wiring are all present and correct at this commit. Nine new findings were recorded —
none are regressions of prior fixes. The most notable (SiteEventLogging-016, **High**)
is a correctness defect in the query path: timestamps are stored as ISO 8601 strings
generated from `DateTimeOffset.UtcNow` (so they always have a `+00:00` offset suffix),
but the `From`/`To` filters are stringified verbatim via `request.From.Value.ToString("o")`
without normalising to UTC, so a central client that sends a non-UTC `DateTimeOffset`
gets a broken lexicographic comparison and either spuriously includes or excludes
events. The next-most-notable findings are SiteEventLogging-015 (unbounded background
write queue can grow without limit under sustained writer slowness — sister
`SqliteAuditWriter` uses a bounded channel) and SiteEventLogging-017 (the central
client's `PageSize` is used verbatim with no upper-bound clamp, defeating the design's
"prevents broad queries from overwhelming the communication channel" rationale). The
remaining findings are low-severity hygiene / documentation: an unused
`FailedWriteCount` metric, untyped severity/event-type fields, non-invariant culture
parsing, the purge service running on the standby node, the redundant `Cache=Shared`
on a single-connection logger, and a non-volatile stop flag in a concurrency stress
test.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -61,6 +86,21 @@ premise does not hold on .NET 8+).
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020). |
| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings. |
| 3 | Concurrency & thread safety | ☑ | Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003. |
| 4 | Error handling & resilience | ☑ | `FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018). |
| 5 | Security | ☑ | Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns. |
| 6 | Performance & resource management | ☑ | Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017). |
| 7 | Design-document adherence | ☑ | `EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019). |
| 8 | Code organization & conventions | ☑ | `FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018). |
| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
## Findings
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -706,3 +746,341 @@ re-triage note). No code change made. A verification test
`StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour
(asserts `StartAsync` returns in under 1 s and the initial purge still runs on the
background scheduler).
### SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63` |
**Description**
`SiteEventLogger` creates its background-writer feeder as
`Channel.CreateUnbounded<PendingEvent>(...)`. The writer thread funnels every write
through the shared `_writeLock` (acquired by `WithConnection`), so any condition that
makes a single iteration slow — a long-running query in `EventLogQueryService`
holding the lock, a `PurgeByStorageCap` run that takes the lock for batched
`DELETE` + `PRAGMA incremental_vacuum`, a disk stall, or a sustained event burst
from an alarm storm / script failure loop — drives the queue arbitrarily large.
Every queued `PendingEvent` retains its `TaskCompletionSource` and its payload
strings, so there is no upper bound on how much memory the recorder can hold.
The sister centralized-audit component `ScadaLink.AuditLog/Site/SqliteAuditWriter.cs`
addresses the same hot-path-writer problem with
`Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait })`,
giving back-pressure to producers. Site event logging picked the riskier choice for
a component that — per the design — is fed by every site subsystem (script, alarm,
deployment, DCL, store-and-forward, instance lifecycle, notification) and has both
a 30-day retention sweep and a 1 GB cap-purge competing for the same lock.
**Recommendation**
Switch to `Channel.CreateBounded<PendingEvent>(...)` with a configurable capacity
(default in the order of 10 000 — large enough to absorb a normal alarm burst,
small enough to bound memory). Pick a `FullMode` that matches policy: `Wait` for
back-pressure (callers `await` and serialise their actor thread on the queue —
defeats some of the SiteEventLogging-005 win but is safe), or `DropOldest` /
`DropWrite` with a counter (drop-and-count is closer to "best-effort audit"). Add
the dropped-event counter to `FailedWriteCount` or a sibling metric. Document the
chosen policy on `ISiteEventLogger.LogEventAsync`.
### SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78` |
**Description**
Event rows are persisted with `timestamp` = `DateTimeOffset.UtcNow.ToString("o")`,
which always emits the round-trip ISO 8601 form ending in the literal offset
`+00:00` (e.g. `2026-05-28T12:34:56.7890123+00:00`). The query path filters by
range using a direct string comparison:
```
whereClauses.Add("timestamp >= $from");
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));
```
`request.From` is a `DateTimeOffset?` and `ToString("o")` preserves whatever offset
the caller passed in. If a central client passes a non-UTC `DateTimeOffset` — for
example the result of `DateTimeOffset.Now` in a `UTC+05:00` timezone — the produced
string is `"2026-05-28T17:34:56.0000000+05:00"`, which is lexicographically *greater*
than the equivalent UTC instant string `"2026-05-28T12:34:56.0000000+00:00"`. The
comparison `timestamp >= $from` is then evaluated as a byte-by-byte string compare
(SQLite default `BINARY` collation), so the query either spuriously excludes events
that genuinely occurred in the range, or spuriously includes events from a wholly
different hour. The same defect applies to `To`. The retention purge does
`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")` (UTC) so it is safe; only the
central query path is vulnerable.
The design explicitly states "All timestamps are UTC throughout the system" but the
boundary between a central `DateTimeOffset` and the SQLite store is not enforced.
A central UI rendered in a non-UTC timezone is the most likely trigger, and the
defect silently corrupts every query that filters by time range — exactly the
filter most likely to be set on a "show me what happened around the failover" query.
**Recommendation**
Normalise `From` / `To` to UTC before serialising:
`request.From.Value.ToUniversalTime().ToString("o")` (or
`.UtcDateTime.ToString("o")`), so the produced offset is always `+00:00`. Add a
regression test that filters with a `DateTimeOffset` carrying a non-zero offset and
asserts the matching events are returned. Optionally also store timestamps as
Unix-epoch `INTEGER` and let SQLite compare numerically, eliminating the
lexicographic-comparison hazard structurally.
### SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18` |
**Description**
`EventLogQueryService.ExecuteQuery` resolves the effective page size as
`var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize;`
and uses it directly as the SQL `LIMIT $limit` (passing `pageSize + 1` to detect
"has more"). There is no upper bound. A central client — buggy or hostile — can
send `PageSize = int.MaxValue`, in which case the query attempts to materialise the
entire (up to 1 GB) event log into a single `List<EventLogEntry>` while holding the
shared write lock. This:
- Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster
message serialisation limits, will then be serialised into an
`EventLogQueryResponse` and pushed over the ClusterClient pipe.
- Blocks all writes (purge, recorder hot path) for the duration of the scan
because the read holds `_writeLock`.
- Stalls the singleton `EventLogHandlerActor`, also blocking subsequent legitimate
queries.
The design explicitly justifies pagination as preventing exactly this — "Results
are paginated with a configurable page size (default: 500 events) ... This prevents
broad queries from overwhelming the communication channel." The code honours the
*default* but does not enforce an *upper bound* on a client-supplied override.
**Recommendation**
Clamp `pageSize` to a configurable maximum (e.g. `SiteEventLogOptions.MaxQueryPageSize`,
default 5000) before using it. Also bound `KeywordFilter.Length` (e.g. 256 chars) —
a leading-wildcard `LIKE` of an unbounded pattern is itself an expensive operation
that runs under the same lock. Add a `Success: false, ErrorMessage: "PageSize
exceeds maximum"` reject path so a misbehaving central is told why its query is
refused.
### SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226` |
**Description**
`SiteEventLogger.FailedWriteCount` was added under SiteEventLogging-008 with the
XML doc statement "Surfaced so Health Monitoring can detect a logging outage
instead of relying on a local log line nobody is watching." The implementation is
correct (`Interlocked.Increment` on write failure, `Interlocked.Read` getter), but
a repo-wide search shows **no** caller anywhere in `src/` reads the property —
neither `ScadaLink.HealthMonitoring`, the central health collector, nor the host's
`/health` endpoint. The metric is dead-letter: a logging outage still goes
unnoticed in production, contradicting the original finding's resolution claim.
The property is also exposed only on the concrete `SiteEventLogger`, not on
`ISiteEventLogger`, so even if Health Monitoring were wired up it would have to
take a concrete-type dependency (`internal Connection` removed, but
`FailedWriteCount` remained concrete-only).
**Recommendation**
Either (a) wire `FailedWriteCount` into the existing Health Monitoring metric
pipeline (e.g. publish it alongside other 30-second-interval site metrics, and
promote a sustained non-zero value to a Warning), and add it to `ISiteEventLogger`
so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by
softening the XML doc to "Available for future Health Monitoring integration" and
file a tracking item for the wiring. The current doc claim is misleading.
### SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:21`, `docs/requirements/Component-SiteEventLogging.md:45` |
**Description**
`AddSiteEventLogging` calls `services.AddHostedService<EventLogPurgeService>()`,
which registers the purge `BackgroundService` per host. On a 2-node site cluster
both `node-a` and `node-b` start the service independently, so each runs its own
30-day retention purge and 1 GB cap purge against its own local
`site_events.db`. The design states only "A daily background job runs on the
active node and deletes all events older than 30 days." (Component-SiteEventLogging,
Storage section). In practice the standby node receives no writes, so its purge
finds nothing to delete and is harmless — but the implementation does not match the
documented "active node" gating, and the resolution note on SiteEventLogging-004
already flagged that the *writer* runs on the standby too. The purge has the same
shape.
Aligning to the design is also a defence against a future change that does write
to the standby (e.g. local heartbeats), and removes the per-node wake-ups that
contribute to `Microsoft.Extensions.Hosting` shutdown latency.
**Recommendation**
Either (a) gate the purge service on "this node is the active member of `siteRole`"
(check the cluster singleton ownership before each `RunPurge()`, or host the
purge inside the same cluster singleton as `EventLogHandlerActor`), or (b) reword
the design doc to "the purge runs on every node against its own local database;
on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code
defect.
### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15` |
**Description**
`LogEventAsync` validates `eventType` and `severity` only for non-empty/non-whitespace.
The XML doc enumerates the allowed values: `eventType` ∈ {script, alarm,
deployment, connection, store_and_forward, instance_lifecycle}, `severity`
{Info, Warning, Error}. Nothing in the code enforces either set. Any caller can
pass `"SCRIPT"`, `"Script"`, `"warn"`, `"ERR"`, or a typo and the row is inserted
verbatim. Two follow-on consequences:
1. The `EventLogQueryService.Severity` filter is `severity = $severity` (exact
match, case-sensitive by SQLite default `BINARY` collation). A row stored as
`"error"` will not be returned for a query filtering on `"Error"`. The design
lists severity as a first-class filter and the central UI will reasonably
normalise to one casing — every row stored with a different casing is silently
invisible to that filter.
2. The `Events Logged` table in the design implicitly relies on a stable
`event_type` enumeration to drive UI grouping; a typo'd `event_type` slips in
silently and is hard to detect later.
**Recommendation**
Validate `eventType` and `severity` against a known set (or accept `enum`s on the
interface, converting to canonical string at the call site). Reject unknown values
with `ArgumentException` and log a single-shot warning during construction if a
deployment is found to be using an unexpected value. Alternatively, normalise
casing (`severity = severity.ToLowerInvariant()`) so the query filter is
case-insensitive. Update the XML doc to match the enforced contract.
### SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138` |
**Description**
`ExecuteQuery` materialises rows via
`DateTimeOffset.Parse(reader.GetString(1))`. `DateTimeOffset.Parse(string)` uses
`CultureInfo.CurrentCulture` and `DateTimeStyles.None`. The stored format is ISO
8601 round-trip (`"o"`), which is *usually* parseable in any culture — but a
production node running with a non-default culture (e.g. Turkish "tr-TR", which
has historically broken case-insensitive ASCII comparisons via the
"Turkish-I" issue, or any culture that overrides the date/time separators) can
parse incorrectly or throw `FormatException`. The exception is caught by the outer
`try`, so the entire query is converted to a `Success: false` response — but the
failure mode is silent and culture-dependent.
The recorder side stores via `DateTimeOffset.UtcNow.ToString("o")`, which is also
culture-sensitive in the same way; on a hostile-culture node, the round-trip
between insert and query is not guaranteed to be lossless without explicit
culture pinning.
**Recommendation**
Parse with explicit invariant culture and round-trip style:
`DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture,
DateTimeStyles.RoundtripKind)` (and the same for the `ToString("o", InvariantCulture)`
emitters in `SiteEventLogger.LogEventAsync` and `EventLogPurgeService.PurgeByRetention`).
Alternatively switch the schema to store `timestamp` as Unix-epoch `INTEGER` and
avoid all string-parsing.
### SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52` |
**Description**
The connection string is built as
`$"Data Source={options.Value.DatabasePath};Cache=Shared"`. SQLite's
shared-cache mode is a *cross-connection* optimisation: it lets multiple
`SqliteConnection`s in the same process share an in-process page cache. This
logger owns exactly one `SqliteConnection` and serialises all access through
`_writeLock`, so `Cache=Shared` cannot share with anything — the mode is dormant.
At best it is dead configuration; at worst it adds (very small) per-statement
lock overhead inside SQLite. The sister `SqliteAuditWriter` carries the same
unused option, so the smell is a copy-and-paste pattern.
Shared-cache mode also subtly changes the semantics of `PRAGMA busy_timeout` and
`PRAGMA locking_mode`, so leaving it on while *not* using it is a small future-foot
gun if anyone later opens a second connection to the same file from another
component on the same host (e.g. a tooling read-only viewer).
**Recommendation**
Drop `Cache=Shared` from the connection string — the logger is single-connection
and gains nothing from it. If a future need to share the DB across connections in
the same process arises, reintroduce it deliberately together with the busy_timeout
and locking_mode review that should accompany it.
### SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308` |
**Description**
`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` uses a plain `bool stop = false;`
that the main test thread mutates after the purge task completes
(`stop = true;`) while four background writer tasks are spin-checking `while (!stop)`.
The flag is not declared `volatile`, not wrapped in `Volatile.Read/Volatile.Write`,
and not behind a memory barrier. On a release build with a relaxed memory model
the writer threads are permitted to cache the `stop = false` read indefinitely,
which means in theory the test can hang past xUnit's per-test timeout instead of
asserting `Empty(exceptions)`. The test relies on observed JIT/runtime behaviour
that today happens to refresh the field across the `await _eventLogger.LogEventAsync`
boundary, but that is an implementation detail rather than a contract.
The test is a regression test for SiteEventLogging-003; a flaky / hang-prone
version of it can mask the very behaviour it is meant to pin.
**Recommendation**
Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or change
`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
`CancellationTokenSource` is the canonical .NET pattern and also lets the test
cooperate with xUnit's `Task.WhenAll` timeout.
+407 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteRuntime` |
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -47,6 +47,36 @@ and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
2026-05-17. Open findings: 0.
#### Re-review 2026-05-28 (commit `1eb6e97`)
The module was re-reviewed at commit `1eb6e97` as part of the new baseline
review. The SiteRuntime source surface has grown materially since the prior
pass — primarily by threading `ExecutionId`/`ParentExecutionId`/`SourceNode`
through the script-trust-boundary helpers and the cached-call telemetry
emitters, and by adding `OperationTrackingStore`, the
`AuditingDbConnection`/`AuditingDbCommand`/`AuditingDbDataReader` decorators,
and `ScriptExecutionScheduler`. All 10 checklist categories were walked afresh.
Seven new findings were recorded: a race that throws
`InvalidActorNameException` when a second `DeployInstanceCommand` arrives for
the same instance while a redeployment is still terminating its predecessor
(SiteRuntime-020, Medium); an artifact-only data-connection update that never
reaches the DCL (SiteRuntime-021, Medium); `AuditingDbCommand.DbConnection.set`
reaching into `AuditingDbConnection._inner` via reflection — the same anti-
pattern SiteRuntime-006 eliminated for the repositories, now reintroduced and
in direct tension with the script trust model that forbids `System.Reflection`
(SiteRuntime-022, Medium); `Convert.ToDouble(value)` in `ScriptActor` /
`AlarmActor` running under `CurrentCulture` so a string attribute value
becomes locale-sensitive (SiteRuntime-023, Low); `OperationTrackingStore`
serialising every cached-call write through a single connection +
`SemaphoreSlim` and using sync-over-async in `Dispose()` (SiteRuntime-024,
Medium); inbound-API `SetAttribute` (and any future caller) accepting unknown
attribute names and persisting them as overrides, polluting both `_attributes`
and the SQLite override table (SiteRuntime-025, Low); and the
`ReplicationMessages.cs` outbound/inbound record types still missing public XML
docs (SiteRuntime-026, Low). Prior findings 001019 remain
Resolved/Deferred — no regressions observed in any of their fixed call sites.
Open findings: 7.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -62,6 +92,21 @@ never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
| 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. |
| 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Second-deploy race vs. pending redeploy (020); artifact-only data-connection update never reaches DCL (021); unknown-name SetAttribute persists bogus overrides (025). |
| 2 | Akka.NET conventions | ✓ | Trigger-eval blocking on coordinator mailbox remains Deferred (014); short-lived execution actors and replication actor otherwise conform. |
| 3 | Concurrency & thread safety | ✓ | DM's `_instanceActors` cache and `_pendingRedeploys` map shifted from old race; new ordering race surfaced (020). `OperationTrackingStore` single-connection + SemaphoreSlim serialises all cached writes (024). |
| 4 | Error handling & resilience | ✓ | `Task.Run` fire-and-forget replication paths log on faulted (acceptable, per "best-effort replication" design). DM's deploy persistence rollback path (resolved as SiteRuntime-005) intact. |
| 5 | Security | ✓ | Trust-model semantic analysis (SiteRuntime-011 fix) intact. `AuditingDbCommand` reflects into `AuditingDbConnection._inner` — same anti-pattern as SiteRuntime-006 (022). Audit emitter captures SQL parameter values verbatim per M4 design (M5 will redact). |
| 6 | Performance & resource management | ✓ | Per-call SQLite connections on hot paths in `SiteStorageService` (existing pattern, acceptable). `OperationTrackingStore` `Dispose()` does sync-over-async (024). `ScriptExecutionScheduler` bounded threads as expected. |
| 7 | Design-document adherence | ✓ | Artifact-only data-connection update path is silently inert (021) — contradicts the "site is self-contained after artifact deployment" intent. |
| 8 | Code organization & conventions | ✓ | Repository reflection-via-private-field anti-pattern reintroduced in `AuditingDbCommand` (022). `ReplicationMessages.cs` public records still undocumented (026). |
| 9 | Testing coverage | ✓ | `SiteReplicationActor` remains uncovered (SiteRuntime-016 deferred that gap to a clustered-ActorSystem harness, still outstanding). New findings have no targeted coverage yet. |
| 10 | Documentation & comments | ✓ | `ReplicationMessages.cs` records lack XML docs (026); other XML doc surface materially expanded in `1eb6e97`. |
## Findings
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
@@ -902,3 +947,362 @@ stating the Deployment Manager owns this lifecycle. Regression test:
`InstanceActorTests.InstanceActor_DoesNotHandleDisableOrEnableCommands` asserts the
Instance Actor produces no `InstanceLifecycleResponse` for either command
(confirmed to fail against the pre-fix dead handlers and pass after removal).
### SiteRuntime-020 — Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:285`, `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:971` |
**Description**
The SiteRuntime-003 fix makes `HandleDeploy` watch + stop a running Instance
Actor and buffer the in-flight `DeployInstanceCommand` in `_pendingRedeploys`
until `Terminated` arrives. The handler also removes the instance from
`_instanceActors` synchronously, in step with the stop request:
```csharp
if (_instanceActors.TryGetValue(instanceName, out var existing))
{
_instanceActors.Remove(instanceName);
_pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
Context.Watch(existing);
Context.Stop(existing);
UpdateInstanceCounts();
return;
}
// Fresh deployment — no existing actor to replace.
ApplyDeployment(command, Sender, isRedeploy: false);
```
If a *second* `DeployInstanceCommand` for the same `instanceName` arrives on
the singleton's mailbox while the predecessor is still terminating, the
`_instanceActors.TryGetValue` lookup correctly reports "no existing actor" —
because the first deploy already removed it — and execution falls through to
`ApplyDeployment(..., isRedeploy: false)`. `ApplyDeployment` immediately calls
`CreateInstanceActor`, which calls `Context.ActorOf(props, instanceName)`. But
the predecessor's Akka child name **is still registered** in the parent's
child registry: that name is only released after the predecessor's `Terminated`
signal — exactly the asynchronous gap SiteRuntime-003 was created to plug for
the *first* redeploy. `Context.ActorOf` therefore throws
`InvalidActorNameException`, which Akka rethrows as
`ActorInitializationException` — and the supervisor's `Stop` directive on that
exception (DeploymentManagerActor.cs:179) silently stops the just-created
child. The second deploy is then quietly lost: `_instanceActors` doesn't
contain it (the throw aborted the bookkeeping after `CreateInstanceActor`'s
own `ContainsKey` guard but before `_instanceActors[instanceName] = actorRef`
would have run), `_totalDeployedCount` was incremented, and the deployer is
never told the deployment failed (the persistence `Task.Run` is also dropped
on the throw path). The race is real on a busy site where central retries a
deploy because the prior attempt timed out — exactly the scenario the
DeploymentManager-006 query-then-deploy idempotency mechanism was designed for.
The first-redeploy case (SiteRuntime-003) does NOT exhibit this because at
that point the predecessor's child name was still in `_instanceActors`, so the
branch correctly buffers. The bug is specific to the third (and beyond)
incoming deploy when two are already in flight for the same instance.
**Recommendation**
The pending-redeploy bookkeeping needs to be authoritative for "we are mid-
redeploy on this instance", not just the `_instanceActors` cache. Add a second
keyed lookup — e.g. a `Dictionary<string, IActorRef> _terminatingActorsByName`
populated when the predecessor is stopped — and check it BEFORE
`ApplyDeployment(isRedeploy: false)`. On a hit, overwrite (or stash) the
buffered `PendingRedeploy` for that terminating actor so the latest command
wins on the `Terminated` signal. Alternatively, defer the deploy by stashing
all messages for that `instanceName` until the predecessor terminates (Akka
`Stash` pattern). Either way, the fall-through to "fresh deployment" needs to
be gated on "no instance with this name is currently terminating".
### SiteRuntime-021 — `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:931` |
**Description**
`HandleDeployArtifacts` persists the artifact bundle (shared scripts, external
systems, database connections, notification lists, SMTP configs, and
**data connection definitions**) into local SQLite. For data connection
definitions specifically (`DataConnections`), the handler calls
`_storage.StoreDataConnectionDefinitionAsync(...)` — but does NOT issue a
`CreateConnectionCommand` (or any other DCL command) to the `_dclManager`
actor. The only path that pushes DCL configuration to the DCL is
`EnsureDclConnections`, called exclusively from the deploy / startup-batch
paths against the **flattened instance configuration's** inline `Connections`
dictionary. There is no equivalent for an artifact-only update.
Concretely: an artifact deployment that changes a data connection's endpoint
URL, credentials, backup endpoint, or failover retry count is stored
durably in the site SQLite (so on the *next* node restart the site loads the
new config and `EnsureDclConnections` picks it up) but is silently inert until
either an instance using that connection is redeployed or the node restarts.
This contradicts the design's "after artifact deployment, the site is fully
self-contained" intent (Component-SiteRuntime.md, "System-Wide Artifact
Handling") — the runtime DCL keeps using the stale connection until a much
heavier trigger event occurs. It is also asymmetric with how
`SharedScripts` are handled in the same method: shared scripts are both
stored *and* recompiled into `_sharedScriptLibrary` on update so the change is
live immediately.
(SiteRuntime-010 fixed a related defect inside `EnsureDclConnections` — the
config-hash cache — but that's only consulted on the inline-config path; the
artifact-deployment path never reaches `EnsureDclConnections`.)
**Recommendation**
In the `DataConnections` branch of `HandleDeployArtifacts`, after the
`StoreDataConnectionDefinitionAsync` call, also send a
`CreateConnectionCommand` to `_dclManager` for each updated definition,
re-using the SiteRuntime-010 config hash so unchanged connections are skipped.
Alternatively, refactor `EnsureDclConnections` to accept a flat list of
`(name, protocol, configurationJson, backupConfigurationJson,
failoverRetryCount)` tuples that both the inline (`FlattenedConfiguration`)
and artifact paths can drive through it.
### SiteRuntime-022 — `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner`
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Scripts/AuditingDbCommand.cs:138` |
**Description**
The `DbConnection` setter on `AuditingDbCommand` unwraps an
`AuditingDbConnection` value by reading its private `_inner` field via
reflection:
```csharp
set
{
_wrappingConnection = value;
_inner.Connection = value switch
{
AuditingDbConnection auditing => auditing.GetType()
.GetField("_inner", BindingFlags.Instance | BindingFlags.NonPublic)
!.GetValue(auditing) as DbConnection,
_ => value
};
}
```
This is the same encapsulation-violating anti-pattern that SiteRuntime-006
called out for the site repositories. A rename or refactor of
`AuditingDbConnection._inner` breaks the audit decorator at runtime (no
compile-time signal), the `!.` null-forgiving operator hides the crash, and
the reflective access trips static analyzers and IL trimming. More
problematically, the script trust model the same module enforces in
`ScriptCompilationService.ValidateTrustModel` explicitly forbids
`System.Reflection` in scripts — yet the auditing helper a script ends up
running through itself reaches via reflection into a sibling class. Both
classes are `internal sealed` in the same assembly, so this is purely a
self-imposed contract violation.
A second smaller concern in the same property: the getter returns
`_wrappingConnection ?? _inner.Connection`. If the caller obtains a command
via `AuditingDbConnection.CreateDbCommand()` and immediately reads
`cmd.Connection`, the getter returns the raw inner connection (not the
auditing wrapper), because `_wrappingConnection` is only populated when the
setter is later invoked. That's surprising and at odds with the class's
audit-everything intent — a script that round-trips a command through
`cmd.Connection` re-enters the un-audited path.
**Recommendation**
Expose the wrapped connection through a proper API surface. The simplest fix
that matches the SiteRuntime-006 precedent: add an
`internal DbConnection Inner { get; }` property to `AuditingDbConnection`
(both classes are `internal sealed`, so the property stays out of the public
surface) and replace the reflection switch with `auditing.Inner`. While
touching the property, also have the getter return `_wrappingConnection` even
on the synthesised CreateDbCommand path (e.g. set `_wrappingConnection` to
the parent connection inside `AuditingDbConnection.CreateDbCommand`).
### SiteRuntime-023 — `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:446`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:340`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:356`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:444` |
**Description**
`ScriptActor.EvaluateCondition` and the three `AlarmActor` evaluators
(`EvaluateRangeViolation`, `EvaluateRateOfChange`, `EvaluateHiLo`) call
`Convert.ToDouble(value)` without specifying a culture. When `value` is a
string (a path that exists today — attribute values that arrive as JSON-
deserialized numbers can still surface as strings on some code paths,
particularly array values that are JSON-stringified at
`InstanceActor.HandleTagValueUpdate:377`), `Convert.ToDouble` parses against
`CultureInfo.CurrentCulture`. On a host whose locale uses a comma decimal
separator (German, French, most of continental Europe), `"1.5"` throws and
the condition / alarm silently degrades to its catch-fallthrough (returns
`false` for range/rate-of-change, keeps current level for HiLo, falls back to
string-compare for conditionals). The CLAUDE.md "All timestamps are UTC"
discipline is the equivalent rule for time; there is no equivalent invariant-
culture discipline applied to numeric parsing.
The exposure is bounded — most attribute values arrive as numeric primitives
from `TagValueUpdate.Value` or static `FlattenedConfiguration.Attributes`
(also typed) so the implicit-cast `Convert.ToDouble` path is hit. But the
string path is reachable via inbound API writes
(`RouteToSetAttributesRequest.AttributeValues` is `IReadOnlyDictionary<string,
string>`), via the JSON-array stringification at `HandleTagValueUpdate:377`,
and via static-override values loaded from SQLite (which are persisted as
strings — see `SetStaticOverrideAsync`).
**Recommendation**
Replace each `Convert.ToDouble(value)` with `Convert.ToDouble(value,
CultureInfo.InvariantCulture)`, or front-load a typed-numeric extraction
helper (`if (value is double d) return d; if (value is string s && double.TryParse(s,
NumberStyles.Float, CultureInfo.InvariantCulture, out var p)) return p;
return Convert.ToDouble(value, CultureInfo.InvariantCulture);`). The site is a
deterministic machine-control surface; condition evaluation must not depend
on the host's regional settings.
### SiteRuntime-024 — `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:39`, `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:360` |
**Description**
`OperationTrackingStore` owns exactly one `SqliteConnection` and gates every
public method through a single `SemaphoreSlim(1, 1)`. The class XML comment
calls this out as deliberate ("the M3 brief calls out as 'cleaner than the M2
Channel<T> pipeline given the volume'"), and the *write* volume is genuinely
low — at most a handful of lifecycle rows per cached call. But on a busy site
the *read* path (`GetStatusAsync`) is called by every `Tracking.Status(id)`
invocation from every executing script, and reads are serialised through the
same gate as writes. A long-running write (e.g. a Roslyn-script-driven
`RecordTerminalAsync` competing with an SQLite checkpoint) holds the gate and
stalls every concurrent status query. SQLite supports concurrent readers with
a single writer in WAL mode; the gate forfeits that capability.
A separate concern in the same class: `Dispose()` calls
`DisposeAsyncCore().AsTask().GetAwaiter().GetResult()`. That is sync-over-
async — the very pattern SiteRuntime-008 was a finding for. If a caller
disposes the store from a synchronization context that does not allow
re-entrance (e.g. an `IHostedService.StopAsync` continuation observed on the
host's sync context, or a finalizer pumping on the thread pool with a stuck
continuation), the `.WaitAsync()` inside `DisposeAsyncCore` waits for a
continuation that will never run, and the dispose deadlocks. The async path
itself is correct; only the sync `Dispose()` wrapper is risky.
**Recommendation**
For the single-connection gate: split reads and writes into separate gates,
or — better — keep the writer single-connection and open a fresh read
connection (or pool of read connections) per `GetStatusAsync` call. SQLite
connections are cheap; the `SiteStorageService` precedent already uses per-
call connections on the read path. For `Dispose()`: prefer
`Dispose() { GC.SuppressFinalize(this); _connection.Dispose(); _gate.Dispose(); }`
without an awaited disposal, and have the `IAsyncDisposable.DisposeAsync`
path do the awaiting. If a synchronous disposable is genuinely needed, do
not bridge it through the async core — duplicate the dispose-once flag check
into a sync path that calls `_connection.Dispose()` directly.
### SiteRuntime-025 — `HandleSetStaticAttribute` persists unknown attribute names as static overrides
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:223`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:246` |
**Description**
`HandleSetStaticAttribute` resolves the target attribute against
`_configuration.Attributes` to decide whether to route the write to the DCL or
treat it as a static-override write. If the lookup fails (`resolved == null`),
`isDataSourced` is false, and execution falls through to
`HandleSetStaticAttributeCore` — which unconditionally:
1. inserts the bogus key into the in-memory `_attributes` dictionary,
2. publishes an `AttributeValueChanged` for the bogus key to the site stream
and to every child Script/Alarm actor,
3. persists a row in `static_attribute_overrides` for the bogus key, and
4. replies `Success = true` to the caller.
Concretely, an inbound API `Route.To().SetAttribute("notARealAttr", "x")`
returns success, pollutes the in-memory state with a key that no script can
legitimately observe (canonical-name lookup will not produce it), persists a
durable SQLite override row that survives restart, and (on every restart)
re-injects the polluting key via `HandleOverridesLoaded` at line 608. The
override is **not** reset on instance redeployment in the same way the
"genuine" overrides are — `ClearStaticOverridesAsync` does clear by
`instance_unique_name`, so the row is eventually cleaned, but only on a full
redeploy; in the meantime each restart resurrects it. The publish-to-stream
side effect also lets a hostile or buggy inbound caller spam debug-view
subscribers with synthetic attribute changes.
Worth flagging at Low: the inbound API surface is already authenticated and
the design assumes its callers are trusted. But the no-validation behaviour
contradicts the design doc's "Scripts can only read/write attributes on their
own instance" framing — an inbound API call inherits the same instance-scope
authority as a script, and the script trust model wouldn't sanction this.
**Recommendation**
In `HandleSetStaticAttribute`, when `resolved == null`, reply
`SetStaticAttributeResponse(Success: false,
ErrorMessage: $"Attribute '{command.AttributeName}' not found on instance
'{_instanceUniqueName}'")` instead of falling through to the override path.
Optionally also surface the existence check on the `RouteInboundApiSetAttributes`
fan-out so a multi-attribute write reports the offending key without rolling
back the others (the per-attribute `Ask` shape already supports a partial
failure response).
### SiteRuntime-026 — `ReplicationMessages.cs` public record types have no XML documentation
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:10`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:13`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:15`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:17`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:19`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:25`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:28`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:30`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:32`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:34` |
**Description**
The ten public record types in `ReplicationMessages.cs`
(`ReplicateConfigDeploy`, `ReplicateConfigRemove`, `ReplicateConfigSetEnabled`,
`ReplicateArtifacts`, `ReplicateStoreAndForward`, `ApplyConfigDeploy`,
`ApplyConfigRemove`, `ApplyConfigSetEnabled`, `ApplyArtifacts`,
`ApplyStoreAndForward`) carry no XML documentation. The file header comment
groups them as "outbound" vs "inbound" but the individual records have no
`<summary>` and no parameter docs. The XML-doc baseline `1eb6e97` rolled out
across the rest of the module (the commit being reviewed is literally `docs:
add XML doc comments across src + Sister Projects section in CLAUDE.md`), so
this file is now the conspicuous outlier — and the `CommentChecker` skill
relied on by the `fixdocs` workflow will flag every record as missing docs.
**Recommendation**
Add a `<summary>` per record naming the direction (outbound → peer / inbound
from peer) and what the operation replicates, and `<param>` docs for each
record parameter. Mirror the precedent in
`src/ScadaLink.Commons/Messages/.../*.cs`. While there, consider sealing the
inbound vs outbound split with a marker base type (currently they're just
named conventionally) so `Receive<ReplicateXxx>` vs `Receive<ApplyXxx>` is
expressed at the type level — but that's optional and out of scope for a
docs-only finding.
+544 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 (3 Deferred: 002, 011, 012; 7 new Open: 018024 — see Re-review 2026-05-28) |
## Summary
@@ -55,6 +55,76 @@ StoreAndForward-017 records that the Retry/Discard activity-log entries hard-cod
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
the site event log.
#### Re-review 2026-05-28 (commit `1eb6e97`)
Full re-review against commit `1eb6e97` with the same 10-category checklist. The
batch-3 / batch-4 resolutions (001, 003010, 013017) are still present and intact; no
regressions detected on prior fixes. Findings 002, 011 and 012 remain validly
`Deferred` (their preconditions are unchanged) and findings 005, 006, 010, 013, 014,
016, 017 are confirmed `Resolved` against the current source.
This pass surfaced **seven new findings** clustered around two themes:
The first theme is **design-doc drift on the notification path**, which has acquired
two now-real defects since the engine became central-targeted. `StoreAndForward-018`
(High) records that a corrupt notification payload — handled in `NotificationForwarder.
DeliverAsync` by returning `false` — parks a notification on its first retry-sweep
encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking
applies only to the external-system-call and cached-database-write categories"). The
same path becomes a poison-payload retry-forever trap on the active node if the engine
ever softened the `false` semantics. `StoreAndForward-019` (Medium) records the
sibling defect: notifications are enqueued with `MaxRetries` defaulting to
`StoreAndForwardOptions.DefaultMaxRetries` (50), and the legacy SMTP path
(`NotificationDeliveryService.SendAsync`) passes a positive bounded `smtpConfig.
MaxRetries` — so an unreachable central will silently park notifications after a
finite retry budget rather than "retry at the fixed forward interval until central acks"
as the design requires. The contract `0 = no limit` is not enforced for the
notification category.
The second theme is **subtle correctness and contract gaps around the operator paths**
that survived the StoreAndForward-016/017 batch. `StoreAndForward-020` (Medium) records
that `RetryParkedMessageAsync` skips replication entirely if `GetMessageByIdAsync`
returns null after a successful local requeue (a narrow but real race window with a
concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby
divergence in that corner. `StoreAndForward-021` (Medium) is a design-doc-vs-code drift
that should be reconciled in the doc: the **operation tracking table** is documented
inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 7787, 108,
114), but the actual `OperationTrackingStore` lives in `src/ScadaLink.SiteRuntime/
Tracking/` and is not consumed by S&F at all — the brief's own note flags this. The
design doc should be updated to point at SiteRuntime, or the store moved to
StoreAndForward.
`StoreAndForward-022` (Low) records that `_cachedCallObserver` silently drops audit
telemetry when a buffered cached-call's `Id` is not a parseable `TrackedOperationId`
GUID — the engine returns from `NotifyCachedCallObserverAsync` before emitting anything,
so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting
produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any
caller passing a custom non-GUID id silently bypasses the entire `Submitted/Forwarded/
Attempted/Delivered/Parked/Discarded` audit lifecycle). `StoreAndForward-023` (Low)
records that `siteId` is silently defaulted to `string.Empty` when no
`IStoreAndForwardSiteContext` is registered, so a misconfigured host produces audit
telemetry with `SourceSite = ""` and the central audit-log's `(SourceSite,
TrackedOperationId)` correlation degrades to a per-id-only index. `StoreAndForward-024`
(Low) is a stop-time ordering defect: `StopAsync` disposes the timer but a
mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
`_replication` after `StopAsync` returns; downstream resources disposed by the host
shutdown sequence (the DI container) can then NRE through the still-running sweep.
## Checklist coverage — Re-review 2026-05-28
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020). |
| 2 | Akka.NET conventions | ☑ | `ParkedMessageHandlerActor` uses `PipeTo` correctly with success/failure projections (007 resolution preserved). No new findings. |
| 3 | Concurrency & thread safety | ☑ | Sweep-vs-stop race: a timer callback running while `StopAsync` returns can touch disposed dependencies (024). |
| 4 | Error handling & resilience | ☑ | Notifications park after `DefaultMaxRetries` exhaustion (019) — contradicts the design doc's "retried until central acks". |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled. |
| 6 | Performance & resource management | ☑ | No new findings — the connection-per-call documented trade-off and pooled `OpenAsync` remain acceptable. |
| 7 | Design-document adherence | ☑ | Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019. |
| 8 | Code organization & conventions | ☑ | `IStoreAndForwardSiteContext` silently defaults `SiteId` to empty (023) — a configuration hole rather than an entity placement issue. |
| 9 | Testing coverage | ☑ | The seven new findings have no regression tests in `tests/ScadaLink.StoreAndForward.Tests/` — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024). |
| 10 | Documentation & comments | ☑ | `CachedCallAttemptOutcome.ParkedMaxRetries` XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The `TrackedOperationId.TryParse` silent-skip behaviour in `NotifyCachedCallObserverAsync` is documented in the source but not on the public observer contract (022). |
## Checklist coverage
| # | Category | Examined | Notes |
@@ -914,3 +984,474 @@ the StoreAndForward-016 replication) — and pass it to `RaiseActivity` (falling
`RetryParkedMessageAsync_ActivityUsesMessageRealCategory` and
`DiscardParkedMessageAsync_ActivityUsesMessageRealCategory` assert the activity carries
`Notification` / `CachedDbWrite` respectively; both fail against the pre-fix code.
### StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62``:69`, `:105``:122`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369``:397` |
**Description**
The Component design doc explicitly carves out notifications from the parking lifecycle:
> "Notifications do not park — they are retried at the fixed forward interval until
> central acks." (`docs/requirements/Component-StoreAndForward.md:47`)
> "Parking applies only to the external-system-call and cached-database-write
> categories." (same line)
`NotificationForwarder.DeliverAsync` violates this. When `TryBuildSubmit` fails to
deserialize the buffered payload — either because `JsonSerializer.Deserialize` throws a
`JsonException` (line 114) or because it returns `null` (line 119) — `DeliverAsync`
returns `false` (line 68). On the **retry path** the S&F engine treats handler `false`
as a permanent failure and **parks the message immediately** via the conditional
`UpdateMessageIfStatusAsync(... Parked)` write at `StoreAndForwardService.cs:373``:385`.
Result: a notification with a corrupt buffered payload — a row that the engine itself
treats as opaque ("Payload: Serialized message content…"; `Component-StoreAndForward.md:
110`) — enters the parked state and surfaces in the central UI's parked-message list
under the `Notification` category, contradicting the doc's invariant and the resolved
StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.
The defect is real today: the inline comment on `NotificationForwarder.cs:64` even
documents the violation ("An unreadable payload cannot be fixed by retrying — park it
(return false)") as the intended behaviour, but that behaviour is what the design doc
forbids. Either the doc needs to acknowledge a poison-payload parking exception for
notifications, or the forwarder needs a different escape hatch (discard? log + drop?
permanent-failure-as-`true` to clear the buffer?). Today there is no consistent answer
between code and design.
Additionally, on the **immediate-delivery** path (a fresh enqueue followed by a
`DeliverAsync` returning `false`), the engine returns `WasBuffered: false` and the row
is never persisted — so the corrupt-payload "park" only occurs on the retry path, where
the message has already been buffered (and replicated to the standby). The
**inconsistency between the two paths** ("not buffered" vs "parked") for the same
permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004
documents the immediate vs retry asymmetry, but does not anticipate that the retry
asymmetry will violate a per-category invariant.
**Recommendation**
Choose one consistent reconciliation. Preferred option: change `NotificationForwarder.
DeliverAsync` to **discard** a corrupt payload rather than park it — delete the
buffered row directly, log a Site Event Log entry under `Discard`, and return `true` so
the engine clears the buffer. This preserves the design's "notifications do not park"
invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload
parking exception specifically for notifications, and revise the resolved
StoreAndForward-017 wording; (b) treat `JsonException` as transient (would retry-forever
on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the
engine and gate the retry-path park behind it for the Notification category.
Add a regression test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.
cs` asserting that a corrupt-payload notification reaches a terminal **non-Parked**
state — today the corrupt-payload behaviour is uncovered.
**Resolution**
_Unresolved._
### StoreAndForward-019 — Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks"
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407``:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773``:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149``:156` |
**Description**
The design doc requires a buffered notification to be retried indefinitely until
central acks:
> "The **notification** category retries differently: it has no source-entity setting.
> The site→central forward uses a single fixed retry interval configured in the host
> `appsettings.json`. … A buffered notification is retried until central acks it; it is
> not parked on a retry limit (central, once reachable, owns delivery, retry, and
> parking from that point on)." (`Component-StoreAndForward.md:55``:59`)
The current engine cannot honour that. `RetryMessageAsync` enforces parking at
`message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
(`StoreAndForwardService.cs:407`); a `MaxRetries == 0` is the documented "no limit"
escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two
notification enqueue paths both supply a positive bounded `MaxRetries`:
- `ScriptRuntimeContext.cs:1773``:1778` (the `Notify.Send` site script path) calls
`EnqueueAsync` without supplying the `maxRetries` argument, so the engine
defaults to `StoreAndForwardOptions.DefaultMaxRetries = 50` (`StoreAndForwardOptions.
cs:18`). After 50 retry sweeps with central unreachable, the notification is parked.
- `NotificationDeliveryService.cs:149``:156` (the legacy SMTP-style path retained for
the central-side `INotificationDeliveryService` callers) passes
`smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : null``null` falls back to the
same 50-retry default, and any positive `smtpConfig.MaxRetries` still bounds the
retry budget. Either way, a long central outage parks the notification.
A parked notification cannot be cleared by a central recovery: it stays parked until an
operator clicks **Retry** in the parked-message UI. The design's invariant — that
notification delivery converges automatically as soon as central is reachable — is
broken: an extended central outage requires manual intervention to clear the backlog,
which is exactly the behaviour the central-only outbox redesign was meant to remove
from the site.
This is closely related to (but distinct from) StoreAndForward-018: 018 is the
*permanent-failure-path* parking violation; 019 is the *transient-failure-path*
parking violation under the engine's normal max-retries policy.
**Recommendation**
Make the notification enqueue paths pass `maxRetries: 0` so the documented "no limit /
never parked" semantics apply, and guard against regression by adding an integration
test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs` that runs
a sweep many more times than `DefaultMaxRetries` against an always-failing handler and
asserts the buffered notification's status stays `Pending` (not `Parked`). A cleaner
alternative is to special-case the `Notification` category inside
`RetryMessageAsync`'s max-retries guard (treat it as `MaxRetries == 0` regardless of
the field value) so the invariant is enforced at the single chokepoint rather than
relying on every caller to pass the right value — this also fixes the legacy
`NotificationDeliveryService` path without editing the consumer.
**Resolution**
_Unresolved._
### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599``:616` |
**Description**
The StoreAndForward-016 resolution wired Requeue replication into operator-initiated
retry. The fix uses a two-step pattern:
```csharp
public async Task<bool> RetryParkedMessageAsync(string messageId)
{
var success = await _storage.RetryParkedMessageAsync(messageId); // step 1
if (success)
{
var message = await _storage.GetMessageByIdAsync(messageId); // step 2 (no txn)
var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
if (message != null)
{
_replication?.ReplicateRequeue(message); // step 3
}
RaiseActivity("Retry", category, ...);
}
return success;
}
```
The two storage calls are on separate connections with no surrounding transaction. A
concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2
(which re-reads the row) can delete or mutate the row:
- An operator who issues `DiscardParkedMessageAsync` immediately after retry — the
`DiscardParkedMessageAsync` storage call is conditional on `status = Parked`, so it
will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued
row will then call `_storage.RemoveMessageAsync` (unconditional), which deletes it.
In a single retry-sweep cycle this race is real because `DefaultRetryInterval = Zero`
is the standard test default and the operator action and a sweep tick can overlap.
- A `RemoveMessageAsync` runs in step 1's wake; `GetMessageByIdAsync` returns null;
step 3 (`_replication?.ReplicateRequeue`) is **skipped entirely**, but step 1
already requeued the row locally. The standby is now left in `Parked` state while
the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016
was supposed to fix. (On the active node a subsequent failover lands on a Parked
standby copy of a discarded message — the same regression 016 already documented.)
The category-fallback path (`StoreAndForwardCategory.ExternalSystem` when message is
null) silently mislabels the activity log entry too — the same defect that
StoreAndForward-017 fixed for the non-racy path, except this branch handed back a
hard-coded fallback rather than re-loading. The activity log entry is a minor side
effect; the missing replication is the real defect.
**Recommendation**
Capture the message **once**, before the local Parked → Pending storage update, so the
replication path has the row in hand even if a concurrent writer deletes it
afterwards:
```csharp
var message = await _storage.GetMessageByIdAsync(messageId); // before the update
if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
return false;
var success = await _storage.RetryParkedMessageAsync(messageId);
if (!success) return false;
// `message` was the parked row; the active node just wrote it back to Pending with
// retry_count = 0 — construct the replicated state from those known mutations.
message.Status = StoreAndForwardMessageStatus.Pending;
message.RetryCount = 0;
message.LastError = null;
message.LastAttemptAt = null;
_replication?.ReplicateRequeue(message);
RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
return true;
```
Add a regression test in `StoreAndForwardReplicationTests` that simulates the
delete-between-update-and-reload race and asserts the `Requeue` replication
operation is still emitted with the correct category.
**Resolution**
_Unresolved._
### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49``:51`, `:77``:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
**Description**
Component-StoreAndForward.md repeatedly assigns the **Operation Tracking Table** to
this component:
- **Responsibilities** (line 21): "Maintain a site-local **operation tracking table**
holding one row per `TrackedOperationId` for cached calls … the authoritative status
record consulted by `Tracking.Status(id)`."
- **Message Lifecycle** (lines 4951): "the operation tracking table is the status
record and the S&F buffer is purely the retry mechanism. A cached call that succeeds
on its first immediate attempt is written directly as a terminal `Delivered` tracking
row and never enters the S&F buffer."
- **Operation Tracking Table** section (lines 7787): "Alongside the S&F buffer DB,
each site node holds a **site-local operation tracking table** in SQLite. … Each row
records the operation kind (`TrackedOperationKind`) …"
The actual implementation lives outside this module: `src/ScadaLink.SiteRuntime/
Tracking/OperationTrackingStore.cs` (and `IOperationTrackingStore`, `OperationTrackingOptions`).
The StoreAndForward project contains no references to the tracking store, owns no
`operation_tracking` table, and `StoreAndForwardService.NotifyCachedCallObserverAsync`
is only a hook handing telemetry context to an `ICachedCallLifecycleObserver` — the
audit bridge wired in `ScadaLink.AuditLog`. The S&F module is **not** the table's
owner; SiteRuntime is.
This is a real design-doc drift, not a code defect, and is flagged explicitly in the
brief's "Module-specific notes". The drift matters because the design doc's
discussion of the lifecycle — "immediate success writes a terminal Delivered tracking
row directly here", "operator discard sets terminal `Discarded`", "central never
mutates the mirror row directly" — places coordination responsibilities on the wrong
component. A reader looking for the source of truth for `Tracking.Status(id)` would
read `Component-StoreAndForward.md` and search `src/ScadaLink.StoreAndForward/` in
vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F
responsibility (line 22), but the emission actually happens via the `AuditLog` site
component subscribing to `ICachedCallLifecycleObserver`.
**Recommendation**
Reconcile the doc with the code. The simplest fix is doc-side: update
Component-StoreAndForward.md to scope its responsibilities back to the retry
mechanism + replication + parked-message management, and add a cross-reference to a
new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or
a new Component-OperationTracking.md). The code is internally consistent — the audit
bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F
engine emits attempt telemetry on the cached-call hot path — but the design doc is
several refactors out of date. The hierarchical map should be:
- `Component-StoreAndForward.md` → S&F buffer + Replication + Parked-message
management + Notification forwarding to central + cached-call telemetry **hook**.
- New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
central-side mirror.
**Resolution**
_Unresolved._
### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484``:515` |
**Description**
`NotifyCachedCallObserverAsync` (the per-attempt observer notifier wired by the M3
Bundle E rollout) bails out with no audit emission when
`TrackedOperationId.TryParse(message.Id, out var trackedId)` returns false
(`StoreAndForwardService.cs:510``:515`). The inline comment justifies the behaviour as
back-compat for "pre-M3 message (random GUID-N id from S&F itself, no
TrackedOperationId threaded in)", but the documented contract is broken in two ways:
1. **Silent dropping of every audit row, not just the first one.** The skip means no
`Attempted` row, no `CachedResolve` terminal row, no audit trail at all for that
operation's S&F lifecycle — yet the rest of the system (script trust boundary,
parked-message UI, etc.) still treats the operation as audit-tracked. The drop is
not surfaced via a metric, log warning (the path is a silent `return`), or counter,
so a misconfigured caller bypasses the audit hot path with zero feedback.
2. **The contract is hidden in field-level XML.** The `ICachedCallLifecycleObserver`
public interface contract (defined in `ScadaLink.Commons`) does not document that
the observer will be silently skipped when the underlying S&F message id is not a
GUID. A consumer reading the interface contract reasonably expects every cached-call
attempt to surface — the audit pipeline depends on it. The silent-drop is an
implementation detail of the S&F bridge that should be either lifted onto the
contract or removed.
The engine itself mints GUID-N ids via `Guid.NewGuid().ToString("N")` (line 224), which
`TrackedOperationId.TryParse` accepts, so the skip path is unreachable for engine-minted
ids. It is reachable only for callers that supply their own `messageId` argument with a
non-GUID format. The current callers (`NotificationOutbox` enqueue path with
NotificationId, cached-call enqueue path with `TrackedOperationId.ToString()`) all
supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id
would silently bypass audit.
**Recommendation**
Two options. The cheap fix: change the skip to a `_logger.LogWarning` with the offending
id so a misconfigured caller is observable, and update the
`ICachedCallLifecycleObserver` XML doc to mention the "non-GUID id → no telemetry"
contract explicitly. The more correct fix: emit a still-audited row for the
non-GUID case (e.g. synthesise a `TrackedOperationId` from the underlying id, or emit a
distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent
holes. Add a regression test in `CachedCallAttemptEmissionTests` capturing the chosen
contract — the existing
`Attempt_MessageIdNotAGuid_NoObserverNotification` test pins today's silent-skip; if
the fix is "log + skip", that test should be updated to also assert the log emission;
if the fix is "emit anyway", the test should be replaced.
**Resolution**
_Unresolved._
### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43``:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
**Description**
`AddStoreAndForward`'s service-collection factory resolves the optional
`IStoreAndForwardSiteContext` and falls back to `string.Empty` when not registered:
```csharp
var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
var siteId = siteContext?.SiteId ?? string.Empty;
return new StoreAndForwardService(storage, options, logger, replication,
cachedCallObserver, siteId);
```
The constructor's parameter is even defaulted to `""`. The empty-string `siteId` flows
straight into every emitted `CachedCallAttemptContext.SourceSite`, which the central
audit pipeline uses as part of the `(SourceSite, TrackedOperationId)` correlation key.
A host that registers an `ICachedCallLifecycleObserver` (the audit observer wired by
`AddAuditLog`) but forgets to register `IStoreAndForwardSiteContext` will produce a
stream of telemetry rows with `SourceSite = ""` — the central audit mirror cannot
distinguish them by site, and the central-site routing of
`RetryParkedOperation`/`DiscardParkedOperation` commands keyed on `SourceSite` will
fail to find the owning site.
The Host's `IStoreAndForwardSiteContext` adapter and the audit observer registration
are wired in lock-step, so the current configuration is correct, but the silent
empty-string fallback is a contract hazard for future hosts (CLI test harness, second
site cluster topology, etc.) and for tests that wire one without the other.
**Recommendation**
Make the contract explicit: when `cachedCallObserver` is non-null, require
`IStoreAndForwardSiteContext` to be registered — throw an `InvalidOperationException`
with a clear "Audit observer registered without a site context — register
IStoreAndForwardSiteContext" message at construction time. When the audit observer is
absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unused.
Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
from the service provider so a late-registered context still takes effect.
**Resolution**
_Unresolved._
### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122``:127`, `:136``:143`, `:303``:329` |
**Description**
`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).
The `_ =` discards the returned `Task`, so when the timer fires the sweep runs **fire
and forget** on a thread-pool thread. `StopAsync` (lines 136143) disposes the timer:
```csharp
if (_retryTimer != null)
{
await _retryTimer.DisposeAsync();
_retryTimer = null;
}
```
`Timer.DisposeAsync()` returns once any in-flight timer **callback** has completed —
but the timer callback in this service is a one-line `_ = RetryPendingMessagesAsync()`
that synchronously returns immediately and leaves the actual sweep running on the
thread pool. So `Timer.DisposeAsync` does not wait for the sweep; only for the
synchronous `_ = ...` discarding step. `StopAsync` returns while a sweep is potentially
still running, touching `_storage` (which the host will dispose), `_replication`
(which the host will tear down), and `_cachedCallObserver` (whose downstream gRPC
channel the host will shut down).
The host shutdown sequence (`AkkaHostedService`) tears down the actor system and the
DI container after this service's `StopAsync` completes — meaning a sweep that runs
past `StopAsync` can call into disposed `SqliteConnection`s (yielding
`ObjectDisposedException`, caught by the sweep's outer `try/catch` as a log) or, more
seriously, push a replication operation into a half-disposed Akka actor pipeline and
trigger noisy dead-letter warnings during a clean shutdown.
The race window is small (the sweep typically finishes in <100 ms in tests) but it is
real, particularly when shutting down a site under load with a non-empty buffer.
**Recommendation**
Track in-flight sweep tasks and `await` them in `StopAsync`:
```csharp
private Task? _currentSweep;
public async Task StopAsync()
{
if (_retryTimer != null)
{
await _retryTimer.DisposeAsync();
_retryTimer = null;
}
if (_currentSweep is { } sweep)
{
try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
}
}
```
Change the timer callback to:
```csharp
_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);
```
Add a `CancellationTokenSource` so a long sweep can be cooperatively aborted on stop;
plumb the cancellation token into `_storage` / `_replication` / `_cachedCallObserver`
calls. Add a regression test in `StoreAndForwardServiceTests` that calls `StopAsync`
mid-sweep and asserts no further storage activity occurs after `StopAsync` returns.
**Resolution**
_Unresolved._
+369 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.TemplateEngine` |
| Design doc | `docs/requirements/Component-TemplateEngine.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 |
## Summary
@@ -48,8 +48,49 @@ Both are limited-impact (nested compositions are the less common case and there
is design-time visibility) but represent genuine drift from the recursive-nesting
design promise.
#### Re-review 2026-05-28 (commit `1eb6e97`)
Re-reviewed the whole module against all ten checklist categories at commit
`1eb6e97`. All sixteen prior findings remain closed. Six new findings surfaced,
clustered in three themes:
1. **Revision-hash / diff coverage gaps**`RevisionHashService` and
`DiffService` both omit `Attributes.Description`, `Alarms.Description`, and
the entire `Connections` map. A change that only edits an attribute/alarm
description, or a data-connection endpoint, will deploy a new flattened
configuration but be invisible to staleness detection and the diff view —
the very gap the revision hash was introduced to close (TemplateEngine-017,
TemplateEngine-018). Severity Medium/High.
2. **TemplateEngine-013 fix only partially applied** — the `0`-as-no-parent
sentinel was removed from `CycleDetector` but `TemplateResolver
.BuildInheritanceChain` still uses `currentId != 0` / `ParentTemplateId ?? 0`.
A template with a real Id of 0 is treated as "no template" and silently
excluded from its own inheritance chain, so every flatten/resolve through
that template loses its members. The fix from `adb5e75` did not propagate
into the resolver (TemplateEngine-019). Severity Medium.
3. **Audit log integrity / drift** — every `Create` audit entry in
`TemplateService` and `SharedScriptService` is written with `EntityId = "0"`
*before* `SaveChangesAsync` populates the real key, so the audit trail loses
the link back to the created row (TemplateEngine-020); `MoveTemplateAsync`
never validates folder-acyclicity / sibling-name-uniqueness even though
`TemplateFolderService.MoveFolderAsync` does (TemplateEngine-021); and the
advertised `IS NOT_locked & not_LockedInDerived & not_IsInherited`
self-reference loop is intact, but `LockEnforcer.ValidateLockChange` permits
downgrading a `LockedInDerived` flag on a base template — there is no
equivalent of the once-locked-stays-locked rule for the `LockedInDerived`
flag (TemplateEngine-022). Severity Low/Medium.
Themes: hash/diff drift from the deployment payload, asymmetric application of
the duplicate-Id / null-sentinel fix from the last batch, and audit-write
ordering inconsistency between `TemplateService` (logs then saves) and
`InstanceService` (saves then logs).
## Checklist coverage
_Re-review (2026-05-17, `39d737e`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Prior bugs (001005, 013) all resolved and verified. Re-review 2026-05-17 found two new nested-composition defects: rename does not cascade (TemplateEngine-015), composed-script `ParentPath` always empty (TemplateEngine-016). |
@@ -63,6 +104,21 @@ design promise.
| 9 | Testing coverage | ✓ | Tests exist for every file, but the dead/placeholder paths (TemplateEngine-004, 005) and deep nesting (TemplateEngine-001) are not exercised. |
| 10 | Documentation & comments | ✓ | Mostly accurate; a misleading converter comment (TemplateEngine-011) and a stale enum/doc mismatch (TemplateEngine-012). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | New: `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector` in `adb5e75` (TemplateEngine-019). `TemplateService.MoveTemplateAsync` performs no folder-acyclicity or sibling-name-uniqueness check (TemplateEngine-021). |
| 2 | Akka.NET conventions | ✓ | No actors. `AddTemplateEngineActors` is still an empty placeholder. Nothing to assess. |
| 3 | Concurrency & thread safety | ✓ | Services remain stateless, scoped per request. No new findings. |
| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently. `MoveTemplateAsync` is missing target-folder validation found elsewhere — see TemplateEngine-021. |
| 5 | Security | ✓ | No new findings. Forbidden-API limitations still tracked under the closed TemplateEngine-006 (resolved as advisory). |
| 6 | Performance & resource management | ✓ | `MergeHiLoConfig` / `PrefixTriggerAttribute` allocate a `MemoryStream` + `Utf8JsonWriter` + `Encoding.UTF8.GetString` per call — fine for the per-flatten frequency, no finding. No new resource leaks. |
| 7 | Design-document adherence | ✓ | New drift: `RevisionHashService` and `DiffService` both omit `Description` fields and the `Connections` map from the deployable payload (TemplateEngine-017, TemplateEngine-018), so the revision hash and diff do not reflect every committed deployment input. |
| 8 | Code organization & conventions | ✓ | Audit-write ordering asymmetric: `TemplateService.Create*` and `SharedScriptService.CreateSharedScriptAsync` log with `EntityId = "0"` before `SaveChangesAsync`, while `InstanceService.CreateInstanceAsync` saves first then logs with the real Id (TemplateEngine-020). |
| 9 | Testing coverage | ✓ | New finding paths exercised in part — `RevisionHashServiceTests` does not assert that Description / Connections changes change the hash; no test for `BuildInheritanceChain` with a real Id of 0; no test for `MoveTemplateAsync` rejecting a target folder. |
| 10 | Documentation & comments | ✓ | New: `LockEnforcer.ValidateLockChange` is documented as enforcing the once-locked-stays-locked rule but has no equivalent for `LockedInDerived` (TemplateEngine-022). |
## Findings
### TemplateEngine-001 — Deeply nested composed members are dropped during flattening
@@ -780,3 +836,313 @@ passes the enclosing module's `prefix` — and the `ScriptScope` now sets
`SelfPath = "Outer.Inner"` pairs with `ParentPath = "Outer"` and `Parent.X`
resolves against the real parent module. Regression test:
`Flatten_NestedComposedScript_ScopeCarriesCorrectParentPath`.
### TemplateEngine-017 — Revision hash and diff both ignore `Description` and `Connections`, defeating staleness detection for real deployment changes
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:128`, `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:156`, `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:42`, `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:110`, `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:118` |
**Description**
The design states the revision hash is "computed from the resolved content" and
backs both staleness detection and diff correlation. The `Hashable*` records,
however, omit fields that are part of the deployed `FlattenedConfiguration`:
- `HashableAttribute` skips `ResolvedAttribute.Description` and the resolved
connection name/protocol (`BoundDataConnectionName`/`BoundDataConnectionProtocol`).
- `HashableAlarm` skips `ResolvedAlarm.Description`.
- The top-level `HashableConfiguration` skips the entire `Connections` map —
the `ConnectionConfig` per connection name carries the protocol, the primary
endpoint JSON, the backup endpoint JSON, and the failover retry count, all
of which travel in the deployment package.
The same gaps exist in `DiffService.AttributesEqual`, `AlarmsEqual`, and there
is no entry for `Connections` at all. Concrete consequences:
1. A Design user edits an attribute's `Description` (an authoring-time
concern) → the flattened payload changes → no hash change, no diff entry.
2. A Deployment user edits the primary endpoint JSON of a data connection
bound to an instance → the deployment package now ships a different
`ConnectionConfig` → no hash change, no diff entry, so the staleness
indicator says the instance is up to date and the diff view shows no
pending change. The site quietly receives different connection
credentials/host on the next redeploy.
The Description case is mostly cosmetic. The `Connections` case is a deployment
correctness gap — staleness detection is the mechanism that tells operators
"this instance has drifted from its template and needs redeployment", and a
connection-endpoint change is exactly the kind of drift it must catch.
**Recommendation**
Add `Description` to `HashableAttribute` and `HashableAlarm` (alphabetically
placed, per the determinism contract) and to `AttributesEqual` / `AlarmsEqual`.
Add a `HashableConnections : SortedDictionary<string, HashableConnection>`
field (or equivalent) to `HashableConfiguration` that includes Protocol,
ConfigurationJson, BackupConfigurationJson, and FailoverRetryCount, and mirror
it in `DiffService`. Add tests:
`Hash_DescriptionEditChangesHash`,
`Hash_ConnectionEndpointEditChangesHash`,
`Diff_ConnectionEndpointEdit_ProducesEntry`.
**Resolution**
_Unresolved._
### TemplateEngine-018 — `DiffService` reports no entries for added/removed/changed connections
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:19` |
**Description**
`DiffService.ComputeDiff` returns a `ConfigurationDiff` with `AttributeChanges`,
`AlarmChanges`, and `ScriptChanges` only. The `FlattenedConfiguration` it diffs
also carries a `Connections` dictionary (per-attribute connection bindings
collapsed to one connection-config-per-name during flattening — see
`FlatteningService:99-118`), and this dictionary materially affects what the
site receives at deploy time. A connection added to or removed from the
flattened configuration (e.g., an instance gains its first data-sourced
attribute, or its last binding is cleared) produces no diff entry. Operators
inspecting the diff view to decide whether to redeploy see "no changes" when
the site will in fact receive a structurally different deployment package.
This is the diff-view counterpart of TemplateEngine-017's hash gap; they are
separable because the `ConfigurationDiff` data shape would have to be extended
even after the hash is fixed.
**Recommendation**
Add `ConnectionChanges` (or equivalent) to `ConfigurationDiff` in `Commons`
(`Types/Flattening/ConfigurationDiff.cs`), populate it in
`DiffService.ComputeDiff` via a new `ComputeEntityDiff` over
`Connections.Keys`, and add a `ConnectionsEqual` helper. Update the Central UI
diff display to render the new section. Add regression tests:
`Diff_NewConnectionBinding_ReportedAsAdded`,
`Diff_ClearedBinding_ReportedAsRemoved`,
`Diff_EndpointEdit_ReportedAsChanged`.
**Resolution**
_Unresolved._
### TemplateEngine-019 — `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateResolver.cs:117`, `src/ScadaLink.TemplateEngine/TemplateResolver.cs:123` |
**Description**
TemplateEngine-013 removed the `0`-as-no-parent sentinel from `CycleDetector`
(`adb5e75`) — `ParentTemplateId` is `int?`, so a missing value means "no
parent" and a real Id of 0 must walk the chain like any other node. The fix
did not propagate into `TemplateResolver.BuildInheritanceChain`:
```csharp
var currentId = templateId;
...
while (currentId != 0 && lookup.TryGetValue(currentId, out var current))
{
...
currentId = current.ParentTemplateId ?? 0;
}
```
The seeded `currentId = templateId` is treated as "no template" when
`templateId == 0`, so `ResolveAllMembers(0, ...)` returns an empty chain even
when a template with Id 0 exists. Walking up, `current.ParentTemplateId ?? 0`
then `currentId != 0` collapses a real parent of Id 0 onto the "no parent"
exit, silently truncating the chain. The chain is the input to every
flatten/resolve/validate path through `FlatteningService`, `TemplateService
.ResolveTemplateMembersAsync`, and `InstanceService.SetAlarmOverrideAsync` — a
template with a real Id of 0 (which EF identity sequences avoid in production
but which any in-memory test or import-staging path can produce) silently
loses its inheritance contribution. The duplicate-tolerant `BuildLookup` added
in `adb5e75` is used here, so the test gap is one half of the same fix.
**Recommendation**
Switch the walk to the `int?` form, mirroring `CycleDetector
.DetectInheritanceCycle`:
```csharp
int? currentId = templateId;
while (currentId.HasValue && lookup.TryGetValue(currentId.Value, out var current))
{
if (!visited.Add(currentId.Value)) break;
chain.Add(current);
currentId = current.ParentTemplateId;
}
```
Add regression test
`TemplateResolverTests.BuildInheritanceChain_RealIdZero_StillResolves`.
**Resolution**
_Unresolved._
### TemplateEngine-020 — `Create*` audit entries are written with `EntityId = "0"` before `SaveChangesAsync` populates the real key
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:77`, `src/ScadaLink.TemplateEngine/TemplateService.cs:256`, `src/ScadaLink.TemplateEngine/TemplateService.cs:407`, `src/ScadaLink.TemplateEngine/TemplateService.cs:556`, `src/ScadaLink.TemplateEngine/TemplateService.cs:734`, `src/ScadaLink.TemplateEngine/SharedScriptService.cs:71` |
**Description**
`IAuditService.LogAsync` takes a `string entityId` argument and `TemplateService
.CreateTemplateAsync`, `AddAttributeAsync`, `AddAlarmAsync`, `AddScriptAsync`,
`AddCompositionAsync`, and `SharedScriptService.CreateSharedScriptAsync` all
hard-code it to `"0"`:
```csharp
await _repository.AddTemplateAsync(template, cancellationToken);
await _auditService.LogAsync(user, "Create", "Template", "0", name, template, cancellationToken);
await _repository.SaveChangesAsync(cancellationToken);
```
EF Core populates `template.Id` only when `SaveChangesAsync` runs, but the
audit row is written and queued in the change tracker *before* the save with a
literal `"0"`. The single save then commits the audit row with `EntityId =
"0"` and the new template/attribute/alarm/script with its real Id. Every
"Create" entry in the audit trail therefore loses the link back to the row it
describes — searching the audit log by entity id of a created row finds
nothing, only the subsequent Update/Delete rows are findable.
Note that `InstanceService.CreateInstanceAsync` uses the opposite order
(`AddInstanceAsync``SaveChangesAsync``LogAsync(... instance.Id ...)`,
lines 9094) and gets the real Id. The asymmetry is the smoking gun: half the
module audits Create correctly, half does not.
A separate consideration: writing the audit row in the same `SaveChangesAsync`
as the entity is correct (it gives transactional all-or-nothing) — the fix is
to save the entity first, then log, then save the audit row (two-phase, like
`InstanceService` and `TemplateService.DeleteTemplateAsync` already do).
**Recommendation**
For every `Create*` path in `TemplateService` and `SharedScriptService`, swap
the order to `AddXxxAsync``SaveChangesAsync` → `LogAsync(... newId
.ToString() ...)` → `SaveChangesAsync`, matching `InstanceService
.CreateInstanceAsync` and `TemplateService.DeleteTemplateAsync`. Add regression
tests that assert the `EntityId` recorded on the audit row matches the
created row's Id.
**Resolution**
_Unresolved._
### TemplateEngine-021 — `MoveTemplateAsync` skips folder cycle and sibling-name-collision validation
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:173` |
**Description**
`TemplateService.MoveTemplateAsync` validates only that the target folder
exists, then unconditionally assigns `template.FolderId = newFolderId`.
`TemplateFolderService.MoveFolderAsync` (the sibling for folder-to-folder
moves) by contrast validates:
- the target folder is not the folder being moved (self-parent);
- the target folder is not a descendant of the folder being moved (cycle);
- no sibling at the destination has the same name (case-insensitive).
The first two are folder-graph concerns and don't apply to template moves, but
the third does — two templates with the same name in the same folder is the
authoring-time scenario the design's "naming collisions are design-time
errors" rule was meant to cover. Today, two templates named "Pump" can be
moved into the same folder with no error, breaking any UI that locates a
template by `(FolderId, Name)` and producing a worse user experience than the
folder-rename path which does check.
Separately, the design doc states folders carry "no semantic meaning for
template resolution, flattening, validation, or inheritance" — so this is
strictly a UI-organization invariant, but it is documented elsewhere
(`TemplateFolderService` enforces it for folders) and the asymmetry is
surprising.
**Recommendation**
After resolving the target folder, run a sibling-name-uniqueness check across
templates with the same `FolderId == newFolderId` and the same `Name`
(case-insensitive), mirroring `TemplateFolderService.MoveFolderAsync` lines
130142. Add a regression test `MoveTemplate_NameCollisionAtDestination_Fails`.
**Resolution**
_Unresolved._
### TemplateEngine-022 — `LockEnforcer.ValidateLockChange` enforces "once-locked-stays-locked" for `IsLocked` but not for `LockedInDerived`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/LockEnforcer.cs:109`, `src/ScadaLink.TemplateEngine/TemplateService.cs:323`, `src/ScadaLink.TemplateEngine/TemplateService.cs:476`, `src/ScadaLink.TemplateEngine/TemplateService.cs:623` |
**Description**
`LockEnforcer.ValidateLockChange` documents and enforces the rule that an
already-locked member cannot be unlocked downstream (`originalIsLocked &&
!proposedIsLocked` → error). The class-level XML doc describes locking as
covering both fields:
> Locking rules: ... Once locked, a member stays locked — it cannot be
> unlocked downstream.
But the `LockedInDerived` field has no equivalent guard. `UpdateAttributeAsync`,
`UpdateAlarmAsync`, and `UpdateScriptAsync` all let the proposed
`LockedInDerived` flag flip in either direction on a base-template member.
This is a subtle correctness gap with two failure modes:
1. A base template originally marked an attribute `LockedInDerived = true` to
protect derived templates from overriding it. A subsequent edit can clear
the flag while leaving existing derived-template overrides intact — those
overrides become legal retroactively even though the design intent was
that they were always blocked.
2. The XML doc on `LockEnforcer` and the class summary on `TemplateService`
describe a one-way ratchet that the code does not implement for one of the
two lock flags. A reader of the documentation cannot tell which rules are
actually enforced.
The defect is "Low" because the design doc for the Template Engine itself
does not explicitly call out a once-locked-stays-locked rule for
`LockedInDerived`. The most likely fix is therefore to (a) correct the
`LockEnforcer` XML doc to describe only `IsLocked`, or (b) add the equivalent
guard for `LockedInDerived` and a regression test. The choice is a design
question — pick one and align the code and docs.
**Recommendation**
Decide the policy. If `LockedInDerived` is intended to be once-set-stays-set
like `IsLocked`, extend `ValidateLockChange` (or add a sibling
`ValidateLockedInDerivedChange`) and reject the downgrade in
`UpdateAttributeAsync` / `UpdateAlarmAsync` / `UpdateScriptAsync`. If it is
intended to be mutable, update the `LockEnforcer` summary to scope the rule
to `IsLocked` only. Either way, add a test pinning the chosen behaviour.
**Resolution**
_Unresolved._
+456
View File
@@ -0,0 +1,456 @@
# Code Review — Transport
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.Transport` |
| Design doc | `docs/requirements/Component-Transport.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 12 |
## Summary
The Transport module is structurally clean, follows the design doc's pipeline
layout (Encryption → Serialization → Export / Import), and has solid lower-tier
coverage (encryptor round-trips, manifest validator, dependency resolver,
session store, diff engine). The big surface-area concerns cluster around two
themes. First, the `Overwrite` resolution path is structurally incomplete: it
updates only the parent entity's scalar fields (e.g. `Template.Description /
FolderId`, `ExternalSystem.EndpointUrl / AuthType / AuthConfiguration`) and
never replaces child collections (attributes, alarms, scripts, external-system
methods), silently diverging from both the design doc's audit-row table and
operator intent. Second, the 3-strike / per-IP unlock-rate-limit story declared
in `TransportOptions` and the design doc isn't wired into the import service —
the only counter is a local field on `TransportImport.razor.cs`, and
`MaxUnlockAttemptsPerIpPerHour` is referenced nowhere. There are also some
smaller integrity-and-resource issues (manifest fields outside `ContentHash`
aren't bound to the encryption envelope, decrypted plaintext lives in the
in-memory session for the full TTL on the failure path, and ZIP reads have no
entry-count / per-entry decompression cap).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | Overwrite paths miss child sync (Transport-001, Transport-002); composition Overwrite intentionally clears (good). |
| 2 | Akka.NET conventions | Yes | No issues found — Transport is service-only, no actors / messages. |
| 3 | Concurrency & thread safety | Yes | `IAuditCorrelationContext` mutation is documented as not thread-safe (Transport-009); singleton `BundleSessionStore` w/ `ConcurrentDictionary` is fine. |
| 4 | Error handling & resilience | Yes | Rollback-failure path is well-considered, but failed sessions are never evicted (Transport-007). |
| 5 | Security | Yes | Unlock lockout + per-IP cap not enforced server-side (Transport-003, Transport-004); manifest fields outside ContentHash are unauthenticated (Transport-005); zip-bomb / per-entry decompression cap missing (Transport-006); secrets travel in plaintext in unencrypted bundles by design but UI-only warning (acceptable per doc). |
| 6 | Performance & resource management | Yes | `BundleSession.DecryptedContent` retained in memory for 30 min even on failure (Transport-007); `PreviewAsync` issues N+1 calls to `GetTemplateWithChildrenAsync` (Transport-008); `BundleSerializer.Pack` serializes content twice. |
| 7 | Design-document adherence | Yes | Overwrite-doesn't-sync-children contradicts the design doc's audit row table (Transport-001); per-IP-per-hour lockout in §11 not implemented (Transport-004); design says "bundles are not retained server-side after ApplyAsync commits" — but failed bundles are retained until TTL (Transport-007). |
| 8 | Code organization & conventions | Yes | No major issues found — clean separation, POCO DTOs in `Serialization/`, scoped vs singleton service lifetimes appropriate. |
| 9 | Testing coverage | Yes | Critical gap: no Overwrite-with-modified-children test for Templates or ExternalSystems (Transport-010); no test exercising failed-bundle session retention or per-IP lockout. |
| 10 | Documentation & comments | Yes | XML comments are extensive and accurate; design doc has some staleness (Transport-011, Transport-012). |
## Findings
### Transport-001 — Template Overwrite never syncs attributes / alarms / scripts
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:844-851` |
**Description**
The `ResolutionAction.Overwrite` branch in `ApplyTemplatesAsync` only writes
`Description` and `FolderId` on the existing template and calls
`UpdateTemplateAsync(ex, …)`. The bundle DTO's `Attributes`, `Alarms`, and
`Scripts` collections are never copied onto the existing entity, so an Overwrite
of a template whose child collections changed silently leaves the target's
existing children in place. `ResolveAlarmScriptLinksAsync` then runs against the
unmodified existing alarms/scripts and does nothing useful for the Overwrite
case. This contradicts the design doc's Configuration Audit Trail table
("Template overwritten → `TemplateUpdated` + per-field rows
(`TemplateAttributeAdded`, `TemplateScriptUpdated`, …)") and the operator's
mental model — an Overwrite that produces no diff is a footgun. The only
integration test (`ConflictResolutionTests.Overwrite_replaces_existing_template_description`)
asserts only on `Description`, so the regression is not caught.
**Recommendation**
For the Overwrite branch, replace the existing template's children to match the
bundle DTO (delete-then-add or diff-and-merge), then re-run the alarm-script and
composition rewire passes against the post-merge state. Emit the per-field audit
rows the design doc enumerates. Add an integration test that overwrites a
template whose Scripts / Attributes / Alarms differ.
**Resolution**
_Unresolved._
### Transport-002 — ExternalSystem Overwrite never syncs methods
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:1213-1221` |
**Description**
`ApplyExternalSystemsAsync` Overwrite path writes `EndpointUrl`, `AuthType`, and
`AuthConfiguration` on the existing `ExternalSystemDefinition` and calls
`UpdateExternalSystemAsync`. The DTO's `Methods` collection is never written —
any added, removed, or modified method on the incoming bundle silently does
not land. Same shape of bug as Transport-001 but on a different entity. The
design doc's audit-row table says
"External system overwritten → `ExternalSystemDefinitionUpdated` + per-method
rows", confirming methods are expected to round-trip.
**Recommendation**
Sync `Methods` on Overwrite via add / update / delete by name (mirroring the
diff classification in `ArtifactDiff.CompareExternalSystem`) and emit the
per-method audit rows. Add a test that overwrites an external system whose
methods differ.
**Resolution**
_Unresolved._
### Transport-003 — Unlock lockout is enforced only client-side; server session is never marked Locked
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:184-203`, `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:267-309`, `src/ScadaLink.Commons/Types/Transport/BundleSession.cs:14-16` |
**Description**
`BundleSession` exposes `FailedUnlockAttempts` and a `Locked` computed property,
and `PreviewAsync` / `ApplyAsync` correctly refuse to proceed when
`session.Locked == true`. But for an encrypted bundle, `LoadAsync` throws
`CryptographicException` before any session is opened, so no session ever holds
a non-zero `FailedUnlockAttempts`. The 3-strike counter lives only in the
Blazor page's local `_failedUnlockAttempts` field; a second tab / circuit / CLI
caller bypassing the UI can retry the same uploaded bytes indefinitely
because the importer accepts a passphrase against a stream and runs PBKDF2 each
call (600 000 iterations / call). The Locked invariant on `BundleSession` is
effectively unreachable — the field is dead code.
**Recommendation**
Move the lockout into `IBundleImporter`. Two viable shapes:
(a) open a session on the first `LoadAsync` call (skip the decryption step until
a separate `UnlockAsync` is called) and increment / lock there;
(b) keep a per-content-hash counter in the session store, scoped by bundle SHA,
so retries against the same bundle bytes are throttled regardless of the UI
client. Either way, emit `BundleImportUnlockFailed` from the service, not from
the Razor page. Test that a second concurrent caller cannot side-step the
lockout.
**Resolution**
_Unresolved._
### Transport-004 — `MaxUnlockAttemptsPerIpPerHour` option is declared but never enforced
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Transport/TransportOptions.cs:12`, `docs/requirements/Component-Transport.md` §11 |
**Description**
`TransportOptions.MaxUnlockAttemptsPerIpPerHour` defaults to 10 and is
documented in the design doc (§11, "Failed-unlock rate limit: per-session
3-strike lockout; per-IP-per-hour cap (default 10, configurable) to deter brute
force against a stolen bundle"). A repo-wide grep finds zero readers of the
field. There is no IP-keyed rate limiter, no `IHttpContextAccessor` in the
importer, no middleware in Central UI guarding the import endpoints. The
documented brute-force defence does not exist in code.
**Recommendation**
Either implement the per-IP cap (e.g. via `Microsoft.AspNetCore.RateLimiting`
on the `TransportImport` page and the `ManagementActor` import command path,
keyed on remote-IP for the UI and on authenticated principal for the CLI), or
drop the option and the design-doc paragraph if the project is intentionally
deferring this. Don't leave a dead-letter option that promises a security
control that isn't there.
**Resolution**
_Unresolved._
### Transport-005 — Manifest fields outside `ContentHash` are not bound to the encrypted payload
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Transport/Encryption/BundleSecretEncryptor.cs:31-49`, `src/ScadaLink.Transport/Serialization/ManifestValidator.cs:29-53` |
**Description**
AES-GCM is called with no Associated Authenticated Data (AAD). The `manifest`
fields — `SourceEnvironment`, `ExportedBy`, `ScadaLinkVersion`, `Summary`,
`Contents`, `CreatedAtUtc`, etc. — are plaintext and only the `ContentHash`
field is checked against the content bytes. An attacker who obtains a bundle
can edit any non-`ContentHash` manifest field (e.g. rewrite the
`SourceEnvironment` displayed in the Step-4 typo-resistant confirmation gate,
forge a more recent `CreatedAtUtc`, lie about `ExportedBy`) without breaking
decryption. The Step-4 confirmation gate the design doc relies on
("User types the source environment name to confirm — typo-resistant gate at
the prod boundary") is therefore tamperable.
**Recommendation**
Pass the SHA-256 of the manifest's canonical bytes (excluding `ContentHash` and
`Encryption`, or simply the whole manifest minus those two fields) as the
`associatedData` argument to `AesGcm.Encrypt` / `AesGcm.Decrypt`. Any
tampering of the manifest's other fields then yields an authentication-tag
mismatch on decrypt. Same change in the plaintext path can be approximated by
extending the hash domain (compute a manifest-and-content hash, or sign the
manifest, depending on how far you want to go).
**Resolution**
_Unresolved._
### Transport-006 — Bundle ZIP read has no per-entry size cap or entry-count cap (zip-bomb / decompression-bomb)
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Transport/Serialization/BundleSerializer.cs:121-156`, `src/ScadaLink.Transport/Import/BundleImporter.cs:132-143` |
**Description**
`LoadAsync` caps the raw bundle bytes at `MaxBundleSizeMb` (default 100 MB)
before opening the ZIP. But `ReadContentBytes` calls `entry.Open()` and
`CopyTo(MemoryStream)` with no per-entry size limit and no defence against
compression ratios — a 100 MB DEFLATE-compressed bundle can decompress to
gigabytes. There is also no cap on the number of entries iterated; only two
known entries are read (`manifest.json` + `content.json`/`content.enc`), but
`ReadContentBytes` does not validate that no extra entries exist or that the
expected entry's `Length` is bounded. A malicious importer-with-RequireAdmin
(or a stolen bundle delivered to an admin) can OOM the central node.
**Recommendation**
Cap each entry's decompressed length explicitly (compare `ZipArchiveEntry.Length`
against a configurable max, or copy into a length-limited stream). Reject
bundles whose entry list contains anything other than the known manifest +
content entries. Consider also rejecting any compression ratio over ~50x as a
defence-in-depth measure.
**Resolution**
_Unresolved._
### Transport-007 — Failed import sessions retain decrypted plaintext for the full 30-minute TTL
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:614-696`, `src/ScadaLink.Transport/Import/BundleSessionStore.cs:67-93` |
**Description**
`ApplyAsync` calls `_sessionStore.Remove(sessionId)` only on the success path
(line 614). The catch block re-throws without removing the session, so a failed
apply leaves the `BundleSession` (with `DecryptedContent` up to ~100 MB) in the
in-memory dictionary until the TTL elapses 30 min later (or `Get` lazily evicts
on a separate lookup). Decrypted secrets — DB connection strings, SMTP
credentials, external-system auth configs — sit in process memory for that
window, accessible to anyone holding the session id. Multiplied across repeated
import attempts on the same circuit, this can produce significant memory
pressure (10 failed 100 MB imports = 1 GB) and contradicts the design doc's
"Bundles are not retained server-side after ApplyAsync commits" statement.
**Recommendation**
In the `ApplyAsync` catch block, call `_sessionStore.Remove(sessionId)` (or at
least zero out `session.DecryptedContent`) before re-throwing. Also clear
`DecryptedContent` from the session on the success path before removing — the
buffer is potentially still rooted by a caller-held reference. Consider
shortening the TTL when a session is in a known-stuck state. The session
store's `EvictExpired` exists but is only called on demand — wire it to a
periodic timer so abandoned sessions clear even without traffic.
**Resolution**
_Unresolved._
### Transport-008 — `PreviewAsync` issues an N+1 `GetTemplateWithChildrenAsync` per matching template name
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:252-272` |
**Description**
Building the per-template diff loops over every existing stub returned by
`GetAllTemplatesAsync` and, for any name that matches an incoming DTO, calls
`GetTemplateWithChildrenAsync(stub.Id)` to re-fetch with children. On a target
DB with many templates that overlap the bundle this is one round-trip per
matching template (often the whole bundle), each query carrying the full
attributes/alarms/scripts/compositions joins. The diff itself is read-only and
fits a single eager-loaded `GetAllTemplatesWithChildrenAsync` query.
**Recommendation**
Add a `GetAllTemplatesWithChildrenAsync` (or extend `GetAllTemplatesAsync` with
an `includeChildren` flag) on `ITemplateEngineRepository` and use it here. The
same N+1 appears in `ResolveCompositionEdgesAsync` (line 1093) for the
just-imported templates, but that loop is bounded by the bundle's size and is
less of a concern.
**Resolution**
_Unresolved._
### Transport-009 — `IAuditCorrelationContext.BundleImportId` is mutated on the same scoped instance the AuditService reads
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:528, 668, 703`, `src/ScadaLink.ConfigurationDatabase/Services/AuditCorrelationContext.cs` |
**Description**
The XML doc on `IAuditCorrelationContext` correctly notes that mutating
`BundleImportId` is not thread-safe and that concurrent imports inside a single
scope would cross-contaminate audit rows. The contract is "Blazor circuit / API
request — sequential await chain — single writer". The risk is that this
invariant is documentation-only — there is no enforcement (e.g. a mutex on set,
or an `AsyncLocal<Guid?>` impl) and no test exercising a concurrent-callers
scenario. A future change that schedules audit writes on a different
synchronization context inside the apply transaction (e.g. `Task.WhenAll` over
the Apply helpers) would silently start leaking the id across rows.
**Recommendation**
Either (a) back `BundleImportId` with an `AsyncLocal<Guid?>` so each logical
call chain inherits the value and concurrent chains can't trample it, or
(b) wrap the apply in a try/finally that snapshots and restores. (b) is closer
to the current design. Either way, add an integration test that fires two
overlapping `ApplyAsync` calls and asserts each bundle's rows carry only that
bundle's id.
**Resolution**
_Unresolved._
### Transport-010 — Critical Overwrite + cross-cutting paths uncovered by tests
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.Transport.IntegrationTests/ConflictResolutionTests.cs`, `tests/ScadaLink.Transport.IntegrationTests/Import/BundleImporterApplyTests.cs` |
**Description**
The existing tests cover the happy path well (round-trip, semantic-validator
gating, rollback even when `RollbackAsync` itself throws, composition imports),
but the per-entity Overwrite resolutions are only spot-tested:
`ConflictResolutionTests.Overwrite_replaces_existing_template_description`
asserts on `Description` only. Specifically missing:
- Overwrite of a `Template` whose `Attributes` / `Alarms` / `Scripts` /
`Compositions` diverged from the existing row (would catch Transport-001).
- Overwrite of an `ExternalSystem` whose `Methods` diverged (would catch
Transport-002).
- Overwrite of a `NotificationList` whose `Recipients` collection diverged
(NotificationList Overwrite does sync recipients via clear+add — needs an
asserting test).
- Concurrent `ApplyAsync` calls on a shared scope to exercise the
`IAuditCorrelationContext` mutation contract (would catch Transport-009).
- Per-IP unlock-throttle behaviour (would catch Transport-004).
- A session that survives a failed Apply (would catch Transport-007).
**Recommendation**
Add the missing integration tests above. Most can be modelled after
`ConflictResolutionTests`' export-then-mutate-target-then-apply pattern.
**Resolution**
_Unresolved._
### Transport-011 — Design doc's Step-1 manifest preview promises decryption-free preview, but `LoadAsync` reads and validates content before passphrase
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `docs/requirements/Component-Transport.md` Import Flow Step 1, `src/ScadaLink.Transport/Import/BundleImporter.cs:124-203` |
**Description**
The design doc says: "The manifest is plaintext so the import wizard can
preview bundle contents and source provenance before the user supplies a
passphrase." `LoadAsync` honours that — but does so by ALWAYS reading and
hashing the content blob (encrypted or not) on the first call, regardless of
whether the caller has a passphrase. For an encrypted bundle with no
passphrase, the code path that surfaces the encrypted-bundle prompt is the
`ArgumentException` thrown at line 195, which has already performed the full
manifest parse + content-hash check + read of the encrypted blob. That's fine,
but it means there is no cheap "manifest peek" — the UI's "let the user see
the manifest before deciding whether to type a passphrase" is at least
O(bundle-size) and consumes the full upload buffer each call. The design doc
gives a misleading impression of cost.
**Recommendation**
Either (a) add an explicit `ReadManifestAsync(Stream)` interface method that
skips the content read for the pure preview case, or (b) update the design
doc to clarify the full envelope is read on every `LoadAsync` and the cheap
"peek" is conceptual rather than runtime.
**Resolution**
_Unresolved._
### Transport-012 — "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `docs/requirements/Component-Transport.md` §Audit Trail, `src/ScadaLink.ConfigurationDatabase/Repositories/CentralUiRepository.cs:148` |
**Description**
The design doc says: "The existing Configuration Audit Log Viewer gains a
**Bundle Import** filter that surfaces all rows for a given import. The
`BundleImported` summary row links to the filtered view." A repository filter
on `BundleImportId` is wired into `CentralUiRepository` (line 148), but no UI
filter control surfaces it and the `BundleImported` summary row does not carry
a hyperlink in `Configuration Audit Log Viewer`. This is a documentation-vs-code
gap, not a bug in Transport itself, but the spec lives in the Transport doc so
it's reasonable to flag.
**Recommendation**
Either implement the filter dropdown + summary-row link in the Configuration
Audit Log Viewer, or note the deferral in the design doc.
**Resolution**
_Unresolved._
+21 -8
View File
@@ -33,9 +33,21 @@ def discover_modules():
return modules
def parse_header(module, text):
"""Extract (last_reviewed, commit) from the module's header table.
Falls back to the historical baseline when the field is absent or templated."""
last = re.search(r"\|\s*Last reviewed\s*\|\s*([0-9]{4}-[0-9]{2}-[0-9]{2})", text)
commit = re.search(r"\|\s*Commit reviewed\s*\|\s*`([^`]+)`", text)
return (
last.group(1) if last else "2026-05-16",
commit.group(1) if commit else "9c60592",
)
def parse_findings(module):
"""Parse one module's findings.md into (module, id, severity, title, status) tuples."""
"""Parse one module's findings.md into ((last_reviewed, commit), [(module, id, severity, title, status), ...])."""
text = open(os.path.join(BASE, module, "findings.md")).read()
header = parse_header(module, text)
findings = []
for block in re.split(r"^### ", text, flags=re.M)[1:]:
head = block.splitlines()[0].strip()
@@ -49,7 +61,7 @@ def parse_findings(module):
if not sev or not status:
raise SystemExit(f"{module}/findings.md: {fid} is missing a Severity or Status field")
findings.append((module, fid, sev.group(1), title, status.group(1).strip()))
return findings
return header, findings
def finding_number(finding):
@@ -58,7 +70,7 @@ def finding_number(finding):
def build_readme(modules, per_module):
pending = sorted(
(f for fs in per_module.values() for f in fs if f[4] in PENDING_STATUSES),
(f for fs in per_module.values() for f in fs[1] if f[4] in PENDING_STATUSES),
key=lambda f: (SEVERITY_ORDER.get(f[2], 9), f[0], finding_number(f)),
)
@@ -66,7 +78,7 @@ def build_readme(modules, per_module):
return sum(1 for f in pending if f[2] == sev)
def open_count(module, sev):
return sum(1 for f in per_module[module]
return sum(1 for f in per_module[module][1]
if f[2] == sev and f[4] in PENDING_STATUSES)
lines = []
@@ -123,9 +135,10 @@ def build_readme(modules, per_module):
add("|--------|---------------|--------|----------------|------|-------|")
for module in modules:
counts = [open_count(module, s) for s in ("Critical", "High", "Medium", "Low")]
add(f"| [{module}]({module}/findings.md) | 2026-05-16 | `9c60592` "
last_reviewed, commit = per_module[module][0]
add(f"| [{module}]({module}/findings.md) | {last_reviewed} | `{commit}` "
f"| {counts[0]}/{counts[1]}/{counts[2]}/{counts[3]} "
f"| {sum(counts)} | {len(per_module[module])} |")
f"| {sum(counts)} | {len(per_module[module][1])} |")
add("")
add("## Pending Findings")
add("")
@@ -159,8 +172,8 @@ def main():
readme_path = os.path.join(BASE, "README.md")
pending = sum(1 for fs in per_module.values()
for f in fs if f[4] in PENDING_STATUSES)
total = sum(len(fs) for fs in per_module.values())
for f in fs[1] if f[4] in PENDING_STATUSES)
total = sum(len(fs[1]) for fs in per_module.values())
if check:
current = open(readme_path).read() if os.path.exists(readme_path) else ""