code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97
Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.
regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
@@ -0,0 +1,500 @@
|
||||
# Code Review — AuditLog
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Module | `src/ScadaLink.AuditLog` |
|
||||
| Design doc | `docs/requirements/Component-AuditLog.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 11 |
|
||||
|
||||
## Summary
|
||||
|
||||
AuditLog is one of the larger and most carefully-engineered modules in the codebase.
|
||||
The site-side hot-path (`SqliteAuditWriter` + `FallbackAuditWriter` + `RingBufferFallback`)
|
||||
implements a textbook bounded-channel + dedicated-writer pattern with batched transactions,
|
||||
UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that
|
||||
genuinely honours the "audit-write must NEVER abort the user-facing action" contract.
|
||||
The central side mirrors that with per-row try/catch on batch ingest, a transactional
|
||||
dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation,
|
||||
and a partition-switch purge that is metadata-only. The payload filter is well-factored
|
||||
with a compile-time regex cache, per-stage failure isolation, and per-target overrides.
|
||||
Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.
|
||||
|
||||
Themes across findings: (1) the largest issue is a **specced-but-unwired transport path** —
|
||||
`ISiteStreamAuditClient.IngestCachedTelemetryAsync` and `AuditLogIngestActor.OnCachedTelemetryAsync`
|
||||
both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry
|
||||
client; the cached-call lifecycle audit rows ride the audit-only `IngestAuditEventsAsync` drain
|
||||
and the central dual-write transaction is dead code (AuditLog-001). (2) Several
|
||||
**Akka.NET supervisor-strategy comments are inaccurate** — multiple actors document
|
||||
"`SupervisorStrategy` uses Resume" but the code returns `DefaultDecider` (which Restarts), and
|
||||
the strategy applies to children, not to the actor itself (AuditLog-002). (3) The
|
||||
**`SqliteAuditWriter` hot-path lock is contended by the 30 s backlog probe** — `GetBacklogStatsAsync`
|
||||
takes the same `_writeLock` that serialises every batch INSERT, so a large-backlog scan can
|
||||
park the hot-path writer (AuditLog-005). (4) **Sync-over-async in `Dispose`** can deadlock under
|
||||
an ASP.NET sync context (AuditLog-006). (5) A handful of **misleading code comments and minor
|
||||
configuration drift** (AuditLog-007, AuditLog-008, AuditLog-009). (6) `CancellationToken`
|
||||
parameters on the actor drain paths are accepted but immediately replaced with
|
||||
`CancellationToken.None` (AuditLog-010). (7) The site-only `AddAuditLogHealthMetricsBridge`
|
||||
registers the `SiteAuditBacklogReporter` hosted service but the `AddAuditLog` registration
|
||||
chain doesn't reject a central composition root that mistakenly calls the site bridge
|
||||
(AuditLog-011). No Critical-severity issues; three Medium, eight Low.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | Yes | `AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009). |
|
||||
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. |
|
||||
| 3 | Concurrency & thread safety | Yes | `GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). |
|
||||
| 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). |
|
||||
| 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). |
|
||||
| 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size. |
|
||||
| 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. |
|
||||
| 8 | Code organization & conventions | Yes | Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011). |
|
||||
| 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does). |
|
||||
| 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. |
|
||||
|
||||
## Findings
|
||||
|
||||
### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a
|
||||
single `CachedCallTelemetry` packet per lifecycle event that carries BOTH the audit row
|
||||
AND the operational `SiteCalls` upsert, with central writing both rows in one transaction.
|
||||
The infrastructure exists: `ISiteStreamAuditClient.IngestCachedTelemetryAsync` is on the
|
||||
interface; `ClusterClientSiteAuditClient.IngestCachedTelemetryAsync` builds the
|
||||
`IngestCachedTelemetryCommand`; the proto carries `CachedTelemetryBatch`;
|
||||
`AuditLogIngestActor.OnCachedTelemetryAsync` performs the dual `InsertIfNotExists` +
|
||||
`UpsertAsync` inside a `BeginTransactionAsync`. But a `grep` for callers of
|
||||
`IngestCachedTelemetryAsync` in `src/ScadaLink.AuditLog` shows only the interface
|
||||
declaration and the two implementations — nothing produces a `CachedTelemetryBatch` for
|
||||
the site to push. The `SiteAuditTelemetryActor.OnDrainAsync` only calls
|
||||
`IngestAuditEventsAsync` (the audit-only path); cached-call audit rows written by
|
||||
`CachedCallTelemetryForwarder` to local SQLite are drained as ordinary audit events,
|
||||
and the `SiteCalls` operational half rides a separate `UpsertSiteCallCommand` channel
|
||||
into `SiteCallAuditActor`. The "central writes AuditLog + SiteCalls in one transaction"
|
||||
guarantee is therefore not delivered — the two writes are now uncorrelated across
|
||||
actors and can fail independently, and the dual-write path in `AuditLogIngestActor`
|
||||
is dead production code.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) wire the combined path: build a `CachedTelemetryBatch` from the audit rows
|
||||
the forwarder writes (alongside the operational half held by `IOperationTrackingStore`),
|
||||
add a parallel drain loop that calls `IngestCachedTelemetryAsync`, and gate the audit-only
|
||||
drain so cached-call rows don't double-emit; or (b) update the design doc + the
|
||||
`AuditLogIngestActor` / `ClusterClientSiteAuditClient` / interface XML comments to
|
||||
acknowledge that the two halves now flow via separate transports, and delete the
|
||||
unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the
|
||||
`AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct
|
||||
actor injection only).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` |
|
||||
|
||||
**Description**
|
||||
|
||||
Three central actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`)
|
||||
all override `SupervisorStrategy()` and return
|
||||
`new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider)`.
|
||||
The surrounding XML / inline comments variously claim "uses `Resume` so a thrown exception
|
||||
inside `ReceiveAsync` does not restart the actor" (AuditLogIngestActor remarks),
|
||||
"uses Resume so any leaked exception keeps the singleton alive for the next tick"
|
||||
(AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive
|
||||
across any leaked exception with `DefaultDecider`'s Restart semantics — restart resets
|
||||
the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery"
|
||||
(SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts
|
||||
with the other two). Two things are wrong: (1) the strategy returned by an actor's
|
||||
`SupervisorStrategy()` override governs how that actor supervises its CHILDREN, not how
|
||||
its own parent treats it — so it is not the mechanism that protects these singletons
|
||||
from their own throws; (2) `DefaultDecider` Restarts for most exceptions, not Resumes.
|
||||
The actors are in fact protected by the per-row / per-batch try/catch blocks inside
|
||||
the receive handlers — the supervisor override is effectively unused, since these
|
||||
actors have no children. The comments mislead a reader into trusting a guarantee
|
||||
that the code does not deliver.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick one of two corrections: either delete the `SupervisorStrategy` override (these
|
||||
actors have no children, so the override is dead) and rewrite the comments to credit
|
||||
the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept
|
||||
as a forward-compat hedge — change the decider to `Decider.From(_ => Directive.Stop)`
|
||||
or similar to match the comment, AND add a clear note that the per-row catch is what
|
||||
keeps the actor running across handler throws, not the supervisor strategy.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:178` |
|
||||
|
||||
**Description**
|
||||
|
||||
`OnCachedTelemetryAsync` opens `_serviceProvider!.CreateAsyncScope()` and lets
|
||||
`await using` dispose it. `OnIngestAsync`, `OnTickAsync` in
|
||||
`SiteAuditReconciliationActor`, and `OnTickAsync` in `AuditLogPurgeActor` all open
|
||||
`_services.CreateScope()` (the synchronous variant) and dispose it with a synchronous
|
||||
`scope.Dispose()` in a `finally` block — even though the per-message work is async and
|
||||
the scoped `IAuditLogRepository` resolves an EF Core `DbContext`, which implements
|
||||
`IAsyncDisposable`. The synchronous `Dispose()` on a `DbContext` blocks on any pending
|
||||
async connection cleanup; under load this can hold the actor thread for the duration
|
||||
of a connection close, which on SQL Server may include sending a `SET TRANSACTION
|
||||
ISOLATION LEVEL` reset round-trip. Switching to `CreateAsyncScope()` + `await using`
|
||||
is the recommended pattern for scoped EF resources.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Change `_services.CreateScope()` to `_services.CreateAsyncScope()` in
|
||||
`OnIngestAsync`, `SiteAuditReconciliationActor.OnTickAsync`, and
|
||||
`AuditLogPurgeActor.OnTickAsync`, and replace the `try/finally { scope?.Dispose(); }`
|
||||
pattern with `await using var scope = _services.CreateAsyncScope();`. The DI scope
|
||||
will dispose asynchronously and the EF Core context will be released without
|
||||
blocking the actor thread.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |
|
||||
|
||||
**Description**
|
||||
|
||||
`PullSiteAsync` iterates the pulled events, calls `InsertIfNotExistsAsync` inside a
|
||||
per-row try/catch, and unconditionally updates `maxOccurred = evt.OccurredAtUtc` after
|
||||
the try/catch — regardless of whether the insert succeeded or threw. The comment at
|
||||
line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the
|
||||
row was returned by the site, so the next tick won't re-fetch it; if it permanently
|
||||
fails to persist, that's an operational concern surfaced by the log, not a hot-loop
|
||||
trigger." For a transient fault that flips to success on the next pull the design
|
||||
holds. But if a row throws on EVERY central attempt (truly permanent persistence fault —
|
||||
e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves
|
||||
past it, and central will simply log on every reconciliation tick. No alert escalates
|
||||
beyond a log line. Worse, the site keeps the row `Pending` (because `MarkReconciledAsync`
|
||||
is only called for rows the puller flipped centrally) AND will trip the
|
||||
`SiteAuditTelemetryStalled` signal because the backlog never drains, but the central
|
||||
log message is the only place an operator could correlate the stall with the
|
||||
persistent insert failure.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) only advance the cursor for rows whose `InsertIfNotExistsAsync` returned
|
||||
cleanly — leave `maxOccurred` at the previous value for the failing row so the next
|
||||
tick retries; or (b) increment a dedicated `CentralAuditPermanentInsertFailure` health
|
||||
counter on the per-row catch so the failure is observable on the dashboard instead of
|
||||
buried in the log. Option (a) needs a guard against the same row throwing forever
|
||||
(saturate the puller) — a small per-event retry counter held in the actor's state with
|
||||
a permanent-skip + `LogCritical` threshold is the standard escape valve.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:597-657` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SqliteAuditWriter.GetBacklogStatsAsync` takes `_writeLock` (the same lock that
|
||||
serialises every batch INSERT in `FlushBatch`) and holds it for the duration of a
|
||||
`SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'`.
|
||||
`SiteAuditBacklogReporter` calls this on a 30-second timer. On a healthy site with
|
||||
few `Pending` rows the index-only scan is fast; under the scenario the metric exists
|
||||
to detect — a prolonged central outage growing the backlog "indefinitely" per
|
||||
Component-AuditLog.md — a `COUNT(*)` over hundreds of thousands of `Pending` rows
|
||||
on the `IX_SiteAuditLog_ForwardState_Occurred` index is no longer cheap, and the
|
||||
duration of that scan is added to the hot-path write latency for every concurrent
|
||||
script. The hot path is supposed to be "durable in microseconds" per the design doc;
|
||||
a multi-hundred-millisecond probe stall in the same period would not be visible
|
||||
externally but would back-pressure the bounded write channel. `ReadPendingAsync` and
|
||||
`ReadPendingSinceAsync` share the same lock for the same reason and have the same
|
||||
exposure under backlog growth.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) move the SELECT outside the write lock by using a second, dedicated
|
||||
read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections
|
||||
to the same file when journal_mode=WAL is enabled — which would also benefit the
|
||||
hot path); or (b) cache the last snapshot inside the writer and recompute it
|
||||
lazily on a dedicated background tick so the reporter reads a pre-computed snapshot
|
||||
without acquiring the write lock. Option (a) also unblocks `ReadPendingAsync` /
|
||||
`ReadPendingSinceAsync` from competing with the writer.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700` |
|
||||
|
||||
**Description**
|
||||
|
||||
```csharp
|
||||
public void Dispose()
|
||||
{
|
||||
DisposeAsync().AsTask().GetAwaiter().GetResult();
|
||||
}
|
||||
```
|
||||
|
||||
This is the classic sync-over-async anti-pattern. `DisposeAsync` `await`s the
|
||||
writer-loop task with `.ConfigureAwait(false)`, so on a thread with no synchronization
|
||||
context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes
|
||||
`Dispose()` from a context that captures (an ASP.NET request, a SynchronizationContext
|
||||
test runner, an Akka.NET dispatcher in some configurations) the `GetResult()` blocks
|
||||
the captured thread while the continuation tries to resume on it — classic deadlock.
|
||||
The writer is registered as a DI singleton, so this is unlikely to bite during the
|
||||
host's `IAsyncDisposable` shutdown (DI prefers `DisposeAsync` when available), but
|
||||
an integration test or future code path that constructs the writer manually inside
|
||||
a sync context will hang.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Drop the `IDisposable` interface and rely on `IAsyncDisposable` only — the DI container
|
||||
will call `DisposeAsync` on singletons that implement it. If a sync `Dispose` is
|
||||
required for compatibility with consumers that don't honour `IAsyncDisposable`,
|
||||
implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
|
||||
short wait, without blocking the thread for the full async drain.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:148-218` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AddAuditLog` registers three components that depend on `INodeIdentityProvider`:
|
||||
- `CachedCallTelemetryForwarder` — resolves with `sp.GetService<INodeIdentityProvider>()`
|
||||
(optional, falls back to a null `SourceNode`).
|
||||
- `CachedCallLifecycleBridge` — resolves with `sp.GetService<INodeIdentityProvider>()`
|
||||
(optional, same fallback).
|
||||
- `CentralAuditWriter` — resolves with `sp.GetRequiredService<INodeIdentityProvider>()`
|
||||
(required, throws at first resolution if unregistered).
|
||||
|
||||
The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are
|
||||
optional because tests may skip the registration; the third is required because "the
|
||||
production composition root in `SiteServiceRegistration` registers the provider as a
|
||||
singleton on both site and central paths". But this is a fragile guarantee — `AddAuditLog`
|
||||
itself does NOT register the provider, so a future composition root that calls
|
||||
`AddAuditLog` without first calling whatever registers `INodeIdentityProvider` will fail
|
||||
on the FIRST resolution of `ICentralAuditWriter` (which is a lazy factory) rather than
|
||||
at `AddAuditLog` time. The result: site nodes that "happen to work" because they hold
|
||||
a registered provider, central composition test fixtures that fail at runtime instead
|
||||
of DI-build time, and a `GetService`/`GetRequiredService` split that gives no clear
|
||||
contract to the reader.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) make all three optional: `CentralAuditWriter` already handles a null provider
|
||||
gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys
|
||||
nothing. Or (b) make all three required and either add `services.AddSingleton<INodeIdentityProvider, ...>()`
|
||||
inside `AddAuditLog` (with a sensible default — null node name returns `<unknown>`) or
|
||||
add an explicit guard at the top of `AddAuditLog` that throws if no provider has been
|
||||
registered yet (`services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))`).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:125,155` |
|
||||
|
||||
**Description**
|
||||
|
||||
`FallbackAuditWriter`, `CentralAuditWriter`, and `AuditLogIngestActor` all accept an
|
||||
`IAuditPayloadFilter` as an optional dependency, defaulting to `null = pass-through`.
|
||||
The justification in every XML comment is the same: "the M4 test composition roots
|
||||
that don't pass one keep working (they only ever write small payloads)". This is fine
|
||||
for size — but the filter also performs HEADER REDACTION (`Authorization`, `Cookie`,
|
||||
`Set-Cookie`, `X-API-Key`), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test
|
||||
fixture (or any future composition root that bypasses `AddAuditLog`) that injects a
|
||||
real `RequestSummary` will see secrets written to SQLite / MS SQL with no redaction.
|
||||
The combination "audit-write must never abort the user-facing action" + "unredacted
|
||||
secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the
|
||||
no-filter fallback genuinely dangerous — over-redacting on a missing filter is the
|
||||
contract the production setup honours, but the code itself defaults to under-redact.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Change the three null-coalesce sites to default to a non-null sentinel filter that
|
||||
performs the header redaction (`HeaderRedactList`) using the hard-coded defaults
|
||||
from `AuditLogOptions`, even when no `IAuditPayloadFilter` is registered. The
|
||||
truncation stage can remain optional; the header redaction must not. Alternatively,
|
||||
make `IAuditPayloadFilter` non-optional and have `AddAuditLog` register the real
|
||||
filter unconditionally — tests that don't bind the options section will resolve the
|
||||
default `AuditLogOptions` and get the production-default redact list automatically.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740` |
|
||||
|
||||
**Description**
|
||||
|
||||
The first `lock (_writeLock)` block in `DisposeAsync` is commented:
|
||||
|
||||
> Stop accepting new events. Setting _disposed first ensures any FlushBatch entered
|
||||
> after we mark disposed will fault its pending events rather than touching the
|
||||
> about-to-close connection.
|
||||
|
||||
But the block does NOT set `_disposed = true` — it only calls
|
||||
`_writeQueue.Writer.TryComplete()` and captures `_writerLoop`. The `_disposed` flag is
|
||||
flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer
|
||||
loop. During the wait window, a concurrent `WriteAsync` that observed the channel
|
||||
NOT-yet-completed (race: it ran before `TryComplete`) and got past `TryWrite` would
|
||||
land on the writer loop's `FlushBatch`, which then takes the lock and checks
|
||||
`_disposed` — and finds it still `false`. The check at the top of `FlushBatch`
|
||||
(line 265) `if (_disposed) { fault pending; return; }` therefore does NOT fire during
|
||||
the dispose window. In practice the channel being completed drains the loop cleanly
|
||||
and the disposable race is benign, but the comment claims a guarantee that the code
|
||||
does not implement.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either set `_disposed = true` in the first lock block to match the comment (and remove
|
||||
the duplicate `_disposed` check in the second block); or rewrite the comment to
|
||||
describe the actual ordering: the channel is completed first, the loop drains
|
||||
remaining items under the lock, and `_disposed = true` is set only after the loop
|
||||
exits. The current code is correct; the comment is wrong.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |
|
||||
|
||||
**Description**
|
||||
|
||||
The drain loops on `SiteAuditTelemetryActor.OnDrainAsync` and the per-site pull on
|
||||
`SiteAuditReconciliationActor.PullSiteAsync` both pass `CancellationToken.None` to
|
||||
every async dependency call (queue reads, gRPC client, repository writes). The actor
|
||||
has no `CancellationToken` field, so there's no in-flight cancellation source —
|
||||
graceful shutdown relies entirely on `PostStop` being called and the actor's
|
||||
`Receive` continuation completing naturally. For a healthy gRPC client this is fine,
|
||||
but a stuck `IngestAuditEventsAsync` call (slow central, partition switch in progress)
|
||||
holds the actor's continuation indefinitely; the host's coordinated-shutdown will then
|
||||
time out the actor system and leave the actor in an undefined state. The brief
|
||||
references "cancellation on stop" in the partition-maintenance comments but
|
||||
`SiteAuditTelemetryActor` does not implement it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Introduce a per-actor `CancellationTokenSource` populated in `PreStart` and cancelled
|
||||
in `PostStop`; pass `_lifecycleCts.Token` instead of `CancellationToken.None` in
|
||||
every async dependency call. Same change for `SiteAuditReconciliationActor`. The
|
||||
existing `OperationCanceledException` is already swallowed by the top-level catch
|
||||
in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346` |
|
||||
|
||||
**Description**
|
||||
|
||||
The XML doc on `AddAuditLog` is explicit: "Idempotent re-registration is not supported;
|
||||
call this exactly once per `IServiceCollection`." But `AddAuditLogHealthMetricsBridge`
|
||||
calls `services.AddHostedService<SiteAuditBacklogReporter>()` (line 275), which is
|
||||
NOT idempotent — every call registers another descriptor, and the host will spin up
|
||||
N reporters and have them all poll SQLite every 30 s, all push the same snapshot into
|
||||
`ISiteHealthCollector`. The site composition path is supposed to call this exactly
|
||||
once, but tests or composition refactors that accidentally call twice will pay 2x the
|
||||
SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just
|
||||
wasted work). Worse, `AddAuditLogCentralMaintenance` (line 301) is also non-idempotent —
|
||||
`AddOptions<AuditLogPartitionMaintenanceOptions>` and `AddHostedService<AuditLogPartitionMaintenanceService>`
|
||||
will pile up.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) guard each Add* helper with a "has the marker been seen?" sentinel
|
||||
(register a private marker descriptor on first call, no-op on subsequent calls);
|
||||
or (b) explicitly document idempotency on the public surface of every helper and
|
||||
verify with a unit test in `AddAuditLogTests`. Option (a) matches the pattern other
|
||||
SDK extensions use and removes a foot-gun.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.CLI` |
|
||||
| Design doc | `docs/requirements/Component-CLI.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -47,6 +47,36 @@ and `WriteAsTable` derives table columns from only the first array element, sile
|
||||
dropping columns for any later element with a different shape (CLI-016). No
|
||||
Critical/High issues; the module remains healthy.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
The CLI has grown two substantial new command groups since the last re-review —
|
||||
`scadalink audit` (Audit Log #23 M8) and `scadalink bundle` (Transport #24) — together
|
||||
adding ~1,500 lines of new production code. The new `audit` surface is well-tested and
|
||||
well-factored (pure helpers + a clear `IAuditFormatter` seam), but the new `bundle`
|
||||
surface is untested, duplicates the URL/credential resolution that already exists in
|
||||
`CommandHelpers`, and inherits a partial authorization-exit-code regression that also
|
||||
appears in the audit path. Two longstanding fragility gaps that the prior reviews missed
|
||||
also surface in this pass: `CliConfig.Load` parses the config file with no try/catch, and
|
||||
`CommandTreeTests` still pins the old 14-group count so the two new groups are excluded
|
||||
from the leaf-action and registry-resolution coverage that protected the rest of the
|
||||
tree. Module health is broadly good but the consolidated count is now seven Open
|
||||
findings (none Critical, three Medium).
|
||||
|
||||
- **CLI-017** — `BundleCommands` duplicates `ExecuteCommandAsync` and skips the
|
||||
`FORBIDDEN`/`UNAUTHORIZED` exit-code mapping (auth exit 2 contract regression).
|
||||
- **CLI-018** — `AuditQueryHelpers.RunQueryAsync` / `AuditExportHelpers.RunExportAsync`
|
||||
return exit 1 on every error, never the documented exit 2 for authorization failure.
|
||||
- **CLI-019** — `BundleCommands.bundle export` decodes the entire base64 bundle in
|
||||
memory and writes synchronously — 100 MB bundles double-buffer.
|
||||
- **CLI-020** — `BundleCommands.bundle export` parses the success body with bare
|
||||
`JsonDocument.Parse` + `GetProperty` and throws on a malformed/abbreviated envelope.
|
||||
- **CLI-021** — `CliConfig.Load` crashes the whole CLI when `~/.scadalink/config.json`
|
||||
is malformed or unreadable, even if `--url` was supplied on the command line.
|
||||
- **CLI-022** — `AuditCommands` and `BundleCommands` are absent from `CommandTreeTests`;
|
||||
the test still pins `Equal(14, groups.Count)` and silently excludes the new groups.
|
||||
- **CLI-023** — `Component-CLI.md` says the audit commands ride `POST /management`,
|
||||
but the implementation calls a new `GET /api/audit/*` REST endpoint pair.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
_Original review (2026-05-16, `9c60592`):_
|
||||
@@ -79,6 +109,21 @@ _Re-review (2026-05-17, `39d737e`):_
|
||||
| 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. |
|
||||
| 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `BundleCommands.BuildExport` unguarded `JsonDocument.Parse` + `GetProperty` (CLI-020); `CliConfig.Load` unguarded JSON parse (CLI-021). |
|
||||
| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR/REST client. No issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | No new concurrency surface; `debug stream` unchanged since CLI-011/012. No issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Bundle and audit paths skip the auth exit-code contract (CLI-017, CLI-018); bundle JSON-envelope parse is brittle (CLI-020); config-file parse aborts the process (CLI-021). |
|
||||
| 5 | Security | ☑ | No new credential or trust-boundary issues. No issues. |
|
||||
| 6 | Performance & resource management | ☑ | `bundle export` double-buffers the whole bundle in memory (CLI-019). |
|
||||
| 7 | Design-document adherence | ☑ | `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses new REST endpoints (CLI-023). |
|
||||
| 8 | Code organization & conventions | ☑ | `BundleCommands.RunBundleCommandAsync` re-implements credential/URL resolution that `CommandHelpers.ExecuteCommandAsync` already provides — drift waiting to happen (CLI-017). |
|
||||
| 9 | Testing coverage | ☑ | `BundleCommands` has no tests; `CommandTreeTests` pins `Equal(14, …)` and excludes the new `AuditCommands` + `BundleCommands` groups (CLI-022). |
|
||||
| 10 | Documentation & comments | ☑ | XML docs accurate; doc-vs-code transport drift folded into CLI-023. No other issues. |
|
||||
|
||||
## Findings
|
||||
|
||||
### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
|
||||
@@ -741,3 +786,284 @@ list and `OutputFormatter.WriteTable` pads missing cells, so heterogeneous array
|
||||
render every column. Regression tests added in `TableHeaderUnionTests` (3 tests:
|
||||
later-element-only column included, first-seen column order preserved,
|
||||
first-element-extra column still rendered).
|
||||
|
||||
### CLI-017 — `BundleCommands.RunBundleCommandAsync` duplicates `ExecuteCommandAsync` and breaks the auth exit-code contract
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:244-289` (vs. `src/ScadaLink.CLI/Commands/CommandHelpers.cs:20-73`, `:159-174`) |
|
||||
|
||||
**Description**
|
||||
|
||||
`BundleCommands.RunBundleCommandAsync` re-implements the URL/credential resolution,
|
||||
validation, and HTTP plumbing that `CommandHelpers.ExecuteCommandAsync` already provides
|
||||
for every other command group — to attach a 5-minute timeout (`BundleCommandTimeout`)
|
||||
and a caller-supplied success handler. In duplicating it, two contracts that
|
||||
`CommandHelpers` carefully establishes were dropped:
|
||||
|
||||
1. **Authorization exit code.** `CommandHelpers.HandleResponse` routes through
|
||||
`IsAuthorizationFailure`, which returns exit 2 for **either** HTTP 403 **or** an
|
||||
`UNAUTHORIZED`/`FORBIDDEN` error code on any status (resolution of CLI-009). The
|
||||
bundle path at line 287 uses a bare `if (response.StatusCode == 403) return 2;` — a
|
||||
server that signals authorization failure via the `code` field on a non-403 status
|
||||
(the same channel the rest of the CLI honours) will exit 1 instead of 2 from
|
||||
`bundle export`/`preview`/`import`. `Component-Transport.md:289` explicitly states
|
||||
"Exit codes follow the project convention: `0` = success, `1` = command failure,
|
||||
`2` = authorization failure," so this is a contract regression.
|
||||
2. **Error-message phrasing drift.** The two duplicated error paths
|
||||
(`bundle:258-260`, `:264-266`) emit shorter messages that omit the
|
||||
`SCADALINK_MANAGEMENT_URL` / `SCADALINK_USERNAME` env-var hints the canonical paths
|
||||
give — confusing if the user is trying to debug what's missing.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Refactor `CommandHelpers.ExecuteCommandAsync` to accept an optional `TimeSpan` timeout
|
||||
and an optional success handler, and have `BundleCommands` call it. Failing that,
|
||||
extract `CommandHelpers.IsAuthorizationFailure` to `internal` and call it from
|
||||
`RunBundleCommandAsync` in place of the bare 403 check, and copy the canonical error
|
||||
messages verbatim.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-018 — `audit query` and `audit export` never return exit 2 for an authorization failure
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/Commands/AuditQueryHelpers.cs:186-193`, `src/ScadaLink.CLI/Commands/AuditExportHelpers.cs:147-153` |
|
||||
|
||||
**Description**
|
||||
|
||||
The two audit-log subcommands (`audit query`, `audit export`) ride a new REST surface
|
||||
(`GET /api/audit/query` and `GET /api/audit/export`) — not the `POST /management`
|
||||
envelope that goes through `CommandHelpers.HandleResponse`. Both helpers map *any*
|
||||
non-success response to a generic `OutputFormatter.WriteError(...)` + `return 1`:
|
||||
|
||||
- `AuditQueryHelpers.RunQueryAsync:186-193` returns 1 unconditionally when `JsonData`
|
||||
is null (i.e. any error). It never inspects `StatusCode` or `ErrorCode`.
|
||||
- `AuditExportHelpers.RunExportAsync:147-153` returns 1 for every non-success status,
|
||||
again with no 403 / `FORBIDDEN` carve-out.
|
||||
|
||||
`Component-CLI.md:295-296` documents exit code 2 for "Authorization failure (insufficient
|
||||
role)". `Component-AuditLog.md` (Security & Tamper-Evidence) and `Component-CLI.md:184-187`
|
||||
both call out that the audit endpoints are gated by `OperationalAudit` and `AuditExport`
|
||||
permissions enforced server-side — i.e. these are exactly the commands most likely to
|
||||
return 403 in routine use. The exit-code regression silently downgrades a 403 to a
|
||||
generic command failure, breaking the CI/CD scripting contract.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Promote `CommandHelpers.IsAuthorizationFailure` to `internal` (or move it to a small
|
||||
shared helper) and have `RunQueryAsync` / `RunExportAsync` return 2 when it matches.
|
||||
The check needs to use the `ManagementResponse.StatusCode` / `ErrorCode` pair the
|
||||
audit `SendGetAsync` already populates.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-019 — `bundle export` decodes the entire base64 bundle into memory before writing
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:117-124`, `src/ScadaLink.CLI/ManagementHttpClient.cs:47-92` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-Transport.md:271` ceilings the raw bundle at 100 MB and notes the
|
||||
per-request body cap is raised to 200 MB once base64-inflated. The CLI's export path
|
||||
goes through `ManagementHttpClient.SendCommandAsync`, which reads the entire response
|
||||
body into a string (`responseBody = await httpResponse.Content.ReadAsStringAsync(...)`)
|
||||
and returns it as `ManagementResponse.JsonData`. `BundleCommands.BuildExport` then:
|
||||
|
||||
1. `JsonDocument.Parse(jsonOk)` re-allocates the JSON DOM (~200 MB string + DOM).
|
||||
2. `doc.RootElement.GetProperty("base64Bundle").GetString()` materializes the base64
|
||||
payload as another ~200 MB `string`.
|
||||
3. `Convert.FromBase64String(base64)` allocates a fresh ~100 MB `byte[]`.
|
||||
4. `File.WriteAllBytes(output, bytes)` writes synchronously.
|
||||
|
||||
Peak working-set for a 100 MB bundle is therefore ~600 MB, all on the LOH, plus the
|
||||
file-I/O is fully synchronous. The streaming `SendGetStreamAsync` path the audit
|
||||
export uses (line 155-156) shows the right pattern is already available for plain GETs,
|
||||
but bundles ride a `POST /management` envelope so they currently can't reuse it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
For the export path specifically, add a streaming variant — either a new
|
||||
`POST /api/bundle/export` REST endpoint mirroring the audit pattern, or a chunk-fetch
|
||||
follow-up `GET /api/bundle/<exportId>` so the CLI can stream bytes through
|
||||
`Stream.CopyToAsync` without buffering the whole envelope. If a v1 stop-gap is needed,
|
||||
at minimum switch to `File.WriteAllBytesAsync` and use `Convert.TryFromBase64Chars`
|
||||
with a rented buffer to avoid the double-LOH allocation.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-020 — `bundle export` success-envelope parse is unguarded
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:117-126` |
|
||||
|
||||
**Description**
|
||||
|
||||
The export success handler does:
|
||||
|
||||
```csharp
|
||||
using var doc = JsonDocument.Parse(jsonOk);
|
||||
var base64 = doc.RootElement.GetProperty("base64Bundle").GetString()!;
|
||||
var byteCount = doc.RootElement.GetProperty("byteCount").GetInt32();
|
||||
var bytes = Convert.FromBase64String(base64);
|
||||
```
|
||||
|
||||
None of these calls are wrapped in a `try/catch`. A server-side bug that omits one of
|
||||
the two properties, returns a `null` `base64Bundle`, sends invalid base64, or sends a
|
||||
malformed JSON envelope will surface as one of `KeyNotFoundException` /
|
||||
`InvalidOperationException` / `FormatException` — an unhandled stack trace, not a clean
|
||||
`INVALID_RESPONSE` / exit 1, contradicting the "graceful-degradation" theme that the
|
||||
prior reviews (CLI-002, CLI-003, CLI-005) repeatedly hardened.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wrap the parse + base64-decode in a `try` block that catches `JsonException`,
|
||||
`KeyNotFoundException`, `InvalidOperationException`, and `FormatException` and emits a
|
||||
clean `OutputFormatter.WriteError(..., "INVALID_RESPONSE")` + `return 1`. Add a
|
||||
regression test against a malformed-envelope stub `HttpMessageHandler`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-021 — `CliConfig.Load` crashes the CLI on a malformed config file
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/CliConfig.cs:41-53` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CliConfig.Load` is the first thing every command runs (via `ExecuteCommandAsync`,
|
||||
`AuditCommandHelpers.ResolveConnection`, and `BundleCommands.RunBundleCommandAsync`).
|
||||
Its config-file branch is:
|
||||
|
||||
```csharp
|
||||
if (File.Exists(configPath))
|
||||
{
|
||||
var json = File.ReadAllText(configPath);
|
||||
var fileConfig = JsonSerializer.Deserialize<CliConfigFile>(json, ...);
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Neither call is guarded. If `~/.scadalink/config.json` exists but is malformed
|
||||
(stale, partial, or someone's `vim` swap), `JsonSerializer.Deserialize` throws
|
||||
`JsonException`. If the file exists but isn't readable (mode 0000),
|
||||
`File.ReadAllText` throws `UnauthorizedAccessException`. Either fault aborts every
|
||||
CLI invocation with an unhandled stack trace — even invocations that supply every
|
||||
input on the command line and don't need the config file at all (`--url`,
|
||||
`--username`, `--password`, `--format` all on the CLI).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wrap the file-read and the `JsonSerializer.Deserialize` in a single
|
||||
`try/catch (Exception)` (or specifically `JsonException` +
|
||||
`UnauthorizedAccessException` + `IOException`). On failure, write a single one-line
|
||||
warning to `Console.Error` ("ignoring malformed `~/.scadalink/config.json`: {message}")
|
||||
and return the default `CliConfig`, so the rest of the precedence chain (env vars +
|
||||
command-line flags) still works.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-022 — `CommandTreeTests` excludes the two new command groups
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.CLI.Tests/CommandTreeTests.cs:21-37`, `:55-58` (vs. `src/ScadaLink.CLI/Program.cs:21-36`) |
|
||||
|
||||
**Description**
|
||||
|
||||
`CommandTreeTests.AllCommandGroups()` builds 14 command groups; `Program.cs` now
|
||||
registers 16 (`AuditCommands` and `BundleCommands` were added since the last
|
||||
re-review). Worse, the smoke test pins `Assert.Equal(14, groups.Count)`, so the
|
||||
test list intentionally matches the harness's array and stays green even though the
|
||||
real production tree is two groups larger. The downstream assertions
|
||||
(`EveryLeafCommand_HasAnAction`, `CommandPayloadTypes_ResolveViaRegistry`) therefore
|
||||
also do NOT cover the new audit / bundle leaves — and `BundleCommands` has zero
|
||||
test coverage of any kind (no parsing tests, no success-handler tests, no
|
||||
registry-resolution tests).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add `AuditCommands.Build(...)` and `BundleCommands.Build(...)` to the
|
||||
`AllCommandGroups()` array, bump the assertion to `Equal(16, groups.Count)`, and add
|
||||
representative payload types to `CommandPayloadTypes_ResolveViaRegistry`
|
||||
(`ExportBundleCommand`, `PreviewBundleCommand`, `ImportBundleCommand`). Optionally,
|
||||
add a `BundleCommandsTests` file covering the success-envelope parse and the
|
||||
`NameListOption` comma-split parser.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-023 — `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses REST endpoints
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-CLI.md:310-311` (vs. `src/ScadaLink.CLI/Commands/AuditQueryHelpers.cs:186`, `src/ScadaLink.CLI/Commands/AuditExportHelpers.cs:126`, `src/ScadaLink.CLI/ManagementHttpClient.cs:94-156`) |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-CLI.md:310` states: "The `scadalink audit` command group rides this same
|
||||
transport — there is no separate audit endpoint." But the implementation calls a
|
||||
new REST surface — `GET /api/audit/query` and `GET /api/audit/export` — via two new
|
||||
methods on `ManagementHttpClient` (`SendGetAsync`, `SendGetStreamAsync`), distinct
|
||||
from the `POST /management` envelope. The plan document
|
||||
(`docs/plans/2026-05-20-audit-log-code-roadmap.md:1583`) corroborates the
|
||||
implementation: "REST endpoints `GET /api/audit/query` (paged) and
|
||||
`GET /api/audit/export` (streaming)" — i.e. the design doc is the stale one.
|
||||
|
||||
A reader following `Component-CLI.md` would expect the audit endpoints to share
|
||||
the management envelope's authentication + dispatch path and route through
|
||||
`ManagementActor`, neither of which is true. The auth-exit-code regression
|
||||
(CLI-018) is itself a direct consequence of this divergence: the audit helpers
|
||||
duplicate the management envelope's response handling instead of riding it, and
|
||||
forgot to copy the auth carve-out.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Update `Component-CLI.md:310-311` (and the Dependencies bullet at `:311`) to
|
||||
describe the actual REST surface: `GET /api/audit/query` (paged) and
|
||||
`GET /api/audit/export` (streaming), with HTTP Basic Auth shared with the
|
||||
management envelope and permission checks enforced by the server-side
|
||||
`AuditController`. Optionally cross-link to
|
||||
`docs/plans/2026-05-20-audit-log-code-roadmap.md` (M8 task list) as the
|
||||
authoritative source.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.CentralUI` |
|
||||
| Design doc | `docs/requirements/Component-CentralUI.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 8 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -73,6 +73,55 @@ cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4
|
||||
claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories
|
||||
1, 2, 5, 6, 7, 10 produced no new findings.
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | CentralUI-026 (AuditFilterBar UTC), CentralUI-027 (3 other pages with same UTC bug). |
|
||||
| 2 | Akka.NET conventions | ☑ | No new findings — module is presentation; `DebugStreamService` actor usage unchanged. |
|
||||
| 3 | Concurrency & thread safety | ☑ | CentralUI-030 (StringWriter capture buffer not thread-safe under intra-script `Task.WhenAll`). |
|
||||
| 4 | Error handling & resilience | ☑ | No new findings — the prior CentralUI-018/023 patterns hold. |
|
||||
| 5 | Security | ☑ | CentralUI-028 (NotificationReport + SiteCallsReport not site-scoped — CentralUI-002 regression on new pages). |
|
||||
| 6 | Performance & resource management | ☑ | CentralUI-031 (TransportImport buffers full bundle bytes in component state). |
|
||||
| 7 | Design-document adherence | ☑ | CentralUI-032 (AuditResultsGrid forward-only paging diverges from "keyset paginated" implied bi-directional). |
|
||||
| 8 | Code organization & conventions | ☑ | CentralUI-029 (`JS.InvokeAsync<int>("eval", ...)` in ConfigurationAuditLog vs the `_content/.../BrowserTime` module pattern). |
|
||||
| 9 | Testing coverage | ☑ | CentralUI-033 (TransportImport / SiteCallsReport query-string drill-in code paths untested). |
|
||||
| 10 | Documentation & comments | ☑ | No new findings — code comments accurately describe intent. |
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All 25 prior findings remain closed. This re-review re-examined the full
|
||||
module against the 10-category checklist with attention to the
|
||||
recently-added Transport export/import wizards (`TransportExport`,
|
||||
`TransportImport`) and the operational Audit Log page (Bundle B..G). The
|
||||
most consequential pattern in this pass is that the **CentralUI-008
|
||||
local-input-treated-as-UTC** bug, fixed for the legacy
|
||||
`AuditLog.razor` via the `BrowserTime.LocalInputToUtc` helper, has been
|
||||
silently recreated on every other page that exposes a
|
||||
`<input type="datetime-local">` filter — `AuditFilterBar` (the new
|
||||
operational Audit Log filter, CentralUI-026), `SiteCallsReport`,
|
||||
`NotificationReport`, and `EventLogs` (CentralUI-027). The Audit Log
|
||||
page CSV export URL therefore mis-shifts the From/To filter window by
|
||||
the operator's UTC offset, and the same offset bug silently corrupts
|
||||
audit-style queries on Site Calls / Notification Report / Event Logs.
|
||||
Second-most consequential is **CentralUI-028**: the new `NotificationReport`
|
||||
and `SiteCallsReport` pages (both `[Authorize(RequireDeployment)]`) do
|
||||
NOT filter their site dropdown or row data through `SiteScopeService`,
|
||||
and the relay actions (`RetryNotification`/`DiscardNotification`,
|
||||
`RetrySiteCall`/`DiscardSiteCall`) issue no server-side site-scope
|
||||
re-check before relaying to the owning site — so a site-scoped Deployment
|
||||
user can read and act on notifications and cached calls for sites
|
||||
outside their grant, replicating the original CentralUI-002 defect on
|
||||
the two pages added after the CentralUI-002 fix landed. The remaining
|
||||
new findings (CentralUI-029..CentralUI-033) cover a residual `JS.InvokeAsync<int>("eval", ...)`
|
||||
in `ConfigurationAuditLog`, a single-thread `StringWriter` capture buffer
|
||||
in the Test Run sandbox (a sandboxed script that uses `Task.WhenAll` can
|
||||
write concurrently), a `using var` `MemoryStream` followed by `ms.ToArray()`
|
||||
buffering the full bundle in memory in `TransportImport`, the
|
||||
`AuditResultsGrid` having no Previous-page control (forward-only navigation,
|
||||
a UX/design adherence gap), and the un-tested `TransportImport` /
|
||||
`SiteCallsReport` query-string drill-in code paths.
|
||||
|
||||
## Findings
|
||||
|
||||
### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
|
||||
@@ -1216,3 +1265,278 @@ also forces the CentralUI-020 fix.
|
||||
**Resolution**
|
||||
|
||||
2026-05-17 — added `SessionExpiryComponentTests` (bUnit): an expired ping (401) redirects to `/login`, a live ping (200) and a transient failure (status 0) do not, and on the `/login` route the component neither pings nor redirects; also added `AuthPingEndpointTests` covering the `/auth/ping` endpoint contract.
|
||||
|
||||
### CentralUI-026 — `AuditFilterBar` From/To filters treat browser-local datetimes as UTC
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Audit/AuditFilterBar.razor:97-104`; `src/ScadaLink.CentralUI/Components/Audit/AuditQueryModel.cs:56-58,150-178,203-213` |
|
||||
|
||||
**Description**
|
||||
|
||||
The new operational Audit Log filter bar binds two `<input type="datetime-local">` controls
|
||||
straight to `AuditQueryModel.CustomFromUtc` / `CustomToUtc` (`DateTime?`), and `ToFilter`
|
||||
emits those values as `AuditLogQueryFilter.FromUtc` / `ToUtc` without converting from
|
||||
the browser's local time zone. A `datetime-local` input yields the user's *browser-local*
|
||||
wall-clock value, so for any non-UTC user the audit query window is shifted by their UTC
|
||||
offset — returning the wrong rows from the central `AuditLog` table and producing a
|
||||
mis-shifted CSV export through `AuditLogPage.BuildExportUrl`, which round-trips the
|
||||
filter's `FromUtc`/`ToUtc` straight into `?from=`/`?to=` query params. This is the same
|
||||
defect CentralUI-008 fixed for the legacy `Components/Pages/Monitoring/AuditLog.razor`
|
||||
via the `BrowserTime.LocalInputToUtc(value, _browserUtcOffsetMinutes)` helper — but the
|
||||
new Audit Log v2 filter bar does not use that helper, so a Bundle B/C/D/E/F regression
|
||||
re-introduced the bug for the page-replacement target. The CLAUDE.md "all timestamps are
|
||||
UTC throughout" decision is satisfied at the wire level but violated at the input
|
||||
boundary, exactly as the original finding called out.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Fetch the browser offset once via JS interop (mirroring `ConfigurationAuditLog.OnAfterRenderAsync`
|
||||
and `AuditLog.razor`'s implementation), pipe both `CustomFromUtc` and `CustomToUtc` through
|
||||
`BrowserTime.LocalInputToUtc(value, offsetMinutes)` inside `AuditQueryModel.ToFilter`
|
||||
(or in the filter-bar Apply path before calling `ToFilter`), and add a regression test
|
||||
that pins the non-UTC behaviour (mirroring `BrowserTimeTests.LocalInputToUtc_NonUtcBrowser_DoesNotEqualNaiveRelabelling`).
|
||||
The label "Custom From / To" should also be clarified ("UTC" vs "local") in the UI.
|
||||
|
||||
### CentralUI-027 — Same UTC misinterpretation in `SiteCallsReport`, `NotificationReport`, and `EventLogs`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor:74-80`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:421-425`; `src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor:75-81,639-640`; `src/ScadaLink.CentralUI/Components/Pages/Monitoring/EventLogs.razor:62-73,261-262` |
|
||||
|
||||
**Description**
|
||||
|
||||
The same `datetime-local`-treated-as-UTC bug from CentralUI-008 and CentralUI-026 is
|
||||
present on three other pages:
|
||||
|
||||
- `SiteCallsReport.ToUtc` stamps `DateTimeKind.Utc` on the local-input value
|
||||
(`DateTime.SpecifyKind(value.Value, DateTimeKind.Utc)`).
|
||||
- `NotificationReport.ToUtc` does the same — `new DateTimeOffset(DateTime.SpecifyKind(local.Value, DateTimeKind.Utc))`.
|
||||
- `EventLogs.FetchPage` emits `new DateTimeOffset(_filterFrom.Value, TimeSpan.Zero)`,
|
||||
which labels the browser-local wall-clock value as UTC (the exact pre-fix shape of
|
||||
CentralUI-008).
|
||||
|
||||
For any non-UTC operator, every Site-Calls / Notification / Event-Log query is silently
|
||||
shifted by their UTC offset. The bug is mass-recreated on every page added after
|
||||
CentralUI-008 landed — the `BrowserTime` helper exists but is only used by the legacy
|
||||
Audit Log page and `ConfigurationAuditLog`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Plumb the browser offset (via `eval` interop or a dedicated JS module, mirroring
|
||||
`ConfigurationAuditLog`/`AuditLog.razor`) into each of these pages and route every
|
||||
local-input value through `BrowserTime.LocalInputToUtc(value, offsetMinutes)` before
|
||||
constructing the wire filter. Add regression tests pinning the non-UTC behaviour for
|
||||
at least one representative page so the helper's continued use is enforced.
|
||||
|
||||
### CentralUI-028 — `NotificationReport` and `SiteCallsReport` bypass `SiteScopeService` — Deployment role site-scoping defeated on the two new central-mirror pages
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor:2,434,472,502`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor:2,52-59`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:97-110,201,250-251,278-279` |
|
||||
|
||||
**Description**
|
||||
|
||||
Both pages are `[Authorize(Policy = RequireDeployment)]` and, per CLAUDE.md "Security &
|
||||
Auth", the Deployment role must be site-scoped. CentralUI-002 fixed this for every
|
||||
Deployment/Monitoring page that existed at the time by introducing `SiteScopeService`
|
||||
and threading `FilterSitesAsync` / `IsSiteAllowedAsync` through the site dropdowns and
|
||||
mutating calls. The two new central-mirror pages — Notification Report (Notification
|
||||
Outbox queryable list) and Site Calls Report (Site Call Audit queryable list) — do NOT
|
||||
inject `SiteScopeService`, do NOT filter their Source-Site `<select>` lists (they
|
||||
enumerate `await SiteRepository.GetAllSitesAsync()` straight to the dropdown), do NOT
|
||||
narrow the query results by permitted site, and do NOT re-check the user's grant
|
||||
before relaying Retry/Discard to the owning site. `NotificationReport.RetryNotificationAsync`,
|
||||
`NotificationReport.DiscardNotificationAsync`, `SiteCallsReport.RetrySiteCallAsync`,
|
||||
and `SiteCallsReport.DiscardSiteCallAsync` all dispatch with the row's `SourceSiteId` /
|
||||
`SourceSite` unchecked. A scoped Deployment user can therefore (a) browse every row in
|
||||
the central `Notifications` / `SiteCalls` table including those for sites outside their
|
||||
grant, (b) submit Retry/Discard URLs hand-crafted from the row metadata, and (c) the
|
||||
site relay completes successfully because the CommunicationService only sees the
|
||||
row's source-site identifier, not the user's grant. This is a direct regression of the
|
||||
CentralUI-002 contract on the two pages that landed after CentralUI-002 was closed.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Inject `SiteScopeService` into both pages; filter the source-site dropdown through
|
||||
`FilterSitesAsync`; default the filter to the permitted-site set so a scoped user sees
|
||||
only their own rows (or push the predicate into the central query — preferred, so the
|
||||
filter cannot be bypassed by URL manipulation); and re-check `IsSiteAllowedAsync` in
|
||||
`RetryNotificationAsync`/`DiscardNotificationAsync`/`RetrySiteCallAsync`/`DiscardSiteCallAsync`
|
||||
before the CommunicationService call, surfacing a "not permitted for this site" toast
|
||||
on failure (mirroring `ParkedMessages.razor`'s `SelectedSiteIsPermitted` guard).
|
||||
Add `Site_ScopedDeploymentUser_OnlySeesPermittedRows` and
|
||||
`Site_ScopedDeploymentUser_CannotRetryRowOnNonPermittedSite` regression tests modelled
|
||||
on `TopologyPageTests.SiteScoping_*`.
|
||||
|
||||
### CentralUI-029 — `ConfigurationAuditLog` uses `JS.InvokeAsync<int>("eval", ...)` instead of a dedicated JS module
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Audit/ConfigurationAuditLog.razor:248-263` |
|
||||
|
||||
**Description**
|
||||
|
||||
`OnAfterRenderAsync` fetches the browser's UTC offset with
|
||||
`JS.InvokeAsync<int>("eval", "new Date().getTimezoneOffset()")`. Calling `eval` over
|
||||
JS interop is a code-smell: it widens the JS-interop attack surface (any future
|
||||
attacker who can influence the second argument runs arbitrary JS), it is brittle
|
||||
under stricter Content-Security-Policy headers (CSP `script-src` directives commonly
|
||||
forbid `unsafe-eval`), and it bypasses the existing module-import pattern the rest
|
||||
of the module follows (`session-expiry.js`, `audit-grid.js`, `nav-state.js`,
|
||||
`transport.js` are all loaded as `IJSObjectReference` modules). The legacy
|
||||
`AuditLog.razor` (CentralUI-008 fix) and the planned helper exist precisely to avoid
|
||||
this. Today the eval text is a static string so there is no live bug; the issue is
|
||||
that the pattern invites a future caller to compose the argument from page state.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Move the offset lookup into a small wwwroot JS module (e.g.
|
||||
`wwwroot/js/browser-time.js` exporting `getTimezoneOffsetMinutes()`) and `import` it
|
||||
via `IJSObjectReference` like the other helpers. Replace the `eval` call with
|
||||
`module.InvokeAsync<int>("getTimezoneOffsetMinutes")`. The fix is local and removes
|
||||
a residual eval surface; the same module can host the rest of the `BrowserTime`
|
||||
plumbing CentralUI-027 will need.
|
||||
|
||||
### CentralUI-030 — `SandboxConsoleCapture`'s per-call `StringWriter` is not thread-safe under intra-script concurrency
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/SandboxConsoleCapture.cs:31-118`; `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:401-404` |
|
||||
|
||||
**Description**
|
||||
|
||||
CentralUI-003 correctly routed console capture through an `AsyncLocal<StringWriter?>`
|
||||
so concurrent Test Runs cannot cross-contaminate. `BeginCapture` flows the capture
|
||||
buffer through the call-tree, and `Target` reads it on every `Write`. But a single
|
||||
script execution can still write to its captured `StringWriter` from multiple threads
|
||||
within one call-tree: the script trust model allows `System.Threading.Tasks`, so a
|
||||
user script can `await Task.WhenAll(t1, t2, t3)` where each task is `Task.Run(() => Console.WriteLine(...))`,
|
||||
and `_current.Value` flows into each `Task.Run`. The capture buffer is a plain
|
||||
`StringWriter` (`captured = new StringWriter()` in `RunInSandboxAsync`), which is
|
||||
**not** thread-safe — concurrent `WriteLine` calls can throw or interleave
|
||||
character-level. The Akka/gRPC-thread race fixed by CentralUI-003 is gone, but the
|
||||
intra-script-concurrency race is a residual hazard for any script that exercises
|
||||
parallel tasks (a realistic shape for a Test Run that calls multiple `External.Call`s
|
||||
concurrently). Severity is Low because the symptom is a corrupted ConsoleOutput
|
||||
string, not a security/data-loss issue, and the script must opt into Task-based
|
||||
concurrency to trigger it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wrap the capture buffer with `TextWriter.Synchronized(new StringWriter())` (the
|
||||
BCL's purpose-built thread-safe wrapper), or hold a lock inside `SandboxConsoleCapture.Write*`
|
||||
on the current scope's `StringWriter`. Add a focused test that runs `await Task.WhenAll(...)`
|
||||
with `Console.WriteLine` in each task and asserts the resulting `ConsoleOutput` has
|
||||
the expected line count regardless of thread interleaving.
|
||||
|
||||
### CentralUI-031 — `TransportImport` buffers the full bundle bytes in component state
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:72,104-142,160-161` |
|
||||
|
||||
**Description**
|
||||
|
||||
`OnFileSelectedAsync` reads the uploaded `.scadabundle` into a `MemoryStream`,
|
||||
calls `ms.ToArray()`, and stores the byte array on the component as
|
||||
`private byte[]? _bundleBytes`. The bytes live on the Blazor circuit for the
|
||||
lifetime of the wizard — through the passphrase step, the diff step (which can
|
||||
take an arbitrary amount of operator time on a large bundle), the confirm step,
|
||||
and the apply step — and are only cleared in `ResetSessionState` (Done /
|
||||
re-upload). For an operator who walks away from the diff step mid-review, the
|
||||
configured `MaxBundleSizeMb` (default not enforced here; only the file-size
|
||||
check on read) worth of bytes stays pinned on the central node's heap per
|
||||
open circuit. The page has no `IDisposable` to clear the bytes on tear-down
|
||||
either. Severity is Low because the cap is checked at upload time and Import
|
||||
is Admin-only (limited concurrent users), but the lifetime is longer than the
|
||||
strictly-needed retention.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Stream the bundle to a temp file (or to the `IBundleImporter`'s session store)
|
||||
rather than caching it on the component. Failing that, implement `IDisposable`
|
||||
on `TransportImport` and clear `_bundleBytes` (`Array.Clear` for sensitivity)
|
||||
on dispose; also clear the cached passphrase string. Tighten `MaxBundleSizeMb`
|
||||
docs to call out the in-memory cost per concurrent import session.
|
||||
|
||||
### CentralUI-032 — `AuditResultsGrid` paging is forward-only, no Previous button
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Audit/AuditResultsGrid.razor:76-82`; `src/ScadaLink.CentralUI/Components/Audit/AuditResultsGrid.razor.cs:65,196-197,219-220` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Audit Log results grid (Bundle B / M7-T3) renders a single "Next page" button
|
||||
and a `Page N · M rows` label, with no Previous control. The design doc says
|
||||
"Keyset pagination ordered by `(OccurredAtUtc desc, EventId desc)`. Default page
|
||||
size 100." — keyset paging is naturally forward-only, but a usable audit-triage
|
||||
workflow needs to step back to the previous page (the `SiteCallsReport` keyset
|
||||
implementation correctly maintains a `Stack<(...)> _cursorStack` for exactly this).
|
||||
An operator who clicks Next once and misses a row on the first page cannot return
|
||||
without re-applying the filter to start a fresh first page. The current shape
|
||||
also makes the "Page N" label slightly misleading — there is no in-grid affordance
|
||||
to use it as a navigation target.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Mirror the `SiteCallsReport.razor.cs` keyset-paging shape: maintain a
|
||||
`Stack<(DateTime?, Guid?)> _cursorStack` of previous-page cursors, add a Previous
|
||||
button gated on `_cursorStack.Count > 0`, push the current cursor on Next and pop
|
||||
on Previous. Either implement this or update the design doc to acknowledge
|
||||
forward-only paging on the Audit Log grid.
|
||||
|
||||
### CentralUI-033 — Drill-in / query-string code paths for the new Transport + SiteCalls pages are untested
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:97-238,267-319`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:107-148`; `tests/ScadaLink.CentralUI.Tests/Pages/Design/TransportImportPageTests.cs`; `tests/ScadaLink.CentralUI.Tests/Pages/SiteCallsReportPageTests.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
The CentralUI-025 lesson — "a critical drill-in/redirect path was untested, so the
|
||||
CentralUI-020 defect was not caught" — applies again to the two newest pages.
|
||||
`SiteCallsReport.ApplyQueryStringFilters` parses `?status=` and `?stuck=true` to
|
||||
seed the filters from a Health-dashboard KPI tile drill-in; there is no test that
|
||||
pins this seeding (an unrecognised status, a missing param, the case-insensitive
|
||||
match). `TransportImport` has a 5-step state machine and a 3-strike passphrase
|
||||
lockout, both with intricate transition logic
|
||||
(`GoFromUploadAsync` re-trying `LoadAsync`, the `_failedUnlockAttempts` reset on
|
||||
success, the audit-row write on failure) — none of the step-machine transition
|
||||
paths or the lockout reset / lockout-trip behaviours are pinned by tests. The
|
||||
existing `TransportImportPageTests` exercise rendering shapes, not the lifecycle.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add bUnit tests for `SiteCallsReport.ApplyQueryStringFilters` covering valid /
|
||||
invalid / case-mismatched `?status=` values and the `?stuck=true` toggle, and
|
||||
add `TransportImport` lifecycle tests covering: an encrypted-bundle upload
|
||||
advances to Step 2 without opening a session; a wrong passphrase increments the
|
||||
counter and writes the `BundleImportUnlockFailed` audit row; the lockout resets
|
||||
the wizard to Step 1 once `MaxUnlockAttemptsPerSession` is reached; a successful
|
||||
unlock resets the counter and advances to Step 3.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ClusterInfrastructure` |
|
||||
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -45,6 +45,43 @@ part of the configuration contract but is never consumed — `ScadaLink.Host`'s
|
||||
does not enforce the design doc's requirement that `down-if-alone` be `on` for the
|
||||
keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
The only change to this module between `39d737e` and `1eb6e97` is the
|
||||
documentation-only commit `1eb6e97` itself, which added a handful of `<param>`
|
||||
XML doc tags to `ClusterOptionsValidator.Validate` and to
|
||||
`AddClusterInfrastructureActors` — no source-of-truth changes. Walked all three
|
||||
source files and all three test files against the full 10-category checklist
|
||||
again. Found **four new issues**, all Low severity, that the prior re-review
|
||||
either did not surface or that have aged into the file:
|
||||
|
||||
- **CI-011 (Low, Code organization)** — `ClusterOptions.SectionName` is
|
||||
documented as "the single source of truth so binding sites do not hard-code
|
||||
the magic string" (the very justification CI-005's resolution offered), but
|
||||
`ScadaLink.Host.SiteServiceRegistration.BindSharedOptions:100` and three
|
||||
references in `ScadaLink.Host.StartupValidator` all hard-code
|
||||
`"ScadaLink:Cluster"` literals. The constant is decorative — a "single source
|
||||
of truth" that nothing reads. Same pattern as CI-009 (inert configuration knob).
|
||||
- **CI-012 (Low, Design-document adherence)** — the validator accepts
|
||||
`SeedNodes.Count == 1` even though the design doc states "both nodes are seed
|
||||
nodes" (a properly-configured deployment lists 2). `Host.StartupValidator:45`
|
||||
already enforces `>= 2`, so this module's own contract validator is the
|
||||
weaker of the two. Inconsistent enforcement across the two projects that
|
||||
share ownership of the cluster contract.
|
||||
- **CI-013 (Low, Documentation & comments)** — `ClusterOptionsTests
|
||||
.Properties_CanBeSetToCustomValues` deliberately sets
|
||||
`SplitBrainResolverStrategy = "keep-majority"` and `MinNrOfMembers = 2` — the
|
||||
exact values the design doc warns are catastrophic. The CI-006 resolution
|
||||
acknowledged this is intentional (testing the POCO accepts any value; the
|
||||
validator does the rejecting) but the test has no inline comment saying so,
|
||||
and a future reader could easily misinterpret it as endorsing those values.
|
||||
- **CI-014 (Low, Code organization)** — `AddClusterInfrastructureActors` is
|
||||
dead surface: no caller exists anywhere in the solution (verified via
|
||||
`grep -rn`), its XML doc instructs callers "do not call", and its body
|
||||
unconditionally throws. CI-002's resolution chose "fail loudly" over "delete"
|
||||
but the method now offers nothing — keeping it is API-surface noise that an
|
||||
IDE will still suggest via auto-complete.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
|
||||
@@ -63,6 +100,21 @@ Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
|
||||
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
|
||||
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Validator logic and DI registration are correct. No new defects. |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors in this module (legitimate, per CI-001 resolution). Nothing actor-shaped to evaluate. |
|
||||
| 3 | Concurrency & thread safety | ✓ | Validator and DI extensions remain stateless. No issues. |
|
||||
| 4 | Error handling & resilience | ✓ | Validator now rejects every catastrophic value the design doc enumerates. New — it accepts `SeedNodes.Count == 1` even though the design doc requires both nodes as seeds, and `Host.StartupValidator` enforces `>= 2`, so the module's own validator is the weaker check (CI-012). |
|
||||
| 5 | Security | ✓ | No authn/authz surface, no secret handling, no remoting transport configured here. No issues. |
|
||||
| 6 | Performance & resource management | ✓ | No resources held; validator allocates a small failure list per call only. No issues. |
|
||||
| 7 | Design-document adherence | ✓ | `ClusterOptions` contract complete and validated. New — validator's seed-node count check is weaker than the design (CI-012). |
|
||||
| 8 | Code organization & conventions | ✓ | Options/validator placement and Options pattern correct. New — `SectionName` constant documented as "single source of truth" but never read by any binding site (CI-011); `AddClusterInfrastructureActors` is dead surface that no caller invokes (CI-014). |
|
||||
| 9 | Testing coverage | ✓ | 16 tests across three classes. New — `ClusterOptionsTests.Properties_CanBeSetToCustomValues` sets the exact catastrophic values the design doc forbids without an inline comment explaining why (CI-013). |
|
||||
| 10 | Documentation & comments | ✓ | XML docs accurate across all source files (commit `1eb6e97` filled in the remaining `<param>` tags). New — CI-013 (test lacks intent comment); CI-011 (XML doc for `SectionName` claims a property the code does not deliver). |
|
||||
|
||||
## Findings
|
||||
|
||||
### ClusterInfrastructure-001 — Module implements none of its documented responsibilities
|
||||
@@ -628,3 +680,181 @@ message explaining the isolated-single-node-cluster hazard, consistent with how
|
||||
validator already rejects quorum split-brain strategies. Developed test-first:
|
||||
`ClusterOptionsValidatorTests.DownIfAloneFalse_FailsValidation` was written first,
|
||||
confirmed failing, then passing after the fix. Module test suite green (18 passed).
|
||||
|
||||
### ClusterInfrastructure-011 — `SectionName` constant is decorative — no binding site references it
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:24-27`, `src/ScadaLink.Host/SiteServiceRegistration.cs:100`, `src/ScadaLink.Host/StartupValidator.cs:43`, `src/ScadaLink.Host/StartupValidator.cs:45`, `src/ScadaLink.Host/StartupValidator.cs:75` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ClusterOptions.SectionName` was added by CI-005 as `public const string SectionName =
|
||||
"ScadaLink:Cluster";`, with an XML doc declaring it "the single source of truth so
|
||||
binding sites do not hard-code the magic string". CI-005's resolution likewise framed
|
||||
the constant as the canonical reference value. In practice, **no caller in the
|
||||
solution reads it**. `grep -rn "ClusterOptions.SectionName" src/` returns zero hits.
|
||||
Every site that needs the section name hard-codes the literal:
|
||||
|
||||
- `ScadaLink.Host.SiteServiceRegistration.BindSharedOptions:100` —
|
||||
`services.Configure<ClusterOptions>(config.GetSection("ScadaLink:Cluster"));`
|
||||
- `ScadaLink.Host.StartupValidator:43,45,75` — three `"ScadaLink:Cluster"` /
|
||||
`"ScadaLink:Cluster:SeedNodes"` literals.
|
||||
|
||||
The `SectionName_IsTheExpectedAppSettingsSection` test pins the constant's value but
|
||||
does not protect against the underlying drift hazard: if someone changes
|
||||
`SectionName` to `"ScadaLink:Akka:Cluster"`, the test still passes (because it tests
|
||||
the constant against the same literal), the validator still registers, and binding
|
||||
silently goes to whichever string the Host hard-codes. The constant currently
|
||||
provides none of the safety its XML doc claims. This is the same pattern of "inert
|
||||
configuration knob" CI-009 flagged for `DownIfAlone`, just with the harm being
|
||||
configuration drift rather than runtime behaviour.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) replace the hard-coded `"ScadaLink:Cluster"` literals in
|
||||
`SiteServiceRegistration.cs:100` and `StartupValidator.cs:43,45,75` with
|
||||
`ClusterOptions.SectionName` (a small Host-module change, to be tracked there), or
|
||||
(b) if the constant is intentionally decorative, soften the XML doc so it does not
|
||||
claim to be the source of truth. Do not leave a public constant whose stated
|
||||
guarantee the code does not deliver.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open — needs a one-line Host-side change to reference the constant, plus a test
|
||||
that proves the section name flows from this module to the Host._
|
||||
|
||||
### ClusterInfrastructure-012 — Validator accepts `SeedNodes.Count == 1` despite design requiring both nodes as seeds
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:30-33` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-ClusterInfrastructure.md` (Node Configuration) states:
|
||||
|
||||
> Cluster seed nodes: **Both nodes** are seed nodes — each node lists both itself and
|
||||
> its partner. Either node can start first and form the cluster; the other joins when
|
||||
> it starts. No startup ordering dependency.
|
||||
|
||||
A correctly-configured ScadaLink deployment therefore lists **two** seed nodes.
|
||||
`ClusterOptionsValidator.Validate` only checks that `SeedNodes` is non-null and
|
||||
non-empty (`Count == 0`). A configuration with a single seed node passes validation
|
||||
silently — but that defeats the "no startup ordering dependency" guarantee the
|
||||
design doc explicitly calls out.
|
||||
|
||||
`ScadaLink.Host.StartupValidator:43-46` does enforce the rule:
|
||||
|
||||
```csharp
|
||||
var seedNodes = configuration.GetSection("ScadaLink:Cluster:SeedNodes").Get<List<string>>();
|
||||
if (seedNodes is null || seedNodes.Count < 2)
|
||||
errors.Add("ScadaLink:Cluster:SeedNodes must have at least 2 entries");
|
||||
```
|
||||
|
||||
So the rule is enforced — but by the **other** project, after the
|
||||
`ClusterOptionsValidator` (the contract owner) already accepted the value. This is
|
||||
both inconsistent (two validators with different rules for the same field) and the
|
||||
weaker check is the contract-owner's. The pre-existing test
|
||||
`ServiceCollectionExtensionsTests.AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution`
|
||||
even constructs a `SeedNodes` list with one entry and expects validation to succeed
|
||||
on that count — locking in the gap.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Tighten the validator: require `SeedNodes.Count >= 2` with a message that references
|
||||
the "both nodes are seed nodes" design rule. Update
|
||||
`AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution` to use a two-entry
|
||||
list, and add a test case for `SeedNodes.Count == 1` failing validation. Once this
|
||||
module's validator enforces the rule, `Host.StartupValidator`'s duplicate check
|
||||
becomes redundant and can be removed in the Host's review.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
### ClusterInfrastructure-013 — Test uses catastrophic config values without an inline-intent comment
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:47-67` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ClusterOptionsTests.Properties_CanBeSetToCustomValues` deliberately sets two values
|
||||
the design doc explicitly warns are catastrophic:
|
||||
|
||||
```csharp
|
||||
SplitBrainResolverStrategy = "keep-majority", // design doc: total cluster shutdown on partition
|
||||
...
|
||||
MinNrOfMembers = 2 // design doc: blocks singleton, halts data collection
|
||||
```
|
||||
|
||||
The CI-006 resolution acknowledged this is intentional — the test exercises the POCO
|
||||
property setter (which by design accepts any string/int because the validator does
|
||||
the rejecting), and `ClusterOptionsValidatorTests.UnsupportedSplitBrainStrategy_FailsValidation`
|
||||
+ `MinNrOfMembers_NotOne_FailsValidation` prove the validator rejects them. But this
|
||||
reasoning is recorded **only** in the CI-006 resolution text in this findings file,
|
||||
not in the test itself. A reader landing on the test cold has no signal that these
|
||||
values are forbidden in production; they could reasonably infer the test endorses
|
||||
them.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a brief XML-doc / inline comment to `Properties_CanBeSetToCustomValues` stating
|
||||
that it exercises only the POCO's setter — these values intentionally do **not**
|
||||
represent a valid runtime configuration, and `ClusterOptionsValidator` rejects them
|
||||
(with a cross-reference to the relevant validator tests). Two lines is enough; the
|
||||
goal is to make the test's intent self-documenting.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
### ClusterInfrastructure-014 — `AddClusterInfrastructureActors` is dead surface — no caller, no behaviour
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:42-48` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AddClusterInfrastructureActors` has now reached a curious state: it is a public
|
||||
extension method with an XML doc that ends "Do not call AddClusterInfrastructureActors()"
|
||||
and a body that unconditionally throws `NotImplementedException`. CI-002's resolution
|
||||
chose "throw loudly" over "delete" specifically because CI-001 was still resolving the
|
||||
ownership-split question. That question is settled — the design doc, the README
|
||||
component table, and `Component-ClusterInfrastructure.md`'s "Implementation Note — Code
|
||||
Placement" all permanently locate the Akka actor bootstrap in `ScadaLink.Host`.
|
||||
|
||||
A `grep -rn "AddClusterInfrastructureActors" src/ tests/` confirms there is no caller
|
||||
anywhere in the solution. The method's only consumer is its own test
|
||||
(`AddClusterInfrastructureActors_ThrowsRatherThanSilentlySucceeding`), which asserts
|
||||
that the method throws when called. Keeping it costs API surface (IDE auto-complete
|
||||
suggests it, the docs render it, and a future contributor might re-introduce a call
|
||||
expecting it to register something), and gives nothing in return.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Delete `AddClusterInfrastructureActors`, delete its test, and add a one-line note to
|
||||
`docs/requirements/Component-ClusterInfrastructure.md`'s code-placement section
|
||||
explicitly stating that this project exposes no actor-registration extension
|
||||
(actor wiring lives in `ScadaLink.Host`). If the user prefers to keep the
|
||||
"fail-fast" trap, mark the method `[Obsolete(true, error: true)]` so the compiler —
|
||||
not the runtime — rejects the call.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Commons` |
|
||||
| Design doc | `docs/requirements/Component-Commons.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 9 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -46,6 +46,42 @@ indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSer
|
||||
legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014).
|
||||
No Critical, High, or Medium issues were found.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Commons has grown substantially since `39d737e` — 132 changed files (≈ +4 600 lines), driven
|
||||
by the Audit Log (#23), Site Call Audit (#22), and Transport (#24) work. The new surface
|
||||
area covers six new entity domain folders (Audit, Transport types under `Types/Transport`),
|
||||
seven new service interfaces (`IPartitionMaintenance`, `INodeIdentityProvider`,
|
||||
`ISiteAuditQueue`, `ICachedCallLifecycleObserver`, `ICachedCallTelemetryForwarder`,
|
||||
`IOperationTrackingStore`, `IBundleExporter` / `IBundleImporter` / `IBundleSessionStore` /
|
||||
`IAuditCorrelationContext`), a new `IAuditLogRepository`, and three new message folders
|
||||
(`Messages/Audit/`, `Messages/Integration/` extensions, `Messages/Management/TransportCommands`).
|
||||
The `SourceNode` thread-through and `ExecutionId` / `ParentExecutionId` additive-evolution
|
||||
fields are uniformly applied across `AuditEvent`, `SiteCall`, `Notification`,
|
||||
`NotificationSubmit`, `RouteToCallRequest`, `ScriptCallRequest`, and `SiteHealthReport` —
|
||||
all as trailing optional parameters, consistent with REQ-COM-5a.
|
||||
|
||||
All fourteen prior findings (Commons-001 through Commons-014) remain `Resolved`. Nine new
|
||||
findings were recorded this pass: one Medium on the lack of UTC-kind enforcement for the
|
||||
new `DateTime`-typed `*Utc` columns (Commons-019), one Medium on unconstrained
|
||||
`EncryptionMetadata` (Commons-015), one Medium on the now-substantially-stale design doc
|
||||
(Commons-017), and six Low findings covering minor convention drift, missing unit tests
|
||||
for the Transport types, an unresolvable `<see cref>` in `IAuditCorrelationContext`, a
|
||||
benign lazy-parse race in `ExternalCallResult.Response`, undocumented JSON-blob shapes,
|
||||
two interfaces parked in the wrong folder, and a magic-number threshold in `BundleSession`.
|
||||
No Critical or High issues were found.
|
||||
|
||||
The architectural-constraint tests still enforce the no-Akka/no-EF/no-ASP.NET rule, the
|
||||
POCO-entity and message-as-record conventions, and the `ToLocalTime` ban; they do not yet
|
||||
cover the new `*Utc`-suffixed `DateTime` properties on `AuditEvent` / `SiteCall`. Test
|
||||
coverage for the new types is uneven — `TrackedOperationId`, `SiteCallOperational`,
|
||||
`CachedCallTelemetry`, `SiteCallQueries`, `AuditQueryParamParsers`, `ApiKeyHasher`,
|
||||
`Notification`, and `SiteCall` are all directly tested; the Transport types
|
||||
(`BundleManifest`, `EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`,
|
||||
`ImportPreview`, `ImportResolution`, `ImportResult`, `ManifestContentEntry`) have only
|
||||
integration-level coverage in `tests/ScadaLink.Transport.IntegrationTests/`, with no
|
||||
shape/serialization tests in `ScadaLink.Commons.Tests`.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -61,6 +97,21 @@ No Critical, High, or Medium issues were found.
|
||||
| 9 | Testing coverage | ✓ | `ValueFormatter`, `DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`, `Result<T>`, `ConfigurationDiff`, `AlarmContext`, and the OPC UA serializer round-trip have no tests (Commons-010). |
|
||||
| 10 | Documentation & comments | ✓ | `OpcUaEndpointConfigSerializer.Deserialize` XML doc does not mention the silent data-loss path (Commons-005). `Component-Commons.md` is stale relative to the actual file set (Commons-009). `ValueFormatter` uses current-culture formatting without documenting it (Commons-012). |
|
||||
|
||||
## Checklist coverage — Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | `EncryptionMetadata` accepts any algorithm string + any iteration count with no validation (Commons-015). New `*Utc`-suffixed `DateTime` columns on `AuditEvent`/`SiteCall` have no `DateTimeKind.Utc` enforcement and are inconsistent with `Notification`'s `DateTimeOffset` (Commons-019). |
|
||||
| 2 | Akka.NET conventions | ✓ | Commons has no actors. All new message contracts (`Messages/Audit`, `Messages/Integration` extensions, `RouteToCallRequest`, `ScriptCallRequest`) are records with trailing optional members per REQ-COM-5a. Correlation IDs present on request/response messages. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `IAuditCorrelationContext` documents its scoped/sequential thread-safety contract explicitly (good). `ExternalCallResult.Response` has a benign lazy-parse race — two concurrent reads can both parse and produce distinct wrappers (Commons-021). |
|
||||
| 4 | Error handling & resilience | ✓ | The new ingest/upsert command + reply pairs (`UpsertSiteCallReply`, `IngestAuditEventsReply`, `IngestCachedTelemetryReply`) carry idempotency-friendly accepted-id lists and an `Accepted` flag that explicitly does NOT propagate audit-write failure to the user-facing action (alog.md §13). |
|
||||
| 5 | Security | ✓ | `ApiKeyHasher` correctly fails-fast on missing / weak pepper (≥16 chars), uses HMAC-SHA256, never accepts a null plaintext, and provides a clearly-labelled `Default` for tests only. `ApiKey.FromHash` is the production constructor; the plaintext constructor only ever uses the unpeppered `Default` and is documented as such. No script-trust violations in any new file. |
|
||||
| 6 | Performance & resource management | ✓ | `IBundleSessionStore.EvictExpired` exists for sessions — good. `BundleSession` carries `DecryptedContent` plus `Manifest` per session; the size is bounded by the configured bundle cap but no explicit per-session size accounting. `ExternalCallResult.Response` lazy parse not thread-safe (Commons-021). |
|
||||
| 7 | Design-document adherence | ✓ | `Component-Commons.md` is now significantly stale relative to the actual file set: stale enum values for `AuditKind`/`AuditStatus`, missing `AuditEvent`/`SiteCall` entities, missing `IAuditLogRepository`, missing six service interfaces and `Interfaces/Transport/`, missing four `Types/*` folders and `Messages/Audit/` (Commons-017). |
|
||||
| 8 | Code organization & conventions | ✓ | `IOperationTrackingStore` and `IPartitionMaintenance` live at the root of `Interfaces/` rather than under `Interfaces/Services/` (Commons-018). `BundleSession.Locked` uses a magic `3` rather than a named constant (Commons-016). Message contracts and entities otherwise follow the additive-evolution / POCO / `record` conventions. |
|
||||
| 9 | Testing coverage | ✓ | Transport types (`BundleManifest`, `EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`, `ImportPreview`, `ImportResolution`, `ImportResult`, `ManifestContentEntry`) have no unit tests in `tests/ScadaLink.Commons.Tests/`; only `tests/ScadaLink.Transport.IntegrationTests/` exercises them (Commons-020). `IngestAuditEventsCommand` / `IngestCachedTelemetryCommand` / `UpsertSiteCallCommand` / `PullAuditEventsRequest` / `PullAuditEventsResponse` / `AuditTelemetryEnvelope` shape tests also absent. |
|
||||
| 10 | Documentation & comments | ✓ | `IAuditCorrelationContext` references `BundleImporter.ApplyAsync` — an implementation type Commons does not see, so the `<see cref>` is unresolvable (Commons-022b, folded into Commons-022). `ImportPreviewItem.FieldDiffJson` and `Notification.ResolvedTargets` are JSON-string columns with no documented shape contract (Commons-022). |
|
||||
|
||||
## Findings
|
||||
|
||||
### Commons-001 — `StaleTagMonitor` stale-fire race between timer and `OnValueReceived`
|
||||
@@ -674,3 +725,415 @@ describe the corrupt-typed-row branch. Regression tests added in
|
||||
`OpcUaEndpointConfigSerializerTests` (`Deserialize_TypedShapeWithInvalidEnum_ReportsMalformedNotLegacy`,
|
||||
`Deserialize_TypedShapeWithWrongTypeField_ReportsMalformedNotLegacy`,
|
||||
`Deserialize_ValidTypedRow_StillReportsTyped`).
|
||||
|
||||
### Commons-015 — `EncryptionMetadata` accepts any algorithm string and any iteration count
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Types/Transport/EncryptionMetadata.cs:3-8` |
|
||||
|
||||
**Description**
|
||||
|
||||
`EncryptionMetadata` is a positional record that carries the bundle's encryption parameters
|
||||
over the wire and into the persistence/audit layer:
|
||||
|
||||
```csharp
|
||||
public sealed record EncryptionMetadata(
|
||||
string Algorithm, // "AES-256-GCM"
|
||||
string Kdf, // "PBKDF2-SHA256"
|
||||
int Iterations,
|
||||
string SaltB64,
|
||||
string IvB64);
|
||||
```
|
||||
|
||||
The expected values are documented as inline comments only — there is no validation, no
|
||||
enum, and no constructor invariant. The consequences:
|
||||
|
||||
- A bundle manifest that says `Algorithm = "AES-128-CBC"` (or any garbage string) will
|
||||
deserialize successfully. The mismatch surfaces only when `BundleImporter` tries to
|
||||
decrypt, where it most likely manifests as a misleading exception (or a silent wrong-key
|
||||
result, depending on the implementation).
|
||||
- `Iterations` is unconstrained — `0`, negative, or absurdly large values round-trip. A
|
||||
zero/negative iteration count weakens the KDF and a billion-iteration count is a DoS
|
||||
vector against a passphrase-unlock attempt.
|
||||
- `SaltB64` / `IvB64` are just `string` — there is no length, format, or non-null check.
|
||||
A null or empty salt/IV silently rides through serialization and surfaces inside the
|
||||
cipher init.
|
||||
|
||||
`EncryptionMetadata` is the integrity contract for the bundle's encryption envelope and
|
||||
crosses both the file boundary (the on-disk bundle manifest) and the central audit log.
|
||||
The defense-in-depth principle says malformed values should be rejected at the type
|
||||
boundary, not at the cipher.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Validate in a static factory or constructor: reject unsupported `Algorithm`/`Kdf` (an
|
||||
enum or a small whitelist of strings), require `Iterations >= 100_000` (or whatever the
|
||||
documented PBKDF2 minimum is) and `<= 10_000_000`, require non-blank `SaltB64`/`IvB64`,
|
||||
and Base64-decode them at construction so a malformed encoding fails fast. Document the
|
||||
accepted values on the record.
|
||||
|
||||
### Commons-016 — `BundleSession.Locked` uses a magic `3` rather than a named constant
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Types/Transport/BundleSession.cs:13-16` |
|
||||
|
||||
**Description**
|
||||
|
||||
`BundleSession` exposes:
|
||||
|
||||
```csharp
|
||||
public int FailedUnlockAttempts { get; set; }
|
||||
public bool Locked => FailedUnlockAttempts >= 3;
|
||||
```
|
||||
|
||||
The `3` is a magic number with no constant, no XML doc reference, and no symbol to
|
||||
search for if a future operator wants to change the threshold (or write a test that
|
||||
deliberately exercises the lockout). The XML comment on `Locked` repeats the literal
|
||||
("three or more unlock attempts have failed") rather than citing a constant, so a
|
||||
change to the threshold would have to be made in three places (the comparison, the XML
|
||||
text, and any caller-side `attempts < 3` checks). The lockout count is also a
|
||||
security-relevant policy parameter — it deserves a named symbol so a security review
|
||||
can find it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Promote the threshold to a `public const int MaxUnlockAttempts = 3;` on `BundleSession`
|
||||
(or to the `IBundleSessionStore`/`BundleImporter` if that is the better home), and rewrite
|
||||
the `Locked` expression and the XML comment in terms of it. If the threshold is actually
|
||||
owned by a Transport-component option, document the link.
|
||||
|
||||
### Commons-017 — `Component-Commons.md` is significantly stale (audit enums, new entities, new repositories, new service interfaces, new folders)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-Commons.md:41-44`, `:75-79`, `:88-95`, `:107-117`, `:152-232` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Commons design doc has fallen materially behind the code:
|
||||
|
||||
- **REQ-COM-1 audit enums** — the doc's `AuditKind` enum lists
|
||||
`SyncCall, CachedEnqueued, CachedAttempt, CachedTerminal, SyncWrite, SyncRead, Enqueued,
|
||||
Attempt, Terminal, Completed`; the actual enum in `Types/Enums/AuditKind.cs` has
|
||||
*completely different* values: `ApiCall, ApiCallCached, DbWrite, DbWriteCached, NotifySend,
|
||||
NotifyDeliver, InboundRequest, InboundAuthFailure, CachedSubmit, CachedResolve`.
|
||||
Likewise `AuditStatus` — doc says `Success, TransientFailure, PermanentFailure, Enqueued,
|
||||
Retrying, Delivered, Parked, Discarded`; actual values are `Submitted, Forwarded,
|
||||
Attempted, Delivered, Failed, Parked, Discarded, Skipped`. The doc's enum names cannot
|
||||
be matched to the code at all.
|
||||
- **REQ-COM-3 entities** — the Audit bullet still lists only `AuditLogEntry`; the
|
||||
actual `Entities/Audit/` folder now contains `AuditEvent` and `SiteCall` as well, and
|
||||
both carry significant additional columns (`SourceNode`, `ExecutionId`,
|
||||
`ParentExecutionId`) that are core to the M3-M7 work and entirely absent from the doc.
|
||||
- **REQ-COM-4 repositories** — `IAuditLogRepository` is in the code (with its
|
||||
`InsertIfNotExistsAsync`, `QueryAsync`, `SwitchOutPartitionAsync`,
|
||||
`GetPartitionBoundariesOlderThanAsync`, `GetKpiSnapshotAsync`, `GetExecutionTreeAsync`,
|
||||
`GetDistinctSourceNodesAsync` surface) but missing from the REQ-COM-4 list.
|
||||
- **REQ-COM-4a services** — the doc lists seven service interfaces. The code adds
|
||||
`ICachedCallLifecycleObserver`, `ICachedCallTelemetryForwarder`, `INodeIdentityProvider`,
|
||||
`ISiteAuditQueue`, plus the misplaced `IOperationTrackingStore` and `IPartitionMaintenance`
|
||||
(see Commons-018), and the `Interfaces/Transport/` folder with four more interfaces
|
||||
(`IBundleExporter`, `IBundleImporter`, `IBundleSessionStore`, `IAuditCorrelationContext`)
|
||||
— none of which appear in REQ-COM-4a.
|
||||
- **REQ-COM-5b folder tree** — missing: `Types/Audit/` (`AuditLogPaging`,
|
||||
`AuditLogQueryFilter`, `AuditQueryParamParsers`, `ExecutionTreeNode`,
|
||||
`SiteCallKpiSnapshot`, `SiteCallPaging`, `SiteCallQueryFilter`,
|
||||
`SiteCallSiteKpiSnapshot`), `Types/Notifications/` (`NotificationKpiSnapshot`,
|
||||
`NotificationOutboxFilter`, `SiteNotificationKpiSnapshot`), `Types/InboundApi/`
|
||||
(`ApiKeyHasher`, `ParameterDefinition`), `Types/Transport/` (nine records),
|
||||
`Messages/Audit/` (seven new message files), `Interfaces/Transport/` (four
|
||||
interfaces), plus the new `AuditLogKpiSnapshot`, `SiteAuditBacklogSnapshot`,
|
||||
`SiteCallOperational`, `TrackingStatusSnapshot` directly under `Types/`.
|
||||
|
||||
CLAUDE.md's editing rules state design docs and code must travel together. The doc is now
|
||||
much less useful as a map of the actual file set than after the previous (Commons-009)
|
||||
refresh.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Refresh `Component-Commons.md` against the current file set: rewrite the `AuditKind` /
|
||||
`AuditStatus` enum value lists to match the code, add `AuditEvent` and `SiteCall` to
|
||||
REQ-COM-3, add `IAuditLogRepository` to REQ-COM-4, expand REQ-COM-4a with the new service
|
||||
interfaces (and add a sentence on the Transport interfaces in `Interfaces/Transport/`),
|
||||
and rewrite the REQ-COM-5b folder tree to include the new `Types/*`, `Messages/Audit`,
|
||||
and `Interfaces/Transport` folders. The same kind of refresh that resolved Commons-009 is
|
||||
needed again now.
|
||||
|
||||
### Commons-018 — `IOperationTrackingStore` and `IPartitionMaintenance` are at the root of `Interfaces/` instead of `Interfaces/Services/`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Interfaces/IOperationTrackingStore.cs`, `src/ScadaLink.Commons/Interfaces/IPartitionMaintenance.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
REQ-COM-5b documents the `Interfaces/` folder as having exactly three sub-folders:
|
||||
`Protocol/` (REQ-COM-2), `Repositories/` (REQ-COM-4), and `Services/` (REQ-COM-4a). Two
|
||||
new interfaces — `IOperationTrackingStore` and `IPartitionMaintenance` — are filed at
|
||||
the root of `Interfaces/` (namespace `ScadaLink.Commons.Interfaces`) rather than under
|
||||
`Interfaces/Services/` (namespace `ScadaLink.Commons.Interfaces.Services`). They are
|
||||
straightforward cross-cutting service interfaces consumed by the Audit Log component (a
|
||||
site-local SQLite tracking store; a central partition-maintenance hosted-service helper)
|
||||
and conceptually belong alongside `ISiteAuditQueue`, `ICachedCallLifecycleObserver`, etc.
|
||||
The inconsistency is small but it breaks the "every interface lives under a sub-folder"
|
||||
rule REQ-COM-5b establishes, and it makes the namespace surface inconsistent — every
|
||||
other recently-added service interface uses `Interfaces.Services`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Move both files into `Interfaces/Services/` and adjust the namespace to
|
||||
`ScadaLink.Commons.Interfaces.Services`. Update consumers in `ScadaLink.AuditLog`,
|
||||
`ScadaLink.SiteRuntime`, and `ScadaLink.ConfigurationDatabase`. Add them to the
|
||||
REQ-COM-4a list (see Commons-017).
|
||||
|
||||
### Commons-019 — New `*Utc`-suffixed `DateTime` columns on `AuditEvent` / `SiteCall` are not enforced as UTC; inconsistent with `Notification`'s `DateTimeOffset`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Entities/Audit/AuditEvent.cs:15-18`, `src/ScadaLink.Commons/Entities/Audit/SiteCall.cs:59-68`, `tests/ScadaLink.Commons.Tests/Entities/EntityConventionTests.cs:49-69` |
|
||||
|
||||
**Description**
|
||||
|
||||
CLAUDE.md mandates UTC throughout the system, "DateTime with DateTimeKind.Utc *or*
|
||||
DateTimeOffset". The pre-existing convention in Commons entities is `DateTimeOffset`,
|
||||
and the architectural test `AllTimestampProperties_ShouldBeDateTimeOffset` enforces it
|
||||
on a name-allowlist (`Timestamp`, `DeployedAt`, `CompletedAt`, `GeneratedAt`,
|
||||
`ReportTimestamp`, `SnapshotTimestamp`). The new audit entities deviate:
|
||||
|
||||
- `AuditEvent.OccurredAtUtc` and `IngestedAtUtc` — `DateTime` (nullable on the second).
|
||||
- `SiteCall.CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`, `IngestedAtUtc` — `DateTime`.
|
||||
|
||||
The `Notification` entity in the same domain uses `DateTimeOffset` for every timestamp
|
||||
(`SiteEnqueuedAt`, `CreatedAt`, `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt`). The
|
||||
architectural test does not catch the `*Utc` columns because those property names are not
|
||||
on the allowlist. Concretely:
|
||||
|
||||
- Nothing prevents a producer from assigning `DateTime.Now` (kind = `Local`) or
|
||||
`new DateTime(2026,1,1)` (kind = `Unspecified`) to an `OccurredAtUtc` column. The
|
||||
value will round-trip through `System.Text.Json` losing the `Kind` (it defaults to
|
||||
`Unspecified` on read). The `Utc` suffix is convention-only.
|
||||
- Comparison across the boundary is now ambiguous — the central `AuditLog.OccurredAtUtc`
|
||||
and the central `Notifications.CreatedAt` are different CLR types, with `DateTimeOffset`
|
||||
carrying an explicit offset and `DateTime` not.
|
||||
- The repository query filters (`AuditLogQueryFilter.FromUtc`/`ToUtc`,
|
||||
`SiteCallQueryFilter.FromUtc`/`ToUtc`) also use bare `DateTime`. A caller building one
|
||||
from `DateTime.UtcNow.AddHours(-1)` is fine; a caller using `DateTimeOffset.UtcNow.DateTime`
|
||||
is fine; a caller using `DateTime.Now` is silently wrong.
|
||||
|
||||
This is the same defect the architectural test was designed to catch on the
|
||||
`DateTimeOffset` side — the test just doesn't cover the new column-naming convention.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick a single rule:
|
||||
|
||||
1. Convert the audit entities to `DateTimeOffset` to match every other Commons entity
|
||||
and the architectural-test allowlist (largest blast radius — touches gRPC proto
|
||||
types, EF mappings, SQL schemas, query filters).
|
||||
2. Keep `DateTime` for audit but extend `EntityConventionTests` to recognise the `*Utc`
|
||||
property-name pattern and assert (a) it is `DateTime` (not `DateTimeOffset`) and
|
||||
(b) any constant-default has `DateTimeKind.Utc`. Add a runtime assertion at the
|
||||
write boundary (`SqliteAuditWriter.WriteAsync`, the central upsert) that the
|
||||
incoming `Kind == DateTimeKind.Utc` and reject otherwise.
|
||||
|
||||
Option 2 is the smaller change and is consistent with how `AuditLog` rows are stored in
|
||||
SQL Server (`datetime2`, no offset). Either way the inconsistency with `Notification`
|
||||
should be documented in REQ-COM-1 as a deliberate choice.
|
||||
|
||||
### Commons-020 — Transport types and new Audit-message types have no unit tests in `ScadaLink.Commons.Tests`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.Commons.Tests/` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Transport (#24) work adds nine records under `Types/Transport/` (`BundleManifest`,
|
||||
`EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`,
|
||||
`ImportPreview` + `ImportPreviewItem`, `ImportResolution`, `ImportResult`,
|
||||
`ManifestContentEntry`) and four interfaces under `Interfaces/Transport/`. None of them
|
||||
have a focused test file in `tests/ScadaLink.Commons.Tests/` — coverage is entirely
|
||||
inside `tests/ScadaLink.Transport.IntegrationTests/`, which exercises the
|
||||
end-to-end exporter/importer flow but does not pin the Commons-level wire contracts.
|
||||
|
||||
Similarly, the new `Messages/Audit/` folder (`IngestAuditEventsCommand`/`Reply`,
|
||||
`IngestCachedTelemetryCommand`/`Reply`, `UpsertSiteCallCommand`/`Reply`,
|
||||
`SiteCallRelayMessages`) and the `Messages/Integration/` additions
|
||||
(`AuditTelemetryEnvelope`, `PullAuditEventsRequest`/`Response`) have no
|
||||
serialization-shape tests in Commons. The existing `MessageConventionTests`,
|
||||
`CompatibilityTests`, `ConnectionBindingSerializationTests`, and
|
||||
`SiteCallQueriesTests` cover some but not all of the new traffic — `PullAuditEvents`
|
||||
and `AuditTelemetryEnvelope` in particular cross the site→central version-skew
|
||||
boundary that REQ-COM-5a is designed to enforce, so a JSON round-trip + named-property
|
||||
assertion is the minimum protection against a future positional/tuple slip.
|
||||
|
||||
This is the same pattern as Commons-010 — behavior-bearing types with no Commons-level
|
||||
test coverage, where the integration suite cannot catch a Commons-only contract
|
||||
regression.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add focused tests in `tests/ScadaLink.Commons.Tests/Types/Transport/` (round-trip
|
||||
serialization for each Transport record, named JSON property assertions for
|
||||
`EncryptionMetadata` / `BundleManifest`, the `BundleSession.Locked` threshold —
|
||||
see Commons-016, the `ConflictKind`/`ResolutionAction` enum coverage), and in
|
||||
`tests/ScadaLink.Commons.Tests/Messages/Audit/` (round-trip + named-property assertions
|
||||
for the seven new message files). Prioritise the contracts that cross the site→central
|
||||
boundary (`AuditTelemetryEnvelope`, `PullAuditEventsRequest`/`Response`,
|
||||
`IngestCachedTelemetryCommand`).
|
||||
|
||||
### Commons-021 — `ExternalCallResult.Response` has a benign lazy-parse race
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Interfaces/Services/IExternalSystemClient.cs:91-104` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ExternalCallResult` is a `record` returned to scripts after an outbound HTTP call. The
|
||||
`Response` property lazily parses `ResponseJson` into a `DynamicJsonElement`:
|
||||
|
||||
```csharp
|
||||
public dynamic? Response
|
||||
{
|
||||
get
|
||||
{
|
||||
if (!_responseParsed)
|
||||
{
|
||||
_response = string.IsNullOrEmpty(ResponseJson)
|
||||
? null
|
||||
: new DynamicJsonElement(System.Text.Json.JsonDocument.Parse(ResponseJson).RootElement);
|
||||
_responseParsed = true;
|
||||
}
|
||||
return _response;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`_response` and `_responseParsed` are plain mutable fields on a `record` that the
|
||||
language otherwise treats as immutable. Two threads reading `Response` simultaneously
|
||||
can both see `_responseParsed == false`, both call `JsonDocument.Parse`, and produce
|
||||
two distinct `DynamicJsonElement` wrappers — the second write wins, and any reference
|
||||
the loser thread already held becomes inconsistent with the winner. The race is benign
|
||||
in the current usage (scripts get the result on one thread and use it on that thread),
|
||||
and `DynamicJsonElement` after Commons-002 clones the underlying `JsonElement`, so the
|
||||
duplicate parses do not even leak document handles. But the pattern is fragile — a
|
||||
future caller that hands the result to a background continuation or `Task.WhenAll` would
|
||||
introduce a real correctness gap, and the laziness is implicit in `record` semantics
|
||||
that otherwise suggest immutability.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Use `Lazy<dynamic?>` initialised in the property (with `LazyThreadSafetyMode.ExecutionAndPublication`,
|
||||
the default) and drop the mutable backing fields, or replace the property with a method
|
||||
named `ParseResponse()` so the laziness is explicit and the caller knows to call it once
|
||||
and cache. Either way, the change is local and preserves the existing `record`-equality
|
||||
behavior.
|
||||
|
||||
### Commons-022 — `IAuditCorrelationContext` references an unresolvable `BundleImporter.ApplyAsync` cref; JSON-blob columns have no documented shape
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Interfaces/Transport/IAuditCorrelationContext.cs:11`, `src/ScadaLink.Commons/Types/Transport/ImportPreview.cs:11`, `src/ScadaLink.Commons/Entities/Notifications/Notification.cs:33` |
|
||||
|
||||
**Description**
|
||||
|
||||
Two related XML-doc weaknesses, both around the new Transport / Audit surface:
|
||||
|
||||
1. `IAuditCorrelationContext`'s remarks say
|
||||
`<see cref="BundleImporter.ApplyAsync"/>`. `BundleImporter` is the concrete
|
||||
implementation in `ScadaLink.Transport.Import`, which Commons does not (and must
|
||||
not) reference. The cref is unresolvable from Commons and will surface as a
|
||||
build-time XML doc warning. The correct reference is the interface method
|
||||
`IBundleImporter.ApplyAsync`.
|
||||
|
||||
2. Two JSON-string columns flow across components without a documented shape:
|
||||
- `ImportPreviewItem.FieldDiffJson` — described only as "string?" with no remarks on
|
||||
who produces it, who reads it, or what shape it carries. The Central UI renders it,
|
||||
so a drift between producer and renderer is a silent UI regression.
|
||||
- `Notification.ResolvedTargets` — described as "Resolved delivery targets snapshotted
|
||||
at delivery time, for audit" but the shape (newline-separated emails? a JSON array?
|
||||
comma-separated?) is undocumented. Audit consumers and the Central UI both read
|
||||
this field.
|
||||
|
||||
Both are wire/persistence-format strings; an undocumented schema invites the same
|
||||
kind of producer/consumer drift the `ValueTuple` finding in Commons-008 surfaced for
|
||||
the typed messages.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
- Fix the `<see cref>` in `IAuditCorrelationContext` to point at `IBundleImporter.ApplyAsync`.
|
||||
- Add a remarks block to `ImportPreviewItem.FieldDiffJson` describing the JSON shape
|
||||
(e.g. "a JSON object keyed by field name with `{ existing, incoming }` values") or, if
|
||||
the shape is meant to be opaque to the wire, document that explicitly.
|
||||
- Add a remarks block to `Notification.ResolvedTargets` documenting the format.
|
||||
- Consider replacing both with strong-typed Commons records — `ResolvedTargets` could be
|
||||
`IReadOnlyList<string>` serialised via EF value converter, and `FieldDiffJson` could
|
||||
be a `FieldDiff` record. That is a larger change and is left as a follow-up.
|
||||
|
||||
### Commons-023 — Trailing-optional `SourceNode` on positional records mixes additive evolution patterns
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Messages/Audit/SiteCallQueries.cs:53-66`, `:110-123`, `src/ScadaLink.Commons/Messages/Notification/NotificationOutboxQueries.cs:26-39`, `:104-123`, `src/ScadaLink.Commons/Types/SiteCallOperational.cs:42-54`, `src/ScadaLink.Commons/Types/TrackingStatusSnapshot.cs:33-46` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `SourceNode` rollout adds an optional trailing parameter to a long list of positional
|
||||
records. Two minor patterns emerge that are worth flagging:
|
||||
|
||||
- `SiteCallSummary` (twelve required positional members plus an optional 13th
|
||||
`SourceNode = null`) — and the parallel `NotificationSummary` (ten required + optional
|
||||
`SourceNode = null`) — both push the optional past a `bool IsStuck` flag. A consumer
|
||||
reading the positional signature is now mixing required and optional members. The
|
||||
record otherwise works correctly because every consumer constructs it via named
|
||||
arguments, but a positional constructor call (which the language allows) would silently
|
||||
miss the new field.
|
||||
- `TrackingStatusSnapshot` has been made non-optional `SourceNode` (`string? SourceNode`
|
||||
without `= null`), inconsistent with `SiteCallOperational`'s `string? SourceNode` (also
|
||||
without default — but `SiteCallOperational` is purely positional). The mix of "optional
|
||||
with default" and "optional without default" across the same domain is fine technically
|
||||
but is the kind of inconsistency that bites a future additive field.
|
||||
|
||||
Neither pattern is a defect today — every consumer is updated, and JSON serialization
|
||||
treats nullable-without-default the same as nullable-with-default. But the conventions
|
||||
across the Audit / Notifications message surface have drifted enough that REQ-COM-5a's
|
||||
"additive-only" rule deserves a one-paragraph clarification: do new optional fields take
|
||||
a `= null` default, or not? The current code is mixed.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a one-paragraph "How to add a field" sub-section to REQ-COM-5a stating: new optional
|
||||
fields on positional records MUST be added at the end of the parameter list AND MUST
|
||||
carry a `= null` (or other safe) default value, so existing positional construction
|
||||
sites keep compiling. Apply that rule retroactively to `TrackingStatusSnapshot` and any
|
||||
other recent record that did not adopt it. No behavioral change required.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Communication` |
|
||||
| Design doc | `docs/requirements/Component-Communication.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -42,6 +42,47 @@ gRPC-supplied `correlation_id` flows straight into an Akka actor name
|
||||
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
|
||||
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All prior findings (Communication-001..015) remain `Resolved` in this commit. The
|
||||
re-review walked all 10 checklist categories again on the surface that has not
|
||||
been re-examined before — the central↔site command/control routing surface
|
||||
(`CentralCommunicationActor`, `SiteCommunicationActor`) rather than the
|
||||
previously-mined gRPC streaming surface — and uncovered a cluster of defects
|
||||
around the connection-state-change workflow. The single material finding is
|
||||
**`HandleConnectionStateChanged` is dead code**: no production code path emits
|
||||
`ConnectionStateChanged`, so the documented "kill active debug streams for the
|
||||
disconnected site" + "mark in-progress deployments as failed" workflow never
|
||||
fires at runtime (Communication-016). The downstream consequence is
|
||||
**`_inProgressDeployments` grows unboundedly** — entries are inserted on every
|
||||
deployment but only cleaned via that dead path (Communication-017). Three
|
||||
smaller items round out the re-review: site heartbeats hard-code
|
||||
`IsActive: true` regardless of node role (Communication-018), the
|
||||
60-second-periodic `LoadSiteAddressesFromDb` task has no CancellationToken so a
|
||||
hung DB query has no upper bound (Communication-019), the
|
||||
`SiteAddressCacheLoaded` internal message carries a mutable
|
||||
`Dictionary`/`List` (Communication-020), `SiteStreamGrpcServer.SubscribeInstance`
|
||||
leaks the StreamRelayActor if `_streamSubscriber.Subscribe` throws between
|
||||
`ActorOf` and the `try` block (Communication-021), and `_debugSubscriptions`
|
||||
keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
|
||||
(Communication-022). Seven new findings, all Open: one High, one Medium, five
|
||||
Low.
|
||||
|
||||
## Checklist coverage 2026-05-28
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | `HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022). |
|
||||
| 2 | Akka.NET conventions | ✓ | `SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean. |
|
||||
| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues. |
|
||||
| 4 | Error handling & resilience | ✓ | `LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021). |
|
||||
| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
|
||||
| 6 | Performance & resource management | ✓ | `_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
|
||||
| 7 | Design-document adherence | ✓ | `ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018). |
|
||||
| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
|
||||
| 9 | Testing coverage | ✓ | `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
|
||||
| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -726,3 +767,294 @@ gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
|
||||
per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
|
||||
NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
|
||||
than being masked by an endpoint-agnostic mock.
|
||||
|
||||
### Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:338-375` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CentralCommunicationActor.HandleConnectionStateChanged` is wired to
|
||||
`Receive<ConnectionStateChanged>` and implements two important workflows on
|
||||
`IsConnected == false`: (1) kill every active debug stream for the disconnected
|
||||
site (`_debugSubscriptions` walk → `DebugStreamTerminated` Tell to each
|
||||
subscriber); (2) mark every in-progress deployment for that site as failed
|
||||
(`_inProgressDeployments` walk → entry removal). Both are documented in the
|
||||
component design doc's "Connection Failure Behavior" section and in WP-5 of the
|
||||
work plan referenced in the class's own XML doc comment.
|
||||
|
||||
A repo-wide search (`grep -rn ConnectionStateChanged src/ tests/`) shows **no
|
||||
production code ever emits `ConnectionStateChanged`**. The only producers are
|
||||
the unit test `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
|
||||
(line 137) and the Commons message-roundtrip test. The
|
||||
`CentralCommunicationActor` therefore never receives one in production, the
|
||||
disconnect-cleanup workflow never fires, and `_debugSubscriptions` /
|
||||
`_inProgressDeployments` are never pruned via this path.
|
||||
|
||||
Concrete consequences:
|
||||
- A site goes down → its active debug streams do **not** get a synchronous
|
||||
`DebugStreamTerminated` notification from central. The bridge actor must
|
||||
detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST.
|
||||
Subscribers wait that long for the `OnStreamTerminated` callback instead of
|
||||
the documented "immediately killed by central" behaviour.
|
||||
- In-progress deployments to a disconnected site continue to occupy the
|
||||
Ask-reply window and only fail when the Ask times out at the
|
||||
`CommunicationService.DeployInstanceAsync` layer (120s). They are never
|
||||
proactively marked failed.
|
||||
- The unit test gives a strong false impression that the workflow works — it
|
||||
exercises a code path that has no production caller.
|
||||
|
||||
The design doc and CLAUDE.md mention "ClusterClient handles failover between
|
||||
NodeA and NodeB internally — there is no application-level NodeA preference /
|
||||
NodeB fallback logic" — so the ClusterClient mechanism is the documented
|
||||
failover transport. But that says nothing about *signalling* a fully-down
|
||||
remote cluster to central's coordinator actor, which is exactly what
|
||||
`ConnectionStateChanged` was meant to do.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick one of:
|
||||
- Wire a producer for `ConnectionStateChanged` — e.g. subscribe to
|
||||
`ClusterClient`'s contact-point/cluster events (`ClusterClient.ContactPoints`
|
||||
Refresh / `ContactPointAdded` / `ContactPointRemoved`) or watch the
|
||||
ClusterClient actor for a "no contact points reachable" state — and have it
|
||||
publish `ConnectionStateChanged` to `Self` on each transition.
|
||||
- If the documented "synchronously kill streams on disconnect" behaviour is
|
||||
intentionally being dropped in favour of the slower keepalive-based
|
||||
detection, delete the handler, the `ConnectionStateChanged` record, and the
|
||||
related `_debugSubscriptions` / `_inProgressDeployments` tracking, then
|
||||
update the design doc's "Connection Failure Behavior" section accordingly.
|
||||
|
||||
Either way, replace `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
|
||||
— at present it asserts a behaviour that no production code triggers.
|
||||
|
||||
---
|
||||
|
||||
### Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:357-367` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TrackMessageForCleanup` inserts `_inProgressDeployments[deploy.DeploymentId] =
|
||||
envelope.SiteId` on every `DeployInstanceCommand` routed to a site (line 501).
|
||||
The only places that *remove* from `_inProgressDeployments` are:
|
||||
- `HandleConnectionStateChanged` on `IsConnected == false` (line 366) — which
|
||||
per Communication-016 never fires in production.
|
||||
- `PostStop` (line 553) — only on actor death (central failover).
|
||||
|
||||
There is **no removal on the normal happy path** — neither when the site replies
|
||||
`DeploymentStatusResponse` (the reply goes to the Ask's temporary reply actor,
|
||||
not back through `CentralCommunicationActor`), nor on Ask timeout. Every
|
||||
successful or failed deployment leaves its entry behind for the lifetime of the
|
||||
process.
|
||||
|
||||
Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
|
||||
monotonically. Over months of operation across all sites a central node could
|
||||
accumulate tens of thousands of entries — a real, observable leak. More
|
||||
seriously, the field is *also* the source-of-truth set the
|
||||
`HandleConnectionStateChanged` walk uses to fail in-progress deployments, so
|
||||
even if a `ConnectionStateChanged` *were* fired today, the walk would
|
||||
"fail" thousands of already-completed deployments and Tell their (now stale)
|
||||
correlation-IDs into the void.
|
||||
|
||||
`_debugSubscriptions` (line 67) shares the same shape — but a normal debug
|
||||
session ends with an `UnsubscribeDebugViewRequest` that *does* drive cleanup
|
||||
(line 497), so leaks are only realised when a consumer crashes without
|
||||
unsubscribing.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either remove `_inProgressDeployments` entirely (it has no other consumer once
|
||||
Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
|
||||
is retained, add a removal hook on the reply path. The simplest fix is to
|
||||
subscribe `CentralCommunicationActor` to the Ask reply: route
|
||||
`DeployInstanceCommand` through the actor with the actor as the Ask sender,
|
||||
forward the reply to the original caller, and `_inProgressDeployments.Remove`
|
||||
in the same handler. (Today the Ask is taken on the *actor* itself by the
|
||||
caller, so the reply skips the coordinator.)
|
||||
|
||||
---
|
||||
|
||||
### Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:357-371` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteCommunicationActor.SendHeartbeatToCentral` builds
|
||||
`new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)`
|
||||
on every periodic tick (line 366), with no inspection of whether this node is
|
||||
actually the active site node or a standby. The `HeartbeatMessage.IsActive`
|
||||
field thus carries the literal value `true` on every heartbeat from every
|
||||
node, and the field is effectively dead — central's `HandleHeartbeat` doesn't
|
||||
consume it either (line 297 only passes `SiteId` and `Timestamp` to
|
||||
`MarkHeartbeat`).
|
||||
|
||||
Per CLAUDE.md's Cluster & Failover section the active/standby distinction is
|
||||
real ("Both nodes are seed nodes", "keep-oldest split-brain resolver",
|
||||
"automatic dual-node recovery"), so a heartbeat that *could* carry node-role
|
||||
information would be useful for the central health dashboard distinguishing
|
||||
"active node down, standby up" from "site fully offline". As shipped, the
|
||||
field is contract noise and a future implementer might mistakenly assume it
|
||||
already carries meaningful state.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) resolve the current cluster role at heartbeat-send time and pass it
|
||||
through — e.g. `Cluster.Get(Context.System).SelfRoles.Contains("active")` or
|
||||
the project's existing role mechanism — and have the central aggregator
|
||||
consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
|
||||
(additive-only-evolution: deprecate the field, default to `true`, plan
|
||||
removal in a major message contract revision).
|
||||
|
||||
---
|
||||
|
||||
### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LoadSiteAddressesFromDb` runs `await repo.GetAllSitesAsync()` inside
|
||||
`Task.Run(async () => ...).PipeTo(self)` with no cancellation token (line 404).
|
||||
The repository signature accepts `CancellationToken` (the test mock declares
|
||||
`GetAllSitesAsync(Arg.Any<CancellationToken>())`), but the actor calls the
|
||||
no-arg overload — so a hung MS SQL connection has no upper bound. The
|
||||
60-second-periodic refresh keeps firing; each tick spawns a fresh `Task.Run`
|
||||
that piles up if the database is consistently slow. The actor itself is
|
||||
unaffected (it's not blocked), but pending tasks and DB connection-pool
|
||||
resources accumulate, and the `Status.Failure` handler (Communication-006)
|
||||
never fires because the task never faults — it just sits.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
|
||||
60s the refresh runs on, or a configurable timeout in `CommunicationOptions`).
|
||||
Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
|
||||
a new load to avoid task accumulation.
|
||||
|
||||
---
|
||||
|
||||
### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:567` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Akka.NET convention is that messages crossing actor boundaries (even
|
||||
internal Self-messages over an async task boundary) are immutable.
|
||||
`SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts)` is a
|
||||
record but its `SiteContacts` payload is a mutable `Dictionary` whose values
|
||||
are mutable `List<string>`. Constructed inside `Task.Run` and handed off to
|
||||
the actor, the cache could in principle be mutated by either side; in
|
||||
practice nothing does, but the type is a stale-evidence guarantee that
|
||||
CLAUDE.md's "message immutability" rule is being followed only by convention.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Change the record signature to use `IReadOnlyDictionary<string, IReadOnlyList<string>>`
|
||||
(or `ImmutableDictionary` / `ImmutableArray<string>`) and freeze the data
|
||||
before piping. The cost is negligible — the payload is built and consumed
|
||||
once per refresh tick.
|
||||
|
||||
---
|
||||
|
||||
### Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SubscribeInstance` performs these statements in order (lines 189-194), all
|
||||
*before* the `try` block at line 200:
|
||||
1. `Interlocked.Increment(ref _actorCounter)`
|
||||
2. `_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))`
|
||||
3. `_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)`
|
||||
|
||||
If step 3 throws (the subscriber is wired but its `Subscribe` faults — a stale
|
||||
instance name, a temporary index lookup failure, etc.), the exception escapes
|
||||
the method as an unhandled `RpcException` *and* leaks the freshly-created
|
||||
`relayActor`. The `finally` block at line 211 is unreachable because the
|
||||
throw happens before the `try`. The actor's `_activeStreams` entry, the
|
||||
`StreamEntry.Cts`, and the `Channel<SiteStreamEvent>` are also leaked.
|
||||
|
||||
In normal operation `_streamSubscriber.Subscribe` does not throw, so the bug is
|
||||
latent — but a misbehaving site runtime (e.g. `SiteStreamManager` faulted
|
||||
because the actor system is shutting down) would surface it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Restructure to either (a) wrap the `Subscribe` call in a `try` whose `catch`
|
||||
stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
|
||||
creation *inside* the existing `try` block (the `finally` will then handle
|
||||
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
|
||||
past the `try {` brace.
|
||||
|
||||
---
|
||||
|
||||
### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TrackMessageForCleanup` on `SubscribeDebugViewRequest` does
|
||||
`_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender)` (line 493).
|
||||
The dictionary indexer silently overwrites any prior entry for the same
|
||||
`CorrelationId`. If two debug sessions ever reuse the same correlation ID (e.g.
|
||||
two Blazor users start a stream at the same moment with a non-GUID id, or a
|
||||
caller bug, or a malicious caller as flagged in the cousin
|
||||
Communication-014), the first subscriber's entry is overwritten and lost —
|
||||
on a later `ConnectionStateChanged(false)` (per Communication-016 it never
|
||||
actually fires today, but the design intent stands), only the *second*
|
||||
subscriber would be notified of the disconnect.
|
||||
|
||||
`DebugStreamService.StartStreamAsync` uses `Guid.NewGuid().ToString("N")` as
|
||||
the session id (`DebugStreamService.cs:97`), so a real collision is
|
||||
astronomically unlikely in normal operation. But the central side is not
|
||||
defending itself: a CLI consumer or a future caller is implicitly trusted to
|
||||
generate globally-unique ids.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
When the slot is already occupied, log a Warning and either reject the new
|
||||
subscription with an error response or evict the prior subscriber via
|
||||
`DebugStreamTerminated` before installing the new one. Mirrors the
|
||||
`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
|
||||
cancels the existing stream (line 167).
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ConfigurationDatabase` |
|
||||
| Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 10 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -59,6 +59,59 @@ inconsistency — a redundant cast on one of the three `HasConversion` calls
|
||||
(`ConfigurationDatabase-014`). The module is otherwise healthy and the prior fixes
|
||||
hold up well.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain
|
||||
`Resolved`; their fixes still hold (encryption converter, fail-fast guard,
|
||||
peppered API-key hash, ephemeral-fallback hardening, etc.). The module has
|
||||
grown since the last review — new code includes Audit Log (#23) raw-SQL
|
||||
paths in `AuditLogRepository` (partition-switch purge, recursive
|
||||
execution-tree CTE, KPI snapshot, partition-boundary discovery), the
|
||||
`AuditLogPartitionMaintenance` SPLIT-RANGE roll-forward implementation, the
|
||||
`AuditCorrelationContext` scoped service that stamps `BundleImportId`, the
|
||||
`SiteCallAuditRepository` monotonic-rank upsert, and the
|
||||
`NotificationOutboxRepository` per-site KPI surface — and most of the new
|
||||
findings are concentrated in those raw-SQL paths and in latent gaps left
|
||||
behind by the CD-012 hash migration.
|
||||
|
||||
Ten new findings were recorded. The most material is
|
||||
`ConfigurationDatabase-015`: a check-then-act race in
|
||||
`NotificationOutboxRepository.InsertIfNotExistsAsync` with no duplicate-key
|
||||
catch — unlike the sibling Audit Log / Site Call ingest paths, a concurrent
|
||||
ack-after-persist on the same `NotificationId` will surface as an
|
||||
unhandled `DbUpdateException` and break the at-least-once site→central
|
||||
handoff. `ConfigurationDatabase-016` flags that
|
||||
`InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with
|
||||
`ApiKeyHasher.Default` (unpeppered) while the production create-path uses
|
||||
the configured peppered hasher — any future caller (or test that exercises
|
||||
the method) will silently fail to find a real key; the production
|
||||
`ApiKeyValidator` happens not to call it, but the method is a publicly
|
||||
exposed `IInboundApiRepository` member and a latent bug.
|
||||
`ConfigurationDatabase-017` records that the `DeleteDeploymentRecordAsync`
|
||||
stub-attach delete bypasses the documented optimistic-concurrency rule on
|
||||
`DeploymentRecord.RowVersion` — the SQLite tests pass because the test
|
||||
fixture re-maps `RowVersion` as a nullable concurrency token, but in
|
||||
production this is likely to throw `DbUpdateConcurrencyException`.
|
||||
`ConfigurationDatabase-018` records the `DateTime`-typed `*Utc` columns on
|
||||
`AuditEvent` and `SiteCall` re-emerge as `Kind=Unspecified` on read; the
|
||||
sibling Commons module flagged the same pattern as Commons-019, and
|
||||
`AuditLogPartitionMaintenance.GetMaxBoundaryAsync` already defends against
|
||||
it with an explicit `SpecifyKind(Utc)` — but `GetPartitionBoundariesOlderThanAsync`
|
||||
does not (`ConfigurationDatabase-020`). `ConfigurationDatabase-019` is the
|
||||
SPLIT-RANGE loop in `AuditLogPartitionMaintenance.EnsureLookaheadAsync`
|
||||
swallowing every `SqlException` as a Warning and continuing — a genuine
|
||||
failure (permissions, deadlock, transient) leaves a missing boundary and
|
||||
the next iteration cheerfully splits the following month, creating a hole.
|
||||
`ConfigurationDatabase-021` is a low-severity hardening concern around
|
||||
`SwitchOutPartitionAsync`'s raw-SQL interpolation of `monthBoundaryStr` /
|
||||
`stagingTableName` (currently safe by construction, but truncates fractional
|
||||
seconds). `ConfigurationDatabase-022` is the stale "WP-24 stub" XML comment
|
||||
on `DeploymentManagerRepository`. `ConfigurationDatabase-023` is a
|
||||
design-doc-adherence drift on `IX_AuditLog_CorrelationId` (design says
|
||||
`IX_AuditLog_Correlation`). `ConfigurationDatabase-024` is missing test
|
||||
coverage for the SPLIT-RANGE failure-continuation behaviour and for the
|
||||
production-shape stub-attach delete with a real rowversion.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -74,6 +127,21 @@ hold up well.
|
||||
| 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
|
||||
| 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | `GetPartitionBoundariesOlderThanAsync` returns `DateTimeKind.Unspecified` (CD-020). `GetApiKeyByValueAsync` hashes with the unpeppered default (CD-016). |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors in this module. No issues found. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `NotificationOutboxRepository.InsertIfNotExistsAsync` check-then-act has no duplicate-key catch (CD-015). Stub-attach delete bypasses documented optimistic concurrency on `DeploymentRecord.RowVersion` (CD-017). |
|
||||
| 4 | Error handling & resilience | ✓ | `AuditLogPartitionMaintenance.EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues (CD-019). |
|
||||
| 5 | Security | ✓ | `SwitchOutPartitionAsync` interpolates a `DateTime` string and a GUID-suffixed identifier into raw SQL — safe by construction but pattern is risky (CD-021). |
|
||||
| 6 | Performance & resource management | ✓ | No new issues found. |
|
||||
| 7 | Design-document adherence | ✓ | Index name drift: design says `IX_AuditLog_Correlation`, code uses `IX_AuditLog_CorrelationId` (CD-023). |
|
||||
| 8 | Code organization & conventions | ✓ | `DateTime *Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement (CD-018). |
|
||||
| 9 | Testing coverage | ✓ | No tests for SPLIT failure continuation and no production-shape rowversion stub-attach test (CD-024). |
|
||||
| 10 | Documentation & comments | ✓ | Stale "WP-24 stub" XML comment on `DeploymentManagerRepository` (CD-022). |
|
||||
|
||||
## Findings
|
||||
|
||||
### ConfigurationDatabase-001 — `GetTemplateWithChildrenAsync` loads child templates then discards them
|
||||
@@ -816,3 +884,411 @@ no behavioural regression test is meaningful (cf. CD-005); a forward guard was a
|
||||
in `SchemaConfigurationTests.cs` —
|
||||
`SecretColumns_AllHaveEncryptedStringConverterApplied` (theory over all three secret
|
||||
columns) — asserting each column keeps an `EncryptedStringConverter`.
|
||||
|
||||
### ConfigurationDatabase-015 — `NotificationOutboxRepository.InsertIfNotExistsAsync` is a check-then-act race with no duplicate-key catch
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45` |
|
||||
|
||||
**Description**
|
||||
|
||||
`InsertIfNotExistsAsync` does `AnyAsync(x => x.NotificationId == n.NotificationId)`,
|
||||
then — if false — `AddAsync` + `SaveChangesAsync`. There is a check-then-act window
|
||||
between the two operations: two sessions can both pass the `AnyAsync` check and both
|
||||
attempt the INSERT, and the loser surfaces as a uniqueness violation on the
|
||||
`NotificationId` primary key wrapped in a `DbUpdateException` / `SqlException` (error
|
||||
2627). The site→central handoff for notifications is documented as **at-least-once
|
||||
with ack-after-persist plus insert-if-not-exists**; collisions on the same
|
||||
`NotificationId` are therefore not a "should never happen" but the *expected* contention
|
||||
mode. As written, the second concurrent ack throws, fails the site→central
|
||||
acknowledgement, and the site retries the same row again on its next forward — a
|
||||
livelock if the contending pair keeps racing.
|
||||
|
||||
The sibling raw-SQL `IF NOT EXISTS … INSERT` paths in `AuditLogRepository.InsertIfNotExistsAsync`
|
||||
(see SqlErrorUniqueIndexViolation / SqlErrorPrimaryKeyViolation handling at
|
||||
`AuditLogRepository.cs:74-89`) and `SiteCallAuditRepository.UpsertAsync`
|
||||
(`SiteCallAuditRepository.cs:87-96`) explicitly catch errors 2601/2627 and treat the
|
||||
loser as a no-op — exactly the right pattern for "first-write-wins idempotent ingest".
|
||||
This repository alone does not.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) rewrite the body as a single raw-SQL `IF NOT EXISTS … INSERT` and apply the
|
||||
same 2601/2627 catch-and-log-Debug pattern the AuditLog and SiteCall repositories use,
|
||||
or (b) wrap the existing flow in a try/catch around `SaveChangesAsync` that inspects
|
||||
the inner `SqlException.Number` and returns `false` (i.e. "another writer won the race")
|
||||
on 2601/2627. Option (a) is preferable because it collapses the two round-trips to one
|
||||
and matches the established idempotent-ingest pattern used elsewhere in the module.
|
||||
Add a regression test that simulates two concurrent `InsertIfNotExistsAsync` calls
|
||||
(using two open contexts) for the same `NotificationId` and asserts neither call
|
||||
throws and exactly one row lands.
|
||||
|
||||
### ConfigurationDatabase-016 — `InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with the unpeppered `ApiKeyHasher.Default`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/InboundApiRepository.cs:35-39` |
|
||||
|
||||
**Description**
|
||||
|
||||
`GetApiKeyByValueAsync` resolves an API key by its presented plaintext value by hashing
|
||||
the candidate and looking up `KeyHash`. The hash, however, is computed with the static
|
||||
`ApiKeyHasher.Default` (the fixed, deployment-independent unpeppered hasher used for
|
||||
tests). Production key creation uses the DI-registered, *peppered* `IApiKeyHasher`
|
||||
constructed from `InboundApiOptions.ApiKeyPepper` (see CD-012 resolution and
|
||||
`ApiKeyHasher.ctor(string pepper)`), so the stored `KeyHash` of any real key was
|
||||
produced under the deployment pepper. Hashing the candidate with the unpeppered
|
||||
`Default` yields a different digest, and the `WHERE KeyHash = @hash` lookup will never
|
||||
match a real key.
|
||||
|
||||
The production `ApiKeyValidator` (InboundAPI module) deliberately does NOT call this
|
||||
method — it fetches all keys and runs a constant-time comparison via the
|
||||
DI-registered hasher (`ApiKeyValidator.cs:53-64`) — so the immediate
|
||||
authentication path is unaffected. But `GetApiKeyByValueAsync` remains a publicly
|
||||
exposed `IInboundApiRepository` member; any new caller (a future admin tool, a CLI
|
||||
command, a test) that uses it under a peppered configuration will silently get a
|
||||
`null` result for an existing, valid key, and almost certainly mis-route the failure
|
||||
as "key not found".
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) take `IApiKeyHasher` via constructor injection — alongside the existing
|
||||
`ScadaLinkDbContext` and optional `ILogger` — and use it here so the repository
|
||||
participates in the same peppered scheme as the rest of the system; or (b) delete
|
||||
the method from both the implementation and `IInboundApiRepository` (Commons) on the
|
||||
grounds that the production authentication path correctly avoids it for timing
|
||||
reasons and there is no remaining valid caller. Add a regression test that constructs
|
||||
the repository under a real `ApiKeyHasher("a-strong-pepper-value")`, inserts an
|
||||
`ApiKey.FromHash(...)` using the same hasher, and asserts `GetApiKeyByValueAsync`
|
||||
returns the row — under option (a) it should pass; under option (b) the method no
|
||||
longer exists.
|
||||
|
||||
### ConfigurationDatabase-017 — Stub-attach delete on `DeploymentRecord` bypasses optimistic concurrency
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:83-97` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DeploymentRecord` carries a SQL Server `rowversion` concurrency token (declared
|
||||
in `DeploymentConfiguration` and confirmed by `ConcurrencyTests`), per the design
|
||||
doc's "Optimistic concurrency is used on deployment status records". When
|
||||
`DeleteDeploymentRecordAsync` falls into its stub-attach branch (no tracked entity
|
||||
in `_dbContext.DeploymentRecords.Local` for the given id), it constructs
|
||||
`new DeploymentRecord("stub", "stub") { Id = id }`, `Attach`es it, and `Remove`s it.
|
||||
The stub's `RowVersion` is left at its default `null` (or `byte[0]`).
|
||||
|
||||
EF Core's SQL Server provider generates the delete as
|
||||
`DELETE FROM DeploymentRecords WHERE Id = @id AND RowVersion = @stubRowVersion` — and
|
||||
the stub rowversion is not the row's real rowversion, so on a real SQL Server (with
|
||||
`IsRowVersion()` auto-populating the column) the WHERE never matches and `SaveChanges`
|
||||
throws `DbUpdateConcurrencyException`. The path is exercised by
|
||||
`RepositoryCoverageTests.DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity` —
|
||||
but the test fixture remaps `RowVersion` as a nullable `IsConcurrencyToken()` column
|
||||
without auto-population (`SqliteTestHelper.ConfigureForTests`), so the stored
|
||||
RowVersion is null AND the stub's RowVersion is null AND the SQLite delete matches.
|
||||
Production-shape behaviour is the opposite.
|
||||
|
||||
The same stub-attach pattern is used on `SystemArtifactDeploymentRecord`,
|
||||
`Site`, and `DataConnection`. Those entities have no rowversion token, so the
|
||||
production behaviour is correct for them — the issue is specific to
|
||||
`DeploymentRecord`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replace the stub-attach branch in `DeleteDeploymentRecordAsync` with a real lookup —
|
||||
`await _dbContext.DeploymentRecords.FindAsync([id], ct)` then `Remove` if non-null —
|
||||
mirroring `DeleteInstanceAttributeOverrideAsync` and `DeleteDeployedSnapshotAsync`.
|
||||
This loses the "delete by id without a read" micro-optimisation (a real concern only
|
||||
in batched-delete loops) but restores the documented concurrency contract. If the
|
||||
optimisation is genuinely required, attach a `DeploymentRecord` with the *caller's*
|
||||
known RowVersion (the caller had to fetch the row at some point) and accept the
|
||||
`DbUpdateConcurrencyException` as the correct concurrency signal. Add a regression
|
||||
test under MS SQL (extend `RepositoryCoverageTests` with a SQL-Server-flavoured
|
||||
fixture, or use `MsSqlMigrationFixture`) that asserts the stub-attach delete works
|
||||
when the real RowVersion is supplied.
|
||||
|
||||
### ConfigurationDatabase-018 — `DateTime`-typed `*Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs`, `Configurations/SiteCallEntityTypeConfiguration.cs` (mappings for `OccurredAtUtc`, `IngestedAtUtc`, `CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`) |
|
||||
|
||||
**Description**
|
||||
|
||||
`AuditEvent.OccurredAtUtc` / `IngestedAtUtc` and `SiteCall.CreatedAtUtc` /
|
||||
`UpdatedAtUtc` / `TerminalAtUtc` / `IngestedAtUtc` are declared as `DateTime` (not
|
||||
`DateTimeOffset`) per the Audit Log #23 spec, with a UTC suffix convention. SQL Server's
|
||||
`datetime2` provider strips the `Kind` flag on the wire — values inserted with
|
||||
`DateTimeKind.Utc` round-trip as `DateTimeKind.Unspecified` on read. The EF mappings
|
||||
add no `HasConversion(...)` to normalise the kind. The sibling Commons module just
|
||||
flagged the same pattern as `Commons-019`; in this module the consequence is concrete:
|
||||
|
||||
- `AuditLogPartitionMaintenance.GetMaxBoundaryAsync` already defends with an explicit
|
||||
`DateTime.SpecifyKind(dt, DateTimeKind.Utc)` (see `AuditLogPartitionMaintenance.cs:103-104`).
|
||||
That defence is necessary precisely because the EF mapping does not enforce it.
|
||||
- `AuditLogRepository.GetPartitionBoundariesOlderThanAsync` does NOT defend — it
|
||||
returns `reader.GetDateTime(0)` directly with `Kind=Unspecified` (separate finding
|
||||
CD-020).
|
||||
- Downstream comparisons like `DateTime.UtcNow` (Kind=Utc) against a re-read
|
||||
`OccurredAtUtc` (Kind=Unspecified) do not produce a runtime error, but any code
|
||||
path that converts via `.ToLocalTime()` or `.ToUniversalTime()` will silently
|
||||
interpret an unspecified-kind value as local time and produce wrong results.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Apply a value converter on every `DateTime`-typed `*Utc` column that re-tags the
|
||||
`Kind` to `Utc` on read (and asserts/`SpecifyKind` on write to defend against an
|
||||
accidental local-kind write). EF Core's built-in
|
||||
`UtcValueConverter`-style pattern is a single line per column:
|
||||
|
||||
```csharp
|
||||
builder.Property(e => e.OccurredAtUtc)
|
||||
.HasConversion(
|
||||
v => v,
|
||||
v => DateTime.SpecifyKind(v, DateTimeKind.Utc));
|
||||
```
|
||||
|
||||
Apply uniformly to `AuditEvent` (OccurredAtUtc, IngestedAtUtc), `SiteCall`
|
||||
(CreatedAtUtc, UpdatedAtUtc, TerminalAtUtc, IngestedAtUtc), and any other
|
||||
`DateTime *Utc` columns added later. Add a regression test that inserts a UTC row,
|
||||
re-reads it in a fresh context, and asserts `Kind == DateTimeKind.Utc`. Coordinate
|
||||
with the sibling `Commons-019` finding so the resolution is consistent across both
|
||||
modules.
|
||||
|
||||
### ConfigurationDatabase-019 — `EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues, creating partition holes
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Maintenance/AuditLogPartitionMaintenance.cs:181-199` |
|
||||
|
||||
**Description**
|
||||
|
||||
`EnsureLookaheadAsync` loops one month at a time from `next` up to `horizon` and
|
||||
issues `ALTER PARTITION SCHEME … NEXT USED` + `ALTER PARTITION FUNCTION … SPLIT RANGE`
|
||||
per month. The class doc says idempotency is guaranteed by reading the max-boundary
|
||||
first and only issuing SPLITs for strictly-greater months — so "boundary already
|
||||
exists" (SQL Server msg 7708/7711) cannot occur by construction. Yet the loop wraps
|
||||
each iteration in `catch (SqlException ex) { _logger.LogWarning(...); }` and
|
||||
continues, with the rationale "the desired end state (boundary present) is satisfied
|
||||
by either path."
|
||||
|
||||
That rationale is correct only for an "already-exists" error — which the pre-check
|
||||
makes impossible. Any *other* `SqlException` — a permissions failure (the
|
||||
`scadalink_audit_purger` role's `ALTER ON SCHEMA::dbo` revoked or not granted), a
|
||||
deadlock victim, a transient connection drop, a transaction log full, an underlying
|
||||
filegroup full — leaves the boundary genuinely **not** created, logs a Warning
|
||||
(quiet by default in most appenders), and the next iteration tries to SPLIT the
|
||||
following month. That split *can* succeed (it is a different range value), creating
|
||||
a permanent **hole** in the partition layout: month N never had a partition created,
|
||||
month N+1 does, so any future row in month N lands in the partition that previously
|
||||
spanned both months and partition-switch purge for month N becomes impossible.
|
||||
|
||||
The class is the central singleton's daily-tick partition roll-forward, so the hole
|
||||
persists until an operator notices it and rebuilds manually — by which point months
|
||||
of audit retention may be locked behind the unsplit range.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) drop the `try/catch` entirely so any SPLIT failure aborts the loop and
|
||||
surfaces to the hosted service (the next tick retries — at-least-once with no holes),
|
||||
or (b) keep the catch but narrow it to ONLY the
|
||||
"boundary-already-exists" errors (SQL Server msg 7708 and 7711) and log at Debug,
|
||||
mirroring how `AuditLogRepository.InsertIfNotExistsAsync` narrowly catches 2601/2627.
|
||||
Option (a) is preferable: by class-doc construction the catch should never fire, so
|
||||
its only effect is to mask the real-failure case. Add tests that simulate a SPLIT
|
||||
failure (e.g. a permission denial via a constrained test login) and assert the loop
|
||||
aborts after the first failure with no further SPLITs.
|
||||
|
||||
### ConfigurationDatabase-020 — `GetPartitionBoundariesOlderThanAsync` returns `DateTime` with `Kind=Unspecified`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs:378-387` |
|
||||
|
||||
**Description**
|
||||
|
||||
`GetPartitionBoundariesOlderThanAsync` reads `reader.GetDateTime(0)` and adds the
|
||||
raw value to the returned list. SQL Server's `datetime2` materialises as
|
||||
`DateTimeKind.Unspecified` on the ADO.NET side (see CD-019), so every returned
|
||||
boundary has `Kind=Unspecified`. The sibling `AuditLogPartitionMaintenance.GetMaxBoundaryAsync`
|
||||
(`AuditLogPartitionMaintenance.cs:103-104`) explicitly defends against this exact
|
||||
issue by calling `DateTime.SpecifyKind(dt, DateTimeKind.Utc)` — exactly because EF /
|
||||
ADO.NET strips the kind — but the repository method does not. Callers (the
|
||||
`AuditLogPurgeActor`) that compare a returned boundary to `DateTime.UtcNow` get a
|
||||
silently wrong comparison if they ever serialise to/from a string with a local-kind
|
||||
assumption in between.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wrap the read with `DateTime.SpecifyKind(reader.GetDateTime(0), DateTimeKind.Utc)`,
|
||||
matching the explicit defensive pattern already in
|
||||
`AuditLogPartitionMaintenance.GetMaxBoundaryAsync`. Better still: fix CD-019 (a value
|
||||
converter on the column) so the defence at the read site is no longer required.
|
||||
|
||||
### ConfigurationDatabase-021 — `SwitchOutPartitionAsync` interpolates `monthBoundary` / staging table name into raw SQL
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs:192-338` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SwitchOutPartitionAsync` builds two large SQL batches via interpolated strings
|
||||
(`sampleSql` and `sql`) that include `{monthBoundaryStr}` and `{stagingTableName}`
|
||||
directly in the SQL text, and executes them via `ExecuteSqlRawAsync` /
|
||||
`cmd.ExecuteScalarAsync`. Both values are constructed inside the method —
|
||||
`monthBoundaryStr = monthBoundary.ToUniversalTime().ToString("yyyy-MM-dd HH:mm:ss")`
|
||||
and `stagingTableName = $"AuditLog_Staging_{Guid.NewGuid():N}"` — and the formats are
|
||||
fully controlled. SQL injection is therefore not possible as the code stands.
|
||||
|
||||
Two related concerns:
|
||||
|
||||
1. The format string `"yyyy-MM-dd HH:mm:ss"` truncates fractional seconds. The
|
||||
partition function is seeded at `T00:00:00` exactly, so truncation happens to
|
||||
produce the right boundary value today. A future change that adds a sub-second
|
||||
boundary (or invokes `SwitchOutPartitionAsync` with a non-midnight value) would
|
||||
silently round to the wrong partition with no error — and SWITCH PARTITION would
|
||||
either fail loudly or succeed against the wrong month. Use
|
||||
`"yyyy-MM-dd HH:mm:ss.fffffff"` to match the precision the migration seeds at,
|
||||
and the rounding ambiguity disappears.
|
||||
2. The pattern of "build a multi-statement DDL batch by string concatenation" is
|
||||
robust today only by inspection. A code review tripwire — the CLAUDE.md note "the
|
||||
data-access layer must not concatenate SQL" — would catch the pattern earlier;
|
||||
converting the batch to a parameterised `sp_executesql` invocation (the inner
|
||||
`EXEC sp_executesql @sql` already exists for the SWITCH itself) is the textbook
|
||||
safe form even when the input is internally controlled.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
(1) Switch `monthBoundaryStr`'s format to `"yyyy-MM-dd HH:mm:ss.fffffff"`. (2)
|
||||
Optionally migrate the two batches to fully parameterised `sp_executesql` form so
|
||||
the `monthBoundary` value flows as a typed `@boundary datetime2(7)` parameter
|
||||
rather than as interpolated text — the only piece that genuinely *cannot* be
|
||||
parameterised is the staging table identifier (DDL identifiers are not parameterisable
|
||||
in T-SQL), but a server-side `QUOTENAME(@stagingTable)` wrapper covers it. Add a
|
||||
regression test that supplies a non-midnight `monthBoundary` value and asserts the
|
||||
boundary lookup resolves to the expected partition.
|
||||
|
||||
### ConfigurationDatabase-022 — Stale "WP-24 Stub level sufficient for diff/staleness support" XML comment on `DeploymentManagerRepository`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:8-14` |
|
||||
|
||||
**Description**
|
||||
|
||||
The class-level XML doc on `DeploymentManagerRepository` reads "WP-24: Stub level
|
||||
sufficient for diff/staleness support." WP-24 (Deployment Manager work-package) shipped
|
||||
long ago; the repository now covers full `DeploymentRecord` CRUD,
|
||||
`SystemArtifactDeploymentRecord` CRUD, `DeployedConfigSnapshot` CRUD, and an
|
||||
`Instance` deletion path with explicit Restrict-FK cleanup
|
||||
(`DeleteInstanceAsync` at line 210-229). The comment misleads a reader into
|
||||
thinking the repository is incomplete and tempts them not to investigate further
|
||||
before adding new behaviour. The same module-context observation was noted but
|
||||
not raised in the prior review.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Remove the WP-24 line and rewrite the class doc to describe what the repository
|
||||
actually does today: EF Core implementation of `IDeploymentManagerRepository`
|
||||
covering deployment records, system-artifact deployment records, deployed config
|
||||
snapshots, and the Restrict-FK-aware `DeleteInstanceAsync` for the
|
||||
deployment pipeline. Cross-reference the optimistic-concurrency contract on
|
||||
`DeploymentRecord.RowVersion`.
|
||||
|
||||
### ConfigurationDatabase-023 — `AuditLog` correlation-index name drifts from design doc (`IX_AuditLog_CorrelationId` vs `IX_AuditLog_Correlation`)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs:99-101`, `Migrations/20260520142214_AddAuditLogTable.cs:103-107` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Component-ConfigurationDatabase design doc lists the AuditLog indexes by name —
|
||||
including `IX_AuditLog_Correlation (CorrelationId)` for the "drilldown from a single
|
||||
operation" use case. The implemented index name is `IX_AuditLog_CorrelationId` (the
|
||||
fluent-config `HasDatabaseName` call and the matching DDL in the migration both use
|
||||
the `Id`-suffixed form). The names are syntactically valid SQL Server index names and
|
||||
the index does the right work; the drift is cosmetic but it breaks scripted
|
||||
maintenance ops that grep for the documented name (e.g. a runbook reindex script).
|
||||
|
||||
The other four documented index names (`IX_AuditLog_OccurredAtUtc`,
|
||||
`IX_AuditLog_Site_Occurred`, `IX_AuditLog_Channel_Status_Occurred`,
|
||||
`IX_AuditLog_Target_Occurred`, plus the post-design additions
|
||||
`IX_AuditLog_Execution`, `IX_AuditLog_ParentExecution`, `IX_AuditLog_Node_Occurred`)
|
||||
agree with the code.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick one direction. Updating the design doc to match the code is cheap (one word) and
|
||||
preserves the existing migration; renaming the index in the database requires a new
|
||||
migration that does `sp_rename`. Document-aligning is the lower-cost option and
|
||||
matches the resolution pattern used for CD-005.
|
||||
|
||||
### ConfigurationDatabase-024 — Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.ConfigurationDatabase.Tests/Maintenance/AuditLogPartitionMaintenanceTests.cs`, `tests/.../RepositoryCoverageTests.cs:855-869` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AuditLogPartitionMaintenanceTests` exercises the happy-path SPLIT-RANGE behaviour
|
||||
(no-op, single-month, three-month, already-exists idempotency) but never simulates a
|
||||
SPLIT *failure* — so the catch-and-continue behaviour flagged in CD-019 is
|
||||
behaviourally untested. The class is a central singleton driving daily audit purge;
|
||||
a regression that turned the failure path into a permanent hole would not surface in
|
||||
the test suite.
|
||||
|
||||
Separately, `RepositoryCoverageTests.DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity`
|
||||
covers the stub-attach delete path under the SQLite test fixture, but the fixture
|
||||
remaps `RowVersion` as a nullable concurrency token (`SqliteTestHelper`), so it does
|
||||
not exercise the production-shape `IsRowVersion()` auto-population — the actual
|
||||
concurrency-token bug flagged in CD-018 cannot show up. There is an
|
||||
`MsSqlMigrationFixture` in the test project already (used by the Audit Log migration
|
||||
tests); the stub-attach delete deserves a parallel MS-SQL-flavoured test.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
(1) Add an `AuditLogPartitionMaintenanceTests` case that constructs a context against
|
||||
a constrained login (no `ALTER ON SCHEMA::dbo`), invokes `EnsureLookaheadAsync` for a
|
||||
three-month gap, and asserts: only the partition boundaries created BEFORE the
|
||||
permissions failure landed remain, and the call aborts cleanly without continuing to
|
||||
later months. This pins down the resolution of CD-019. (2) Add a
|
||||
`RepositoryCoverageTests` case that uses `MsSqlMigrationFixture` to insert a
|
||||
`DeploymentRecord`, clear the change tracker, call `DeleteDeploymentRecordAsync`,
|
||||
and assert the row is gone — pinning the resolution of CD-018. Both tests should be
|
||||
`[SkippableFact]` so the suite still passes when no MS SQL Server is available.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.DataConnectionLayer` |
|
||||
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 5 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -30,6 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
|
||||
heuristic. Test coverage is adequate for the happy paths and failover but absent for
|
||||
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
The 2026-05-28 re-review walked all 10 checklist categories against the current
|
||||
source and found **5 new findings**. All 17 prior findings remain `Resolved` and the
|
||||
fixes (reverse-index unsubscribe, atomic disconnect guards, real-logger threading,
|
||||
initial-connect failover, per-tag write-batch results, subscribe-response accuracy)
|
||||
were verified in place. The new findings cluster around `HandleSubscribe` /
|
||||
`HandleSubscribeCompleted` race-induced state drift:
|
||||
|
||||
- **High** — concurrent subscribes for the same tag from different instances each see
|
||||
the tag as not-yet-subscribed (the `alreadySubscribed` snapshot was taken before
|
||||
the Task.Run dispatch), so each Task.Run calls `_adapter.SubscribeAsync` and the
|
||||
later `HandleSubscribeCompleted` silently discards the second adapter subscription
|
||||
handle — the orphan never gets `UnsubscribeAsync`'d.
|
||||
- **Medium** — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>`
|
||||
mutated from thread-pool continuations of `SubscribeAsync` / `UnsubscribeAsync` /
|
||||
`DisconnectAsync` running in parallel — the same class of bug DCL-003 fixed in
|
||||
`RealOpcUaClient` but missed in the layer above.
|
||||
- **Medium** — `HandleSubscribeCompleted`'s success branch never checks
|
||||
`_unresolvedTags`, so a tag that previously failed resolution (incrementing
|
||||
`_totalSubscribed`) and is then successfully subscribed by a different instance gets
|
||||
`_totalSubscribed++` a second time, double-counting; meanwhile the unresolved entry
|
||||
lingers until the retry timer also resolves it, creating an orphaned monitored item.
|
||||
- **Medium** — when an instance is unsubscribed mid-flight,
|
||||
`HandleSubscribeCompleted` re-creates an empty `_subscriptionsByInstance[name]`
|
||||
entry and processes the late results, leaking `_tagSubscriberCount` /
|
||||
`_totalSubscribed` / `_resolvedTags` increments for an instance with no
|
||||
`_subscribers` entry to deliver values to.
|
||||
- **Medium** — `HandleSubscribeCompleted` calls `Timers.StartPeriodicTimer` on every
|
||||
completed subscribe with unresolved tags; in Akka.NET, `StartPeriodicTimer` with the
|
||||
same key cancels and replaces the existing timer, so a burst of subscribes arriving
|
||||
faster than `TagResolutionRetryInterval` (10 s default) keeps resetting the timer
|
||||
and the retry never actually fires.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
|
||||
@@ -50,7 +84,22 @@ so a mid-batch disconnect aborts the whole write batch (the same class of defect
|
||||
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
|
||||
`DataConnectionLayer-014`.
|
||||
|
||||
## Checklist coverage
|
||||
## Checklist coverage (2026-05-28 re-review, commit `1eb6e97`)
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | Findings 020 (double-count `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance) and 021 (leaked `_subscriptionsByInstance` entry + counter increments when instance unsubscribes mid-flight). |
|
||||
| 2 | Akka.NET conventions | x | Finding 022 — `Timers.StartPeriodicTimer` reset on every `HandleSubscribeCompleted` for unresolved tags can stall the retry timer indefinitely under a subscribe burst. |
|
||||
| 3 | Concurrency & thread safety | x | Finding 018 — concurrent subscribes for the same tag from different instances each spawn an adapter subscription and the second handle is orphaned. Finding 019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from thread-pool continuations (same class of bug as DCL-003 one layer above). |
|
||||
| 4 | Error handling & resilience | x | No new issues; DCL-004 / DCL-007 / DCL-015 / DCL-017 fixes verified in place. |
|
||||
| 5 | Security | x | No new issues; DCL-012 / DCL-014 fixes verified. The Commons-side `OpcUaEndpointConfig.AutoAcceptUntrustedCerts = true` default surfaced in DCL-012 is still present but is out of this module's scope. |
|
||||
| 6 | Performance & resource management | x | No new issues; DCL-008 reverse index verified. (Finding 018's orphaned adapter handle is logged under concurrency.) |
|
||||
| 7 | Design-document adherence | x | No new issues. DCL-009's design-doc action (document unstable-disconnect failover trigger + configurable threshold) is still open at the doc level but out of this module's scope. |
|
||||
| 8 | Code organization & conventions | x | No issues — POCOs in Commons, options class owned by component, factory + DI registration consistent. |
|
||||
| 9 | Testing coverage | x | DCL001–017 regression tests present. Gaps remain for finding 018 (concurrent subscribe of same tag from two instances), 019 (concurrent `_subscriptionHandles` mutation), 020 (resolve-via-different-instance), 021 (unsubscribe-mid-flight), 022 (timer-reset starvation). |
|
||||
| 10 | Documentation & comments | x | No new issues; DCL-013 atomic-guard XML comments verified. |
|
||||
|
||||
## Checklist coverage (2026-05-17 re-review, commit `39d737e`)
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
@@ -896,3 +945,268 @@ unhandled exception. Regression test
|
||||
`DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch` fails against the
|
||||
pre-fix code (the batch throws, no map returned) and passes after;
|
||||
`DCL017_WriteBatch_CancellationAbortsWholeBatch` guards that cancellation still aborts.
|
||||
|
||||
### DataConnectionLayer-018 — Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
|
||||
set on the actor thread before dispatching the `Task.Run` that performs the adapter
|
||||
I/O (line 557). The snapshot is the only basis on which the background task decides
|
||||
whether to call `_adapter.SubscribeAsync` — and it is taken **once**, before the I/O
|
||||
runs.
|
||||
|
||||
If two `SubscribeTagsRequest` messages arrive on the actor thread for different
|
||||
instances that both reference the same tag path, both `HandleSubscribe` invocations
|
||||
take a snapshot at a time when neither subscribe has completed, so `alreadySubscribed`
|
||||
does not contain the shared tag in either snapshot. Both background tasks then call
|
||||
`_adapter.SubscribeAsync(tagPath, ...)`, the adapter creates **two** monitored items
|
||||
and returns two distinct subscription ids, and each task pipes a `SubscribeCompleted`
|
||||
back to the actor with `AlreadySubscribed: false, Success: true`.
|
||||
|
||||
`HandleSubscribeCompleted` for the first message takes the success branch and writes
|
||||
`_subscriptionIds[tagPath] = subId1`. The second message arrives, hits the
|
||||
"already in `_subscriptionIds`" guard at line 653 (`_subscriptionIds.ContainsKey(...)`)
|
||||
and `continue`s — but `result.SubscriptionId` (the orphan handle for the second
|
||||
adapter subscription) is silently discarded. The orphan monitored item stays alive in
|
||||
the OPC UA session for the lifetime of the adapter, sending duplicate data-change
|
||||
notifications (whose callbacks were stamped with the captured `generation`) into
|
||||
`HandleTagValueReceived` for every value change. Across a deploy that creates many
|
||||
instances sharing a few tags, this leaks N-1 monitored items per shared tag and
|
||||
doubles/triples the per-tag publish traffic.
|
||||
|
||||
DCL-010 fixed an analogous duplicate-dispatch bug for the tag-resolution retry path
|
||||
via `_resolutionInFlight`; the equivalent guard is missing on the user-subscribe
|
||||
path.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Track in-flight subscribes the same way DCL-010 tracks in-flight retries: maintain a
|
||||
`HashSet<string> _subscribesInFlight` and add `tagPath` to it on the actor thread
|
||||
**before** the `Task.Run` dispatch, only for tags not already in
|
||||
`_subscriptionIds` and not already in `_subscribesInFlight`. Tags that are already
|
||||
in flight should produce a `SubscribeTagResult(..., AlreadySubscribed: true, ...)`
|
||||
without touching the adapter. Remove from `_subscribesInFlight` in
|
||||
`HandleSubscribeCompleted` once the result is applied. Add a regression test that
|
||||
fans two simultaneous `SubscribeTagsRequest` messages for the same tag and asserts
|
||||
exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription id).
|
||||
|
||||
### DataConnectionLayer-019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
|
||||
|
||||
**Description**
|
||||
|
||||
`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
|
||||
string>`. It is mutated from:
|
||||
|
||||
- `SubscribeAsync` (line 167): `_subscriptionHandles[subscriptionId] = tagPath;`
|
||||
after an `await _client!.CreateSubscriptionAsync(...)` — i.e. the assignment
|
||||
executes on the continuation thread (a thread-pool thread).
|
||||
- `UnsubscribeAsync` (line 177): `_subscriptionHandles.Remove(subscriptionId);`
|
||||
similarly after an `await`.
|
||||
- `DisconnectAsync` indirectly via the underlying `_client.DisconnectAsync` does
|
||||
**not** touch `_subscriptionHandles`, but multiple `SubscribeAsync` /
|
||||
`UnsubscribeAsync` calls can run in parallel from the upper layer.
|
||||
|
||||
The DCL upper layer calls `_adapter.SubscribeAsync` from multiple places that all
|
||||
run off the actor thread:
|
||||
|
||||
- `DataConnectionActor.HandleSubscribe` inside its `Task.Run` (multiple invocations
|
||||
can run in parallel — see DCL-018);
|
||||
- `HandleRetryTagResolution` issues `_adapter.SubscribeAsync` for every tag in
|
||||
`_unresolvedTags` and pipes the continuation (each subscribe runs concurrently
|
||||
via the SDK's async machinery);
|
||||
- `ReSubscribeAll` does the same after a reconnect.
|
||||
|
||||
So plain-`Dictionary` mutations occur on multiple thread-pool threads concurrently —
|
||||
the exact pattern DCL-003 fixed by switching `RealOpcUaClient._monitoredItems` and
|
||||
`_callbacks` to `ConcurrentDictionary<,>`. Plain `Dictionary` mutations during a
|
||||
concurrent resize are undefined behaviour: they can throw
|
||||
`InvalidOperationException`, corrupt the internal hash buckets, or lose entries.
|
||||
|
||||
This is `_subscriptionHandles` is currently dead state (the dictionary is written to
|
||||
and `Remove`d but **never read**), so a corruption today would not crash the
|
||||
subscribe path — but the bug is latent and the field will become load-bearing the
|
||||
moment any code reads it (e.g., to expose a subscription-id-to-tag-path lookup for
|
||||
diagnostics, which is what the dictionary's name suggests it was intended for).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) change `_subscriptionHandles` to
|
||||
`ConcurrentDictionary<string, string>` and use `TryAdd` / `TryRemove`, mirroring
|
||||
DCL-003's fix one layer down, or (b) delete the field entirely since it is never
|
||||
read — the bookkeeping is fully owned by `RealOpcUaClient._monitoredItems` /
|
||||
`_callbacks` and `DataConnectionActor._subscriptionIds`. Removing it eliminates the
|
||||
race and removes dead state in one stroke. Add a regression test (or extend
|
||||
`DCL003_SharedDictionaryFields_AreConcurrentCollections`) that asserts no
|
||||
non-concurrent `Dictionary` field is shared across thread boundaries in adapter
|
||||
state.
|
||||
|
||||
### DataConnectionLayer-020 — `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribeCompleted`'s success branch (line 656-661) writes
|
||||
`_subscriptionIds[result.TagPath] = result.SubscriptionId!; _totalSubscribed++;
|
||||
_resolvedTags++;`. The guard at line 653 only skips when the tag is already in
|
||||
`_subscriptionIds`; it does **not** check `_unresolvedTags`. So the success branch
|
||||
runs for a tag that previously failed resolution from an earlier instance's subscribe
|
||||
(which incremented `_totalSubscribed` and added the tag to `_unresolvedTags` at line
|
||||
674-676) and is now successfully subscribed by a later instance.
|
||||
|
||||
Sequence:
|
||||
|
||||
1. Instance A subscribes "Tag1". `_adapter.SubscribeAsync` throws a non-connection-level
|
||||
exception. `HandleSubscribeCompleted` takes the resolution-failure branch:
|
||||
`_unresolvedTags.Add("Tag1"); _totalSubscribed++;` (now 1).
|
||||
2. The device finishes booting. Instance B subscribes "Tag1". `_adapter.SubscribeAsync`
|
||||
succeeds, returning `subId`. `HandleSubscribeCompleted` takes the success branch:
|
||||
`_subscriptionIds["Tag1"] = subId; _totalSubscribed++; _resolvedTags++;`
|
||||
(now `_totalSubscribed = 2`, `_resolvedTags = 1`).
|
||||
3. `_unresolvedTags` still contains "Tag1". The retry timer fires next tick,
|
||||
`HandleRetryTagResolution` dispatches `SubscribeAsync("Tag1", ...)` against the
|
||||
adapter (creating a **second** monitored item for the same tag), and
|
||||
`HandleTagResolutionSucceeded` runs `_unresolvedTags.Remove("Tag1")` →
|
||||
`_subscriptionIds["Tag1"] = newSubId` (overwriting Instance B's id, orphaning that
|
||||
monitored item) → `_resolvedTags++` (now 2, matching `_totalSubscribed`).
|
||||
|
||||
Net effect:
|
||||
|
||||
- `_totalSubscribed` is over-counted by 1 from step 2 until step 3 reconciles
|
||||
`_resolvedTags`. During that window the health report's "subscribed / resolved"
|
||||
ratio is wrong.
|
||||
- Two adapter subscription handles for the same tag are leaked across this race
|
||||
(DCL-018's orphan plus the retry's second adapter call); the second leaks
|
||||
permanently because `_subscriptionIds["Tag1"]` only stores the most recent id.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In `HandleSubscribeCompleted`'s success branch, before the `_totalSubscribed++`,
|
||||
check `_unresolvedTags.Remove(result.TagPath)` — if the tag was already counted as
|
||||
unresolved, promote it without re-incrementing `_totalSubscribed` (mirror
|
||||
`HandleTagResolutionSucceeded`'s shape: only increment `_resolvedTags`,
|
||||
`_subscriptionIds[tag] = subId`, and clear `_resolutionInFlight`). Add a regression
|
||||
test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
|
||||
"resolve via a second instance" sequence.
|
||||
|
||||
### DataConnectionLayer-021 — `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
|
||||
thread and pipes a `SubscribeCompleted` back. If an `UnsubscribeTagsRequest` for the
|
||||
same instance is processed on the actor thread between dispatch and completion,
|
||||
`HandleUnsubscribe` removes the instance from both `_subscriptionsByInstance` and
|
||||
`_subscribers`. When the late `SubscribeCompleted` arrives,
|
||||
`HandleSubscribeCompleted` (line 629-634) **re-creates** the
|
||||
`_subscriptionsByInstance[instanceName] = new HashSet<string>()` entry and proceeds
|
||||
to apply the results — but `_subscribers[instanceName]` was already removed by the
|
||||
unsubscribe and is **not** re-added.
|
||||
|
||||
Consequences:
|
||||
|
||||
1. `_subscriptionsByInstance` keeps a permanently-leaked entry for an instance that
|
||||
no longer exists. `ReSubscribeAll` derives its tag list from
|
||||
`_subscriptionsByInstance.Values` and will keep re-subscribing the leaked tags on
|
||||
every future reconnect.
|
||||
2. For each tag, `_tagSubscriberCount[tagPath]` is incremented (line 647-649), so the
|
||||
reverse index treats the leaked instance as a real subscriber. The only way to
|
||||
drop the count is another `HandleUnsubscribe` for the same instance — which can
|
||||
never arrive because the Instance Actor that owned the instance is gone.
|
||||
3. The success branch increments `_totalSubscribed` / `_resolvedTags` (or
|
||||
`_unresolvedTags` for genuine resolution failures), drifting health counters
|
||||
permanently above the actual subscribed instance count.
|
||||
4. Subsequent `HandleTagValueReceived` fanout iterates `_subscriptionsByInstance` and
|
||||
skips this entry via the `_subscribers.TryGetValue` check (line 1019), so values
|
||||
are silently dropped — but the work of fanning them out (the iteration and the
|
||||
tag lookup) is still done for every value update on every leaked tag, forever.
|
||||
5. The genuine-resolution-failure path at line 682-686 (`subscriber.Tell(new
|
||||
TagValueUpdate(..., QualityCode.Bad, ...))`) also silently no-ops because
|
||||
`_subscribers.TryGetValue` is false — so the design doc's "push bad quality on
|
||||
resolution failure" promise is broken for this case (a minor, edge-case wrinkle).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In `HandleSubscribeCompleted`, when `_subscriptionsByInstance.TryGetValue` fails,
|
||||
treat the result as obsolete: log it and `return` without re-creating the entry or
|
||||
applying any state mutations. Any successfully-created adapter subscriptions in
|
||||
`msg.Results` should be cleaned up — iterate the results and
|
||||
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` (fire-and-forget) for each
|
||||
successful one so the orphan handles do not leak in the adapter. Add a regression
|
||||
test that subscribes from instance A, immediately sends an `UnsubscribeTagsRequest`
|
||||
for A while the subscribe I/O is in flight, completes the subscribe, and asserts
|
||||
`_subscriptionsByInstance`, `_tagSubscriberCount` and health counters are all clean.
|
||||
|
||||
### DataConnectionLayer-022 — `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
|
||||
991-998) both call:
|
||||
|
||||
```
|
||||
Timers.StartPeriodicTimer(
|
||||
"tag-resolution-retry",
|
||||
new RetryTagResolution(),
|
||||
_options.TagResolutionRetryInterval,
|
||||
_options.TagResolutionRetryInterval);
|
||||
```
|
||||
|
||||
`Akka.Actor.ITimerScheduler.StartPeriodicTimer(key, ...)` cancels and replaces any
|
||||
existing timer registered under the same key. So every additional subscribe (or
|
||||
every additional tag-resolution failure) that produces unresolved tags **resets** the
|
||||
retry timer's countdown to the full interval — the timer never accumulates
|
||||
elapsed time across calls.
|
||||
|
||||
With the default `TagResolutionRetryInterval = 10s`, an instance-startup burst that
|
||||
produces a new `SubscribeTagsRequest` every 5s (a not-unusual cadence during
|
||||
deployment fan-out) will keep cancelling the not-yet-fired retry every 5s, so the
|
||||
"periodic" retry never actually fires until subscribes go quiet. In a steady-state
|
||||
site with many instances deploying together this can delay tag resolution by tens
|
||||
of seconds, leaving attributes at `Bad` quality longer than the documented retry
|
||||
interval implies.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Start the periodic timer once, when the actor first transitions to having
|
||||
non-empty `_unresolvedTags`, and only re-start it after `Timers.Cancel(...)` has
|
||||
been called (e.g., when the actor enters `Reconnecting`). The cleanest pattern is to
|
||||
gate the start with `if (!Timers.IsTimerActive("tag-resolution-retry"))` before
|
||||
calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply the
|
||||
same gate at both call sites. Add a regression test that fires 5 subscribes with
|
||||
unresolved tags within one retry interval and asserts the retry fires at most one
|
||||
interval after the first failure (not after the fifth subscribe).
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.DeploymentManager` |
|
||||
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -53,20 +53,52 @@ DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
|
||||
it still describes the query-before-redeploy behaviour that actually moved into
|
||||
`TryReconcileWithSiteAsync` (DeploymentManager-017).
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Re-reviewed at commit `1eb6e97` after the DeploymentManager-015/016/017 fixes
|
||||
and a docs-only XML-comment pass. The three prior findings remain `Resolved`
|
||||
and verified — `ApplyPostSuccessSideEffectsAsync` is now invoked from both the
|
||||
normal success path and `TryReconcileWithSiteAsync`, the reconciled-success
|
||||
branch corrects `prior.RevisionHash` to the target, and `GetDeploymentStatusAsync`'s
|
||||
XML doc now describes the local-DB-read it actually performs and cross-refs the
|
||||
reconciliation helper. The DiffService wiring, options binding, ref-counted
|
||||
operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
|
||||
test seam are still in place. The 7 new findings here are not regressions in
|
||||
the DeploymentManager-015/016 fixes — they are issues uncovered by widening
|
||||
the lens to the lifecycle paths, reconciliation's interaction with
|
||||
intentional `Disabled` state, audit semantics, and operational concerns
|
||||
(per-site artifact-build cost, Pending→InProgress double-write).
|
||||
|
||||
The single notable correctness issue is DeploymentManager-018: the
|
||||
reconciliation shortcut unconditionally sets `instance.State = Enabled` via
|
||||
`ApplyPostSuccessSideEffectsAsync`. After a central failover that loses the
|
||||
in-memory operation lock, a user can legitimately `Disable` an instance whose
|
||||
prior deploy record is still `InProgress`; a subsequent redeploy then reconciles
|
||||
and silently re-enables the instance against the user's explicit intent.
|
||||
The remaining six findings are medium/low: lifecycle-timeout audit gap
|
||||
(DeploymentManager-019), audit-user attribution in reconciliation
|
||||
(DeploymentManager-020), silent fallback in `ResolveSiteIdentifierAsync`
|
||||
(DeploymentManager-021), back-to-back `Pending`→`InProgress` writes
|
||||
(DeploymentManager-022), per-site re-query of system-wide artifacts
|
||||
(DeploymentManager-023), and shared static state across `*ProbeActor` tests
|
||||
(DeploymentManager-024).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
|
||||
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
|
||||
| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
|
||||
| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
|
||||
| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
|
||||
| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
|
||||
| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
|
||||
| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
|
||||
| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
|
||||
| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018). |
|
||||
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
|
||||
| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019). |
|
||||
| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
|
||||
| 6 | Performance & resource management | ✓ | New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023). |
|
||||
| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation. |
|
||||
| 8 | Code organization & conventions | ✓ | New: redundant `Pending`→`InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021). |
|
||||
| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
|
||||
| 10 | Documentation & comments | ✓ | New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -873,3 +905,293 @@ database as a pure local read, and cross-references `TryReconcileWithSiteAsync`
|
||||
as where the query-the-site-before-redeploy reconciliation actually lives.
|
||||
Documentation-only change; no regression test (a test asserting comment text
|
||||
would be meaningless).
|
||||
|
||||
### DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TryReconcileWithSiteAsync` calls `ApplyPostSuccessSideEffectsAsync` whenever
|
||||
the site reports it has the target revision hash, and that helper
|
||||
unconditionally writes `instance.State = InstanceState.Enabled`. The
|
||||
reconciliation shortcut only runs when the prior `DeploymentRecord` is
|
||||
`InProgress` or timeout-`Failed` — exactly the scenarios that survive a central
|
||||
failover (the in-memory `OperationLockManager` is lost on failover, by design:
|
||||
*"Lost on central failover (acceptable per design — in-progress treated as
|
||||
failed)"*).
|
||||
|
||||
After such a failover, the per-instance operation lock is gone but the
|
||||
deployment record is still `InProgress` in the DB. A user can legitimately
|
||||
issue `DisableInstanceAsync` for the same instance — there is nothing in
|
||||
`DisableInstanceAsync` that consults the deployment record, only the
|
||||
`StateTransitionValidator` over `Instance.State`. If the state is `Enabled`
|
||||
(the typical case when the deploy started), the disable proceeds, the site
|
||||
honours it (the design states a disabled instance retains its deployed
|
||||
configuration), and central now persists `Instance.State = Disabled`. The
|
||||
deployment-record row remains `InProgress` (no one transitioned it). Later the
|
||||
user retries the deploy: `TryReconcileWithSiteAsync` runs, the site still has
|
||||
the target revision hash (Disable doesn't change the deployed config), the
|
||||
prior record is marked `Success`, and `ApplyPostSuccessSideEffectsAsync` writes
|
||||
`Instance.State = Enabled` — silently overriding the user's explicit Disable.
|
||||
|
||||
The same trap exists for any direct DB edit / migration that flipped the state
|
||||
between the timed-out deploy and the redeploy. The normal deploy path can
|
||||
defensibly assume `Enabled` after a fresh successful apply, but the
|
||||
reconciliation path is reconciling *prior* state with *prior* user intent; it
|
||||
should preserve `Disabled` if that is the current `Instance.State` at the time
|
||||
of reconciliation, mirroring the design's separation between deploy (config
|
||||
apply) and disable (subscription/script lifecycle).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In the reconciliation branch, do not force `Enabled`. Either:
|
||||
- Pass a flag/parameter to `ApplyPostSuccessSideEffectsAsync` telling it
|
||||
whether to touch state, and skip the state write on the reconciliation path
|
||||
(leaving the current `Instance.State` intact, which is already `Enabled`
|
||||
for a fresh deploy that timed out and `Disabled` for the user-disabled
|
||||
follow-up case); or
|
||||
- Only set `Enabled` when the current `Instance.State` is `NotDeployed` (i.e.
|
||||
the first-deploy timed-out case), and leave existing `Enabled`/`Disabled`
|
||||
alone.
|
||||
|
||||
Add a regression test where an instance with `Instance.State = Disabled` and a
|
||||
prior `InProgress` deployment record is reconciled — the resulting
|
||||
`Instance.State` must remain `Disabled`, and the deployment record must still
|
||||
be marked `Success`.
|
||||
|
||||
### DeploymentManager-019 — Lifecycle command timeout writes no audit entry
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
|
||||
wrap the `CommunicationService` call in a linked CTS with
|
||||
`LifecycleCommandTimeout` (DeploymentManager-012). On timeout they log a
|
||||
warning and `return Result<...>.Failure(...)` — and skip the
|
||||
`_auditService.LogAsync` call entirely. As a result, an operator-initiated
|
||||
disable/enable/delete that times out at the site leaves **no audit trail**:
|
||||
the user, the timestamp, the command id, and the failure mode are not
|
||||
recorded in the audit log. The deploy path goes out of its way to write a
|
||||
`DeployFailed` audit entry on the same failure mode
|
||||
(`DeploymentService.cs:274-276`), with `CancellationToken.None` so the write is
|
||||
durable; the lifecycle commands do not.
|
||||
|
||||
The design lists audit logging as a Deployment Manager responsibility for "all
|
||||
deployment actions, system-wide artifact deployments, and instance lifecycle
|
||||
changes" — a timed-out lifecycle command **is** an attempted lifecycle change,
|
||||
and the operator action is exactly the kind of event the audit log exists to
|
||||
record.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In each of the three `catch (Exception ex) when (ex is TimeoutException or
|
||||
OperationCanceledException)` blocks, write a `DisableTimeout`/`EnableTimeout`/
|
||||
`DeleteTimeout` (or use the existing operation name with a failure flag)
|
||||
audit entry with `CancellationToken.None` so a cancelled outer token does not
|
||||
prevent the audit write, mirroring `DeployFailed`. Add a unit test asserting
|
||||
that `DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
|
||||
also produces an audit entry.
|
||||
|
||||
### DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:683-686` |
|
||||
|
||||
**Description**
|
||||
|
||||
In `TryReconcileWithSiteAsync` the audit call is:
|
||||
|
||||
```
|
||||
await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
|
||||
```
|
||||
|
||||
`prior.DeployedBy` is the user who issued the original (timed-out / stuck)
|
||||
deployment, not the `user` parameter passed into `DeployInstanceAsync`. The
|
||||
current user — the one who triggered the redeploy that produced the
|
||||
reconciliation — is dropped on the floor. For audit forensics this is
|
||||
misleading: the row will read "user A reconciled their own deployment"
|
||||
when in fact user B initiated the action that reconciled it.
|
||||
|
||||
The original deployer is interesting context, but it should be carried in the
|
||||
audit-detail object (where `DeploymentId` and `RevisionHash` already live), not
|
||||
substituted for the actor.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Use `user` (the parameter on `DeployInstanceAsync`, threaded through
|
||||
`TryReconcileWithSiteAsync`) as the audit actor, and include
|
||||
`OriginalDeployer = prior.DeployedBy` in the detail object so the original
|
||||
attribution is preserved without misrepresenting who took the action.
|
||||
|
||||
### DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111` |
|
||||
|
||||
**Description**
|
||||
|
||||
```
|
||||
private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
|
||||
{
|
||||
var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
|
||||
return site?.SiteIdentifier ?? siteId.ToString();
|
||||
}
|
||||
```
|
||||
|
||||
If the `Site` row is missing (FK was deleted, race with admin delete, DB
|
||||
inconsistency), the method silently returns the numeric DB id rendered as a
|
||||
string. This is then passed to `CommunicationService.{Deploy,Disable,Enable,
|
||||
Delete}InstanceAsync` and `QueryDeploymentStateAsync` as if it were a real
|
||||
`SiteIdentifier` (e.g. "site-a"). The communication layer will fail with an
|
||||
"unknown site" or routing error, producing a confusing diagnostic that hides
|
||||
the actual problem (no site row).
|
||||
|
||||
This is a defensive concern, but every mutating operation in the module goes
|
||||
through this method, so a stale instance whose site was deleted will produce a
|
||||
misleading error every time it is touched.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Treat a missing site as a hard validation failure: return a
|
||||
`Result.Failure($"Site with ID {siteId} not found")` early from the calling
|
||||
operations, instead of fabricating an identifier. The repository already
|
||||
returns `Site?`, so the null path is type-visible; just don't paper over it.
|
||||
|
||||
### DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DeployInstanceAsync` does:
|
||||
|
||||
```
|
||||
record.Status = Pending;
|
||||
AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
|
||||
record.Status = InProgress;
|
||||
UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
|
||||
```
|
||||
|
||||
There is no work between the two writes — flattening, validation, and
|
||||
reconciliation have already completed by line 174. The deploy command is sent
|
||||
immediately after the `InProgress` write. The `Pending` write therefore costs:
|
||||
an extra `SaveChangesAsync` round-trip, an extra `IDeploymentStatusNotifier`
|
||||
invocation (which the CentralUI-006 page renders, so the user briefly sees a
|
||||
`Pending` flicker before `InProgress`), and an extra row-version bump if EF
|
||||
optimistic concurrency is enabled on the table.
|
||||
|
||||
The design uses `Pending` to mean "queued, not yet sent" and `InProgress` to
|
||||
mean "sent to site, awaiting response". The code's `Pending` slot has no
|
||||
queuing — it is set and immediately overwritten — so the state buys nothing
|
||||
operationally.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either:
|
||||
- Drop the `Pending` write entirely and create the record directly in
|
||||
`InProgress` (one row insert, one notification, simpler UI); or
|
||||
- Move the `Pending`→`InProgress` transition to bracket actual queueing/work
|
||||
(e.g. set `Pending` *before* flattening + reconciliation, set `InProgress`
|
||||
immediately before `DeployInstanceAsync` on the comm service) so the two
|
||||
states carry distinguishable semantics worth a separate write.
|
||||
|
||||
### DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DeployToAllSitesAsync` loops over sites and calls
|
||||
`BuildDeployArtifactsCommandAsync(site.Id, ...)` for each one. Of the six
|
||||
artifact sets the method gathers, **only** `dataConnections` is per-site:
|
||||
|
||||
- `_templateRepo.GetAllSharedScriptsAsync` — global.
|
||||
- `_externalSystemRepo.GetAllExternalSystemsAsync` — global, plus
|
||||
`GetMethodsByExternalSystemIdAsync` per external system per site.
|
||||
- `_externalSystemRepo.GetAllDatabaseConnectionsAsync` — global.
|
||||
- `_notificationRepo.GetAllNotificationListsAsync` — global.
|
||||
- `_notificationRepo.GetAllSmtpConfigurationsAsync` — global.
|
||||
- `_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)` — **per-site**.
|
||||
|
||||
With N sites this issues ≈ 5·N redundant queries on the global sets (plus
|
||||
M·N method queries, where M is the external-system count). On a hub-and-spoke
|
||||
deployment with many sites the artifact-deploy path is noticeably slower than
|
||||
necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the
|
||||
DbContext is not thread-safe and the per-site commands are already built
|
||||
sequentially (good); the redundant queries are sequential too, but the
|
||||
network/round-trip cost is real.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Hoist the global queries (shared scripts, external systems + their methods,
|
||||
DB connections, notification lists, SMTP configurations) out of
|
||||
`BuildDeployArtifactsCommandAsync`, fetch them once in `DeployToAllSitesAsync`,
|
||||
and pass them in alongside the site id (or expose a
|
||||
`BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals)` overload).
|
||||
`RetryForSiteAsync` (the single-site path) can keep the convenience-overload
|
||||
behaviour. Add a test using NSubstitute's `.Received()` to assert
|
||||
`_templateRepo.GetAllSharedScriptsAsync` is called exactly once for an
|
||||
N-site deployment.
|
||||
|
||||
### DeploymentManager-024 — Test probe actors hold mutable static state across tests
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ReconcileProbeActor.QueryCount` / `DeployCount`, `SerializationProbeActor.MaxConcurrent`
|
||||
/ `_current`, and `ArtifactProbeActor.Received` are all `static` fields.
|
||||
Each test's actor constructor resets them — but reset-on-construction only
|
||||
works as long as no two tests in the same class run concurrently. xUnit's
|
||||
default parallelism disables intra-class parallelism, so today's tests pass;
|
||||
flip the assembly-level `[CollectionBehavior(DisableTestParallelization = true)]`
|
||||
or move to xUnit v3 (which enables intra-class parallelism by default) and the
|
||||
counters race — a deploy in test A could increment `DeployCount` while test B
|
||||
is asserting on it.
|
||||
|
||||
Static state shared across tests is also why a flaky-test investigation here
|
||||
will be unusually painful: the offending interaction is invisible from any
|
||||
single test file.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replace the static counters with instance state, hand the actor a probe
|
||||
recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
|
||||
in each test. Where the simpler counter shape is preferred, pass a
|
||||
shared-state object into the actor's constructor so each test owns its own
|
||||
instance — never reach for `static` mutable test state.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ExternalSystemGateway` |
|
||||
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 6 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -51,6 +51,36 @@ both substantive findings are second-order defects in earlier fixes — the earl
|
||||
resolutions did not verify the downstream contract of the S&F engine they integrate
|
||||
with.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All seventeen prior findings (001–017) remain `Resolved`; spot-checks against the
|
||||
current source confirm the fixes still hold. Between `39d737e` and this re-review the
|
||||
only source changes to the module are the documentation-only commit `1eb6e97` (XML
|
||||
doc additions) and the `executionId` / `sourceScript` / `parentExecutionId` plumbing
|
||||
threaded through `CachedCallAsync` / `CachedWriteAsync` to the S&F enqueue (Audit Log
|
||||
#23 Tasks 4/6). The re-review walked the full 10-category checklist again and
|
||||
surfaced **six new findings**, none Critical. The most serious
|
||||
(`ExternalSystemGateway-018`, High) is that `DeliverBufferedAsync` on both
|
||||
`ExternalSystemClient` and `DatabaseGateway` lets a `JsonException` from
|
||||
`JsonSerializer.Deserialize` propagate out of the delivery handler — the S&F engine
|
||||
treats any thrown exception as a transient retry, so a corrupted or
|
||||
schema-incompatible buffered row becomes a permanent poison message that is retried
|
||||
on every sweep forever (the same retry-forever class of hazard `-015` already
|
||||
addressed for a different cause). `ExternalSystemGateway-019` (Medium) is that
|
||||
`HttpClient.Timeout` is never set, so any operator-configured `DefaultHttpTimeout`
|
||||
greater than 100s is silently clipped by `HttpClient`'s built-in 100s default and the
|
||||
gateway's "timeout applies to the HTTP request round-trip" guarantee no longer
|
||||
holds — a partial reopen of the `-002` contract for the long-timeout case.
|
||||
`ExternalSystemGateway-020` (Medium) is a silent precision-loss bug in the cached-DB-write
|
||||
retry path: `JsonElementToParameterValue` collapses any JSON number that is not
|
||||
Int64-convertible to `double`, so a script's `decimal` SQL parameter is downcast on
|
||||
retry and only on retry. The remaining three (`-021`/`-022`/`-023`, Low) are an
|
||||
unauthenticated-by-default `ApplyAuth` for unknown `AuthType` / malformed Basic config,
|
||||
runtime-only HTTP-verb validation, and an undocumented PATCH HTTP method (code vs
|
||||
design-doc drift). Theme: every new finding is in a code path that was added or
|
||||
touched by the earlier fix bundle but whose error-propagation contract was not
|
||||
verified end-to-end against the S&F engine or the design doc.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -66,6 +96,21 @@ with.
|
||||
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
|
||||
| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `JsonException` not caught in either `DeliverBufferedAsync`, so a corrupt buffered payload becomes a permanent poison-message retried forever — finding 018. `JsonElementToParameterValue` collapses a non-Int64 number to `double`, silently losing precision for `decimal` SQL parameters on cached-write retry — finding 020. `new HttpMethod(method.HttpMethod)` accepts any string at runtime, so an invalid HTTP verb is only diagnosed at call time — finding 022. |
|
||||
| 2 | Akka.NET conventions | ☑ | Still no actors in this module; `AddExternalSystemGatewayActors` remains a no-op. The cached-call lifecycle/audit emission lives in `ScriptRuntimeContext` / `CachedCallTelemetryForwarder` (SiteRuntime / AuditLog), not here, and that boundary is correct. No issues found. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Services are still stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. The added `executionId` / `sourceScript` / `parentExecutionId` plumbing flows through method arguments only — no shared state introduced. No findings. |
|
||||
| 4 | Error handling & resilience | ☑ | The poison-payload retry-forever path is the headline resilience issue (finding 018). `HttpClient.Timeout` not being set leaves the gateway's per-call round-trip cap clipped to the framework's 100s default whenever the configured `DefaultHttpTimeout` is larger — finding 019 (partial reopen of the `-002` contract). |
|
||||
| 5 | Security | ☑ | Auth secrets still never logged; error bodies still truncated. `ApplyAuth` is silent on unknown `AuthType` / empty `AuthConfiguration` / malformed Basic config — finding 021 (fail-open is a real but bounded risk; recorded Low because misconfiguration is the precondition). Connection-string handling in `DatabaseGateway` reads from the entity verbatim and never logs it. |
|
||||
| 6 | Performance & resource management | ☑ | Disposal paths from findings 005/010 still hold. The `IHttpClientFactory` name-keyed-options registration (finding 016 fix) creates a fresh `SocketsHttpHandler` per primary-handler build — acceptable because `IHttpClientFactory` recycles handlers. No new findings. |
|
||||
| 7 | Design-document adherence | ☑ | The design doc enumerates GET/POST/PUT/DELETE but the code also serializes a body for PATCH (and accepts arbitrary HTTP verbs at runtime) — finding 023 (drift to be reconciled). The per-call timeout guarantee is partially defeated by the unset `HttpClient.Timeout` for option values > 100s — finding 019. |
|
||||
| 8 | Code organization & conventions | ☑ | The `-016` fix replaced `ConfigureHttpClientDefaults` with a scoped `IConfigureNamedOptions<HttpClientFactoryOptions>` — verified clean, no new conventions issue. `internal virtual CreateConnection` (DatabaseGateway) and `internal InvokeHttpAsync` (ExternalSystemClient) are exposed via `InternalsVisibleTo` for tests — acceptable. No new findings. |
|
||||
| 9 | Testing coverage | ☑ | The `JsonException` deserialization path for `DeliverBufferedAsync` is untested; the `JsonElementToParameterValue` `double`-downcast path is untested; `ApplyAuth`'s unknown-AuthType / empty-config / malformed-Basic branches are untested. Recorded under findings 018 / 020 / 021 rather than a standalone coverage finding. |
|
||||
| 10 | Documentation & comments | ☑ | XML doc additions in `1eb6e97` are accurate and consistent. PATCH support is undocumented in the design doc (finding 023). The inline `ExternalSystemGateway-015` block-comment in `CachedCallAsync` (lines 126–133) and the equivalent in `DatabaseGateway.cs:106–113` now correctly describe the "treat 0 as unset" semantics. |
|
||||
|
||||
## Findings
|
||||
|
||||
### ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
|
||||
@@ -951,3 +996,298 @@ method whose effective parameter set is empty produces a URL identical to the
|
||||
no-parameters case. Regression test
|
||||
`Call_GetWithAllNullParameters_DoesNotAppendTrailingQuestionMark` asserts the
|
||||
captured request URI has no trailing `?`; it was verified to fail before the fix.
|
||||
|
||||
### ExternalSystemGateway-018 — `DeliverBufferedAsync` lets `JsonException` propagate, turning a corrupt buffered row into a permanent retry-forever poison message
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:176`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:151` |
|
||||
|
||||
**Description**
|
||||
|
||||
Both `ExternalSystemClient.DeliverBufferedAsync` and `DatabaseGateway.DeliverBufferedAsync`
|
||||
begin with an unguarded `JsonSerializer.Deserialize<...>(message.PayloadJson)`:
|
||||
|
||||
```csharp
|
||||
var payload = JsonSerializer.Deserialize<CachedCallPayload>(message.PayloadJson);
|
||||
if (payload == null || string.IsNullOrEmpty(payload.SystemName) || ...) {
|
||||
_logger.LogError("... unreadable payload; parking.");
|
||||
return false;
|
||||
}
|
||||
```
|
||||
|
||||
The "unreadable payload; parking" branch is only entered when `Deserialize` *succeeds*
|
||||
and produces a null / partially-empty object. If `PayloadJson` is **malformed JSON** —
|
||||
the column was truncated mid-write, an older payload schema is being deserialized into a
|
||||
newer record, or storage corruption occurred — `Deserialize` throws `JsonException`
|
||||
before that check is ever reached. The exception propagates out of the delivery handler.
|
||||
|
||||
The Store-and-Forward retry loop treats *any* thrown exception from a delivery handler
|
||||
as a transient failure (only a returned `false` parks the message); see
|
||||
`StoreAndForwardService.RetryMessageAsync`. Combined with the `MaxRetries == 0` →
|
||||
"unset → bounded default" fix from `-015`, the resulting behaviour is:
|
||||
|
||||
1. Corrupt payload arrives in the buffer.
|
||||
2. Every retry sweep deserializes, throws `JsonException`, increments `RetryCount`.
|
||||
3. The message is retried until `RetryCount >= MaxRetries`, then parked — *only* if
|
||||
`MaxRetries > 0` is configured (which `-015` already established is not the default
|
||||
site configuration today). With the bounded S&F default it does eventually park, but
|
||||
it park-loops noisily for `DefaultMaxRetries` iterations first; without that bound it
|
||||
retries forever.
|
||||
4. The script is unaware — the cached call was returned `WasBuffered: true` long ago.
|
||||
|
||||
This is the same "poison message buffered forever" class of hazard that
|
||||
`ExternalSystemGateway-001` (no-handler) and `ExternalSystemGateway-015` (MaxRetries==0)
|
||||
already removed for their own causes; corrupt JSON is an alternative arrival path into
|
||||
the same bad state.
|
||||
|
||||
The `DatabaseGateway.DeliverBufferedAsync` path has the same shape and the same defect:
|
||||
`JsonSerializer.Deserialize<CachedWritePayload>` at line 151 is not guarded.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wrap the `Deserialize` call in a `try/catch (JsonException)` block in both
|
||||
`DeliverBufferedAsync` methods. A `JsonException` is by definition a permanent failure —
|
||||
re-running the same deserialization against the same payload will produce the same
|
||||
exception — so the catch should log at `LogError` and **return `false`** so the S&F
|
||||
engine parks the message rather than retrying. Add regression tests that feed a
|
||||
malformed `PayloadJson` to each handler and assert `delivered == false` (i.e. the
|
||||
message parks) and that no exception escapes the handler.
|
||||
|
||||
### ExternalSystemGateway-019 — `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:226,257-264`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:90-102` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `-002` fix enforces the per-call timeout via a linked `CancellationTokenSource`
|
||||
built from `_options.DefaultHttpTimeout` and passed into `SendAsync`. That correctly
|
||||
caps every call to *at most* the configured value when `DefaultHttpTimeout` ≤ 100s.
|
||||
However, `HttpClient.Timeout` (the framework default) is never set on either the named
|
||||
client or its primary handler — the `GatewayHttpClientConfigurator` only sets
|
||||
`MaxConnectionsPerServer`. `HttpClient.Timeout` defaults to **100 seconds**, and
|
||||
`SendAsync` enforces it internally by cancelling its own private CTS, raising a
|
||||
`TaskCanceledException` from `SendAsync` *without* cancelling either the caller's token
|
||||
or the gateway's `timeoutCts`.
|
||||
|
||||
Consequences when an operator configures `DefaultHttpTimeout` to anything > 100s
|
||||
(a legitimate setting for external systems with long-running endpoints — recipe
|
||||
exports, large queries):
|
||||
|
||||
1. The gateway's `timeoutCts` (e.g. 5 minutes) has not yet fired.
|
||||
2. `HttpClient.Timeout` fires at 100s, `SendAsync` throws.
|
||||
3. Neither `when (cancellationToken.IsCancellationRequested)` nor
|
||||
`when (timeoutCts.IsCancellationRequested)` matches, so the exception falls into
|
||||
the generic `catch (Exception ex) when (ErrorClassifier.IsTransient(ex))` branch
|
||||
(line 277) and is re-thrown as a `TransientExternalSystemException` with the
|
||||
message `"Connection error to {Name}: A task was canceled."` — misattributing a
|
||||
timeout as a connection error.
|
||||
4. The configured 5-minute round-trip window the design doc promises ("Each external
|
||||
system definition specifies a timeout that applies to all method calls on that
|
||||
system" / "applies to the HTTP request round-trip") is silently overridden.
|
||||
|
||||
The opposite case (`DefaultHttpTimeout` < 100s) is the only one the `-002` regression
|
||||
test exercises (200ms), so the defect is not caught by the existing suite.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Set `HttpClient.Timeout = Timeout.InfiniteTimeSpan` on the gateway's named clients via
|
||||
the existing `GatewayHttpClientConfigurator` (delegate `HttpClientActions` rather than
|
||||
just `HttpMessageHandlerBuilderActions`), so the cancellation-token mechanism is the
|
||||
sole timeout source. The linked `timeoutCts` then reliably enforces
|
||||
`DefaultHttpTimeout` for every value, and the timeout-vs-cancellation classification at
|
||||
lines 266–276 stays accurate. Add a regression test that configures `DefaultHttpTimeout`
|
||||
to ~150s, hangs the handler, and asserts the call times out at the configured value
|
||||
and produces a `"Timeout calling..."` (not `"Connection error to..."`) error.
|
||||
|
||||
### ExternalSystemGateway-020 — `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:185-193` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DatabaseGateway.JsonElementToParameterValue` materialises the buffered cached-write
|
||||
SQL parameter values during a retry-sweep delivery:
|
||||
|
||||
```csharp
|
||||
private static object JsonElementToParameterValue(JsonElement element) => element.ValueKind switch
|
||||
{
|
||||
JsonValueKind.String => (object?)element.GetString() ?? DBNull.Value,
|
||||
JsonValueKind.Number => element.TryGetInt64(out var l) ? l : element.GetDouble(),
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
For a JSON number, the helper attempts `Int64` first and otherwise returns a `double`.
|
||||
There is no `decimal` branch. The immediate-attempt path is unaffected — `CachedWriteAsync`
|
||||
on the original call serializes the script-provided typed parameters via
|
||||
`JsonSerializer.Serialize(new { ConnectionName, Sql, Parameters = parameters })` and
|
||||
executes the SQL directly outside this code path. But the **retry path** runs through
|
||||
`DeliverBufferedAsync` → `JsonElementToParameterValue`, so a script that submitted
|
||||
a `decimal` value (e.g. `123.4567890123m`) gets:
|
||||
|
||||
1. Immediate attempt: `decimal` parameter, full precision (or, more accurately, the
|
||||
value never enters this helper because cached writes today never re-execute on the
|
||||
immediate path — but on the retry path it does).
|
||||
2. Retry attempt(s) after a transient failure: the value is deserialized as a JSON
|
||||
number, fails `TryGetInt64`, and is downcast to `double` — which has ~15–17 digits
|
||||
of precision against `decimal`'s 28–29. A SQL column of type `decimal(18, 6)` or
|
||||
`numeric` receives a value that has been truncated to `double` precision before
|
||||
parameter binding.
|
||||
|
||||
Two further consequences worth recording:
|
||||
|
||||
- The downcast is **silent** — there is no log, no error, and the cached-write
|
||||
acknowledgement to the script has long since happened. Data drift between a
|
||||
same-call immediate-success delivery and a same-call retry delivery is the worst
|
||||
shape of "looks like the right value but isn't" defect.
|
||||
- For SCADA telemetry (process variables, totals, currency-denominated quality
|
||||
reports) `decimal` is the correct CLR type and `double`'s representation error
|
||||
changes the persisted value.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replace the `Number` branch with a precision-preserving cascade — try `Int64`, then
|
||||
`decimal` (`element.TryGetDecimal(out var d) ? d : element.GetDouble()`), and only
|
||||
fall back to `double` when even `decimal` fails. Add a regression test against
|
||||
`DatabaseGateway.DeliverBufferedAsync` that buffers a write with a high-precision
|
||||
`decimal` parameter, drives the delivery, and asserts the SQL parameter bound is a
|
||||
`decimal` (or compares the round-tripped value to the original at the parameter level)
|
||||
rather than a `double` with truncated precision. The same Number-branch decision should
|
||||
be reviewed against `JsonValueKind.True`/`False`/`Null` (currently fine) and a string
|
||||
that happens to encode a number (already correctly returns `string`).
|
||||
|
||||
### ExternalSystemGateway-021 — `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:385-415` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ApplyAuth` has three fail-open paths that all result in an HTTP request being sent
|
||||
**without** the credential the operator configured:
|
||||
|
||||
1. Line 387 — `if (string.IsNullOrEmpty(system.AuthConfiguration)) return;` returns
|
||||
early regardless of `AuthType`. A system entity with `AuthType = "apikey"` but an
|
||||
empty `AuthConfiguration` (e.g. the secret column failed to deploy, or the
|
||||
protector key changed and decryption produced `""`) sends every request with no
|
||||
`X-API-Key` header — the gateway is silent.
|
||||
2. The `switch` has no `default` arm. A system entity with `AuthType = "bearer"`,
|
||||
`"oauth2"`, a typo like `"ApiKey "` (trailing space) or even `"none"` falls off the
|
||||
`switch` and the request is sent without any auth header — again silent.
|
||||
3. Line 408 — `if (basicParts.Length == 2)` skips the auth attach when
|
||||
`AuthConfiguration` for `basic` lacks a `:` separator. The request is sent with no
|
||||
`Authorization` header.
|
||||
|
||||
Effectively the gateway treats every misconfiguration as "send anonymously" and
|
||||
relies on the remote system rejecting it with a 401/403. That is a defensible default
|
||||
on its own, but combined with `-007`'s 2 KB error-body cap and the fact that no audit
|
||||
or warning is emitted, an operator debugging "why does my external system always
|
||||
return 401" has nothing to go on inside ScadaLink — the gateway never says it failed
|
||||
to apply auth. For `AuthType = "none"` (the design's expected sentinel for
|
||||
unauthenticated systems) the fall-through is correct; the failure mode is misconfig.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `default:` arm to the `switch` that logs `_logger.LogWarning(...)` naming the
|
||||
unknown `AuthType` and the system, and emit a similar warning when
|
||||
`AuthConfiguration` is empty for an `AuthType` of `"apikey"` or `"basic"` (those
|
||||
require a value; `"none"` does not). For Basic auth specifically, the
|
||||
`basicParts.Length != 2` branch should also warn. Do **not** include the
|
||||
`AuthConfiguration` value in the log message — secrets must stay out of the log
|
||||
(consistent with the existing module). A small set of `ApplyAuth` unit tests
|
||||
verifying the warning emission and that no `Authorization` / `X-API-Key` header is
|
||||
ever leaked in the warning text would close the test gap as well.
|
||||
|
||||
### ExternalSystemGateway-022 — `new HttpMethod(method.HttpMethod)` accepts any string at runtime; an invalid HTTP verb fails only at call time
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:233` |
|
||||
|
||||
**Description**
|
||||
|
||||
`InvokeHttpAsync` constructs the request method directly from the string column:
|
||||
`new HttpRequestMessage(new HttpMethod(method.HttpMethod), url)`. `System.Net.Http.HttpMethod(string)`
|
||||
performs only a token-character validation (it rejects whitespace and control chars
|
||||
but accepts arbitrary non-standard tokens like `"FOO"` or `"GIT"`). The body-vs-query
|
||||
selection at lines 239–250 explicitly checks for POST/PUT/PATCH; for any other
|
||||
non-standard verb (`"FOO"`) the parameters silently go to neither body nor query
|
||||
string and the request is dispatched anyway.
|
||||
|
||||
The design doc enumerates GET/POST/PUT/DELETE as the supported set. There is no
|
||||
validation at deployment time, at definition save time, or at gateway
|
||||
resolution time that `method.HttpMethod` is one of the expected verbs. An operator
|
||||
who typos `"DLETE"` discovers the issue only when a script invokes that method and
|
||||
the remote server rejects the request — usually as a 4xx that the gateway classifies
|
||||
as permanent, which is correct but obscures the root cause.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Validate `method.HttpMethod` at gateway entry — either with a small `switch` of
|
||||
allowed verbs in `InvokeHttpAsync` that throws `PermanentExternalSystemException` for
|
||||
an unsupported verb (cheap, immediate, surfaces a clear error to the script), or by
|
||||
adding a validation pass in the Template/Deployment Manager so it can never reach
|
||||
the gateway. The first option is local to this module and cheaper to land. Either
|
||||
way, the canonical list should agree with `BuildUrl`'s query-vs-body decision (which
|
||||
currently knows about POST/PUT/PATCH for body and GET/DELETE for query — note PATCH
|
||||
is in the body branch but not the design-doc list; see finding 023).
|
||||
|
||||
### ExternalSystemGateway-023 — PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:241`, `docs/requirements/Component-ExternalSystemGateway.md:43` |
|
||||
|
||||
**Description**
|
||||
|
||||
The component design doc lists the supported HTTP methods as `GET, POST, PUT, or
|
||||
DELETE` (line 43: `**HTTP method**: GET, POST, PUT, or DELETE.`). `InvokeHttpAsync`'s
|
||||
body-serialization branch at lines 239–250 explicitly includes `PATCH` alongside POST
|
||||
and PUT — so PATCH is in fact supported (and routes parameters into the JSON body),
|
||||
but operators reading the spec would not know it. Conversely, `BuildUrl`'s
|
||||
query-string branch at lines 364–366 lists only `GET` and `DELETE`, so a PATCH
|
||||
method's parameters always go to the body, matching the body-branch but not appearing
|
||||
anywhere in the documented contract.
|
||||
|
||||
This is mild drift — the code is more permissive than the spec. It only becomes a
|
||||
real issue if a future change relies on the documented "only GET/POST/PUT/DELETE"
|
||||
set and breaks the PATCH path silently, or if PATCH is genuinely out of scope and a
|
||||
template author defines a PATCH method on purpose only to learn later it is
|
||||
unsupported.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick one direction and apply it in the same session, per the project's "design doc +
|
||||
code travel together" rule:
|
||||
|
||||
- If PATCH is intentionally supported, add `PATCH` to the Component-ExternalSystemGateway.md
|
||||
HTTP-method list (line 43) and add a parameterised test confirming a PATCH method
|
||||
sends its parameters in the JSON body and resolves like POST/PUT for error
|
||||
classification.
|
||||
- If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
|
||||
body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
|
||||
The design-doc list then remains the single source of truth.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.HealthMonitoring` |
|
||||
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -51,6 +51,35 @@ HealthMonitoring + CentralUI change), and `CollectReport` reading
|
||||
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
|
||||
broadly faithful to the design intent.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All sixteen prior findings (HealthMonitoring-001..016) remain `Resolved`. This
|
||||
baseline re-review applied the full 10-category checklist and produced **7 new
|
||||
findings** (1 Medium, 6 Low — none crash-class). The most material observation
|
||||
is a **metric-loss race** in `HealthReportSender.ExecuteAsync`
|
||||
(HealthMonitoring-017): `CollectReport` resets the per-interval error counters
|
||||
(`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`,
|
||||
`SiteAuditWriteFailures`, `AuditRedactionFailure`) **before**
|
||||
`_transport.Send(...)` is attempted, so a transport failure (the existing
|
||||
`catch { LogError; }` path) silently discards every error this site recorded in
|
||||
the failed interval — the module-specific concern of "metric counters drifting
|
||||
from raw-per-interval to cumulative" inverted into _drifting_ to _lost_. A
|
||||
parallel hazard exists in `CentralHealthReportLoop` (HealthMonitoring-018). The
|
||||
remaining items are smaller: two Audit Log metrics
|
||||
(`SiteAuditTelemetryStalled`, `CentralAuditWriteFailures`) listed in the design
|
||||
doc never make it into a HealthMonitoring surface (HealthMonitoring-019); a
|
||||
heartbeat with `receivedAt <= existing.LastHeartbeatAt` brings an offline site
|
||||
back online with a stale heartbeat that can flap right back to offline on the
|
||||
next check (HealthMonitoring-020); the reserved `CentralSiteId = "central"`
|
||||
constant collides with any real site named `"central"` and silently extends its
|
||||
offline grace (HealthMonitoring-021); `CentralHealthReportLoopTests` uses real
|
||||
wall-clock 50 ms timers + `Task.Delay`, making it timing-sensitive
|
||||
(HealthMonitoring-022); and one obsolete placeholder test name
|
||||
(`StoreAndForwardBufferDepths_IsEmptyPlaceholder`) misrepresents what it now
|
||||
covers (HealthMonitoring-023). All sequence-number and offline-detection
|
||||
arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
|
||||
monotonic mismatch was observed.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -66,6 +95,21 @@ broadly faithful to the design intent.
|
||||
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
|
||||
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | `HealthReportSender` and `CentralHealthReportLoop` reset per-interval counters before the send/process call — counts lost on transport failure (HealthMonitoring-017, HealthMonitoring-018). `MarkHeartbeat` brings an offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` — site can flap straight back to offline (HealthMonitoring-020). `CentralSiteId = "central"` reserved constant silently collides with any real site named "central" (HealthMonitoring-021). |
|
||||
| 2 | Akka.NET conventions | x | Module contains no actors itself. `IHealthReportTransport` cleanly abstracts the Akka-remoting send. `ProcessReport`/`MarkHeartbeat` are called from `CentralCommunicationActor`'s receive — invoked on the actor thread but the aggregator's CAS loops make that safe regardless. No issues found. |
|
||||
| 3 | Concurrency & thread safety | x | Verified the resolved `SiteHealthState` immutable-record / CAS-loop pattern still holds across `ProcessReport`, `MarkHeartbeat`, `CheckForOfflineSites`. `SiteHealthCollector` uses `volatile` for reference fields (`_clusterNodes`, `_nodeHostname`, `_siteAuditBacklog`, `_isActiveNode`) and `Interlocked` for counters consistently. `CollectReport`'s `new Dictionary<>(concurrentDict)` snapshots are not strictly atomic but acceptable at the documented scale. No new issues found. |
|
||||
| 4 | Error handling & resilience | x | `try/catch` blocks now log all non-fatal failures (resolved HealthMonitoring-010 still in place). Outer `catch (Exception)` in `ExecuteAsync` keeps the loop alive — sound. New: the counter-reset-before-send issue (HealthMonitoring-017, HealthMonitoring-018) is an error-handling gap — transport failure silently swallows the interval's metric data. |
|
||||
| 5 | Security | x | No issues found. The module handles only numeric/string operational metrics; no secrets, auth surface, or untrusted input parsing. `MarkHeartbeat` and `ProcessReport` trust the caller (intra-cluster). |
|
||||
| 6 | Performance & resource management | x | `PeriodicTimer` instances disposed via `using`. CAS retry loops in `ProcessReport`/`MarkHeartbeat` have no bounded retry cap but contention is the dictionary-size limit (one entry per site) so the loop is effectively wait-free for the common case. No issues found. |
|
||||
| 7 | Design-document adherence | x | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` are listed as required dashboard tiles in `Component-HealthMonitoring.md` but have no HealthMonitoring-side surface — both live only in `AuditLog`'s `AuditCentralHealthSnapshot` with no integration into the health aggregator or its consumers (HealthMonitoring-019). |
|
||||
| 8 | Code organization & conventions | x | Options class correctly owned by the component, validator registered idempotently across all three `Add*` paths. POCO/messages in Commons. `AddCentralHealthAggregation` implicitly depends on `ISiteHealthCollector` being registered elsewhere (Host calls `AddHealthMonitoring()` first) — works but is a hidden ordering requirement. Minor; not flagged. |
|
||||
| 9 | Testing coverage | x | Per-interval reset semantics covered for site-side counters but NOT for the failed-send case (no test asserts counters remain accumulated when the transport throws — would catch HealthMonitoring-017). `CentralHealthReportLoopTests` uses real wall-clock 50 ms `PeriodicTimer` + `Task.Delay(250)` for timing — flake-prone on a slow CI runner (HealthMonitoring-022). The placeholder test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` name is stale (HealthMonitoring-023). |
|
||||
| 10 | Documentation & comments | x | XML docs in the new audit-bridge surfaces (`IncrementSiteAuditWriteFailures`, `IncrementAuditRedactionFailure`, `UpdateSiteAuditBacklog`) are accurate. The stale placeholder test name is the only issue (HealthMonitoring-023). |
|
||||
|
||||
## Findings
|
||||
|
||||
### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
|
||||
@@ -776,3 +820,314 @@ continues to work via the optional parameter. Regression test
|
||||
asserts the timestamp equals a fixed injected instant exactly (not just a
|
||||
before/after window); it would not compile against the pre-fix single-arg-less
|
||||
constructor.
|
||||
|
||||
### HealthMonitoring-017 — `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:140-154`, `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:146-153` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HealthReportSender.ExecuteAsync` calls `_collector.CollectReport(_siteId)` and
|
||||
then `_transport.Send(reportWithSeq)` inside a single `try` block whose `catch`
|
||||
logs and continues. `CollectReport` atomically read-and-resets the per-interval
|
||||
counters via `Interlocked.Exchange(ref _scriptErrorCount, 0)` (and the same for
|
||||
`_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`,
|
||||
`_auditRedactionFailures`). If `_transport.Send` then throws — Akka remoting
|
||||
hiccup, transport not yet associated, central side temporarily unavailable,
|
||||
serialization failure on a malformed metric, etc. — the `catch (Exception ex)`
|
||||
on line 150 logs an error and the loop simply waits for the next tick. The
|
||||
report was never delivered, but the counters have already been reset to zero, so
|
||||
**every error this site recorded in the failed interval is gone**: it is neither
|
||||
in the (un-sent) report nor in the (zeroed) collector. The very next successful
|
||||
report will show "0 script errors / 0 alarm errors" for the entire window in
|
||||
which the transport was broken, masking exactly the period the operator most
|
||||
needs to triage.
|
||||
|
||||
This contradicts the design doc's "raw counts per reporting interval" / "counter
|
||||
resets **after each report is sent**" wording — current code resets on each
|
||||
report _attempt_, regardless of outcome. The hazard worsens under sustained
|
||||
transport failure: every interval's errors are lost; the central dashboard sees
|
||||
a quiet site while the site is, in fact, failing.
|
||||
|
||||
The same shape exists in `CentralHealthReportLoop` (see HealthMonitoring-018) —
|
||||
`CollectReport` is called before `_aggregator.ProcessReport`. The aggregator
|
||||
call is in-process and unlikely to throw, but the structural bug is identical.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Build the report from a non-destructive read first (`PeekReport(siteId)`,
|
||||
returning a snapshot without mutating the counters) and only call a dedicated
|
||||
`ResetIntervalCounters()` after a successful `_transport.Send`. Alternatively,
|
||||
on a `Send` failure, restore the lost counts via `Interlocked.Add` of the
|
||||
captured values back into the collector fields — atomically correct as long as
|
||||
no other thread can read them in between, which is true here because the next
|
||||
read is the next `CollectReport` on the same loop. The "peek then commit"
|
||||
shape is the cleaner public API.
|
||||
|
||||
A regression test should add a failing-transport scenario:
|
||||
`Send` throws an `InvalidOperationException`; assert that the next successful
|
||||
report includes the previously-failed interval's `ScriptErrorCount`.
|
||||
|
||||
### HealthMonitoring-018 — Same counter-reset-before-publish hazard in `CentralHealthReportLoop`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:87-98` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CentralHealthReportLoop.ExecuteAsync` calls `_collector.CollectReport(CentralSiteId)`
|
||||
(which resets the per-interval counters on the shared `SiteHealthCollector`
|
||||
instance — see HealthMonitoring-017) and then `_aggregator.ProcessReport(reportWithSeq)`
|
||||
inside the same `try` block. If `ProcessReport` throws, the central node's own
|
||||
per-interval counters (`ScriptErrorCount`, `AlarmEvaluationErrorCount`,
|
||||
`DeadLetterCount`, `SiteAuditWriteFailures`, `AuditRedactionFailure`) are lost
|
||||
for that interval.
|
||||
|
||||
In practice `ProcessReport` is a pure in-memory CAS loop and is very unlikely
|
||||
to throw, so the operational impact is small. However, the structural bug is
|
||||
identical to HealthMonitoring-017 and would be fixed by the same
|
||||
"peek then commit" refactor in `SiteHealthCollector`. The Audit-Log-related
|
||||
metrics matter most here: `AuditRedactionFailure` is genuinely incremented at
|
||||
central during normal operation (the Notification Outbox dispatcher and
|
||||
Inbound API middleware both write through `CentralAuditRedactionFailureCounter`
|
||||
which can fan out to the collector via the bridge), so this is not purely
|
||||
theoretical.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Adopt the same "peek then reset on successful publish" pattern recommended for
|
||||
HealthMonitoring-017. Reuse the new `PeekReport` / `ResetIntervalCounters`
|
||||
collector API once it lands.
|
||||
|
||||
### HealthMonitoring-019 — `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-HealthMonitoring.md:39,40`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs`, `src/ScadaLink.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-HealthMonitoring.md` lists `SiteAuditTelemetryStalled` and
|
||||
`CentralAuditWriteFailures` (and reiterates them under the Audit Log KPIs
|
||||
section and in the Dependencies section) as required dashboard metrics. The
|
||||
doc also says they "are central-computed alongside the existing central KPIs"
|
||||
(Notification Outbox / Site Call Audit) and surface in the **Audit** dashboard
|
||||
tile group.
|
||||
|
||||
Tracing the code:
|
||||
|
||||
- `SiteAuditTelemetryStalled` is published by `SiteAuditReconciliationActor`,
|
||||
picked up by `SiteAuditTelemetryStalledTracker`, and latched into
|
||||
`AuditCentralHealthSnapshot._stalled` (a `ConcurrentDictionary<string, bool>`
|
||||
in the `ScadaLink.AuditLog` assembly).
|
||||
- `CentralAuditWriteFailures` is incremented inside `AuditCentralHealthSnapshot`
|
||||
via `ICentralAuditWriteFailureCounter.Increment()` (also in `ScadaLink.AuditLog`).
|
||||
|
||||
Neither metric is referenced anywhere in `src/ScadaLink.HealthMonitoring/`:
|
||||
- `ICentralHealthAggregator` does not expose them.
|
||||
- `SiteHealthCollector` has no central counterpart (it is site-only).
|
||||
- `SiteHealthReport` has no `SiteAuditTelemetryStalled` / `CentralAuditWriteFailures`
|
||||
fields (the site-only `SiteAuditWriteFailures`, `AuditRedactionFailure`, and
|
||||
`SiteAuditBacklog` _are_ wired; the central pair is the gap).
|
||||
|
||||
Currently the only consumer of `IAuditCentralHealthSnapshot` is whatever
|
||||
Central UI page binds to it directly (out of scope for this module), but the
|
||||
design doc places these metrics under HealthMonitoring's responsibility
|
||||
("Health Monitoring Dashboard displays aggregated metrics"). At minimum the
|
||||
Dependencies section's claim that Health Monitoring provides "the
|
||||
central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics"
|
||||
is false for `CentralAuditWriteFailures`: nothing under
|
||||
`src/ScadaLink.HealthMonitoring/` knows about it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decide whether HealthMonitoring or the consuming UI page owns the
|
||||
`IAuditCentralHealthSnapshot` integration:
|
||||
|
||||
- If HealthMonitoring owns it, expose a `CentralKpis` accessor on
|
||||
`ICentralHealthAggregator` (e.g. a `GetCentralAuditHealth()` method that
|
||||
returns a typed DTO derived from the injected `IAuditCentralHealthSnapshot`)
|
||||
so the dashboard has a single read surface mirroring `GetAllSiteStates`.
|
||||
- If the UI page binds `IAuditCentralHealthSnapshot` directly, update the
|
||||
HealthMonitoring design doc's Responsibilities / Dependencies sections to
|
||||
reflect that and remove the implied integration.
|
||||
|
||||
Either way, add a regression test that the chosen surface returns the live
|
||||
counter and per-site stalled state.
|
||||
|
||||
### HealthMonitoring-020 — `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:128-147` |
|
||||
|
||||
**Description**
|
||||
|
||||
The CAS path in `MarkHeartbeat` picks `newHeartbeat = max(receivedAt, existing.LastHeartbeatAt)`,
|
||||
then short-circuits only when `newHeartbeat == existing.LastHeartbeatAt &&
|
||||
existing.IsOnline`. That short-circuit is correct, but consider the case where
|
||||
`existing.IsOnline == false` and `receivedAt <= existing.LastHeartbeatAt`:
|
||||
|
||||
1. Suppose a site is marked offline by `CheckForOfflineSites` at time T1.
|
||||
2. A late/out-of-order heartbeat carrying a `receivedAt` _older_ than the last
|
||||
stored `LastHeartbeatAt` arrives at T2 (clock skew at the receive site, or a
|
||||
delayed message that was generated before the offline-marking).
|
||||
3. `newHeartbeat == existing.LastHeartbeatAt` (kept), but the short-circuit
|
||||
condition fails because `existing.IsOnline == false`, so the CAS produces a
|
||||
new record with `IsOnline = true` and the **stale** `LastHeartbeatAt`.
|
||||
4. On the very next `CheckForOfflineSites` tick (≤ `OfflineTimeout/2` later),
|
||||
`now - LastHeartbeatAt` is still ≥ `OfflineTimeout`, so the site is
|
||||
immediately marked offline again — the heartbeat brought it online for less
|
||||
than the check cadence, producing a "flap" in the dashboard.
|
||||
|
||||
In practice `receivedAt` is normally `_timeProvider.GetUtcNow()` at the
|
||||
`CentralCommunicationActor` receive site, so monotonically increasing — the bug
|
||||
is latent. But the contract `MarkHeartbeat(string siteId, DateTimeOffset receivedAt)`
|
||||
makes no guarantee about ordering, and an out-of-order delivery (Akka remoting
|
||||
ordering across connection re-establishment edge cases) or a small wall-clock
|
||||
correction at central would expose it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
When transitioning offline → online, use `now` (from the injected
|
||||
`TimeProvider`) rather than the caller-supplied `receivedAt` for
|
||||
`LastHeartbeatAt`, or take `max(receivedAt, _timeProvider.GetUtcNow())` so the
|
||||
recovery point is always recent. A unit test driving `MarkHeartbeat` with a
|
||||
`receivedAt` older than the last stored heartbeat on an offline site, then a
|
||||
`CheckForOfflineSites` immediately afterwards, would assert the site stays
|
||||
online.
|
||||
|
||||
### HealthMonitoring-021 — `CentralSiteId = "central"` reserved constant silently collides with a real site named "central"
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:22`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:224-226` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CentralHealthAggregator.CheckForOfflineSites` looks up the per-site offline
|
||||
timeout with:
|
||||
|
||||
```csharp
|
||||
var timeout = kvp.Key == CentralHealthReportLoop.CentralSiteId
|
||||
? _options.CentralOfflineTimeout
|
||||
: _options.OfflineTimeout;
|
||||
```
|
||||
|
||||
`CentralSiteId` is the literal string `"central"`. Site IDs are free-form
|
||||
strings set in configuration / the Sites repository; there is no validation
|
||||
that excludes the reserved `"central"` name. An operator who creates a real
|
||||
site with `SiteId = "central"` will have:
|
||||
|
||||
- Their real-site reports arriving via `ProcessReport` get stored in the same
|
||||
dictionary slot as the central self-report (they share the keyspace), so the
|
||||
central self-report and the real-site report repeatedly overwrite each
|
||||
other via the sequence-number guard — whichever has the higher Unix-ms seed
|
||||
wins, and the other is silently rejected as stale. The dashboard alternates
|
||||
between two unrelated payloads.
|
||||
- The real site gets the longer `CentralOfflineTimeout` (default 3 minutes)
|
||||
instead of the normal `OfflineTimeout` (60 s), so a genuinely-failed real
|
||||
site marked "central" stays falsely-online for an extra two minutes.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Two options:
|
||||
|
||||
1. Reject the reserved name at the Site entity / configuration validation
|
||||
layer (Configuration Database component, out of this module's scope) and
|
||||
document `"central"` as reserved. This is the cleaner UX fix.
|
||||
2. As a defence-in-depth inside HealthMonitoring, store the central
|
||||
self-report under a key that cannot collide — e.g. prefix it with a
|
||||
character that is forbidden in real site IDs (`":central"` or `"#central"`)
|
||||
— and adjust `CheckForOfflineSites` accordingly.
|
||||
|
||||
Either fix should include a regression test creating a real `SiteHealthReport`
|
||||
with `SiteId = "central"` and asserting the central self-report's identity is
|
||||
preserved.
|
||||
|
||||
### HealthMonitoring-022 — `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs:32-42` |
|
||||
|
||||
**Description**
|
||||
|
||||
`RunLoopBriefly` starts the hosted service with a 50 ms `PeriodicTimer` and
|
||||
then `await Task.Delay(runForMs, CancellationToken.None)` (with `runForMs`
|
||||
between 150 ms and 300 ms). `GeneratesCentralReports_WhenSelfIsPrimary` and
|
||||
`AssignsMonotonicSequenceNumbers` both assert "at least 2 reports were
|
||||
generated" within the window. On a heavily-contended CI runner where the
|
||||
hosted-service start-up plus a couple of `PeriodicTimer` ticks can blow past
|
||||
300 ms, these tests will silently flake.
|
||||
|
||||
The rest of the suite (`CentralHealthAggregatorTests`, `SiteHealthCollectorTests`,
|
||||
`HealthReportSenderTests` partially) was deliberately refactored to use the
|
||||
injected `TimeProvider` precisely to avoid this. `CentralHealthReportLoop` and
|
||||
`HealthReportSender` already accept a `TimeProvider`, but the loop's
|
||||
`PeriodicTimer` is still real-time because `PeriodicTimer` does not consume
|
||||
the `TimeProvider` parameter.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) accept the timing-sensitivity and bump the delay budget
|
||||
generously, or (b) refactor the hosted-service loop to use a
|
||||
`TimeProvider.CreateTimer`-based tick mechanism so the test can advance a
|
||||
fake clock and assert deterministically how many ticks fire. Option (b) is
|
||||
the better long-term fix and matches the pattern used elsewhere in the
|
||||
module's tests.
|
||||
|
||||
### HealthMonitoring-023 — `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122` |
|
||||
|
||||
**Description**
|
||||
|
||||
The test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` was originally named
|
||||
to codify the HealthMonitoring-001 bug ("`SetStoreAndForwardDepths` has no
|
||||
callers, so `StoreAndForwardBufferDepths` is always empty"). HealthMonitoring-001
|
||||
is `Resolved` — `HealthReportSender` now populates per-category depths from
|
||||
the S&F engine, and the same test class has `SetStoreAndForwardDepths_ReflectedInReport`
|
||||
covering the populated path. The "placeholder" test still passes because it
|
||||
constructs a fresh collector and never calls the setter, so its assertion
|
||||
(`Assert.Empty(report.StoreAndForwardBufferDepths)`) is now testing the
|
||||
**default empty state of an un-configured collector**. The HealthMonitoring-001
|
||||
resolution note explicitly chose to keep it as "the collector-level
|
||||
default-state test", but the test method name and the implied semantics no
|
||||
longer match.
|
||||
|
||||
A maintainer reading the test name today will misread it as documentation that
|
||||
the metric is unimplemented (which it isn't), and may waste time investigating
|
||||
a non-bug.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
|
||||
(or similar) and update the test body's intent — purely a documentation /
|
||||
maintainability fix; no behaviour change.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Host` |
|
||||
| Design doc | `docs/requirements/Component-Host.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -48,6 +48,38 @@ Serilog sink setup is hard-coded in `Program.cs` rather than configuration-drive
|
||||
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
|
||||
exception type including permanent schema-validation failures (Host-015).
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All fifteen prior findings (Host-001..015) remain `Resolved` in the current tree
|
||||
and the regressions introduced for them — Host-001's predicate, the externalised
|
||||
secrets, the Site GrpcPort/RemotingPort/seed-port validation rules, the escaped
|
||||
HOCON builder with `DownIfAlone` and millisecond-precision durations, the
|
||||
configuration-driven Serilog sinks, the transient-only `StartupRetry`
|
||||
classifier — are all still in place. This re-review walked the ten checklist
|
||||
categories over the full module again and recorded seven new findings, none of
|
||||
them crash/data-loss class. Host-016 (Medium) mirrors the resolved Host-004
|
||||
shipped-config bug on the **Communication** side: `appsettings.Site.json`'s
|
||||
second `CentralContactPoints` entry points at the site's own remoting port
|
||||
(`localhost:8082`) instead of central, an incorrect dev example that copies
|
||||
into multi-central deployments. Host-017 (Medium) flags a partial REQ-HOST-7
|
||||
implementation — the documented site-shutdown ordering (stop accepting streams
|
||||
first, cancel active streams via `IHostApplicationLifetime.ApplicationStopping`,
|
||||
then tear down actors) is not wired: the site path registers no
|
||||
`ApplicationStopping` handler that signals `SiteStreamGrpcServer`, and the gRPC
|
||||
server exposes no cancel-all-streams entry point. The remaining five are Low:
|
||||
`NodeOptions.NodeName` (the operator-configured value stamped on
|
||||
`AuditLog.SourceNode`) is absent from both shipped per-role configs even though
|
||||
the docker per-node configs set it (Host-018); the migration `StartupRetry`
|
||||
call passes `default` for `CancellationToken`, so a SIGTERM during the
|
||||
bounded-retry window is ignored for up to ~2 minutes (Host-019);
|
||||
`LoggerConfigurationFactory` layers `MinimumLevel.Is` over
|
||||
`ReadFrom.Configuration`, so any `Serilog:MinimumLevel` an operator sets is
|
||||
silently overridden by `ScadaLink:Logging:MinimumLevel` (Host-020); the
|
||||
shipped `appsettings.json` carries a Microsoft `Logging:LogLevel` block but
|
||||
Serilog is the only logger provider and the section is dead config (Host-021);
|
||||
and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
|
||||
a typo) and falls back to `Information` with no warning (Host-022).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -63,6 +95,21 @@ exception type including permanent schema-validation failures (Host-015).
|
||||
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
|
||||
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Re-review: `appsettings.Site.json` second `CentralContactPoints` entry targets the site's own remoting port instead of central (Host-016) — same defect class as the resolved Host-004 seed-list bug. |
|
||||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping, role-scoped site singletons, ClusterClient initial-contact wiring all reviewed; no new issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | `_trackedDisposables` is locked on both sides of the lifecycle; `_actorSystem` publication is safe via the IHost startup `await` boundary. New Low: `StartupRetry` migration call passes `default` `CancellationToken`, so SIGTERM during the retry window is ignored (Host-019). |
|
||||
| 4 | Error handling & resilience | ☑ | `IsTransientDatabaseFault` correctly classifies socket / timeout / SqlException; the retry helper itself remains sound. Host-019 is the resilience gap. |
|
||||
| 5 | Security | ☑ | Secrets stay externalised; the `_secrets` placeholder comment is intact. No new issues. |
|
||||
| 6 | Performance & resource management | ☑ | No new undisposed resources; gRPC stream lifetime cap remains correct. No new issues. |
|
||||
| 7 | Design-document adherence | ☑ | Re-review: REQ-HOST-7 site-shutdown ordering — stop accepting new streams, cancel active streams via `ApplicationStopping`, then tear down actors — is not wired in `Program.cs` (Host-017). |
|
||||
| 8 | Code organization & conventions | ☑ | Re-review: `NodeOptions.NodeName` is absent from the shipped per-role configs even though it stamps `AuditLog.SourceNode` (Host-018); the appsettings `Logging:LogLevel` Microsoft section is dead config under Serilog (Host-021). |
|
||||
| 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
|
||||
| 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaLink:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
|
||||
|
||||
## Findings
|
||||
|
||||
### Host-001 — `/health/ready` includes the leader-only `active-node` check
|
||||
@@ -777,3 +824,278 @@ site now passes it. Regression tests in `StartupRetryTests`:
|
||||
when `isTransient` returns false) and `ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent`
|
||||
(retries a `TimeoutException` then stops at a permanent `InvalidOperationException`).
|
||||
Full Host suite green (182 passed).
|
||||
|
||||
### Host-016 — Site `CentralContactPoints` second entry targets the site's own remoting port
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/appsettings.Site.json:33-37` |
|
||||
|
||||
**Description**
|
||||
|
||||
The shipped site config sets `Node:RemotingPort = 8082` and lists
|
||||
`Communication:CentralContactPoints` as
|
||||
`["akka.tcp://scadalink@localhost:8081", "akka.tcp://scadalink@localhost:8082"]`.
|
||||
The second contact point — port `8082` — is the **site's own** remoting endpoint,
|
||||
not a central node. `SiteCommunicationActor` / `ClusterClient` uses these
|
||||
addresses as initial contacts when discovering the central
|
||||
`ClusterClientReceptionist`; a contact pointing at the site itself can never
|
||||
reach the central receptionist and will be a permanent failure in the
|
||||
initial-contact rotation. For the single-node dev loopback layout the first
|
||||
contact (`8081`, central) succeeds and the bug is masked, but this is exactly
|
||||
the kind of dev-config "example" that gets duplicated into multi-central
|
||||
deployments — the same failure mode the resolved Host-004 finding called out
|
||||
for the seed-node list. `StartupValidator` validates seed nodes against the
|
||||
gRPC port (Host-004) but does not cross-check `CentralContactPoints` against
|
||||
the site's own `RemotingPort`, so the misconfiguration passes silently.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Correct the shipped site example to list two central remoting endpoints (e.g.
|
||||
`localhost:8081` for `central-a` and a distinct port for `central-b` in a
|
||||
multi-node layout). Consider extending `StartupValidator` to reject any
|
||||
`Communication:CentralContactPoints` entry whose host+port matches this site
|
||||
node's `NodeHostname`+`RemotingPort`. Add a regression test in
|
||||
`StartupValidatorTests` mirroring `Site_SeedNodeOnGrpcPort_FailsValidation`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
### Host-017 — Site-shutdown ordering from REQ-HOST-7 is not wired
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Program.cs:229-265`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
REQ-HOST-7 documents an explicit four-step shutdown sequence for site nodes:
|
||||
"(1) On `CoordinatedShutdown`, stop accepting new gRPC streams first.
|
||||
(2) Cancel all active gRPC streams (triggering client-side reconnect).
|
||||
(3) Tear down actors.
|
||||
(4) Use `IHostApplicationLifetime.ApplicationStopping` to signal the gRPC
|
||||
server." The site path in `Program.cs` (the `role == "Site"` branch) registers
|
||||
no `IHostApplicationLifetime.ApplicationStopping` callback, and
|
||||
`SiteStreamGrpcServer` exposes no "stop accepting" / "cancel all streams"
|
||||
entry point — it has `SetReady` but no corresponding `SetUnavailable` or
|
||||
`CancelAllStreams`. In practice, on `SIGTERM` Kestrel closes its listener
|
||||
naturally and `AkkaHostedService.StopAsync` runs Akka `CoordinatedShutdown`,
|
||||
but there is no explicit, ordered handoff that meets the documented contract:
|
||||
in-flight streams are not actively cancelled before actors begin tearing down,
|
||||
so clients see a stream that goes silent (and only times out via gRPC
|
||||
keepalive) rather than a clean `Cancelled` they can reconnect on. This is a
|
||||
contract-vs-code drift — either the design doc is overstating what is
|
||||
implemented, or the implementation is incomplete.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `SiteStreamGrpcServer.CancelAllStreams()` method that flips a "shutting
|
||||
down" flag (so `SubscribeSite` immediately fails new streams with
|
||||
`StatusCode.Unavailable`) and cancels every entry's `Cts` in the `_streams`
|
||||
map. In `Program.cs` site branch, resolve `IHostApplicationLifetime` and
|
||||
register a callback on `ApplicationStopping` that calls `CancelAllStreams()`
|
||||
before the Akka hosted service runs `CoordinatedShutdown` (or order via
|
||||
`AkkaHostedService.StopAsync` itself — `IHostedService.StopAsync` runs in
|
||||
reverse-registration order, so the gRPC server's lifetime can be sequenced
|
||||
before Akka shutdown). Alternatively, reconcile REQ-HOST-7 with the actual
|
||||
implementation if the explicit ordering is no longer intended. Add an
|
||||
integration test under `tests/ScadaLink.Host.Tests` that starts a site host,
|
||||
opens a stream, triggers shutdown, and asserts the stream completes with
|
||||
`Cancelled` before the actor system tears down.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
### Host-018 — Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/appsettings.Central.json`, `src/ScadaLink.Host/appsettings.Site.json`, `src/ScadaLink.Host/NodeOptions.cs:10-16` |
|
||||
|
||||
**Description**
|
||||
|
||||
`NodeOptions.NodeName` is documented as "the operator-configured semantic node
|
||||
name used to stamp the SourceNode column on audit rows", with conventional
|
||||
values `node-a`/`node-b` for site nodes and `central-a`/`central-b` for
|
||||
central nodes. The CLAUDE.md "Centralized Audit Log" key-decision section
|
||||
calls this out: `SourceNode` is meant to be carried verbatim through audit
|
||||
telemetry and reconciliation, and is indexed via
|
||||
`IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)`. The docker per-node
|
||||
configs (`docker/central-node-a/appsettings.Central.json`,
|
||||
`docker/site-a-node-a/appsettings.Site.json`, etc.) all set
|
||||
`ScadaLink:Node:NodeName`. The **shipped, default** per-role files in
|
||||
`src/ScadaLink.Host/` — the templates a developer running the binary
|
||||
directly will use — do not. `NodeIdentityProvider` normalises an empty
|
||||
`NodeName` to `null`, so dev audit rows carry a null `SourceNode` and the
|
||||
indexed lookup never narrows. The dev examples should match the docker
|
||||
examples; at minimum the field should appear in the shipped templates with a
|
||||
placeholder explaining the convention.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add `"NodeName": "central-a"` (or a placeholder like `"${NODE_NAME}"`) to
|
||||
`appsettings.Central.json` and `"NodeName": "node-a"` to
|
||||
`appsettings.Site.json`, with a short comment that the value must be set
|
||||
per-node in multi-node deployments. Consider validating in `StartupValidator`
|
||||
that `NodeName` is non-empty, or accept the null and document explicitly that
|
||||
single-node dev deployments leave `SourceNode` null.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
### Host-019 — Migration `StartupRetry` call drops the host `CancellationToken`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Program.cs:154-165` |
|
||||
|
||||
**Description**
|
||||
|
||||
`StartupRetry.ExecuteWithRetryAsync` accepts an optional
|
||||
`CancellationToken cancellationToken = default` and observes it both at the
|
||||
top of each attempt and inside the `Task.Delay` between retries. The migration
|
||||
call site in `Program.cs` passes no token, so the helper runs with
|
||||
`CancellationToken.None`. With `maxAttempts: 8`, `initialDelay: 2s`, and the
|
||||
30s cap, a database that stays unreachable can keep the retry loop alive for
|
||||
~2 minutes before the host process responds to `SIGTERM` / `Ctrl+C` /
|
||||
Windows-Service stop. The `Program.cs` startup pipeline does not yet have a
|
||||
host-lifetime token to forward at this point (the `app` is built but not
|
||||
yet running), but `app.Lifetime.ApplicationStopping` is available the moment
|
||||
`builder.Build()` returns. Threading it into the retry call honours the host
|
||||
lifecycle and matches the helper's documented contract.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pass `app.Lifetime.ApplicationStopping` (or `CancellationToken.None`
|
||||
explicitly with a comment if intentional) into
|
||||
`StartupRetry.ExecuteWithRetryAsync`. Add a `StartupRetryTests` case
|
||||
exercising token-cancellation mid-backoff.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
### Host-020 — `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:36-43` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LoggerConfigurationFactory.Build` reads the `Serilog` configuration section
|
||||
via `ReadFrom.Configuration(configuration)` (which can include a
|
||||
`MinimumLevel` block — the standard Serilog way to set the floor) and **then**
|
||||
calls `.MinimumLevel.Is(minimumLevel)` derived from
|
||||
`ScadaLink:Logging:MinimumLevel`. Serilog's fluent builder applies the later
|
||||
call, so any `Serilog:MinimumLevel:Default` an operator sets is silently
|
||||
overridden by `ScadaLink:Logging:MinimumLevel` (or by its
|
||||
`Information` fallback when the ScadaLink key is absent). There are now two
|
||||
documented configuration paths for the same setting with non-obvious
|
||||
precedence, and the override direction is the opposite of what most Serilog
|
||||
users would expect (the more-specific `Serilog` section being the authority).
|
||||
The XML doc on `Build` says "the explicit `MinimumLevel.Is` pins the floor"
|
||||
but does not warn that the floor *overrides* the Serilog section's own
|
||||
`MinimumLevel`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick one mechanism: either (a) drop the `MinimumLevel.Is` call and let
|
||||
`ReadFrom.Configuration` consume `Serilog:MinimumLevel`, migrating any docs/
|
||||
deployments that reference `ScadaLink:Logging:MinimumLevel`; or (b) keep the
|
||||
current "ScadaLink:Logging" path and reject `Serilog:MinimumLevel` if present
|
||||
(throw at startup so the operator sees the conflict). At minimum, expand the
|
||||
XML doc + REQ-HOST-8 to spell out the precedence explicitly.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
### Host-021 — Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/appsettings.json:2-6` |
|
||||
|
||||
**Description**
|
||||
|
||||
`appsettings.json` carries a Microsoft `Logging:LogLevel:Default = Information`
|
||||
block. The `Logging:LogLevel` map is consumed by
|
||||
`Microsoft.Extensions.Logging.ConfigurationConsoleLoggerOptions` and similar
|
||||
provider configurations bound from the standard `Logging` section. The Host
|
||||
calls `builder.Host.UseSerilog()`, which replaces the default
|
||||
`ILoggerFactory` setup with Serilog as the **only** logger provider; Serilog
|
||||
reads from `configuration.ReadFrom.Configuration(...)` which consumes the
|
||||
`Serilog` section, **not** `Logging:LogLevel`. The result is that an operator
|
||||
editing `Logging:LogLevel:Default` (a very natural thing to try, since it is
|
||||
the .NET convention) sees no behaviour change — the section is dead config.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either remove the `Logging:LogLevel` block from `appsettings.json` (Serilog
|
||||
owns logging configuration in this Host), or replace it with a brief comment
|
||||
explaining it is intentionally retained for non-Serilog tooling. Document the
|
||||
authoritative location (`Serilog` + `ScadaLink:Logging`) in
|
||||
`Component-Host.md` REQ-HOST-8 if not already explicit.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
### Host-022 — `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:50-55` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LoggerConfigurationFactory.ParseLevel` uses
|
||||
`Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed)` and
|
||||
returns `LogEventLevel.Information` when parsing fails — without logging the
|
||||
fallback. An operator who sets
|
||||
`ScadaLink:Logging:MinimumLevel = "Informaiton"` (a common typo) or
|
||||
`"Verbose,Debug"` or any unrecognised value gets the default level silently;
|
||||
there is no warning, no log line, no startup error. Combined with Host-020
|
||||
(this is the only mechanism that pins the floor), a misspelt value is
|
||||
invisible until someone wonders why the level change "didn't take". The
|
||||
helper is small and could either fail-fast in `StartupValidator` or emit a
|
||||
console warning before the logger is configured.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In `LoggerConfigurationFactory.Build`, when `loggingOptions.MinimumLevel` is
|
||||
non-null/non-blank but does not parse to a valid `LogEventLevel`, write a
|
||||
`Console.Error.WriteLine` warning (the logger is not yet built) and proceed
|
||||
with `Information`. Alternatively, validate the value in `StartupValidator`
|
||||
and fail fast — that matches the pattern used for other ScadaLink
|
||||
configuration keys. Add a `LoggerConfigurationTests` case asserting the
|
||||
behaviour you choose.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Open._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.InboundAPI` |
|
||||
| Design doc | `docs/requirements/Component-InboundAPI.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 8 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -64,6 +64,66 @@ statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteH
|
||||
| 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. |
|
||||
| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All 17 prior findings remain `Resolved`. The module has grown materially since the
|
||||
last pass — a new `AuditWriteMiddleware` (Audit Log #23 M4 Bundle D) now lives under
|
||||
`src/ScadaLink.InboundAPI/Middleware/`, the `ApiKeyValidator` was rewired to hash the
|
||||
candidate with `IApiKeyHasher` (ConfigurationDatabase-012), and an `IInstanceRouter`
|
||||
seam was introduced. This re-review re-walked all 10 checklist categories against
|
||||
`1eb6e97` and surfaced **8 new findings** concentrated on the new audit middleware
|
||||
and a stranded follow-up from InboundAPI-008:
|
||||
|
||||
1. The InboundAPI-008 resolution explicitly deferred registering an `IActiveNodeGate`
|
||||
implementation in `ScadaLink.Host` as a "follow-up outside this module's scope" —
|
||||
that follow-up is still unfulfilled (no production registration anywhere in
|
||||
`src/ScadaLink.Host/`), so the design-mandated standby-node gating is silently
|
||||
disabled in production today (`InboundAPI-022`, High).
|
||||
2. `AuditWriteMiddleware` is wired in `Program.cs` against `/api/*` rather than the
|
||||
specific `POST /api/{methodName}` route, so GETs against `/api/audit/query` and
|
||||
`/api/audit/export` (audit query endpoints — themselves not script invocations)
|
||||
now emit spurious `AuditChannel.ApiInbound`/`InboundRequest` rows back into the
|
||||
audit log with `Target` set to the last path segment (`InboundAPI-025`, Medium).
|
||||
3. The middleware fires its audit write as `_ = _auditWriter.WriteAsync(evt)` — the
|
||||
wrapping try/catch only catches synchronous throws, so a faulted async writer
|
||||
task is unobserved and the row silently disappears with no log line
|
||||
(`InboundAPI-018`, Low/Medium).
|
||||
4. `ParentExecutionId` correlation flows only through `RouteToCallRequest` —
|
||||
`RouteToGetAttributesRequest`/`RouteToSetAttributesRequest` have no
|
||||
`ParentExecutionId` field, so attribute reads/writes from inbound scripts lose
|
||||
the inbound→site execution-tree link the Audit Log decision in CLAUDE.md
|
||||
describes (`InboundAPI-021`, Medium).
|
||||
5. `EndpointExtensions.HandleInboundApiRequest` — the entire wiring composition
|
||||
that ties validator/executor/route/audit together — has no test coverage; only
|
||||
the components it composes are tested (`InboundAPI-023`, Low).
|
||||
6. `EndpointExtensions.HandleInboundApiRequest` does
|
||||
`ContentType?.Contains("json")` (case-sensitive) so a request with
|
||||
`application/JSON` and no Content-Length silently skips JSON body parsing
|
||||
(`InboundAPI-020`, Low).
|
||||
7. `AuditWriteMiddleware.InvokeAsync` calls `EnableBuffering()` unconditionally
|
||||
before the empty-body short-circuit, allocating a `FileBufferingReadStream` for
|
||||
every request including bodyless ones (`InboundAPI-019`, Low).
|
||||
|
||||
Severity mix: 1 High, 3 Medium, 4 Low — no Critical. (The eighth finding —
|
||||
`InboundAPI-024`, Low — is a defensive watch-list item flagging that
|
||||
`_knownBadMethods` is unbounded; it is bounded *in practice* today by the
|
||||
configuration database, but the invariant is undocumented.)
|
||||
|
||||
## Checklist coverage — 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `ContentType?.Contains("json")` is case-sensitive (InboundAPI-020). |
|
||||
| 2 | Akka.NET conventions | ☑ | ASP.NET-hosted, no actors of its own; routes via `IInstanceRouter`/`CommunicationService`. No new issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | `ConcurrentDictionary` handler cache (post-001/002 fix). New audit middleware is per-request scoped, no shared mutable state. No new issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Audit `WriteAsync` is fire-and-forget; async faults are unobserved (InboundAPI-018). |
|
||||
| 5 | Security | ☑ | `IActiveNodeGate` not registered in Host — standby-node gating disabled in production (InboundAPI-022). |
|
||||
| 6 | Performance & resource management | ☑ | `EnableBuffering()` unconditional on bodyless requests (InboundAPI-019); audit middleware wraps `Response.Body` and mints `ExecutionId` for non-script /api routes (InboundAPI-025). |
|
||||
| 7 | Design-document adherence | ☑ | `ParentExecutionId` not stamped on attribute-read/write routed messages (InboundAPI-021). InboundAPI-008's deferred Host registration still unfulfilled (InboundAPI-022). |
|
||||
| 8 | Code organization & conventions | ☑ | No new issues. |
|
||||
| 9 | Testing coverage | ☑ | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test (InboundAPI-023); middleware/filter/validator/executor/route are individually covered. |
|
||||
| 10 | Documentation & comments | ☑ | No new issues. |
|
||||
|
||||
## Findings
|
||||
|
||||
### InboundAPI-001 — Singleton script handler cache mutated without synchronization
|
||||
@@ -844,3 +904,329 @@ now depends on `IInstanceLocator` + `IInstanceRouter` (both substitutable). Adde
|
||||
for each routed method, `GetAttribute` delegating to the batch `GetAttributes` and
|
||||
returning `null` for an absent key, `SetAttribute` delegating to `SetAttributes`, and
|
||||
the InboundAPI-016 deadline-token inheritance behaviour. All 15 pass.
|
||||
|
||||
### InboundAPI-018 — `AuditWriteMiddleware` fires `WriteAsync` as `_ = task` — faulted async writes are unobserved
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:257` |
|
||||
|
||||
**Description**
|
||||
|
||||
`EmitInboundAudit` calls `_ = _auditWriter.WriteAsync(evt);` — the returned `Task` is
|
||||
discarded with the discard operator inside a synchronous `try` block. The wrapping
|
||||
`try/catch (Exception ex)` (lines 198–266) only catches a *synchronous* throw before
|
||||
the writer returns a task. Once `WriteAsync` returns a task, any exception that
|
||||
faults that task (e.g. a DB timeout in the central audit writer, a serialization
|
||||
failure, a cancellation that bubbles up) is never observed: it is not logged, it
|
||||
does not increment the `CentralAuditWriteFailures` health-monitoring counter the
|
||||
design doc references ("Fail-soft semantics" paragraph), and the audit row is
|
||||
silently lost. In .NET, unobserved task exceptions are eventually surfaced via
|
||||
`TaskScheduler.UnobservedTaskException` and may also be logged by the runtime —
|
||||
either way, the middleware itself has no control over what (if anything) happens
|
||||
on a fault. The XML doc comment at line 255 claims "the writer itself swallows"
|
||||
but this is an implicit cross-component contract: the abstraction
|
||||
`ICentralAuditWriter.WriteAsync` returns `Task` and makes no such guarantee, and
|
||||
the only test that exercises a throwing writer (`AuditWriter_Throws_*` in
|
||||
`AuditWriteMiddlewareTests.cs`) uses an `OnWrite` callback that throws
|
||||
*synchronously*, not asynchronously — so the async fault path is not covered by
|
||||
tests either.
|
||||
|
||||
This matters because Component-InboundAPI.md states that audit-emission failures
|
||||
must increment `CentralAuditWriteFailures` (Health Monitoring #11) — a counter
|
||||
that, with the current fire-and-forget, will under-count async-faulted writes.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) await the write and rely on the surrounding try/catch to log the
|
||||
failure, accepting an extra await on the request hot path; or (b) keep the
|
||||
fire-and-forget for latency but attach a `ContinueWith(t => ..., OnlyOnFaulted)`
|
||||
that logs the fault and increments the failure counter, so a faulted async write
|
||||
is at least observed. Option (b) preserves "audit emission never blocks the HTTP
|
||||
response" while restoring the visibility the design assumes. Add a regression
|
||||
test using a writer whose `WriteAsync` returns a faulted `Task` (not a
|
||||
synchronous throw) to pin the new contract.
|
||||
|
||||
### InboundAPI-019 — `EnableBuffering()` called unconditionally on every request, including bodyless requests
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Location | `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:141` |
|
||||
| Status | Open |
|
||||
|
||||
**Description**
|
||||
|
||||
`InvokeAsync` always calls `ctx.Request.EnableBuffering()` before the empty-body
|
||||
short-circuit at `ReadBufferedRequestBodyAsync` line 289 (`if (request.ContentLength
|
||||
is 0) return (null, false);`). `EnableBuffering()` swaps the request stream for a
|
||||
`FileBufferingReadStream` whose construction allocates an internal buffer (default
|
||||
threshold ~30 KB before spilling to a temp file) regardless of whether the request
|
||||
actually has a body. The /api scope this middleware lives under will see at least
|
||||
some bodyless requests (e.g. GET `/api/audit/query` once that route is in the same
|
||||
branch — see InboundAPI-025; future health checks; misbehaving clients) and each
|
||||
one pays the buffering allocation cost for no benefit.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Defer the `EnableBuffering()` call into `ReadBufferedRequestBodyAsync` after the
|
||||
`ContentLength is 0` check, or short-circuit in `InvokeAsync` before enabling
|
||||
buffering when `ContentLength is 0` and `Method is "GET" or "HEAD" or "DELETE"`.
|
||||
The win is a per-request `FileBufferingReadStream` allocation avoided on every
|
||||
bodyless request through the middleware.
|
||||
|
||||
### InboundAPI-020 — `ContentType.Contains("json")` is case-sensitive; `application/JSON` with no Content-Length skips body parsing
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:70` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleInboundApiRequest` parses the JSON body only when
|
||||
`httpContext.Request.ContentLength > 0 || httpContext.Request.ContentType?.Contains("json") == true`.
|
||||
The `string.Contains(string)` overload used here is case-sensitive — a perfectly
|
||||
valid HTTP header `Content-Type: application/JSON` (uppercase) would yield
|
||||
`false` (`"application/JSON".Contains("json")` is `false`). With no
|
||||
Content-Length (e.g. chunked transfer-encoding) and an uppercase content type,
|
||||
the handler then leaves `body = null` and `ParameterValidator.Validate` runs
|
||||
against a missing body — so a method that declares any required parameter is
|
||||
rejected with "Missing required parameters" even though the caller did send a
|
||||
well-formed JSON body. HTTP RFC 7230 §3.2 makes header field names case-insensitive
|
||||
but is silent on values; in practice clients do sometimes uppercase media-type
|
||||
tokens, and the framework's own `MediaTypeHeaderValue` is case-insensitive.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Use the case-insensitive overload —
|
||||
`httpContext.Request.ContentType?.Contains("json", StringComparison.OrdinalIgnoreCase) == true`
|
||||
— or rely on the framework's `IsJson` check via
|
||||
`MediaTypeHeaderValue.TryParse`/`HttpRequest.HasJsonContentType()`. Add a
|
||||
regression test posting with `application/JSON` and Transfer-Encoding: chunked.
|
||||
|
||||
### InboundAPI-021 — `ParentExecutionId` correlation flows only through `Call`; attribute reads/writes lose the inbound→site execution-tree link
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:141-143`, `:182-183`, `:225-226`; `src/ScadaLink.Commons/Messages/InboundApi/RouteToInstanceRequest.cs:15-21`, `:36-40`, `:55-59` |
|
||||
|
||||
**Description**
|
||||
|
||||
CLAUDE.md's Centralized Audit Log section describes `ParentExecutionId` as the
|
||||
cross-execution spawn pointer that "every row of a spawned run carries" and
|
||||
specifically calls out "the inbound API → routed-site-script case". The current
|
||||
implementation honours this only on `RouteToCallRequest` — which carries
|
||||
`ParentExecutionId` as its trailing additive field (line 21 of
|
||||
`RouteToInstanceRequest.cs`) and is stamped by `RouteTarget.Call` with the
|
||||
inbound request's execution id at line 143 of `RouteHelper.cs`.
|
||||
|
||||
`RouteToGetAttributesRequest` and `RouteToSetAttributesRequest`, however, have
|
||||
**no `ParentExecutionId` field** and the matching `RouteTarget.GetAttributes` /
|
||||
`SetAttributes` methods (`RouteHelper.cs:182-183`, `:225-226`) never reference
|
||||
`_parentExecutionId`. So when an inbound API script reads or writes a site
|
||||
attribute via `Route.To("inst").GetAttribute(...)` /
|
||||
`Route.To("inst").SetAttribute(...)`, the site-side audit row for that
|
||||
trust-boundary action (an outbound-by-the-script DB / OPC write at the site) is
|
||||
emitted with `ParentExecutionId = null` and the execution-tree walk
|
||||
`IX_AuditLog_ParentExecution` cannot link it back to the spawning inbound
|
||||
request. The two-row pair (inbound + spawned site work) reverts to the
|
||||
"top-level / null" state the design says is the *fallback* for non-spawned runs.
|
||||
The asymmetry between `Call` and `GetAttributes`/`SetAttributes` is also surprising
|
||||
— a script author would reasonably expect the same correlation across all
|
||||
`Route.To(...)` calls.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a trailing `Guid? ParentExecutionId = null` field to
|
||||
`RouteToGetAttributesRequest` and `RouteToSetAttributesRequest` (additive
|
||||
trailing member, matches the message-evolution rule in CLAUDE.md); stamp it
|
||||
from `_parentExecutionId` in `RouteTarget.GetAttributes` and
|
||||
`RouteTarget.SetAttributes`; have the site-side handlers thread the field onto
|
||||
their emitted audit rows. Add a `RouteHelperTests` regression case asserting
|
||||
that an attribute read/write carries the inherited `ParentExecutionId`.
|
||||
|
||||
### InboundAPI-022 — `IActiveNodeGate` has no production registration in Host — standby-node gating is silently disabled in production
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/IActiveNodeGate.cs`, `src/ScadaLink.InboundAPI/InboundApiEndpointFilter.cs:52-60`; absent from `src/ScadaLink.Host/Program.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
InboundAPI-008's resolution adds `IActiveNodeGate` (lines 17–24 of
|
||||
`IActiveNodeGate.cs`) so a standby central node can refuse to serve the inbound
|
||||
API. `InboundApiEndpointFilter.InvokeAsync` consults the gate at line 52
|
||||
(`var gate = httpContext.RequestServices.GetService<IActiveNodeGate>();`), and
|
||||
when `gate is { IsActiveNode: false }` returns HTTP 503. The filter's behaviour
|
||||
when **no implementation is registered** (line 51 comment) is to fall through and
|
||||
serve the request — the resolution paragraph for InboundAPI-008 closes with:
|
||||
|
||||
> "Follow-up (outside this module's scope): `ScadaLink.Host` should register an
|
||||
> `IActiveNodeGate` implementation backed by `ActiveNodeHealthCheck` /
|
||||
> `Cluster.State.Leader` in the central-role branch of `Program.cs` so the gate is
|
||||
> actually enforced in production; until then the endpoint defaults to "allow"."
|
||||
|
||||
A grep of the entire `src/ScadaLink.Host/` tree at `1eb6e97` finds **zero**
|
||||
`IActiveNodeGate` registrations: `grep -rn "IActiveNodeGate\|AddSingleton.*ActiveNode"
|
||||
src/ScadaLink.Host/` returns no matches. The follow-up was never carried out. So
|
||||
in production today the standby central node still serves the inbound API exactly
|
||||
as InboundAPI-008 described — executes method scripts, runs `Route.To()` calls,
|
||||
races the active node, and may operate against stale singleton state. The new
|
||||
infrastructure (interface + filter check) is present but unwired; from the user's
|
||||
perspective the original High-severity issue is unresolved in deployed binaries.
|
||||
|
||||
The design says the inbound API is "Central cluster only (active node)" and
|
||||
"fails over with it" — this guarantee is not currently enforced in production.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Register an `IActiveNodeGate` implementation in the central-role branch of
|
||||
`ScadaLink.Host/Program.cs`. The natural backing is the existing
|
||||
`ActiveNodeHealthCheck` (already wired for `/health/active`) or a direct read of
|
||||
`Cluster.Get(actorSystem).State.Leader == Cluster.Get(actorSystem).SelfAddress`.
|
||||
Add an integration test in the Host that spins up the central role and asserts
|
||||
that the gate is resolvable and returns `IsActiveNode` consistent with cluster
|
||||
leader state. Until that wiring lands, this finding is the user-facing
|
||||
realisation of the InboundAPI-008 vulnerability.
|
||||
|
||||
### InboundAPI-023 — `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test coverage
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:31-140`, `tests/ScadaLink.InboundAPI.Tests/` |
|
||||
|
||||
**Description**
|
||||
|
||||
The endpoint handler `HandleInboundApiRequest` is the wiring composition that
|
||||
ties the validator → JSON parse → `ParameterValidator` → `InboundScriptExecutor` →
|
||||
result-serialization path together; it is the single piece of code that maps
|
||||
validator status codes to HTTP responses, threads the `parentExecutionId` from
|
||||
`HttpContext.Items` into the executor, stashes the resolved API key name as
|
||||
`AuditActorItemKey`, and emits the request-aborted short-circuit. The test
|
||||
project covers each composed component (`ApiKeyValidatorTests`,
|
||||
`ParameterValidatorTests`, `InboundScriptExecutorTests`, `RouteHelperTests`,
|
||||
`InboundApiEndpointFilter`, `AuditWriteMiddlewareTests`,
|
||||
`MiddlewareOrderTests`) but no test exercises `HandleInboundApiRequest` itself —
|
||||
so regressions in the wiring (e.g. forgetting to stash the actor name on
|
||||
`HttpContext.Items`, the `Contains("json")` case sensitivity from
|
||||
InboundAPI-020, or accidentally swapping `validationResult.StatusCode` for a
|
||||
literal) are not caught.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add an `EndpointExtensionsTests` suite using `TestServer` (the same pattern
|
||||
`MiddlewareOrderTests` uses) covering: the happy path (200 + body), invalid
|
||||
JSON (400), validator 401, validator 403, parameter-validation failure (400),
|
||||
script-failure 500, client-aborted short-circuit (`Results.Empty`), and the
|
||||
actor-stash invariant (HttpContext.Items[AuditActorItemKey] is set with the
|
||||
resolved key name after successful auth, but is absent on auth failures).
|
||||
|
||||
### InboundAPI-024 — `_knownBadMethods` is unbounded — an attacker can grow the cache by spamming distinct method names against the audit middleware path
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:30`, `:77`, `:223`, `:233` |
|
||||
|
||||
**Description**
|
||||
|
||||
The InboundAPI-009 fix introduced `_knownBadMethods`, a `ConcurrentDictionary<string, byte>`
|
||||
of method names whose Roslyn compilation failed, to short-circuit lazy
|
||||
recompilation. It is keyed by `method.Name` and entries are only ever removed
|
||||
when `CompileAndRegister` succeeds for the same name (line 83). Practically the
|
||||
key space is bounded by the configured method definitions in the database, so
|
||||
this is bounded in normal operation. But because the cache is mutated from the
|
||||
lazy-compile path at `ExecuteAsync.cs:233`, and `ExecuteAsync` is called from
|
||||
`HandleInboundApiRequest` only **after** `ApiKeyValidator.ValidateAsync` has
|
||||
returned `Valid` (i.e. a real method exists), the entry is keyed by a name that
|
||||
must have already been resolved through `GetMethodByNameAsync` — so this attack
|
||||
surface is gated by the configuration database. The finding is therefore mostly
|
||||
defensive: there is no rate limit on inbound API calls (deliberate design), so
|
||||
if a future change ever causes `ExecuteAsync` to be called for an unvalidated
|
||||
caller-supplied method name (e.g. a refactor that moves method-existence
|
||||
checking later), this cache would become attacker-controllable.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Optional / defensive: cap `_knownBadMethods` (e.g. an LRU with a fixed size, or
|
||||
clear it periodically). At minimum, document the invariant in the executor's
|
||||
XML comment that `_knownBadMethods` keys must come from validated
|
||||
`ApiMethod.Name` values, so the safety property survives future refactors. No
|
||||
immediate change required; this is a watch-list item.
|
||||
|
||||
### InboundAPI-025 — `AuditWriteMiddleware` runs against the entire `/api/*` branch — emits spurious `ApiInbound` audit rows for `/api/audit/query` and `/api/audit/export`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Program.cs:183-185`; consumers: `src/ScadaLink.ManagementService/AuditEndpoints.cs:93-94`; emitter: `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:175-252` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Program.cs` wires the audit middleware as
|
||||
`app.UseWhen(ctx => ctx.Request.Path.StartsWithSegments("/api"), branch => branch.UseAuditWriteMiddleware())`
|
||||
— scoped to the `/api` *prefix*, not to the `POST /api/{methodName}` route.
|
||||
Meanwhile, `ScadaLink.ManagementService/AuditEndpoints.cs` maps
|
||||
`MapGet("/api/audit/query", ...)` (line 93) and `MapGet("/api/audit/export", ...)`
|
||||
(line 94). Both routes therefore inherit `AuditWriteMiddleware`, which emits an
|
||||
`AuditEvent { Channel = AuditChannel.ApiInbound, Kind = AuditKind.InboundRequest, ... }`
|
||||
row for every call. The middleware's `ResolveMethodName` falls back to the last
|
||||
path segment (lines 446–452), so a GET `/api/audit/query?...` is recorded as if a
|
||||
caller had invoked an inbound API method named "query"; an export is recorded
|
||||
as method "export". Effects:
|
||||
|
||||
1. **Audit log is polluted with non-script rows.** The audit log is now
|
||||
recording its own query traffic as if it were inbound script invocations,
|
||||
contradicting Component-AuditLog.md's scope ("script trust boundary actions").
|
||||
2. **Audit reads recursively emit audit writes.** Every audit-log query (e.g.
|
||||
from the Central UI Audit Log page or the CLI `audit query` command) writes
|
||||
an additional row into `AuditLog`, growing the table on read.
|
||||
3. **`Target` is meaningless.** `/api/audit/query` has no method definition, so
|
||||
the recorded `Target = "query"` is not joinable to any `ApiMethod` row in
|
||||
audit-log drill-ins.
|
||||
4. **Wasted resources on health probes / management calls.** Any future routes
|
||||
added under `/api/` will inherit the middleware and pay the
|
||||
`EnableBuffering`, `CapturedResponseStream`, and `JsonSerializer.Serialize`
|
||||
costs even though they are not inbound script invocations.
|
||||
|
||||
Tests for the audit middleware (`AuditWriteMiddlewareTests`) and pipeline order
|
||||
(`MiddlewareOrderTests`) wire the middleware only against the
|
||||
`POST /api/{methodName}` route in test hosts, so this production-only
|
||||
mis-scoping is not exercised.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Tighten the predicate so the middleware runs only on the inbound API method
|
||||
route, not on the `/api/` prefix. Options:
|
||||
|
||||
- `app.UseWhen(ctx => ctx.Request.Path.StartsWithSegments("/api") && !ctx.Request.Path.StartsWithSegments("/api/audit") && !ctx.Request.Path.StartsWithSegments("/api/management"), ...)`
|
||||
— defensive, but fragile to future route additions.
|
||||
- Move the audit emission from a pipeline middleware to an `IEndpointFilter`
|
||||
applied via `.AddEndpointFilter<>()` on the `MapInboundAPI` registration
|
||||
(alongside `InboundApiEndpointFilter`). This makes the scope explicit on the
|
||||
one route that needs it and survives future `/api/...` route additions
|
||||
unchanged.
|
||||
|
||||
The endpoint-filter form is the recommended fix — it co-locates the audit-emission
|
||||
scope with the route definition and matches how InboundAPI-006/008 gating is
|
||||
already wired.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ManagementService` |
|
||||
| Design doc | `docs/requirements/Component-ManagementService.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 (1 Deferred — see ManagementService-012) |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 6 (1 Deferred — see ManagementService-012) |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -46,6 +46,32 @@ that can leave an instance partially modified after an error (015, Medium), raw
|
||||
messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and
|
||||
`QueryDeploymentsCommand` having no test coverage at all (017, Low).
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All seventeen prior findings remain correctly closed; ManagementService-012 is still the
|
||||
only Deferred entry (marker-interface on `ManagementEnvelope.Command` still belongs in the
|
||||
Commons module). The module has grown substantially since the last review (`+1997 lines`):
|
||||
the Transport (#24) bundle commands (`ExportBundle`/`PreviewBundle`/`ImportBundle`) have
|
||||
been added to `ManagementActor`, and a new `AuditEndpoints.cs` (`/api/audit/query` and
|
||||
`/api/audit/export`) ships alongside the existing `/management` endpoint. This re-review
|
||||
re-ran the full 10-category checklist and surfaced **six new findings**. The dominant
|
||||
theme is the same authorization gap that findings 001/002/003/014 closed for the
|
||||
ManagementActor, now resurfacing in the new surfaces:
|
||||
**QueryAuditLogCommand has no role gate at all** (018, High) — any authenticated user can
|
||||
read the configuration audit log via `/management`, even though the parallel
|
||||
`/api/audit/query` requires `OperationalAuditRoles`. The new `/api/audit/{query,export}`
|
||||
endpoints build an `AuthenticatedUser` with `PermittedSiteIds` but never enforce site scope
|
||||
(019, Medium) — although audit roles are not site-scoped by design, the user-supplied
|
||||
`sourceSiteId` filter is honoured verbatim. `HandleUpdateSmtpConfig` returns the full
|
||||
SmtpConfiguration entity (including the `Credentials` field, which can carry SMTP passwords
|
||||
/ OAuth2 client secrets) in the response and audit row (020, Medium). The Transport (#24)
|
||||
bundle commands have zero test coverage in `ManagementActorTests` (021, Medium) — neither
|
||||
role gating nor success/error paths. The `Component-ManagementService.md` design doc is
|
||||
stale on three fronts: it does not mention Transport bundle commands, the `/api/audit/*`
|
||||
endpoints, or the now-wired `CommandTimeout` option (022, Low). Finally,
|
||||
`HandleQueryDeployments` issues one `GetInstanceByIdAsync` per unique instance ID when
|
||||
filtering for a site-scoped user — an N+1 read pattern on the unfiltered branch (023, Low).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -61,6 +87,21 @@ messages from unexpected faults being returned verbatim to HTTP callers (016, Lo
|
||||
| 9 | Testing coverage | + | Authorization is well covered; site-scope enforcement, the HTTP endpoint, `DebugStreamHub`, and remote-query handlers have no tests. See 013. |
|
||||
| 10 | Documentation & comments | + | XML docs are accurate where present; `ManagementServiceOptions` and `ResolveRolesCommand` paths are undocumented dead code (010, 011). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | + | `HandleImportBundle` correctly dedupes resolutions per (entity,name); `ParseDocument` still allocates a `JsonDocument.Parse("{}")` on the failure path but the caller's `using` disposes it. No new defects. |
|
||||
| 2 | Akka.NET conventions | + | PipeTo dispatch from 004 is intact; supervision strategy from 005 is intact; `Sender` correctly captured to local before PipeTo. No new findings. |
|
||||
| 3 | Concurrency & thread safety | + | Bundle handlers `await` cleanly; `BundleSession` is not cleaned up if `PreviewAsync`/`ApplyAsync` throws, but that is an `IBundleImporter` contract concern outside this module. No new findings. |
|
||||
| 4 | Error handling & resilience | + | `ManagementCommandException` from 016 is applied consistently across the new bundle handlers (curated `CryptographicException`/`ArgumentException` paths). No new findings. |
|
||||
| 5 | Security | + | `QueryAuditLogCommand` has no role gate (018, High). New `/api/audit/*` endpoints build `PermittedSiteIds` but never enforce them (019, Medium). `HandleUpdateSmtpConfig` returns + audits `Credentials` verbatim (020, Medium). |
|
||||
| 6 | Performance & resource management | + | `HandleQueryDeployments` unfiltered-with-scope branch is N+1 on instance lookups (023, Low). Request body up to 200 MB read into a single `string` in `HandleRequest` (acceptable per Transport bundle requirement). |
|
||||
| 7 | Design-document adherence | + | `Component-ManagementService.md` is stale on Transport bundle commands, `/api/audit/*` endpoints, and the now-wired `CommandTimeout` (022, Low). |
|
||||
| 8 | Code organization & conventions | + | `AuditEndpoints` duplicates the Basic Auth → LDAP → roles flow from `ManagementEndpoints` (~50 lines). Acknowledged in `AuditEndpoints` XML but worth tracking. No new finding raised. |
|
||||
| 9 | Testing coverage | + | Transport bundle commands have zero `ManagementActorTests` coverage — neither role gating nor handler logic (021, Medium). |
|
||||
| 10 | Documentation & comments | + | New `AuditEndpoints` XML doc is high quality. `Component-ManagementService.md` not updated for Transport/Audit endpoints (022 covers). |
|
||||
|
||||
## Findings
|
||||
|
||||
### ManagementService-001 — Remote-query and debug-snapshot handlers bypass site-scope enforcement
|
||||
@@ -748,3 +789,294 @@ Resolved 2026-05-17 (commit pending). Added seven `QueryDeployments_*` tests to
|
||||
Deployment user and an Admin user, in- and out-of-scope
|
||||
(`_FilteredByOutOfScopeInstance_ReturnsUnauthorized`, `_FilteredByInScopeInstance_ReturnsRecords`,
|
||||
`_UnfilteredForSiteScopedUser_DropsOutOfScopeRecords`, `_UnfilteredForAdminUser_ReturnsAllRecords`).
|
||||
|
||||
### ManagementService-018 — QueryAuditLogCommand has no role gate
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:153`–`:207`, `:336`, `:1302` |
|
||||
|
||||
**Description**
|
||||
|
||||
`QueryAuditLogCommand` is dispatched at line 336 to `HandleQueryAuditLog`, which calls
|
||||
`ICentralUiRepository.GetAuditLogEntriesAsync(...)` with no role check, no site-scope
|
||||
check, and no actor filter. `GetRequiredRole` (lines 153–207) does not list
|
||||
`QueryAuditLogCommand`, so it falls through to the `_ => null` case — i.e. "read-only
|
||||
queries — any authenticated user". The parallel `/api/audit/query` endpoint in
|
||||
`AuditEndpoints.HandleQuery` correctly enforces `AuthorizationPolicies.OperationalAuditRoles`
|
||||
(`{ "Admin", "Audit", "AuditReadOnly" }`), so a CLI authenticated as a user with only the
|
||||
`Deployment` role — or no roles at all — is rejected at `/api/audit/query` but can read
|
||||
the *same* audit log table through `/management` by sending `QueryAuditLogCommand`. The
|
||||
two surfaces enforce different permissions on the same data; the older
|
||||
ManagementActor-routed path is the looser one. The audit log records every script-trust-
|
||||
boundary action and is sensitive operationally — it should not be readable by a default
|
||||
authenticated user.
|
||||
|
||||
This is the same authorization-bypass class as findings 001/002/014 and was missed in
|
||||
that sweep because `QueryAuditLogCommand` (legacy `Action`/`EntityType` filter) is a
|
||||
separate command from the new keyset-paged `IAuditLogRepository.QueryAsync` path the
|
||||
`/api/audit/query` endpoint uses.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add `QueryAuditLogCommand` to `GetRequiredRole`. The natural fit is a new
|
||||
`"OperationalAudit"`-style role group — but `GetRequiredRole` returns a single string and
|
||||
the project's existing role gates do too (`Admin`/`Design`/`Deployment`). Two equally
|
||||
defensible options:
|
||||
|
||||
1. Add `QueryAuditLogCommand` to the `Admin`-required group — strict, mirrors that
|
||||
`AuditExportRoles` includes `Admin`. The CLI's CLI-017/018 audit work uses
|
||||
`/api/audit/query`, so `QueryAuditLogCommand` may be effectively orphaned anyway.
|
||||
2. Extend `GetRequiredRole` to return a role *set* and add an `AuditRoles` group equal to
|
||||
`AuthorizationPolicies.OperationalAuditRoles`, so the two surfaces converge.
|
||||
|
||||
Recommended: option 1 plus a deprecation comment on `QueryAuditLogCommand` pointing at
|
||||
`/api/audit/query` — the legacy command's filter shape is a subset of the new endpoint's,
|
||||
so the ManagementActor route is redundant. Add a regression test asserting that a
|
||||
no-role / `Deployment`-only caller gets `ManagementUnauthorized` for `QueryAuditLogCommand`.
|
||||
|
||||
### ManagementService-019 — AuditEndpoints builds PermittedSiteIds but never enforces them
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ManagementService/AuditEndpoints.cs:358`–`:368`, `:397`–`:437` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AuditEndpoints.AuthenticateAsync` resolves the caller's roles AND `PermittedSiteIds` and
|
||||
wraps them in an `AuthenticatedUser` (lines 358–366), but the returned `AuthenticatedUser`
|
||||
is then only used for the `HasAnyRole(...)` role check on lines 114 and 163 — its
|
||||
`PermittedSiteIds` are never read. `ParseFilter` (line 397) accepts the caller-supplied
|
||||
`sourceSiteId=...` query string verbatim and passes it straight into the
|
||||
`IAuditLogRepository.QueryAsync` filter. A user whose `Audit` (or `AuditReadOnly`) role
|
||||
mapping carries scope rules — e.g. `AuditReadOnly` scoped to "plant-a" — can still ask
|
||||
for `sourceSiteId=plant-b` and get back rows for plant-b.
|
||||
|
||||
Today this gap is partially benign because the design treats `Audit`/`AuditReadOnly` as
|
||||
non-site-scoped roles (`Component-AuditLog.md` does not list site scoping for the audit
|
||||
permissions, and the LDAP role mapping UI does not currently surface site scope rules
|
||||
for those roles). But (a) the `RoleMapper` will silently honour scope rules attached to
|
||||
any role, including `Audit`, so an operator who *does* configure them gets a UI that
|
||||
says "scoped" and an endpoint that ignores the scope — a contract violation; (b) the
|
||||
`Admin` role's `PermittedSiteIds` are always empty (system-wide), so enforcing for the
|
||||
other roles is cheap. The asymmetry with the `/management` endpoint — which routes every
|
||||
site-targeted command through `EnforceSiteScope` — is also a maintenance hazard.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decide explicitly whether the audit endpoints honour site scope. Two options:
|
||||
|
||||
1. **Honour scope** — in `HandleQuery` / `HandleExport`, after the role check, intersect
|
||||
the caller-supplied `filter.SourceSiteIds` with `user.PermittedSiteIds`. If the
|
||||
caller supplied no `sourceSiteId` and `PermittedSiteIds` is non-empty, restrict to
|
||||
`PermittedSiteIds`. If the intersection is empty, return an empty page (or a 403 if
|
||||
the caller explicitly asked for an out-of-scope site).
|
||||
2. **Document the intentional bypass** — drop the `PermittedSiteIds` field from the
|
||||
`AuthenticatedUser` constructed in `AuthenticateAsync` (or comment it as "ignored —
|
||||
audit roles are not site-scoped") so the code stops carrying a value it does not
|
||||
read, and add an XML doc note on the endpoint class that audit roles are always
|
||||
system-wide by design.
|
||||
|
||||
Recommended: option 1, mirroring the `ManagementActor` pattern — same security posture
|
||||
across both surfaces. Add a regression test that a site-scoped `AuditReadOnly` user
|
||||
filtering on an out-of-scope site gets a 403 (or an empty page).
|
||||
|
||||
### ManagementService-020 — UpdateSmtpConfig returns and audits the SMTP Credentials field verbatim
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:1136`–`:1153` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleUpdateSmtpConfig` reads the existing `SmtpConfiguration` entity, applies the
|
||||
incoming command, and then **(a)** passes the full `config` object as the `afterState`
|
||||
to `AuditAsync` (line 1151) — meaning the SMTP credential string is persisted in the
|
||||
audit log — and **(b)** returns the full `config` to the caller (line 1152), which is
|
||||
serialized via `SerializeResult` and sent back over HTTP. `SmtpConfiguration.Credentials`
|
||||
carries the SMTP-Auth password (for `Basic`) or the OAuth2 client secret (for
|
||||
`OAuth2ClientCredentials`); `SmtpConfiguration` has no `[JsonIgnore]` on this field
|
||||
and `SerializeResult`'s `JsonSerializerOptions` does not exclude it. The pattern
|
||||
parallels what ConfigurationDatabase-012 fixed for inbound API keys: a credential
|
||||
artifact must not be echoed back through every read/audit path.
|
||||
|
||||
The credential is supplied by the operator in `UpdateSmtpConfigCommand.Credentials`,
|
||||
so the caller already has it. But (1) anyone with read access to the audit log
|
||||
(`OperationalAuditRoles`) can now retrieve every SMTP credential change verbatim — a
|
||||
strictly larger blast radius than `Admin`-only `UpdateSmtpConfig`. (2) The serialized
|
||||
`config` echo means the credential moves over the wire in the response even though the
|
||||
caller has no need for it. (3) Any future read path that returns
|
||||
`SmtpConfiguration` — `ListSmtpConfigsCommand` already does at line 1130 — will leak
|
||||
the stored credential too.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Three changes, in order of priority:
|
||||
|
||||
1. In `HandleUpdateSmtpConfig` and `HandleListSmtpConfigs`, project to a credential-free
|
||||
shape before returning — e.g. `new { config.Id, config.Host, config.Port,
|
||||
config.AuthType, config.FromAddress, config.TlsMode }`. Match the
|
||||
`HandleListApiKeys` pattern.
|
||||
2. In `AuditAsync` for the SMTP path, pass a credential-free `afterState` (the same
|
||||
anonymous shape). The fact that *something* changed is auditable; the secret value
|
||||
is not.
|
||||
3. Tag `SmtpConfiguration.Credentials` with `[JsonIgnore]` in Commons (out-of-scope edit
|
||||
for this module, but worth a follow-up). Alternatively, configure
|
||||
`ResultSerializerOptions` with a property name policy that skips a known set of
|
||||
credential field names — but a per-entity projection is cleaner.
|
||||
|
||||
Add regression tests: `UpdateSmtpConfig_DoesNotEchoCredentialsInResponse` and
|
||||
`UpdateSmtpConfig_DoesNotPersistCredentialsInAuditLog`.
|
||||
|
||||
### ManagementService-021 — Transport bundle handlers have zero test coverage
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1`; `src/ScadaLink.ManagementService/ManagementActor.cs:1717`–`:1897` |
|
||||
|
||||
**Description**
|
||||
|
||||
The three Transport (#24) bundle handlers — `HandleExportBundle`, `HandlePreviewBundle`,
|
||||
`HandleImportBundle` (~180 lines of handler logic at the bottom of `ManagementActor.cs`)
|
||||
— have **no tests** in `ManagementActorTests`. Specifically untested:
|
||||
|
||||
1. **Role gating.** `ExportBundleCommand` requires `Design`; `PreviewBundleCommand` and
|
||||
`ImportBundleCommand` require `Admin`. No test asserts that the wrong role gets
|
||||
`ManagementUnauthorized`. CLI-017 / CLI-018 just landed around bundle plumbing — a
|
||||
future refactor that moves these commands between role groups in `GetRequiredRole`
|
||||
would silently regress the gate.
|
||||
2. **Name resolution in `HandleExportBundle`.** The inner `ResolveIds<T>` helper raises
|
||||
`ManagementCommandException` for unknown names. The "all entity types" branch
|
||||
(`cmd.All == true`) and the "missing name" branch are both untested.
|
||||
3. **`HandleImportBundle` blocker rejection.** The handler aborts before `ApplyAsync`
|
||||
when any `ConflictKind.Blocker` row is present; the produced error message is
|
||||
curated and surfaced to the caller, but no test asserts the abort path or that the
|
||||
importer's `ApplyAsync` was not called.
|
||||
4. **Resolution dedupe.** `HandleImportBundle` dedupes `(EntityType, Name)` keys
|
||||
last-write-wins — the dedupe is critical (CLI-014 was about it on the CLI side) but
|
||||
has no actor-side regression test.
|
||||
5. **`DecodeBundle` failure modes** (empty/non-base64 input) — both branches return
|
||||
curated `ManagementCommandException` but neither is exercised.
|
||||
6. **`ParseConflictPolicy`** for `"skip"`, `"overwrite"`, `"rename"`, and the invalid-
|
||||
value branch — all untested.
|
||||
|
||||
Given the size and reach of the bundle path (cross-cutting central configuration
|
||||
import), this gap is materially larger than usual for new handler code.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add an `ImportBundleHandlerTests` suite covering:
|
||||
- role gating for all three commands (`Design`/`Admin` mismatch -> `ManagementUnauthorized`),
|
||||
- `ExportBundleCommand(All: true)` happy-path,
|
||||
- `ExportBundleCommand` with an unknown name -> `ManagementError`,
|
||||
- `ImportBundleCommand` with a `Blocker` row -> `ManagementError` and `ApplyAsync` not called,
|
||||
- `ImportBundleCommand` with duplicate preview items -> dedupe to one resolution per (type, name),
|
||||
- `DecodeBundle` empty/invalid base64,
|
||||
- `ParseConflictPolicy` all four branches.
|
||||
|
||||
Use NSubstitute for `IBundleImporter` / `IBundleExporter` (no need for a real bundle in
|
||||
the actor tests; the bundle round-trip belongs in `Transport` tests).
|
||||
|
||||
### ManagementService-022 — Design doc is stale on Transport bundle commands, /api/audit/* endpoints, and CommandTimeout
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-ManagementService.md:77`–`:175`, `:205`–`:209` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-ManagementService.md` does not mention three pieces of shipped functionality:
|
||||
|
||||
1. **Transport (#24) bundle commands.** `ExportBundleCommand`, `PreviewBundleCommand`,
|
||||
and `ImportBundleCommand` are dispatched at `ManagementActor.cs:350`–`:352` and
|
||||
role-gated in `GetRequiredRole` (Design for Export; Admin for Preview/Import). The
|
||||
design doc's "Message Groups" section enumerates Templates, Instances, Sites, Data
|
||||
Connections, Deployments, External Systems, Notifications, Security, Audit Log,
|
||||
Shared Scripts, Database Connections, Inbound API Methods, Health, and Remote
|
||||
Queries — but has no "Transport" / "Bundles" group. The CLI now offers `bundle
|
||||
export`/`preview`/`import` (per the recent CLI-017/018 work) and points
|
||||
at these commands.
|
||||
2. **`/api/audit/*` endpoints.** The doc's "HTTP Management API" section (line 52)
|
||||
describes only `POST /management`. `AuditEndpoints.MapAuditAPI()` adds
|
||||
`GET /api/audit/query` and `GET /api/audit/export` with their own auth-and-role
|
||||
path mirroring `ManagementEndpoints` (intentionally — see the `AuditEndpoints` XML
|
||||
docs), but the design doc gives no signal that the module exposes more than one
|
||||
route group, no per-endpoint role mapping table, and no mention that the response
|
||||
shape differs (keyset cursor vs. opaque page).
|
||||
3. **`CommandTimeout`.** Line 209 still says "Reserved for future configuration —
|
||||
e.g., command timeout overrides", but ManagementService-010 wired the option through
|
||||
`ResolveAskTimeout`. The doc is stale.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Update `Component-ManagementService.md`:
|
||||
|
||||
- Add a "Transport" entry to "Message Groups" listing `ExportBundle`,
|
||||
`PreviewBundle`, `ImportBundle` with their per-command roles. Cross-reference
|
||||
`Component-Transport.md`.
|
||||
- Add an "Audit Log HTTP API" subsection under "HTTP Management API" describing
|
||||
`GET /api/audit/query` (keyset cursor, `OperationalAuditRoles`) and
|
||||
`GET /api/audit/export` (csv/jsonl streaming, `AuditExportRoles`, parquet 501).
|
||||
Note the deliberate divergence in the source-site query-string key
|
||||
(`sourceSiteId` vs CentralUI's `site`).
|
||||
- In the "Configuration" table, replace "Reserved for future configuration" with the
|
||||
actual `CommandTimeout` semantics: "Max time the HTTP endpoint will Ask the
|
||||
ManagementActor before returning HTTP 504; falls back to 30 s when unset or
|
||||
non-positive."
|
||||
|
||||
### ManagementService-023 — HandleQueryDeployments unfiltered branch is N+1 on instance lookup
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:1276`–`:1295` |
|
||||
|
||||
**Description**
|
||||
|
||||
The site-scoped unfiltered branch of `HandleQueryDeployments` (added under
|
||||
ManagementService-014) reads every `DeploymentRecord` via `GetAllDeploymentRecordsAsync`,
|
||||
then for each *unique* `record.InstanceId` calls
|
||||
`ITemplateEngineRepository.GetInstanceByIdAsync` to resolve the instance's
|
||||
`SiteId`. The handler caches results in `instanceSiteCache` so each instance is loaded
|
||||
at most once per call, but for a fleet with N distinct instances having deployment
|
||||
history, the handler still issues N round-trips to the configuration database to
|
||||
authorize a single query. With a large deployment history the cumulative DB hit can be
|
||||
material; it also runs every time a site-scoped user opens the deployments page.
|
||||
|
||||
This is acceptable in steady state today (sites tend to have small fleets and few
|
||||
deployments) but is a textbook N+1 read pattern, and on a busy day for a site-scoped
|
||||
operator the cost will dominate the request. Admin and system-wide Deployment users
|
||||
correctly skip the loop (they hit only `GetAllDeploymentRecordsAsync`).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a batch-resolve method to `ITemplateEngineRepository` — e.g.
|
||||
`Task<IDictionary<int, int>> GetInstanceSiteIdsAsync(IEnumerable<int> instanceIds)` —
|
||||
backed by a single EF query
|
||||
(`Instances.Where(i => instanceIds.Contains(i.Id)).Select(i => new { i.Id, i.SiteId })`).
|
||||
`HandleQueryDeployments` would then issue exactly two queries on the unfiltered branch
|
||||
(records + sites) regardless of fleet size. The change is additive to
|
||||
`ITemplateEngineRepository` and out-of-module for the actual implementation, but the
|
||||
handler change is local; a quick interim alternative is to project deployment records
|
||||
to include the instance's `SiteId` at the repo level, which removes the second query
|
||||
entirely.
|
||||
|
||||
Defer until a noticeable hot path emerges, but track it: this is the only N+1 in
|
||||
`ManagementActor` once 002 / 014 are folded in.
|
||||
|
||||
@@ -0,0 +1,488 @@
|
||||
# Code Review — NotificationOutbox
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Module | `src/ScadaLink.NotificationOutbox` |
|
||||
| Design doc | `docs/requirements/Component-NotificationOutbox.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 10 |
|
||||
|
||||
## Summary
|
||||
|
||||
NotificationOutbox is a small, focused module — one ~985-line actor
|
||||
(`NotificationOutboxActor`), a strongly-typed options class, an
|
||||
`INotificationDeliveryAdapter` seam, and the single concrete `EmailNotificationDeliveryAdapter`.
|
||||
The Akka.NET conventions are textbook: every async path is wrapped with `PipeTo`, the
|
||||
dispatcher uses an in-flight guard cleared on `DispatchComplete`, the sender is captured
|
||||
before crossing the await, and the actor isolates per-notification failures so one bad row
|
||||
never aborts a batch. Test coverage is broad — ingest, dispatch, query, retry/discard,
|
||||
purge, KPI, and the new audit-emission paths (B2 attempts + B3 terminals) all have
|
||||
dedicated test files — and the audit-write-failure-never-aborts-delivery contract is
|
||||
explicitly asserted.
|
||||
|
||||
The dominant theme is **trust-boundary leakage between Outbox, NotificationService, and
|
||||
ConfigurationDatabase**. The outbox inherits two known defects from its sibling modules
|
||||
that are reachable through `EmailNotificationDeliveryAdapter`: the OAuth2 SASL empty-user
|
||||
bug (NS-021) ships every M365 send with `user=""`, and the
|
||||
`InsertIfNotExistsAsync` check-then-act race (CD-015) lives on the outbox's ack-after-persist
|
||||
hot path. Neither is a defect of code under `src/ScadaLink.NotificationOutbox/`, but both
|
||||
are surfaced here because production dispatch and ingest go through these exact lines.
|
||||
A secondary theme is **dispatcher-fire-and-forget audit writes** (`_ = _auditWriter.WriteAsync(...)`)
|
||||
that can race the per-sweep scope dispose under the wrong DI graph, and a few smaller
|
||||
drifts: the dispatcher passes `CancellationToken.None` to adapter delivery (no graceful
|
||||
shutdown for in-flight SMTP sends), the `StuckAgeThreshold` XML-doc describes a behavior
|
||||
the design explicitly forbids (display-only, never reclaim), the `MaxRetries` boundary check
|
||||
uses `>=` against a config value that can be zero (immediate park on first transient
|
||||
failure), and several `NotificationOutboxOptions` fields are documented in code but absent
|
||||
from `Component-NotificationOutbox.md`. No Critical findings; two High, six Medium, two Low.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | Yes | `MaxRetries` zero/negative immediately parks (NotificationOutbox-002); `StuckAgeThreshold` XML doc contradicts design (NotificationOutbox-009); `Guid.TryParse` accepts compact `"N"` ids emitted by sites. |
|
||||
| 2 | Akka.NET conventions | Yes | `PipeTo` / sender-capture / in-flight guard pattern is correctly applied throughout. Fire-and-forget `_ = _auditWriter.WriteAsync(...)` raises a scope-lifetime concern (NotificationOutbox-004). |
|
||||
| 3 | Concurrency & thread safety | Yes | Actor state mutated only on actor thread. Inherited CD-015 race on `InsertIfNotExistsAsync` (NotificationOutbox-005) is the only race; the dispatcher's in-flight guard correctly serializes sweeps. |
|
||||
| 4 | Error handling & resilience | Yes | Outer try/catch on `RunDispatchPass`/`RunPurgePass` keeps the in-flight guard sane; per-notification isolation is correct. CT not threaded into delivery (NotificationOutbox-003). |
|
||||
| 5 | Security | Yes | Inherited OAuth2 empty-user (NotificationOutbox-001) reachable through the adapter. No new credential or trust-boundary issues introduced by the outbox itself. |
|
||||
| 6 | Performance & resource management | Yes | Dispatch interval & batch size are simple polling; `ResolveAdapters` rebuilds the lookup per sweep (NotificationOutbox-006). No leaks. |
|
||||
| 7 | Design-document adherence | Yes | `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, `PurgeInterval` are not in the design doc (NotificationOutbox-007). |
|
||||
| 8 | Code organization & conventions | Yes | Options class lives in the component project (correct); DI extension lives in the component (correct); adapter is `scoped`, actor singleton — interaction correctly documented in `ServiceCollectionExtensions`. No issues. |
|
||||
| 9 | Testing coverage | Yes | Solid actor-behaviour coverage. Missing tests for `FallbackMaxRetries` / empty-SMTP-config dispatch path (NotificationOutbox-008). |
|
||||
| 10 | Documentation & comments | Yes | XML on `StuckAgeThreshold` misleading (NotificationOutbox-009); XML on dispatcher's audit `_ =` fire-and-forget says "writer never throws" but `EmitAttemptAudit` still wraps in try/catch — comment contradicts itself (NotificationOutbox-010). |
|
||||
|
||||
## Findings
|
||||
|
||||
### NotificationOutbox-001 — `EmailNotificationDeliveryAdapter` inherits the OAuth2 empty-user SASL bug (NS-021) on the M365 send path
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/Delivery/EmailNotificationDeliveryAdapter.cs:185-191` (calls `smtp.AuthenticateAsync("oauth2", token)`); root cause in `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:76-79` |
|
||||
|
||||
**Description**
|
||||
|
||||
`EmailNotificationDeliveryAdapter.SendAsync` resolves an OAuth2 access token via
|
||||
`_tokenService.GetTokenAsync(...)` and then calls
|
||||
`await smtp.AuthenticateAsync(config.AuthType, credentials, cancellationToken);`
|
||||
on `ISmtpClientWrapper`. The production implementation (`MailKitSmtpClientWrapper`)
|
||||
constructs `new SaslMechanismOAuth2("", credentials)` — an empty user-name field —
|
||||
which Microsoft 365 SMTP rejects with `535 5.7.3 Authentication unsuccessful`. The
|
||||
sibling NotificationService finding NS-021 documents this in full; the outbox is the
|
||||
*new home* for delivery on central, so every OAuth2 send that the outbox dispatches
|
||||
hits this code path. The defect is therefore reachable here even though the offending
|
||||
constructor lives in the NotificationService project, and the central-only redesign
|
||||
means this is now the only delivery path in production. Existing outbox tests do not
|
||||
catch it because they all substitute `ISmtpClientWrapper` and assert only that
|
||||
`AuthenticateAsync` is invoked with `("oauth2", "<token>")` — the real
|
||||
`SaslMechanismOAuth2` is never instantiated. `OAuth2TokenService.GetTokenAsync` is
|
||||
explicitly wired to `login.microsoftonline.com/.../oauth2/v2.0/token` with
|
||||
`scope=https://outlook.office365.com/.default`, so M365 SMTP is the intended target —
|
||||
and is precisely the relay that requires the user field to be populated.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Track the NS-021 fix and add an outbox-side regression test once the wrapper signature
|
||||
is widened. Concretely, when `ISmtpClientWrapper.AuthenticateAsync` is extended to
|
||||
accept the sender mailbox (or a dedicated `oauth2UserName` parameter), update
|
||||
`EmailNotificationDeliveryAdapter.SendAsync` to pass `config.FromAddress`, and add a
|
||||
test in `EmailNotificationDeliveryAdapterTests` that asserts the OAuth2 path forwards
|
||||
the sender identity. Until then, surface the same finding here so the outbox is not
|
||||
treated as resolved when NS-021 fires.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-002 — Dispatcher parks on first transient failure when `SmtpConfiguration.MaxRetries == 0`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:348-360` |
|
||||
|
||||
**Description**
|
||||
|
||||
The transient-failure branch increments `RetryCount` then evaluates
|
||||
`if (notification.RetryCount >= maxRetries) notification.Status = NotificationStatus.Parked;`.
|
||||
`maxRetries` is read from the central `SmtpConfiguration.MaxRetries` column, which has
|
||||
no enforced lower bound and is not validated by the outbox. A row whose `MaxRetries`
|
||||
is `0` (or any negative value) immediately satisfies `1 >= 0` on the very first
|
||||
transient failure, so the notification is parked without a single retry — directly
|
||||
contradicting the design doc's "fixed retry interval, reuse central SMTP
|
||||
max-retry-count" intent, where a configured value of zero would naturally read as
|
||||
"never retry, fail straight to permanent". `SetupSmtpRetryPolicy` in the dispatch
|
||||
tests always supplies a positive value, so this path is not exercised.
|
||||
|
||||
Additionally, an operator who clears the SMTP config row drops into the
|
||||
`FallbackMaxRetries = 10` / `FallbackRetryDelay = 1 min` path
|
||||
(`ResolveRetryPolicyAsync` line 251); that path is also untested — see
|
||||
NotificationOutbox-008. The operational result is that a single bad SMTP config
|
||||
value silently halves the outbox's delivery guarantees.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Validate `MaxRetries` at the read point: treat a non-positive value as either the
|
||||
configured fallback (current `FallbackMaxRetries = 10`) or — preferred — surface the
|
||||
mis-configuration to the operator via a health metric and refuse to dispatch until
|
||||
the row is corrected. Either way, add a test that asserts the dispatcher's behaviour
|
||||
for `MaxRetries == 0` and `MaxRetries < 0`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-003 — Dispatcher does not propagate a `CancellationToken` into delivery; in-flight SMTP sends cannot be cancelled on shutdown
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:334`, `src/ScadaLink.NotificationOutbox/Delivery/INotificationDeliveryAdapter.cs:22` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DeliverOneAsync` calls `var outcome = await adapter.DeliverAsync(notification);` —
|
||||
the second `CancellationToken` parameter on `INotificationDeliveryAdapter.DeliverAsync`
|
||||
is left at its `default(CancellationToken)` value, meaning `CancellationToken.None`.
|
||||
`EmailNotificationDeliveryAdapter.SendAsync` then threads that `None` token into
|
||||
`smtp.ConnectAsync`, `smtp.AuthenticateAsync`, and `smtp.SendAsync`. The consequence
|
||||
is that during a coordinated cluster shutdown (singleton handover, drain) any
|
||||
in-flight SMTP send is uncancellable and the dispatcher's sweep must wait for the
|
||||
underlying socket/SMTP timeout (`SmtpConfiguration.ConnectionTimeoutSeconds`) before
|
||||
the sweep's task completes and `DispatchComplete` lowers the in-flight guard. With
|
||||
the default connect-timeout values this is on the order of tens of seconds per
|
||||
notification in the in-progress batch, blocking `CoordinatedShutdown`.
|
||||
|
||||
The adapter implementations clearly *expect* a token — the contract type is
|
||||
`CancellationToken cancellationToken = default` everywhere — so this is a wiring
|
||||
gap, not a missing interface.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wire a per-sweep `CancellationTokenSource` linked to the actor's lifecycle (cancel
|
||||
in `PostStop`) and pass its token into `DeliverAsync`. A linked source per sweep
|
||||
also bounds individual deliveries by the configured connection timeout when a more
|
||||
explicit per-attempt budget is wanted. Add a test that cancels mid-`DeliverAsync` and
|
||||
asserts the dispatcher completes promptly and the row is left non-terminal
|
||||
(`Pending`/`Retrying` unchanged) for the next sweep.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-004 — `EmitAttemptAudit`/`EmitTerminalAudit` fire-and-forget pattern can outlive the per-sweep DI scope
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:425-435`, `463-485` |
|
||||
|
||||
**Description**
|
||||
|
||||
Both emission helpers issue `_ = _auditWriter.WriteAsync(evt);` — discarding the
|
||||
returned task. `CentralAuditWriter.WriteAsync` opens its own `await using var scope =
|
||||
_services.CreateAsyncScope();` and resolves a scoped `IAuditLogRepository` (verified
|
||||
at `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:118-121`), so the writer is
|
||||
defensively scope-independent. However the dispatcher already holds a per-sweep
|
||||
`using var scope = _serviceProvider.CreateScope();` and the per-notification
|
||||
`UpdateAsync` runs in that scope. The fire-and-forget pattern means:
|
||||
|
||||
1. The dispatcher's outer scope can be disposed (sweep done, `DispatchComplete`
|
||||
piped) while the audit `WriteAsync` task is still running on a *different*
|
||||
scope it owns — works today only because the writer creates its own scope.
|
||||
2. A faulted unobserved task is silently lost: if `CentralAuditWriter.WriteAsync`
|
||||
itself were ever made `async void` or refactored to not internally try/catch,
|
||||
the dispatcher would never see the fault and the audit row would vanish without
|
||||
the `_logger.LogWarning` reaching the operator.
|
||||
3. The XML-doc above `EmitAttemptAudit` says "PipeTo is not used because the writer
|
||||
never throws" — but the surrounding `try { _ = _auditWriter.WriteAsync(evt); }
|
||||
catch (Exception ex)` will only catch a synchronous throw from the *task
|
||||
construction*, not the awaited body of `WriteAsync`. The comment understates the
|
||||
risk: the catch is structurally unreachable for the documented failure mode.
|
||||
|
||||
The system actually wants the *invariant* "audit write never affects delivery"
|
||||
(verified by the `AuditWriter_Throws_…StillSucceeds` tests). That invariant is
|
||||
better expressed by `await`-ing the writer inside the actor's outer try/catch (the
|
||||
dispatcher already swallows per-notification exceptions) than by a discard-task,
|
||||
which couples the lifetime of the dispatcher's scope to that of the audit task
|
||||
through whatever scope graph the writer happens to use today.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either `await _auditWriter.WriteAsync(evt)` inside the existing `try`/`catch` (the
|
||||
preferred fix — preserves the invariant, plays well with the per-sweep scope, and
|
||||
makes the catch block actually reachable), or — if a true fire-and-forget remains
|
||||
desired — capture the returned task and attach a continuation that calls
|
||||
`_logger.LogWarning` on faulted to keep diagnostics intact. Either way, fix the
|
||||
"writer never throws" XML-doc to match the implementation.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-005 — Ingest persistence inherits the CD-015 check-then-act race; under contention the second writer throws and the site retries
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:127-132` (caller); root cause in `src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubmit` → `PersistAsync` calls `repository.InsertIfNotExistsAsync(notification)`
|
||||
on `INotificationOutboxRepository`. The current implementation
|
||||
(`src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs`)
|
||||
does a check-then-act with no duplicate-key catch — documented as CD-015 (High,
|
||||
Open). The Notification Outbox's documented contract is "at-least-once handoff with
|
||||
ack-after-persist plus insert-if-not-exists on `NotificationId`" (CLAUDE.md,
|
||||
Component-NotificationOutbox.md §Ingest & Idempotency), and the duplicate-insert
|
||||
race is the **expected contention pattern** — the site retries the same submission
|
||||
after a lost ack. As written, the loser surfaces a `SqlException` (2627 PK
|
||||
violation) wrapped in `DbUpdateException`, propagates through `PipeTo`'s failure
|
||||
projection as a `NotificationSubmitAck { Accepted: false, Error: "... PRIMARY KEY ..." }`,
|
||||
the site treats the ack as a forwarding failure and forwards the message **again**,
|
||||
re-entering the same race. If the contending pair keeps racing this can livelock.
|
||||
|
||||
The actor side is fine — `PipeTo`'s success/failure projection correctly forwards
|
||||
the exception message. The repository side needs the standard `2601/2627 → no-op`
|
||||
pattern that AuditLog and SiteCall already use. This finding tracks the outbox-side
|
||||
visibility of the CD-015 defect so a re-review of NotificationOutbox surfaces it
|
||||
even if the reader has not yet read the ConfigurationDatabase findings.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Track CD-015 to resolution. As a defense-in-depth complement here, consider
|
||||
treating a duplicate-key `DbUpdateException` in the actor's ingest failure
|
||||
projection as `Accepted: true` so a lost ack between persisted-by-the-first-writer
|
||||
and ack-back does not produce a permanent re-forward loop — but the cleanest fix
|
||||
remains the CD-015 raw-SQL `IF NOT EXISTS … INSERT` with `2601/2627` catch in
|
||||
`NotificationOutboxRepository`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-006 — `ResolveAdapters` rebuilds the `NotificationType → adapter` dictionary on every dispatch sweep
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:267-277` |
|
||||
|
||||
**Description**
|
||||
|
||||
Every dispatch sweep calls `ResolveAdapters(scope.ServiceProvider)` which enumerates
|
||||
`scopedServices.GetServices<INotificationDeliveryAdapter>()` and builds a fresh
|
||||
`Dictionary<NotificationType, INotificationDeliveryAdapter>`. Adapter registration
|
||||
is decided at startup (`AddNotificationOutbox` registers
|
||||
`EmailNotificationDeliveryAdapter`); the registration set does not change at
|
||||
runtime. With a default `DispatchInterval = 10s` and only ever one entry today, the
|
||||
allocation overhead is trivial — but the comment "the last adapter registered for a
|
||||
given type wins, mirroring DI's last-wins resolution semantics" elevates this to a
|
||||
behaviour contract, and the per-sweep dictionary construction obscures the lookup's
|
||||
identity from one sweep to the next, making any future stateful adapter (rate
|
||||
limiter, circuit breaker) silently lose its state.
|
||||
|
||||
The same issue is the reason `EmailNotificationDeliveryAdapter` is *scoped* — it
|
||||
holds a scoped `INotificationRepository`. A trivial cache-the-types-but-resolve-
|
||||
the-instance fix is possible: cache the set of declared `NotificationType` values
|
||||
and look up each adapter by `GetService<INotificationDeliveryAdapter>()`
|
||||
filtered by `Type` per sweep.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Document the per-sweep contract explicitly ("each sweep gets a fresh adapter
|
||||
instance per the scoped DI contract — adapters must not carry state across
|
||||
sweeps") in the actor XML, or — preferred — cache only the *types* at startup
|
||||
(`PreStart`) and resolve the scoped instance per sweep, so future adapters with
|
||||
stateful intent (timeouts, circuit breakers) cannot accidentally lose state.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-007 — `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, and `PurgeInterval` are not in the design document
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:13`, `:22`, `:25`; `docs/requirements/Component-NotificationOutbox.md:152-160` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-NotificationOutbox.md` §Configuration enumerates three options: dispatch
|
||||
interval, stuck-age threshold, and terminal-row retention window. The implemented
|
||||
`NotificationOutboxOptions` adds three additional fields:
|
||||
|
||||
- `DispatchBatchSize` (default `100`) — caps the per-sweep claim size, but is invisible
|
||||
to anyone reading only the spec.
|
||||
- `PurgeInterval` (default `1 day`) — the design doc says "daily purge" as if the
|
||||
cadence is fixed; in code it is configurable.
|
||||
- `DeliveredKpiWindow` (default `1 min`) — the KPI section says "Delivered (last
|
||||
interval)" without saying how long "last interval" is or that it is configurable.
|
||||
|
||||
The design doc also asserts "Delivery max-retry-count and retry interval are not
|
||||
part of `NotificationOutboxOptions` — they are reused from the central SMTP
|
||||
configuration" (line 160) — implementation honours this. But the three additions
|
||||
above are dead text in the design doc. The KPI dashboard cadence and the dispatch
|
||||
batch size are both operationally important values an operator/engineer will hunt
|
||||
for; their absence from the spec is design drift.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add the three fields to `Component-NotificationOutbox.md §Configuration` with their
|
||||
defaults, or remove them from the implementation if they were meant to be fixed
|
||||
constants. Cross-link `DeliveredKpiWindow` from the §Monitoring "Delivered (last
|
||||
interval)" KPI bullet so a reader sees what controls the bucket length.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-008 — `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:29-31`, `:251-259`; tests in `tests/ScadaLink.NotificationOutbox.Tests/NotificationOutboxActorDispatchTests.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ResolveRetryPolicyAsync` falls back to `FallbackMaxRetries = 10` and
|
||||
`FallbackRetryDelay = 1 min` when `notificationRepository.GetAllSmtpConfigurationsAsync()`
|
||||
returns an empty list (no SMTP configuration row). The comment correctly observes
|
||||
that delivery itself will then return `Permanent("No SMTP configuration available")`
|
||||
from `EmailNotificationDeliveryAdapter.cs:78-81`, so the fallback retry policy
|
||||
never actually retries anything — the row is permanently parked on first attempt
|
||||
regardless of retry count or delay.
|
||||
|
||||
This produces three concerns. (1) The fallback is essentially dead code — the retry
|
||||
policy values are never consulted in practice because delivery always fails
|
||||
permanently before the retry branch is reached. (2) The fallback can be reached
|
||||
*after* a previously-deployed SMTP config is deleted, which is precisely the
|
||||
moment an operator needs accurate audit trails; the row will say `Parked` with
|
||||
`LastError = "No SMTP configuration available"` but the audit signal "retry policy
|
||||
fell back to defaults" is invisible. (3) Tests never exercise either the fallback
|
||||
path or the empty-SMTP-config dispatch path: `SetupSmtpRetryPolicy` always supplies
|
||||
a config in every dispatch test.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a regression test that runs a dispatch sweep with no SMTP config row and
|
||||
asserts the row is parked with the documented error. Optionally remove the fallback
|
||||
constants if parking-with-no-config is the *intended* operational signal; document
|
||||
the choice in the actor XML so a maintainer does not "fix" the unreachable code.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-009 — `StuckAgeThreshold` XML-doc says "in-progress notification is re-claimed" — contradicts the design's display-only stuck detection
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:15-16` |
|
||||
|
||||
**Description**
|
||||
|
||||
```csharp
|
||||
/// <summary>Age past which an in-progress notification is considered stuck and re-claimed.</summary>
|
||||
public TimeSpan StuckAgeThreshold { get; set; } = TimeSpan.FromMinutes(10);
|
||||
```
|
||||
|
||||
The implementation never reclaims anything based on `StuckAgeThreshold`. It is used
|
||||
only as a cutoff for the stuck-count KPI (`StuckCutoff`/`IsStuck` in
|
||||
`NotificationOutboxActor.cs:932-942`) and as a `StuckCutoff` filter on paginated
|
||||
queries. The design doc is explicit: "A notification is **stuck** if it is `Pending`
|
||||
or `Retrying` and older than a configurable age threshold (default 10 minutes).
|
||||
Detection is **display-only** — a count KPI and a row badge. There is no automated
|
||||
escalation or alerting" (`Component-NotificationOutbox.md:143-145`). A maintainer
|
||||
reading the XML and expecting "re-claim" behaviour will be surprised twice — once
|
||||
when no re-claim happens, and once when they go looking for the re-claim code and
|
||||
find none.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Rewrite the XML to match the design: "Age past which a still-`Pending`/`Retrying`
|
||||
notification is counted as stuck on the KPI tile and the per-row badge.
|
||||
Display-only — does not affect dispatch."
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationOutbox-010 — Comment claims `PipeTo` is not used "because the writer never throws"; the surrounding try/catch is dead-letter for the documented failure mode
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:469-477` |
|
||||
|
||||
**Description**
|
||||
|
||||
```csharp
|
||||
try
|
||||
{
|
||||
var evt = BuildNotifyDeliverEvent(notification, now, AuditStatus.Attempted, errorMessage)
|
||||
with { DurationMs = durationMs };
|
||||
// Fire-and-forget — we do NOT await: the dispatcher loop must not
|
||||
// be blocked by audit IO, and the writer swallows its own faults.
|
||||
// PipeTo is not used because the writer never throws.
|
||||
_ = _auditWriter.WriteAsync(evt);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
_logger.LogWarning(ex, "Failed to emit Attempted audit row …");
|
||||
}
|
||||
```
|
||||
|
||||
The XML-doc on `EmitAttemptAudit` is internally inconsistent and structurally
|
||||
incorrect: (1) if "the writer never throws" then the surrounding try/catch is
|
||||
unreachable and dead code; (2) if the writer *can* throw (and the catch is
|
||||
meaningful) then "never throws" is wrong. In practice the catch only ever fires
|
||||
on a synchronous throw from the writer's *task construction* — never on a fault
|
||||
in the awaited body — because the discarded task is not observed. The current
|
||||
behaviour matches the design intent ("audit failure NEVER aborts delivery"), but
|
||||
the comment misleads the next reader on the *why*.
|
||||
|
||||
This is the same root cause as NotificationOutbox-004 — they target the same lines
|
||||
from different angles (NotificationOutbox-004 is the scope-lifetime /
|
||||
fire-and-forget Akka concern, NotificationOutbox-010 is the doc/comment-clarity
|
||||
concern). Closing NotificationOutbox-004 by switching to `await` resolves both.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
If `await`-ing the writer (recommended fix per NotificationOutbox-004): delete the
|
||||
"PipeTo is not used because the writer never throws" line entirely and let
|
||||
the try/catch's behaviour speak for itself. If keeping fire-and-forget: rewrite
|
||||
the comment to "fire-and-forget by design (the writer is responsible for its
|
||||
own failure handling); the surrounding try/catch only catches the synchronous
|
||||
task-construction throw and is otherwise unreachable."
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.NotificationService` |
|
||||
| Design doc | `docs/requirements/Component-NotificationService.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -55,20 +55,65 @@ any code (NS-017, dead config — NS-007 sourced the timeout/limit from
|
||||
outside its lock, is sized once and never resized on redeployment, and is never
|
||||
disposed (NS-018).
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Re-reviewed at commit `1eb6e97` against the **materially-changed design**: per the
|
||||
updated `Component-NotificationService.md` and CLAUDE.md, the Notification Service
|
||||
is now **central-only**. Sites no longer deliver notifications over SMTP — a
|
||||
script's `Notify.Send` enqueues to the site Store-and-Forward Engine and
|
||||
`NotificationForwarder.DeliverAsync` (S&F handler in StoreAndForward) forwards
|
||||
the payload to the central Notification Outbox, which dispatches via the
|
||||
`INotificationDeliveryAdapter` registered for the list's `Type`. Email delivery
|
||||
on central is performed by `EmailNotificationDeliveryAdapter` in the
|
||||
NotificationOutbox project — it reuses this module's SMTP machinery
|
||||
(`ISmtpClientWrapper`, `OAuth2TokenService`, `SmtpErrorClassifier`,
|
||||
`SmtpTlsModeParser`, `EmailAddressValidator`, `CredentialRedactor`,
|
||||
`SmtpPermanentException`, `NotificationOptions`) but is the actual production
|
||||
caller. The intended residual responsibility of this module is to **supply that
|
||||
shared SMTP machinery** plus list/SMTP-config definition management on central.
|
||||
|
||||
The re-review surfaced **seven new findings**. The dominant theme is **dead
|
||||
code that contradicts the design doc**: `NotificationDeliveryService`, the
|
||||
`INotificationDeliveryService` interface in Commons, the `NotificationResult`
|
||||
record, the entire `DeliverBufferedAsync` S&F handler, and the prior NS-001…
|
||||
NS-018 test fixtures that exercise them are now orphaned — no production code
|
||||
path resolves `INotificationDeliveryService` on a site (sites no longer register
|
||||
this module per `SiteServiceRegistration.cs:33-38`) and on central the
|
||||
NotificationOutbox uses its own `EmailNotificationDeliveryAdapter` (which
|
||||
duplicates the connect/auth/send/disconnect sequence rather than delegating to
|
||||
`NotificationDeliveryService`). The class is still registered by
|
||||
`AddNotificationService` on central (`Program.cs:77`) but no consumer resolves
|
||||
it (NS-019). The `S&F handler must be registered` workaround that NS-001 added
|
||||
to `AkkaHostedService` is itself superseded by the `NotificationForwarder`
|
||||
registered for the same category at `AkkaHostedService.cs:654-660` (NS-020).
|
||||
Secondary findings: a real-world correctness gap (the OAuth2
|
||||
`SaslMechanismOAuth2` is constructed with an **empty user id** so server-side
|
||||
account binding fails for any provider that requires it — NS-021); the SMTP
|
||||
client wrapper holds a single `MailKit.SmtpClient` for the lifetime of the
|
||||
wrapper but the factory delegate creates a new wrapper per send, so successive
|
||||
sends through the same factory share NO connection but DO share a wrapper that
|
||||
mutates `_client.Timeout` on every connect (benign because every wrapper has its
|
||||
own client, but the design comment about pooling is now contradicted — NS-022);
|
||||
the design-doc retention/maintenance language has no implementation in this
|
||||
module and there is no test affirming the module is central-only (NS-023, NS-024);
|
||||
and `CredentialRedactor` masks any component of the credential string that is
|
||||
≥ 4 characters long — a 4-character user name like `root` or a 4-char tenant
|
||||
prefix could be aggressively scrubbed out of unrelated log text (NS-025).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Double SMTP client construction; `Auto` socket option for non-TLS; `TimeoutException`/`OperationCanceledException` misclassified. |
|
||||
| 2 | Akka.NET conventions | ☑ | No actors in this module (`AddNotificationServiceActors` is a no-op); delivery is a plain DI service. No Akka-specific issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` is a singleton with a shared mutable token cache; double-checked locking present but cache key is wrong (NS-006). |
|
||||
| 4 | Error handling & resilience | ☑ | Critical: no S&F delivery handler registered for `Notification` (NS-001). Fragile substring error classification (NS-002, NS-003). |
|
||||
| 5 | Security | ☑ | Credentials handled as plaintext strings; OAuth2 client secret in DB credential blob; no recipient address validation. |
|
||||
| 6 | Performance & resource management | ☑ | Two `ISmtpClientWrapper` instances created per send, one leaked; connection not pooled; `MaxConcurrentConnections` unenforced. |
|
||||
| 7 | Design-document adherence | ☑ | Connection timeout, max concurrent connections, and TLS `SSL`/`None` modes from the design doc are not implemented. |
|
||||
| 8 | Code organization & conventions | ☑ | `SmtpPermanentException` in the wrong file; `SmtpConfiguration` POCO has non-nullable strings with no initializer (compiler-warning risk). |
|
||||
| 9 | Testing coverage | ☑ | Happy path and main error branches covered; OAuth2 delivery path, `DeliverAsync` permanent fallback, and token-cache concurrency untested. |
|
||||
| 10 | Documentation & comments | ☑ | XML comment on `DeliverAsync` ("Throws on failure") and the misleading "OAuth2 token refresh if needed" comment do not match behaviour. |
|
||||
| 1 | Correctness & logic bugs | ☑ | Re-review: OAuth2 SASL constructed with empty user id (NS-021); `CredentialRedactor` over-masks short components (NS-025). Earlier NS-005/NS-008 fixes hold. |
|
||||
| 2 | Akka.NET conventions | ☑ | No actors in this module. `AddNotificationServiceActors` remains a documented no-op. |
|
||||
| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` per-credential locks now correct (NS-006 hold). No new issues. |
|
||||
| 4 | Error handling & resilience | ☑ | NS-014/NS-015 classification fixes hold but the entire `DeliverBufferedAsync` / `SendAsync` error path is dead (NS-019/NS-020). |
|
||||
| 5 | Security | ☑ | OAuth2 `SaslMechanismOAuth2` empty user id (NS-021); `CredentialRedactor` aggressiveness (NS-025); at-rest encryption still deferred (NS-013). |
|
||||
| 6 | Performance & resource management | ☑ | `MailKitSmtpClientWrapper` keeps a single `SmtpClient` for the wrapper lifetime; combined with per-send factory this means no pooling — re-document or fix (NS-022). |
|
||||
| 7 | Design-document adherence | ☑ | Critical drift: module still exposes site-style S&F sending; the design doc inverted delivery to central months ago (NS-019). Site registration removed but central still wires the dead service. |
|
||||
| 8 | Code organization & conventions | ☑ | `INotificationDeliveryService` lives in Commons and is now unused — should be retired or relocated to a NotificationService-internal namespace (NS-019). Module-vs-NotificationOutbox boundary unclear. |
|
||||
| 9 | Testing coverage | ☑ | 56 tests pass but ~40 of them assert behaviour of a code path no production caller exercises (NS-024). No test affirms the central-only design — i.e. that `AddNotificationService` registers no notification-sending service on a site. |
|
||||
| 10 | Documentation & comments | ☑ | `NotificationDeliveryService` XML doc still claims "WP-11/12: Notification delivery via SMTP" with no warning that the class is orphaned; `INotificationDeliveryService` Commons doc claims "Implemented by NotificationService, consumed by ScriptRuntimeContext" — both consumers are wrong now (NS-023). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -595,3 +640,199 @@ Replace the hand-rolled double-checked init with `Lazy<SemaphoreSlim>` or `LazyI
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-05-17. All three issues confirmed against source. The hand-rolled double-checked init was replaced with a `Lazy<SemaphoreSlim>` — its publication is correctly synchronised, eliminating the lock-free read of a non-`volatile` reference. `NotificationDeliveryService` now implements `IDisposable` and disposes the limiter (if created) under the existing lock, with idempotent re-entry and an `ObjectDisposedException` guard in `SendAsync`/`GetConcurrencyLimiter`; the scoped DI registration disposes it per scope. The limiter remains scoped (not hoisted to a site singleton) — the design doc deploys one SMTP config per site and the per-instance capture is bounded; the redeploy-resize concern is acknowledged as low-impact and not changed here, since hoisting would require a registration change for marginal benefit. Tests `Service_Dispose_DisposesConcurrencyLimiter` plus the existing `Send_MaxConcurrentConnections_LimitsConcurrentDeliveries`.
|
||||
|
||||
### NotificationService-019 — `NotificationDeliveryService` and `INotificationDeliveryService` are orphaned by the central-only redesign
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:18-442`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:20-21`, `src/ScadaLink.Commons/Interfaces/Services/INotificationDeliveryService.cs:1-33`, `src/ScadaLink.Host/Program.cs:77` |
|
||||
|
||||
**Description**
|
||||
|
||||
The updated `Component-NotificationService.md` (re-read in full at this commit) makes the new design unambiguous: "The Notification Service is the central component that manages notification-list and SMTP definitions and provides the per-type delivery adapters used to send notifications. … Notification delivery has been inverted: a site script's notification is store-and-forwarded to the central cluster, and the central **Notification Outbox** owns dispatch and delivery, calling an `INotificationDeliveryAdapter` supplied by this component." The doc explicitly states the service is "central cluster only", "no longer present at site clusters", and "no longer delivers notifications from sites".
|
||||
|
||||
The current source does not match. `NotificationDeliveryService` is a site-shaped notification sender: it accepts `(listName, subject, message)`, performs an immediate SMTP `DeliverAsync`, catches transient failures and **buffers them to a `StoreAndForwardCategory.Notification` row**, and exposes `DeliverBufferedAsync` as the matching S&F handler. That is precisely the old site-side flow the design doc says was removed. The doc explicitly notes "there is no … local SQLite copy" of notification lists at sites, yet `DeliverBufferedAsync` re-resolves the list from a repository expected to be reachable on the buffering node.
|
||||
|
||||
Who actually calls it?
|
||||
|
||||
- **Sites** do **not**. `SiteServiceRegistration.cs:33-38` documents the deliberate omission: "AddNotificationService() is intentionally NOT registered on the site path." Sites register `NotificationForwarder` (in `ScadaLink.StoreAndForward`) as the S&F handler for `StoreAndForwardCategory.Notification` (`AkkaHostedService.cs:654-660`), which Asks the central comms actor and never touches SMTP. `ScriptRuntimeContext.NotifyHelper` (in `SiteRuntime`) enqueues directly to S&F as a serialized `NotificationSubmit`, **not** via `INotificationDeliveryService.SendAsync`.
|
||||
- **Central** registers it (`Program.cs:77` calls `AddNotificationService`) but no central component resolves it. The central notification dispatcher is `NotificationOutboxActor` → `INotificationDeliveryAdapter` → `EmailNotificationDeliveryAdapter`. The adapter is a full re-implementation of the connect/auth/send/disconnect sequence (see `EmailNotificationDeliveryAdapter.cs:163-222`) — it deliberately does not call `NotificationDeliveryService.DeliverAsync` (XML-doc on the adapter says "Reuses the `ScadaLink.NotificationService` SMTP machinery — `ISmtpClientWrapper`, `SmtpTlsModeParser`, `OAuth2TokenService` and the typed `SmtpPermanentException`", i.e. only the leaf primitives).
|
||||
|
||||
The `NotificationDeliveryService` class, its `DeliverBufferedAsync`, the `Func<ISmtpClientWrapper>` registration consumed only by it, and the `INotificationDeliveryService` interface (still in Commons) and `NotificationResult` record are therefore dead code that contradicts the design. Worse, every prior finding NS-001..NS-018 was reviewed and resolved against this dead path. The 56-test green test suite (NS-012 resolution note) exercises behaviour no production caller invokes — it gives a false sense of coverage. The misleading XML doc on `NotificationDeliveryService` ("WP-11/12: Notification delivery via SMTP") tells a maintainer this is *the* delivery path; the registration on central does the same.
|
||||
|
||||
Risk: an operator following the design doc will look here for "the central email delivery code" and find a parallel implementation that is never called; a future feature change (e.g. retry policy tweak) made here will silently have no effect; the `Notify` script-API end-to-end behaviour now depends on `NotificationOutbox` + `EmailNotificationDeliveryAdapter` + `NotificationForwarder`, none of which are tested in this module's suite.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decide and execute one of:
|
||||
|
||||
1. **Delete `NotificationDeliveryService`, `DeliverBufferedAsync`, the `BufferedNotification` payload type, the `Func<ISmtpClientWrapper>` scoped registration (move it to NotificationOutbox if still needed there — it already has its own), and `INotificationDeliveryService`/`NotificationResult` in Commons.** Reduce `AddNotificationService` to registering the shared primitives — `OAuth2TokenService`, `ISmtpClientWrapper` factory, `NotificationOptions`. Delete the NS-001..NS-018 tests that target the orphaned path; rebase the ones that exercise primitives (`SmtpErrorClassifier`, `SmtpTlsModeParser`, `CredentialRedactor`, `EmailAddressValidator`, `MailKitSmtpClientWrapper`, `OAuth2TokenService`) which remain genuinely shared. Update `CompositionRootTests` (`tests/ScadaLink.Host.Tests/CompositionRootTests.cs:208-209`) and `IntegrationSurfaceTests` (`tests/ScadaLink.IntegrationTests/IntegrationSurfaceTests.cs:122-135`) to drop the stale assertions.
|
||||
|
||||
2. **Keep the class as the central-only Email delivery primitive** and rewrite `EmailNotificationDeliveryAdapter` to delegate to it. This is the smaller diff but the larger semantic burden — `NotificationDeliveryService.SendAsync` returns `NotificationResult` (Success / WasBuffered) which cannot encode the three-way `DeliveryOutcome` (Success / Transient / Permanent) the outbox needs, so the contract still has to change.
|
||||
|
||||
Recommended path is option 1: the parallel implementation in `EmailNotificationDeliveryAdapter` is already complete and matches the new design's `DeliveryOutcome` model; salvaging the old class would re-introduce the very inversion this redesign removed.
|
||||
|
||||
### NotificationService-020 — NS-001 fix superseded; `AkkaHostedService` would register two competing `Notification` S&F handlers if both code paths ran
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:654-660`, NS-001 resolution note (this file) |
|
||||
|
||||
**Description**
|
||||
|
||||
NS-001 was resolved by registering an `S&F → DeliverBufferedAsync` handler for `StoreAndForwardCategory.Notification` at site startup in `AkkaHostedService`. The current source registers a **different** handler for the same category at `AkkaHostedService.cs:654-660` — `NotificationForwarder.DeliverAsync`, which forwards to central instead of sending SMTP. `StoreAndForwardService.RegisterDeliveryHandler` (verified by reading `StoreAndForward/StoreAndForwardService.cs` around line 109) takes a single handler per category — last-write-wins or first-write-wins, either way the two registrations cannot both be active.
|
||||
|
||||
The NS-001 resolution note in this file describes a state of the code that no longer exists: it says the handler "is now registered at site startup in `AkkaHostedService`" and points to a handler resolving `NotificationDeliveryService` via a fresh DI scope. That registration is gone from the current `AkkaHostedService` (only `ExternalSystem`, `CachedDbWrite`, and the `NotificationForwarder`-based `Notification` registration are present at the current location). So the NS-001 fix has been silently rolled back / replaced as part of the central-only redesign.
|
||||
|
||||
The risk this finding tracks is not the current state per se — `NotificationForwarder` registration is correct under the new design — but the **stale resolution note** plus the fact that `NotificationDeliveryService.DeliverBufferedAsync` still exists in this module and is still tested as an S&F handler. A future merge or revert that re-introduces the NS-001-style registration (because it is what the test suite shape implies) would conflict with `NotificationForwarder`. The two handlers do diametrically opposite things (forward to central vs. send SMTP locally on a site where there is no SMTP config), so a misregistration would cause a silent regression of the design inversion.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Mark the NS-001 resolution note in this file as **superseded by NS-019** with a one-line note explaining that the registration was removed when sites stopped delivering. Delete the orphan `DeliverBufferedAsync` and its tests as part of the NS-019 work. Add a comment on `NotificationForwarder` registration in `AkkaHostedService` cross-referencing NS-019/NS-020 so a maintainer searching for the `Notification` S&F handler finds the one canonical registration.
|
||||
|
||||
### NotificationService-021 — OAuth2 SASL constructed with empty user identifier; M365 SMTP will reject the auth handshake
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:76-79` |
|
||||
|
||||
**Description**
|
||||
|
||||
```csharp
|
||||
case "oauth2":
|
||||
// OAuth2 token is passed directly as credentials (pre-fetched by token service)
|
||||
var oauth2 = new SaslMechanismOAuth2("", credentials);
|
||||
await _client.AuthenticateAsync(oauth2, cancellationToken);
|
||||
break;
|
||||
```
|
||||
|
||||
`SaslMechanismOAuth2(string userName, string token)` — MailKit's XOAUTH2 mechanism — sends the SASL initial response as `user=<userName>\x01auth=Bearer <token>\x01\x01`. Microsoft 365 (and most OAuth2-enabled SMTP relays) **require the `userName` field to be the From mailbox identity the token was issued for**; an empty string is rejected with a server response like `535 5.7.3 Authentication unsuccessful` ("Either the user identity does not match the principal in the token, or the user is empty"). Office 365's documentation for SMTP AUTH XOAUTH2 calls this out explicitly.
|
||||
|
||||
The token-fetch path supports this: `OAuth2TokenService.GetTokenAsync` issues a Client Credentials grant against `login.microsoftonline.com/{tenantId}/oauth2/v2.0/token` with `scope=https://outlook.office365.com/.default`, which is the Microsoft 365 SMTP send scope — meaning the intended target is M365 SMTP, which is precisely the server that rejects an empty user. The `SmtpConfiguration.FromAddress` field is exactly the user identity that should be passed.
|
||||
|
||||
This bug is not caught by tests because every existing test uses a fake `ISmtpClientWrapper` (`Substitute.For<ISmtpClientWrapper>()`, `RecordingAuthClient`, etc.) — `MailKitSmtpClientWrapper.AuthenticateAsync` is never exercised against a real `SaslMechanismOAuth2`. The OAuth2 delivery test (NS-012, `Send_OAuth2Config_AuthenticatesWithResolvedAccessToken`) only asserts the wrapper's `AuthenticateAsync` is invoked with `("oauth2", "<access-token>")`; the wrapper itself is mocked out. The same defect is present in `EmailNotificationDeliveryAdapter` only because it routes through this same `AuthenticateAsync` method.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pass the sender mailbox into the wrapper's `AuthenticateAsync` path. The cleanest fix is to thread `config.FromAddress` (or a dedicated `oauth2UserName` parameter) through `ISmtpClientWrapper.AuthenticateAsync` so the OAuth2 branch can construct `new SaslMechanismOAuth2(config.FromAddress, credentials)`. Add an integration-style test that runs `MailKitSmtpClientWrapper.AuthenticateAsync` against a stub `SmtpClient` and asserts the XOAUTH2 initial-response bytes contain the expected `user=<from>` field, so this regression is caught next time.
|
||||
|
||||
### NotificationService-022 — `MailKitSmtpClientWrapper` holds a long-lived `SmtpClient`; combined with per-send factory, the design comment about pooling is contradicted
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:14`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:19` |
|
||||
|
||||
**Description**
|
||||
|
||||
`MailKitSmtpClientWrapper` declares `private readonly SmtpClient _client = new();` — a single `SmtpClient` is constructed when the wrapper is constructed and lives for the wrapper's lifetime. The DI registration is `services.AddSingleton<Func<ISmtpClientWrapper>>(_ => () => new MailKitSmtpClientWrapper());` (`ServiceCollectionExtensions.cs:19`) — every invocation of the factory creates a **new** wrapper and therefore a **new** `SmtpClient`. `NotificationDeliveryService.DeliverAsync` (the orphan, per NS-019) and `EmailNotificationDeliveryAdapter.SendAsync` both invoke the factory per send and dispose the wrapper at end of send. So in practice there is no connection pooling — every send pays a full TCP+TLS handshake.
|
||||
|
||||
This is internally consistent (and matches MailKit guidance — `SmtpClient` is not thread-safe and reusing across deliveries needs careful guarding). However:
|
||||
|
||||
1. The XML on the wrapper class says nothing about lifetime; the field-initializer `new SmtpClient()` *implies* a reusable connection. A maintainer might "fix" the factory to reuse a single wrapper (singleton) believing they are enabling pooling, and immediately introduce a concurrency bug: `MailKit.SmtpClient` rejects concurrent send calls and the wrapper carries no synchronization.
|
||||
2. `ConnectAsync` mutates `_client.Timeout` (`MailKitSmtpClientWrapper.cs:39-42`) every time it runs. If a wrapper is ever reused across deliveries with different `SmtpConfiguration.ConnectionTimeoutSeconds` values, the timeout is silently overwritten — not a current bug, but a latent footgun.
|
||||
3. The design doc requirement "Max concurrent connections (default 5)" is currently honoured by the NS-007 `SemaphoreSlim` on `NotificationDeliveryService`, but `EmailNotificationDeliveryAdapter` has **no equivalent throttle** — see `EmailNotificationDeliveryAdapter.cs:163-222`, no semaphore. So on central, where the actual delivery now happens, the design-doc concurrency limit is no longer enforced. This is a regression introduced by the redesign — the outbox does not carry NS-007's limiter forward.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Document the per-send lifecycle on `MailKitSmtpClientWrapper` (XML on the class: "one wrapper per delivery; the wrapper owns a single `SmtpClient` that is connected/authenticated/sent/disconnected/disposed once"). Either move the NS-007 `SemaphoreSlim` into a shared per-site holder consumed by `EmailNotificationDeliveryAdapter`, or accept the loss and update the design doc. Add `[Obsolete]` or `internal` to discourage re-using a wrapper across sends.
|
||||
|
||||
### NotificationService-023 — XML docs on the orphaned classes still describe the removed site-delivery flow; misleading to maintainers
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:12-17`, `src/ScadaLink.Commons/Interfaces/Services/INotificationDeliveryService.cs:3-12`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:8-9` |
|
||||
|
||||
**Description**
|
||||
|
||||
XML comments still claim the dead path is the live path:
|
||||
|
||||
- `NotificationDeliveryService` class summary: "WP-11: Notification delivery via SMTP. WP-12: Error classification and S&F integration. Transient: connection refused, timeout, SMTP 4xx → hand to S&F. Permanent: SMTP 5xx → returned to script." This is the pre-redesign behaviour. The site-S&F branch in particular is dead (see NS-019), and "returned to script" is no longer accurate — `Notify.Send` is async and never returns a permanent error to the script per the design doc.
|
||||
- `INotificationDeliveryService` (Commons): "Interface for sending notifications. Implemented by NotificationService, consumed by ScriptRuntimeContext." Verified against source: `ScriptRuntimeContext` does **not** consume this interface — it enqueues directly to `StoreAndForwardService` (see `SiteRuntime/Scripts/ScriptRuntimeContext.cs:1770-1774`). The Commons-level claim therefore documents an interaction that no longer exists.
|
||||
- `NotificationResult` is a record returned only by the orphaned `SendAsync`. The Notification Outbox uses `DeliveryOutcome` instead, which encodes the Success/Transient/Permanent three-way that `NotificationResult(Success, ErrorMessage, WasBuffered)` cannot.
|
||||
- `ServiceCollectionExtensions.AddNotificationService` XML doc says "Registers the notification delivery services (SMTP, OAuth2 token, delivery adapter)" — no mention that the central-only redesign means most of what it registers is unused.
|
||||
|
||||
A reader following the XML docs from any entry point ends up at a path that does not run. The CLAUDE.md "External Integrations" section and `Component-NotificationService.md` describe the new design; the in-source docs contradict them.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Tied to NS-019: if the orphan classes are deleted, this finding closes itself. If they are kept temporarily, prepend each summary with "**Obsolete — superseded by NotificationOutbox's `EmailNotificationDeliveryAdapter`. Retained for transitional compatibility; do not add new callers.**" and update `INotificationDeliveryService`'s summary to reflect the inverted flow or remove the interface.
|
||||
|
||||
### NotificationService-024 — No test affirms the central-only invariant; the orphaned-path tests give a false coverage signal
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.NotificationService.Tests/NotificationDeliveryServiceTests.cs`, `tests/ScadaLink.IntegrationTests/IntegrationSurfaceTests.cs:118-136`, `tests/ScadaLink.Host.Tests/CompositionRootTests.cs:207-209` |
|
||||
|
||||
**Description**
|
||||
|
||||
The module test suite has 56 tests; counting `NotificationDeliveryServiceTests.cs`, ~40 of them exercise `NotificationDeliveryService.SendAsync`/`DeliverBufferedAsync` — code paths that, per NS-019, no production caller resolves. They pass against the orphaned class and so the suite stays green, but the green is a false signal: changing the dead implementation (or deleting it) does not flag any regression in the live notification-delivery flow, which now lives in `EmailNotificationDeliveryAdapter` (covered by NotificationOutbox's own tests) and `NotificationForwarder` (covered, if at all, by StoreAndForward's tests).
|
||||
|
||||
In particular there is **no test in this module** that affirms the central-only invariant the design doc requires:
|
||||
|
||||
- No test that `AddNotificationService()` registered on a *site* role would be inert / no-op'd, or that `SiteServiceRegistration.Configure` does **not** call `AddNotificationService` (an obvious regression vector — re-adding it would silently restore the orphaned site-delivery path).
|
||||
- No test that confirms `INotificationDeliveryService` has no production consumer (i.e. an architecture test that fails if anyone re-introduces a constructor parameter or `GetRequiredService<INotificationDeliveryService>()` call).
|
||||
- The cross-module `CompositionRootTests` (`tests/ScadaLink.Host.Tests/CompositionRootTests.cs:208-209`) still asserts `NotificationDeliveryService` and `INotificationDeliveryService` are registered, locking in the orphan rather than catching it.
|
||||
- `IntegrationSurfaceTests.cs:122-125` constructs `NotificationDeliveryService` directly to validate "the integration surface" — testing a surface that no script actually crosses.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
After NS-019 is decided:
|
||||
|
||||
1. If the orphan is deleted, remove the orphaned-path tests (NS-001/004/005/007/008/009/010/014/015/016/017/018-style tests targeting `SendAsync`/`DeliverBufferedAsync`). Retain `SmtpErrorClassifierTests`, `SmtpTlsModeParserTests`, `CredentialRedactorTests`, `OAuth2TokenServiceTests`, and `MailKitSmtpClientWrapperTests` (primitives genuinely shared). Update `CompositionRootTests` to drop the stale rows and `IntegrationSurfaceTests` to call the live path via `INotificationDeliveryAdapter`/`EmailNotificationDeliveryAdapter`.
|
||||
2. Add a one-shot architecture test in `tests/ScadaLink.Architecture.Tests` (if it exists, else this module) that scans for direct references to `INotificationDeliveryService` outside this project and the obsolete-interface declaration in Commons, failing if any new consumer reappears.
|
||||
|
||||
### NotificationService-025 — `CredentialRedactor` over-masks: any 4-character credential component is masked anywhere it appears, including unrelated log text
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/CredentialRedactor.cs:34-48` |
|
||||
|
||||
**Description**
|
||||
|
||||
```csharp
|
||||
var parts = credentials.Split(':')
|
||||
.Where(p => p.Length >= 4)
|
||||
.Append(credentials)
|
||||
.Distinct()
|
||||
.OrderByDescending(p => p.Length);
|
||||
|
||||
foreach (var part in parts)
|
||||
{
|
||||
result = result.Replace(part, Mask, StringComparison.Ordinal);
|
||||
}
|
||||
```
|
||||
|
||||
The threshold `p.Length >= 4` is permissive enough that common short identifiers used by operators become aggressive global redaction tokens:
|
||||
|
||||
- A Basic-Auth credential of `root:hunter2` produces components `["root", "hunter2", "root:hunter2"]`. Every literal `root` anywhere in the exception/log text is masked — including unrelated mentions like file paths (`/root/.config`) or default-account names in the server's reply. This obscures legitimate diagnostic information without protecting any additional secret.
|
||||
- An OAuth2 tenant id is a GUID (long, safe). The client id is typically a GUID. The client secret is the high-entropy part. The full `tenant:client:secret` is the actual sensitive triple. A tenant GUID embedded in unrelated text (a tenant-bound error code, a partial URL) will be masked even when the appearance is non-sensitive.
|
||||
- The user name in Basic Auth is sometimes the From address (`scada-notifications@company.com`) — masking *the company's notification mailbox* in every log line that mentions it has real operational cost.
|
||||
|
||||
The function also uses `String.Replace` ordinarily, not word-boundary aware — a 4-char prefix that happens to be a substring of a longer benign token gets eaten.
|
||||
|
||||
The threshold is a defence-in-depth choice; the existing tests assert that `Hunter2pw!` and `Sup3rSecretValue` are masked (good) and that `null` text/credentials are handled (good), but nothing pins the negative behaviour: e.g. a test that a 4-char user name `root` is **not** also masked when it appears in an unrelated path.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Tighten the redaction policy: mask only the obviously-secret components — the password (Basic), the client secret (OAuth2), and the whole packed string — not the user name / tenant / client id. The simplest implementation is to redact only the **last** colon-separated component (the secret) plus the full packed string. Bump the per-component minimum length to something high enough that a typical short user name does not match (≥ 12 chars is the usual heuristic for a password). Add a test asserting `Scrub("/root/.config", "root:hunter2")` does not mask `/root/.config`'s `root`.
|
||||
|
||||
+208
-29
@@ -40,34 +40,38 @@ module file and counted in **Total**.
|
||||
| Severity | Open findings |
|
||||
|----------|---------------|
|
||||
| Critical | 0 |
|
||||
| High | 0 |
|
||||
| Medium | 0 |
|
||||
| Low | 0 |
|
||||
| **Total** | **0** |
|
||||
| High | 18 |
|
||||
| Medium | 62 |
|
||||
| Low | 92 |
|
||||
| **Total** | **172** |
|
||||
|
||||
## Module Status
|
||||
|
||||
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|
||||
|--------|---------------|--------|----------------|------|-------|
|
||||
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
|
||||
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 25 |
|
||||
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 10 |
|
||||
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
|
||||
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
|
||||
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
|
||||
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
|
||||
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
|
||||
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
|
||||
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
|
||||
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
|
||||
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 18 |
|
||||
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
|
||||
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 |
|
||||
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
|
||||
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
|
||||
| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/8 | 11 | 11 |
|
||||
| [CLI](CLI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/4 | 7 | 23 |
|
||||
| [CentralUI](CentralUI/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/5 | 8 | 33 |
|
||||
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/4 | 4 | 14 |
|
||||
| [Commons](Commons/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/6 | 9 | 23 |
|
||||
| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/1/5 | 7 | 22 |
|
||||
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/5 | 10 | 24 |
|
||||
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/0 | 5 | 22 |
|
||||
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/1/5 | 7 | 24 |
|
||||
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/3 | 6 | 23 |
|
||||
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 23 |
|
||||
| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 22 |
|
||||
| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/4 | 8 | 25 |
|
||||
| [ManagementService](ManagementService/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/2 | 6 | 23 |
|
||||
| [NotificationOutbox](NotificationOutbox/findings.md) | 2026-05-28 | `1eb6e97` | 0/2/5/3 | 10 | 10 |
|
||||
| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/2/2/3 | 7 | 25 |
|
||||
| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 21 |
|
||||
| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 6 |
|
||||
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/6 | 9 | 23 |
|
||||
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/4/3 | 7 | 26 |
|
||||
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/3 | 7 | 24 |
|
||||
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/1 | 6 | 22 |
|
||||
| [Transport](Transport/findings.md) | 2026-05-28 | `1eb6e97` | 0/3/5/4 | 12 | 12 |
|
||||
|
||||
## Pending Findings
|
||||
|
||||
@@ -80,14 +84,189 @@ description, location, recommendation — lives in the module's `findings.md`.
|
||||
|
||||
_None open._
|
||||
|
||||
### High (0)
|
||||
### High (18)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| CentralUI-028 | [CentralUI](CentralUI/findings.md) | `NotificationReport` and `SiteCallsReport` bypass `SiteScopeService` — Deployment role site-scoping defeated on the two new central-mirror pages |
|
||||
| Communication-016 | [Communication](Communication/findings.md) | `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires |
|
||||
| ConfigurationDatabase-015 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `NotificationOutboxRepository.InsertIfNotExistsAsync` is a check-then-act race with no duplicate-key catch |
|
||||
| DataConnectionLayer-018 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle |
|
||||
| DeploymentManager-018 | [DeploymentManager](DeploymentManager/findings.md) | Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover |
|
||||
| ExternalSystemGateway-018 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `DeliverBufferedAsync` lets `JsonException` propagate, turning a corrupt buffered row into a permanent retry-forever poison message |
|
||||
| InboundAPI-022 | [InboundAPI](InboundAPI/findings.md) | `IActiveNodeGate` has no production registration in Host — standby-node gating is silently disabled in production |
|
||||
| ManagementService-018 | [ManagementService](ManagementService/findings.md) | QueryAuditLogCommand has no role gate |
|
||||
| NotificationOutbox-001 | [NotificationOutbox](NotificationOutbox/findings.md) | `EmailNotificationDeliveryAdapter` inherits the OAuth2 empty-user SASL bug (NS-021) on the M365 send path |
|
||||
| NotificationOutbox-002 | [NotificationOutbox](NotificationOutbox/findings.md) | Dispatcher parks on first transient failure when `SmtpConfiguration.MaxRetries == 0` |
|
||||
| NotificationService-019 | [NotificationService](NotificationService/findings.md) | `NotificationDeliveryService` and `INotificationDeliveryService` are orphaned by the central-only redesign |
|
||||
| NotificationService-021 | [NotificationService](NotificationService/findings.md) | OAuth2 SASL constructed with empty user identifier; M365 SMTP will reject the auth handshake |
|
||||
| SiteEventLogging-016 | [SiteEventLogging](SiteEventLogging/findings.md) | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps |
|
||||
| StoreAndForward-018 | [StoreAndForward](StoreAndForward/findings.md) | Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant |
|
||||
| TemplateEngine-017 | [TemplateEngine](TemplateEngine/findings.md) | Revision hash and diff both ignore `Description` and `Connections`, defeating staleness detection for real deployment changes |
|
||||
| Transport-001 | [Transport](Transport/findings.md) | Template Overwrite never syncs attributes / alarms / scripts |
|
||||
| Transport-002 | [Transport](Transport/findings.md) | ExternalSystem Overwrite never syncs methods |
|
||||
| Transport-003 | [Transport](Transport/findings.md) | Unlock lockout is enforced only client-side; server session is never marked Locked |
|
||||
|
||||
### Medium (0)
|
||||
### Medium (62)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| AuditLog-001 | [AuditLog](AuditLog/findings.md) | Combined-telemetry transport is plumbed end-to-end but never invoked in production |
|
||||
| AuditLog-004 | [AuditLog](AuditLog/findings.md) | `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows |
|
||||
| AuditLog-005 | [AuditLog](AuditLog/findings.md) | `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan |
|
||||
| CLI-017 | [CLI](CLI/findings.md) | `BundleCommands.RunBundleCommandAsync` duplicates `ExecuteCommandAsync` and breaks the auth exit-code contract |
|
||||
| CLI-018 | [CLI](CLI/findings.md) | `audit query` and `audit export` never return exit 2 for an authorization failure |
|
||||
| CLI-019 | [CLI](CLI/findings.md) | `bundle export` decodes the entire base64 bundle into memory before writing |
|
||||
| CentralUI-026 | [CentralUI](CentralUI/findings.md) | `AuditFilterBar` From/To filters treat browser-local datetimes as UTC |
|
||||
| CentralUI-027 | [CentralUI](CentralUI/findings.md) | Same UTC misinterpretation in `SiteCallsReport`, `NotificationReport`, and `EventLogs` |
|
||||
| Commons-015 | [Commons](Commons/findings.md) | `EncryptionMetadata` accepts any algorithm string and any iteration count |
|
||||
| Commons-017 | [Commons](Commons/findings.md) | `Component-Commons.md` is significantly stale (audit enums, new entities, new repositories, new service interfaces, new folders) |
|
||||
| Commons-019 | [Commons](Commons/findings.md) | New `*Utc`-suffixed `DateTime` columns on `AuditEvent` / `SiteCall` are not enforced as UTC; inconsistent with `Notification`'s `DateTimeOffset` |
|
||||
| Communication-017 | [Communication](Communication/findings.md) | `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up |
|
||||
| ConfigurationDatabase-016 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with the unpeppered `ApiKeyHasher.Default` |
|
||||
| ConfigurationDatabase-017 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stub-attach delete on `DeploymentRecord` bypasses optimistic concurrency |
|
||||
| ConfigurationDatabase-018 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `DateTime`-typed `*Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement |
|
||||
| ConfigurationDatabase-019 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues, creating partition holes |
|
||||
| DataConnectionLayer-019 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations |
|
||||
| DataConnectionLayer-020 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe |
|
||||
| DataConnectionLayer-021 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight |
|
||||
| DataConnectionLayer-022 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts |
|
||||
| DeploymentManager-019 | [DeploymentManager](DeploymentManager/findings.md) | Lifecycle command timeout writes no audit entry |
|
||||
| ExternalSystemGateway-019 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default |
|
||||
| ExternalSystemGateway-020 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry |
|
||||
| HealthMonitoring-017 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts |
|
||||
| HealthMonitoring-019 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface |
|
||||
| Host-016 | [Host](Host/findings.md) | Site `CentralContactPoints` second entry targets the site's own remoting port |
|
||||
| Host-017 | [Host](Host/findings.md) | Site-shutdown ordering from REQ-HOST-7 is not wired |
|
||||
| InboundAPI-018 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` fires `WriteAsync` as `_ = task` — faulted async writes are unobserved |
|
||||
| InboundAPI-021 | [InboundAPI](InboundAPI/findings.md) | `ParentExecutionId` correlation flows only through `Call`; attribute reads/writes lose the inbound→site execution-tree link |
|
||||
| InboundAPI-025 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` runs against the entire `/api/*` branch — emits spurious `ApiInbound` audit rows for `/api/audit/query` and `/api/audit/export` |
|
||||
| ManagementService-019 | [ManagementService](ManagementService/findings.md) | AuditEndpoints builds PermittedSiteIds but never enforces them |
|
||||
| ManagementService-020 | [ManagementService](ManagementService/findings.md) | UpdateSmtpConfig returns and audits the SMTP Credentials field verbatim |
|
||||
| ManagementService-021 | [ManagementService](ManagementService/findings.md) | Transport bundle handlers have zero test coverage |
|
||||
| NotificationOutbox-003 | [NotificationOutbox](NotificationOutbox/findings.md) | Dispatcher does not propagate a `CancellationToken` into delivery; in-flight SMTP sends cannot be cancelled on shutdown |
|
||||
| NotificationOutbox-004 | [NotificationOutbox](NotificationOutbox/findings.md) | `EmitAttemptAudit`/`EmitTerminalAudit` fire-and-forget pattern can outlive the per-sweep DI scope |
|
||||
| NotificationOutbox-005 | [NotificationOutbox](NotificationOutbox/findings.md) | Ingest persistence inherits the CD-015 check-then-act race; under contention the second writer throws and the site retries |
|
||||
| NotificationOutbox-007 | [NotificationOutbox](NotificationOutbox/findings.md) | `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, and `PurgeInterval` are not in the design document |
|
||||
| NotificationOutbox-010 | [NotificationOutbox](NotificationOutbox/findings.md) | Comment claims `PipeTo` is not used "because the writer never throws"; the surrounding try/catch is dead-letter for the documented failure mode |
|
||||
| NotificationService-020 | [NotificationService](NotificationService/findings.md) | NS-001 fix superseded; `AkkaHostedService` would register two competing `Notification` S&F handlers if both code paths ran |
|
||||
| NotificationService-024 | [NotificationService](NotificationService/findings.md) | No test affirms the central-only invariant; the orphaned-path tests give a false coverage signal |
|
||||
| Security-016 | [Security](Security/findings.md) | `RoleMapper` silently drops the system-wide Deployment grant when a user is also in any site-scoped Deployment group |
|
||||
| Security-017 | [Security](Security/findings.md) | `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code from production callers — `[Authorize(Policy = RequireDeployment)]` does NOT enforce site scoping |
|
||||
| SiteCallAudit-001 | [SiteCallAudit](SiteCallAudit/findings.md) | SupervisorStrategy override is dead code; XML claims Resume that is not enforced |
|
||||
| SiteCallAudit-003 | [SiteCallAudit](SiteCallAudit/findings.md) | `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it |
|
||||
| SiteEventLogging-015 | [SiteEventLogging](SiteEventLogging/findings.md) | Background write queue is unbounded; can grow without limit under sustained writer slowness |
|
||||
| SiteEventLogging-017 | [SiteEventLogging](SiteEventLogging/findings.md) | Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale |
|
||||
| SiteRuntime-020 | [SiteRuntime](SiteRuntime/findings.md) | Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name |
|
||||
| SiteRuntime-021 | [SiteRuntime](SiteRuntime/findings.md) | `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL |
|
||||
| SiteRuntime-022 | [SiteRuntime](SiteRuntime/findings.md) | `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner` |
|
||||
| SiteRuntime-024 | [SiteRuntime](SiteRuntime/findings.md) | `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async |
|
||||
| StoreAndForward-019 | [StoreAndForward](StoreAndForward/findings.md) | Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks" |
|
||||
| StoreAndForward-020 | [StoreAndForward](StoreAndForward/findings.md) | `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load |
|
||||
| StoreAndForward-021 | [StoreAndForward](StoreAndForward/findings.md) | Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime |
|
||||
| TemplateEngine-018 | [TemplateEngine](TemplateEngine/findings.md) | `DiffService` reports no entries for added/removed/changed connections |
|
||||
| TemplateEngine-019 | [TemplateEngine](TemplateEngine/findings.md) | `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector` |
|
||||
| TemplateEngine-020 | [TemplateEngine](TemplateEngine/findings.md) | `Create*` audit entries are written with `EntityId = "0"` before `SaveChangesAsync` populates the real key |
|
||||
| TemplateEngine-021 | [TemplateEngine](TemplateEngine/findings.md) | `MoveTemplateAsync` skips folder cycle and sibling-name-collision validation |
|
||||
| Transport-004 | [Transport](Transport/findings.md) | `MaxUnlockAttemptsPerIpPerHour` option is declared but never enforced |
|
||||
| Transport-005 | [Transport](Transport/findings.md) | Manifest fields outside `ContentHash` are not bound to the encrypted payload |
|
||||
| Transport-006 | [Transport](Transport/findings.md) | Bundle ZIP read has no per-entry size cap or entry-count cap (zip-bomb / decompression-bomb) |
|
||||
| Transport-007 | [Transport](Transport/findings.md) | Failed import sessions retain decrypted plaintext for the full 30-minute TTL |
|
||||
| Transport-010 | [Transport](Transport/findings.md) | Critical Overwrite + cross-cutting paths uncovered by tests |
|
||||
|
||||
### Low (0)
|
||||
### Low (92)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| AuditLog-002 | [AuditLog](AuditLog/findings.md) | `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider |
|
||||
| AuditLog-003 | [AuditLog](AuditLog/findings.md) | `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously |
|
||||
| AuditLog-006 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock |
|
||||
| AuditLog-007 | [AuditLog](AuditLog/findings.md) | `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations |
|
||||
| AuditLog-008 | [AuditLog](AuditLog/findings.md) | Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain |
|
||||
| AuditLog-009 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't |
|
||||
| AuditLog-010 | [AuditLog](AuditLog/findings.md) | Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream |
|
||||
| AuditLog-011 | [AuditLog](AuditLog/findings.md) | `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call |
|
||||
| CLI-020 | [CLI](CLI/findings.md) | `bundle export` success-envelope parse is unguarded |
|
||||
| CLI-021 | [CLI](CLI/findings.md) | `CliConfig.Load` crashes the CLI on a malformed config file |
|
||||
| CLI-022 | [CLI](CLI/findings.md) | `CommandTreeTests` excludes the two new command groups |
|
||||
| CLI-023 | [CLI](CLI/findings.md) | `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses REST endpoints |
|
||||
| CentralUI-029 | [CentralUI](CentralUI/findings.md) | `ConfigurationAuditLog` uses `JS.InvokeAsync<int>("eval", ...)` instead of a dedicated JS module |
|
||||
| CentralUI-030 | [CentralUI](CentralUI/findings.md) | `SandboxConsoleCapture`'s per-call `StringWriter` is not thread-safe under intra-script concurrency |
|
||||
| CentralUI-031 | [CentralUI](CentralUI/findings.md) | `TransportImport` buffers the full bundle bytes in component state |
|
||||
| CentralUI-032 | [CentralUI](CentralUI/findings.md) | `AuditResultsGrid` paging is forward-only, no Previous button |
|
||||
| CentralUI-033 | [CentralUI](CentralUI/findings.md) | Drill-in / query-string code paths for the new Transport + SiteCalls pages are untested |
|
||||
| ClusterInfrastructure-011 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `SectionName` constant is decorative — no binding site references it |
|
||||
| ClusterInfrastructure-012 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator accepts `SeedNodes.Count == 1` despite design requiring both nodes as seeds |
|
||||
| ClusterInfrastructure-013 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Test uses catastrophic config values without an inline-intent comment |
|
||||
| ClusterInfrastructure-014 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `AddClusterInfrastructureActors` is dead surface — no caller, no behaviour |
|
||||
| Commons-016 | [Commons](Commons/findings.md) | `BundleSession.Locked` uses a magic `3` rather than a named constant |
|
||||
| Commons-018 | [Commons](Commons/findings.md) | `IOperationTrackingStore` and `IPartitionMaintenance` are at the root of `Interfaces/` instead of `Interfaces/Services/` |
|
||||
| Commons-020 | [Commons](Commons/findings.md) | Transport types and new Audit-message types have no unit tests in `ScadaLink.Commons.Tests` |
|
||||
| Commons-021 | [Commons](Commons/findings.md) | `ExternalCallResult.Response` has a benign lazy-parse race |
|
||||
| Commons-022 | [Commons](Commons/findings.md) | `IAuditCorrelationContext` references an unresolvable `BundleImporter.ApplyAsync` cref; JSON-blob columns have no documented shape |
|
||||
| Commons-023 | [Commons](Commons/findings.md) | Trailing-optional `SourceNode` on positional records mixes additive evolution patterns |
|
||||
| Communication-018 | [Communication](Communication/findings.md) | Site heartbeats hard-code `IsActive: true` regardless of node role |
|
||||
| Communication-019 | [Communication](Communication/findings.md) | `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository |
|
||||
| Communication-020 | [Communication](Communication/findings.md) | `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types |
|
||||
| Communication-021 | [Communication](Communication/findings.md) | `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try |
|
||||
| Communication-022 | [Communication](Communication/findings.md) | `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber |
|
||||
| ConfigurationDatabase-020 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `GetPartitionBoundariesOlderThanAsync` returns `DateTime` with `Kind=Unspecified` |
|
||||
| ConfigurationDatabase-021 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `SwitchOutPartitionAsync` interpolates `monthBoundary` / staging table name into raw SQL |
|
||||
| ConfigurationDatabase-022 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stale "WP-24 Stub level sufficient for diff/staleness support" XML comment on `DeploymentManagerRepository` |
|
||||
| ConfigurationDatabase-023 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `AuditLog` correlation-index name drifts from design doc (`IX_AuditLog_CorrelationId` vs `IX_AuditLog_Correlation`) |
|
||||
| ConfigurationDatabase-024 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete |
|
||||
| DeploymentManager-020 | [DeploymentManager](DeploymentManager/findings.md) | `DeployReconciled` audit attributes the action to the prior deployer, not the current user |
|
||||
| DeploymentManager-021 | [DeploymentManager](DeploymentManager/findings.md) | `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing |
|
||||
| DeploymentManager-022 | [DeploymentManager](DeploymentManager/findings.md) | `Pending` and `InProgress` are written back-to-back with no intervening work |
|
||||
| DeploymentManager-023 | [DeploymentManager](DeploymentManager/findings.md) | `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site |
|
||||
| DeploymentManager-024 | [DeploymentManager](DeploymentManager/findings.md) | Test probe actors hold mutable static state across tests |
|
||||
| ExternalSystemGateway-021 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config |
|
||||
| ExternalSystemGateway-022 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `new HttpMethod(method.HttpMethod)` accepts any string at runtime; an invalid HTTP verb fails only at call time |
|
||||
| ExternalSystemGateway-023 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set |
|
||||
| HealthMonitoring-018 | [HealthMonitoring](HealthMonitoring/findings.md) | Same counter-reset-before-publish hazard in `CentralHealthReportLoop` |
|
||||
| HealthMonitoring-020 | [HealthMonitoring](HealthMonitoring/findings.md) | `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` |
|
||||
| HealthMonitoring-021 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralSiteId = "central"` reserved constant silently collides with a real site named "central" |
|
||||
| HealthMonitoring-022 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI |
|
||||
| HealthMonitoring-023 | [HealthMonitoring](HealthMonitoring/findings.md) | `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder |
|
||||
| Host-018 | [Host](Host/findings.md) | Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null |
|
||||
| Host-019 | [Host](Host/findings.md) | Migration `StartupRetry` call drops the host `CancellationToken` |
|
||||
| Host-020 | [Host](Host/findings.md) | `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel` |
|
||||
| Host-021 | [Host](Host/findings.md) | Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog |
|
||||
| Host-022 | [Host](Host/findings.md) | `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information` |
|
||||
| InboundAPI-019 | [InboundAPI](InboundAPI/findings.md) | `EnableBuffering()` called unconditionally on every request, including bodyless requests |
|
||||
| InboundAPI-020 | [InboundAPI](InboundAPI/findings.md) | `ContentType.Contains("json")` is case-sensitive; `application/JSON` with no Content-Length skips body parsing |
|
||||
| InboundAPI-023 | [InboundAPI](InboundAPI/findings.md) | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test coverage |
|
||||
| InboundAPI-024 | [InboundAPI](InboundAPI/findings.md) | `_knownBadMethods` is unbounded — an attacker can grow the cache by spamming distinct method names against the audit middleware path |
|
||||
| ManagementService-022 | [ManagementService](ManagementService/findings.md) | Design doc is stale on Transport bundle commands, /api/audit/* endpoints, and CommandTimeout |
|
||||
| ManagementService-023 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments unfiltered branch is N+1 on instance lookup |
|
||||
| NotificationOutbox-006 | [NotificationOutbox](NotificationOutbox/findings.md) | `ResolveAdapters` rebuilds the `NotificationType → adapter` dictionary on every dispatch sweep |
|
||||
| NotificationOutbox-008 | [NotificationOutbox](NotificationOutbox/findings.md) | `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested |
|
||||
| NotificationOutbox-009 | [NotificationOutbox](NotificationOutbox/findings.md) | `StuckAgeThreshold` XML-doc says "in-progress notification is re-claimed" — contradicts the design's display-only stuck detection |
|
||||
| NotificationService-022 | [NotificationService](NotificationService/findings.md) | `MailKitSmtpClientWrapper` holds a long-lived `SmtpClient`; combined with per-send factory, the design comment about pooling is contradicted |
|
||||
| NotificationService-023 | [NotificationService](NotificationService/findings.md) | XML docs on the orphaned classes still describe the removed site-delivery flow; misleading to maintainers |
|
||||
| NotificationService-025 | [NotificationService](NotificationService/findings.md) | `CredentialRedactor` over-masks: any 4-character credential component is masked anywhere it appears, including unrelated log text |
|
||||
| Security-018 | [Security](Security/findings.md) | Role names are hard-coded magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and `AuthorizationPolicies` |
|
||||
| Security-019 | [Security](Security/findings.md) | Service-account rebind failure is reported as "Invalid username or password" — masks misconfiguration as a user-credential error |
|
||||
| Security-020 | [Security](Security/findings.md) | `SecurityOptions` has no startup validation for required fields (`LdapServer`, `LdapSearchBase`) |
|
||||
| Security-021 | [Security](Security/findings.md) | `RequireHttpsCookie=false` dev opt-out has no warning path — an HTTP production deployment silently transmits the JWT bearer credential in cleartext |
|
||||
| SiteCallAudit-002 | [SiteCallAudit](SiteCallAudit/findings.md) | Singleton failover does not wait for in-flight async upserts |
|
||||
| SiteCallAudit-004 | [SiteCallAudit](SiteCallAudit/findings.md) | Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift |
|
||||
| SiteCallAudit-005 | [SiteCallAudit](SiteCallAudit/findings.md) | `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing |
|
||||
| SiteCallAudit-006 | [SiteCallAudit](SiteCallAudit/findings.md) | Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor |
|
||||
| SiteEventLogging-018 | [SiteEventLogging](SiteEventLogging/findings.md) | `FailedWriteCount` is exposed but never consumed by Health Monitoring |
|
||||
| SiteEventLogging-019 | [SiteEventLogging](SiteEventLogging/findings.md) | `EventLogPurgeService` runs on every host node; design says "active node" |
|
||||
| SiteEventLogging-020 | [SiteEventLogging](SiteEventLogging/findings.md) | `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced |
|
||||
| SiteEventLogging-021 | [SiteEventLogging](SiteEventLogging/findings.md) | `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales |
|
||||
| SiteEventLogging-022 | [SiteEventLogging](SiteEventLogging/findings.md) | `Cache=Shared` is redundant for a single-connection logger |
|
||||
| SiteEventLogging-023 | [SiteEventLogging](SiteEventLogging/findings.md) | Concurrent-stress test uses a non-volatile `stop` flag |
|
||||
| SiteRuntime-023 | [SiteRuntime](SiteRuntime/findings.md) | `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive |
|
||||
| SiteRuntime-025 | [SiteRuntime](SiteRuntime/findings.md) | `HandleSetStaticAttribute` persists unknown attribute names as static overrides |
|
||||
| SiteRuntime-026 | [SiteRuntime](SiteRuntime/findings.md) | `ReplicationMessages.cs` public record types have no XML documentation |
|
||||
| StoreAndForward-022 | [StoreAndForward](StoreAndForward/findings.md) | `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId` |
|
||||
| StoreAndForward-023 | [StoreAndForward](StoreAndForward/findings.md) | `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation |
|
||||
| StoreAndForward-024 | [StoreAndForward](StoreAndForward/findings.md) | `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown |
|
||||
| TemplateEngine-022 | [TemplateEngine](TemplateEngine/findings.md) | `LockEnforcer.ValidateLockChange` enforces "once-locked-stays-locked" for `IsLocked` but not for `LockedInDerived` |
|
||||
| Transport-008 | [Transport](Transport/findings.md) | `PreviewAsync` issues an N+1 `GetTemplateWithChildrenAsync` per matching template name |
|
||||
| Transport-009 | [Transport](Transport/findings.md) | `IAuditCorrelationContext.BundleImportId` is mutated on the same scoped instance the AuditService reads |
|
||||
| Transport-011 | [Transport](Transport/findings.md) | Design doc's Step-1 manifest preview promises decryption-free preview, but `LoadAsync` reads and validates content before passphrase |
|
||||
| Transport-012 | [Transport](Transport/findings.md) | "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI |
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Security` |
|
||||
| Design doc | `docs/requirements/Component-Security.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 (1 deferred — Security-008) |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 6 (1 deferred — Security-008) |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -48,6 +48,36 @@ omits the separate idle check (Security-014). The two Low findings concern fragi
|
||||
DN parsing of group names containing escaped commas and an un-trimmed username flowing
|
||||
into the LDAP filter, fallback DN, and JWT claims.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Re-reviewed the module on a fresh baseline. All Security-001..007, 009..015 fixes remain
|
||||
in place; the only Open carry-over is Security-008 (still correctly **Deferred** —
|
||||
`ISecurityRepository` still exposes no per-set scope-rule query, so the N+1 in
|
||||
`RoleMapper` cannot be removed from within this module). The original
|
||||
Security-014 fix is now load-bearing: `RefreshToken` calls `IsIdleTimedOut` before
|
||||
re-issuing, and the new cookie sliding-expiry tests in `SecurityReviewRegressionTests`
|
||||
pin CentralUI-005's Security-side contract. This pass surfaced **6 new findings**
|
||||
(Security-016..021): one Medium correctness/security defect, one Medium design-adherence
|
||||
defect, and four Low. The most consequential is **Security-016** — when a user is
|
||||
mapped to *both* a system-wide Deployment LDAP group (e.g. `SCADA-Deploy-All`) and a
|
||||
site-scoped Deployment LDAP group (e.g. `SCADA-Deploy-SiteA`), `RoleMapper` silently
|
||||
treats the union as site-scoped (the system-wide grant is dropped); the design's
|
||||
"multiple groups grant multiple independent roles" intent is not honoured for this
|
||||
mix-and-match case. **Security-017** is the cross-module partner of CentralUI-028:
|
||||
`SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are declared and registered
|
||||
but no production caller ever instantiates them — `[Authorize(Policy = RequireDeployment)]`
|
||||
*does not* enforce the documented site scoping, callers must remember to inject
|
||||
`SiteScopeService` and re-check `IsSiteAllowedAsync` themselves (which the two new
|
||||
report pages flagged by CentralUI-028 forgot to do). The remaining Lows are: role names
|
||||
are magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and
|
||||
`AuthorizationPolicies` (Security-018); a service-account-rebind failure is reported
|
||||
to the user as "Invalid username or password" — masking a misconfiguration as a
|
||||
user-credential error (Security-019); required `SecurityOptions` fields
|
||||
(`LdapServer`, `LdapSearchBase`) have no `IValidateOptions` startup check, so empty
|
||||
values silently surface only on first login (Security-020); and the
|
||||
`RequireHttpsCookie=false` dev opt-out emits no warning, so an HTTP production
|
||||
deployment silently transmits the JWT bearer credential in cleartext (Security-021).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -63,6 +93,21 @@ into the LDAP filter, fallback DN, and JWT claims.
|
||||
| 9 | Testing coverage | ☑ | No tests for `RoleMapper` N+1 behavior, DN-injection inputs, StartTLS path, or idle-timeout-after-refresh. Insecure-config combinations under-tested (Security-011). |
|
||||
| 10 | Documentation & comments | ☑ | `SecurityOptions` XML docs say direct bind uses `cn={username}` while the search filter uses `uid=` — comment is misleading (covered under Security-004). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `RoleMapper` drops a system-wide Deployment grant when the user is also in any site-scoped Deployment group (Security-016); hard-coded role-name string `"Deployment"` in two separate places allows a refactor to silently break site scoping (Security-018). |
|
||||
| 2 | Akka.NET conventions | ☑ | No actors. `AddSecurityActors` is still a registration placeholder. No issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Services stateless; LDAP sync calls wrapped in `Task.Run` with the now-bounded timeout (Security-009 resolution holds). No issues found. |
|
||||
| 4 | Error handling & resilience | ☑ | A service-account-rebind failure inside `AuthenticateAsync` is reported as "Invalid username or password", masking a misconfiguration as a user-credential error (Security-019). LDAP-failure rule + partial-outage path remain correctly enforced post-Security-012. |
|
||||
| 5 | Security | ☑ | `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code — no policy is registered that uses them and no production caller instantiates them, so declarative `[Authorize]` does not enforce site scoping (Security-017, cross-module partner of CentralUI-028). `RequireHttpsCookie=false` dev opt-out has no warning path — a production misconfiguration silently transmits the JWT bearer credential over HTTP (Security-021). |
|
||||
| 6 | Performance & resource management | ☑ | Security-008 N+1 remains correctly Deferred (still gated on `ISecurityRepository`). No new perf issues. |
|
||||
| 7 | Design-document adherence | ☑ | `RoleMapper`'s drop-system-wide-on-any-scoped behaviour (Security-016) contradicts the design's "A user can hold multiple roles simultaneously … roles are independent — there is no implied hierarchy" rule for the union case; `SiteScopeRequirement` advertises a site-scope authorization pattern the implementation does not actually wire up (Security-017). |
|
||||
| 8 | Code organization & conventions | ☑ | Role-name strings are duplicated as magic literals across `RoleMapper.cs`, `SiteScopeAuthorizationHandler.cs`, and `AuthorizationPolicies.cs` — only the audit roles have a single source of truth via `OperationalAuditRoles` / `AuditExportRoles` (Security-018). `SecurityOptions` defaults pass through to runtime with no `IValidateOptions` for required fields like `LdapServer` / `LdapSearchBase` (Security-020). |
|
||||
| 9 | Testing coverage | ☑ | No test covers a user mapped to both a system-wide AND a site-scoped Deployment LDAP group (the Security-016 case). No test covers the `SiteScopeRequirement` cross-page integration — tests evaluate the handler in isolation, not the absence of a policy that uses it (Security-017). |
|
||||
| 10 | Documentation & comments | ☑ | `SiteScopeAuthorizationHandler` XML doc describes a permission model no caller actually invokes (Security-017). Otherwise stable. |
|
||||
|
||||
## Findings
|
||||
|
||||
### Security-001 — StartTLS upgrade path is unreachable dead code
|
||||
@@ -654,3 +699,226 @@ use the single canonical identity. Regression tests
|
||||
`NormalizeUsername_TrimsLeadingAndTrailingWhitespace`,
|
||||
`BuildFallbackUserDn_TrimmedUsername_NoLeadingTrailingSpace`,
|
||||
`AuthenticateAsync_UsernameWithSurroundingWhitespace_StillRejectedForInsecure`.
|
||||
|
||||
### Security-016 — `RoleMapper` silently drops the system-wide Deployment grant when a user is also in any site-scoped Deployment group
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/RoleMapper.cs:30-31`, `:41-55`, `:59` |
|
||||
|
||||
**Description**
|
||||
|
||||
`MapGroupsToRolesAsync` resolves the Deployment role's site scope as a single
|
||||
`isSystemWide = hasDeploymentRole && !hasDeploymentWithScopeRules` flag computed across
|
||||
ALL matched Deployment mappings. If a user is a member of both a system-wide Deployment
|
||||
group (e.g. `SCADA-Deploy-All`, no scope rules) AND a site-scoped Deployment group
|
||||
(e.g. `SCADA-Deploy-SiteA`, one scope rule for Site A), the second mapping sets
|
||||
`hasDeploymentWithScopeRules = true`, so the final `isSystemWide` becomes `false` and
|
||||
the returned `PermittedSiteIds` is just `[SiteA]`. The system-wide grant from
|
||||
`SCADA-Deploy-All` is silently dropped — the user loses access to every other site, even
|
||||
though one of their LDAP groups was intended to grant them system-wide reach. This
|
||||
contradicts the design's "A user can hold multiple roles simultaneously … roles are
|
||||
independent — there is no implied hierarchy" intent: the union of grants should be the
|
||||
broadest grant in the set, not the narrowest. The mistake is also non-obvious to an
|
||||
operator: from the Admin → LDAP Mappings page nothing flags that adding a site-scoped
|
||||
Deployment mapping for a user already in `SCADA-Deploy-All` *removes* sites from their
|
||||
effective grant. The downstream `SiteScopeService.IsSystemWideAsync` / `FilterSitesAsync`
|
||||
faithfully reproduce this narrowing, so the user can no longer see or act on sites
|
||||
outside `[SiteA]`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Track the union semantics explicitly: if any matched Deployment mapping has no scope
|
||||
rules, the user is system-wide regardless of what other mappings have. The simplest
|
||||
change is to set `hasDeploymentWithScopeRules` only when the mapping has scope rules
|
||||
AND another flag `hasUnscopedDeploymentMapping` is false; then compute
|
||||
`isSystemWide = hasUnscopedDeploymentMapping || (hasDeploymentRole && !hasDeploymentWithScopeRules)`.
|
||||
Equivalently: collect per-mapping `(hasRules, scopedSiteIds)` first, then
|
||||
`isSystemWide = any mapping has hasRules==false`, and `permittedSiteIds = union of all
|
||||
scopedSiteIds` (left empty for system-wide users). Add a regression test
|
||||
`MapGroupsToRoles_UserInBothSystemWideAndScopedDeploymentGroup_IsSystemWide` covering
|
||||
the design's example pair `SCADA-Deploy-All` + `SCADA-Deploy-SiteA`.
|
||||
|
||||
### Security-017 — `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code from production callers — `[Authorize(Policy = RequireDeployment)]` does NOT enforce site scoping
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/SiteScopeAuthorizationHandler.cs:8-58`; `src/ScadaLink.Security/AuthorizationPolicies.cs:113-143` |
|
||||
|
||||
**Description**
|
||||
|
||||
The module declares `SiteScopeRequirement` (an `IAuthorizationRequirement` carrying a
|
||||
`TargetSiteId`) and the matching `SiteScopeAuthorizationHandler` that combines the
|
||||
Deployment role claim with the `SiteId` claims to enforce the design's site-scoping
|
||||
rule. The handler is registered in `AddScadaLinkAuthorization`
|
||||
(`services.AddSingleton<IAuthorizationHandler, SiteScopeAuthorizationHandler>()`). But
|
||||
no `AddPolicy` call ever wires the requirement to a named policy, and a grep across
|
||||
`src/ScadaLink.CentralUI` and `src/ScadaLink.ManagementService` confirms that **no
|
||||
production code ever instantiates `new SiteScopeRequirement(...)` or calls
|
||||
`AuthorizeAsync(...)` with one** — the only callers are the unit tests in
|
||||
`SecurityTests.cs:1146,1166,1185,1203`. The design + CLAUDE.md state that "Deployment
|
||||
and Monitoring pages must filter every site/instance list through `FilterSitesAsync`
|
||||
and re-check `IsSiteAllowedAsync` before any cross-site command", and the
|
||||
CentralUI-028 finding (High, Open) confirms this is exactly the contract two new
|
||||
report pages forgot — because there is no declarative `[Authorize(Policy = ...)]`
|
||||
shortcut, callers must remember to inject `SiteScopeService` and write the check by
|
||||
hand, and any new page that forgets is a silent regression with no compile-time or
|
||||
test-time signal. The module's published surface advertises an authorization-handler
|
||||
pattern that is, in practice, unwired plumbing.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) **delete** `SiteScopeRequirement` and `SiteScopeAuthorizationHandler` (and
|
||||
the dead `IAuthorizationHandler` registration) and document `SiteScopeService` as the
|
||||
sole site-scoping mechanism — this is the smaller change and matches what the codebase
|
||||
actually does today; or, preferably, (b) **finish the wiring**: add a `RequireSiteScope`
|
||||
policy that uses `SiteScopeRequirement` and provide a small helper / source generator
|
||||
or analyzer that flags Deployment-policy-attributed pages without a site-scope check.
|
||||
Either way, address the cross-module gap: CentralUI-028 stays open until production
|
||||
pages reliably enforce the rule. If (b) is chosen, a route-parameter-aware
|
||||
`IAuthorizationPolicyProvider` is needed so the policy can read the target site id from
|
||||
the request — that is a meaningful design extension and would need to be planned
|
||||
alongside the Central UI's existing `SiteScopeService` usage rather than replacing it
|
||||
piecemeal.
|
||||
|
||||
### Security-018 — Role names are hard-coded magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and `AuthorizationPolicies`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/RoleMapper.cs:41`; `src/ScadaLink.Security/SiteScopeAuthorizationHandler.cs:36`; `src/ScadaLink.Security/AuthorizationPolicies.cs:118,121,124,95,107` |
|
||||
|
||||
**Description**
|
||||
|
||||
The role-name literals `"Admin"`, `"Design"`, `"Deployment"`, `"Audit"`, and
|
||||
`"AuditReadOnly"` are duplicated as magic strings across three separate files:
|
||||
`RoleMapper.cs:41` hard-codes `"Deployment"` to detect the site-scope branch;
|
||||
`SiteScopeAuthorizationHandler.cs:36` independently hard-codes `"Deployment"` to gate
|
||||
the handler; and `AuthorizationPolicies.cs:118,121,124` hard-code the four role names
|
||||
as the policy `RequireClaim` values. Only the audit roles have a single source of truth
|
||||
(via the `OperationalAuditRoles` / `AuditExportRoles` arrays on
|
||||
`AuthorizationPolicies`). A future rename or addition of a role that misses any one of
|
||||
these call sites silently breaks the system: e.g. renaming "Deployment" → "Deployer"
|
||||
in `RoleMapper` alone would leave the policy still requiring `"Deployment"` (logins
|
||||
get the new role name but the policy never matches), while changing it in the policy
|
||||
alone would leave `RoleMapper` failing to populate scope rules for the renamed role.
|
||||
The bug class is "string drift" — exactly the kind the `OperationalAuditRoles` constant
|
||||
was introduced to prevent.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Introduce a `public static class Roles { public const string Admin = "Admin"; public const
|
||||
string Design = "Design"; public const string Deployment = "Deployment"; public const string
|
||||
Audit = "Audit"; public const string AuditReadOnly = "AuditReadOnly"; }` in the Security
|
||||
project and replace every magic-string occurrence — including the elements of
|
||||
`OperationalAuditRoles` and `AuditExportRoles` — with the constants. A single rename
|
||||
will then either succeed everywhere or fail to compile.
|
||||
|
||||
### Security-019 — Service-account rebind failure is reported as "Invalid username or password" — masks misconfiguration as a user-credential error
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/LdapAuthService.cs:85-89`, `:147-151` |
|
||||
|
||||
**Description**
|
||||
|
||||
After the user's credentials bind successfully, `AuthenticateAsync` re-binds as the
|
||||
configured service account to perform the group/attribute search
|
||||
(`connection.Bind(_options.LdapServiceAccountDn, _options.LdapServiceAccountPassword)`).
|
||||
A failure of this second bind — wrong service-account password, deleted/disabled
|
||||
service-account, locked-out service-account — throws `LdapException` which is caught by
|
||||
the broad outer `catch (LdapException)` and returned as
|
||||
`new LdapAuthResult(false, null, username, null, "Invalid username or password.")`.
|
||||
The user sees an "invalid credentials" message for *their* credentials even though
|
||||
their bind succeeded and the failure was in the system's own service-account
|
||||
configuration. Worse, every user attempting to log in sees the same incorrect message
|
||||
during a service-account outage, which routes operators down the wrong incident path
|
||||
(reset the user's password) instead of the right one (check the service-account
|
||||
credentials). The successful user bind itself is also not auditable as a discrete
|
||||
event because the result is "Invalid username or password" — indistinguishable from a
|
||||
genuine bad-password attempt.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wrap the service-account rebind in its own `try`/`catch (LdapException)` and surface a
|
||||
distinct error: log `_logger.LogError(ex, "Service-account rebind failed; check
|
||||
LdapServiceAccountDn / LdapServiceAccountPassword configuration")` and return
|
||||
`new LdapAuthResult(false, null, username, null, "Authentication service is misconfigured. Contact an administrator.")`.
|
||||
Add a regression test that exercises the service-account-bind failure path (a mocked
|
||||
or seamed `LdapConnection.Bind` that throws on the second call) and asserts the
|
||||
distinct error message.
|
||||
|
||||
### Security-020 — `SecurityOptions` has no startup validation for required fields (`LdapServer`, `LdapSearchBase`)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/SecurityOptions.cs:6-7`, `:36-37`; `src/ScadaLink.Security/ServiceCollectionExtensions.cs:13-30` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SecurityOptions.JwtSigningKey` correctly fails fast at `JwtTokenService` construction
|
||||
(Security-003 fix), but the LDAP-side required fields — `LdapServer` (default
|
||||
`string.Empty`) and `LdapSearchBase` (default `string.Empty`) — have no equivalent
|
||||
guard. `AddSecurity` does not register an `IValidateOptions<SecurityOptions>`. A
|
||||
deployment that fails to set `LdapServer` (a typo in the appsettings.json section name,
|
||||
a missing environment-variable substitution, a misconfigured Docker compose file)
|
||||
starts cleanly — the Central UI comes up, the login page loads, and only the first
|
||||
authentication attempt fails with `LdapConnection.Connect("")` throwing a low-level
|
||||
exception that bubbles up as the generic "An unexpected error occurred during
|
||||
authentication." message. The misconfiguration surfaces minutes or hours into the
|
||||
deploy, on the first real user login, rather than at startup where it is cheap to
|
||||
diagnose.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add an `IValidateOptions<SecurityOptions>` registered via
|
||||
`services.AddOptions<SecurityOptions>().ValidateOnStart()` that fails when
|
||||
`LdapServer` is null/whitespace, `LdapSearchBase` is null/whitespace, or
|
||||
`LdapPort <= 0`. Combine with the existing `JwtTokenService` constructor check so
|
||||
every required `SecurityOptions` field is enforced at startup, not at first use.
|
||||
|
||||
### Security-021 — `RequireHttpsCookie=false` dev opt-out has no warning path — an HTTP production deployment silently transmits the JWT bearer credential in cleartext
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/SecurityOptions.cs:100-108`; `src/ScadaLink.Security/ServiceCollectionExtensions.cs:54-59` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Security-002 fix added `RequireHttpsCookie` (default `true`) so the auth cookie's
|
||||
`SecurePolicy` is `Always` in production. The current Docker dev cluster sets
|
||||
`RequireHttpsCookie=false` in both central nodes' `appsettings.Central.json`, downgrading
|
||||
to `SameAsRequest` so the local HTTP cluster works. The downgrade is documented in the
|
||||
XML doc but is silent at runtime: no log line warns that the cookie carrying the JWT
|
||||
bearer credential is being sent over an HTTP-only path. A production deployment that
|
||||
inherits a dev-derived appsettings — or that copy-pastes the docker config and forgets
|
||||
to flip the flag — transmits the session token in cleartext with no diagnostic signal.
|
||||
The default is correct; the gap is that the unsafe override has no operational guard.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In the `PostConfigure` block in `AddSecurity`, when `RequireHttpsCookie == false`, log
|
||||
a single startup warning along the lines of `_logger.LogWarning("RequireHttpsCookie is
|
||||
DISABLED — auth cookie SecurePolicy is SameAsRequest. The cookie-embedded JWT will be
|
||||
transmitted over plain HTTP. This setting is intended for local dev only — set
|
||||
SecurityOptions:RequireHttpsCookie=true in production.")`. Optionally, also fail
|
||||
startup when `RequireHttpsCookie=false` AND `ASPNETCORE_ENVIRONMENT=Production`. Add a
|
||||
regression test that asserts the warning is emitted when the flag is disabled and not
|
||||
when it is enabled.
|
||||
|
||||
@@ -0,0 +1,322 @@
|
||||
# Code Review — SiteCallAudit
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Module | `src/ScadaLink.SiteCallAudit` |
|
||||
| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 6 |
|
||||
|
||||
## Summary
|
||||
|
||||
The module is small (one actor + DI extension + options class). The actor is a
|
||||
central cluster singleton that exposes three responsibility groups: direct
|
||||
`UpsertSiteCallCommand` ingest, paginated/KPI read handlers, and the central→site
|
||||
Retry/Discard relay. Ingest idempotency is delegated to the repository's
|
||||
monotonic-upsert (the CD-015 check-then-act window is mitigated by the
|
||||
duplicate-key swallow on the insert leg). Findings cluster around two themes:
|
||||
(a) the `SupervisorStrategy` override is dead-code that contradicts the XML
|
||||
docstring — it governs children, and this actor has none, so the documented
|
||||
"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
|
||||
between the design doc and the code (reconciliation puller + daily purge
|
||||
schedule are still deferred; `OnUpsertAsync` does not stamp `IngestedAtUtc`
|
||||
unlike the dual-write path). The relay path is well covered by Akka TestKit
|
||||
unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
|
||||
tests using a shared `MsSqlMigrationFixture`.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | Yes | `OnUpsertAsync` does not refresh `IngestedAtUtc` (Finding 003). |
|
||||
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy()` override is dead code (Finding 001). `Sender` correctly captured before first await on every handler. `PipeTo` used for read replies. |
|
||||
| 3 | Concurrency & thread safety | Yes | `_centralCommunication` mutated only on actor thread via `RegisterCentralCommunication`. DI scope-per-message disposed in `try/finally`. No issues found. |
|
||||
| 4 | Error handling & resilience | Yes | Ingest catches all + replies `Accepted=false`. Relay distinguishes `SiteUnreachable` vs `OperationFailed`. Failover handover does not wait for in-flight async work (Finding 002). |
|
||||
| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond `SourceSite`. No issues found. |
|
||||
| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. `MaxPageSize=200` clamp present. No issues found. |
|
||||
| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
|
||||
| 8 | Code organization & conventions | Yes | `RegisterCentralCommunication` is a top-level record colocated with the actor — by design (carries `IActorRef`, cannot live in Commons). No issues found. |
|
||||
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
|
||||
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
|
||||
|
||||
## Findings
|
||||
|
||||
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:32-46`, `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:147-151` |
|
||||
|
||||
**Description**
|
||||
|
||||
The XML remarks block (lines 32-46) states:
|
||||
|
||||
> "The `SupervisorStrategy` uses `Resume` so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
|
||||
|
||||
The override at lines 147-151 returns a `OneForOneStrategy` with `DefaultDecider`
|
||||
and `maxNrOfRetries: 0`. Two problems compound:
|
||||
|
||||
1. `ActorBase.SupervisorStrategy()` governs the actor's **children**, not the
|
||||
actor itself. `SiteCallAuditActor` creates no children, so this override is
|
||||
dead code.
|
||||
2. The returned strategy uses `DefaultDecider` (Restart for most exceptions),
|
||||
**not** `Directive.Resume`. So even if the actor did have children, the
|
||||
strategy would not be Resume — it would be the default Restart-on-most-faults
|
||||
behaviour with `maxNrOfRetries: 0` (which forces a Stop after the first
|
||||
failure).
|
||||
|
||||
Net effect: the actor's own self-supervision is whatever the parent supplies
|
||||
(`SupervisorStrategy.DefaultDecider` from the singleton manager / user
|
||||
guardian in tests), which Restarts on most exceptions. If the `try/catch` in
|
||||
`OnUpsertAsync` ever leaked (e.g. a synchronous throw constructing `replyTo`),
|
||||
the actor would Restart, reset `_centralCommunication` to null, and silently
|
||||
break the relay until `RegisterCentralCommunication` runs again.
|
||||
|
||||
This same pattern (with the same misleading XML doc) exists in
|
||||
`AuditLogIngestActor`, `AuditLogPurgeActor`, and `SiteAuditReconciliationActor`
|
||||
— they were likely cargo-culted; this finding documents the local instance.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either:
|
||||
|
||||
- Remove the `SupervisorStrategy()` override entirely (it does nothing useful)
|
||||
and revise the XML comment to drop the "Resume" claim. Self-supervision is
|
||||
the parent's concern (the cluster singleton manager); the `try/catch` in
|
||||
`OnUpsertAsync` is what actually keeps the actor alive.
|
||||
- Or, if Resume-on-self-throw is actually desired, that requires wiring a
|
||||
custom supervisor in the parent (`ClusterSingletonManager`) — not overriding
|
||||
`SupervisorStrategy()` here. Simpler path: keep the `try/catch`, drop the
|
||||
override.
|
||||
|
||||
The CLAUDE.md "Resume for coordinator actors" decision applies to actors with
|
||||
children (Site Runtime hierarchy) — not to leaf cluster singletons.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:455-462` (singleton wiring), `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
|
||||
|
||||
**Description**
|
||||
|
||||
The singleton is created with `terminationMessage: PoisonPill.Instance`. On
|
||||
failover the active node's singleton stops as soon as the mailbox is drained
|
||||
of normal messages and the PoisonPill is processed. An in-flight
|
||||
`OnUpsertAsync` Task started before the PoisonPill arrived will be allowed to
|
||||
complete (the message-handler runs synchronously from the mailbox's view),
|
||||
but the Akka actor model does NOT cancel the EF Core
|
||||
`ExecuteSqlInterpolatedAsync` call.
|
||||
|
||||
Two consequences:
|
||||
|
||||
1. The new singleton on the other node may begin accepting
|
||||
`UpsertSiteCallCommand` for the same `TrackedOperationId` while the old
|
||||
singleton's in-flight upsert is still running. The repository's
|
||||
monotonic-upsert and the SQL duplicate-key swallow protect storage state.
|
||||
2. The original `replyTo` sender may receive its `Accepted=true` after the new
|
||||
singleton has already returned a different reply. Idempotency keys protect
|
||||
correctness; wire-level ordering is best-effort by design.
|
||||
|
||||
This is consistent with the design ("eventually-consistent mirror, sites are
|
||||
source of truth"), but worth documenting as an explicit invariant. The
|
||||
Notification Outbox sibling has the same pattern.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
- Document the failover/handover semantics in the actor's XML remarks: "On
|
||||
cluster singleton handover, in-flight `OnUpsertAsync` tasks complete on the
|
||||
old node and may produce a late `Accepted=true` reply; the repository's
|
||||
monotonic upsert ensures storage state is consistent."
|
||||
- Add an integration test that deliberately races two concurrent upserts on
|
||||
the same `TrackedOperationId` to verify the duplicate-key swallow +
|
||||
monotonic rank check (the CD-015 race-pattern check the parent task
|
||||
flagged).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteCallAudit-003 — `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
|
||||
|
||||
**Description**
|
||||
|
||||
The combined-telemetry hot path (`AuditLogIngestActor.OnCachedTelemetryAsync`)
|
||||
stamps `IngestedAtUtc = DateTime.UtcNow` on both the `AuditLog` row and the
|
||||
`SiteCall` row at central-side persist time
|
||||
(`src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:238-239`). The design
|
||||
doc treats `IngestedAtUtc` as "central ingested (or last refreshed) this row"
|
||||
— a central-side timestamp.
|
||||
|
||||
`SiteCallAuditActor.OnUpsertAsync` writes the supplied `SiteCall` as-is, with
|
||||
whatever `IngestedAtUtc` the caller stamped. The only current callers are the
|
||||
unit tests (which use `DateTime.UtcNow` at command-construction time). Once
|
||||
the deferred reconciliation puller lands and starts emitting
|
||||
`UpsertSiteCallCommand`s, the puller (running on central) is responsible for
|
||||
stamping a central timestamp — but if a future direct-write caller forgets,
|
||||
or constructs from a site DTO, the value could drift (e.g. become a site
|
||||
clock value).
|
||||
|
||||
This is currently latent because no production caller exists, but it's
|
||||
inconsistent with the dual-write code path and undocumented.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
- Either: stamp `IngestedAtUtc = DateTime.UtcNow` inside `OnUpsertAsync`
|
||||
before calling `UpsertAsync` (matching `AuditLogIngestActor`'s behaviour),
|
||||
using `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }`.
|
||||
- Or: document in the `UpsertSiteCallCommand` XML that callers MUST stamp
|
||||
`IngestedAtUtc` to a central-side `DateTime.UtcNow` immediately before
|
||||
sending.
|
||||
|
||||
Preferred: stamp inside the actor — same as the combined-telemetry path —
|
||||
because callers cannot in general know the actor is colocated on central.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ScadaLink.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc (`Component-SiteCallAudit.md` lines 24-32) lists five
|
||||
responsibilities, including:
|
||||
|
||||
- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
|
||||
- "Purge terminal audit rows after a configurable retention window."
|
||||
|
||||
The repository exposes `PurgeTerminalAsync` but nothing in this module
|
||||
schedules a daily call (Notification Outbox owns a `MaintenanceService` for
|
||||
its equivalent; no `SiteCallAuditMaintenanceService` exists). The
|
||||
reconciliation puller is acknowledged in the actor XML
|
||||
(`only reconciliation remains deferred`) but is not surfaced in the design
|
||||
doc as deferred — the doc reads as if it ships.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
- Either: implement the deferred pieces (a hosted service that wakes daily
|
||||
and calls `repo.PurgeTerminalAsync(now - retentionWindow)`, plus a per-site
|
||||
reconciliation puller with a cursor + an `IPullCachedTelemetryClient`).
|
||||
- Or: add a "Status" / "Deferred" subsection to the design doc explicitly
|
||||
listing what's not yet implemented (matches the pattern Audit Log uses for
|
||||
its tamper-evidence hash chain).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:548-563` |
|
||||
|
||||
**Description**
|
||||
|
||||
```csharp
|
||||
return outcome switch
|
||||
{
|
||||
SiteCallRelayOutcome.Applied => null,
|
||||
SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
|
||||
SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
|
||||
// SiteUnreachable is never produced from a ParkedOperationActionAck —
|
||||
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
|
||||
// before any ack is classified, so this arm is unreachable by construction.
|
||||
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
|
||||
_ => throw new ArgumentOutOfRangeException(...)
|
||||
};
|
||||
```
|
||||
|
||||
The comment correctly states the `SiteUnreachable` arm is unreachable when
|
||||
called from `ClassifyAck`. The arm therefore exists only to satisfy
|
||||
exhaustiveness, but instead of throwing or returning a sentinel, it falls
|
||||
through to `ack.ErrorMessage` — indistinguishable from the `OperationFailed`
|
||||
arm above. If any future caller *does* feed `SiteUnreachable` into
|
||||
`AckErrorMessage` (e.g. via refactor), the result will be a silent
|
||||
wrong-detail-text bug rather than an immediate crash. The default arm
|
||||
correctly throws `ArgumentOutOfRangeException`, so the `SiteUnreachable` arm
|
||||
is the inconsistent one.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replace the `SiteUnreachable => ack.ErrorMessage` arm with:
|
||||
|
||||
```csharp
|
||||
SiteCallRelayOutcome.SiteUnreachable =>
|
||||
throw new InvalidOperationException(
|
||||
"AckErrorMessage cannot be called for SiteUnreachable — those responses "
|
||||
+ "are built by UnreachableRetry/UnreachableDiscard before classification."),
|
||||
```
|
||||
|
||||
— fail fast if the invariant is ever violated by a refactor.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor` covers
|
||||
the case where stuck rows are interleaved with non-stuck rows (page-1 returns
|
||||
2 stuck rows, page-2 returns the third). It does not cover the edge where
|
||||
the row at the keyset cursor boundary (`AfterCreatedAtUtc + AfterId`) is
|
||||
itself a non-stuck row — i.e. the cursor points at a row the next page must
|
||||
SKIP through to find more stuck rows. The repository's SQL composes the
|
||||
cursor predicate (`CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id <
|
||||
...)`) with the stuck predicate, so it should be honest, but the test only
|
||||
asserts row counts and `IsStuck`, not that the second-page query specifically
|
||||
skipped non-stuck rows between the cursor and the next stuck row.
|
||||
|
||||
Lower priority because the SQL composition is straightforward, but adding a
|
||||
direct test would lock the invariant.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
|
||||
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
|
||||
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
|
||||
no overlap and all 3 stuck rows visited.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.SiteEventLogging` |
|
||||
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 9 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -46,6 +46,31 @@ keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block o
|
||||
host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
|
||||
premise does not hold on .NET 8+).
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain closed
|
||||
and their resolutions hold up under inspection: the lock-guarded `WithConnection`
|
||||
overloads, the background-writer `Channel<T>` with disposed-mid-drain fault
|
||||
propagation, the `auto_vacuum = INCREMENTAL` schema + logical-size measurement, the
|
||||
severity index, the `LIKE` keyword-search escaping, and the concrete-recorder DI
|
||||
wiring are all present and correct at this commit. Nine new findings were recorded —
|
||||
none are regressions of prior fixes. The most notable (SiteEventLogging-016, **High**)
|
||||
is a correctness defect in the query path: timestamps are stored as ISO 8601 strings
|
||||
generated from `DateTimeOffset.UtcNow` (so they always have a `+00:00` offset suffix),
|
||||
but the `From`/`To` filters are stringified verbatim via `request.From.Value.ToString("o")`
|
||||
without normalising to UTC, so a central client that sends a non-UTC `DateTimeOffset`
|
||||
gets a broken lexicographic comparison and either spuriously includes or excludes
|
||||
events. The next-most-notable findings are SiteEventLogging-015 (unbounded background
|
||||
write queue can grow without limit under sustained writer slowness — sister
|
||||
`SqliteAuditWriter` uses a bounded channel) and SiteEventLogging-017 (the central
|
||||
client's `PageSize` is used verbatim with no upper-bound clamp, defeating the design's
|
||||
"prevents broad queries from overwhelming the communication channel" rationale). The
|
||||
remaining findings are low-severity hygiene / documentation: an unused
|
||||
`FailedWriteCount` metric, untyped severity/event-type fields, non-invariant culture
|
||||
parsing, the purge service running on the standby node, the redundant `Cache=Shared`
|
||||
on a single-connection logger, and a non-volatile stop flag in a concurrency stress
|
||||
test.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -61,6 +86,21 @@ premise does not hold on .NET 8+).
|
||||
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
|
||||
| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020). |
|
||||
| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003. |
|
||||
| 4 | Error handling & resilience | ☑ | `FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018). |
|
||||
| 5 | Security | ☑ | Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns. |
|
||||
| 6 | Performance & resource management | ☑ | Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017). |
|
||||
| 7 | Design-document adherence | ☑ | `EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019). |
|
||||
| 8 | Code organization & conventions | ☑ | `FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018). |
|
||||
| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
|
||||
| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
|
||||
|
||||
## Findings
|
||||
|
||||
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
|
||||
@@ -706,3 +746,341 @@ re-triage note). No code change made. A verification test
|
||||
`StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour
|
||||
(asserts `StartAsync` returns in under 1 s and the initial purge still runs on the
|
||||
background scheduler).
|
||||
|
||||
### SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteEventLogger` creates its background-writer feeder as
|
||||
`Channel.CreateUnbounded<PendingEvent>(...)`. The writer thread funnels every write
|
||||
through the shared `_writeLock` (acquired by `WithConnection`), so any condition that
|
||||
makes a single iteration slow — a long-running query in `EventLogQueryService`
|
||||
holding the lock, a `PurgeByStorageCap` run that takes the lock for batched
|
||||
`DELETE` + `PRAGMA incremental_vacuum`, a disk stall, or a sustained event burst
|
||||
from an alarm storm / script failure loop — drives the queue arbitrarily large.
|
||||
Every queued `PendingEvent` retains its `TaskCompletionSource` and its payload
|
||||
strings, so there is no upper bound on how much memory the recorder can hold.
|
||||
|
||||
The sister centralized-audit component `ScadaLink.AuditLog/Site/SqliteAuditWriter.cs`
|
||||
addresses the same hot-path-writer problem with
|
||||
`Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait })`,
|
||||
giving back-pressure to producers. Site event logging picked the riskier choice for
|
||||
a component that — per the design — is fed by every site subsystem (script, alarm,
|
||||
deployment, DCL, store-and-forward, instance lifecycle, notification) and has both
|
||||
a 30-day retention sweep and a 1 GB cap-purge competing for the same lock.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Switch to `Channel.CreateBounded<PendingEvent>(...)` with a configurable capacity
|
||||
(default in the order of 10 000 — large enough to absorb a normal alarm burst,
|
||||
small enough to bound memory). Pick a `FullMode` that matches policy: `Wait` for
|
||||
back-pressure (callers `await` and serialise their actor thread on the queue —
|
||||
defeats some of the SiteEventLogging-005 win but is safe), or `DropOldest` /
|
||||
`DropWrite` with a counter (drop-and-count is closer to "best-effort audit"). Add
|
||||
the dropped-event counter to `FailedWriteCount` or a sibling metric. Document the
|
||||
chosen policy on `ISiteEventLogger.LogEventAsync`.
|
||||
|
||||
### SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78` |
|
||||
|
||||
**Description**
|
||||
|
||||
Event rows are persisted with `timestamp` = `DateTimeOffset.UtcNow.ToString("o")`,
|
||||
which always emits the round-trip ISO 8601 form ending in the literal offset
|
||||
`+00:00` (e.g. `2026-05-28T12:34:56.7890123+00:00`). The query path filters by
|
||||
range using a direct string comparison:
|
||||
|
||||
```
|
||||
whereClauses.Add("timestamp >= $from");
|
||||
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));
|
||||
```
|
||||
|
||||
`request.From` is a `DateTimeOffset?` and `ToString("o")` preserves whatever offset
|
||||
the caller passed in. If a central client passes a non-UTC `DateTimeOffset` — for
|
||||
example the result of `DateTimeOffset.Now` in a `UTC+05:00` timezone — the produced
|
||||
string is `"2026-05-28T17:34:56.0000000+05:00"`, which is lexicographically *greater*
|
||||
than the equivalent UTC instant string `"2026-05-28T12:34:56.0000000+00:00"`. The
|
||||
comparison `timestamp >= $from` is then evaluated as a byte-by-byte string compare
|
||||
(SQLite default `BINARY` collation), so the query either spuriously excludes events
|
||||
that genuinely occurred in the range, or spuriously includes events from a wholly
|
||||
different hour. The same defect applies to `To`. The retention purge does
|
||||
`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")` (UTC) so it is safe; only the
|
||||
central query path is vulnerable.
|
||||
|
||||
The design explicitly states "All timestamps are UTC throughout the system" but the
|
||||
boundary between a central `DateTimeOffset` and the SQLite store is not enforced.
|
||||
A central UI rendered in a non-UTC timezone is the most likely trigger, and the
|
||||
defect silently corrupts every query that filters by time range — exactly the
|
||||
filter most likely to be set on a "show me what happened around the failover" query.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Normalise `From` / `To` to UTC before serialising:
|
||||
`request.From.Value.ToUniversalTime().ToString("o")` (or
|
||||
`.UtcDateTime.ToString("o")`), so the produced offset is always `+00:00`. Add a
|
||||
regression test that filters with a `DateTimeOffset` carrying a non-zero offset and
|
||||
asserts the matching events are returned. Optionally also store timestamps as
|
||||
Unix-epoch `INTEGER` and let SQLite compare numerically, eliminating the
|
||||
lexicographic-comparison hazard structurally.
|
||||
|
||||
### SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18` |
|
||||
|
||||
**Description**
|
||||
|
||||
`EventLogQueryService.ExecuteQuery` resolves the effective page size as
|
||||
`var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize;`
|
||||
and uses it directly as the SQL `LIMIT $limit` (passing `pageSize + 1` to detect
|
||||
"has more"). There is no upper bound. A central client — buggy or hostile — can
|
||||
send `PageSize = int.MaxValue`, in which case the query attempts to materialise the
|
||||
entire (up to 1 GB) event log into a single `List<EventLogEntry>` while holding the
|
||||
shared write lock. This:
|
||||
|
||||
- Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster
|
||||
message serialisation limits, will then be serialised into an
|
||||
`EventLogQueryResponse` and pushed over the ClusterClient pipe.
|
||||
- Blocks all writes (purge, recorder hot path) for the duration of the scan
|
||||
because the read holds `_writeLock`.
|
||||
- Stalls the singleton `EventLogHandlerActor`, also blocking subsequent legitimate
|
||||
queries.
|
||||
|
||||
The design explicitly justifies pagination as preventing exactly this — "Results
|
||||
are paginated with a configurable page size (default: 500 events) ... This prevents
|
||||
broad queries from overwhelming the communication channel." The code honours the
|
||||
*default* but does not enforce an *upper bound* on a client-supplied override.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Clamp `pageSize` to a configurable maximum (e.g. `SiteEventLogOptions.MaxQueryPageSize`,
|
||||
default 5000) before using it. Also bound `KeywordFilter.Length` (e.g. 256 chars) —
|
||||
a leading-wildcard `LIKE` of an unbounded pattern is itself an expensive operation
|
||||
that runs under the same lock. Add a `Success: false, ErrorMessage: "PageSize
|
||||
exceeds maximum"` reject path so a misbehaving central is told why its query is
|
||||
refused.
|
||||
|
||||
### SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteEventLogger.FailedWriteCount` was added under SiteEventLogging-008 with the
|
||||
XML doc statement "Surfaced so Health Monitoring can detect a logging outage
|
||||
instead of relying on a local log line nobody is watching." The implementation is
|
||||
correct (`Interlocked.Increment` on write failure, `Interlocked.Read` getter), but
|
||||
a repo-wide search shows **no** caller anywhere in `src/` reads the property —
|
||||
neither `ScadaLink.HealthMonitoring`, the central health collector, nor the host's
|
||||
`/health` endpoint. The metric is dead-letter: a logging outage still goes
|
||||
unnoticed in production, contradicting the original finding's resolution claim.
|
||||
|
||||
The property is also exposed only on the concrete `SiteEventLogger`, not on
|
||||
`ISiteEventLogger`, so even if Health Monitoring were wired up it would have to
|
||||
take a concrete-type dependency (`internal Connection` removed, but
|
||||
`FailedWriteCount` remained concrete-only).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) wire `FailedWriteCount` into the existing Health Monitoring metric
|
||||
pipeline (e.g. publish it alongside other 30-second-interval site metrics, and
|
||||
promote a sustained non-zero value to a Warning), and add it to `ISiteEventLogger`
|
||||
so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by
|
||||
softening the XML doc to "Available for future Health Monitoring integration" and
|
||||
file a tracking item for the wiring. The current doc claim is misleading.
|
||||
|
||||
### SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:21`, `docs/requirements/Component-SiteEventLogging.md:45` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AddSiteEventLogging` calls `services.AddHostedService<EventLogPurgeService>()`,
|
||||
which registers the purge `BackgroundService` per host. On a 2-node site cluster
|
||||
both `node-a` and `node-b` start the service independently, so each runs its own
|
||||
30-day retention purge and 1 GB cap purge against its own local
|
||||
`site_events.db`. The design states only "A daily background job runs on the
|
||||
active node and deletes all events older than 30 days." (Component-SiteEventLogging,
|
||||
Storage section). In practice the standby node receives no writes, so its purge
|
||||
finds nothing to delete and is harmless — but the implementation does not match the
|
||||
documented "active node" gating, and the resolution note on SiteEventLogging-004
|
||||
already flagged that the *writer* runs on the standby too. The purge has the same
|
||||
shape.
|
||||
|
||||
Aligning to the design is also a defence against a future change that does write
|
||||
to the standby (e.g. local heartbeats), and removes the per-node wake-ups that
|
||||
contribute to `Microsoft.Extensions.Hosting` shutdown latency.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) gate the purge service on "this node is the active member of `siteRole`"
|
||||
(check the cluster singleton ownership before each `RunPurge()`, or host the
|
||||
purge inside the same cluster singleton as `EventLogHandlerActor`), or (b) reword
|
||||
the design doc to "the purge runs on every node against its own local database;
|
||||
on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code
|
||||
defect.
|
||||
|
||||
### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LogEventAsync` validates `eventType` and `severity` only for non-empty/non-whitespace.
|
||||
The XML doc enumerates the allowed values: `eventType` ∈ {script, alarm,
|
||||
deployment, connection, store_and_forward, instance_lifecycle}, `severity` ∈
|
||||
{Info, Warning, Error}. Nothing in the code enforces either set. Any caller can
|
||||
pass `"SCRIPT"`, `"Script"`, `"warn"`, `"ERR"`, or a typo and the row is inserted
|
||||
verbatim. Two follow-on consequences:
|
||||
|
||||
1. The `EventLogQueryService.Severity` filter is `severity = $severity` (exact
|
||||
match, case-sensitive by SQLite default `BINARY` collation). A row stored as
|
||||
`"error"` will not be returned for a query filtering on `"Error"`. The design
|
||||
lists severity as a first-class filter and the central UI will reasonably
|
||||
normalise to one casing — every row stored with a different casing is silently
|
||||
invisible to that filter.
|
||||
2. The `Events Logged` table in the design implicitly relies on a stable
|
||||
`event_type` enumeration to drive UI grouping; a typo'd `event_type` slips in
|
||||
silently and is hard to detect later.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Validate `eventType` and `severity` against a known set (or accept `enum`s on the
|
||||
interface, converting to canonical string at the call site). Reject unknown values
|
||||
with `ArgumentException` and log a single-shot warning during construction if a
|
||||
deployment is found to be using an unexpected value. Alternatively, normalise
|
||||
casing (`severity = severity.ToLowerInvariant()`) so the query filter is
|
||||
case-insensitive. Update the XML doc to match the enforced contract.
|
||||
|
||||
### SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ExecuteQuery` materialises rows via
|
||||
`DateTimeOffset.Parse(reader.GetString(1))`. `DateTimeOffset.Parse(string)` uses
|
||||
`CultureInfo.CurrentCulture` and `DateTimeStyles.None`. The stored format is ISO
|
||||
8601 round-trip (`"o"`), which is *usually* parseable in any culture — but a
|
||||
production node running with a non-default culture (e.g. Turkish "tr-TR", which
|
||||
has historically broken case-insensitive ASCII comparisons via the
|
||||
"Turkish-I" issue, or any culture that overrides the date/time separators) can
|
||||
parse incorrectly or throw `FormatException`. The exception is caught by the outer
|
||||
`try`, so the entire query is converted to a `Success: false` response — but the
|
||||
failure mode is silent and culture-dependent.
|
||||
|
||||
The recorder side stores via `DateTimeOffset.UtcNow.ToString("o")`, which is also
|
||||
culture-sensitive in the same way; on a hostile-culture node, the round-trip
|
||||
between insert and query is not guaranteed to be lossless without explicit
|
||||
culture pinning.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Parse with explicit invariant culture and round-trip style:
|
||||
`DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture,
|
||||
DateTimeStyles.RoundtripKind)` (and the same for the `ToString("o", InvariantCulture)`
|
||||
emitters in `SiteEventLogger.LogEventAsync` and `EventLogPurgeService.PurgeByRetention`).
|
||||
Alternatively switch the schema to store `timestamp` as Unix-epoch `INTEGER` and
|
||||
avoid all string-parsing.
|
||||
|
||||
### SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52` |
|
||||
|
||||
**Description**
|
||||
|
||||
The connection string is built as
|
||||
`$"Data Source={options.Value.DatabasePath};Cache=Shared"`. SQLite's
|
||||
shared-cache mode is a *cross-connection* optimisation: it lets multiple
|
||||
`SqliteConnection`s in the same process share an in-process page cache. This
|
||||
logger owns exactly one `SqliteConnection` and serialises all access through
|
||||
`_writeLock`, so `Cache=Shared` cannot share with anything — the mode is dormant.
|
||||
At best it is dead configuration; at worst it adds (very small) per-statement
|
||||
lock overhead inside SQLite. The sister `SqliteAuditWriter` carries the same
|
||||
unused option, so the smell is a copy-and-paste pattern.
|
||||
|
||||
Shared-cache mode also subtly changes the semantics of `PRAGMA busy_timeout` and
|
||||
`PRAGMA locking_mode`, so leaving it on while *not* using it is a small future-foot
|
||||
gun if anyone later opens a second connection to the same file from another
|
||||
component on the same host (e.g. a tooling read-only viewer).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Drop `Cache=Shared` from the connection string — the logger is single-connection
|
||||
and gains nothing from it. If a future need to share the DB across connections in
|
||||
the same process arises, reintroduce it deliberately together with the busy_timeout
|
||||
and locking_mode review that should accompany it.
|
||||
|
||||
### SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308` |
|
||||
|
||||
**Description**
|
||||
|
||||
`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` uses a plain `bool stop = false;`
|
||||
that the main test thread mutates after the purge task completes
|
||||
(`stop = true;`) while four background writer tasks are spin-checking `while (!stop)`.
|
||||
The flag is not declared `volatile`, not wrapped in `Volatile.Read/Volatile.Write`,
|
||||
and not behind a memory barrier. On a release build with a relaxed memory model
|
||||
the writer threads are permitted to cache the `stop = false` read indefinitely,
|
||||
which means in theory the test can hang past xUnit's per-test timeout instead of
|
||||
asserting `Empty(exceptions)`. The test relies on observed JIT/runtime behaviour
|
||||
that today happens to refresh the field across the `await _eventLogger.LogEventAsync`
|
||||
boundary, but that is an implementation detail rather than a contract.
|
||||
|
||||
The test is a regression test for SiteEventLogging-003; a flaky / hang-prone
|
||||
version of it can mask the very behaviour it is meant to pin.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or change
|
||||
`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
|
||||
`CancellationTokenSource` is the canonical .NET pattern and also lets the test
|
||||
cooperate with xUnit's `Task.WhenAll` timeout.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.SiteRuntime` |
|
||||
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -47,6 +47,36 @@ and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
|
||||
never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
|
||||
2026-05-17. Open findings: 0.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
The module was re-reviewed at commit `1eb6e97` as part of the new baseline
|
||||
review. The SiteRuntime source surface has grown materially since the prior
|
||||
pass — primarily by threading `ExecutionId`/`ParentExecutionId`/`SourceNode`
|
||||
through the script-trust-boundary helpers and the cached-call telemetry
|
||||
emitters, and by adding `OperationTrackingStore`, the
|
||||
`AuditingDbConnection`/`AuditingDbCommand`/`AuditingDbDataReader` decorators,
|
||||
and `ScriptExecutionScheduler`. All 10 checklist categories were walked afresh.
|
||||
Seven new findings were recorded: a race that throws
|
||||
`InvalidActorNameException` when a second `DeployInstanceCommand` arrives for
|
||||
the same instance while a redeployment is still terminating its predecessor
|
||||
(SiteRuntime-020, Medium); an artifact-only data-connection update that never
|
||||
reaches the DCL (SiteRuntime-021, Medium); `AuditingDbCommand.DbConnection.set`
|
||||
reaching into `AuditingDbConnection._inner` via reflection — the same anti-
|
||||
pattern SiteRuntime-006 eliminated for the repositories, now reintroduced and
|
||||
in direct tension with the script trust model that forbids `System.Reflection`
|
||||
(SiteRuntime-022, Medium); `Convert.ToDouble(value)` in `ScriptActor` /
|
||||
`AlarmActor` running under `CurrentCulture` so a string attribute value
|
||||
becomes locale-sensitive (SiteRuntime-023, Low); `OperationTrackingStore`
|
||||
serialising every cached-call write through a single connection +
|
||||
`SemaphoreSlim` and using sync-over-async in `Dispose()` (SiteRuntime-024,
|
||||
Medium); inbound-API `SetAttribute` (and any future caller) accepting unknown
|
||||
attribute names and persisting them as overrides, polluting both `_attributes`
|
||||
and the SQLite override table (SiteRuntime-025, Low); and the
|
||||
`ReplicationMessages.cs` outbound/inbound record types still missing public XML
|
||||
docs (SiteRuntime-026, Low). Prior findings 001–019 remain
|
||||
Resolved/Deferred — no regressions observed in any of their fixed call sites.
|
||||
Open findings: 7.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -62,6 +92,21 @@ never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
|
||||
| 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. |
|
||||
| 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Second-deploy race vs. pending redeploy (020); artifact-only data-connection update never reaches DCL (021); unknown-name SetAttribute persists bogus overrides (025). |
|
||||
| 2 | Akka.NET conventions | ✓ | Trigger-eval blocking on coordinator mailbox remains Deferred (014); short-lived execution actors and replication actor otherwise conform. |
|
||||
| 3 | Concurrency & thread safety | ✓ | DM's `_instanceActors` cache and `_pendingRedeploys` map shifted from old race; new ordering race surfaced (020). `OperationTrackingStore` single-connection + SemaphoreSlim serialises all cached writes (024). |
|
||||
| 4 | Error handling & resilience | ✓ | `Task.Run` fire-and-forget replication paths log on faulted (acceptable, per "best-effort replication" design). DM's deploy persistence rollback path (resolved as SiteRuntime-005) intact. |
|
||||
| 5 | Security | ✓ | Trust-model semantic analysis (SiteRuntime-011 fix) intact. `AuditingDbCommand` reflects into `AuditingDbConnection._inner` — same anti-pattern as SiteRuntime-006 (022). Audit emitter captures SQL parameter values verbatim per M4 design (M5 will redact). |
|
||||
| 6 | Performance & resource management | ✓ | Per-call SQLite connections on hot paths in `SiteStorageService` (existing pattern, acceptable). `OperationTrackingStore` `Dispose()` does sync-over-async (024). `ScriptExecutionScheduler` bounded threads as expected. |
|
||||
| 7 | Design-document adherence | ✓ | Artifact-only data-connection update path is silently inert (021) — contradicts the "site is self-contained after artifact deployment" intent. |
|
||||
| 8 | Code organization & conventions | ✓ | Repository reflection-via-private-field anti-pattern reintroduced in `AuditingDbCommand` (022). `ReplicationMessages.cs` public records still undocumented (026). |
|
||||
| 9 | Testing coverage | ✓ | `SiteReplicationActor` remains uncovered (SiteRuntime-016 deferred that gap to a clustered-ActorSystem harness, still outstanding). New findings have no targeted coverage yet. |
|
||||
| 10 | Documentation & comments | ✓ | `ReplicationMessages.cs` records lack XML docs (026); other XML doc surface materially expanded in `1eb6e97`. |
|
||||
|
||||
## Findings
|
||||
|
||||
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
|
||||
@@ -902,3 +947,362 @@ stating the Deployment Manager owns this lifecycle. Regression test:
|
||||
`InstanceActorTests.InstanceActor_DoesNotHandleDisableOrEnableCommands` asserts the
|
||||
Instance Actor produces no `InstanceLifecycleResponse` for either command
|
||||
(confirmed to fail against the pre-fix dead handlers and pass after removal).
|
||||
|
||||
### SiteRuntime-020 — Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:285`, `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:971` |
|
||||
|
||||
**Description**
|
||||
|
||||
The SiteRuntime-003 fix makes `HandleDeploy` watch + stop a running Instance
|
||||
Actor and buffer the in-flight `DeployInstanceCommand` in `_pendingRedeploys`
|
||||
until `Terminated` arrives. The handler also removes the instance from
|
||||
`_instanceActors` synchronously, in step with the stop request:
|
||||
|
||||
```csharp
|
||||
if (_instanceActors.TryGetValue(instanceName, out var existing))
|
||||
{
|
||||
_instanceActors.Remove(instanceName);
|
||||
_pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
|
||||
Context.Watch(existing);
|
||||
Context.Stop(existing);
|
||||
UpdateInstanceCounts();
|
||||
return;
|
||||
}
|
||||
|
||||
// Fresh deployment — no existing actor to replace.
|
||||
ApplyDeployment(command, Sender, isRedeploy: false);
|
||||
```
|
||||
|
||||
If a *second* `DeployInstanceCommand` for the same `instanceName` arrives on
|
||||
the singleton's mailbox while the predecessor is still terminating, the
|
||||
`_instanceActors.TryGetValue` lookup correctly reports "no existing actor" —
|
||||
because the first deploy already removed it — and execution falls through to
|
||||
`ApplyDeployment(..., isRedeploy: false)`. `ApplyDeployment` immediately calls
|
||||
`CreateInstanceActor`, which calls `Context.ActorOf(props, instanceName)`. But
|
||||
the predecessor's Akka child name **is still registered** in the parent's
|
||||
child registry: that name is only released after the predecessor's `Terminated`
|
||||
signal — exactly the asynchronous gap SiteRuntime-003 was created to plug for
|
||||
the *first* redeploy. `Context.ActorOf` therefore throws
|
||||
`InvalidActorNameException`, which Akka rethrows as
|
||||
`ActorInitializationException` — and the supervisor's `Stop` directive on that
|
||||
exception (DeploymentManagerActor.cs:179) silently stops the just-created
|
||||
child. The second deploy is then quietly lost: `_instanceActors` doesn't
|
||||
contain it (the throw aborted the bookkeeping after `CreateInstanceActor`'s
|
||||
own `ContainsKey` guard but before `_instanceActors[instanceName] = actorRef`
|
||||
would have run), `_totalDeployedCount` was incremented, and the deployer is
|
||||
never told the deployment failed (the persistence `Task.Run` is also dropped
|
||||
on the throw path). The race is real on a busy site where central retries a
|
||||
deploy because the prior attempt timed out — exactly the scenario the
|
||||
DeploymentManager-006 query-then-deploy idempotency mechanism was designed for.
|
||||
|
||||
The first-redeploy case (SiteRuntime-003) does NOT exhibit this because at
|
||||
that point the predecessor's child name was still in `_instanceActors`, so the
|
||||
branch correctly buffers. The bug is specific to the third (and beyond)
|
||||
incoming deploy when two are already in flight for the same instance.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
The pending-redeploy bookkeeping needs to be authoritative for "we are mid-
|
||||
redeploy on this instance", not just the `_instanceActors` cache. Add a second
|
||||
keyed lookup — e.g. a `Dictionary<string, IActorRef> _terminatingActorsByName`
|
||||
populated when the predecessor is stopped — and check it BEFORE
|
||||
`ApplyDeployment(isRedeploy: false)`. On a hit, overwrite (or stash) the
|
||||
buffered `PendingRedeploy` for that terminating actor so the latest command
|
||||
wins on the `Terminated` signal. Alternatively, defer the deploy by stashing
|
||||
all messages for that `instanceName` until the predecessor terminates (Akka
|
||||
`Stash` pattern). Either way, the fall-through to "fresh deployment" needs to
|
||||
be gated on "no instance with this name is currently terminating".
|
||||
|
||||
### SiteRuntime-021 — `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:931` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleDeployArtifacts` persists the artifact bundle (shared scripts, external
|
||||
systems, database connections, notification lists, SMTP configs, and
|
||||
**data connection definitions**) into local SQLite. For data connection
|
||||
definitions specifically (`DataConnections`), the handler calls
|
||||
`_storage.StoreDataConnectionDefinitionAsync(...)` — but does NOT issue a
|
||||
`CreateConnectionCommand` (or any other DCL command) to the `_dclManager`
|
||||
actor. The only path that pushes DCL configuration to the DCL is
|
||||
`EnsureDclConnections`, called exclusively from the deploy / startup-batch
|
||||
paths against the **flattened instance configuration's** inline `Connections`
|
||||
dictionary. There is no equivalent for an artifact-only update.
|
||||
|
||||
Concretely: an artifact deployment that changes a data connection's endpoint
|
||||
URL, credentials, backup endpoint, or failover retry count is stored
|
||||
durably in the site SQLite (so on the *next* node restart the site loads the
|
||||
new config and `EnsureDclConnections` picks it up) but is silently inert until
|
||||
either an instance using that connection is redeployed or the node restarts.
|
||||
This contradicts the design's "after artifact deployment, the site is fully
|
||||
self-contained" intent (Component-SiteRuntime.md, "System-Wide Artifact
|
||||
Handling") — the runtime DCL keeps using the stale connection until a much
|
||||
heavier trigger event occurs. It is also asymmetric with how
|
||||
`SharedScripts` are handled in the same method: shared scripts are both
|
||||
stored *and* recompiled into `_sharedScriptLibrary` on update so the change is
|
||||
live immediately.
|
||||
|
||||
(SiteRuntime-010 fixed a related defect inside `EnsureDclConnections` — the
|
||||
config-hash cache — but that's only consulted on the inline-config path; the
|
||||
artifact-deployment path never reaches `EnsureDclConnections`.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In the `DataConnections` branch of `HandleDeployArtifacts`, after the
|
||||
`StoreDataConnectionDefinitionAsync` call, also send a
|
||||
`CreateConnectionCommand` to `_dclManager` for each updated definition,
|
||||
re-using the SiteRuntime-010 config hash so unchanged connections are skipped.
|
||||
Alternatively, refactor `EnsureDclConnections` to accept a flat list of
|
||||
`(name, protocol, configurationJson, backupConfigurationJson,
|
||||
failoverRetryCount)` tuples that both the inline (`FlattenedConfiguration`)
|
||||
and artifact paths can drive through it.
|
||||
|
||||
### SiteRuntime-022 — `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Scripts/AuditingDbCommand.cs:138` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `DbConnection` setter on `AuditingDbCommand` unwraps an
|
||||
`AuditingDbConnection` value by reading its private `_inner` field via
|
||||
reflection:
|
||||
|
||||
```csharp
|
||||
set
|
||||
{
|
||||
_wrappingConnection = value;
|
||||
_inner.Connection = value switch
|
||||
{
|
||||
AuditingDbConnection auditing => auditing.GetType()
|
||||
.GetField("_inner", BindingFlags.Instance | BindingFlags.NonPublic)
|
||||
!.GetValue(auditing) as DbConnection,
|
||||
_ => value
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
This is the same encapsulation-violating anti-pattern that SiteRuntime-006
|
||||
called out for the site repositories. A rename or refactor of
|
||||
`AuditingDbConnection._inner` breaks the audit decorator at runtime (no
|
||||
compile-time signal), the `!.` null-forgiving operator hides the crash, and
|
||||
the reflective access trips static analyzers and IL trimming. More
|
||||
problematically, the script trust model the same module enforces in
|
||||
`ScriptCompilationService.ValidateTrustModel` explicitly forbids
|
||||
`System.Reflection` in scripts — yet the auditing helper a script ends up
|
||||
running through itself reaches via reflection into a sibling class. Both
|
||||
classes are `internal sealed` in the same assembly, so this is purely a
|
||||
self-imposed contract violation.
|
||||
|
||||
A second smaller concern in the same property: the getter returns
|
||||
`_wrappingConnection ?? _inner.Connection`. If the caller obtains a command
|
||||
via `AuditingDbConnection.CreateDbCommand()` and immediately reads
|
||||
`cmd.Connection`, the getter returns the raw inner connection (not the
|
||||
auditing wrapper), because `_wrappingConnection` is only populated when the
|
||||
setter is later invoked. That's surprising and at odds with the class's
|
||||
audit-everything intent — a script that round-trips a command through
|
||||
`cmd.Connection` re-enters the un-audited path.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Expose the wrapped connection through a proper API surface. The simplest fix
|
||||
that matches the SiteRuntime-006 precedent: add an
|
||||
`internal DbConnection Inner { get; }` property to `AuditingDbConnection`
|
||||
(both classes are `internal sealed`, so the property stays out of the public
|
||||
surface) and replace the reflection switch with `auditing.Inner`. While
|
||||
touching the property, also have the getter return `_wrappingConnection` even
|
||||
on the synthesised CreateDbCommand path (e.g. set `_wrappingConnection` to
|
||||
the parent connection inside `AuditingDbConnection.CreateDbCommand`).
|
||||
|
||||
### SiteRuntime-023 — `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:446`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:340`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:356`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:444` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ScriptActor.EvaluateCondition` and the three `AlarmActor` evaluators
|
||||
(`EvaluateRangeViolation`, `EvaluateRateOfChange`, `EvaluateHiLo`) call
|
||||
`Convert.ToDouble(value)` without specifying a culture. When `value` is a
|
||||
string (a path that exists today — attribute values that arrive as JSON-
|
||||
deserialized numbers can still surface as strings on some code paths,
|
||||
particularly array values that are JSON-stringified at
|
||||
`InstanceActor.HandleTagValueUpdate:377`), `Convert.ToDouble` parses against
|
||||
`CultureInfo.CurrentCulture`. On a host whose locale uses a comma decimal
|
||||
separator (German, French, most of continental Europe), `"1.5"` throws and
|
||||
the condition / alarm silently degrades to its catch-fallthrough (returns
|
||||
`false` for range/rate-of-change, keeps current level for HiLo, falls back to
|
||||
string-compare for conditionals). The CLAUDE.md "All timestamps are UTC"
|
||||
discipline is the equivalent rule for time; there is no equivalent invariant-
|
||||
culture discipline applied to numeric parsing.
|
||||
|
||||
The exposure is bounded — most attribute values arrive as numeric primitives
|
||||
from `TagValueUpdate.Value` or static `FlattenedConfiguration.Attributes`
|
||||
(also typed) so the implicit-cast `Convert.ToDouble` path is hit. But the
|
||||
string path is reachable via inbound API writes
|
||||
(`RouteToSetAttributesRequest.AttributeValues` is `IReadOnlyDictionary<string,
|
||||
string>`), via the JSON-array stringification at `HandleTagValueUpdate:377`,
|
||||
and via static-override values loaded from SQLite (which are persisted as
|
||||
strings — see `SetStaticOverrideAsync`).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replace each `Convert.ToDouble(value)` with `Convert.ToDouble(value,
|
||||
CultureInfo.InvariantCulture)`, or front-load a typed-numeric extraction
|
||||
helper (`if (value is double d) return d; if (value is string s && double.TryParse(s,
|
||||
NumberStyles.Float, CultureInfo.InvariantCulture, out var p)) return p;
|
||||
return Convert.ToDouble(value, CultureInfo.InvariantCulture);`). The site is a
|
||||
deterministic machine-control surface; condition evaluation must not depend
|
||||
on the host's regional settings.
|
||||
|
||||
### SiteRuntime-024 — `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:39`, `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:360` |
|
||||
|
||||
**Description**
|
||||
|
||||
`OperationTrackingStore` owns exactly one `SqliteConnection` and gates every
|
||||
public method through a single `SemaphoreSlim(1, 1)`. The class XML comment
|
||||
calls this out as deliberate ("the M3 brief calls out as 'cleaner than the M2
|
||||
Channel<T> pipeline given the volume'"), and the *write* volume is genuinely
|
||||
low — at most a handful of lifecycle rows per cached call. But on a busy site
|
||||
the *read* path (`GetStatusAsync`) is called by every `Tracking.Status(id)`
|
||||
invocation from every executing script, and reads are serialised through the
|
||||
same gate as writes. A long-running write (e.g. a Roslyn-script-driven
|
||||
`RecordTerminalAsync` competing with an SQLite checkpoint) holds the gate and
|
||||
stalls every concurrent status query. SQLite supports concurrent readers with
|
||||
a single writer in WAL mode; the gate forfeits that capability.
|
||||
|
||||
A separate concern in the same class: `Dispose()` calls
|
||||
`DisposeAsyncCore().AsTask().GetAwaiter().GetResult()`. That is sync-over-
|
||||
async — the very pattern SiteRuntime-008 was a finding for. If a caller
|
||||
disposes the store from a synchronization context that does not allow
|
||||
re-entrance (e.g. an `IHostedService.StopAsync` continuation observed on the
|
||||
host's sync context, or a finalizer pumping on the thread pool with a stuck
|
||||
continuation), the `.WaitAsync()` inside `DisposeAsyncCore` waits for a
|
||||
continuation that will never run, and the dispose deadlocks. The async path
|
||||
itself is correct; only the sync `Dispose()` wrapper is risky.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
For the single-connection gate: split reads and writes into separate gates,
|
||||
or — better — keep the writer single-connection and open a fresh read
|
||||
connection (or pool of read connections) per `GetStatusAsync` call. SQLite
|
||||
connections are cheap; the `SiteStorageService` precedent already uses per-
|
||||
call connections on the read path. For `Dispose()`: prefer
|
||||
`Dispose() { GC.SuppressFinalize(this); _connection.Dispose(); _gate.Dispose(); }`
|
||||
without an awaited disposal, and have the `IAsyncDisposable.DisposeAsync`
|
||||
path do the awaiting. If a synchronous disposable is genuinely needed, do
|
||||
not bridge it through the async core — duplicate the dispose-once flag check
|
||||
into a sync path that calls `_connection.Dispose()` directly.
|
||||
|
||||
### SiteRuntime-025 — `HandleSetStaticAttribute` persists unknown attribute names as static overrides
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:223`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:246` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSetStaticAttribute` resolves the target attribute against
|
||||
`_configuration.Attributes` to decide whether to route the write to the DCL or
|
||||
treat it as a static-override write. If the lookup fails (`resolved == null`),
|
||||
`isDataSourced` is false, and execution falls through to
|
||||
`HandleSetStaticAttributeCore` — which unconditionally:
|
||||
|
||||
1. inserts the bogus key into the in-memory `_attributes` dictionary,
|
||||
2. publishes an `AttributeValueChanged` for the bogus key to the site stream
|
||||
and to every child Script/Alarm actor,
|
||||
3. persists a row in `static_attribute_overrides` for the bogus key, and
|
||||
4. replies `Success = true` to the caller.
|
||||
|
||||
Concretely, an inbound API `Route.To().SetAttribute("notARealAttr", "x")`
|
||||
returns success, pollutes the in-memory state with a key that no script can
|
||||
legitimately observe (canonical-name lookup will not produce it), persists a
|
||||
durable SQLite override row that survives restart, and (on every restart)
|
||||
re-injects the polluting key via `HandleOverridesLoaded` at line 608. The
|
||||
override is **not** reset on instance redeployment in the same way the
|
||||
"genuine" overrides are — `ClearStaticOverridesAsync` does clear by
|
||||
`instance_unique_name`, so the row is eventually cleaned, but only on a full
|
||||
redeploy; in the meantime each restart resurrects it. The publish-to-stream
|
||||
side effect also lets a hostile or buggy inbound caller spam debug-view
|
||||
subscribers with synthetic attribute changes.
|
||||
|
||||
Worth flagging at Low: the inbound API surface is already authenticated and
|
||||
the design assumes its callers are trusted. But the no-validation behaviour
|
||||
contradicts the design doc's "Scripts can only read/write attributes on their
|
||||
own instance" framing — an inbound API call inherits the same instance-scope
|
||||
authority as a script, and the script trust model wouldn't sanction this.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In `HandleSetStaticAttribute`, when `resolved == null`, reply
|
||||
`SetStaticAttributeResponse(Success: false,
|
||||
ErrorMessage: $"Attribute '{command.AttributeName}' not found on instance
|
||||
'{_instanceUniqueName}'")` instead of falling through to the override path.
|
||||
Optionally also surface the existence check on the `RouteInboundApiSetAttributes`
|
||||
fan-out so a multi-attribute write reports the offending key without rolling
|
||||
back the others (the per-attribute `Ask` shape already supports a partial
|
||||
failure response).
|
||||
|
||||
### SiteRuntime-026 — `ReplicationMessages.cs` public record types have no XML documentation
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:10`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:13`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:15`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:17`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:19`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:25`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:28`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:30`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:32`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:34` |
|
||||
|
||||
**Description**
|
||||
|
||||
The ten public record types in `ReplicationMessages.cs`
|
||||
(`ReplicateConfigDeploy`, `ReplicateConfigRemove`, `ReplicateConfigSetEnabled`,
|
||||
`ReplicateArtifacts`, `ReplicateStoreAndForward`, `ApplyConfigDeploy`,
|
||||
`ApplyConfigRemove`, `ApplyConfigSetEnabled`, `ApplyArtifacts`,
|
||||
`ApplyStoreAndForward`) carry no XML documentation. The file header comment
|
||||
groups them as "outbound" vs "inbound" but the individual records have no
|
||||
`<summary>` and no parameter docs. The XML-doc baseline `1eb6e97` rolled out
|
||||
across the rest of the module (the commit being reviewed is literally `docs:
|
||||
add XML doc comments across src + Sister Projects section in CLAUDE.md`), so
|
||||
this file is now the conspicuous outlier — and the `CommentChecker` skill
|
||||
relied on by the `fixdocs` workflow will flag every record as missing docs.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `<summary>` per record naming the direction (outbound → peer / inbound
|
||||
from peer) and what the operation replicates, and `<param>` docs for each
|
||||
record parameter. Mirror the precedent in
|
||||
`src/ScadaLink.Commons/Messages/.../*.cs`. While there, consider sealing the
|
||||
inbound vs outbound split with a marker base type (currently they're just
|
||||
named conventionally) so `Receive<ReplicateXxx>` vs `Receive<ApplyXxx>` is
|
||||
expressed at the type level — but that's optional and out of scope for a
|
||||
docs-only finding.
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.StoreAndForward` |
|
||||
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 (3 Deferred: 002, 011, 012; 7 new Open: 018–024 — see Re-review 2026-05-28) |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -55,6 +55,76 @@ StoreAndForward-017 records that the Retry/Discard activity-log entries hard-cod
|
||||
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
|
||||
the site event log.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Full re-review against commit `1eb6e97` with the same 10-category checklist. The
|
||||
batch-3 / batch-4 resolutions (001, 003–010, 013–017) are still present and intact; no
|
||||
regressions detected on prior fixes. Findings 002, 011 and 012 remain validly
|
||||
`Deferred` (their preconditions are unchanged) and findings 005, 006, 010, 013, 014,
|
||||
016, 017 are confirmed `Resolved` against the current source.
|
||||
|
||||
This pass surfaced **seven new findings** clustered around two themes:
|
||||
|
||||
The first theme is **design-doc drift on the notification path**, which has acquired
|
||||
two now-real defects since the engine became central-targeted. `StoreAndForward-018`
|
||||
(High) records that a corrupt notification payload — handled in `NotificationForwarder.
|
||||
DeliverAsync` by returning `false` — parks a notification on its first retry-sweep
|
||||
encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking
|
||||
applies only to the external-system-call and cached-database-write categories"). The
|
||||
same path becomes a poison-payload retry-forever trap on the active node if the engine
|
||||
ever softened the `false` semantics. `StoreAndForward-019` (Medium) records the
|
||||
sibling defect: notifications are enqueued with `MaxRetries` defaulting to
|
||||
`StoreAndForwardOptions.DefaultMaxRetries` (50), and the legacy SMTP path
|
||||
(`NotificationDeliveryService.SendAsync`) passes a positive bounded `smtpConfig.
|
||||
MaxRetries` — so an unreachable central will silently park notifications after a
|
||||
finite retry budget rather than "retry at the fixed forward interval until central acks"
|
||||
as the design requires. The contract `0 = no limit` is not enforced for the
|
||||
notification category.
|
||||
|
||||
The second theme is **subtle correctness and contract gaps around the operator paths**
|
||||
that survived the StoreAndForward-016/017 batch. `StoreAndForward-020` (Medium) records
|
||||
that `RetryParkedMessageAsync` skips replication entirely if `GetMessageByIdAsync`
|
||||
returns null after a successful local requeue (a narrow but real race window with a
|
||||
concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby
|
||||
divergence in that corner. `StoreAndForward-021` (Medium) is a design-doc-vs-code drift
|
||||
that should be reconciled in the doc: the **operation tracking table** is documented
|
||||
inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 77–87, 108,
|
||||
114), but the actual `OperationTrackingStore` lives in `src/ScadaLink.SiteRuntime/
|
||||
Tracking/` and is not consumed by S&F at all — the brief's own note flags this. The
|
||||
design doc should be updated to point at SiteRuntime, or the store moved to
|
||||
StoreAndForward.
|
||||
|
||||
`StoreAndForward-022` (Low) records that `_cachedCallObserver` silently drops audit
|
||||
telemetry when a buffered cached-call's `Id` is not a parseable `TrackedOperationId`
|
||||
GUID — the engine returns from `NotifyCachedCallObserverAsync` before emitting anything,
|
||||
so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting
|
||||
produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any
|
||||
caller passing a custom non-GUID id silently bypasses the entire `Submitted/Forwarded/
|
||||
Attempted/Delivered/Parked/Discarded` audit lifecycle). `StoreAndForward-023` (Low)
|
||||
records that `siteId` is silently defaulted to `string.Empty` when no
|
||||
`IStoreAndForwardSiteContext` is registered, so a misconfigured host produces audit
|
||||
telemetry with `SourceSite = ""` and the central audit-log's `(SourceSite,
|
||||
TrackedOperationId)` correlation degrades to a per-id-only index. `StoreAndForward-024`
|
||||
(Low) is a stop-time ordering defect: `StopAsync` disposes the timer but a
|
||||
mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
|
||||
`_replication` after `StopAsync` returns; downstream resources disposed by the host
|
||||
shutdown sequence (the DI container) can then NRE through the still-running sweep.
|
||||
|
||||
## Checklist coverage — Re-review 2026-05-28
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020). |
|
||||
| 2 | Akka.NET conventions | ☑ | `ParkedMessageHandlerActor` uses `PipeTo` correctly with success/failure projections (007 resolution preserved). No new findings. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Sweep-vs-stop race: a timer callback running while `StopAsync` returns can touch disposed dependencies (024). |
|
||||
| 4 | Error handling & resilience | ☑ | Notifications park after `DefaultMaxRetries` exhaustion (019) — contradicts the design doc's "retried until central acks". |
|
||||
| 5 | Security | ☑ | No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled. |
|
||||
| 6 | Performance & resource management | ☑ | No new findings — the connection-per-call documented trade-off and pooled `OpenAsync` remain acceptable. |
|
||||
| 7 | Design-document adherence | ☑ | Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019. |
|
||||
| 8 | Code organization & conventions | ☑ | `IStoreAndForwardSiteContext` silently defaults `SiteId` to empty (023) — a configuration hole rather than an entity placement issue. |
|
||||
| 9 | Testing coverage | ☑ | The seven new findings have no regression tests in `tests/ScadaLink.StoreAndForward.Tests/` — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024). |
|
||||
| 10 | Documentation & comments | ☑ | `CachedCallAttemptOutcome.ParkedMaxRetries` XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The `TrackedOperationId.TryParse` silent-skip behaviour in `NotifyCachedCallObserverAsync` is documented in the source but not on the public observer contract (022). |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -914,3 +984,474 @@ the StoreAndForward-016 replication) — and pass it to `RaiseActivity` (falling
|
||||
`RetryParkedMessageAsync_ActivityUsesMessageRealCategory` and
|
||||
`DiscardParkedMessageAsync_ActivityUsesMessageRealCategory` assert the activity carries
|
||||
`Notification` / `CachedDbWrite` respectively; both fail against the pre-fix code.
|
||||
|
||||
### StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62`–`:69`, `:105`–`:122`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369`–`:397` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Component design doc explicitly carves out notifications from the parking lifecycle:
|
||||
|
||||
> "Notifications do not park — they are retried at the fixed forward interval until
|
||||
> central acks." (`docs/requirements/Component-StoreAndForward.md:47`)
|
||||
> "Parking applies only to the external-system-call and cached-database-write
|
||||
> categories." (same line)
|
||||
|
||||
`NotificationForwarder.DeliverAsync` violates this. When `TryBuildSubmit` fails to
|
||||
deserialize the buffered payload — either because `JsonSerializer.Deserialize` throws a
|
||||
`JsonException` (line 114) or because it returns `null` (line 119) — `DeliverAsync`
|
||||
returns `false` (line 68). On the **retry path** the S&F engine treats handler `false`
|
||||
as a permanent failure and **parks the message immediately** via the conditional
|
||||
`UpdateMessageIfStatusAsync(... Parked)` write at `StoreAndForwardService.cs:373`–`:385`.
|
||||
Result: a notification with a corrupt buffered payload — a row that the engine itself
|
||||
treats as opaque ("Payload: Serialized message content…"; `Component-StoreAndForward.md:
|
||||
110`) — enters the parked state and surfaces in the central UI's parked-message list
|
||||
under the `Notification` category, contradicting the doc's invariant and the resolved
|
||||
StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.
|
||||
|
||||
The defect is real today: the inline comment on `NotificationForwarder.cs:64` even
|
||||
documents the violation ("An unreadable payload cannot be fixed by retrying — park it
|
||||
(return false)") as the intended behaviour, but that behaviour is what the design doc
|
||||
forbids. Either the doc needs to acknowledge a poison-payload parking exception for
|
||||
notifications, or the forwarder needs a different escape hatch (discard? log + drop?
|
||||
permanent-failure-as-`true` to clear the buffer?). Today there is no consistent answer
|
||||
between code and design.
|
||||
|
||||
Additionally, on the **immediate-delivery** path (a fresh enqueue followed by a
|
||||
`DeliverAsync` returning `false`), the engine returns `WasBuffered: false` and the row
|
||||
is never persisted — so the corrupt-payload "park" only occurs on the retry path, where
|
||||
the message has already been buffered (and replicated to the standby). The
|
||||
**inconsistency between the two paths** ("not buffered" vs "parked") for the same
|
||||
permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004
|
||||
documents the immediate vs retry asymmetry, but does not anticipate that the retry
|
||||
asymmetry will violate a per-category invariant.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Choose one consistent reconciliation. Preferred option: change `NotificationForwarder.
|
||||
DeliverAsync` to **discard** a corrupt payload rather than park it — delete the
|
||||
buffered row directly, log a Site Event Log entry under `Discard`, and return `true` so
|
||||
the engine clears the buffer. This preserves the design's "notifications do not park"
|
||||
invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload
|
||||
parking exception specifically for notifications, and revise the resolved
|
||||
StoreAndForward-017 wording; (b) treat `JsonException` as transient (would retry-forever
|
||||
on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the
|
||||
engine and gate the retry-path park behind it for the Notification category.
|
||||
Add a regression test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.
|
||||
cs` asserting that a corrupt-payload notification reaches a terminal **non-Parked**
|
||||
state — today the corrupt-payload behaviour is uncovered.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-019 — Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks"
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407`–`:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773`–`:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149`–`:156` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc requires a buffered notification to be retried indefinitely until
|
||||
central acks:
|
||||
|
||||
> "The **notification** category retries differently: it has no source-entity setting.
|
||||
> The site→central forward uses a single fixed retry interval configured in the host
|
||||
> `appsettings.json`. … A buffered notification is retried until central acks it; it is
|
||||
> not parked on a retry limit (central, once reachable, owns delivery, retry, and
|
||||
> parking from that point on)." (`Component-StoreAndForward.md:55`–`:59`)
|
||||
|
||||
The current engine cannot honour that. `RetryMessageAsync` enforces parking at
|
||||
`message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
|
||||
(`StoreAndForwardService.cs:407`); a `MaxRetries == 0` is the documented "no limit"
|
||||
escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two
|
||||
notification enqueue paths both supply a positive bounded `MaxRetries`:
|
||||
|
||||
- `ScriptRuntimeContext.cs:1773`–`:1778` (the `Notify.Send` site script path) calls
|
||||
`EnqueueAsync` without supplying the `maxRetries` argument, so the engine
|
||||
defaults to `StoreAndForwardOptions.DefaultMaxRetries = 50` (`StoreAndForwardOptions.
|
||||
cs:18`). After 50 retry sweeps with central unreachable, the notification is parked.
|
||||
- `NotificationDeliveryService.cs:149`–`:156` (the legacy SMTP-style path retained for
|
||||
the central-side `INotificationDeliveryService` callers) passes
|
||||
`smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : null` — `null` falls back to the
|
||||
same 50-retry default, and any positive `smtpConfig.MaxRetries` still bounds the
|
||||
retry budget. Either way, a long central outage parks the notification.
|
||||
|
||||
A parked notification cannot be cleared by a central recovery: it stays parked until an
|
||||
operator clicks **Retry** in the parked-message UI. The design's invariant — that
|
||||
notification delivery converges automatically as soon as central is reachable — is
|
||||
broken: an extended central outage requires manual intervention to clear the backlog,
|
||||
which is exactly the behaviour the central-only outbox redesign was meant to remove
|
||||
from the site.
|
||||
|
||||
This is closely related to (but distinct from) StoreAndForward-018: 018 is the
|
||||
*permanent-failure-path* parking violation; 019 is the *transient-failure-path*
|
||||
parking violation under the engine's normal max-retries policy.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the notification enqueue paths pass `maxRetries: 0` so the documented "no limit /
|
||||
never parked" semantics apply, and guard against regression by adding an integration
|
||||
test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs` that runs
|
||||
a sweep many more times than `DefaultMaxRetries` against an always-failing handler and
|
||||
asserts the buffered notification's status stays `Pending` (not `Parked`). A cleaner
|
||||
alternative is to special-case the `Notification` category inside
|
||||
`RetryMessageAsync`'s max-retries guard (treat it as `MaxRetries == 0` regardless of
|
||||
the field value) so the invariant is enforced at the single chokepoint rather than
|
||||
relying on every caller to pass the right value — this also fixes the legacy
|
||||
`NotificationDeliveryService` path without editing the consumer.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599`–`:616` |
|
||||
|
||||
**Description**
|
||||
|
||||
The StoreAndForward-016 resolution wired Requeue replication into operator-initiated
|
||||
retry. The fix uses a two-step pattern:
|
||||
|
||||
```csharp
|
||||
public async Task<bool> RetryParkedMessageAsync(string messageId)
|
||||
{
|
||||
var success = await _storage.RetryParkedMessageAsync(messageId); // step 1
|
||||
if (success)
|
||||
{
|
||||
var message = await _storage.GetMessageByIdAsync(messageId); // step 2 (no txn)
|
||||
var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
|
||||
if (message != null)
|
||||
{
|
||||
_replication?.ReplicateRequeue(message); // step 3
|
||||
}
|
||||
RaiseActivity("Retry", category, ...);
|
||||
}
|
||||
return success;
|
||||
}
|
||||
```
|
||||
|
||||
The two storage calls are on separate connections with no surrounding transaction. A
|
||||
concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2
|
||||
(which re-reads the row) can delete or mutate the row:
|
||||
|
||||
- An operator who issues `DiscardParkedMessageAsync` immediately after retry — the
|
||||
`DiscardParkedMessageAsync` storage call is conditional on `status = Parked`, so it
|
||||
will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued
|
||||
row will then call `_storage.RemoveMessageAsync` (unconditional), which deletes it.
|
||||
In a single retry-sweep cycle this race is real because `DefaultRetryInterval = Zero`
|
||||
is the standard test default and the operator action and a sweep tick can overlap.
|
||||
- A `RemoveMessageAsync` runs in step 1's wake; `GetMessageByIdAsync` returns null;
|
||||
step 3 (`_replication?.ReplicateRequeue`) is **skipped entirely**, but step 1
|
||||
already requeued the row locally. The standby is now left in `Parked` state while
|
||||
the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016
|
||||
was supposed to fix. (On the active node a subsequent failover lands on a Parked
|
||||
standby copy of a discarded message — the same regression 016 already documented.)
|
||||
|
||||
The category-fallback path (`StoreAndForwardCategory.ExternalSystem` when message is
|
||||
null) silently mislabels the activity log entry too — the same defect that
|
||||
StoreAndForward-017 fixed for the non-racy path, except this branch handed back a
|
||||
hard-coded fallback rather than re-loading. The activity log entry is a minor side
|
||||
effect; the missing replication is the real defect.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Capture the message **once**, before the local Parked → Pending storage update, so the
|
||||
replication path has the row in hand even if a concurrent writer deletes it
|
||||
afterwards:
|
||||
|
||||
```csharp
|
||||
var message = await _storage.GetMessageByIdAsync(messageId); // before the update
|
||||
if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
|
||||
return false;
|
||||
|
||||
var success = await _storage.RetryParkedMessageAsync(messageId);
|
||||
if (!success) return false;
|
||||
|
||||
// `message` was the parked row; the active node just wrote it back to Pending with
|
||||
// retry_count = 0 — construct the replicated state from those known mutations.
|
||||
message.Status = StoreAndForwardMessageStatus.Pending;
|
||||
message.RetryCount = 0;
|
||||
message.LastError = null;
|
||||
message.LastAttemptAt = null;
|
||||
_replication?.ReplicateRequeue(message);
|
||||
RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
|
||||
return true;
|
||||
```
|
||||
|
||||
Add a regression test in `StoreAndForwardReplicationTests` that simulates the
|
||||
delete-between-update-and-reload race and asserts the `Requeue` replication
|
||||
operation is still emitted with the correct category.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49`–`:51`, `:77`–`:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
|
||||
|
||||
**Description**
|
||||
|
||||
Component-StoreAndForward.md repeatedly assigns the **Operation Tracking Table** to
|
||||
this component:
|
||||
|
||||
- **Responsibilities** (line 21): "Maintain a site-local **operation tracking table**
|
||||
holding one row per `TrackedOperationId` for cached calls … the authoritative status
|
||||
record consulted by `Tracking.Status(id)`."
|
||||
- **Message Lifecycle** (lines 49–51): "the operation tracking table is the status
|
||||
record and the S&F buffer is purely the retry mechanism. A cached call that succeeds
|
||||
on its first immediate attempt is written directly as a terminal `Delivered` tracking
|
||||
row and never enters the S&F buffer."
|
||||
- **Operation Tracking Table** section (lines 77–87): "Alongside the S&F buffer DB,
|
||||
each site node holds a **site-local operation tracking table** in SQLite. … Each row
|
||||
records the operation kind (`TrackedOperationKind`) …"
|
||||
|
||||
The actual implementation lives outside this module: `src/ScadaLink.SiteRuntime/
|
||||
Tracking/OperationTrackingStore.cs` (and `IOperationTrackingStore`, `OperationTrackingOptions`).
|
||||
The StoreAndForward project contains no references to the tracking store, owns no
|
||||
`operation_tracking` table, and `StoreAndForwardService.NotifyCachedCallObserverAsync`
|
||||
is only a hook handing telemetry context to an `ICachedCallLifecycleObserver` — the
|
||||
audit bridge wired in `ScadaLink.AuditLog`. The S&F module is **not** the table's
|
||||
owner; SiteRuntime is.
|
||||
|
||||
This is a real design-doc drift, not a code defect, and is flagged explicitly in the
|
||||
brief's "Module-specific notes". The drift matters because the design doc's
|
||||
discussion of the lifecycle — "immediate success writes a terminal Delivered tracking
|
||||
row directly here", "operator discard sets terminal `Discarded`", "central never
|
||||
mutates the mirror row directly" — places coordination responsibilities on the wrong
|
||||
component. A reader looking for the source of truth for `Tracking.Status(id)` would
|
||||
read `Component-StoreAndForward.md` and search `src/ScadaLink.StoreAndForward/` in
|
||||
vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F
|
||||
responsibility (line 22), but the emission actually happens via the `AuditLog` site
|
||||
component subscribing to `ICachedCallLifecycleObserver`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Reconcile the doc with the code. The simplest fix is doc-side: update
|
||||
Component-StoreAndForward.md to scope its responsibilities back to the retry
|
||||
mechanism + replication + parked-message management, and add a cross-reference to a
|
||||
new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or
|
||||
a new Component-OperationTracking.md). The code is internally consistent — the audit
|
||||
bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F
|
||||
engine emits attempt telemetry on the cached-call hot path — but the design doc is
|
||||
several refactors out of date. The hierarchical map should be:
|
||||
|
||||
- `Component-StoreAndForward.md` → S&F buffer + Replication + Parked-message
|
||||
management + Notification forwarding to central + cached-call telemetry **hook**.
|
||||
- New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
|
||||
- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
|
||||
central-side mirror.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484`–`:515` |
|
||||
|
||||
**Description**
|
||||
|
||||
`NotifyCachedCallObserverAsync` (the per-attempt observer notifier wired by the M3
|
||||
Bundle E rollout) bails out with no audit emission when
|
||||
`TrackedOperationId.TryParse(message.Id, out var trackedId)` returns false
|
||||
(`StoreAndForwardService.cs:510`–`:515`). The inline comment justifies the behaviour as
|
||||
back-compat for "pre-M3 message (random GUID-N id from S&F itself, no
|
||||
TrackedOperationId threaded in)", but the documented contract is broken in two ways:
|
||||
|
||||
1. **Silent dropping of every audit row, not just the first one.** The skip means no
|
||||
`Attempted` row, no `CachedResolve` terminal row, no audit trail at all for that
|
||||
operation's S&F lifecycle — yet the rest of the system (script trust boundary,
|
||||
parked-message UI, etc.) still treats the operation as audit-tracked. The drop is
|
||||
not surfaced via a metric, log warning (the path is a silent `return`), or counter,
|
||||
so a misconfigured caller bypasses the audit hot path with zero feedback.
|
||||
|
||||
2. **The contract is hidden in field-level XML.** The `ICachedCallLifecycleObserver`
|
||||
public interface contract (defined in `ScadaLink.Commons`) does not document that
|
||||
the observer will be silently skipped when the underlying S&F message id is not a
|
||||
GUID. A consumer reading the interface contract reasonably expects every cached-call
|
||||
attempt to surface — the audit pipeline depends on it. The silent-drop is an
|
||||
implementation detail of the S&F bridge that should be either lifted onto the
|
||||
contract or removed.
|
||||
|
||||
The engine itself mints GUID-N ids via `Guid.NewGuid().ToString("N")` (line 224), which
|
||||
`TrackedOperationId.TryParse` accepts, so the skip path is unreachable for engine-minted
|
||||
ids. It is reachable only for callers that supply their own `messageId` argument with a
|
||||
non-GUID format. The current callers (`NotificationOutbox` enqueue path with
|
||||
NotificationId, cached-call enqueue path with `TrackedOperationId.ToString()`) all
|
||||
supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id
|
||||
would silently bypass audit.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Two options. The cheap fix: change the skip to a `_logger.LogWarning` with the offending
|
||||
id so a misconfigured caller is observable, and update the
|
||||
`ICachedCallLifecycleObserver` XML doc to mention the "non-GUID id → no telemetry"
|
||||
contract explicitly. The more correct fix: emit a still-audited row for the
|
||||
non-GUID case (e.g. synthesise a `TrackedOperationId` from the underlying id, or emit a
|
||||
distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent
|
||||
holes. Add a regression test in `CachedCallAttemptEmissionTests` capturing the chosen
|
||||
contract — the existing
|
||||
`Attempt_MessageIdNotAGuid_NoObserverNotification` test pins today's silent-skip; if
|
||||
the fix is "log + skip", that test should be updated to also assert the log emission;
|
||||
if the fix is "emit anyway", the test should be replaced.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43`–`:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AddStoreAndForward`'s service-collection factory resolves the optional
|
||||
`IStoreAndForwardSiteContext` and falls back to `string.Empty` when not registered:
|
||||
|
||||
```csharp
|
||||
var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
|
||||
var siteId = siteContext?.SiteId ?? string.Empty;
|
||||
return new StoreAndForwardService(storage, options, logger, replication,
|
||||
cachedCallObserver, siteId);
|
||||
```
|
||||
|
||||
The constructor's parameter is even defaulted to `""`. The empty-string `siteId` flows
|
||||
straight into every emitted `CachedCallAttemptContext.SourceSite`, which the central
|
||||
audit pipeline uses as part of the `(SourceSite, TrackedOperationId)` correlation key.
|
||||
A host that registers an `ICachedCallLifecycleObserver` (the audit observer wired by
|
||||
`AddAuditLog`) but forgets to register `IStoreAndForwardSiteContext` will produce a
|
||||
stream of telemetry rows with `SourceSite = ""` — the central audit mirror cannot
|
||||
distinguish them by site, and the central-site routing of
|
||||
`RetryParkedOperation`/`DiscardParkedOperation` commands keyed on `SourceSite` will
|
||||
fail to find the owning site.
|
||||
|
||||
The Host's `IStoreAndForwardSiteContext` adapter and the audit observer registration
|
||||
are wired in lock-step, so the current configuration is correct, but the silent
|
||||
empty-string fallback is a contract hazard for future hosts (CLI test harness, second
|
||||
site cluster topology, etc.) and for tests that wire one without the other.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the contract explicit: when `cachedCallObserver` is non-null, require
|
||||
`IStoreAndForwardSiteContext` to be registered — throw an `InvalidOperationException`
|
||||
with a clear "Audit observer registered without a site context — register
|
||||
IStoreAndForwardSiteContext" message at construction time. When the audit observer is
|
||||
absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unused.
|
||||
Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
|
||||
from the service provider so a late-registered context still takes effect.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122`–`:127`, `:136`–`:143`, `:303`–`:329` |
|
||||
|
||||
**Description**
|
||||
|
||||
`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).
|
||||
The `_ =` discards the returned `Task`, so when the timer fires the sweep runs **fire
|
||||
and forget** on a thread-pool thread. `StopAsync` (lines 136–143) disposes the timer:
|
||||
|
||||
```csharp
|
||||
if (_retryTimer != null)
|
||||
{
|
||||
await _retryTimer.DisposeAsync();
|
||||
_retryTimer = null;
|
||||
}
|
||||
```
|
||||
|
||||
`Timer.DisposeAsync()` returns once any in-flight timer **callback** has completed —
|
||||
but the timer callback in this service is a one-line `_ = RetryPendingMessagesAsync()`
|
||||
that synchronously returns immediately and leaves the actual sweep running on the
|
||||
thread pool. So `Timer.DisposeAsync` does not wait for the sweep; only for the
|
||||
synchronous `_ = ...` discarding step. `StopAsync` returns while a sweep is potentially
|
||||
still running, touching `_storage` (which the host will dispose), `_replication`
|
||||
(which the host will tear down), and `_cachedCallObserver` (whose downstream gRPC
|
||||
channel the host will shut down).
|
||||
|
||||
The host shutdown sequence (`AkkaHostedService`) tears down the actor system and the
|
||||
DI container after this service's `StopAsync` completes — meaning a sweep that runs
|
||||
past `StopAsync` can call into disposed `SqliteConnection`s (yielding
|
||||
`ObjectDisposedException`, caught by the sweep's outer `try/catch` as a log) or, more
|
||||
seriously, push a replication operation into a half-disposed Akka actor pipeline and
|
||||
trigger noisy dead-letter warnings during a clean shutdown.
|
||||
|
||||
The race window is small (the sweep typically finishes in <100 ms in tests) but it is
|
||||
real, particularly when shutting down a site under load with a non-empty buffer.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Track in-flight sweep tasks and `await` them in `StopAsync`:
|
||||
|
||||
```csharp
|
||||
private Task? _currentSweep;
|
||||
|
||||
public async Task StopAsync()
|
||||
{
|
||||
if (_retryTimer != null)
|
||||
{
|
||||
await _retryTimer.DisposeAsync();
|
||||
_retryTimer = null;
|
||||
}
|
||||
if (_currentSweep is { } sweep)
|
||||
{
|
||||
try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Change the timer callback to:
|
||||
|
||||
```csharp
|
||||
_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);
|
||||
```
|
||||
|
||||
Add a `CancellationTokenSource` so a long sweep can be cooperatively aborted on stop;
|
||||
plumb the cancellation token into `_storage` / `_replication` / `_cachedCallObserver`
|
||||
calls. Add a regression test in `StoreAndForwardServiceTests` that calls `StopAsync`
|
||||
mid-sweep and asserts no further storage activity occurs after `StopAsync` returns.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.TemplateEngine` |
|
||||
| Design doc | `docs/requirements/Component-TemplateEngine.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 6 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -48,8 +48,49 @@ Both are limited-impact (nested compositions are the less common case and there
|
||||
is design-time visibility) but represent genuine drift from the recursive-nesting
|
||||
design promise.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Re-reviewed the whole module against all ten checklist categories at commit
|
||||
`1eb6e97`. All sixteen prior findings remain closed. Six new findings surfaced,
|
||||
clustered in three themes:
|
||||
|
||||
1. **Revision-hash / diff coverage gaps** — `RevisionHashService` and
|
||||
`DiffService` both omit `Attributes.Description`, `Alarms.Description`, and
|
||||
the entire `Connections` map. A change that only edits an attribute/alarm
|
||||
description, or a data-connection endpoint, will deploy a new flattened
|
||||
configuration but be invisible to staleness detection and the diff view —
|
||||
the very gap the revision hash was introduced to close (TemplateEngine-017,
|
||||
TemplateEngine-018). Severity Medium/High.
|
||||
|
||||
2. **TemplateEngine-013 fix only partially applied** — the `0`-as-no-parent
|
||||
sentinel was removed from `CycleDetector` but `TemplateResolver
|
||||
.BuildInheritanceChain` still uses `currentId != 0` / `ParentTemplateId ?? 0`.
|
||||
A template with a real Id of 0 is treated as "no template" and silently
|
||||
excluded from its own inheritance chain, so every flatten/resolve through
|
||||
that template loses its members. The fix from `adb5e75` did not propagate
|
||||
into the resolver (TemplateEngine-019). Severity Medium.
|
||||
|
||||
3. **Audit log integrity / drift** — every `Create` audit entry in
|
||||
`TemplateService` and `SharedScriptService` is written with `EntityId = "0"`
|
||||
*before* `SaveChangesAsync` populates the real key, so the audit trail loses
|
||||
the link back to the created row (TemplateEngine-020); `MoveTemplateAsync`
|
||||
never validates folder-acyclicity / sibling-name-uniqueness even though
|
||||
`TemplateFolderService.MoveFolderAsync` does (TemplateEngine-021); and the
|
||||
advertised `IS NOT_locked & not_LockedInDerived & not_IsInherited`
|
||||
self-reference loop is intact, but `LockEnforcer.ValidateLockChange` permits
|
||||
downgrading a `LockedInDerived` flag on a base template — there is no
|
||||
equivalent of the once-locked-stays-locked rule for the `LockedInDerived`
|
||||
flag (TemplateEngine-022). Severity Low/Medium.
|
||||
|
||||
Themes: hash/diff drift from the deployment payload, asymmetric application of
|
||||
the duplicate-Id / null-sentinel fix from the last batch, and audit-write
|
||||
ordering inconsistency between `TemplateService` (logs then saves) and
|
||||
`InstanceService` (saves then logs).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
_Re-review (2026-05-17, `39d737e`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Prior bugs (001–005, 013) all resolved and verified. Re-review 2026-05-17 found two new nested-composition defects: rename does not cascade (TemplateEngine-015), composed-script `ParentPath` always empty (TemplateEngine-016). |
|
||||
@@ -63,6 +104,21 @@ design promise.
|
||||
| 9 | Testing coverage | ✓ | Tests exist for every file, but the dead/placeholder paths (TemplateEngine-004, 005) and deep nesting (TemplateEngine-001) are not exercised. |
|
||||
| 10 | Documentation & comments | ✓ | Mostly accurate; a misleading converter comment (TemplateEngine-011) and a stale enum/doc mismatch (TemplateEngine-012). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | New: `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector` in `adb5e75` (TemplateEngine-019). `TemplateService.MoveTemplateAsync` performs no folder-acyclicity or sibling-name-uniqueness check (TemplateEngine-021). |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors. `AddTemplateEngineActors` is still an empty placeholder. Nothing to assess. |
|
||||
| 3 | Concurrency & thread safety | ✓ | Services remain stateless, scoped per request. No new findings. |
|
||||
| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently. `MoveTemplateAsync` is missing target-folder validation found elsewhere — see TemplateEngine-021. |
|
||||
| 5 | Security | ✓ | No new findings. Forbidden-API limitations still tracked under the closed TemplateEngine-006 (resolved as advisory). |
|
||||
| 6 | Performance & resource management | ✓ | `MergeHiLoConfig` / `PrefixTriggerAttribute` allocate a `MemoryStream` + `Utf8JsonWriter` + `Encoding.UTF8.GetString` per call — fine for the per-flatten frequency, no finding. No new resource leaks. |
|
||||
| 7 | Design-document adherence | ✓ | New drift: `RevisionHashService` and `DiffService` both omit `Description` fields and the `Connections` map from the deployable payload (TemplateEngine-017, TemplateEngine-018), so the revision hash and diff do not reflect every committed deployment input. |
|
||||
| 8 | Code organization & conventions | ✓ | Audit-write ordering asymmetric: `TemplateService.Create*` and `SharedScriptService.CreateSharedScriptAsync` log with `EntityId = "0"` before `SaveChangesAsync`, while `InstanceService.CreateInstanceAsync` saves first then logs with the real Id (TemplateEngine-020). |
|
||||
| 9 | Testing coverage | ✓ | New finding paths exercised in part — `RevisionHashServiceTests` does not assert that Description / Connections changes change the hash; no test for `BuildInheritanceChain` with a real Id of 0; no test for `MoveTemplateAsync` rejecting a target folder. |
|
||||
| 10 | Documentation & comments | ✓ | New: `LockEnforcer.ValidateLockChange` is documented as enforcing the once-locked-stays-locked rule but has no equivalent for `LockedInDerived` (TemplateEngine-022). |
|
||||
|
||||
## Findings
|
||||
|
||||
### TemplateEngine-001 — Deeply nested composed members are dropped during flattening
|
||||
@@ -780,3 +836,313 @@ passes the enclosing module's `prefix` — and the `ScriptScope` now sets
|
||||
`SelfPath = "Outer.Inner"` pairs with `ParentPath = "Outer"` and `Parent.X`
|
||||
resolves against the real parent module. Regression test:
|
||||
`Flatten_NestedComposedScript_ScopeCarriesCorrectParentPath`.
|
||||
|
||||
### TemplateEngine-017 — Revision hash and diff both ignore `Description` and `Connections`, defeating staleness detection for real deployment changes
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:128`, `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:156`, `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:42`, `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:110`, `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:118` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design states the revision hash is "computed from the resolved content" and
|
||||
backs both staleness detection and diff correlation. The `Hashable*` records,
|
||||
however, omit fields that are part of the deployed `FlattenedConfiguration`:
|
||||
|
||||
- `HashableAttribute` skips `ResolvedAttribute.Description` and the resolved
|
||||
connection name/protocol (`BoundDataConnectionName`/`BoundDataConnectionProtocol`).
|
||||
- `HashableAlarm` skips `ResolvedAlarm.Description`.
|
||||
- The top-level `HashableConfiguration` skips the entire `Connections` map —
|
||||
the `ConnectionConfig` per connection name carries the protocol, the primary
|
||||
endpoint JSON, the backup endpoint JSON, and the failover retry count, all
|
||||
of which travel in the deployment package.
|
||||
|
||||
The same gaps exist in `DiffService.AttributesEqual`, `AlarmsEqual`, and there
|
||||
is no entry for `Connections` at all. Concrete consequences:
|
||||
|
||||
1. A Design user edits an attribute's `Description` (an authoring-time
|
||||
concern) → the flattened payload changes → no hash change, no diff entry.
|
||||
2. A Deployment user edits the primary endpoint JSON of a data connection
|
||||
bound to an instance → the deployment package now ships a different
|
||||
`ConnectionConfig` → no hash change, no diff entry, so the staleness
|
||||
indicator says the instance is up to date and the diff view shows no
|
||||
pending change. The site quietly receives different connection
|
||||
credentials/host on the next redeploy.
|
||||
|
||||
The Description case is mostly cosmetic. The `Connections` case is a deployment
|
||||
correctness gap — staleness detection is the mechanism that tells operators
|
||||
"this instance has drifted from its template and needs redeployment", and a
|
||||
connection-endpoint change is exactly the kind of drift it must catch.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add `Description` to `HashableAttribute` and `HashableAlarm` (alphabetically
|
||||
placed, per the determinism contract) and to `AttributesEqual` / `AlarmsEqual`.
|
||||
Add a `HashableConnections : SortedDictionary<string, HashableConnection>`
|
||||
field (or equivalent) to `HashableConfiguration` that includes Protocol,
|
||||
ConfigurationJson, BackupConfigurationJson, and FailoverRetryCount, and mirror
|
||||
it in `DiffService`. Add tests:
|
||||
`Hash_DescriptionEditChangesHash`,
|
||||
`Hash_ConnectionEndpointEditChangesHash`,
|
||||
`Diff_ConnectionEndpointEdit_ProducesEntry`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### TemplateEngine-018 — `DiffService` reports no entries for added/removed/changed connections
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:19` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DiffService.ComputeDiff` returns a `ConfigurationDiff` with `AttributeChanges`,
|
||||
`AlarmChanges`, and `ScriptChanges` only. The `FlattenedConfiguration` it diffs
|
||||
also carries a `Connections` dictionary (per-attribute connection bindings
|
||||
collapsed to one connection-config-per-name during flattening — see
|
||||
`FlatteningService:99-118`), and this dictionary materially affects what the
|
||||
site receives at deploy time. A connection added to or removed from the
|
||||
flattened configuration (e.g., an instance gains its first data-sourced
|
||||
attribute, or its last binding is cleared) produces no diff entry. Operators
|
||||
inspecting the diff view to decide whether to redeploy see "no changes" when
|
||||
the site will in fact receive a structurally different deployment package.
|
||||
|
||||
This is the diff-view counterpart of TemplateEngine-017's hash gap; they are
|
||||
separable because the `ConfigurationDiff` data shape would have to be extended
|
||||
even after the hash is fixed.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add `ConnectionChanges` (or equivalent) to `ConfigurationDiff` in `Commons`
|
||||
(`Types/Flattening/ConfigurationDiff.cs`), populate it in
|
||||
`DiffService.ComputeDiff` via a new `ComputeEntityDiff` over
|
||||
`Connections.Keys`, and add a `ConnectionsEqual` helper. Update the Central UI
|
||||
diff display to render the new section. Add regression tests:
|
||||
`Diff_NewConnectionBinding_ReportedAsAdded`,
|
||||
`Diff_ClearedBinding_ReportedAsRemoved`,
|
||||
`Diff_EndpointEdit_ReportedAsChanged`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### TemplateEngine-019 — `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.TemplateEngine/TemplateResolver.cs:117`, `src/ScadaLink.TemplateEngine/TemplateResolver.cs:123` |
|
||||
|
||||
**Description**
|
||||
|
||||
TemplateEngine-013 removed the `0`-as-no-parent sentinel from `CycleDetector`
|
||||
(`adb5e75`) — `ParentTemplateId` is `int?`, so a missing value means "no
|
||||
parent" and a real Id of 0 must walk the chain like any other node. The fix
|
||||
did not propagate into `TemplateResolver.BuildInheritanceChain`:
|
||||
|
||||
```csharp
|
||||
var currentId = templateId;
|
||||
...
|
||||
while (currentId != 0 && lookup.TryGetValue(currentId, out var current))
|
||||
{
|
||||
...
|
||||
currentId = current.ParentTemplateId ?? 0;
|
||||
}
|
||||
```
|
||||
|
||||
The seeded `currentId = templateId` is treated as "no template" when
|
||||
`templateId == 0`, so `ResolveAllMembers(0, ...)` returns an empty chain even
|
||||
when a template with Id 0 exists. Walking up, `current.ParentTemplateId ?? 0`
|
||||
then `currentId != 0` collapses a real parent of Id 0 onto the "no parent"
|
||||
exit, silently truncating the chain. The chain is the input to every
|
||||
flatten/resolve/validate path through `FlatteningService`, `TemplateService
|
||||
.ResolveTemplateMembersAsync`, and `InstanceService.SetAlarmOverrideAsync` — a
|
||||
template with a real Id of 0 (which EF identity sequences avoid in production
|
||||
but which any in-memory test or import-staging path can produce) silently
|
||||
loses its inheritance contribution. The duplicate-tolerant `BuildLookup` added
|
||||
in `adb5e75` is used here, so the test gap is one half of the same fix.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Switch the walk to the `int?` form, mirroring `CycleDetector
|
||||
.DetectInheritanceCycle`:
|
||||
|
||||
```csharp
|
||||
int? currentId = templateId;
|
||||
while (currentId.HasValue && lookup.TryGetValue(currentId.Value, out var current))
|
||||
{
|
||||
if (!visited.Add(currentId.Value)) break;
|
||||
chain.Add(current);
|
||||
currentId = current.ParentTemplateId;
|
||||
}
|
||||
```
|
||||
|
||||
Add regression test
|
||||
`TemplateResolverTests.BuildInheritanceChain_RealIdZero_StillResolves`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### TemplateEngine-020 — `Create*` audit entries are written with `EntityId = "0"` before `SaveChangesAsync` populates the real key
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:77`, `src/ScadaLink.TemplateEngine/TemplateService.cs:256`, `src/ScadaLink.TemplateEngine/TemplateService.cs:407`, `src/ScadaLink.TemplateEngine/TemplateService.cs:556`, `src/ScadaLink.TemplateEngine/TemplateService.cs:734`, `src/ScadaLink.TemplateEngine/SharedScriptService.cs:71` |
|
||||
|
||||
**Description**
|
||||
|
||||
`IAuditService.LogAsync` takes a `string entityId` argument and `TemplateService
|
||||
.CreateTemplateAsync`, `AddAttributeAsync`, `AddAlarmAsync`, `AddScriptAsync`,
|
||||
`AddCompositionAsync`, and `SharedScriptService.CreateSharedScriptAsync` all
|
||||
hard-code it to `"0"`:
|
||||
|
||||
```csharp
|
||||
await _repository.AddTemplateAsync(template, cancellationToken);
|
||||
await _auditService.LogAsync(user, "Create", "Template", "0", name, template, cancellationToken);
|
||||
await _repository.SaveChangesAsync(cancellationToken);
|
||||
```
|
||||
|
||||
EF Core populates `template.Id` only when `SaveChangesAsync` runs, but the
|
||||
audit row is written and queued in the change tracker *before* the save with a
|
||||
literal `"0"`. The single save then commits the audit row with `EntityId =
|
||||
"0"` and the new template/attribute/alarm/script with its real Id. Every
|
||||
"Create" entry in the audit trail therefore loses the link back to the row it
|
||||
describes — searching the audit log by entity id of a created row finds
|
||||
nothing, only the subsequent Update/Delete rows are findable.
|
||||
|
||||
Note that `InstanceService.CreateInstanceAsync` uses the opposite order
|
||||
(`AddInstanceAsync` → `SaveChangesAsync` → `LogAsync(... instance.Id ...)`,
|
||||
lines 90–94) and gets the real Id. The asymmetry is the smoking gun: half the
|
||||
module audits Create correctly, half does not.
|
||||
|
||||
A separate consideration: writing the audit row in the same `SaveChangesAsync`
|
||||
as the entity is correct (it gives transactional all-or-nothing) — the fix is
|
||||
to save the entity first, then log, then save the audit row (two-phase, like
|
||||
`InstanceService` and `TemplateService.DeleteTemplateAsync` already do).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
For every `Create*` path in `TemplateService` and `SharedScriptService`, swap
|
||||
the order to `AddXxxAsync` → `SaveChangesAsync` → `LogAsync(... newId
|
||||
.ToString() ...)` → `SaveChangesAsync`, matching `InstanceService
|
||||
.CreateInstanceAsync` and `TemplateService.DeleteTemplateAsync`. Add regression
|
||||
tests that assert the `EntityId` recorded on the audit row matches the
|
||||
created row's Id.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### TemplateEngine-021 — `MoveTemplateAsync` skips folder cycle and sibling-name-collision validation
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:173` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TemplateService.MoveTemplateAsync` validates only that the target folder
|
||||
exists, then unconditionally assigns `template.FolderId = newFolderId`.
|
||||
`TemplateFolderService.MoveFolderAsync` (the sibling for folder-to-folder
|
||||
moves) by contrast validates:
|
||||
|
||||
- the target folder is not the folder being moved (self-parent);
|
||||
- the target folder is not a descendant of the folder being moved (cycle);
|
||||
- no sibling at the destination has the same name (case-insensitive).
|
||||
|
||||
The first two are folder-graph concerns and don't apply to template moves, but
|
||||
the third does — two templates with the same name in the same folder is the
|
||||
authoring-time scenario the design's "naming collisions are design-time
|
||||
errors" rule was meant to cover. Today, two templates named "Pump" can be
|
||||
moved into the same folder with no error, breaking any UI that locates a
|
||||
template by `(FolderId, Name)` and producing a worse user experience than the
|
||||
folder-rename path which does check.
|
||||
|
||||
Separately, the design doc states folders carry "no semantic meaning for
|
||||
template resolution, flattening, validation, or inheritance" — so this is
|
||||
strictly a UI-organization invariant, but it is documented elsewhere
|
||||
(`TemplateFolderService` enforces it for folders) and the asymmetry is
|
||||
surprising.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
After resolving the target folder, run a sibling-name-uniqueness check across
|
||||
templates with the same `FolderId == newFolderId` and the same `Name`
|
||||
(case-insensitive), mirroring `TemplateFolderService.MoveFolderAsync` lines
|
||||
130–142. Add a regression test `MoveTemplate_NameCollisionAtDestination_Fails`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### TemplateEngine-022 — `LockEnforcer.ValidateLockChange` enforces "once-locked-stays-locked" for `IsLocked` but not for `LockedInDerived`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.TemplateEngine/LockEnforcer.cs:109`, `src/ScadaLink.TemplateEngine/TemplateService.cs:323`, `src/ScadaLink.TemplateEngine/TemplateService.cs:476`, `src/ScadaLink.TemplateEngine/TemplateService.cs:623` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LockEnforcer.ValidateLockChange` documents and enforces the rule that an
|
||||
already-locked member cannot be unlocked downstream (`originalIsLocked &&
|
||||
!proposedIsLocked` → error). The class-level XML doc describes locking as
|
||||
covering both fields:
|
||||
|
||||
> Locking rules: ... Once locked, a member stays locked — it cannot be
|
||||
> unlocked downstream.
|
||||
|
||||
But the `LockedInDerived` field has no equivalent guard. `UpdateAttributeAsync`,
|
||||
`UpdateAlarmAsync`, and `UpdateScriptAsync` all let the proposed
|
||||
`LockedInDerived` flag flip in either direction on a base-template member.
|
||||
This is a subtle correctness gap with two failure modes:
|
||||
|
||||
1. A base template originally marked an attribute `LockedInDerived = true` to
|
||||
protect derived templates from overriding it. A subsequent edit can clear
|
||||
the flag while leaving existing derived-template overrides intact — those
|
||||
overrides become legal retroactively even though the design intent was
|
||||
that they were always blocked.
|
||||
2. The XML doc on `LockEnforcer` and the class summary on `TemplateService`
|
||||
describe a one-way ratchet that the code does not implement for one of the
|
||||
two lock flags. A reader of the documentation cannot tell which rules are
|
||||
actually enforced.
|
||||
|
||||
The defect is "Low" because the design doc for the Template Engine itself
|
||||
does not explicitly call out a once-locked-stays-locked rule for
|
||||
`LockedInDerived`. The most likely fix is therefore to (a) correct the
|
||||
`LockEnforcer` XML doc to describe only `IsLocked`, or (b) add the equivalent
|
||||
guard for `LockedInDerived` and a regression test. The choice is a design
|
||||
question — pick one and align the code and docs.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decide the policy. If `LockedInDerived` is intended to be once-set-stays-set
|
||||
like `IsLocked`, extend `ValidateLockChange` (or add a sibling
|
||||
`ValidateLockedInDerivedChange`) and reject the downgrade in
|
||||
`UpdateAttributeAsync` / `UpdateAlarmAsync` / `UpdateScriptAsync`. If it is
|
||||
intended to be mutable, update the `LockEnforcer` summary to scope the rule
|
||||
to `IsLocked` only. Either way, add a test pinning the chosen behaviour.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
|
||||
@@ -0,0 +1,456 @@
|
||||
# Code Review — Transport
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Module | `src/ScadaLink.Transport` |
|
||||
| Design doc | `docs/requirements/Component-Transport.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 12 |
|
||||
|
||||
## Summary
|
||||
|
||||
The Transport module is structurally clean, follows the design doc's pipeline
|
||||
layout (Encryption → Serialization → Export / Import), and has solid lower-tier
|
||||
coverage (encryptor round-trips, manifest validator, dependency resolver,
|
||||
session store, diff engine). The big surface-area concerns cluster around two
|
||||
themes. First, the `Overwrite` resolution path is structurally incomplete: it
|
||||
updates only the parent entity's scalar fields (e.g. `Template.Description /
|
||||
FolderId`, `ExternalSystem.EndpointUrl / AuthType / AuthConfiguration`) and
|
||||
never replaces child collections (attributes, alarms, scripts, external-system
|
||||
methods), silently diverging from both the design doc's audit-row table and
|
||||
operator intent. Second, the 3-strike / per-IP unlock-rate-limit story declared
|
||||
in `TransportOptions` and the design doc isn't wired into the import service —
|
||||
the only counter is a local field on `TransportImport.razor.cs`, and
|
||||
`MaxUnlockAttemptsPerIpPerHour` is referenced nowhere. There are also some
|
||||
smaller integrity-and-resource issues (manifest fields outside `ContentHash`
|
||||
aren't bound to the encryption envelope, decrypted plaintext lives in the
|
||||
in-memory session for the full TTL on the failure path, and ZIP reads have no
|
||||
entry-count / per-entry decompression cap).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | Yes | Overwrite paths miss child sync (Transport-001, Transport-002); composition Overwrite intentionally clears (good). |
|
||||
| 2 | Akka.NET conventions | Yes | No issues found — Transport is service-only, no actors / messages. |
|
||||
| 3 | Concurrency & thread safety | Yes | `IAuditCorrelationContext` mutation is documented as not thread-safe (Transport-009); singleton `BundleSessionStore` w/ `ConcurrentDictionary` is fine. |
|
||||
| 4 | Error handling & resilience | Yes | Rollback-failure path is well-considered, but failed sessions are never evicted (Transport-007). |
|
||||
| 5 | Security | Yes | Unlock lockout + per-IP cap not enforced server-side (Transport-003, Transport-004); manifest fields outside ContentHash are unauthenticated (Transport-005); zip-bomb / per-entry decompression cap missing (Transport-006); secrets travel in plaintext in unencrypted bundles by design but UI-only warning (acceptable per doc). |
|
||||
| 6 | Performance & resource management | Yes | `BundleSession.DecryptedContent` retained in memory for 30 min even on failure (Transport-007); `PreviewAsync` issues N+1 calls to `GetTemplateWithChildrenAsync` (Transport-008); `BundleSerializer.Pack` serializes content twice. |
|
||||
| 7 | Design-document adherence | Yes | Overwrite-doesn't-sync-children contradicts the design doc's audit row table (Transport-001); per-IP-per-hour lockout in §11 not implemented (Transport-004); design says "bundles are not retained server-side after ApplyAsync commits" — but failed bundles are retained until TTL (Transport-007). |
|
||||
| 8 | Code organization & conventions | Yes | No major issues found — clean separation, POCO DTOs in `Serialization/`, scoped vs singleton service lifetimes appropriate. |
|
||||
| 9 | Testing coverage | Yes | Critical gap: no Overwrite-with-modified-children test for Templates or ExternalSystems (Transport-010); no test exercising failed-bundle session retention or per-IP lockout. |
|
||||
| 10 | Documentation & comments | Yes | XML comments are extensive and accurate; design doc has some staleness (Transport-011, Transport-012). |
|
||||
|
||||
## Findings
|
||||
|
||||
### Transport-001 — Template Overwrite never syncs attributes / alarms / scripts
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:844-851` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `ResolutionAction.Overwrite` branch in `ApplyTemplatesAsync` only writes
|
||||
`Description` and `FolderId` on the existing template and calls
|
||||
`UpdateTemplateAsync(ex, …)`. The bundle DTO's `Attributes`, `Alarms`, and
|
||||
`Scripts` collections are never copied onto the existing entity, so an Overwrite
|
||||
of a template whose child collections changed silently leaves the target's
|
||||
existing children in place. `ResolveAlarmScriptLinksAsync` then runs against the
|
||||
unmodified existing alarms/scripts and does nothing useful for the Overwrite
|
||||
case. This contradicts the design doc's Configuration Audit Trail table
|
||||
("Template overwritten → `TemplateUpdated` + per-field rows
|
||||
(`TemplateAttributeAdded`, `TemplateScriptUpdated`, …)") and the operator's
|
||||
mental model — an Overwrite that produces no diff is a footgun. The only
|
||||
integration test (`ConflictResolutionTests.Overwrite_replaces_existing_template_description`)
|
||||
asserts only on `Description`, so the regression is not caught.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
For the Overwrite branch, replace the existing template's children to match the
|
||||
bundle DTO (delete-then-add or diff-and-merge), then re-run the alarm-script and
|
||||
composition rewire passes against the post-merge state. Emit the per-field audit
|
||||
rows the design doc enumerates. Add an integration test that overwrites a
|
||||
template whose Scripts / Attributes / Alarms differ.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-002 — ExternalSystem Overwrite never syncs methods
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:1213-1221` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ApplyExternalSystemsAsync` Overwrite path writes `EndpointUrl`, `AuthType`, and
|
||||
`AuthConfiguration` on the existing `ExternalSystemDefinition` and calls
|
||||
`UpdateExternalSystemAsync`. The DTO's `Methods` collection is never written —
|
||||
any added, removed, or modified method on the incoming bundle silently does
|
||||
not land. Same shape of bug as Transport-001 but on a different entity. The
|
||||
design doc's audit-row table says
|
||||
"External system overwritten → `ExternalSystemDefinitionUpdated` + per-method
|
||||
rows", confirming methods are expected to round-trip.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Sync `Methods` on Overwrite via add / update / delete by name (mirroring the
|
||||
diff classification in `ArtifactDiff.CompareExternalSystem`) and emit the
|
||||
per-method audit rows. Add a test that overwrites an external system whose
|
||||
methods differ.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-003 — Unlock lockout is enforced only client-side; server session is never marked Locked
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:184-203`, `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:267-309`, `src/ScadaLink.Commons/Types/Transport/BundleSession.cs:14-16` |
|
||||
|
||||
**Description**
|
||||
|
||||
`BundleSession` exposes `FailedUnlockAttempts` and a `Locked` computed property,
|
||||
and `PreviewAsync` / `ApplyAsync` correctly refuse to proceed when
|
||||
`session.Locked == true`. But for an encrypted bundle, `LoadAsync` throws
|
||||
`CryptographicException` before any session is opened, so no session ever holds
|
||||
a non-zero `FailedUnlockAttempts`. The 3-strike counter lives only in the
|
||||
Blazor page's local `_failedUnlockAttempts` field; a second tab / circuit / CLI
|
||||
caller bypassing the UI can retry the same uploaded bytes indefinitely
|
||||
because the importer accepts a passphrase against a stream and runs PBKDF2 each
|
||||
call (600 000 iterations / call). The Locked invariant on `BundleSession` is
|
||||
effectively unreachable — the field is dead code.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Move the lockout into `IBundleImporter`. Two viable shapes:
|
||||
(a) open a session on the first `LoadAsync` call (skip the decryption step until
|
||||
a separate `UnlockAsync` is called) and increment / lock there;
|
||||
(b) keep a per-content-hash counter in the session store, scoped by bundle SHA,
|
||||
so retries against the same bundle bytes are throttled regardless of the UI
|
||||
client. Either way, emit `BundleImportUnlockFailed` from the service, not from
|
||||
the Razor page. Test that a second concurrent caller cannot side-step the
|
||||
lockout.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-004 — `MaxUnlockAttemptsPerIpPerHour` option is declared but never enforced
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/TransportOptions.cs:12`, `docs/requirements/Component-Transport.md` §11 |
|
||||
|
||||
**Description**
|
||||
|
||||
`TransportOptions.MaxUnlockAttemptsPerIpPerHour` defaults to 10 and is
|
||||
documented in the design doc (§11, "Failed-unlock rate limit: per-session
|
||||
3-strike lockout; per-IP-per-hour cap (default 10, configurable) to deter brute
|
||||
force against a stolen bundle"). A repo-wide grep finds zero readers of the
|
||||
field. There is no IP-keyed rate limiter, no `IHttpContextAccessor` in the
|
||||
importer, no middleware in Central UI guarding the import endpoints. The
|
||||
documented brute-force defence does not exist in code.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either implement the per-IP cap (e.g. via `Microsoft.AspNetCore.RateLimiting`
|
||||
on the `TransportImport` page and the `ManagementActor` import command path,
|
||||
keyed on remote-IP for the UI and on authenticated principal for the CLI), or
|
||||
drop the option and the design-doc paragraph if the project is intentionally
|
||||
deferring this. Don't leave a dead-letter option that promises a security
|
||||
control that isn't there.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-005 — Manifest fields outside `ContentHash` are not bound to the encrypted payload
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/Encryption/BundleSecretEncryptor.cs:31-49`, `src/ScadaLink.Transport/Serialization/ManifestValidator.cs:29-53` |
|
||||
|
||||
**Description**
|
||||
|
||||
AES-GCM is called with no Associated Authenticated Data (AAD). The `manifest`
|
||||
fields — `SourceEnvironment`, `ExportedBy`, `ScadaLinkVersion`, `Summary`,
|
||||
`Contents`, `CreatedAtUtc`, etc. — are plaintext and only the `ContentHash`
|
||||
field is checked against the content bytes. An attacker who obtains a bundle
|
||||
can edit any non-`ContentHash` manifest field (e.g. rewrite the
|
||||
`SourceEnvironment` displayed in the Step-4 typo-resistant confirmation gate,
|
||||
forge a more recent `CreatedAtUtc`, lie about `ExportedBy`) without breaking
|
||||
decryption. The Step-4 confirmation gate the design doc relies on
|
||||
("User types the source environment name to confirm — typo-resistant gate at
|
||||
the prod boundary") is therefore tamperable.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pass the SHA-256 of the manifest's canonical bytes (excluding `ContentHash` and
|
||||
`Encryption`, or simply the whole manifest minus those two fields) as the
|
||||
`associatedData` argument to `AesGcm.Encrypt` / `AesGcm.Decrypt`. Any
|
||||
tampering of the manifest's other fields then yields an authentication-tag
|
||||
mismatch on decrypt. Same change in the plaintext path can be approximated by
|
||||
extending the hash domain (compute a manifest-and-content hash, or sign the
|
||||
manifest, depending on how far you want to go).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-006 — Bundle ZIP read has no per-entry size cap or entry-count cap (zip-bomb / decompression-bomb)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/Serialization/BundleSerializer.cs:121-156`, `src/ScadaLink.Transport/Import/BundleImporter.cs:132-143` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LoadAsync` caps the raw bundle bytes at `MaxBundleSizeMb` (default 100 MB)
|
||||
before opening the ZIP. But `ReadContentBytes` calls `entry.Open()` and
|
||||
`CopyTo(MemoryStream)` with no per-entry size limit and no defence against
|
||||
compression ratios — a 100 MB DEFLATE-compressed bundle can decompress to
|
||||
gigabytes. There is also no cap on the number of entries iterated; only two
|
||||
known entries are read (`manifest.json` + `content.json`/`content.enc`), but
|
||||
`ReadContentBytes` does not validate that no extra entries exist or that the
|
||||
expected entry's `Length` is bounded. A malicious importer-with-RequireAdmin
|
||||
(or a stolen bundle delivered to an admin) can OOM the central node.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Cap each entry's decompressed length explicitly (compare `ZipArchiveEntry.Length`
|
||||
against a configurable max, or copy into a length-limited stream). Reject
|
||||
bundles whose entry list contains anything other than the known manifest +
|
||||
content entries. Consider also rejecting any compression ratio over ~50x as a
|
||||
defence-in-depth measure.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-007 — Failed import sessions retain decrypted plaintext for the full 30-minute TTL
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:614-696`, `src/ScadaLink.Transport/Import/BundleSessionStore.cs:67-93` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ApplyAsync` calls `_sessionStore.Remove(sessionId)` only on the success path
|
||||
(line 614). The catch block re-throws without removing the session, so a failed
|
||||
apply leaves the `BundleSession` (with `DecryptedContent` up to ~100 MB) in the
|
||||
in-memory dictionary until the TTL elapses 30 min later (or `Get` lazily evicts
|
||||
on a separate lookup). Decrypted secrets — DB connection strings, SMTP
|
||||
credentials, external-system auth configs — sit in process memory for that
|
||||
window, accessible to anyone holding the session id. Multiplied across repeated
|
||||
import attempts on the same circuit, this can produce significant memory
|
||||
pressure (10 failed 100 MB imports = 1 GB) and contradicts the design doc's
|
||||
"Bundles are not retained server-side after ApplyAsync commits" statement.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In the `ApplyAsync` catch block, call `_sessionStore.Remove(sessionId)` (or at
|
||||
least zero out `session.DecryptedContent`) before re-throwing. Also clear
|
||||
`DecryptedContent` from the session on the success path before removing — the
|
||||
buffer is potentially still rooted by a caller-held reference. Consider
|
||||
shortening the TTL when a session is in a known-stuck state. The session
|
||||
store's `EvictExpired` exists but is only called on demand — wire it to a
|
||||
periodic timer so abandoned sessions clear even without traffic.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-008 — `PreviewAsync` issues an N+1 `GetTemplateWithChildrenAsync` per matching template name
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:252-272` |
|
||||
|
||||
**Description**
|
||||
|
||||
Building the per-template diff loops over every existing stub returned by
|
||||
`GetAllTemplatesAsync` and, for any name that matches an incoming DTO, calls
|
||||
`GetTemplateWithChildrenAsync(stub.Id)` to re-fetch with children. On a target
|
||||
DB with many templates that overlap the bundle this is one round-trip per
|
||||
matching template (often the whole bundle), each query carrying the full
|
||||
attributes/alarms/scripts/compositions joins. The diff itself is read-only and
|
||||
fits a single eager-loaded `GetAllTemplatesWithChildrenAsync` query.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `GetAllTemplatesWithChildrenAsync` (or extend `GetAllTemplatesAsync` with
|
||||
an `includeChildren` flag) on `ITemplateEngineRepository` and use it here. The
|
||||
same N+1 appears in `ResolveCompositionEdgesAsync` (line 1093) for the
|
||||
just-imported templates, but that loop is bounded by the bundle's size and is
|
||||
less of a concern.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-009 — `IAuditCorrelationContext.BundleImportId` is mutated on the same scoped instance the AuditService reads
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:528, 668, 703`, `src/ScadaLink.ConfigurationDatabase/Services/AuditCorrelationContext.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
The XML doc on `IAuditCorrelationContext` correctly notes that mutating
|
||||
`BundleImportId` is not thread-safe and that concurrent imports inside a single
|
||||
scope would cross-contaminate audit rows. The contract is "Blazor circuit / API
|
||||
request — sequential await chain — single writer". The risk is that this
|
||||
invariant is documentation-only — there is no enforcement (e.g. a mutex on set,
|
||||
or an `AsyncLocal<Guid?>` impl) and no test exercising a concurrent-callers
|
||||
scenario. A future change that schedules audit writes on a different
|
||||
synchronization context inside the apply transaction (e.g. `Task.WhenAll` over
|
||||
the Apply helpers) would silently start leaking the id across rows.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) back `BundleImportId` with an `AsyncLocal<Guid?>` so each logical
|
||||
call chain inherits the value and concurrent chains can't trample it, or
|
||||
(b) wrap the apply in a try/finally that snapshots and restores. (b) is closer
|
||||
to the current design. Either way, add an integration test that fires two
|
||||
overlapping `ApplyAsync` calls and asserts each bundle's rows carry only that
|
||||
bundle's id.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-010 — Critical Overwrite + cross-cutting paths uncovered by tests
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.Transport.IntegrationTests/ConflictResolutionTests.cs`, `tests/ScadaLink.Transport.IntegrationTests/Import/BundleImporterApplyTests.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
The existing tests cover the happy path well (round-trip, semantic-validator
|
||||
gating, rollback even when `RollbackAsync` itself throws, composition imports),
|
||||
but the per-entity Overwrite resolutions are only spot-tested:
|
||||
`ConflictResolutionTests.Overwrite_replaces_existing_template_description`
|
||||
asserts on `Description` only. Specifically missing:
|
||||
- Overwrite of a `Template` whose `Attributes` / `Alarms` / `Scripts` /
|
||||
`Compositions` diverged from the existing row (would catch Transport-001).
|
||||
- Overwrite of an `ExternalSystem` whose `Methods` diverged (would catch
|
||||
Transport-002).
|
||||
- Overwrite of a `NotificationList` whose `Recipients` collection diverged
|
||||
(NotificationList Overwrite does sync recipients via clear+add — needs an
|
||||
asserting test).
|
||||
- Concurrent `ApplyAsync` calls on a shared scope to exercise the
|
||||
`IAuditCorrelationContext` mutation contract (would catch Transport-009).
|
||||
- Per-IP unlock-throttle behaviour (would catch Transport-004).
|
||||
- A session that survives a failed Apply (would catch Transport-007).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add the missing integration tests above. Most can be modelled after
|
||||
`ConflictResolutionTests`' export-then-mutate-target-then-apply pattern.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-011 — Design doc's Step-1 manifest preview promises decryption-free preview, but `LoadAsync` reads and validates content before passphrase
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-Transport.md` Import Flow Step 1, `src/ScadaLink.Transport/Import/BundleImporter.cs:124-203` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc says: "The manifest is plaintext so the import wizard can
|
||||
preview bundle contents and source provenance before the user supplies a
|
||||
passphrase." `LoadAsync` honours that — but does so by ALWAYS reading and
|
||||
hashing the content blob (encrypted or not) on the first call, regardless of
|
||||
whether the caller has a passphrase. For an encrypted bundle with no
|
||||
passphrase, the code path that surfaces the encrypted-bundle prompt is the
|
||||
`ArgumentException` thrown at line 195, which has already performed the full
|
||||
manifest parse + content-hash check + read of the encrypted blob. That's fine,
|
||||
but it means there is no cheap "manifest peek" — the UI's "let the user see
|
||||
the manifest before deciding whether to type a passphrase" is at least
|
||||
O(bundle-size) and consumes the full upload buffer each call. The design doc
|
||||
gives a misleading impression of cost.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) add an explicit `ReadManifestAsync(Stream)` interface method that
|
||||
skips the content read for the pure preview case, or (b) update the design
|
||||
doc to clarify the full envelope is read on every `LoadAsync` and the cheap
|
||||
"peek" is conceptual rather than runtime.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Transport-012 — "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-Transport.md` §Audit Trail, `src/ScadaLink.ConfigurationDatabase/Repositories/CentralUiRepository.cs:148` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc says: "The existing Configuration Audit Log Viewer gains a
|
||||
**Bundle Import** filter that surfaces all rows for a given import. The
|
||||
`BundleImported` summary row links to the filtered view." A repository filter
|
||||
on `BundleImportId` is wired into `CentralUiRepository` (line 148), but no UI
|
||||
filter control surfaces it and the `BundleImported` summary row does not carry
|
||||
a hyperlink in `Configuration Audit Log Viewer`. This is a documentation-vs-code
|
||||
gap, not a bug in Transport itself, but the spec lives in the Transport doc so
|
||||
it's reasonable to flag.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either implement the filter dropdown + summary-row link in the Configuration
|
||||
Audit Log Viewer, or note the deferral in the design doc.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
@@ -33,9 +33,21 @@ def discover_modules():
|
||||
return modules
|
||||
|
||||
|
||||
def parse_header(module, text):
|
||||
"""Extract (last_reviewed, commit) from the module's header table.
|
||||
Falls back to the historical baseline when the field is absent or templated."""
|
||||
last = re.search(r"\|\s*Last reviewed\s*\|\s*([0-9]{4}-[0-9]{2}-[0-9]{2})", text)
|
||||
commit = re.search(r"\|\s*Commit reviewed\s*\|\s*`([^`]+)`", text)
|
||||
return (
|
||||
last.group(1) if last else "2026-05-16",
|
||||
commit.group(1) if commit else "9c60592",
|
||||
)
|
||||
|
||||
|
||||
def parse_findings(module):
|
||||
"""Parse one module's findings.md into (module, id, severity, title, status) tuples."""
|
||||
"""Parse one module's findings.md into ((last_reviewed, commit), [(module, id, severity, title, status), ...])."""
|
||||
text = open(os.path.join(BASE, module, "findings.md")).read()
|
||||
header = parse_header(module, text)
|
||||
findings = []
|
||||
for block in re.split(r"^### ", text, flags=re.M)[1:]:
|
||||
head = block.splitlines()[0].strip()
|
||||
@@ -49,7 +61,7 @@ def parse_findings(module):
|
||||
if not sev or not status:
|
||||
raise SystemExit(f"{module}/findings.md: {fid} is missing a Severity or Status field")
|
||||
findings.append((module, fid, sev.group(1), title, status.group(1).strip()))
|
||||
return findings
|
||||
return header, findings
|
||||
|
||||
|
||||
def finding_number(finding):
|
||||
@@ -58,7 +70,7 @@ def finding_number(finding):
|
||||
|
||||
def build_readme(modules, per_module):
|
||||
pending = sorted(
|
||||
(f for fs in per_module.values() for f in fs if f[4] in PENDING_STATUSES),
|
||||
(f for fs in per_module.values() for f in fs[1] if f[4] in PENDING_STATUSES),
|
||||
key=lambda f: (SEVERITY_ORDER.get(f[2], 9), f[0], finding_number(f)),
|
||||
)
|
||||
|
||||
@@ -66,7 +78,7 @@ def build_readme(modules, per_module):
|
||||
return sum(1 for f in pending if f[2] == sev)
|
||||
|
||||
def open_count(module, sev):
|
||||
return sum(1 for f in per_module[module]
|
||||
return sum(1 for f in per_module[module][1]
|
||||
if f[2] == sev and f[4] in PENDING_STATUSES)
|
||||
|
||||
lines = []
|
||||
@@ -123,9 +135,10 @@ def build_readme(modules, per_module):
|
||||
add("|--------|---------------|--------|----------------|------|-------|")
|
||||
for module in modules:
|
||||
counts = [open_count(module, s) for s in ("Critical", "High", "Medium", "Low")]
|
||||
add(f"| [{module}]({module}/findings.md) | 2026-05-16 | `9c60592` "
|
||||
last_reviewed, commit = per_module[module][0]
|
||||
add(f"| [{module}]({module}/findings.md) | {last_reviewed} | `{commit}` "
|
||||
f"| {counts[0]}/{counts[1]}/{counts[2]}/{counts[3]} "
|
||||
f"| {sum(counts)} | {len(per_module[module])} |")
|
||||
f"| {sum(counts)} | {len(per_module[module][1])} |")
|
||||
add("")
|
||||
add("## Pending Findings")
|
||||
add("")
|
||||
@@ -159,8 +172,8 @@ def main():
|
||||
|
||||
readme_path = os.path.join(BASE, "README.md")
|
||||
pending = sum(1 for fs in per_module.values()
|
||||
for f in fs if f[4] in PENDING_STATUSES)
|
||||
total = sum(len(fs) for fs in per_module.values())
|
||||
for f in fs[1] if f[4] in PENDING_STATUSES)
|
||||
total = sum(len(fs[1]) for fs in per_module.values())
|
||||
|
||||
if check:
|
||||
current = open(readme_path).read() if os.path.exists(readme_path) else ""
|
||||
|
||||
Reference in New Issue
Block a user