docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+197 -2
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway` |
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
@@ -81,6 +81,42 @@ design-doc drift). Theme: every new finding is in a code path that was added or
touched by the earlier fix bundle but whose error-propagation contract was not
verified end-to-end against the S&F engine or the design doc.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
All twenty-three prior findings (001023) remain `Resolved`; spot-checks against the
current source confirm the fixes still hold. Since `1eb6e97` the module gained a new
SQL error-classification layer (`SqlErrorClassifier` + `TransientDatabaseException` /
`PermanentDatabaseException`, commits `d0527064` / `de375ff7`) and `DatabaseGateway.CachedWriteAsync`
was reshaped to attempt the write immediately and classify the outcome (transient →
buffer, permanent → synchronous `Failed`, mirroring `CachedCallAsync`); the `ScadaLink → ZB.MOM.WW.ScadaBridge`
rename also landed. The full 10-category checklist was re-walked and surfaced **three new
findings**, none Critical/High. The most serious (`ExternalSystemGateway-024`, Medium) is
that `SqlErrorClassifier.IsTransient(Exception)` classifies *every* `InvalidOperationException`
(and `TimeoutException`) as transient, so a DB-layer authoring/driver-misuse bug that surfaces
as `InvalidOperationException` is silently buffered and retried instead of propagating —
contradicting the classifier's own stated "authoring bugs must propagate" contract and the
HTTP-path symmetry it claims. `-025` (Low) is a caller-cancellation that the SQL driver
raises as a `SqlException` (mid-flight cancel) being reclassified as a *permanent* DB error
rather than propagating the cancellation (a narrow asymmetry with the `-008` contract; untested).
`-026` (Low) is a mis-ordered numbered-label comment block in `ExecuteWriteAsync`. Theme:
the new SQL-classification seam faithfully mirrors the HTTP path's *shape* but is more
permissive on the non-typed-exception branch than the HTTP path it cites as its model.
_Re-review (2026-06-20, `4307c381`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `BuildUrl` / `JsonElementToParameterValue` precision cascade / verb validation all still correct (006/017/020/022). New: caller-cancel surfacing as `SqlException` is misclassified permanent — finding 025. |
| 2 | Akka.NET conventions | ☑ | Still no actors; `AddExternalSystemGatewayActors` remains a no-op. Cached-call/write lifecycle + audit emission live in SiteRuntime/AuditLog, correct boundary. No issues. |
| 3 | Concurrency & thread safety | ☑ | Services stateless and DI-scoped; new `ExecuteWriteAsync`/`RunSqlAsync`/`SqlErrorClassifier` seams introduce no shared mutable state (static `HashSet` is read-only). No findings. |
| 4 | Error handling & resilience | ☑ | New SQL transient/permanent split is the headline area. `SqlErrorClassifier.IsTransient(Exception)` is over-broad — all `InvalidOperationException`/`TimeoutException` are transient, sweeping authoring bugs into retry — finding 024. Number-based set (024 aside) is sound; unknown→permanent fail-fast is correct. |
| 5 | Security | ☑ | `ExecuteWriteAsync`/`SqlErrorClassifier` use the connection NAME, never the connection string, in error text (verified — credentials not leaked). `ApplyAuth` secrets still never logged. SQL `CommandText` is script-authored by design (parameterised values bound separately). No new findings. |
| 6 | Performance & resource management | ☑ | `RunSqlAsync` uses `await using`/`using` for connection + command. `EmptyParameters` reused. `client.Timeout` set per rented client (correct — `IHttpClientFactory` clients are per-call). No new findings. |
| 7 | Design-document adherence | ☑ | Updated doc (lines 6566) now describes the immediate-attempt-then-buffer cached-write model the code implements — in sync. PATCH reconciled (023). No new drift. |
| 8 | Code organization & conventions | ☑ | `Transient`/`PermanentDatabaseException` are component-local exception types (parallel to the HTTP ones); acceptable. `SqlErrorClassifier` is a static policy class in the component. No new findings. |
| 9 | Testing coverage | ☑ | New SQL-classification paths well covered (immediate + retry, transient/permanent/non-Sql/cancel/unexpected). Gaps: caller-cancel-as-`SqlException` (025) and the `InvalidOperationException`-authoring-bug case (024) are untested — recorded under those findings. |
| 10 | Documentation & comments | ☑ | XML docs accurate and thorough. The numbered case labels in `ExecuteWriteAsync`'s comment block are mis-ordered vs the catch sequence — finding 026. |
## Checklist coverage
| # | Category | Examined | Notes |
@@ -1312,3 +1348,162 @@ code travel together" rule:
- If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
The design-doc list then remains the single source of truth.
### ExternalSystemGateway-024 — `SqlErrorClassifier.IsTransient(Exception)` classifies all `InvalidOperationException`/`TimeoutException` as transient, so DB authoring/driver-misuse bugs are buffered and retried instead of propagating
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:123-142`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:346-355` |
**Description**
`SqlErrorClassifier.IsTransient(Exception)` is a pure exception-type test:
```csharp
return exception is InvalidOperationException
or IOException
or SocketException
or TimeoutException
or TaskCanceledException
or DbException;
```
`DatabaseGateway.ExecuteWriteAsync` uses it as the third ordered catch
(`catch (Exception ex) when (SqlErrorClassifier.IsTransient(ex))`) to decide whether a
write that did *not* surface as a `SqlException` should be buffered and retried. The
type's own XML doc justifies the set as "a live DB outage does not always surface as a
`SqlException`" and explicitly contrasts it with authoring bugs: *"Authoring bugs
(`ArgumentException`, `NullReferenceException`, etc.) are loud, fixable failures —
silently buffering and retrying them forever would hide the bug."*
The problem is that `InvalidOperationException` and `TimeoutException` are exactly the
two most common shapes of an authoring/driver-misuse bug on the ADO.NET write path, and
both are now swept into the transient set:
1. `Microsoft.Data.SqlClient` throws `InvalidOperationException` for a long list of
*programming* errors that are NOT outages and will fail identically on every retry —
e.g. a `SqlParameter` already belongs to another `SqlParameterCollection`, the
command's `Connection` property is not set, an open `SqlDataReader` already exists on
the connection, or `CommandText` is invalid for the call. The doc treats
`InvalidOperationException` as "the connection is not open / pooled connection
broken", but the type carries no such discrimination — any of those authoring
conditions is classified transient and buffered.
2. A bare `TimeoutException` (or any of these) thrown by a custom/wrapped provider for a
non-outage reason is likewise classified transient.
Consequence: a `CachedWrite` whose payload triggers one of these deterministic
programming errors is returned to the script as `WasBuffered: true` (a lie — it will
never succeed), then re-attempted on every S&F sweep, failing identically each time,
until `RetryCount >= DefaultMaxRetries` and it is finally parked — a noisy retry loop
hiding a fixable bug, which is the exact outcome the doc says the design avoids. It is
also asymmetric with the HTTP path it cites as its model: `ErrorClassifier.IsTransient(Exception)`
(ExternalSystemClient) deliberately does NOT catch `InvalidOperationException`, so the
same class of bug propagates on the API side.
**Recommendation**
Narrow the non-`SqlException` transient set to genuine transport/outage shapes. Drop
the broad `InvalidOperationException` arm — or, if connection-state errors must stay
transient, narrow it (e.g. only when the message indicates a broken/closed connection,
or only the `DbException`/`SocketException`/`IOException`/timeout shapes) so that
ADO.NET programming `InvalidOperationException`s propagate as the design intends.
Mirror the HTTP path's narrower `ErrorClassifier.IsTransient(Exception)` set unless
there is a documented reason the SQL path must be broader. Add a regression test that
throws a programming-shaped `InvalidOperationException` (e.g. "The SqlParameter is
already contained by another SqlParameterCollection.") from `RunSqlAsync` and asserts
it propagates (not buffered), companion to the existing
`CachedWrite_UnexpectedException_Propagates_NotClassifiedTransient`.
**Resolution**
Deferred 2026-06-20: narrowing the over-broad SQL transient set (currently all `InvalidOperationException`/`TimeoutException` are treated transient) to match the HTTP path is a classification-policy decision for the owner; recommended narrowing recorded. No data loss today (bounded noisy retry returning a false WasBuffered), so deferred rather than forced.
### ExternalSystemGateway-025 — Caller-token cancellation surfacing from the SQL driver as a `SqlException` is reclassified as a permanent DB error instead of propagating
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:329-345`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:158-174` |
**Description**
`ExecuteWriteAsync`'s ordered catches handle caller cancellation only via
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`.
That covers the common case where the driver observes the token before/around the
network call and raises `OperationCanceledException`/`TaskCanceledException`. However,
when `Microsoft.Data.SqlClient` is cancelled *mid-statement* by the caller's token it
can instead raise a **`SqlException`** (historically error number `0` / message
"Operation cancelled by user", and on some paths `3980` "the request failed to run
because the batch is aborted"). Such a `SqlException` does not derive from
`OperationCanceledException`, so the first filter is skipped; it falls into
`catch (SqlException ex) { throw SqlErrorClassifier.Throw(connectionName, ex); }`, and
because `0`/`3980` are not in the transient set, `SqlErrorClassifier.Throw` raises a
`PermanentDatabaseException`.
The caller-requested cancellation is therefore reported to the script as a *permanent
database failure* (`ExternalCallResult.Success == false`, "Permanent database error: …")
rather than propagating as an `OperationCanceledException`. This contradicts the
`ExternalSystemGateway-008` contract ("the caller asked to abandon the work — do not
reclassify") that the comment block at `DatabaseGateway.cs:333-337` explicitly cites,
and it is the SQL-path analogue of the cancellation/timeout conflation that `-008`
fixed for HTTP. The existing `CachedWrite_CancellationRequested_PropagatesOperationCanceled_NotReclassified`
test only exercises the `OperationCanceledException` shape, so the `SqlException`
cancellation path is uncovered.
**Recommendation**
Before classifying a `SqlException`, re-check `cancellationToken.IsCancellationRequested`
and, if set, rethrow as `OperationCanceledException(cancellationToken)` (or call
`cancellationToken.ThrowIfCancellationRequested()`), so a caller-initiated cancel
propagates regardless of whether the driver surfaced it as `OperationCanceledException`
or as a cancellation-shaped `SqlException`. Add a regression test that cancels the
caller token and throws a cancellation-shaped `SqlException` (or, since `SqlException`
has no public constructor, drive it through `RunSqlAsync` with a token-cancelled
`OperationCanceledException` wrapped to assert the guard) and asserts an
`OperationCanceledException` propagates and nothing is buffered.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `ExecuteWriteAsync` now calls `cancellationToken.ThrowIfCancellationRequested()` at the top of the `catch (SqlException)` block (before `SqlErrorClassifier.Throw`), so a caller-token cancel that surfaces as a SqlException propagates as `OperationCanceledException` regardless of the driver's exception shape (version-independent). Test added (verified failing pre-fix).
### ExternalSystemGateway-026 — `ExecuteWriteAsync` comment-block case labels are mis-ordered relative to the catch sequence
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:316-355` |
**Description**
The header comment for `ExecuteWriteAsync` (lines 316-328) numbers the classification
cases 14 in catch order: `1.` caller-requested cancellation propagates unchanged; `2.`
a `SqlException` is classified by number; `3.` a non-`SqlException` transport/connection
failure is transient; `4.` unexpected exceptions propagate. The inline labels on the
catch bodies, however, are inverted: the cancellation catch (line 333) is annotated
`// [2]` and the non-`SqlException` transient catch (line 348) is annotated `// [1]`.
The labels also do not match the parallel labelling in the test file, where `[1]` is
used for the non-Sql-outage case and `[2]` for cancellation — so the source comment and
the tests disagree with the source's own header list.
This is cosmetic but actively misleading: a reader cross-referencing the numbered
rationale in the header against the `[n]`-tagged catch bodies will map each case to the
wrong rationale, and the discrepancy undermines confidence in an otherwise carefully
documented classification seam.
**Recommendation**
Renumber the inline `// [n]` labels to match the header list (cancellation = `[1]`,
non-`SqlException` transient = `[3]`), or drop the bracketed numbers entirely and keep
the prose, so the comment is self-consistent. Align the test-file annotations to
whatever scheme is chosen.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): renumbered the inline `// [n]` case labels in `ExecuteWriteAsync` to match the method header's rationale list (cancellation=[1], SqlException=[2], non-Sql=[3]), and realigned the three corresponding labels in the test file. Comment-only.