docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
@@ -5,9 +5,9 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway` |
|
||||
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Summary
|
||||
@@ -81,6 +81,42 @@ design-doc drift). Theme: every new finding is in a code path that was added or
|
||||
touched by the earlier fix bundle but whose error-propagation contract was not
|
||||
verified end-to-end against the S&F engine or the design doc.
|
||||
|
||||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||||
|
||||
All twenty-three prior findings (001–023) remain `Resolved`; spot-checks against the
|
||||
current source confirm the fixes still hold. Since `1eb6e97` the module gained a new
|
||||
SQL error-classification layer (`SqlErrorClassifier` + `TransientDatabaseException` /
|
||||
`PermanentDatabaseException`, commits `d0527064` / `de375ff7`) and `DatabaseGateway.CachedWriteAsync`
|
||||
was reshaped to attempt the write immediately and classify the outcome (transient →
|
||||
buffer, permanent → synchronous `Failed`, mirroring `CachedCallAsync`); the `ScadaLink → ZB.MOM.WW.ScadaBridge`
|
||||
rename also landed. The full 10-category checklist was re-walked and surfaced **three new
|
||||
findings**, none Critical/High. The most serious (`ExternalSystemGateway-024`, Medium) is
|
||||
that `SqlErrorClassifier.IsTransient(Exception)` classifies *every* `InvalidOperationException`
|
||||
(and `TimeoutException`) as transient, so a DB-layer authoring/driver-misuse bug that surfaces
|
||||
as `InvalidOperationException` is silently buffered and retried instead of propagating —
|
||||
contradicting the classifier's own stated "authoring bugs must propagate" contract and the
|
||||
HTTP-path symmetry it claims. `-025` (Low) is a caller-cancellation that the SQL driver
|
||||
raises as a `SqlException` (mid-flight cancel) being reclassified as a *permanent* DB error
|
||||
rather than propagating the cancellation (a narrow asymmetry with the `-008` contract; untested).
|
||||
`-026` (Low) is a mis-ordered numbered-label comment block in `ExecuteWriteAsync`. Theme:
|
||||
the new SQL-classification seam faithfully mirrors the HTTP path's *shape* but is more
|
||||
permissive on the non-typed-exception branch than the HTTP path it cites as its model.
|
||||
|
||||
_Re-review (2026-06-20, `4307c381`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `BuildUrl` / `JsonElementToParameterValue` precision cascade / verb validation all still correct (006/017/020/022). New: caller-cancel surfacing as `SqlException` is misclassified permanent — finding 025. |
|
||||
| 2 | Akka.NET conventions | ☑ | Still no actors; `AddExternalSystemGatewayActors` remains a no-op. Cached-call/write lifecycle + audit emission live in SiteRuntime/AuditLog, correct boundary. No issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Services stateless and DI-scoped; new `ExecuteWriteAsync`/`RunSqlAsync`/`SqlErrorClassifier` seams introduce no shared mutable state (static `HashSet` is read-only). No findings. |
|
||||
| 4 | Error handling & resilience | ☑ | New SQL transient/permanent split is the headline area. `SqlErrorClassifier.IsTransient(Exception)` is over-broad — all `InvalidOperationException`/`TimeoutException` are transient, sweeping authoring bugs into retry — finding 024. Number-based set (024 aside) is sound; unknown→permanent fail-fast is correct. |
|
||||
| 5 | Security | ☑ | `ExecuteWriteAsync`/`SqlErrorClassifier` use the connection NAME, never the connection string, in error text (verified — credentials not leaked). `ApplyAuth` secrets still never logged. SQL `CommandText` is script-authored by design (parameterised values bound separately). No new findings. |
|
||||
| 6 | Performance & resource management | ☑ | `RunSqlAsync` uses `await using`/`using` for connection + command. `EmptyParameters` reused. `client.Timeout` set per rented client (correct — `IHttpClientFactory` clients are per-call). No new findings. |
|
||||
| 7 | Design-document adherence | ☑ | Updated doc (lines 65–66) now describes the immediate-attempt-then-buffer cached-write model the code implements — in sync. PATCH reconciled (023). No new drift. |
|
||||
| 8 | Code organization & conventions | ☑ | `Transient`/`PermanentDatabaseException` are component-local exception types (parallel to the HTTP ones); acceptable. `SqlErrorClassifier` is a static policy class in the component. No new findings. |
|
||||
| 9 | Testing coverage | ☑ | New SQL-classification paths well covered (immediate + retry, transient/permanent/non-Sql/cancel/unexpected). Gaps: caller-cancel-as-`SqlException` (025) and the `InvalidOperationException`-authoring-bug case (024) are untested — recorded under those findings. |
|
||||
| 10 | Documentation & comments | ☑ | XML docs accurate and thorough. The numbered case labels in `ExecuteWriteAsync`'s comment block are mis-ordered vs the catch sequence — finding 026. |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -1312,3 +1348,162 @@ code travel together" rule:
|
||||
- If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
|
||||
body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
|
||||
The design-doc list then remains the single source of truth.
|
||||
|
||||
### ExternalSystemGateway-024 — `SqlErrorClassifier.IsTransient(Exception)` classifies all `InvalidOperationException`/`TimeoutException` as transient, so DB authoring/driver-misuse bugs are buffered and retried instead of propagating
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Deferred |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:123-142`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:346-355` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SqlErrorClassifier.IsTransient(Exception)` is a pure exception-type test:
|
||||
|
||||
```csharp
|
||||
return exception is InvalidOperationException
|
||||
or IOException
|
||||
or SocketException
|
||||
or TimeoutException
|
||||
or TaskCanceledException
|
||||
or DbException;
|
||||
```
|
||||
|
||||
`DatabaseGateway.ExecuteWriteAsync` uses it as the third ordered catch
|
||||
(`catch (Exception ex) when (SqlErrorClassifier.IsTransient(ex))`) to decide whether a
|
||||
write that did *not* surface as a `SqlException` should be buffered and retried. The
|
||||
type's own XML doc justifies the set as "a live DB outage does not always surface as a
|
||||
`SqlException`" and explicitly contrasts it with authoring bugs: *"Authoring bugs
|
||||
(`ArgumentException`, `NullReferenceException`, etc.) are loud, fixable failures —
|
||||
silently buffering and retrying them forever would hide the bug."*
|
||||
|
||||
The problem is that `InvalidOperationException` and `TimeoutException` are exactly the
|
||||
two most common shapes of an authoring/driver-misuse bug on the ADO.NET write path, and
|
||||
both are now swept into the transient set:
|
||||
|
||||
1. `Microsoft.Data.SqlClient` throws `InvalidOperationException` for a long list of
|
||||
*programming* errors that are NOT outages and will fail identically on every retry —
|
||||
e.g. a `SqlParameter` already belongs to another `SqlParameterCollection`, the
|
||||
command's `Connection` property is not set, an open `SqlDataReader` already exists on
|
||||
the connection, or `CommandText` is invalid for the call. The doc treats
|
||||
`InvalidOperationException` as "the connection is not open / pooled connection
|
||||
broken", but the type carries no such discrimination — any of those authoring
|
||||
conditions is classified transient and buffered.
|
||||
2. A bare `TimeoutException` (or any of these) thrown by a custom/wrapped provider for a
|
||||
non-outage reason is likewise classified transient.
|
||||
|
||||
Consequence: a `CachedWrite` whose payload triggers one of these deterministic
|
||||
programming errors is returned to the script as `WasBuffered: true` (a lie — it will
|
||||
never succeed), then re-attempted on every S&F sweep, failing identically each time,
|
||||
until `RetryCount >= DefaultMaxRetries` and it is finally parked — a noisy retry loop
|
||||
hiding a fixable bug, which is the exact outcome the doc says the design avoids. It is
|
||||
also asymmetric with the HTTP path it cites as its model: `ErrorClassifier.IsTransient(Exception)`
|
||||
(ExternalSystemClient) deliberately does NOT catch `InvalidOperationException`, so the
|
||||
same class of bug propagates on the API side.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Narrow the non-`SqlException` transient set to genuine transport/outage shapes. Drop
|
||||
the broad `InvalidOperationException` arm — or, if connection-state errors must stay
|
||||
transient, narrow it (e.g. only when the message indicates a broken/closed connection,
|
||||
or only the `DbException`/`SocketException`/`IOException`/timeout shapes) so that
|
||||
ADO.NET programming `InvalidOperationException`s propagate as the design intends.
|
||||
Mirror the HTTP path's narrower `ErrorClassifier.IsTransient(Exception)` set unless
|
||||
there is a documented reason the SQL path must be broader. Add a regression test that
|
||||
throws a programming-shaped `InvalidOperationException` (e.g. "The SqlParameter is
|
||||
already contained by another SqlParameterCollection.") from `RunSqlAsync` and asserts
|
||||
it propagates (not buffered), companion to the existing
|
||||
`CachedWrite_UnexpectedException_Propagates_NotClassifiedTransient`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Deferred 2026-06-20: narrowing the over-broad SQL transient set (currently all `InvalidOperationException`/`TimeoutException` are treated transient) to match the HTTP path is a classification-policy decision for the owner; recommended narrowing recorded. No data loss today (bounded noisy retry returning a false WasBuffered), so deferred rather than forced.
|
||||
|
||||
### ExternalSystemGateway-025 — Caller-token cancellation surfacing from the SQL driver as a `SqlException` is reclassified as a permanent DB error instead of propagating
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:329-345`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:158-174` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ExecuteWriteAsync`'s ordered catches handle caller cancellation only via
|
||||
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`.
|
||||
That covers the common case where the driver observes the token before/around the
|
||||
network call and raises `OperationCanceledException`/`TaskCanceledException`. However,
|
||||
when `Microsoft.Data.SqlClient` is cancelled *mid-statement* by the caller's token it
|
||||
can instead raise a **`SqlException`** (historically error number `0` / message
|
||||
"Operation cancelled by user", and on some paths `3980` "the request failed to run
|
||||
because the batch is aborted"). Such a `SqlException` does not derive from
|
||||
`OperationCanceledException`, so the first filter is skipped; it falls into
|
||||
`catch (SqlException ex) { throw SqlErrorClassifier.Throw(connectionName, ex); }`, and
|
||||
because `0`/`3980` are not in the transient set, `SqlErrorClassifier.Throw` raises a
|
||||
`PermanentDatabaseException`.
|
||||
|
||||
The caller-requested cancellation is therefore reported to the script as a *permanent
|
||||
database failure* (`ExternalCallResult.Success == false`, "Permanent database error: …")
|
||||
rather than propagating as an `OperationCanceledException`. This contradicts the
|
||||
`ExternalSystemGateway-008` contract ("the caller asked to abandon the work — do not
|
||||
reclassify") that the comment block at `DatabaseGateway.cs:333-337` explicitly cites,
|
||||
and it is the SQL-path analogue of the cancellation/timeout conflation that `-008`
|
||||
fixed for HTTP. The existing `CachedWrite_CancellationRequested_PropagatesOperationCanceled_NotReclassified`
|
||||
test only exercises the `OperationCanceledException` shape, so the `SqlException`
|
||||
cancellation path is uncovered.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Before classifying a `SqlException`, re-check `cancellationToken.IsCancellationRequested`
|
||||
and, if set, rethrow as `OperationCanceledException(cancellationToken)` (or call
|
||||
`cancellationToken.ThrowIfCancellationRequested()`), so a caller-initiated cancel
|
||||
propagates regardless of whether the driver surfaced it as `OperationCanceledException`
|
||||
or as a cancellation-shaped `SqlException`. Add a regression test that cancels the
|
||||
caller token and throws a cancellation-shaped `SqlException` (or, since `SqlException`
|
||||
has no public constructor, drive it through `RunSqlAsync` with a token-cancelled
|
||||
`OperationCanceledException` wrapped to assert the guard) and asserts an
|
||||
`OperationCanceledException` propagates and nothing is buffered.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): `ExecuteWriteAsync` now calls `cancellationToken.ThrowIfCancellationRequested()` at the top of the `catch (SqlException)` block (before `SqlErrorClassifier.Throw`), so a caller-token cancel that surfaces as a SqlException propagates as `OperationCanceledException` regardless of the driver's exception shape (version-independent). Test added (verified failing pre-fix).
|
||||
|
||||
### ExternalSystemGateway-026 — `ExecuteWriteAsync` comment-block case labels are mis-ordered relative to the catch sequence
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:316-355` |
|
||||
|
||||
**Description**
|
||||
|
||||
The header comment for `ExecuteWriteAsync` (lines 316-328) numbers the classification
|
||||
cases 1–4 in catch order: `1.` caller-requested cancellation propagates unchanged; `2.`
|
||||
a `SqlException` is classified by number; `3.` a non-`SqlException` transport/connection
|
||||
failure is transient; `4.` unexpected exceptions propagate. The inline labels on the
|
||||
catch bodies, however, are inverted: the cancellation catch (line 333) is annotated
|
||||
`// [2]` and the non-`SqlException` transient catch (line 348) is annotated `// [1]`.
|
||||
The labels also do not match the parallel labelling in the test file, where `[1]` is
|
||||
used for the non-Sql-outage case and `[2]` for cancellation — so the source comment and
|
||||
the tests disagree with the source's own header list.
|
||||
|
||||
This is cosmetic but actively misleading: a reader cross-referencing the numbered
|
||||
rationale in the header against the `[n]`-tagged catch bodies will map each case to the
|
||||
wrong rationale, and the discrepancy undermines confidence in an otherwise carefully
|
||||
documented classification seam.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Renumber the inline `// [n]` labels to match the header list (cancellation = `[1]`,
|
||||
non-`SqlException` transient = `[3]`), or drop the bracketed numbers entirely and keep
|
||||
the prose, so the comment is self-consistent. Align the test-file annotations to
|
||||
whatever scheme is chosen.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): renumbered the inline `// [n]` case labels in `ExecuteWriteAsync` to match the method header's rationale list (cancellation=[1], SqlException=[2], non-Sql=[3]), and realigned the three corresponding labels in the test file. Comment-only.
|
||||
|
||||
Reference in New Issue
Block a user