Files
lmxopcua/code-reviews/Configuration/findings.md
Joseph Doherty b92fea15d4 fix(configuration): resolve Low code-review findings (Configuration-004,005,007,010,011)
- Configuration-004: NodePermissions stored as int to match the EF
  HasConversion<int>() in OtOpcUaConfigDbContext.ConfigureNodeAcl.
- Configuration-005: serialise LiteDbConfigCache.PutAsync so concurrent
  Put for the same (ClusterId, GenerationId) cannot duplicate rows.
- Configuration-007: rethrow OperationCanceledException from
  GenerationApplier.ApplyPass when the caller's token is cancelled.
- Configuration-010: scrub secrets and drop the full exception object
  from the ResilientConfigReader fallback warning log.
- Configuration-011: pin the previously-uncovered GenerationApplier
  cancellation and path-length / publish-validation paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 05:38:18 -04:00

193 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Configuration
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Configuration-001, Configuration-002, Configuration-003 |
| 2 | OtOpcUa conventions | Configuration-004 |
| 3 | Concurrency & thread safety | Configuration-005 |
| 4 | Error handling & resilience | Configuration-006, Configuration-007 |
| 5 | Security | Configuration-008, Configuration-009, Configuration-010 |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Configuration-011 |
| 10 | Documentation & comments | No issues found |
## Findings
### Configuration-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:282` |
| Status | Resolved |
**Description:** `sp_PublishGeneration` invokes `EXEC dbo.sp_ValidateDraft @DraftGenerationId = @DraftGenerationId;` and then continues unconditionally to the reservation MERGE and the `Status='Published'` update. `sp_ValidateDraft` signals every failure with `RAISERROR(..., 16, 1)` followed by `RETURN`. A severity-16 `RAISERROR` is not a batch-aborting error and `SET XACT_ABORT ON` does not abort the transaction for it, so control returns to `sp_PublishGeneration`, which publishes the draft even though validation rejected it (cross-cluster namespace binding, dangling tag FKs, duplicate external identifiers, EquipmentUuid immutability all pass through). Pre-publish validation is effectively bypassed.
**Recommendation:** Wrap the `EXEC dbo.sp_ValidateDraft` in `BEGIN TRY ... END TRY BEGIN CATCH ROLLBACK; THROW; END CATCH` so the validation `RAISERROR` propagates and aborts the publish, or have `sp_ValidateDraft` return a result-set/output parameter that `sp_PublishGeneration` inspects and explicitly rolls back on. Add a regression test that publishes a draft with a known violation and asserts it stays `Draft`.
**Resolution:** Resolved 2026-05-22 — wrapped the `EXEC dbo.sp_ValidateDraft` call in `sp_PublishGeneration` in a `BEGIN TRY ... BEGIN CATCH ROLLBACK; THROW; END CATCH` block so a validation `RAISERROR` rolls back the publish transaction and re-raises instead of being silently ignored; added DB-backed regression test `Publish_aborts_when_ValidateDraft_rejects_the_draft`.
### Configuration-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:325` |
| Status | Resolved |
**Description:** `sp_RollbackToGeneration` opens its own `BEGIN TRANSACTION`, clones rows into a new Draft, then `EXEC dbo.sp_PublishGeneration`, which itself runs `BEGIN TRANSACTION` (nesting `@@TRANCOUNT` to 2) and on its failure paths executes a bare `ROLLBACK`. A bare `ROLLBACK` rolls back to the outermost transaction and sets `@@TRANCOUNT` to 0, so when `sp_RollbackToGeneration` later reaches its own `COMMIT` it runs with no open transaction and raises error 3902. The rollback clone is silently discarded and the caller sees a confusing secondary error instead of the real publish failure.
**Recommendation:** Make `sp_PublishGeneration` transaction-nesting aware: capture `@@TRANCOUNT` on entry, only `BEGIN TRANSACTION` when zero (otherwise `SAVE TRANSACTION`), and only `COMMIT`/`ROLLBACK` the level it owns. Alternatively factor the publish body into an inner proc that assumes an ambient transaction.
**Resolution:** Resolved 2026-05-22 — made `sp_PublishGeneration` transaction-nesting aware: captures `@@TRANCOUNT` on entry, issues `BEGIN TRANSACTION` when zero or `SAVE TRANSACTION sp_PublishGeneration` when nested, and uses `ROLLBACK TRANSACTION sp_PublishGeneration` (savepoint rollback) on all failure paths in the nested case so the caller's outer transaction is not wiped; also wrapped `EXEC dbo.sp_ValidateDraft` in `BEGIN TRY ... END CATCH` so validation errors propagate correctly.
### Configuration-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:73` |
| Status | Resolved |
**Description:** `ValidatePathLength` computes path length with hard-coded constants — it always charges 64 chars for Enterprise+Site (`32 + 32 + ...`) regardless of the cluster's actual values. This over-rejects: a short Enterprise/Site is penalised by up to 64 unused chars, so a legitimately under-200-char path can fail `PathTooLong`. The check also silently `continue`s when an equipment's `UnsLineId`/`UnsAreaId` does not resolve, so an orphaned-line path is never length-checked.
**Recommendation:** Pass the actual `Enterprise` and `Site` strings into the validator (e.g. on `DraftSnapshot`, or as parameters alongside `ValidateClusterTopology`) and compute the real length. If the cluster row cannot be supplied, document the check as a conservative upper bound or emit a lower-severity warning rather than a hard error.
**Resolution:** Resolved 2026-05-22 — added nullable `Enterprise` and `Site` properties to `DraftSnapshot`; `ValidatePathLength` uses actual lengths when set and falls back to the conservative 32-char upper bound per segment with a comment explaining the trade-off; `DraftValidationService` now loads the cluster row and populates both properties; added `PathLength_uses_actual_Enterprise_Site_when_provided` and `PathLength_conservative_fallback_when_Enterprise_Site_absent` unit tests.
### Configuration-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Enums/NodePermissions.cs:8`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/OtOpcUaConfigDbContext.cs:417` |
| Status | Resolved |
**Description:** `NodePermissions` is declared `[Flags] enum ... : uint`, while its XML doc and `NodeAcl.PermissionFlags`' doc both say "stored as int", and `ConfigureNodeAcl` uses `HasConversion<int>()` — a `uint``int` conversion. Only bits 011 are used today, but the underlying-type/storage-type mismatch is a latent trap: a future bit-31 flag yields a `uint` value that overflows `int` and the conversion round-trip would corrupt it.
**Recommendation:** Change the enum underlying type to `int` (consistent with the docs and the conversion). No high bit is in use, so this is the smaller change.
**Resolution:** Resolved 2026-05-23 — changed `NodePermissions` underlying type from `uint` to `int` so it matches the documented "stored as int" semantics and the `HasConversion<int>()` mapping in `OtOpcUaConfigDbContext.ConfigureNodeAcl`. Added regression test `NodePermissionsTests` pinning the underlying type and round-trip safety through `int` storage.
### Configuration-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs:50` |
| Status | Resolved |
**Description:** `PutAsync` performs a non-atomic find-then-insert/update. Two concurrent `PutAsync` calls for the same `(ClusterId, GenerationId)` can both observe `existing is null` and both `Insert`, producing two rows for one generation. The constructor's `EnsureIndex` calls are non-unique, so the storage layer does not prevent the duplicate, and `PruneOldGenerationsAsync`'s `keepLatest` accounting is then off.
**Recommendation:** Declare a unique index on `(ClusterId, GenerationId)` and treat the duplicate-key exception as an idempotent no-op, or guard `PutAsync` with an instance `SemaphoreSlim`/lock. Document the concurrency contract on `ILocalConfigCache`.
**Resolution:** Resolved 2026-05-23 — guarded `PutAsync` with an instance-level `SemaphoreSlim` so the find-then-insert/update window runs atomically for a given cache instance; documented the concurrency contract on `ILocalConfigCache`. Added regression test `PutAsync_concurrent_for_same_cluster_and_generation_does_not_duplicate` that runs 64 concurrent puts and inspects the LiteDB file directly to confirm exactly one row per `(ClusterId, GenerationId)` survives.
### Configuration-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs:79` |
| Status | Resolved |
**Description:** The fallback `catch` filters on `ex is not OperationCanceledException`. A SQL command timeout surfaced by ADO.NET as a `TaskCanceledException` (derives from `OperationCanceledException`) is then treated as caller cancellation and propagates instead of falling back to the sealed cache — the opposite of the documented "fallback on any exception including timeout". The retry `ShouldHandle` predicate has the same shape, so command-timeout cancellations are also not retried consistently.
**Recommendation:** Distinguish caller cancellation from command-timeout cancellation explicitly: inspect `cancellationToken.IsCancellationRequested` to decide whether an `OperationCanceledException` is a genuine cancel (rethrow) or a timeout (fall back). Add unit tests for both a `TimeoutRejectedException` path and a command-timeout `TaskCanceledException` path asserting cache fallback occurs.
**Resolution:** Resolved 2026-05-22 — changed the fallback `catch` filter to `ex is not OperationCanceledException || !cancellationToken.IsCancellationRequested` so a command-timeout `TaskCanceledException` (caller token not cancelled) triggers cache fallback while genuine caller cancellation still propagates; changed the retry `ShouldHandle` predicate to `Handle<Exception>()` (handles all exceptions, relying on Polly's own cancellation-token check to stop retrying on genuine cancellation); added three unit tests: `CommandTimeout_TaskCanceledException_FallsBackToCache`, `PollyTimeout_TimeoutRejectedException_FallsBackToCache`, and `CallerCancellation_Propagates_NotFallback`.
### Configuration-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Apply/GenerationApplier.cs:44` |
| Status | Resolved |
**Description:** `ApplyPass` wraps each callback in `catch (Exception ex)`. This swallows `OperationCanceledException` — a cancellation during a callback is recorded as just another entity error string and the applier keeps walking the remaining passes instead of stopping. It also masks fatal exceptions. The applier continues applying Added/Modified passes even after a Removed callback failed, leaving a partially-applied runtime state.
**Recommendation:** Rethrow `OperationCanceledException` rather than recording it as an entity error; call `ct.ThrowIfCancellationRequested()` between passes. Document or enforce whether a failed Removed pass should abort before the Added/Modified passes run.
**Resolution:** Resolved 2026-05-23 — added `catch (OperationCanceledException) when (ct.IsCancellationRequested) { throw; }` ahead of the generic catch in `ApplyPass` so genuine caller cancellation propagates rather than being recorded as an entity error, and added a `ct.ThrowIfCancellationRequested()` at the top of each Added/Modified pass iteration. The "failed Removed pass keeps walking Added/Modified" behaviour was confirmed as the intended contract (cascades must settle) and pinned by `Apply_continues_to_Added_pass_when_a_Removed_callback_throws`. New regression tests: `Apply_propagates_OperationCanceledException_from_callback_when_token_cancelled`, `Apply_stops_between_passes_when_cancellation_requested`.
### Configuration-008
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:150`, `:373`, `:468` |
| Status | Resolved |
**Description:** Three stored procedures build `ConfigAuditLog.DetailsJson` by raw string concatenation of caller-supplied `nvarchar` parameters: `sp_RegisterNodeGenerationApplied` (`@Status`), `sp_RollbackToGeneration` (`@TargetGenerationId`), `sp_ReleaseExternalIdReservation` (`@Kind`, `@Value`). A value with a double-quote or backslash produces malformed JSON; combined with the `CK_ConfigAuditLog_DetailsJson_IsJson` check constraint, the `INSERT` fails the constraint and aborts the surrounding publish/rollback transaction (denial of operation). It is also a JSON-injection vector that can silently rewrite the audit record's shape.
**Recommendation:** Build the JSON with a safe constructor (`FOR JSON PATH, WITHOUT_ARRAY_WRAPPER` or `JSON_OBJECT(...)` on SQL Server 2022+) so values are properly escaped, or run each interpolated value through `STRING_ESCAPE(@Value, 'json')`. Add tests with quote/backslash-containing inputs.
**Resolution:** Resolved 2026-05-22 — routed every caller-supplied string interpolated into `DetailsJson` through `STRING_ESCAPE(@x, 'json')` (`@Status` in `sp_RegisterNodeGenerationApplied`; `@Kind`/`@Value` in `sp_ReleaseExternalIdReservation`) and emitted `sp_RollbackToGeneration`'s `@TargetGenerationId` as a bare JSON number via explicit `CONVERT(nvarchar(20), CONVERT(bigint, ...))`; added DB-backed regression tests `RegisterNodeGenerationApplied_escapes_quotes_in_audit_DetailsJson` and `ReleaseReservation_escapes_quotes_in_audit_DetailsJson` that round-trip quote/backslash inputs through `ISJSON`/`JSON_VALUE`.
### Configuration-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/DesignTimeDbContextFactory.cs:14` |
| Status | Resolved |
**Description:** `DefaultConnectionString` embeds a plaintext `sa` password with `User Id=sa` directly in source, checked into the repository. Although used only at design time (`dotnet ef`), a checked-in `sa` credential normalises committing DB passwords and, if live for the shared dev SQL Server, grants `sa` to anyone with repo access. `TrustServerCertificate=True` plus `Encrypt=False` additionally disables transport protection for that connection.
**Recommendation:** Drop the embedded credential. Fall back to integrated auth (`Trusted_Connection=True`) or fail fast with a message instructing the developer to set `OTOPCUA_CONFIG_CONNECTION`. Rotate the dev `sa` password if this value is live.
**Resolution:** Resolved 2026-05-22 — removed the embedded `sa` password and `DefaultConnectionString` constant entirely; `CreateDbContext` now throws `InvalidOperationException` with a clear setup message when `OTOPCUA_CONFIG_CONNECTION` is not set, rather than silently falling back to a hardcoded credential; added XML-doc example showing the recommended integrated-auth connection string.
### Configuration-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs:81` |
| Status | Resolved |
**Description:** On central-DB read failure the warning log records the full exception object. Callers pass arbitrary `centralFetch` delegates; if any delegate closes over a connection string, an exception thrown from it (or a `SqlException` carrying server/credential context) is logged verbatim. There is no scrubbing of connection-string fragments before logging, against the project's no-secret-logging rule.
**Recommendation:** Log `ex.GetType().Name` and `ex.Message` for SQL failures rather than the full exception, or run exception messages through a connection-string scrubber before they reach the log sink.
**Resolution:** Resolved 2026-05-23 — stopped passing the raw exception object to `LogWarning`; the fallback log now records only `ex.GetType().Name` and a `ScrubSecrets`-redacted `ex.Message` so connection-string fragments (Password, User Id, Pwd, Uid, AccessToken, Authorization, ApiKey/Api-Key) are stripped before reaching any sink. Added regression test `FallbackWarning_does_not_log_full_exception_object_or_password_fragment` that captures emitted log records and asserts no raw exception attached and no credential keyword present in the rendered message.
### Configuration-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Apply/GenerationApplier.cs:7`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:60` |
| Status | Resolved |
**Description:** The companion test project covers the cache, schema compliance, stored procedures, and `DraftValidator` well, but two flagged behaviours are not pinned: (a) `GenerationApplier` ordering/cancellation when a Removed callback fails — no test asserts the Added/Modified passes still run or that cancellation aborts; (b) `ValidatePathLength`'s constant 32+32 approximation — no test exercises a long Enterprise/Site. The publish-bypasses-validation bug (Configuration-001) is also untested against the live SQL fixture.
**Recommendation:** Add `GenerationApplierTests` cases for a throwing callback (assert error recorded, assert cancellation propagates) and a `DraftValidatorTests` path-length boundary case. Add a `StoredProceduresTests` case that publishes an invalid draft and asserts it stays `Draft`.
**Resolution:** Resolved 2026-05-23 — all three gaps now covered. (a) `GenerationApplierTests.Apply_continues_to_Added_pass_when_a_Removed_callback_throws` pins ordering; `Apply_propagates_OperationCanceledException_from_callback_when_token_cancelled` and `Apply_stops_between_passes_when_cancellation_requested` (added under Configuration-007) pin cancellation. (b) `DraftValidatorTests.PathLength_uses_actual_Enterprise_Site_when_provided` and `PathLength_conservative_fallback_when_Enterprise_Site_absent` (added under Configuration-003) pin the path-length boundary. (c) `StoredProceduresTests.Publish_aborts_when_ValidateDraft_rejects_the_draft` (added under Configuration-001) pins the publish-bypasses-validation regression against the live SQL fixture.