diff --git a/code-reviews/Configuration/findings.md b/code-reviews/Configuration/findings.md index b62a1bb1..040f864c 100644 --- a/code-reviews/Configuration/findings.md +++ b/code-reviews/Configuration/findings.md @@ -4,10 +4,10 @@ |---|---| | Module | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration` | | Reviewer | Claude Code | -| Review date | 2026-05-22 | -| Commit reviewed | `76d35d1` | +| Review date | 2026-06-19 (re-review; first reviewed 2026-05-22) | +| Commit reviewed | `7286d320` (re-review; was `76d35d1`) | | Status | Reviewed | -| Open findings | 0 | +| Open findings | 2 | ## Checklist coverage @@ -190,3 +190,67 @@ **Recommendation:** Add `GenerationApplierTests` cases for a throwing callback (assert error recorded, assert cancellation propagates) and a `DraftValidatorTests` path-length boundary case. Add a `StoredProceduresTests` case that publishes an invalid draft and asserts it stays `Draft`. **Resolution:** Resolved 2026-05-23 — all three gaps now covered. (a) `GenerationApplierTests.Apply_continues_to_Added_pass_when_a_Removed_callback_throws` pins ordering; `Apply_propagates_OperationCanceledException_from_callback_when_token_cancelled` and `Apply_stops_between_passes_when_cancellation_requested` (added under Configuration-007) pin cancellation. (b) `DraftValidatorTests.PathLength_uses_actual_Enterprise_Site_when_provided` and `PathLength_conservative_fallback_when_Enterprise_Site_absent` (added under Configuration-003) pin the path-length boundary. (c) `StoredProceduresTests.Publish_aborts_when_ValidateDraft_rejects_the_draft` (added under Configuration-001) pins the publish-bypasses-validation regression against the live SQL fixture. + +## Re-review 2026-06-19 (commit 7286d320) + +Re-reviewed the full non-migration C# data layer at `7286d320` (the Configuration source is byte-identical between `7286d320` and the working HEAD `8ac5a2db` — `git diff 7286d320 HEAD` shows no non-migration Configuration changes). Migration files remain off-limits and were not touched. + +Prior findings Configuration-001…011 remain Resolved. Notable since the first review: the v1 generation model and its stored-proc publish path are retired; `DraftValidationService` (named in the Configuration-003 resolution) is replaced by `DraftSnapshotFactory`, which deliberately leaves `Enterprise`/`Site` null so `ValidatePathLength` uses its conservative upper bound — an intentional, documented fallback, not a regression. The DB-backed `StoredProceduresTests` referenced by the Configuration-001/008/011 resolutions no longer exist in the test project (consolidated out when the generation model dropped); the live SQL `SchemaComplianceTests`, `DriverHostStatusTests`, and `AuthorizationTests` still run against the Docker SQL fixture and pass. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Configuration-012 | +| 2 | OtOpcUa conventions | No issues found | +| 3 | Concurrency & thread safety | Configuration-012 | +| 4 | Error handling & resilience | No issues found (Configuration-006/007 fixes verified still in place) | +| 5 | Security | No issues found (Configuration-008/009/010 fixes verified still in place; `DesignTimeDbContextFactory` carries no credential; `ResilientConfigReader.ScrubSecrets` intact) | +| 6 | Performance & resource management | No issues found (factory + query paths use `AsNoTracking`; cluster audit query is a bounded two-step, not N+1) | +| 7 | Design-document adherence | Configuration-013 | +| 8 | Code organization & conventions | No issues found | +| 9 | Testing coverage | Configuration-013 (dead-code path untested in prod wiring), Configuration-014 | +| 10 | Documentation & comments | Configuration-014 | + +### Configuration-012 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Concurrency & thread safety | +| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs:31`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs:71`, `:129` | +| Status | Resolved | + +**Description:** Both LiteDB-backed caches constructed their `LiteDatabase` with the default constructor, which binds the process-wide singleton `BsonMapper.Global`. LiteDB resolves a POCO's members into that mapper **lazily on first use and without thread-safety across concurrently-constructed `LiteDatabase` instances**. When two caches initialise or run in parallel — e.g. a `GenerationSealedCache.SealAsync` racing a `ReadCurrentAsync`, or `LiteDbConfigCache` racing the sealed cache, or simply the test suite running both classes' fixtures in parallel — the global mapper races, surfacing as `NotSupportedException: Member ClusterId not found on BsonMapper for type GenerationSnapshot` (the query predicate can't resolve the member) or `LiteException: Cannot insert duplicate key in unique index '_id'. The duplicate value is '0'` (the `int Id` auto-id mapping was lost, so `Insert` writes a literal `0` twice). The per-instance `_writeGate` added under Configuration-005 serialises one instance's find-then-insert but does nothing about this cross-instance global-mapper hazard. Reproduced deterministically under thread-pool contention (the Configuration-005 regression test `PutAsync_concurrent_…` was intermittently failing in the full suite for exactly this reason). This is a real runtime fault, not test-only: the resilient-read fallback path and the seal path can run concurrently in a live node. + +**Recommendation:** Give each `LiteDatabase` a private `BsonMapper` with the entity pre-registered (`mapper.Entity()`) so member resolution happens once, single-threaded, at construction and never touches the global mapper. + +**Resolution:** Resolved 2026-06-19 — added a private `BuildMapper()` to both `LiteDbConfigCache` and `GenerationSealedCache` that constructs a fresh `BsonMapper`, calls `mapper.Entity()`, and passes it to every `new LiteDatabase(...)` (construction, seal-write, sealed-read). Added regression test `LiteDbConfigCacheTests.Concurrent_cache_instances_do_not_race_the_shared_bson_mapper` (24 cache instances × 16 concurrent puts each); confirmed it fails on the unfixed code (`Member ClusterId not found on BsonMapper`) and passes on the fixed code, and the previously-flaky `PutAsync_concurrent_…` is now stable across repeated full-suite runs. + +### Configuration-013 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Design-document adherence | +| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:243` (`ValidateClusterTopology`) | +| Status | Open | + +**Description:** `DraftValidator.ValidateClusterTopology` is documented as the managed pre-publish guard that catches cluster-topology drift the SQL `CK_ServerCluster_RedundancyMode_NodeCount` check cannot see — specifically an operator disabling a `ClusterNode` (effective enabled-count = 1) while `RedundancyMode` stays `Hot`/`Warm`, which would boot the runtime into an invalid-topology band. It is fully unit-tested (`DraftValidatorTests` §"ValidateClusterTopology") but **no production code calls it.** The deploy gate in `AdminOperationsActor.StartDeployment` runs `DraftValidator.Validate(...)` (the snapshot rules) but never `ValidateClusterTopology(...)`, so the documented enabled-node-count guard is inert at deploy time — the only thing standing is the row-level SQL CHECK, which the doc explicitly says is insufficient. + +**Recommendation:** Wire `ValidateClusterTopology` into the deploy/publish path — load the `ServerCluster` row(s) + their `ClusterNode`s and run it alongside `Validate`, folding its errors into the same reject summary. The fix belongs in `src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs` (a different module), so it is **deferred from this module's edit scope** and recorded here against the now-dead Configuration-layer method. Cross-module: ControlPlane. + +**Resolution:** _(open — fix is in the ControlPlane module's `AdminOperationsActor`, outside Configuration's edit scope)_ + +### Configuration-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Services/ILdapGroupRoleMappingService.cs:23`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Services/LdapGroupRoleMappingService.cs:24` | +| Status | Open | + +**Description:** `ILdapGroupRoleMappingService.GetByGroupsAsync`'s XML doc asserts "Case-insensitive per LDAP conventions", but the implementation is `db.LdapGroupRoleMappings.Where(m => groupSet.Contains(m.LdapGroup))`, which translates to a SQL `IN (…)` whose case-sensitivity is entirely determined by the column's collation. On the default case-insensitive server collation (the dev/Docker SQL) the doc holds; on a case-sensitive-collation deployment the lookup would silently miss rows that differ only in case — and the in-memory EF unit tests (which match exact case) would not catch it. This is a hot path: it fires on every Admin-UI sign-in, so a silent miss denies the user their role grant. The doc overstates a guarantee the code does not enforce. + +**Recommendation:** Either (a) soften the doc to state the match is collation-dependent and document the required CI collation as a deployment constraint, or (b) pin the `LdapGroup` column to an explicit `_CI_` collation in the model so the guarantee is enforced regardless of server default. Option (b) is a schema/migration change and must be deferred (no migration edits in this review); option (a) is a one-line doc change but only papers over the gap. Left **Open** pending a decision on which guarantee to commit to — no code change applied because the documented behaviour currently matches the deployment reality (CI server collation). + +**Resolution:** _(open — preferred fix (b) pins the column collation, which needs an EF migration; deferred per the no-migration rule)_ diff --git a/src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs b/src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs index a64ddf3a..e152bfa5 100644 --- a/src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs +++ b/src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs @@ -28,6 +28,18 @@ public sealed class GenerationSealedCache private const string CurrentPointerFileName = "CURRENT"; private readonly string _cacheRoot; + // Private per-database BsonMapper with the entity pre-registered. LiteDB's default + // BsonMapper.Global is a process-wide singleton whose lazy per-type member registration is + // not thread-safe across concurrently-constructed LiteDatabase instances; a seal racing a + // read (or this cache racing LiteDbConfigCache) corrupts the global mapper, surfacing as + // "Member … not found on BsonMapper" or a bogus duplicate-_id insert — Configuration-012. + private static BsonMapper BuildMapper() + { + var mapper = new BsonMapper(); + mapper.Entity(); + return mapper; + } + /// Root directory for all clusters' sealed caches. public string CacheRoot => _cacheRoot; @@ -68,7 +80,7 @@ public sealed class GenerationSealedCache var tmpPath = sealedPath + ".tmp"; try { - using (var db = new LiteDatabase(new ConnectionString { Filename = tmpPath, Upgrade = false })) + using (var db = new LiteDatabase(new ConnectionString { Filename = tmpPath, Upgrade = false }, BuildMapper())) { var col = db.GetCollection(CollectionName); col.Insert(snapshot); @@ -126,7 +138,7 @@ public sealed class GenerationSealedCache try { - using var db = new LiteDatabase(new ConnectionString { Filename = sealedPath, ReadOnly = true }); + using var db = new LiteDatabase(new ConnectionString { Filename = sealedPath, ReadOnly = true }, BuildMapper()); var col = db.GetCollection(CollectionName); var snapshot = col.FindAll().FirstOrDefault() ?? throw new GenerationCacheUnavailableException( diff --git a/src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs b/src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs index 39e5dfb4..75b144b0 100644 --- a/src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs +++ b/src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs @@ -11,6 +11,21 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration.LocalCache; public sealed class LiteDbConfigCache : ILocalConfigCache, IDisposable { private const string CollectionName = "generations"; + + // LiteDB's default BsonMapper.Global is a process-wide singleton whose per-type member + // registration is lazy and NOT thread-safe across concurrently-constructed LiteDatabase + // instances. When several caches (this one + GenerationSealedCache) initialise in parallel + // the global mapper races, surfacing as "Member ClusterId not found on BsonMapper" or a + // bogus "duplicate key _id = 0" (the int auto-id mapping was lost so Insert writes a literal + // 0 twice) — Configuration-012. Give each database a private, pre-registered mapper so member + // resolution happens once, single-threaded, at construction and never touches the global. + private static BsonMapper BuildMapper() + { + var mapper = new BsonMapper(); + mapper.Entity(); + return mapper; + } + private readonly LiteDatabase _db; private readonly ILiteCollection _col; // PutAsync is a find-then-insert/update; without serialization, two concurrent puts for the @@ -28,7 +43,7 @@ public sealed class LiteDbConfigCache : ILocalConfigCache, IDisposable // the header and "recover"), so we force a write + read probe to fail fast on real corruption. try { - _db = new LiteDatabase(new ConnectionString { Filename = dbPath, Upgrade = true }); + _db = new LiteDatabase(new ConnectionString { Filename = dbPath, Upgrade = true }, BuildMapper()); _col = _db.GetCollection(CollectionName); _col.EnsureIndex(s => s.ClusterId); _col.EnsureIndex(s => s.GenerationId); diff --git a/tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/LiteDbConfigCacheTests.cs b/tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/LiteDbConfigCacheTests.cs index 527c5b66..0da9d589 100644 --- a/tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/LiteDbConfigCacheTests.cs +++ b/tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/LiteDbConfigCacheTests.cs @@ -129,6 +129,51 @@ public sealed class LiteDbConfigCacheTests : IDisposable $"PutAsync must upsert atomically — found {gen42Count} rows for (c-1, gen=42) after 64 concurrent puts"); } + // ------------------------------------------------------------------------------------ + // Configuration-012 — the per-instance _writeGate (Configuration-005) does not protect + // against LiteDB's process-wide BsonMapper.Global lazy-init race. Many cache INSTANCES + // constructed + driven concurrently corrupt the shared global mapper, surfacing as + // "Member ClusterId not found on BsonMapper" or a bogus "duplicate key _id = 0". A private + // per-database mapper with the entity pre-registered fixes it. + // ------------------------------------------------------------------------------------ + /// Verifies that many cache instances constructed and driven concurrently do not + /// corrupt LiteDB's shared global BsonMapper — each Put/Get round-trips its own payload and + /// no insert throws a member-not-found or duplicate-_id exception. + [Fact] + public async Task Concurrent_cache_instances_do_not_race_the_shared_bson_mapper() + { + var paths = new List(); + try + { + var outer = Enumerable.Range(0, 24).Select(i => Task.Run(async () => + { + var path = Path.Combine(Path.GetTempPath(), $"otopcua-cache-mapperrace-{Guid.NewGuid():N}.db"); + lock (paths) paths.Add(path); + + using var cache = new LiteDbConfigCache(path); + // Pre-seed a sentinel, then hammer one (cluster, gen) from many threads. + await cache.PutAsync(Snapshot($"c-{i}", 99)); + var inner = Enumerable.Range(0, 16) + .Select(_ => Task.Run(() => cache.PutAsync(Snapshot($"c-{i}", 42)))) + .ToArray(); + await Task.WhenAll(inner); + + var got = await cache.GetMostRecentAsync($"c-{i}"); + got.ShouldNotBeNull(); + got!.GenerationId.ShouldBe(99); // 99 > 42, latest by GenerationId + })).ToArray(); + + // The unfixed code throws LiteException / NotSupportedException out of these tasks under + // the global-mapper race; the fixed code completes cleanly. + await Task.WhenAll(outer); + } + finally + { + foreach (var p in paths) + if (File.Exists(p)) File.Delete(p); + } + } + /// Verifies that a corrupted cache file surfaces as LocalConfigCacheCorruptException. [Fact] public void Corrupt_file_surfaces_as_LocalConfigCacheCorruptException()