review(Configuration): fix LiteDB global BsonMapper cross-instance race (High)
Re-review at 7286d320. Configuration-012 (High): LiteDbConfigCache/GenerationSealedCache built
LiteDatabase on the process-wide BsonMapper.Global whose lazy member resolution races across
concurrently-constructed DBs (NotSupportedException/duplicate-key under contention; also caused
intermittent suite flakiness). Fix: per-cache fresh BsonMapper + pre-registered entity + TDD.
-013 (dead ValidateClusterTopology, ControlPlane) / -014 (collation case-sensitivity, needs
migration) deferred. No migration touched.
This commit is contained in:
@@ -4,10 +4,10 @@
|
||||
|---|---|
|
||||
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration` |
|
||||
| Reviewer | Claude Code |
|
||||
| Review date | 2026-05-22 |
|
||||
| Commit reviewed | `76d35d1` |
|
||||
| Review date | 2026-06-19 (re-review; first reviewed 2026-05-22) |
|
||||
| Commit reviewed | `7286d320` (re-review; was `76d35d1`) |
|
||||
| Status | Reviewed |
|
||||
| Open findings | 0 |
|
||||
| Open findings | 2 |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
@@ -190,3 +190,67 @@
|
||||
**Recommendation:** Add `GenerationApplierTests` cases for a throwing callback (assert error recorded, assert cancellation propagates) and a `DraftValidatorTests` path-length boundary case. Add a `StoredProceduresTests` case that publishes an invalid draft and asserts it stays `Draft`.
|
||||
|
||||
**Resolution:** Resolved 2026-05-23 — all three gaps now covered. (a) `GenerationApplierTests.Apply_continues_to_Added_pass_when_a_Removed_callback_throws` pins ordering; `Apply_propagates_OperationCanceledException_from_callback_when_token_cancelled` and `Apply_stops_between_passes_when_cancellation_requested` (added under Configuration-007) pin cancellation. (b) `DraftValidatorTests.PathLength_uses_actual_Enterprise_Site_when_provided` and `PathLength_conservative_fallback_when_Enterprise_Site_absent` (added under Configuration-003) pin the path-length boundary. (c) `StoredProceduresTests.Publish_aborts_when_ValidateDraft_rejects_the_draft` (added under Configuration-001) pins the publish-bypasses-validation regression against the live SQL fixture.
|
||||
|
||||
## Re-review 2026-06-19 (commit 7286d320)
|
||||
|
||||
Re-reviewed the full non-migration C# data layer at `7286d320` (the Configuration source is byte-identical between `7286d320` and the working HEAD `8ac5a2db` — `git diff 7286d320 HEAD` shows no non-migration Configuration changes). Migration files remain off-limits and were not touched.
|
||||
|
||||
Prior findings Configuration-001…011 remain Resolved. Notable since the first review: the v1 generation model and its stored-proc publish path are retired; `DraftValidationService` (named in the Configuration-003 resolution) is replaced by `DraftSnapshotFactory`, which deliberately leaves `Enterprise`/`Site` null so `ValidatePathLength` uses its conservative upper bound — an intentional, documented fallback, not a regression. The DB-backed `StoredProceduresTests` referenced by the Configuration-001/008/011 resolutions no longer exist in the test project (consolidated out when the generation model dropped); the live SQL `SchemaComplianceTests`, `DriverHostStatusTests`, and `AuthorizationTests` still run against the Docker SQL fixture and pass.
|
||||
|
||||
| # | Category | Result |
|
||||
|---|---|---|
|
||||
| 1 | Correctness & logic bugs | Configuration-012 |
|
||||
| 2 | OtOpcUa conventions | No issues found |
|
||||
| 3 | Concurrency & thread safety | Configuration-012 |
|
||||
| 4 | Error handling & resilience | No issues found (Configuration-006/007 fixes verified still in place) |
|
||||
| 5 | Security | No issues found (Configuration-008/009/010 fixes verified still in place; `DesignTimeDbContextFactory` carries no credential; `ResilientConfigReader.ScrubSecrets` intact) |
|
||||
| 6 | Performance & resource management | No issues found (factory + query paths use `AsNoTracking`; cluster audit query is a bounded two-step, not N+1) |
|
||||
| 7 | Design-document adherence | Configuration-013 |
|
||||
| 8 | Code organization & conventions | No issues found |
|
||||
| 9 | Testing coverage | Configuration-013 (dead-code path untested in prod wiring), Configuration-014 |
|
||||
| 10 | Documentation & comments | Configuration-014 |
|
||||
|
||||
### Configuration-012
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs:31`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs:71`, `:129` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** Both LiteDB-backed caches constructed their `LiteDatabase` with the default constructor, which binds the process-wide singleton `BsonMapper.Global`. LiteDB resolves a POCO's members into that mapper **lazily on first use and without thread-safety across concurrently-constructed `LiteDatabase` instances**. When two caches initialise or run in parallel — e.g. a `GenerationSealedCache.SealAsync` racing a `ReadCurrentAsync`, or `LiteDbConfigCache` racing the sealed cache, or simply the test suite running both classes' fixtures in parallel — the global mapper races, surfacing as `NotSupportedException: Member ClusterId not found on BsonMapper for type GenerationSnapshot` (the query predicate can't resolve the member) or `LiteException: Cannot insert duplicate key in unique index '_id'. The duplicate value is '0'` (the `int Id` auto-id mapping was lost, so `Insert` writes a literal `0` twice). The per-instance `_writeGate` added under Configuration-005 serialises one instance's find-then-insert but does nothing about this cross-instance global-mapper hazard. Reproduced deterministically under thread-pool contention (the Configuration-005 regression test `PutAsync_concurrent_…` was intermittently failing in the full suite for exactly this reason). This is a real runtime fault, not test-only: the resilient-read fallback path and the seal path can run concurrently in a live node.
|
||||
|
||||
**Recommendation:** Give each `LiteDatabase` a private `BsonMapper` with the entity pre-registered (`mapper.Entity<GenerationSnapshot>()`) so member resolution happens once, single-threaded, at construction and never touches the global mapper.
|
||||
|
||||
**Resolution:** Resolved 2026-06-19 — added a private `BuildMapper()` to both `LiteDbConfigCache` and `GenerationSealedCache` that constructs a fresh `BsonMapper`, calls `mapper.Entity<GenerationSnapshot>()`, and passes it to every `new LiteDatabase(...)` (construction, seal-write, sealed-read). Added regression test `LiteDbConfigCacheTests.Concurrent_cache_instances_do_not_race_the_shared_bson_mapper` (24 cache instances × 16 concurrent puts each); confirmed it fails on the unfixed code (`Member ClusterId not found on BsonMapper`) and passes on the fixed code, and the previously-flaky `PutAsync_concurrent_…` is now stable across repeated full-suite runs.
|
||||
|
||||
### Configuration-013
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:243` (`ValidateClusterTopology`) |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `DraftValidator.ValidateClusterTopology` is documented as the managed pre-publish guard that catches cluster-topology drift the SQL `CK_ServerCluster_RedundancyMode_NodeCount` check cannot see — specifically an operator disabling a `ClusterNode` (effective enabled-count = 1) while `RedundancyMode` stays `Hot`/`Warm`, which would boot the runtime into an invalid-topology band. It is fully unit-tested (`DraftValidatorTests` §"ValidateClusterTopology") but **no production code calls it.** The deploy gate in `AdminOperationsActor.StartDeployment` runs `DraftValidator.Validate(...)` (the snapshot rules) but never `ValidateClusterTopology(...)`, so the documented enabled-node-count guard is inert at deploy time — the only thing standing is the row-level SQL CHECK, which the doc explicitly says is insufficient.
|
||||
|
||||
**Recommendation:** Wire `ValidateClusterTopology` into the deploy/publish path — load the `ServerCluster` row(s) + their `ClusterNode`s and run it alongside `Validate`, folding its errors into the same reject summary. The fix belongs in `src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs` (a different module), so it is **deferred from this module's edit scope** and recorded here against the now-dead Configuration-layer method. Cross-module: ControlPlane.
|
||||
|
||||
**Resolution:** _(open — fix is in the ControlPlane module's `AdminOperationsActor`, outside Configuration's edit scope)_
|
||||
|
||||
### Configuration-014
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Services/ILdapGroupRoleMappingService.cs:23`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Services/LdapGroupRoleMappingService.cs:24` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `ILdapGroupRoleMappingService.GetByGroupsAsync`'s XML doc asserts "Case-insensitive per LDAP conventions", but the implementation is `db.LdapGroupRoleMappings.Where(m => groupSet.Contains(m.LdapGroup))`, which translates to a SQL `IN (…)` whose case-sensitivity is entirely determined by the column's collation. On the default case-insensitive server collation (the dev/Docker SQL) the doc holds; on a case-sensitive-collation deployment the lookup would silently miss rows that differ only in case — and the in-memory EF unit tests (which match exact case) would not catch it. This is a hot path: it fires on every Admin-UI sign-in, so a silent miss denies the user their role grant. The doc overstates a guarantee the code does not enforce.
|
||||
|
||||
**Recommendation:** Either (a) soften the doc to state the match is collation-dependent and document the required CI collation as a deployment constraint, or (b) pin the `LdapGroup` column to an explicit `_CI_` collation in the model so the guarantee is enforced regardless of server default. Option (b) is a schema/migration change and must be deferred (no migration edits in this review); option (a) is a one-line doc change but only papers over the gap. Left **Open** pending a decision on which guarantee to commit to — no code change applied because the documented behaviour currently matches the deployment reality (CI server collation).
|
||||
|
||||
**Resolution:** _(open — preferred fix (b) pins the column collation, which needs an EF migration; deferred per the no-migration rule)_
|
||||
|
||||
Reference in New Issue
Block a user