review(Configuration): fix LiteDB global BsonMapper cross-instance race (High)

Re-review at 7286d320. Configuration-012 (High): LiteDbConfigCache/GenerationSealedCache built
LiteDatabase on the process-wide BsonMapper.Global whose lazy member resolution races across
concurrently-constructed DBs (NotSupportedException/duplicate-key under contention; also caused
intermittent suite flakiness). Fix: per-cache fresh BsonMapper + pre-registered entity + TDD.
-013 (dead ValidateClusterTopology, ControlPlane) / -014 (collation case-sensitivity, needs
migration) deferred. No migration touched.
This commit is contained in:
Joseph Doherty
2026-06-19 11:06:56 -04:00
parent 145b06bec9
commit c3d148e396
4 changed files with 142 additions and 6 deletions
+67 -3
View File
@@ -4,10 +4,10 @@
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-06-19 (re-review; first reviewed 2026-05-22) |
| Commit reviewed | `7286d320` (re-review; was `76d35d1`) |
| Status | Reviewed |
| Open findings | 0 |
| Open findings | 2 |
## Checklist coverage
@@ -190,3 +190,67 @@
**Recommendation:** Add `GenerationApplierTests` cases for a throwing callback (assert error recorded, assert cancellation propagates) and a `DraftValidatorTests` path-length boundary case. Add a `StoredProceduresTests` case that publishes an invalid draft and asserts it stays `Draft`.
**Resolution:** Resolved 2026-05-23 — all three gaps now covered. (a) `GenerationApplierTests.Apply_continues_to_Added_pass_when_a_Removed_callback_throws` pins ordering; `Apply_propagates_OperationCanceledException_from_callback_when_token_cancelled` and `Apply_stops_between_passes_when_cancellation_requested` (added under Configuration-007) pin cancellation. (b) `DraftValidatorTests.PathLength_uses_actual_Enterprise_Site_when_provided` and `PathLength_conservative_fallback_when_Enterprise_Site_absent` (added under Configuration-003) pin the path-length boundary. (c) `StoredProceduresTests.Publish_aborts_when_ValidateDraft_rejects_the_draft` (added under Configuration-001) pins the publish-bypasses-validation regression against the live SQL fixture.
## Re-review 2026-06-19 (commit 7286d320)
Re-reviewed the full non-migration C# data layer at `7286d320` (the Configuration source is byte-identical between `7286d320` and the working HEAD `8ac5a2db``git diff 7286d320 HEAD` shows no non-migration Configuration changes). Migration files remain off-limits and were not touched.
Prior findings Configuration-001…011 remain Resolved. Notable since the first review: the v1 generation model and its stored-proc publish path are retired; `DraftValidationService` (named in the Configuration-003 resolution) is replaced by `DraftSnapshotFactory`, which deliberately leaves `Enterprise`/`Site` null so `ValidatePathLength` uses its conservative upper bound — an intentional, documented fallback, not a regression. The DB-backed `StoredProceduresTests` referenced by the Configuration-001/008/011 resolutions no longer exist in the test project (consolidated out when the generation model dropped); the live SQL `SchemaComplianceTests`, `DriverHostStatusTests`, and `AuthorizationTests` still run against the Docker SQL fixture and pass.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Configuration-012 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Configuration-012 |
| 4 | Error handling & resilience | No issues found (Configuration-006/007 fixes verified still in place) |
| 5 | Security | No issues found (Configuration-008/009/010 fixes verified still in place; `DesignTimeDbContextFactory` carries no credential; `ResilientConfigReader.ScrubSecrets` intact) |
| 6 | Performance & resource management | No issues found (factory + query paths use `AsNoTracking`; cluster audit query is a bounded two-step, not N+1) |
| 7 | Design-document adherence | Configuration-013 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Configuration-013 (dead-code path untested in prod wiring), Configuration-014 |
| 10 | Documentation & comments | Configuration-014 |
### Configuration-012
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs:31`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs:71`, `:129` |
| Status | Resolved |
**Description:** Both LiteDB-backed caches constructed their `LiteDatabase` with the default constructor, which binds the process-wide singleton `BsonMapper.Global`. LiteDB resolves a POCO's members into that mapper **lazily on first use and without thread-safety across concurrently-constructed `LiteDatabase` instances**. When two caches initialise or run in parallel — e.g. a `GenerationSealedCache.SealAsync` racing a `ReadCurrentAsync`, or `LiteDbConfigCache` racing the sealed cache, or simply the test suite running both classes' fixtures in parallel — the global mapper races, surfacing as `NotSupportedException: Member ClusterId not found on BsonMapper for type GenerationSnapshot` (the query predicate can't resolve the member) or `LiteException: Cannot insert duplicate key in unique index '_id'. The duplicate value is '0'` (the `int Id` auto-id mapping was lost, so `Insert` writes a literal `0` twice). The per-instance `_writeGate` added under Configuration-005 serialises one instance's find-then-insert but does nothing about this cross-instance global-mapper hazard. Reproduced deterministically under thread-pool contention (the Configuration-005 regression test `PutAsync_concurrent_…` was intermittently failing in the full suite for exactly this reason). This is a real runtime fault, not test-only: the resilient-read fallback path and the seal path can run concurrently in a live node.
**Recommendation:** Give each `LiteDatabase` a private `BsonMapper` with the entity pre-registered (`mapper.Entity<GenerationSnapshot>()`) so member resolution happens once, single-threaded, at construction and never touches the global mapper.
**Resolution:** Resolved 2026-06-19 — added a private `BuildMapper()` to both `LiteDbConfigCache` and `GenerationSealedCache` that constructs a fresh `BsonMapper`, calls `mapper.Entity<GenerationSnapshot>()`, and passes it to every `new LiteDatabase(...)` (construction, seal-write, sealed-read). Added regression test `LiteDbConfigCacheTests.Concurrent_cache_instances_do_not_race_the_shared_bson_mapper` (24 cache instances × 16 concurrent puts each); confirmed it fails on the unfixed code (`Member ClusterId not found on BsonMapper`) and passes on the fixed code, and the previously-flaky `PutAsync_concurrent_…` is now stable across repeated full-suite runs.
### Configuration-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:243` (`ValidateClusterTopology`) |
| Status | Open |
**Description:** `DraftValidator.ValidateClusterTopology` is documented as the managed pre-publish guard that catches cluster-topology drift the SQL `CK_ServerCluster_RedundancyMode_NodeCount` check cannot see — specifically an operator disabling a `ClusterNode` (effective enabled-count = 1) while `RedundancyMode` stays `Hot`/`Warm`, which would boot the runtime into an invalid-topology band. It is fully unit-tested (`DraftValidatorTests` §"ValidateClusterTopology") but **no production code calls it.** The deploy gate in `AdminOperationsActor.StartDeployment` runs `DraftValidator.Validate(...)` (the snapshot rules) but never `ValidateClusterTopology(...)`, so the documented enabled-node-count guard is inert at deploy time — the only thing standing is the row-level SQL CHECK, which the doc explicitly says is insufficient.
**Recommendation:** Wire `ValidateClusterTopology` into the deploy/publish path — load the `ServerCluster` row(s) + their `ClusterNode`s and run it alongside `Validate`, folding its errors into the same reject summary. The fix belongs in `src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs` (a different module), so it is **deferred from this module's edit scope** and recorded here against the now-dead Configuration-layer method. Cross-module: ControlPlane.
**Resolution:** _(open — fix is in the ControlPlane module's `AdminOperationsActor`, outside Configuration's edit scope)_
### Configuration-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Services/ILdapGroupRoleMappingService.cs:23`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Services/LdapGroupRoleMappingService.cs:24` |
| Status | Open |
**Description:** `ILdapGroupRoleMappingService.GetByGroupsAsync`'s XML doc asserts "Case-insensitive per LDAP conventions", but the implementation is `db.LdapGroupRoleMappings.Where(m => groupSet.Contains(m.LdapGroup))`, which translates to a SQL `IN (…)` whose case-sensitivity is entirely determined by the column's collation. On the default case-insensitive server collation (the dev/Docker SQL) the doc holds; on a case-sensitive-collation deployment the lookup would silently miss rows that differ only in case — and the in-memory EF unit tests (which match exact case) would not catch it. This is a hot path: it fires on every Admin-UI sign-in, so a silent miss denies the user their role grant. The doc overstates a guarantee the code does not enforce.
**Recommendation:** Either (a) soften the doc to state the match is collation-dependent and document the required CI collation as a deployment constraint, or (b) pin the `LdapGroup` column to an explicit `_CI_` collation in the model so the guarantee is enforced regardless of server default. Option (b) is a schema/migration change and must be deferred (no migration edits in this review); option (a) is a one-line doc change but only papers over the gap. Left **Open** pending a decision on which guarantee to commit to — no code change applied because the documented behaviour currently matches the deployment reality (CI server collation).
**Resolution:** _(open — preferred fix (b) pins the column collation, which needs an EF migration; deferred per the no-migration rule)_
@@ -28,6 +28,18 @@ public sealed class GenerationSealedCache
private const string CurrentPointerFileName = "CURRENT";
private readonly string _cacheRoot;
// Private per-database BsonMapper with the entity pre-registered. LiteDB's default
// BsonMapper.Global is a process-wide singleton whose lazy per-type member registration is
// not thread-safe across concurrently-constructed LiteDatabase instances; a seal racing a
// read (or this cache racing LiteDbConfigCache) corrupts the global mapper, surfacing as
// "Member … not found on BsonMapper" or a bogus duplicate-_id insert — Configuration-012.
private static BsonMapper BuildMapper()
{
var mapper = new BsonMapper();
mapper.Entity<GenerationSnapshot>();
return mapper;
}
/// <summary>Root directory for all clusters' sealed caches.</summary>
public string CacheRoot => _cacheRoot;
@@ -68,7 +80,7 @@ public sealed class GenerationSealedCache
var tmpPath = sealedPath + ".tmp";
try
{
using (var db = new LiteDatabase(new ConnectionString { Filename = tmpPath, Upgrade = false }))
using (var db = new LiteDatabase(new ConnectionString { Filename = tmpPath, Upgrade = false }, BuildMapper()))
{
var col = db.GetCollection<GenerationSnapshot>(CollectionName);
col.Insert(snapshot);
@@ -126,7 +138,7 @@ public sealed class GenerationSealedCache
try
{
using var db = new LiteDatabase(new ConnectionString { Filename = sealedPath, ReadOnly = true });
using var db = new LiteDatabase(new ConnectionString { Filename = sealedPath, ReadOnly = true }, BuildMapper());
var col = db.GetCollection<GenerationSnapshot>(CollectionName);
var snapshot = col.FindAll().FirstOrDefault()
?? throw new GenerationCacheUnavailableException(
@@ -11,6 +11,21 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration.LocalCache;
public sealed class LiteDbConfigCache : ILocalConfigCache, IDisposable
{
private const string CollectionName = "generations";
// LiteDB's default BsonMapper.Global is a process-wide singleton whose per-type member
// registration is lazy and NOT thread-safe across concurrently-constructed LiteDatabase
// instances. When several caches (this one + GenerationSealedCache) initialise in parallel
// the global mapper races, surfacing as "Member ClusterId not found on BsonMapper" or a
// bogus "duplicate key _id = 0" (the int auto-id mapping was lost so Insert writes a literal
// 0 twice) — Configuration-012. Give each database a private, pre-registered mapper so member
// resolution happens once, single-threaded, at construction and never touches the global.
private static BsonMapper BuildMapper()
{
var mapper = new BsonMapper();
mapper.Entity<GenerationSnapshot>();
return mapper;
}
private readonly LiteDatabase _db;
private readonly ILiteCollection<GenerationSnapshot> _col;
// PutAsync is a find-then-insert/update; without serialization, two concurrent puts for the
@@ -28,7 +43,7 @@ public sealed class LiteDbConfigCache : ILocalConfigCache, IDisposable
// the header and "recover"), so we force a write + read probe to fail fast on real corruption.
try
{
_db = new LiteDatabase(new ConnectionString { Filename = dbPath, Upgrade = true });
_db = new LiteDatabase(new ConnectionString { Filename = dbPath, Upgrade = true }, BuildMapper());
_col = _db.GetCollection<GenerationSnapshot>(CollectionName);
_col.EnsureIndex(s => s.ClusterId);
_col.EnsureIndex(s => s.GenerationId);
@@ -129,6 +129,51 @@ public sealed class LiteDbConfigCacheTests : IDisposable
$"PutAsync must upsert atomically — found {gen42Count} rows for (c-1, gen=42) after 64 concurrent puts");
}
// ------------------------------------------------------------------------------------
// Configuration-012 — the per-instance _writeGate (Configuration-005) does not protect
// against LiteDB's process-wide BsonMapper.Global lazy-init race. Many cache INSTANCES
// constructed + driven concurrently corrupt the shared global mapper, surfacing as
// "Member ClusterId not found on BsonMapper" or a bogus "duplicate key _id = 0". A private
// per-database mapper with the entity pre-registered fixes it.
// ------------------------------------------------------------------------------------
/// <summary>Verifies that many cache instances constructed and driven concurrently do not
/// corrupt LiteDB's shared global BsonMapper — each Put/Get round-trips its own payload and
/// no insert throws a member-not-found or duplicate-_id exception.</summary>
[Fact]
public async Task Concurrent_cache_instances_do_not_race_the_shared_bson_mapper()
{
var paths = new List<string>();
try
{
var outer = Enumerable.Range(0, 24).Select(i => Task.Run(async () =>
{
var path = Path.Combine(Path.GetTempPath(), $"otopcua-cache-mapperrace-{Guid.NewGuid():N}.db");
lock (paths) paths.Add(path);
using var cache = new LiteDbConfigCache(path);
// Pre-seed a sentinel, then hammer one (cluster, gen) from many threads.
await cache.PutAsync(Snapshot($"c-{i}", 99));
var inner = Enumerable.Range(0, 16)
.Select(_ => Task.Run(() => cache.PutAsync(Snapshot($"c-{i}", 42))))
.ToArray();
await Task.WhenAll(inner);
var got = await cache.GetMostRecentAsync($"c-{i}");
got.ShouldNotBeNull();
got!.GenerationId.ShouldBe(99); // 99 > 42, latest by GenerationId
})).ToArray();
// The unfixed code throws LiteException / NotSupportedException out of these tasks under
// the global-mapper race; the fixed code completes cleanly.
await Task.WhenAll(outer);
}
finally
{
foreach (var p in paths)
if (File.Exists(p)) File.Delete(p);
}
}
/// <summary>Verifies that a corrupted cache file surfaces as LocalConfigCacheCorruptException.</summary>
[Fact]
public void Corrupt_file_surfaces_as_LocalConfigCacheCorruptException()