Files
CBDDC/surreal.md
Joseph Doherty bd10914828
All checks were successful
NuGet Package Publish / nuget (push) Successful in 1m17s
Harden Surreal migration with retry/coverage fixes and XML docs cleanup
2026-02-22 05:39:00 -05:00

16 KiB

BLite -> SurrealDB (Embedded + RocksDB) Migration Plan

1) Goal and Scope

Replace all BLite-backed persistence in this repository with SurrealDB embedded using RocksDB persistence, while preserving current CBDDC behavior:

  1. Automatic CDC-driven oplog generation for local writes.
  2. Reliable sync across peers (including reconnect and snapshot flows).
  3. Existing storage contracts (IDocumentStore, IOplogStore, IPeerConfigurationStore, IDocumentMetadataStore, ISnapshotMetadataStore, IPeerOplogConfirmationStore) and test semantics.
  4. Full removal of BLite dependencies, APIs, and documentation references.

2) Current-State Inventory (Repository-Specific)

Primary BLite implementation and integration points currently live in:

  1. src/ZB.MOM.WW.CBDDC.Persistence/BLite/CBDDCBLiteExtensions.cs
  2. src/ZB.MOM.WW.CBDDC.Persistence/BLite/CBDDCDocumentDbContext.cs
  3. src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteDocumentStore.cs
  4. src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteOplogStore.cs
  5. src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteDocumentMetadataStore.cs
  6. src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLitePeerConfigurationStore.cs
  7. src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLitePeerOplogConfirmationStore.cs
  8. src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteSnapshotMetadataStore.cs
  9. samples/ZB.MOM.WW.CBDDC.Sample.Console/SampleDbContext.cs
  10. samples/ZB.MOM.WW.CBDDC.Sample.Console/SampleDocumentStore.cs
  11. samples/ZB.MOM.WW.CBDDC.Sample.Console/Program.cs
  12. tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests/*.cs (BLite-focused tests)
  13. tests/ZB.MOM.WW.CBDDC.E2E.Tests/ClusterCrudSyncE2ETests.cs
  14. src/ZB.MOM.WW.CBDDC.Persistence/ZB.MOM.WW.CBDDC.Persistence.csproj and sample/test package references
  15. README.md and related docs that currently describe BLite as the embedded provider.

3) Target Architecture

3.1 Provider Surface

Create a Surreal provider namespace and extension entrypoint that mirrors current integration shape:

  1. Add AddCBDDCSurrealEmbedded<...>() in a new file (e.g., src/ZB.MOM.WW.CBDDC.Persistence/Surreal/CBDDCSurrealExtensions.cs).
  2. Register Surreal-backed implementations for all existing persistence interfaces.
  3. Keep singleton lifetime for store services and Surreal client factory (equivalent to current BLite singleton model).
  4. Expose options object including:
  • RocksDB endpoint/path (rocksdb://...)
  • Namespace
  • Database
  • CDC polling interval
  • CDC batch size
  • CDC retention duration

3.2 Surreal Connection and Embedded Startup

Use official embedded .NET guidance:

  1. Add Surreal embedded packages.
  2. Use SurrealDbEmbeddedClient/RocksDB embedded client with rocksdb:// endpoint.
  3. Run USE NS <ns> DB <db> at startup.
  4. Dispose/close client on host shutdown.

3.3 Table Design (Schema + Indexing)

Define internal tables as SCHEMAFULL and strongly typed fields to reduce runtime drift.

Proposed tables:

  1. oplog_entries
  2. snapshot_metadatas
  3. remote_peer_configurations
  4. document_metadatas
  5. peer_oplog_confirmations
  6. cdc_checkpoints (new: durable cursor per watched table)
  7. Optional: cdc_dedup (new: idempotency window for duplicate/overlapping reads)

Indexes and IDs:

  1. Prefer deterministic record IDs for point lookups (table:id) where possible.
  2. Add unique indexes for business keys currently enforced in BLite:
  • oplog_entries.hash
  • snapshot_metadatas.node_id
  • (document_metadatas.collection, document_metadatas.key)
  • (peer_oplog_confirmations.peer_node_id, peer_oplog_confirmations.source_node_id)
  1. Add composite indexes for hot sync queries:
  • Oplog by (timestamp_physical, timestamp_logical)
  • Oplog by (timestamp_node_id, timestamp_physical, timestamp_logical)
  • Metadata by (hlc_physical, hlc_logical)
  1. Use EXPLAIN FULL during test/benchmark phase to verify index usage.

3.4 CDC Strategy (Durable + Low Latency)

Implement CDC with Surreal Change Feeds as source of truth and Live Queries as optional accelerators.

  1. Enable CHANGEFEED <duration> per watched table (INCLUDE ORIGINAL when old values are required for conflict handling/debug).
  2. Persist checkpoint cursor (versionstamp preferred) in cdc_checkpoints.
  3. Poll with SHOW CHANGES FOR TABLE <table> SINCE <cursor> LIMIT <N>.
  4. Process changes idempotently; tolerate duplicate windows when timestamp cursors overlap.
  5. Commit checkpoint only after oplog + metadata writes commit successfully.
  6. Optionally run LIVE SELECT subscribers for lower-latency wakeups, but never rely on live events alone for durability.
  7. On startup/reconnect, always catch up via SHOW CHANGES from last persisted cursor.

3.5 Transaction Boundaries

Use explicit SurrealQL transactions for atomic state transitions:

  1. Local CDC event -> write oplog entry + document metadata + vector clock backing data in one transaction.
  2. Remote apply batch -> apply documents + merge oplog + metadata updates atomically in bounded batches.
  3. Snapshot replace/merge -> table-level clear/import or merge in deterministic order with rollback on failure.

4) Execution Plan (Phased)

Phase 0: Design Freeze and Safety Rails

  1. Finalize data model and table schema DDL.
  2. Finalize CDC cursor semantics (versionstamp vs timestamp fallback).
  3. Freeze shared contracts in ZB.MOM.WW.CBDDC.Core (no signature churn during provider port).
  4. Add migration feature flag for temporary cutover control (UseSurrealPersistence), removed in final cleanup.

Exit criteria:

  1. Design doc approved.
  2. DDL + index plan reviewed.
  3. CDC retention value chosen (must exceed maximum offline peer window).

Phase 1: Surreal Infrastructure Layer

  1. Add Surreal packages and connection factory.
  2. Implement startup initialization: NS/DB selection, table/index creation, capability checks.
  3. Introduce provider options and DI extension (AddCBDDCSurrealEmbedded).
  4. Add health probe for embedded connection and schema readiness.

Exit criteria:

  1. dotnet build succeeds.
  2. Basic smoke test can connect, create, read, and delete records in RocksDB-backed embedded Surreal.

Phase 2: Port Store Implementations

Port each BLite store to Surreal while preserving interface behavior:

  1. BLiteOplogStore -> SurrealOplogStore
  2. BLiteDocumentMetadataStore -> SurrealDocumentMetadataStore
  3. BLitePeerConfigurationStore -> SurrealPeerConfigurationStore
  4. BLitePeerOplogConfirmationStore -> SurrealPeerOplogConfirmationStore
  5. BLiteSnapshotMetadataStore -> SurrealSnapshotMetadataStore

Implementation requirements:

  1. Keep existing merge/drop/export/import semantics.
  2. Preserve ordering guarantees for hash-chain methods.
  3. Preserve vector clock bootstrap behavior (snapshot metadata first, oplog second).

Exit criteria:

  1. Store-level unit tests pass with Surreal backend.
  2. No BLite store classes used in DI path.

Phase 3: Document Store + CDC Engine

  1. Replace BLiteDocumentStore<TDbContext> with Surreal-aware document store base.
  2. Implement collection registration + watched table catalog.
  3. Implement CDC worker:
  • Poll SHOW CHANGES
  • Map CDC events to OperationType
  • Generate oplog + metadata
  • Enforce remote-sync suppression/idempotency
  1. Keep equivalent remote apply guard semantics to prevent CDC loopback during sync replay.
  2. Add graceful start/stop lifecycle hooks for CDC worker.

Exit criteria:

  1. Local direct writes produce expected oplog entries.
  2. Remote replay does not create duplicate local oplog entries.
  3. Restart resumes CDC from persisted checkpoint without missing changes.

Phase 4: Sample App and E2E Harness Migration

  1. Replace sample BLite context usage with Surreal-backed sample persistence.
  2. Replace AddCBDDCBLite usage in sample and tests.
  3. Update ClusterCrudSyncE2ETests internals that currently access BLite collections directly.
  4. Refactor fallback CDC assertion logic to Surreal-based observability hooks.

Exit criteria:

  1. Sample runs two-node sync with Surreal embedded RocksDB.
  2. E2E CRUD bidirectional test passes unchanged in behavior.

Phase 5: Data Migration Tooling and Cutover

  1. Build one-time migration utility:
  • Read BLite data via existing stores
  • Write to Surreal tables
  • Preserve hashes/timestamps exactly
  1. Add verification routine comparing counts, hashes, and key spot checks.
  2. Document migration command and rollback artifacts.

Exit criteria:

  1. Dry-run migration succeeds on fixture DB.
  2. Post-migration parity checks are clean.

Phase 6: Remove BLite Completely

  1. Delete src/ZB.MOM.WW.CBDDC.Persistence/BLite/* after Surreal parity is proven.
  2. Remove BLite package references and BLite source generators from project files.
  3. Remove .blite path assumptions from sample/tests/docs.
  4. Update docs and READMEs to SurrealDB terminology.
  5. Ensure rg -n "BLite|blite|AddCBDDCBLite|CBDDCDocumentDbContext" returns no functional references (except historical notes if intentionally retained).

Exit criteria:

  1. Solution builds/tests pass with zero BLite runtime dependency.
  2. Docs reflect Surreal-only provider path.

5) Safe Parallel Subagent Plan

Use parallel subagents only with strict ownership boundaries and integration gates.

5.1 Subagent Work Split

  1. Subagent A (Infrastructure/DI)
  • Owns: new Surreal options, connection factory, DI extension, startup schema init.
  • Files: new src/.../Surreal/* infra files, *.csproj package refs.
  1. Subagent B (Core Stores)
  • Owns: oplog/document metadata/snapshot metadata/peer config/peer confirmation Surreal stores.
  • Files: src/ZB.MOM.WW.CBDDC.Persistence/Surreal/*Store.cs.
  1. Subagent C (CDC + DocumentStore)
  • Owns: Surreal document store base, CDC poller, checkpoint persistence, suppression loop prevention.
  • Files: src/ZB.MOM.WW.CBDDC.Persistence/Surreal/*DocumentStore*, CDC worker files.
  1. Subagent D (Tests)
  • Owns: unit/integration/E2E tests migrated to Surreal.
  • Files: tests/* touched by provider swap.
  1. Subagent E (Sample + Docs)
  • Owns: sample console migration and doc rewrites.
  • Files: samples/*, README.md, docs/* provider docs.

5.2 Parallel Safety Rules

  1. No overlapping file ownership between active subagents.
  2. Shared contract files are locked unless explicitly assigned to one subagent.
  3. Each subagent must submit:
  • changed file list
  • rationale
  • commands run
  • test evidence
  1. Integrator rebases/merges sequentially, never blindly squashing conflicting edits.
  2. If a subagent encounters unrelated dirty changes, it must stop and escalate before editing.

5.3 Integration Order

  1. Merge A -> B -> C -> D -> E.
  2. Run full verification after each merge step, not only at the end.

6) Required Unit/Integration Test Matrix

6.1 Store Contract Tests

  1. Oplog append/export/import/merge/drop parity.
  2. GetChainRangeAsync correctness by hash chain ordering.
  3. GetLastEntryHashAsync behavior with oplog hit and snapshot fallback.
  4. Pruning respects cutoff and confirmations.
  5. Document metadata upsert/mark-deleted/get-after ordering.
  6. Peer config save/get/remove/merge semantics.
  7. Peer confirmation registration/update/deactivate/merge semantics.
  8. Snapshot metadata insert/update/merge and hash lookup.

6.2 CDC Tests

  1. Local write on watched table emits exactly one oplog entry.
  2. Delete mutation emits delete oplog + metadata tombstone.
  3. Remote apply path does not re-emit local CDC oplog entries.
  4. CDC checkpoint persists only after atomic write success.
  5. Restart from checkpoint catches missed changes.
  6. Duplicate window replay is idempotent.
  7. Changefeed retention boundary behavior is explicit and logged.

6.3 Snapshot and Recovery Tests

  1. CreateSnapshotAsync includes docs/oplog/peers/confirmations.
  2. ReplaceDatabaseAsync restores full state.
  3. MergeSnapshotAsync conflict behavior unchanged.
  4. Recovery after process restart retains Surreal RocksDB data.

6.4 E2E Sync Tests

  1. Two peers replicate create/update/delete bidirectionally.
  2. Peer reconnect performs incremental catch-up from CDC cursor.
  3. Multi-change burst preserves deterministic final state.
  4. Optional fault-injection test: crash between oplog write and checkpoint update should replay safely on restart.

7) Verification After Each Subagent Completion

Run this checklist after each merged subagent contribution:

  1. dotnet restore
  2. dotnet build CBDDC.slnx -c Release
  3. Targeted tests for modified projects (fast gate)
  4. Full test suite before moving to next major phase:
  • dotnet test CBDDC.slnx -c Release
  1. Regression grep checks:
  • rg -n "BLite|AddCBDDCBLite|\.blite|CBDDCDocumentDbContext" src samples tests README.md docs
  1. Surreal smoke test:
  • create temp RocksDB path
  • start sample node
  • perform write/update/delete
  • restart process and verify persisted state
  1. CDC durability test:
  • stop node
  • mutate source
  • restart node
  • confirm catch-up via SHOW CHANGES cursor

8) Rollout and Rollback

Rollout

  1. Internal canary branch with Surreal-only provider.
  2. Run full CI + extended E2E soak (long-running sync/reconnect).
  3. Migrate one test dataset from BLite to Surreal and validate parity.
  4. Promote after acceptance criteria are met.

Rollback

  1. Keep BLite export snapshots until Surreal cutover is accepted.
  2. If severe defect appears, restore from pre-cutover snapshot and redeploy previous BLite-tagged build.
  3. Preserve migration logs and parity reports for audit.

9) Definition of Done

  1. No runtime BLite dependency remains.
  2. All store contracts pass with Surreal backend.
  3. CDC is durable (checkpointed), idempotent, and restart-safe.
  4. Sample + E2E prove sync parity.
  5. Documentation and onboarding instructions updated to Surreal embedded RocksDB.
  6. Migration utility + validation report available for production cutover.

10) SurrealDB Best-Practice Notes Applied in This Plan

This plan explicitly applies official Surreal guidance:

  1. Embedded .NET with RocksDB endpoint (rocksdb://) and explicit NS/DB usage.
  2. Schema-first design with strict table/field definitions and typed record references.
  3. Query/index discipline (EXPLAIN FULL, indexed lookups, avoid broad scans).
  4. CDC durability with changefeeds and checkpointed SHOW CHANGES replay.
  5. Live queries used as low-latency signals, not as sole durable CDC transport.
  6. Security hardening (authentication, encryption/backups, restricted capabilities) for any non-embedded server deployments used in tooling/CI.

References (Primary Sources)

  1. SurrealDB .NET embedded engine docs: https://surrealdb.com/docs/surrealdb/embedding/dotnet
  2. SurrealDB .NET SDK embedding guide: https://surrealdb.com/docs/sdk/dotnet/embedding
  3. SurrealDB connection strings (protocol formats incl. RocksDB): https://surrealdb.com/docs/surrealdb/reference-guide/connection-strings
  4. SurrealDB schema best practices: https://surrealdb.com/docs/surrealdb/reference-guide/schema-creation-best-practices
  5. SurrealDB performance best practices: https://surrealdb.com/docs/surrealdb/reference-guide/performance-best-practices
  6. SurrealDB real-time/events best practices: https://surrealdb.com/docs/surrealdb/reference-guide/realtime-best-practices
  7. SurrealQL DEFINE TABLE (changefeed options): https://surrealdb.com/docs/surrealql/statements/define/table
  8. SurrealQL SHOW CHANGES (durable CDC read): https://surrealdb.com/docs/surrealql/statements/show
  9. SurrealQL LIVE SELECT behavior and caveats: https://surrealdb.com/docs/surrealql/statements/live
  10. SurrealDB security best practices: https://surrealdb.com/docs/surrealdb/security/security-best-practices
  11. SurrealQL transactions (BEGIN/COMMIT): https://surrealdb.com/docs/surrealql/statements/begin, https://surrealdb.com/docs/surrealql/statements/commit