16 KiB
BLite -> SurrealDB (Embedded + RocksDB) Migration Plan
1) Goal and Scope
Replace all BLite-backed persistence in this repository with SurrealDB embedded using RocksDB persistence, while preserving current CBDDC behavior:
- Automatic CDC-driven oplog generation for local writes.
- Reliable sync across peers (including reconnect and snapshot flows).
- Existing storage contracts (
IDocumentStore,IOplogStore,IPeerConfigurationStore,IDocumentMetadataStore,ISnapshotMetadataStore,IPeerOplogConfirmationStore) and test semantics. - Full removal of BLite dependencies, APIs, and documentation references.
2) Current-State Inventory (Repository-Specific)
Primary BLite implementation and integration points currently live in:
src/ZB.MOM.WW.CBDDC.Persistence/BLite/CBDDCBLiteExtensions.cssrc/ZB.MOM.WW.CBDDC.Persistence/BLite/CBDDCDocumentDbContext.cssrc/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteDocumentStore.cssrc/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteOplogStore.cssrc/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteDocumentMetadataStore.cssrc/ZB.MOM.WW.CBDDC.Persistence/BLite/BLitePeerConfigurationStore.cssrc/ZB.MOM.WW.CBDDC.Persistence/BLite/BLitePeerOplogConfirmationStore.cssrc/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteSnapshotMetadataStore.cssamples/ZB.MOM.WW.CBDDC.Sample.Console/SampleDbContext.cssamples/ZB.MOM.WW.CBDDC.Sample.Console/SampleDocumentStore.cssamples/ZB.MOM.WW.CBDDC.Sample.Console/Program.cstests/ZB.MOM.WW.CBDDC.Sample.Console.Tests/*.cs(BLite-focused tests)tests/ZB.MOM.WW.CBDDC.E2E.Tests/ClusterCrudSyncE2ETests.cssrc/ZB.MOM.WW.CBDDC.Persistence/ZB.MOM.WW.CBDDC.Persistence.csprojand sample/test package referencesREADME.mdand related docs that currently describe BLite as the embedded provider.
3) Target Architecture
3.1 Provider Surface
Create a Surreal provider namespace and extension entrypoint that mirrors current integration shape:
- Add
AddCBDDCSurrealEmbedded<...>()in a new file (e.g.,src/ZB.MOM.WW.CBDDC.Persistence/Surreal/CBDDCSurrealExtensions.cs). - Register Surreal-backed implementations for all existing persistence interfaces.
- Keep singleton lifetime for store services and Surreal client factory (equivalent to current BLite singleton model).
- Expose options object including:
- RocksDB endpoint/path (
rocksdb://...) - Namespace
- Database
- CDC polling interval
- CDC batch size
- CDC retention duration
3.2 Surreal Connection and Embedded Startup
Use official embedded .NET guidance:
- Add Surreal embedded packages.
- Use
SurrealDbEmbeddedClient/RocksDB embedded client withrocksdb://endpoint. - Run
USE NS <ns> DB <db>at startup. - Dispose/close client on host shutdown.
3.3 Table Design (Schema + Indexing)
Define internal tables as SCHEMAFULL and strongly typed fields to reduce runtime drift.
Proposed tables:
oplog_entriessnapshot_metadatasremote_peer_configurationsdocument_metadataspeer_oplog_confirmationscdc_checkpoints(new: durable cursor per watched table)- Optional:
cdc_dedup(new: idempotency window for duplicate/overlapping reads)
Indexes and IDs:
- Prefer deterministic record IDs for point lookups (
table:id) where possible. - Add unique indexes for business keys currently enforced in BLite:
oplog_entries.hashsnapshot_metadatas.node_id(document_metadatas.collection, document_metadatas.key)(peer_oplog_confirmations.peer_node_id, peer_oplog_confirmations.source_node_id)
- Add composite indexes for hot sync queries:
- Oplog by
(timestamp_physical, timestamp_logical) - Oplog by
(timestamp_node_id, timestamp_physical, timestamp_logical) - Metadata by
(hlc_physical, hlc_logical)
- Use
EXPLAIN FULLduring test/benchmark phase to verify index usage.
3.4 CDC Strategy (Durable + Low Latency)
Implement CDC with Surreal Change Feeds as source of truth and Live Queries as optional accelerators.
- Enable
CHANGEFEED <duration>per watched table (INCLUDE ORIGINALwhen old values are required for conflict handling/debug). - Persist checkpoint cursor (
versionstamppreferred) incdc_checkpoints. - Poll with
SHOW CHANGES FOR TABLE <table> SINCE <cursor> LIMIT <N>. - Process changes idempotently; tolerate duplicate windows when timestamp cursors overlap.
- Commit checkpoint only after oplog + metadata writes commit successfully.
- Optionally run
LIVE SELECTsubscribers for lower-latency wakeups, but never rely on live events alone for durability. - On startup/reconnect, always catch up via
SHOW CHANGESfrom last persisted cursor.
3.5 Transaction Boundaries
Use explicit SurrealQL transactions for atomic state transitions:
- Local CDC event -> write oplog entry + document metadata + vector clock backing data in one transaction.
- Remote apply batch -> apply documents + merge oplog + metadata updates atomically in bounded batches.
- Snapshot replace/merge -> table-level clear/import or merge in deterministic order with rollback on failure.
4) Execution Plan (Phased)
Phase 0: Design Freeze and Safety Rails
- Finalize data model and table schema DDL.
- Finalize CDC cursor semantics (
versionstampvs timestamp fallback). - Freeze shared contracts in
ZB.MOM.WW.CBDDC.Core(no signature churn during provider port). - Add migration feature flag for temporary cutover control (
UseSurrealPersistence), removed in final cleanup.
Exit criteria:
- Design doc approved.
- DDL + index plan reviewed.
- CDC retention value chosen (must exceed maximum offline peer window).
Phase 1: Surreal Infrastructure Layer
- Add Surreal packages and connection factory.
- Implement startup initialization: NS/DB selection, table/index creation, capability checks.
- Introduce provider options and DI extension (
AddCBDDCSurrealEmbedded). - Add health probe for embedded connection and schema readiness.
Exit criteria:
dotnet buildsucceeds.- Basic smoke test can connect, create, read, and delete records in RocksDB-backed embedded Surreal.
Phase 2: Port Store Implementations
Port each BLite store to Surreal while preserving interface behavior:
BLiteOplogStore->SurrealOplogStoreBLiteDocumentMetadataStore->SurrealDocumentMetadataStoreBLitePeerConfigurationStore->SurrealPeerConfigurationStoreBLitePeerOplogConfirmationStore->SurrealPeerOplogConfirmationStoreBLiteSnapshotMetadataStore->SurrealSnapshotMetadataStore
Implementation requirements:
- Keep existing merge/drop/export/import semantics.
- Preserve ordering guarantees for hash-chain methods.
- Preserve vector clock bootstrap behavior (snapshot metadata first, oplog second).
Exit criteria:
- Store-level unit tests pass with Surreal backend.
- No BLite store classes used in DI path.
Phase 3: Document Store + CDC Engine
- Replace
BLiteDocumentStore<TDbContext>with Surreal-aware document store base. - Implement collection registration + watched table catalog.
- Implement CDC worker:
- Poll
SHOW CHANGES - Map CDC events to
OperationType - Generate oplog + metadata
- Enforce remote-sync suppression/idempotency
- Keep equivalent remote apply guard semantics to prevent CDC loopback during sync replay.
- Add graceful start/stop lifecycle hooks for CDC worker.
Exit criteria:
- Local direct writes produce expected oplog entries.
- Remote replay does not create duplicate local oplog entries.
- Restart resumes CDC from persisted checkpoint without missing changes.
Phase 4: Sample App and E2E Harness Migration
- Replace sample BLite context usage with Surreal-backed sample persistence.
- Replace
AddCBDDCBLiteusage in sample and tests. - Update
ClusterCrudSyncE2ETestsinternals that currently access BLite collections directly. - Refactor fallback CDC assertion logic to Surreal-based observability hooks.
Exit criteria:
- Sample runs two-node sync with Surreal embedded RocksDB.
- E2E CRUD bidirectional test passes unchanged in behavior.
Phase 5: Data Migration Tooling and Cutover
- Build one-time migration utility:
- Read BLite data via existing stores
- Write to Surreal tables
- Preserve hashes/timestamps exactly
- Add verification routine comparing counts, hashes, and key spot checks.
- Document migration command and rollback artifacts.
Exit criteria:
- Dry-run migration succeeds on fixture DB.
- Post-migration parity checks are clean.
Phase 6: Remove BLite Completely
- Delete
src/ZB.MOM.WW.CBDDC.Persistence/BLite/*after Surreal parity is proven. - Remove BLite package references and BLite source generators from project files.
- Remove
.blitepath assumptions from sample/tests/docs. - Update docs and READMEs to SurrealDB terminology.
- Ensure
rg -n "BLite|blite|AddCBDDCBLite|CBDDCDocumentDbContext"returns no functional references (except historical notes if intentionally retained).
Exit criteria:
- Solution builds/tests pass with zero BLite runtime dependency.
- Docs reflect Surreal-only provider path.
5) Safe Parallel Subagent Plan
Use parallel subagents only with strict ownership boundaries and integration gates.
5.1 Subagent Work Split
- Subagent A (Infrastructure/DI)
- Owns: new Surreal options, connection factory, DI extension, startup schema init.
- Files: new
src/.../Surreal/*infra files,*.csprojpackage refs.
- Subagent B (Core Stores)
- Owns: oplog/document metadata/snapshot metadata/peer config/peer confirmation Surreal stores.
- Files:
src/ZB.MOM.WW.CBDDC.Persistence/Surreal/*Store.cs.
- Subagent C (CDC + DocumentStore)
- Owns: Surreal document store base, CDC poller, checkpoint persistence, suppression loop prevention.
- Files:
src/ZB.MOM.WW.CBDDC.Persistence/Surreal/*DocumentStore*, CDC worker files.
- Subagent D (Tests)
- Owns: unit/integration/E2E tests migrated to Surreal.
- Files:
tests/*touched by provider swap.
- Subagent E (Sample + Docs)
- Owns: sample console migration and doc rewrites.
- Files:
samples/*,README.md,docs/*provider docs.
5.2 Parallel Safety Rules
- No overlapping file ownership between active subagents.
- Shared contract files are locked unless explicitly assigned to one subagent.
- Each subagent must submit:
- changed file list
- rationale
- commands run
- test evidence
- Integrator rebases/merges sequentially, never blindly squashing conflicting edits.
- If a subagent encounters unrelated dirty changes, it must stop and escalate before editing.
5.3 Integration Order
- Merge A -> B -> C -> D -> E.
- Run full verification after each merge step, not only at the end.
6) Required Unit/Integration Test Matrix
6.1 Store Contract Tests
- Oplog append/export/import/merge/drop parity.
GetChainRangeAsynccorrectness by hash chain ordering.GetLastEntryHashAsyncbehavior with oplog hit and snapshot fallback.- Pruning respects cutoff and confirmations.
- Document metadata upsert/mark-deleted/get-after ordering.
- Peer config save/get/remove/merge semantics.
- Peer confirmation registration/update/deactivate/merge semantics.
- Snapshot metadata insert/update/merge and hash lookup.
6.2 CDC Tests
- Local write on watched table emits exactly one oplog entry.
- Delete mutation emits delete oplog + metadata tombstone.
- Remote apply path does not re-emit local CDC oplog entries.
- CDC checkpoint persists only after atomic write success.
- Restart from checkpoint catches missed changes.
- Duplicate window replay is idempotent.
- Changefeed retention boundary behavior is explicit and logged.
6.3 Snapshot and Recovery Tests
CreateSnapshotAsyncincludes docs/oplog/peers/confirmations.ReplaceDatabaseAsyncrestores full state.MergeSnapshotAsyncconflict behavior unchanged.- Recovery after process restart retains Surreal RocksDB data.
6.4 E2E Sync Tests
- Two peers replicate create/update/delete bidirectionally.
- Peer reconnect performs incremental catch-up from CDC cursor.
- Multi-change burst preserves deterministic final state.
- Optional fault-injection test: crash between oplog write and checkpoint update should replay safely on restart.
7) Verification After Each Subagent Completion
Run this checklist after each merged subagent contribution:
dotnet restoredotnet build CBDDC.slnx -c Release- Targeted tests for modified projects (fast gate)
- Full test suite before moving to next major phase:
dotnet test CBDDC.slnx -c Release
- Regression grep checks:
rg -n "BLite|AddCBDDCBLite|\.blite|CBDDCDocumentDbContext" src samples tests README.md docs
- Surreal smoke test:
- create temp RocksDB path
- start sample node
- perform write/update/delete
- restart process and verify persisted state
- CDC durability test:
- stop node
- mutate source
- restart node
- confirm catch-up via
SHOW CHANGEScursor
8) Rollout and Rollback
Rollout
- Internal canary branch with Surreal-only provider.
- Run full CI + extended E2E soak (long-running sync/reconnect).
- Migrate one test dataset from BLite to Surreal and validate parity.
- Promote after acceptance criteria are met.
Rollback
- Keep BLite export snapshots until Surreal cutover is accepted.
- If severe defect appears, restore from pre-cutover snapshot and redeploy previous BLite-tagged build.
- Preserve migration logs and parity reports for audit.
9) Definition of Done
- No runtime BLite dependency remains.
- All store contracts pass with Surreal backend.
- CDC is durable (checkpointed), idempotent, and restart-safe.
- Sample + E2E prove sync parity.
- Documentation and onboarding instructions updated to Surreal embedded RocksDB.
- Migration utility + validation report available for production cutover.
10) SurrealDB Best-Practice Notes Applied in This Plan
This plan explicitly applies official Surreal guidance:
- Embedded .NET with RocksDB endpoint (
rocksdb://) and explicit NS/DB usage. - Schema-first design with strict table/field definitions and typed record references.
- Query/index discipline (
EXPLAIN FULL, indexed lookups, avoid broad scans). - CDC durability with changefeeds and checkpointed
SHOW CHANGESreplay. - Live queries used as low-latency signals, not as sole durable CDC transport.
- Security hardening (authentication, encryption/backups, restricted capabilities) for any non-embedded server deployments used in tooling/CI.
References (Primary Sources)
- SurrealDB .NET embedded engine docs: https://surrealdb.com/docs/surrealdb/embedding/dotnet
- SurrealDB .NET SDK embedding guide: https://surrealdb.com/docs/sdk/dotnet/embedding
- SurrealDB connection strings (protocol formats incl. RocksDB): https://surrealdb.com/docs/surrealdb/reference-guide/connection-strings
- SurrealDB schema best practices: https://surrealdb.com/docs/surrealdb/reference-guide/schema-creation-best-practices
- SurrealDB performance best practices: https://surrealdb.com/docs/surrealdb/reference-guide/performance-best-practices
- SurrealDB real-time/events best practices: https://surrealdb.com/docs/surrealdb/reference-guide/realtime-best-practices
- SurrealQL
DEFINE TABLE(changefeed options): https://surrealdb.com/docs/surrealql/statements/define/table - SurrealQL
SHOW CHANGES(durable CDC read): https://surrealdb.com/docs/surrealql/statements/show - SurrealQL
LIVE SELECTbehavior and caveats: https://surrealdb.com/docs/surrealql/statements/live - SurrealDB security best practices: https://surrealdb.com/docs/surrealdb/security/security-best-practices
- SurrealQL transactions (
BEGIN/COMMIT): https://surrealdb.com/docs/surrealql/statements/begin, https://surrealdb.com/docs/surrealql/statements/commit