Harden Surreal migration with retry/coverage fixes and XML docs cleanup
All checks were successful
NuGet Package Publish / nuget (push) Successful in 1m17s
All checks were successful
NuGet Package Publish / nuget (push) Successful in 1m17s
This commit is contained in:
360
surreal.md
Normal file
360
surreal.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# BLite -> SurrealDB (Embedded + RocksDB) Migration Plan
|
||||
|
||||
## 1) Goal and Scope
|
||||
|
||||
Replace all BLite-backed persistence in this repository with SurrealDB embedded using RocksDB persistence, while preserving current CBDDC behavior:
|
||||
|
||||
1. Automatic CDC-driven oplog generation for local writes.
|
||||
2. Reliable sync across peers (including reconnect and snapshot flows).
|
||||
3. Existing storage contracts (`IDocumentStore`, `IOplogStore`, `IPeerConfigurationStore`, `IDocumentMetadataStore`, `ISnapshotMetadataStore`, `IPeerOplogConfirmationStore`) and test semantics.
|
||||
4. Full removal of BLite dependencies, APIs, and documentation references.
|
||||
|
||||
## 2) Current-State Inventory (Repository-Specific)
|
||||
|
||||
Primary BLite implementation and integration points currently live in:
|
||||
|
||||
1. `src/ZB.MOM.WW.CBDDC.Persistence/BLite/CBDDCBLiteExtensions.cs`
|
||||
2. `src/ZB.MOM.WW.CBDDC.Persistence/BLite/CBDDCDocumentDbContext.cs`
|
||||
3. `src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteDocumentStore.cs`
|
||||
4. `src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteOplogStore.cs`
|
||||
5. `src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteDocumentMetadataStore.cs`
|
||||
6. `src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLitePeerConfigurationStore.cs`
|
||||
7. `src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLitePeerOplogConfirmationStore.cs`
|
||||
8. `src/ZB.MOM.WW.CBDDC.Persistence/BLite/BLiteSnapshotMetadataStore.cs`
|
||||
9. `samples/ZB.MOM.WW.CBDDC.Sample.Console/SampleDbContext.cs`
|
||||
10. `samples/ZB.MOM.WW.CBDDC.Sample.Console/SampleDocumentStore.cs`
|
||||
11. `samples/ZB.MOM.WW.CBDDC.Sample.Console/Program.cs`
|
||||
12. `tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests/*.cs` (BLite-focused tests)
|
||||
13. `tests/ZB.MOM.WW.CBDDC.E2E.Tests/ClusterCrudSyncE2ETests.cs`
|
||||
14. `src/ZB.MOM.WW.CBDDC.Persistence/ZB.MOM.WW.CBDDC.Persistence.csproj` and sample/test package references
|
||||
15. `README.md` and related docs that currently describe BLite as the embedded provider.
|
||||
|
||||
## 3) Target Architecture
|
||||
|
||||
### 3.1 Provider Surface
|
||||
|
||||
Create a Surreal provider namespace and extension entrypoint that mirrors current integration shape:
|
||||
|
||||
1. Add `AddCBDDCSurrealEmbedded<...>()` in a new file (e.g., `src/ZB.MOM.WW.CBDDC.Persistence/Surreal/CBDDCSurrealExtensions.cs`).
|
||||
2. Register Surreal-backed implementations for all existing persistence interfaces.
|
||||
3. Keep singleton lifetime for store services and Surreal client factory (equivalent to current BLite singleton model).
|
||||
4. Expose options object including:
|
||||
- RocksDB endpoint/path (`rocksdb://...`)
|
||||
- Namespace
|
||||
- Database
|
||||
- CDC polling interval
|
||||
- CDC batch size
|
||||
- CDC retention duration
|
||||
|
||||
### 3.2 Surreal Connection and Embedded Startup
|
||||
|
||||
Use official embedded .NET guidance:
|
||||
|
||||
1. Add Surreal embedded packages.
|
||||
2. Use `SurrealDbEmbeddedClient`/RocksDB embedded client with `rocksdb://` endpoint.
|
||||
3. Run `USE NS <ns> DB <db>` at startup.
|
||||
4. Dispose/close client on host shutdown.
|
||||
|
||||
### 3.3 Table Design (Schema + Indexing)
|
||||
|
||||
Define internal tables as `SCHEMAFULL` and strongly typed fields to reduce runtime drift.
|
||||
|
||||
Proposed tables:
|
||||
|
||||
1. `oplog_entries`
|
||||
2. `snapshot_metadatas`
|
||||
3. `remote_peer_configurations`
|
||||
4. `document_metadatas`
|
||||
5. `peer_oplog_confirmations`
|
||||
6. `cdc_checkpoints` (new: durable cursor per watched table)
|
||||
7. Optional: `cdc_dedup` (new: idempotency window for duplicate/overlapping reads)
|
||||
|
||||
Indexes and IDs:
|
||||
|
||||
1. Prefer deterministic record IDs for point lookups (`table:id`) where possible.
|
||||
2. Add unique indexes for business keys currently enforced in BLite:
|
||||
- `oplog_entries.hash`
|
||||
- `snapshot_metadatas.node_id`
|
||||
- `(document_metadatas.collection, document_metadatas.key)`
|
||||
- `(peer_oplog_confirmations.peer_node_id, peer_oplog_confirmations.source_node_id)`
|
||||
3. Add composite indexes for hot sync queries:
|
||||
- Oplog by `(timestamp_physical, timestamp_logical)`
|
||||
- Oplog by `(timestamp_node_id, timestamp_physical, timestamp_logical)`
|
||||
- Metadata by `(hlc_physical, hlc_logical)`
|
||||
4. Use `EXPLAIN FULL` during test/benchmark phase to verify index usage.
|
||||
|
||||
### 3.4 CDC Strategy (Durable + Low Latency)
|
||||
|
||||
Implement CDC with Surreal Change Feeds as source of truth and Live Queries as optional accelerators.
|
||||
|
||||
1. Enable `CHANGEFEED <duration>` per watched table (`INCLUDE ORIGINAL` when old values are required for conflict handling/debug).
|
||||
2. Persist checkpoint cursor (`versionstamp` preferred) in `cdc_checkpoints`.
|
||||
3. Poll with `SHOW CHANGES FOR TABLE <table> SINCE <cursor> LIMIT <N>`.
|
||||
4. Process changes idempotently; tolerate duplicate windows when timestamp cursors overlap.
|
||||
5. Commit checkpoint only after oplog + metadata writes commit successfully.
|
||||
6. Optionally run `LIVE SELECT` subscribers for lower-latency wakeups, but never rely on live events alone for durability.
|
||||
7. On startup/reconnect, always catch up via `SHOW CHANGES` from last persisted cursor.
|
||||
|
||||
### 3.5 Transaction Boundaries
|
||||
|
||||
Use explicit SurrealQL transactions for atomic state transitions:
|
||||
|
||||
1. Local CDC event -> write oplog entry + document metadata + vector clock backing data in one transaction.
|
||||
2. Remote apply batch -> apply documents + merge oplog + metadata updates atomically in bounded batches.
|
||||
3. Snapshot replace/merge -> table-level clear/import or merge in deterministic order with rollback on failure.
|
||||
|
||||
## 4) Execution Plan (Phased)
|
||||
|
||||
## Phase 0: Design Freeze and Safety Rails
|
||||
|
||||
1. Finalize data model and table schema DDL.
|
||||
2. Finalize CDC cursor semantics (`versionstamp` vs timestamp fallback).
|
||||
3. Freeze shared contracts in `ZB.MOM.WW.CBDDC.Core` (no signature churn during provider port).
|
||||
4. Add migration feature flag for temporary cutover control (`UseSurrealPersistence`), removed in final cleanup.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
1. Design doc approved.
|
||||
2. DDL + index plan reviewed.
|
||||
3. CDC retention value chosen (must exceed maximum offline peer window).
|
||||
|
||||
## Phase 1: Surreal Infrastructure Layer
|
||||
|
||||
1. Add Surreal packages and connection factory.
|
||||
2. Implement startup initialization: NS/DB selection, table/index creation, capability checks.
|
||||
3. Introduce provider options and DI extension (`AddCBDDCSurrealEmbedded`).
|
||||
4. Add health probe for embedded connection and schema readiness.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
1. `dotnet build` succeeds.
|
||||
2. Basic smoke test can connect, create, read, and delete records in RocksDB-backed embedded Surreal.
|
||||
|
||||
## Phase 2: Port Store Implementations
|
||||
|
||||
Port each BLite store to Surreal while preserving interface behavior:
|
||||
|
||||
1. `BLiteOplogStore` -> `SurrealOplogStore`
|
||||
2. `BLiteDocumentMetadataStore` -> `SurrealDocumentMetadataStore`
|
||||
3. `BLitePeerConfigurationStore` -> `SurrealPeerConfigurationStore`
|
||||
4. `BLitePeerOplogConfirmationStore` -> `SurrealPeerOplogConfirmationStore`
|
||||
5. `BLiteSnapshotMetadataStore` -> `SurrealSnapshotMetadataStore`
|
||||
|
||||
Implementation requirements:
|
||||
|
||||
1. Keep existing merge/drop/export/import semantics.
|
||||
2. Preserve ordering guarantees for hash-chain methods.
|
||||
3. Preserve vector clock bootstrap behavior (snapshot metadata first, oplog second).
|
||||
|
||||
Exit criteria:
|
||||
|
||||
1. Store-level unit tests pass with Surreal backend.
|
||||
2. No BLite store classes used in DI path.
|
||||
|
||||
## Phase 3: Document Store + CDC Engine
|
||||
|
||||
1. Replace `BLiteDocumentStore<TDbContext>` with Surreal-aware document store base.
|
||||
2. Implement collection registration + watched table catalog.
|
||||
3. Implement CDC worker:
|
||||
- Poll `SHOW CHANGES`
|
||||
- Map CDC events to `OperationType`
|
||||
- Generate oplog + metadata
|
||||
- Enforce remote-sync suppression/idempotency
|
||||
4. Keep equivalent remote apply guard semantics to prevent CDC loopback during sync replay.
|
||||
5. Add graceful start/stop lifecycle hooks for CDC worker.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
1. Local direct writes produce expected oplog entries.
|
||||
2. Remote replay does not create duplicate local oplog entries.
|
||||
3. Restart resumes CDC from persisted checkpoint without missing changes.
|
||||
|
||||
## Phase 4: Sample App and E2E Harness Migration
|
||||
|
||||
1. Replace sample BLite context usage with Surreal-backed sample persistence.
|
||||
2. Replace `AddCBDDCBLite` usage in sample and tests.
|
||||
3. Update `ClusterCrudSyncE2ETests` internals that currently access BLite collections directly.
|
||||
4. Refactor fallback CDC assertion logic to Surreal-based observability hooks.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
1. Sample runs two-node sync with Surreal embedded RocksDB.
|
||||
2. E2E CRUD bidirectional test passes unchanged in behavior.
|
||||
|
||||
## Phase 5: Data Migration Tooling and Cutover
|
||||
|
||||
1. Build one-time migration utility:
|
||||
- Read BLite data via existing stores
|
||||
- Write to Surreal tables
|
||||
- Preserve hashes/timestamps exactly
|
||||
2. Add verification routine comparing counts, hashes, and key spot checks.
|
||||
3. Document migration command and rollback artifacts.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
1. Dry-run migration succeeds on fixture DB.
|
||||
2. Post-migration parity checks are clean.
|
||||
|
||||
## Phase 6: Remove BLite Completely
|
||||
|
||||
1. Delete `src/ZB.MOM.WW.CBDDC.Persistence/BLite/*` after Surreal parity is proven.
|
||||
2. Remove BLite package references and BLite source generators from project files.
|
||||
3. Remove `.blite` path assumptions from sample/tests/docs.
|
||||
4. Update docs and READMEs to SurrealDB terminology.
|
||||
5. Ensure `rg -n "BLite|blite|AddCBDDCBLite|CBDDCDocumentDbContext"` returns no functional references (except historical notes if intentionally retained).
|
||||
|
||||
Exit criteria:
|
||||
|
||||
1. Solution builds/tests pass with zero BLite runtime dependency.
|
||||
2. Docs reflect Surreal-only provider path.
|
||||
|
||||
## 5) Safe Parallel Subagent Plan
|
||||
|
||||
Use parallel subagents only with strict ownership boundaries and integration gates.
|
||||
|
||||
## 5.1 Subagent Work Split
|
||||
|
||||
1. Subagent A (Infrastructure/DI)
|
||||
- Owns: new Surreal options, connection factory, DI extension, startup schema init.
|
||||
- Files: new `src/.../Surreal/*` infra files, `*.csproj` package refs.
|
||||
|
||||
2. Subagent B (Core Stores)
|
||||
- Owns: oplog/document metadata/snapshot metadata/peer config/peer confirmation Surreal stores.
|
||||
- Files: `src/ZB.MOM.WW.CBDDC.Persistence/Surreal/*Store.cs`.
|
||||
|
||||
3. Subagent C (CDC + DocumentStore)
|
||||
- Owns: Surreal document store base, CDC poller, checkpoint persistence, suppression loop prevention.
|
||||
- Files: `src/ZB.MOM.WW.CBDDC.Persistence/Surreal/*DocumentStore*`, CDC worker files.
|
||||
|
||||
4. Subagent D (Tests)
|
||||
- Owns: unit/integration/E2E tests migrated to Surreal.
|
||||
- Files: `tests/*` touched by provider swap.
|
||||
|
||||
5. Subagent E (Sample + Docs)
|
||||
- Owns: sample console migration and doc rewrites.
|
||||
- Files: `samples/*`, `README.md`, `docs/*` provider docs.
|
||||
|
||||
## 5.2 Parallel Safety Rules
|
||||
|
||||
1. No overlapping file ownership between active subagents.
|
||||
2. Shared contract files are locked unless explicitly assigned to one subagent.
|
||||
3. Each subagent must submit:
|
||||
- changed file list
|
||||
- rationale
|
||||
- commands run
|
||||
- test evidence
|
||||
4. Integrator rebases/merges sequentially, never blindly squashing conflicting edits.
|
||||
5. If a subagent encounters unrelated dirty changes, it must stop and escalate before editing.
|
||||
|
||||
## 5.3 Integration Order
|
||||
|
||||
1. Merge A -> B -> C -> D -> E.
|
||||
2. Run full verification after each merge step, not only at the end.
|
||||
|
||||
## 6) Required Unit/Integration Test Matrix
|
||||
|
||||
## 6.1 Store Contract Tests
|
||||
|
||||
1. Oplog append/export/import/merge/drop parity.
|
||||
2. `GetChainRangeAsync` correctness by hash chain ordering.
|
||||
3. `GetLastEntryHashAsync` behavior with oplog hit and snapshot fallback.
|
||||
4. Pruning respects cutoff and confirmations.
|
||||
5. Document metadata upsert/mark-deleted/get-after ordering.
|
||||
6. Peer config save/get/remove/merge semantics.
|
||||
7. Peer confirmation registration/update/deactivate/merge semantics.
|
||||
8. Snapshot metadata insert/update/merge and hash lookup.
|
||||
|
||||
## 6.2 CDC Tests
|
||||
|
||||
1. Local write on watched table emits exactly one oplog entry.
|
||||
2. Delete mutation emits delete oplog + metadata tombstone.
|
||||
3. Remote apply path does not re-emit local CDC oplog entries.
|
||||
4. CDC checkpoint persists only after atomic write success.
|
||||
5. Restart from checkpoint catches missed changes.
|
||||
6. Duplicate window replay is idempotent.
|
||||
7. Changefeed retention boundary behavior is explicit and logged.
|
||||
|
||||
## 6.3 Snapshot and Recovery Tests
|
||||
|
||||
1. `CreateSnapshotAsync` includes docs/oplog/peers/confirmations.
|
||||
2. `ReplaceDatabaseAsync` restores full state.
|
||||
3. `MergeSnapshotAsync` conflict behavior unchanged.
|
||||
4. Recovery after process restart retains Surreal RocksDB data.
|
||||
|
||||
## 6.4 E2E Sync Tests
|
||||
|
||||
1. Two peers replicate create/update/delete bidirectionally.
|
||||
2. Peer reconnect performs incremental catch-up from CDC cursor.
|
||||
3. Multi-change burst preserves deterministic final state.
|
||||
4. Optional fault-injection test: crash between oplog write and checkpoint update should replay safely on restart.
|
||||
|
||||
## 7) Verification After Each Subagent Completion
|
||||
|
||||
Run this checklist after each merged subagent contribution:
|
||||
|
||||
1. `dotnet restore`
|
||||
2. `dotnet build CBDDC.slnx -c Release`
|
||||
3. Targeted tests for modified projects (fast gate)
|
||||
4. Full test suite before moving to next major phase:
|
||||
- `dotnet test CBDDC.slnx -c Release`
|
||||
5. Regression grep checks:
|
||||
- `rg -n "BLite|AddCBDDCBLite|\.blite|CBDDCDocumentDbContext" src samples tests README.md docs`
|
||||
6. Surreal smoke test:
|
||||
- create temp RocksDB path
|
||||
- start sample node
|
||||
- perform write/update/delete
|
||||
- restart process and verify persisted state
|
||||
7. CDC durability test:
|
||||
- stop node
|
||||
- mutate source
|
||||
- restart node
|
||||
- confirm catch-up via `SHOW CHANGES` cursor
|
||||
|
||||
## 8) Rollout and Rollback
|
||||
|
||||
## Rollout
|
||||
|
||||
1. Internal canary branch with Surreal-only provider.
|
||||
2. Run full CI + extended E2E soak (long-running sync/reconnect).
|
||||
3. Migrate one test dataset from BLite to Surreal and validate parity.
|
||||
4. Promote after acceptance criteria are met.
|
||||
|
||||
## Rollback
|
||||
|
||||
1. Keep BLite export snapshots until Surreal cutover is accepted.
|
||||
2. If severe defect appears, restore from pre-cutover snapshot and redeploy previous BLite-tagged build.
|
||||
3. Preserve migration logs and parity reports for audit.
|
||||
|
||||
## 9) Definition of Done
|
||||
|
||||
1. No runtime BLite dependency remains.
|
||||
2. All store contracts pass with Surreal backend.
|
||||
3. CDC is durable (checkpointed), idempotent, and restart-safe.
|
||||
4. Sample + E2E prove sync parity.
|
||||
5. Documentation and onboarding instructions updated to Surreal embedded RocksDB.
|
||||
6. Migration utility + validation report available for production cutover.
|
||||
|
||||
## 10) SurrealDB Best-Practice Notes Applied in This Plan
|
||||
|
||||
This plan explicitly applies official Surreal guidance:
|
||||
|
||||
1. Embedded .NET with RocksDB endpoint (`rocksdb://`) and explicit NS/DB usage.
|
||||
2. Schema-first design with strict table/field definitions and typed record references.
|
||||
3. Query/index discipline (`EXPLAIN FULL`, indexed lookups, avoid broad scans).
|
||||
4. CDC durability with changefeeds and checkpointed `SHOW CHANGES` replay.
|
||||
5. Live queries used as low-latency signals, not as sole durable CDC transport.
|
||||
6. Security hardening (authentication, encryption/backups, restricted capabilities) for any non-embedded server deployments used in tooling/CI.
|
||||
|
||||
## References (Primary Sources)
|
||||
|
||||
1. SurrealDB .NET embedded engine docs: [https://surrealdb.com/docs/surrealdb/embedding/dotnet](https://surrealdb.com/docs/surrealdb/embedding/dotnet)
|
||||
2. SurrealDB .NET SDK embedding guide: [https://surrealdb.com/docs/sdk/dotnet/embedding](https://surrealdb.com/docs/sdk/dotnet/embedding)
|
||||
3. SurrealDB connection strings (protocol formats incl. RocksDB): [https://surrealdb.com/docs/surrealdb/reference-guide/connection-strings](https://surrealdb.com/docs/surrealdb/reference-guide/connection-strings)
|
||||
4. SurrealDB schema best practices: [https://surrealdb.com/docs/surrealdb/reference-guide/schema-creation-best-practices](https://surrealdb.com/docs/surrealdb/reference-guide/schema-creation-best-practices)
|
||||
5. SurrealDB performance best practices: [https://surrealdb.com/docs/surrealdb/reference-guide/performance-best-practices](https://surrealdb.com/docs/surrealdb/reference-guide/performance-best-practices)
|
||||
6. SurrealDB real-time/events best practices: [https://surrealdb.com/docs/surrealdb/reference-guide/realtime-best-practices](https://surrealdb.com/docs/surrealdb/reference-guide/realtime-best-practices)
|
||||
7. SurrealQL `DEFINE TABLE` (changefeed options): [https://surrealdb.com/docs/surrealql/statements/define/table](https://surrealdb.com/docs/surrealql/statements/define/table)
|
||||
8. SurrealQL `SHOW CHANGES` (durable CDC read): [https://surrealdb.com/docs/surrealql/statements/show](https://surrealdb.com/docs/surrealql/statements/show)
|
||||
9. SurrealQL `LIVE SELECT` behavior and caveats: [https://surrealdb.com/docs/surrealql/statements/live](https://surrealdb.com/docs/surrealql/statements/live)
|
||||
10. SurrealDB security best practices: [https://surrealdb.com/docs/surrealdb/security/security-best-practices](https://surrealdb.com/docs/surrealdb/security/security-best-practices)
|
||||
11. SurrealQL transactions (`BEGIN`/`COMMIT`): [https://surrealdb.com/docs/surrealql/statements/begin](https://surrealdb.com/docs/surrealql/statements/begin), [https://surrealdb.com/docs/surrealql/statements/commit](https://surrealdb.com/docs/surrealql/statements/commit)
|
||||
Reference in New Issue
Block a user