Implement in-process multi-dataset sync isolation across core, network, persistence, and tests

2026-02-22 11:58:34 -05:00
parent c06b56172a
commit 8e97061ab8
60 changed files with 4519 additions and 559 deletions
@@ -30,6 +30,13 @@ To optimize reconnection, each node maintains a **Snapshot** of the last known s
 - If the chain hash matches, they only exchange the delta.
 - This avoids re-processing the entire operation history and ensures efficient gap recovery.

+### Multi-Dataset Sync
+CBDDC supports per-dataset sync pipelines in one process.
+
+- Dataset identity (`datasetId`) is propagated in protocol and persistence records.
+- Each dataset has independent oplog reads, confirmation state, and maintenance cadence.
+- Legacy peers without dataset fields interoperate on `primary`.
+
 ### Peer-Confirmed Oplog Pruning
 CBDDC maintenance pruning now uses a two-cutoff model:

@@ -8,6 +8,7 @@ This index tracks CBDDC major functionality. Each feature has one canonical docu
 - [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md)
 - [Secure Peer Transport](secure-peer-transport.md)
 - [Peer-Confirmed Pruning](peer-confirmed-pruning.md)
+- [Multi-Dataset Sync](multi-dataset-sync.md)

 ## Maintenance Rules

@@ -0,0 +1,67 @@
+# Multi-Dataset Sync
+
+## Summary
+
+CBDDC can run multiple sync pipelines inside one process by assigning each pipeline a `datasetId` (for example `primary`, `logs`, `timeseries`).
+Each dataset pipeline has independent oplog state, vector-clock reads, peer confirmation watermarks, and maintenance scheduling.
+
+## Why Use It
+
+- Keep primary business data sync latency stable during high telemetry volume.
+- Isolate append-only streams (`logs`, `timeseries`) from CRUD-heavy collections.
+- Roll out incrementally using runtime flags and per-dataset enablement.
+
+## Configuration
+
+Register dataset options and enable the runtime coordinator:
+
+```csharp
+services.AddCBDDCSurrealEmbedded<SampleDocumentStore>(sp => options)
+    .AddCBDDCSurrealEmbeddedDataset("primary", o =>
+    {
+        o.InterestingCollections = ["Users", "TodoLists"];
+    })
+    .AddCBDDCSurrealEmbeddedDataset("logs", o =>
+    {
+        o.InterestingCollections = ["Logs"];
+        o.SyncLoopDelay = TimeSpan.FromMilliseconds(500);
+    })
+    .AddCBDDCSurrealEmbeddedDataset("timeseries", o =>
+    {
+        o.InterestingCollections = ["Timeseries"];
+        o.SyncLoopDelay = TimeSpan.FromMilliseconds(500);
+    })
+    .AddCBDDCNetwork<StaticPeerNodeConfigurationProvider>();
+
+services.AddCBDDCMultiDataset(options =>
+{
+    options.EnableMultiDatasetSync = true;
+    options.EnableDatasetPrimary = true;
+    options.EnableDatasetLogs = true;
+    options.EnableDatasetTimeseries = true;
+});
+```
+
+## Wire and Storage Compatibility
+
+- Protocol messages include optional `dataset_id` fields.
+- Missing `dataset_id` is treated as `primary`.
+- Surreal persistence records include `datasetId`; legacy rows without `datasetId` are read as `primary`.
+
+## Operational Notes
+
+- Each dataset runs its own `SyncOrchestrator` instance.
+- Maintenance pruning is dataset-scoped (`datasetId` + cutoff).
+- Snapshot APIs support dataset-scoped operations (`CreateSnapshotAsync(stream, datasetId)`).
+
+## Migration
+
+1. Deploy with `EnableMultiDatasetSync = false`.
+2. Enable multi-dataset mode with only `primary` enabled.
+3. Enable `logs`, verify primary sync SLO.
+4. Enable `timeseries`, verify primary sync SLO again.
+
+## Rollback
+
+- Set `EnableDatasetLogs = false` and `EnableDatasetTimeseries = false` first.
+- If needed, set `EnableMultiDatasetSync = false` to return to the single `primary` sync path.
@@ -221,6 +221,14 @@ services.AddCBDDCCore()
    });
 ```

+### Multi-Dataset Partitioning
+
+Surreal persistence now stores `datasetId` on oplog, metadata, snapshot metadata, confirmation, and CDC checkpoint records.
+
+- Composite indexes include `datasetId` to prevent cross-dataset reads.
+- Legacy rows missing `datasetId` are interpreted as `primary` during reads.
+- Dataset-scoped store APIs (`ExportAsync(datasetId)`, `GetOplogAfterAsync(..., datasetId, ...)`) enforce isolation.
+
 ### CDC Durability Notes

 1. **Checkpoint semantics**: each consumer id has an independent durable cursor (`timestamp + hash`).
@@ -27,6 +27,15 @@ Capture these artifacts before remediation:
 - Current runtime configuration (excluding secrets).
 - Most recent deployment identifier and change window.

+## Multi-Dataset Gates
+
+Before enabling telemetry datasets in production:
+
+1. Enable `primary` only and record baseline primary sync lag.
+2. Enable `logs`; confirm primary lag remains within SLO.
+3. Enable `timeseries`; confirm primary lag remains within SLO.
+4. If primary SLO regresses, disable telemetry datasets first before broader rollback.
+
 ## Recovery Plays

 ### Peer unreachable or lagging