Files
CBDDC/separate.md
Joseph Doherty 8e97061ab8
All checks were successful
NuGet Package Publish / nuget (push) Successful in 1m14s
Implement in-process multi-dataset sync isolation across core, network, persistence, and tests
2026-02-22 11:58:34 -05:00

12 KiB

In-Process Multi-Dataset Sync Plan (Worktree Execution)

Goal

Add true in-process multi-dataset sync so primary business data can sync independently from high-volume append-only datasets (logs, timeseries), with separate state, scheduling, and backpressure behavior.

Desired Outcome

  1. Primary dataset sync throughput/latency is not materially impacted by telemetry dataset volume.
  2. Log and timeseries datasets use independent sync pipelines in the same process.
  3. Existing single-dataset apps continue to work with minimal/no code changes.
  4. Test coverage explicitly verifies isolation and no cross-dataset leakage.

Current Baseline (Why This Change Is Needed)

  1. Current host wiring registers a single IDocumentStore, IOplogStore, and ISyncOrchestrator graph.
  2. Collection filtering exists, but all collections still share one orchestrator/sync loop and one oplog/vector clock lifecycle.
  3. Protocol filters by collection only; there is no dataset identity boundary.
  4. Surreal schema objects are fixed names per configured namespace/database and are not dataset-aware by design.

Proposed Target Architecture

New Concepts

  1. DatasetId:
    • Stable identifier (primary, logs, timeseries, etc).
    • Included in all sync-state-bearing entities and wire messages.
  2. DatasetSyncContext:
    • Encapsulates one dataset's services: document store adapter, oplog store, snapshot metadata, peer confirmation state, orchestrator configuration.
  3. IMultiDatasetSyncOrchestrator:
    • Host-level coordinator that starts/stops one ISyncOrchestrator per dataset.
  4. DatasetSyncOptions:
    • Per-dataset scheduling and limits (loop delay, max peers, optional bandwidth/entry caps, maintenance interval override).

Isolation Model

  1. Independent per-dataset oplog stream and vector clock.
  2. Independent per-dataset peer confirmation watermarks for pruning.
  3. Independent per-dataset transport filtering (handshake and pull/push include dataset id).
  4. Independent per-dataset observability counters.

Compatibility Strategy

  1. Backward compatible wire changes:
    • Add optional dataset_id fields; default to "primary" when absent.
  2. Backward compatible storage:
    • Add datasetId columns/fields where needed.
    • Existing rows default to "primary" during migration/read fallback.
  3. API defaults:
    • Existing single-store registration maps to dataset "primary" with no functional change.

Git Worktree Execution Plan

0. Worktree Preparation

  1. Create worktree and branch:
    • git worktree add ../CBDDC-multidataset -b codex/multidataset-sync
  2. Build baseline in worktree:
    • dotnet build CBDDC.slnx
  3. Capture baseline tests (save output artifact in worktree):
    • dotnet test CBDDC.slnx

Deliverable:

  1. Clean baseline build/test result captured before changes.

1. Design and Contract Layer

Code Changes

  1. Add dataset contracts in src/ZB.MOM.WW.CBDDC.Core:
    • DatasetId value object or constants.
    • DatasetSyncOptions.
    • IDatasetSyncContext/IMultiDatasetSyncOrchestrator.
  2. Extend domain models where sync identity is required:
    • OplogEntry add DatasetId (constructor defaults to "primary").
    • Any metadata types used for causal state/pruning that need dataset partitioning.
  3. Extend store interfaces (minimally invasive):
    • Keep existing methods as compatibility overloads.
    • Add dataset-aware variants where cross-dataset ambiguity exists.

Test Work

  1. Add Core unit tests:
    • OplogEntry hash stability with DatasetId.
    • Defaulting behavior to "primary".
    • Equality/serialization behavior for dataset-aware records.
  2. Update existing Core tests that construct OplogEntry directly.

Exit Criteria:

  1. Core tests compile and pass with default dataset behavior unchanged.

2. Persistence Partitioning (Surreal)

Code Changes

  1. Add dataset partition key to persistence records:
    • Oplog rows.
    • Document metadata rows.
    • Snapshot metadata rows (if used in dataset-scoped recoveries).
    • Peer confirmation records.
    • CDC checkpoints (consumer id should include dataset id or add dedicated field).
  2. Update schema initializer:
    • Add datasetId fields and composite indexes (datasetId + existing key dimensions).
  3. Update queries in all Surreal stores:
    • Enforce dataset filter in every select/update/delete path.
    • Guard against full-table scans that omit dataset filter.
  4. Add migration/read fallback:
    • If datasetId missing on older records, treat as "primary" during transitional reads.

Test Work

  1. Extend SurrealStoreContractTests:
    • Write records in two datasets and verify strict isolation.
    • Verify prune/merge/export/import scoped by dataset.
  2. Add regression tests:
    • Legacy records without datasetId load as "primary" only.
  3. Update durability tests:
    • CDC checkpoints do not collide between datasets.

Exit Criteria:

  1. Persistence tests prove no cross-dataset reads/writes.

3. Network Protocol Dataset Awareness

Code Changes

  1. Update sync.proto (backward compatible):
    • Add dataset_id to HandshakeRequest, HandshakeResponse, PullChangesRequest, PushChangesRequest, and optionally snapshot requests.
  2. Regenerate protocol classes and adapt transport handlers:
    • TcpPeerClient sends dataset id for every dataset pipeline.
    • TcpSyncServer routes requests to correct dataset context.
  3. Defaulting rules:
    • Missing/empty dataset_id => "primary".
  4. Add explicit rejection semantics:
    • If remote peer does not support requested dataset, return accepted handshake but with dataset capability mismatch response path (or reject per dataset connection).

Test Work

  1. Add protocol-level unit tests:
    • Message parse/serialize with and without dataset field.
  2. Update network tests:
    • Handshake stores remote interests per dataset.
    • Pull/push operations do not cross datasets.
    • Backward compatibility with no dataset id present.

Exit Criteria:

  1. Network tests pass for both new and legacy message shapes.

4. Multi-Orchestrator Runtime and DI

Code Changes

  1. Add multi-dataset DI registration extensions:
    • AddCBDDCSurrealEmbeddedDataset(...)
    • AddCBDDCMultiDataset(...)
  2. Build MultiDatasetSyncOrchestrator:
    • Start/stop orchestrators for configured datasets.
    • Isolated cancellation tokens, loops, and failure handling per dataset.
  3. Ensure hosting services (CBDDCNodeService, TcpSyncServerHostedService) initialize dataset contexts deterministically.
  4. Add per-dataset knobs:
    • Sync interval, max entries per cycle, maintenance interval, optional parallelism limits.

Test Work

  1. Add Hosting tests:
    • Multiple datasets register/start/stop cleanly.
    • Failure in one dataset does not stop others.
  2. Add orchestrator tests:
    • Scheduling fairness and per-dataset failure backoff isolation.
  3. Update NoOp/fallback tests for multi-dataset mode.

Exit Criteria:

  1. Runtime starts N dataset pipelines with independent lifecycle behavior.

5. Snapshot and Recovery Semantics

Code Changes

  1. Define snapshot scope options:
    • Per-dataset snapshot and full multi-dataset snapshot.
  2. Update snapshot service APIs and implementations to support:
    • Export/import/merge by dataset id.
  3. Ensure emergency recovery paths in orchestrator are dataset-scoped.

Test Work

  1. Add snapshot tests:
    • Replace/merge for one dataset leaves others untouched.
  2. Update reconnect regression tests:
    • Snapshot-required flow only affects targeted dataset pipeline.

Exit Criteria:

  1. Recovery operations preserve dataset isolation.

6. Sample App and Developer Experience

Code Changes

  1. Add sample configuration for three datasets:
    • primary, logs, timeseries.
  2. Implement append-only sample stores for logs and timeseries.
  3. Expose sample CLI commands to emit load independently per dataset.

Test Work

  1. Add sample integration tests:
    • Heavy append load on logs/timeseries does not significantly delay primary data convergence.
  2. Add benchmark harness cases:
    • Single-dataset baseline vs multi-dataset under telemetry load.

Exit Criteria:

  1. Demonstrable isolation in sample workload.

7. Documentation and Migration Guides

Code/Docs Changes

  1. New doc: docs/features/multi-dataset-sync.md.
  2. Update:
    • docs/architecture.md
    • docs/persistence-providers.md
    • docs/runbook.md
  3. Add migration notes:
    • From single pipeline to multi-dataset configuration.
    • Backward compatibility and rollout toggles.

Test Work

  1. Doc examples compile check (if applicable).
  2. Add config parsing tests for dataset option sections.

Exit Criteria:

  1. Operators have explicit rollout and rollback steps.

8. Rollout Strategy (Safe Adoption)

  1. Feature flags:
    • EnableMultiDatasetSync (global).
    • EnableDatasetPrimary/Logs/Timeseries.
  2. Rollout sequence:
    • Stage 1: Deploy with flag off.
    • Stage 2: Enable primary only in new runtime path.
    • Stage 3: Enable logs, then timeseries.
  3. Observability gates:
    • Primary sync latency SLO must remain within threshold before enabling telemetry datasets.

9. Test Plan (Comprehensive Coverage Matrix)

Unit Tests

  1. Core model defaults and hash behavior with dataset id.
  2. Dataset routing logic in orchestrator dispatcher.
  3. Protocol adapters default dataset_id to "primary" when absent.
  4. Persistence query builders always include dataset predicate.

Integration Tests

  1. Surreal stores:
    • Same key/collection in different datasets remains isolated.
  2. Network:
    • Pull/push with mixed datasets never cross-stream.
  3. Hosting:
    • Independent orchestrator lifecycle and failure isolation.

E2E Tests

  1. Multi-node cluster:
    • Primary converges under heavy append-only telemetry load.
  2. Snapshot/recovery:
    • Dataset-scoped restore preserves other datasets.
  3. Backward compatibility:
    • Legacy node (no dataset id) interoperates on "primary".

Non-Functional Tests

  1. Throughput and latency benchmarks:
    • Compare primary p95 sync lag before/after.
  2. Resource isolation:
    • CPU/memory pressure from telemetry datasets should not break primary SLO.

Test Update Checklist (Existing Tests to Modify)

  1. tests/ZB.MOM.WW.CBDDC.Core.Tests:
    • Update direct OplogEntry constructions.
  2. tests/ZB.MOM.WW.CBDDC.Network.Tests:
    • Handshake/connection/vector-clock tests for dataset-aware flows.
  3. tests/ZB.MOM.WW.CBDDC.Hosting.Tests:
    • Add multi-dataset startup/shutdown/failure cases.
  4. tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests:
    • Extend Surreal contract and durability tests for dataset partitioning.
  5. tests/ZB.MOM.WW.CBDDC.E2E.Tests:
    • Add multi-dataset convergence + interference tests.

Worktree Task Breakdown (Execution Order)

  1. Phase-A: Contracts + Core model updates + unit tests.
  2. Phase-B: Surreal schema/store partitioning + persistence tests.
  3. Phase-C: Protocol and network routing + network tests.
  4. Phase-D: Multi-orchestrator DI/runtime + hosting tests.
  5. Phase-E: Snapshot/recovery updates + regression tests.
  6. Phase-F: Sample/bench/docs + end-to-end verification.

Each phase should be committed separately in the worktree to keep reviewable deltas.

Validation Commands (Run in Worktree)

  1. dotnet build /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx
  2. dotnet test /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx
  3. Focused suites during implementation:
    • dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Core.Tests/ZB.MOM.WW.CBDDC.Core.Tests.csproj
    • dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Network.Tests/ZB.MOM.WW.CBDDC.Network.Tests.csproj
    • dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Hosting.Tests/ZB.MOM.WW.CBDDC.Hosting.Tests.csproj
    • dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests.csproj
    • dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.E2E.Tests/ZB.MOM.WW.CBDDC.E2E.Tests.csproj

Definition of Done

  1. Multi-dataset mode runs primary, logs, and timeseries in one process with independent sync paths.
  2. No cross-dataset data movement in persistence, protocol, or runtime.
  3. Single-dataset existing usage still works via default "primary" dataset.
  4. Added/updated unit, integration, and E2E tests pass in CI.
  5. Docs include migration and operational guidance.