CBDDC/separate.md

# In-Process Multi-Dataset Sync Plan (Worktree Execution)

## Goal

Add true in-process multi-dataset sync so primary business data can sync independently from high-volume append-only datasets (logs, timeseries), with separate state, scheduling, and backpressure behavior.

## Desired Outcome

1. Primary dataset sync throughput/latency is not materially impacted by telemetry dataset volume.
2. Log and timeseries datasets use independent sync pipelines in the same process.
3. Existing single-dataset apps continue to work with minimal/no code changes.
4. Test coverage explicitly verifies isolation and no cross-dataset leakage.

## Current Baseline (Why This Change Is Needed)

1. Current host wiring registers a single `IDocumentStore`, `IOplogStore`, and `ISyncOrchestrator` graph.
2. Collection filtering exists, but all collections still share one orchestrator/sync loop and one oplog/vector clock lifecycle.
3. Protocol filters by collection only; there is no dataset identity boundary.
4. Surreal schema objects are fixed names per configured namespace/database and are not dataset-aware by design.

## Proposed Target Architecture

## New Concepts

1. `DatasetId`:
   - Stable identifier (`primary`, `logs`, `timeseries`, etc).
   - Included in all sync-state-bearing entities and wire messages.
2. `DatasetSyncContext`:
   - Encapsulates one dataset's services: document store adapter, oplog store, snapshot metadata, peer confirmation state, orchestrator configuration.
3. `IMultiDatasetSyncOrchestrator`:
   - Host-level coordinator that starts/stops one `ISyncOrchestrator` per dataset.
4. `DatasetSyncOptions`:
   - Per-dataset scheduling and limits (loop delay, max peers, optional bandwidth/entry caps, maintenance interval override).

## Isolation Model

1. Independent per-dataset oplog stream and vector clock.
2. Independent per-dataset peer confirmation watermarks for pruning.
3. Independent per-dataset transport filtering (handshake and pull/push include dataset id).
4. Independent per-dataset observability counters.

## Compatibility Strategy

1. Backward compatible wire changes:
   - Add optional `dataset_id` fields; default to `"primary"` when absent.
2. Backward compatible storage:
   - Add `datasetId` columns/fields where needed.
   - Existing rows default to `"primary"` during migration/read fallback.
3. API defaults:
   - Existing single-store registration maps to dataset `"primary"` with no functional change.

## Git Worktree Execution Plan

## 0. Worktree Preparation

1. Create worktree and branch:
   - `git worktree add ../CBDDC-multidataset -b codex/multidataset-sync`
2. Build baseline in worktree:
   - `dotnet build CBDDC.slnx`
3. Capture baseline tests (save output artifact in worktree):
   - `dotnet test CBDDC.slnx`

Deliverable:
1. Clean baseline build/test result captured before changes.

## 1. Design and Contract Layer

### Code Changes

1. Add dataset contracts in `src/ZB.MOM.WW.CBDDC.Core`:
   - `DatasetId` value object or constants.
   - `DatasetSyncOptions`.
   - `IDatasetSyncContext`/`IMultiDatasetSyncOrchestrator`.
2. Extend domain models where sync identity is required:
   - `OplogEntry` add `DatasetId` (constructor defaults to `"primary"`).
   - Any metadata types used for causal state/pruning that need dataset partitioning.
3. Extend store interfaces (minimally invasive):
   - Keep existing methods as compatibility overloads.
   - Add dataset-aware variants where cross-dataset ambiguity exists.

### Test Work

1. Add Core unit tests:
   - `OplogEntry` hash stability with `DatasetId`.
   - Defaulting behavior to `"primary"`.
   - Equality/serialization behavior for dataset-aware records.
2. Update existing Core tests that construct `OplogEntry` directly.

Exit Criteria:
1. Core tests compile and pass with default dataset behavior unchanged.

## 2. Persistence Partitioning (Surreal)

### Code Changes

1. Add dataset partition key to persistence records:
   - Oplog rows.
   - Document metadata rows.
   - Snapshot metadata rows (if used in dataset-scoped recoveries).
   - Peer confirmation records.
   - CDC checkpoints (consumer id should include dataset id or add dedicated field).
2. Update schema initializer:
   - Add `datasetId` fields and composite indexes (`datasetId + existing key dimensions`).
3. Update queries in all Surreal stores:
   - Enforce dataset filter in every select/update/delete path.
   - Guard against full-table scans that omit dataset filter.
4. Add migration/read fallback:
   - If `datasetId` missing on older records, treat as `"primary"` during transitional reads.

### Test Work

1. Extend `SurrealStoreContractTests`:
   - Write records in two datasets and verify strict isolation.
   - Verify prune/merge/export/import scoped by dataset.
2. Add regression tests:
   - Legacy records without `datasetId` load as `"primary"` only.
3. Update durability tests:
   - CDC checkpoints do not collide between datasets.

Exit Criteria:
1. Persistence tests prove no cross-dataset reads/writes.

## 3. Network Protocol Dataset Awareness

### Code Changes

1. Update `sync.proto` (backward compatible):
   - Add `dataset_id` to `HandshakeRequest`, `HandshakeResponse`, `PullChangesRequest`, `PushChangesRequest`, and optionally snapshot requests.
2. Regenerate protocol classes and adapt transport handlers:
   - `TcpPeerClient` sends dataset id for every dataset pipeline.
   - `TcpSyncServer` routes requests to correct dataset context.
3. Defaulting rules:
   - Missing/empty `dataset_id` => `"primary"`.
4. Add explicit rejection semantics:
   - If remote peer does not support requested dataset, return accepted handshake but with dataset capability mismatch response path (or reject per dataset connection).

### Test Work

1. Add protocol-level unit tests:
   - Message parse/serialize with and without dataset field.
2. Update network tests:
   - Handshake stores remote interests per dataset.
   - Pull/push operations do not cross datasets.
   - Backward compatibility with no dataset id present.

Exit Criteria:
1. Network tests pass for both new and legacy message shapes.

## 4. Multi-Orchestrator Runtime and DI

### Code Changes

1. Add multi-dataset DI registration extensions:
   - `AddCBDDCSurrealEmbeddedDataset(...)`
   - `AddCBDDCMultiDataset(...)`
2. Build `MultiDatasetSyncOrchestrator`:
   - Start/stop orchestrators for configured datasets.
   - Isolated cancellation tokens, loops, and failure handling per dataset.
3. Ensure hosting services (`CBDDCNodeService`, `TcpSyncServerHostedService`) initialize dataset contexts deterministically.
4. Add per-dataset knobs:
   - Sync interval, max entries per cycle, maintenance interval, optional parallelism limits.

### Test Work

1. Add Hosting tests:
   - Multiple datasets register/start/stop cleanly.
   - Failure in one dataset does not stop others.
2. Add orchestrator tests:
   - Scheduling fairness and per-dataset failure backoff isolation.
3. Update `NoOp`/fallback tests for multi-dataset mode.

Exit Criteria:
1. Runtime starts N dataset pipelines with independent lifecycle behavior.

## 5. Snapshot and Recovery Semantics

### Code Changes

1. Define snapshot scope options:
   - Per-dataset snapshot and full multi-dataset snapshot.
2. Update snapshot service APIs and implementations to support:
   - Export/import/merge by dataset id.
3. Ensure emergency recovery paths in orchestrator are dataset-scoped.

### Test Work

1. Add snapshot tests:
   - Replace/merge for one dataset leaves others untouched.
2. Update reconnect regression tests:
   - Snapshot-required flow only affects targeted dataset pipeline.

Exit Criteria:
1. Recovery operations preserve dataset isolation.

## 6. Sample App and Developer Experience

### Code Changes

1. Add sample configuration for three datasets:
   - `primary`, `logs`, `timeseries`.
2. Implement append-only sample stores for `logs` and `timeseries`.
3. Expose sample CLI commands to emit load independently per dataset.

### Test Work

1. Add sample integration tests:
   - Heavy append load on logs/timeseries does not significantly delay primary data convergence.
2. Add benchmark harness cases:
   - Single-dataset baseline vs multi-dataset under telemetry load.

Exit Criteria:
1. Demonstrable isolation in sample workload.

## 7. Documentation and Migration Guides

### Code/Docs Changes

1. New doc: `docs/features/multi-dataset-sync.md`.
2. Update:
   - `docs/architecture.md`
   - `docs/persistence-providers.md`
   - `docs/runbook.md`
3. Add migration notes:
   - From single pipeline to multi-dataset configuration.
   - Backward compatibility and rollout toggles.

### Test Work

1. Doc examples compile check (if applicable).
2. Add config parsing tests for dataset option sections.

Exit Criteria:
1. Operators have explicit rollout and rollback steps.

## 8. Rollout Strategy (Safe Adoption)

1. Feature flags:
   - `EnableMultiDatasetSync` (global).
   - `EnableDatasetPrimary/Logs/Timeseries`.
2. Rollout sequence:
   - Stage 1: Deploy with flag off.
   - Stage 2: Enable `primary` only in new runtime path.
   - Stage 3: Enable `logs`, then `timeseries`.
3. Observability gates:
   - Primary sync latency SLO must remain within threshold before enabling telemetry datasets.

## 9. Test Plan (Comprehensive Coverage Matrix)

## Unit Tests

1. Core model defaults and hash behavior with dataset id.
2. Dataset routing logic in orchestrator dispatcher.
3. Protocol adapters default `dataset_id` to `"primary"` when absent.
4. Persistence query builders always include dataset predicate.

## Integration Tests

1. Surreal stores:
   - Same key/collection in different datasets remains isolated.
2. Network:
   - Pull/push with mixed datasets never cross-stream.
3. Hosting:
   - Independent orchestrator lifecycle and failure isolation.

## E2E Tests

1. Multi-node cluster:
   - Primary converges under heavy append-only telemetry load.
2. Snapshot/recovery:
   - Dataset-scoped restore preserves other datasets.
3. Backward compatibility:
   - Legacy node (no dataset id) interoperates on `"primary"`.

## Non-Functional Tests

1. Throughput and latency benchmarks:
   - Compare primary p95 sync lag before/after.
2. Resource isolation:
   - CPU/memory pressure from telemetry datasets should not break primary SLO.

## Test Update Checklist (Existing Tests to Modify)

1. `tests/ZB.MOM.WW.CBDDC.Core.Tests`:
   - Update direct `OplogEntry` constructions.
2. `tests/ZB.MOM.WW.CBDDC.Network.Tests`:
   - Handshake/connection/vector-clock tests for dataset-aware flows.
3. `tests/ZB.MOM.WW.CBDDC.Hosting.Tests`:
   - Add multi-dataset startup/shutdown/failure cases.
4. `tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests`:
   - Extend Surreal contract and durability tests for dataset partitioning.
5. `tests/ZB.MOM.WW.CBDDC.E2E.Tests`:
   - Add multi-dataset convergence + interference tests.

## Worktree Task Breakdown (Execution Order)

1. `Phase-A`: Contracts + Core model updates + unit tests.
2. `Phase-B`: Surreal schema/store partitioning + persistence tests.
3. `Phase-C`: Protocol and network routing + network tests.
4. `Phase-D`: Multi-orchestrator DI/runtime + hosting tests.
5. `Phase-E`: Snapshot/recovery updates + regression tests.
6. `Phase-F`: Sample/bench/docs + end-to-end verification.

Each phase should be committed separately in the worktree to keep reviewable deltas.

## Validation Commands (Run in Worktree)

1. `dotnet build /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx`
2. `dotnet test /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx`
3. Focused suites during implementation:
   - `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Core.Tests/ZB.MOM.WW.CBDDC.Core.Tests.csproj`
   - `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Network.Tests/ZB.MOM.WW.CBDDC.Network.Tests.csproj`
   - `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Hosting.Tests/ZB.MOM.WW.CBDDC.Hosting.Tests.csproj`
   - `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests.csproj`
   - `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.E2E.Tests/ZB.MOM.WW.CBDDC.E2E.Tests.csproj`

## Definition of Done

1. Multi-dataset mode runs `primary`, `logs`, and `timeseries` in one process with independent sync paths.
2. No cross-dataset data movement in persistence, protocol, or runtime.
3. Single-dataset existing usage still works via default `"primary"` dataset.
4. Added/updated unit, integration, and E2E tests pass in CI.
5. Docs include migration and operational guidance.