Files
CBDDC/separate.md
Joseph Doherty 8e97061ab8
All checks were successful
NuGet Package Publish / nuget (push) Successful in 1m14s
Implement in-process multi-dataset sync isolation across core, network, persistence, and tests
2026-02-22 11:58:34 -05:00

324 lines
12 KiB
Markdown

# In-Process Multi-Dataset Sync Plan (Worktree Execution)
## Goal
Add true in-process multi-dataset sync so primary business data can sync independently from high-volume append-only datasets (logs, timeseries), with separate state, scheduling, and backpressure behavior.
## Desired Outcome
1. Primary dataset sync throughput/latency is not materially impacted by telemetry dataset volume.
2. Log and timeseries datasets use independent sync pipelines in the same process.
3. Existing single-dataset apps continue to work with minimal/no code changes.
4. Test coverage explicitly verifies isolation and no cross-dataset leakage.
## Current Baseline (Why This Change Is Needed)
1. Current host wiring registers a single `IDocumentStore`, `IOplogStore`, and `ISyncOrchestrator` graph.
2. Collection filtering exists, but all collections still share one orchestrator/sync loop and one oplog/vector clock lifecycle.
3. Protocol filters by collection only; there is no dataset identity boundary.
4. Surreal schema objects are fixed names per configured namespace/database and are not dataset-aware by design.
## Proposed Target Architecture
## New Concepts
1. `DatasetId`:
- Stable identifier (`primary`, `logs`, `timeseries`, etc).
- Included in all sync-state-bearing entities and wire messages.
2. `DatasetSyncContext`:
- Encapsulates one dataset's services: document store adapter, oplog store, snapshot metadata, peer confirmation state, orchestrator configuration.
3. `IMultiDatasetSyncOrchestrator`:
- Host-level coordinator that starts/stops one `ISyncOrchestrator` per dataset.
4. `DatasetSyncOptions`:
- Per-dataset scheduling and limits (loop delay, max peers, optional bandwidth/entry caps, maintenance interval override).
## Isolation Model
1. Independent per-dataset oplog stream and vector clock.
2. Independent per-dataset peer confirmation watermarks for pruning.
3. Independent per-dataset transport filtering (handshake and pull/push include dataset id).
4. Independent per-dataset observability counters.
## Compatibility Strategy
1. Backward compatible wire changes:
- Add optional `dataset_id` fields; default to `"primary"` when absent.
2. Backward compatible storage:
- Add `datasetId` columns/fields where needed.
- Existing rows default to `"primary"` during migration/read fallback.
3. API defaults:
- Existing single-store registration maps to dataset `"primary"` with no functional change.
## Git Worktree Execution Plan
## 0. Worktree Preparation
1. Create worktree and branch:
- `git worktree add ../CBDDC-multidataset -b codex/multidataset-sync`
2. Build baseline in worktree:
- `dotnet build CBDDC.slnx`
3. Capture baseline tests (save output artifact in worktree):
- `dotnet test CBDDC.slnx`
Deliverable:
1. Clean baseline build/test result captured before changes.
## 1. Design and Contract Layer
### Code Changes
1. Add dataset contracts in `src/ZB.MOM.WW.CBDDC.Core`:
- `DatasetId` value object or constants.
- `DatasetSyncOptions`.
- `IDatasetSyncContext`/`IMultiDatasetSyncOrchestrator`.
2. Extend domain models where sync identity is required:
- `OplogEntry` add `DatasetId` (constructor defaults to `"primary"`).
- Any metadata types used for causal state/pruning that need dataset partitioning.
3. Extend store interfaces (minimally invasive):
- Keep existing methods as compatibility overloads.
- Add dataset-aware variants where cross-dataset ambiguity exists.
### Test Work
1. Add Core unit tests:
- `OplogEntry` hash stability with `DatasetId`.
- Defaulting behavior to `"primary"`.
- Equality/serialization behavior for dataset-aware records.
2. Update existing Core tests that construct `OplogEntry` directly.
Exit Criteria:
1. Core tests compile and pass with default dataset behavior unchanged.
## 2. Persistence Partitioning (Surreal)
### Code Changes
1. Add dataset partition key to persistence records:
- Oplog rows.
- Document metadata rows.
- Snapshot metadata rows (if used in dataset-scoped recoveries).
- Peer confirmation records.
- CDC checkpoints (consumer id should include dataset id or add dedicated field).
2. Update schema initializer:
- Add `datasetId` fields and composite indexes (`datasetId + existing key dimensions`).
3. Update queries in all Surreal stores:
- Enforce dataset filter in every select/update/delete path.
- Guard against full-table scans that omit dataset filter.
4. Add migration/read fallback:
- If `datasetId` missing on older records, treat as `"primary"` during transitional reads.
### Test Work
1. Extend `SurrealStoreContractTests`:
- Write records in two datasets and verify strict isolation.
- Verify prune/merge/export/import scoped by dataset.
2. Add regression tests:
- Legacy records without `datasetId` load as `"primary"` only.
3. Update durability tests:
- CDC checkpoints do not collide between datasets.
Exit Criteria:
1. Persistence tests prove no cross-dataset reads/writes.
## 3. Network Protocol Dataset Awareness
### Code Changes
1. Update `sync.proto` (backward compatible):
- Add `dataset_id` to `HandshakeRequest`, `HandshakeResponse`, `PullChangesRequest`, `PushChangesRequest`, and optionally snapshot requests.
2. Regenerate protocol classes and adapt transport handlers:
- `TcpPeerClient` sends dataset id for every dataset pipeline.
- `TcpSyncServer` routes requests to correct dataset context.
3. Defaulting rules:
- Missing/empty `dataset_id` => `"primary"`.
4. Add explicit rejection semantics:
- If remote peer does not support requested dataset, return accepted handshake but with dataset capability mismatch response path (or reject per dataset connection).
### Test Work
1. Add protocol-level unit tests:
- Message parse/serialize with and without dataset field.
2. Update network tests:
- Handshake stores remote interests per dataset.
- Pull/push operations do not cross datasets.
- Backward compatibility with no dataset id present.
Exit Criteria:
1. Network tests pass for both new and legacy message shapes.
## 4. Multi-Orchestrator Runtime and DI
### Code Changes
1. Add multi-dataset DI registration extensions:
- `AddCBDDCSurrealEmbeddedDataset(...)`
- `AddCBDDCMultiDataset(...)`
2. Build `MultiDatasetSyncOrchestrator`:
- Start/stop orchestrators for configured datasets.
- Isolated cancellation tokens, loops, and failure handling per dataset.
3. Ensure hosting services (`CBDDCNodeService`, `TcpSyncServerHostedService`) initialize dataset contexts deterministically.
4. Add per-dataset knobs:
- Sync interval, max entries per cycle, maintenance interval, optional parallelism limits.
### Test Work
1. Add Hosting tests:
- Multiple datasets register/start/stop cleanly.
- Failure in one dataset does not stop others.
2. Add orchestrator tests:
- Scheduling fairness and per-dataset failure backoff isolation.
3. Update `NoOp`/fallback tests for multi-dataset mode.
Exit Criteria:
1. Runtime starts N dataset pipelines with independent lifecycle behavior.
## 5. Snapshot and Recovery Semantics
### Code Changes
1. Define snapshot scope options:
- Per-dataset snapshot and full multi-dataset snapshot.
2. Update snapshot service APIs and implementations to support:
- Export/import/merge by dataset id.
3. Ensure emergency recovery paths in orchestrator are dataset-scoped.
### Test Work
1. Add snapshot tests:
- Replace/merge for one dataset leaves others untouched.
2. Update reconnect regression tests:
- Snapshot-required flow only affects targeted dataset pipeline.
Exit Criteria:
1. Recovery operations preserve dataset isolation.
## 6. Sample App and Developer Experience
### Code Changes
1. Add sample configuration for three datasets:
- `primary`, `logs`, `timeseries`.
2. Implement append-only sample stores for `logs` and `timeseries`.
3. Expose sample CLI commands to emit load independently per dataset.
### Test Work
1. Add sample integration tests:
- Heavy append load on logs/timeseries does not significantly delay primary data convergence.
2. Add benchmark harness cases:
- Single-dataset baseline vs multi-dataset under telemetry load.
Exit Criteria:
1. Demonstrable isolation in sample workload.
## 7. Documentation and Migration Guides
### Code/Docs Changes
1. New doc: `docs/features/multi-dataset-sync.md`.
2. Update:
- `docs/architecture.md`
- `docs/persistence-providers.md`
- `docs/runbook.md`
3. Add migration notes:
- From single pipeline to multi-dataset configuration.
- Backward compatibility and rollout toggles.
### Test Work
1. Doc examples compile check (if applicable).
2. Add config parsing tests for dataset option sections.
Exit Criteria:
1. Operators have explicit rollout and rollback steps.
## 8. Rollout Strategy (Safe Adoption)
1. Feature flags:
- `EnableMultiDatasetSync` (global).
- `EnableDatasetPrimary/Logs/Timeseries`.
2. Rollout sequence:
- Stage 1: Deploy with flag off.
- Stage 2: Enable `primary` only in new runtime path.
- Stage 3: Enable `logs`, then `timeseries`.
3. Observability gates:
- Primary sync latency SLO must remain within threshold before enabling telemetry datasets.
## 9. Test Plan (Comprehensive Coverage Matrix)
## Unit Tests
1. Core model defaults and hash behavior with dataset id.
2. Dataset routing logic in orchestrator dispatcher.
3. Protocol adapters default `dataset_id` to `"primary"` when absent.
4. Persistence query builders always include dataset predicate.
## Integration Tests
1. Surreal stores:
- Same key/collection in different datasets remains isolated.
2. Network:
- Pull/push with mixed datasets never cross-stream.
3. Hosting:
- Independent orchestrator lifecycle and failure isolation.
## E2E Tests
1. Multi-node cluster:
- Primary converges under heavy append-only telemetry load.
2. Snapshot/recovery:
- Dataset-scoped restore preserves other datasets.
3. Backward compatibility:
- Legacy node (no dataset id) interoperates on `"primary"`.
## Non-Functional Tests
1. Throughput and latency benchmarks:
- Compare primary p95 sync lag before/after.
2. Resource isolation:
- CPU/memory pressure from telemetry datasets should not break primary SLO.
## Test Update Checklist (Existing Tests to Modify)
1. `tests/ZB.MOM.WW.CBDDC.Core.Tests`:
- Update direct `OplogEntry` constructions.
2. `tests/ZB.MOM.WW.CBDDC.Network.Tests`:
- Handshake/connection/vector-clock tests for dataset-aware flows.
3. `tests/ZB.MOM.WW.CBDDC.Hosting.Tests`:
- Add multi-dataset startup/shutdown/failure cases.
4. `tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests`:
- Extend Surreal contract and durability tests for dataset partitioning.
5. `tests/ZB.MOM.WW.CBDDC.E2E.Tests`:
- Add multi-dataset convergence + interference tests.
## Worktree Task Breakdown (Execution Order)
1. `Phase-A`: Contracts + Core model updates + unit tests.
2. `Phase-B`: Surreal schema/store partitioning + persistence tests.
3. `Phase-C`: Protocol and network routing + network tests.
4. `Phase-D`: Multi-orchestrator DI/runtime + hosting tests.
5. `Phase-E`: Snapshot/recovery updates + regression tests.
6. `Phase-F`: Sample/bench/docs + end-to-end verification.
Each phase should be committed separately in the worktree to keep reviewable deltas.
## Validation Commands (Run in Worktree)
1. `dotnet build /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx`
2. `dotnet test /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx`
3. Focused suites during implementation:
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Core.Tests/ZB.MOM.WW.CBDDC.Core.Tests.csproj`
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Network.Tests/ZB.MOM.WW.CBDDC.Network.Tests.csproj`
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Hosting.Tests/ZB.MOM.WW.CBDDC.Hosting.Tests.csproj`
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests.csproj`
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.E2E.Tests/ZB.MOM.WW.CBDDC.E2E.Tests.csproj`
## Definition of Done
1. Multi-dataset mode runs `primary`, `logs`, and `timeseries` in one process with independent sync paths.
2. No cross-dataset data movement in persistence, protocol, or runtime.
3. Single-dataset existing usage still works via default `"primary"` dataset.
4. Added/updated unit, integration, and E2E tests pass in CI.
5. Docs include migration and operational guidance.