324 lines
12 KiB
Markdown
324 lines
12 KiB
Markdown
# In-Process Multi-Dataset Sync Plan (Worktree Execution)
|
|
|
|
## Goal
|
|
|
|
Add true in-process multi-dataset sync so primary business data can sync independently from high-volume append-only datasets (logs, timeseries), with separate state, scheduling, and backpressure behavior.
|
|
|
|
## Desired Outcome
|
|
|
|
1. Primary dataset sync throughput/latency is not materially impacted by telemetry dataset volume.
|
|
2. Log and timeseries datasets use independent sync pipelines in the same process.
|
|
3. Existing single-dataset apps continue to work with minimal/no code changes.
|
|
4. Test coverage explicitly verifies isolation and no cross-dataset leakage.
|
|
|
|
## Current Baseline (Why This Change Is Needed)
|
|
|
|
1. Current host wiring registers a single `IDocumentStore`, `IOplogStore`, and `ISyncOrchestrator` graph.
|
|
2. Collection filtering exists, but all collections still share one orchestrator/sync loop and one oplog/vector clock lifecycle.
|
|
3. Protocol filters by collection only; there is no dataset identity boundary.
|
|
4. Surreal schema objects are fixed names per configured namespace/database and are not dataset-aware by design.
|
|
|
|
## Proposed Target Architecture
|
|
|
|
## New Concepts
|
|
|
|
1. `DatasetId`:
|
|
- Stable identifier (`primary`, `logs`, `timeseries`, etc).
|
|
- Included in all sync-state-bearing entities and wire messages.
|
|
2. `DatasetSyncContext`:
|
|
- Encapsulates one dataset's services: document store adapter, oplog store, snapshot metadata, peer confirmation state, orchestrator configuration.
|
|
3. `IMultiDatasetSyncOrchestrator`:
|
|
- Host-level coordinator that starts/stops one `ISyncOrchestrator` per dataset.
|
|
4. `DatasetSyncOptions`:
|
|
- Per-dataset scheduling and limits (loop delay, max peers, optional bandwidth/entry caps, maintenance interval override).
|
|
|
|
## Isolation Model
|
|
|
|
1. Independent per-dataset oplog stream and vector clock.
|
|
2. Independent per-dataset peer confirmation watermarks for pruning.
|
|
3. Independent per-dataset transport filtering (handshake and pull/push include dataset id).
|
|
4. Independent per-dataset observability counters.
|
|
|
|
## Compatibility Strategy
|
|
|
|
1. Backward compatible wire changes:
|
|
- Add optional `dataset_id` fields; default to `"primary"` when absent.
|
|
2. Backward compatible storage:
|
|
- Add `datasetId` columns/fields where needed.
|
|
- Existing rows default to `"primary"` during migration/read fallback.
|
|
3. API defaults:
|
|
- Existing single-store registration maps to dataset `"primary"` with no functional change.
|
|
|
|
## Git Worktree Execution Plan
|
|
|
|
## 0. Worktree Preparation
|
|
|
|
1. Create worktree and branch:
|
|
- `git worktree add ../CBDDC-multidataset -b codex/multidataset-sync`
|
|
2. Build baseline in worktree:
|
|
- `dotnet build CBDDC.slnx`
|
|
3. Capture baseline tests (save output artifact in worktree):
|
|
- `dotnet test CBDDC.slnx`
|
|
|
|
Deliverable:
|
|
1. Clean baseline build/test result captured before changes.
|
|
|
|
## 1. Design and Contract Layer
|
|
|
|
### Code Changes
|
|
|
|
1. Add dataset contracts in `src/ZB.MOM.WW.CBDDC.Core`:
|
|
- `DatasetId` value object or constants.
|
|
- `DatasetSyncOptions`.
|
|
- `IDatasetSyncContext`/`IMultiDatasetSyncOrchestrator`.
|
|
2. Extend domain models where sync identity is required:
|
|
- `OplogEntry` add `DatasetId` (constructor defaults to `"primary"`).
|
|
- Any metadata types used for causal state/pruning that need dataset partitioning.
|
|
3. Extend store interfaces (minimally invasive):
|
|
- Keep existing methods as compatibility overloads.
|
|
- Add dataset-aware variants where cross-dataset ambiguity exists.
|
|
|
|
### Test Work
|
|
|
|
1. Add Core unit tests:
|
|
- `OplogEntry` hash stability with `DatasetId`.
|
|
- Defaulting behavior to `"primary"`.
|
|
- Equality/serialization behavior for dataset-aware records.
|
|
2. Update existing Core tests that construct `OplogEntry` directly.
|
|
|
|
Exit Criteria:
|
|
1. Core tests compile and pass with default dataset behavior unchanged.
|
|
|
|
## 2. Persistence Partitioning (Surreal)
|
|
|
|
### Code Changes
|
|
|
|
1. Add dataset partition key to persistence records:
|
|
- Oplog rows.
|
|
- Document metadata rows.
|
|
- Snapshot metadata rows (if used in dataset-scoped recoveries).
|
|
- Peer confirmation records.
|
|
- CDC checkpoints (consumer id should include dataset id or add dedicated field).
|
|
2. Update schema initializer:
|
|
- Add `datasetId` fields and composite indexes (`datasetId + existing key dimensions`).
|
|
3. Update queries in all Surreal stores:
|
|
- Enforce dataset filter in every select/update/delete path.
|
|
- Guard against full-table scans that omit dataset filter.
|
|
4. Add migration/read fallback:
|
|
- If `datasetId` missing on older records, treat as `"primary"` during transitional reads.
|
|
|
|
### Test Work
|
|
|
|
1. Extend `SurrealStoreContractTests`:
|
|
- Write records in two datasets and verify strict isolation.
|
|
- Verify prune/merge/export/import scoped by dataset.
|
|
2. Add regression tests:
|
|
- Legacy records without `datasetId` load as `"primary"` only.
|
|
3. Update durability tests:
|
|
- CDC checkpoints do not collide between datasets.
|
|
|
|
Exit Criteria:
|
|
1. Persistence tests prove no cross-dataset reads/writes.
|
|
|
|
## 3. Network Protocol Dataset Awareness
|
|
|
|
### Code Changes
|
|
|
|
1. Update `sync.proto` (backward compatible):
|
|
- Add `dataset_id` to `HandshakeRequest`, `HandshakeResponse`, `PullChangesRequest`, `PushChangesRequest`, and optionally snapshot requests.
|
|
2. Regenerate protocol classes and adapt transport handlers:
|
|
- `TcpPeerClient` sends dataset id for every dataset pipeline.
|
|
- `TcpSyncServer` routes requests to correct dataset context.
|
|
3. Defaulting rules:
|
|
- Missing/empty `dataset_id` => `"primary"`.
|
|
4. Add explicit rejection semantics:
|
|
- If remote peer does not support requested dataset, return accepted handshake but with dataset capability mismatch response path (or reject per dataset connection).
|
|
|
|
### Test Work
|
|
|
|
1. Add protocol-level unit tests:
|
|
- Message parse/serialize with and without dataset field.
|
|
2. Update network tests:
|
|
- Handshake stores remote interests per dataset.
|
|
- Pull/push operations do not cross datasets.
|
|
- Backward compatibility with no dataset id present.
|
|
|
|
Exit Criteria:
|
|
1. Network tests pass for both new and legacy message shapes.
|
|
|
|
## 4. Multi-Orchestrator Runtime and DI
|
|
|
|
### Code Changes
|
|
|
|
1. Add multi-dataset DI registration extensions:
|
|
- `AddCBDDCSurrealEmbeddedDataset(...)`
|
|
- `AddCBDDCMultiDataset(...)`
|
|
2. Build `MultiDatasetSyncOrchestrator`:
|
|
- Start/stop orchestrators for configured datasets.
|
|
- Isolated cancellation tokens, loops, and failure handling per dataset.
|
|
3. Ensure hosting services (`CBDDCNodeService`, `TcpSyncServerHostedService`) initialize dataset contexts deterministically.
|
|
4. Add per-dataset knobs:
|
|
- Sync interval, max entries per cycle, maintenance interval, optional parallelism limits.
|
|
|
|
### Test Work
|
|
|
|
1. Add Hosting tests:
|
|
- Multiple datasets register/start/stop cleanly.
|
|
- Failure in one dataset does not stop others.
|
|
2. Add orchestrator tests:
|
|
- Scheduling fairness and per-dataset failure backoff isolation.
|
|
3. Update `NoOp`/fallback tests for multi-dataset mode.
|
|
|
|
Exit Criteria:
|
|
1. Runtime starts N dataset pipelines with independent lifecycle behavior.
|
|
|
|
## 5. Snapshot and Recovery Semantics
|
|
|
|
### Code Changes
|
|
|
|
1. Define snapshot scope options:
|
|
- Per-dataset snapshot and full multi-dataset snapshot.
|
|
2. Update snapshot service APIs and implementations to support:
|
|
- Export/import/merge by dataset id.
|
|
3. Ensure emergency recovery paths in orchestrator are dataset-scoped.
|
|
|
|
### Test Work
|
|
|
|
1. Add snapshot tests:
|
|
- Replace/merge for one dataset leaves others untouched.
|
|
2. Update reconnect regression tests:
|
|
- Snapshot-required flow only affects targeted dataset pipeline.
|
|
|
|
Exit Criteria:
|
|
1. Recovery operations preserve dataset isolation.
|
|
|
|
## 6. Sample App and Developer Experience
|
|
|
|
### Code Changes
|
|
|
|
1. Add sample configuration for three datasets:
|
|
- `primary`, `logs`, `timeseries`.
|
|
2. Implement append-only sample stores for `logs` and `timeseries`.
|
|
3. Expose sample CLI commands to emit load independently per dataset.
|
|
|
|
### Test Work
|
|
|
|
1. Add sample integration tests:
|
|
- Heavy append load on logs/timeseries does not significantly delay primary data convergence.
|
|
2. Add benchmark harness cases:
|
|
- Single-dataset baseline vs multi-dataset under telemetry load.
|
|
|
|
Exit Criteria:
|
|
1. Demonstrable isolation in sample workload.
|
|
|
|
## 7. Documentation and Migration Guides
|
|
|
|
### Code/Docs Changes
|
|
|
|
1. New doc: `docs/features/multi-dataset-sync.md`.
|
|
2. Update:
|
|
- `docs/architecture.md`
|
|
- `docs/persistence-providers.md`
|
|
- `docs/runbook.md`
|
|
3. Add migration notes:
|
|
- From single pipeline to multi-dataset configuration.
|
|
- Backward compatibility and rollout toggles.
|
|
|
|
### Test Work
|
|
|
|
1. Doc examples compile check (if applicable).
|
|
2. Add config parsing tests for dataset option sections.
|
|
|
|
Exit Criteria:
|
|
1. Operators have explicit rollout and rollback steps.
|
|
|
|
## 8. Rollout Strategy (Safe Adoption)
|
|
|
|
1. Feature flags:
|
|
- `EnableMultiDatasetSync` (global).
|
|
- `EnableDatasetPrimary/Logs/Timeseries`.
|
|
2. Rollout sequence:
|
|
- Stage 1: Deploy with flag off.
|
|
- Stage 2: Enable `primary` only in new runtime path.
|
|
- Stage 3: Enable `logs`, then `timeseries`.
|
|
3. Observability gates:
|
|
- Primary sync latency SLO must remain within threshold before enabling telemetry datasets.
|
|
|
|
## 9. Test Plan (Comprehensive Coverage Matrix)
|
|
|
|
## Unit Tests
|
|
|
|
1. Core model defaults and hash behavior with dataset id.
|
|
2. Dataset routing logic in orchestrator dispatcher.
|
|
3. Protocol adapters default `dataset_id` to `"primary"` when absent.
|
|
4. Persistence query builders always include dataset predicate.
|
|
|
|
## Integration Tests
|
|
|
|
1. Surreal stores:
|
|
- Same key/collection in different datasets remains isolated.
|
|
2. Network:
|
|
- Pull/push with mixed datasets never cross-stream.
|
|
3. Hosting:
|
|
- Independent orchestrator lifecycle and failure isolation.
|
|
|
|
## E2E Tests
|
|
|
|
1. Multi-node cluster:
|
|
- Primary converges under heavy append-only telemetry load.
|
|
2. Snapshot/recovery:
|
|
- Dataset-scoped restore preserves other datasets.
|
|
3. Backward compatibility:
|
|
- Legacy node (no dataset id) interoperates on `"primary"`.
|
|
|
|
## Non-Functional Tests
|
|
|
|
1. Throughput and latency benchmarks:
|
|
- Compare primary p95 sync lag before/after.
|
|
2. Resource isolation:
|
|
- CPU/memory pressure from telemetry datasets should not break primary SLO.
|
|
|
|
## Test Update Checklist (Existing Tests to Modify)
|
|
|
|
1. `tests/ZB.MOM.WW.CBDDC.Core.Tests`:
|
|
- Update direct `OplogEntry` constructions.
|
|
2. `tests/ZB.MOM.WW.CBDDC.Network.Tests`:
|
|
- Handshake/connection/vector-clock tests for dataset-aware flows.
|
|
3. `tests/ZB.MOM.WW.CBDDC.Hosting.Tests`:
|
|
- Add multi-dataset startup/shutdown/failure cases.
|
|
4. `tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests`:
|
|
- Extend Surreal contract and durability tests for dataset partitioning.
|
|
5. `tests/ZB.MOM.WW.CBDDC.E2E.Tests`:
|
|
- Add multi-dataset convergence + interference tests.
|
|
|
|
## Worktree Task Breakdown (Execution Order)
|
|
|
|
1. `Phase-A`: Contracts + Core model updates + unit tests.
|
|
2. `Phase-B`: Surreal schema/store partitioning + persistence tests.
|
|
3. `Phase-C`: Protocol and network routing + network tests.
|
|
4. `Phase-D`: Multi-orchestrator DI/runtime + hosting tests.
|
|
5. `Phase-E`: Snapshot/recovery updates + regression tests.
|
|
6. `Phase-F`: Sample/bench/docs + end-to-end verification.
|
|
|
|
Each phase should be committed separately in the worktree to keep reviewable deltas.
|
|
|
|
## Validation Commands (Run in Worktree)
|
|
|
|
1. `dotnet build /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx`
|
|
2. `dotnet test /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx`
|
|
3. Focused suites during implementation:
|
|
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Core.Tests/ZB.MOM.WW.CBDDC.Core.Tests.csproj`
|
|
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Network.Tests/ZB.MOM.WW.CBDDC.Network.Tests.csproj`
|
|
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Hosting.Tests/ZB.MOM.WW.CBDDC.Hosting.Tests.csproj`
|
|
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests.csproj`
|
|
- `dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.E2E.Tests/ZB.MOM.WW.CBDDC.E2E.Tests.csproj`
|
|
|
|
## Definition of Done
|
|
|
|
1. Multi-dataset mode runs `primary`, `logs`, and `timeseries` in one process with independent sync paths.
|
|
2. No cross-dataset data movement in persistence, protocol, or runtime.
|
|
3. Single-dataset existing usage still works via default `"primary"` dataset.
|
|
4. Added/updated unit, integration, and E2E tests pass in CI.
|
|
5. Docs include migration and operational guidance.
|
|
|