dohertj2/CBDDC

Fork 0

Files

Joseph Doherty 8e97061ab8

NuGet Package Publish / nuget (push) Successful in 1m14s

Details

Implement in-process multi-dataset sync isolation across core, network, persistence, and tests

2026-02-22 11:58:34 -05:00

12 KiB

Raw Permalink Blame History

In-Process Multi-Dataset Sync Plan (Worktree Execution)

Goal

Add true in-process multi-dataset sync so primary business data can sync independently from high-volume append-only datasets (logs, timeseries), with separate state, scheduling, and backpressure behavior.

Desired Outcome

Primary dataset sync throughput/latency is not materially impacted by telemetry dataset volume.
Log and timeseries datasets use independent sync pipelines in the same process.
Existing single-dataset apps continue to work with minimal/no code changes.
Test coverage explicitly verifies isolation and no cross-dataset leakage.

Current Baseline (Why This Change Is Needed)

Current host wiring registers a single IDocumentStore, IOplogStore, and ISyncOrchestrator graph.
Collection filtering exists, but all collections still share one orchestrator/sync loop and one oplog/vector clock lifecycle.
Protocol filters by collection only; there is no dataset identity boundary.
Surreal schema objects are fixed names per configured namespace/database and are not dataset-aware by design.

Proposed Target Architecture

New Concepts

DatasetId:
- Stable identifier (primary, logs, timeseries, etc).
- Included in all sync-state-bearing entities and wire messages.
DatasetSyncContext:
- Encapsulates one dataset's services: document store adapter, oplog store, snapshot metadata, peer confirmation state, orchestrator configuration.
IMultiDatasetSyncOrchestrator:
- Host-level coordinator that starts/stops one ISyncOrchestrator per dataset.
DatasetSyncOptions:
- Per-dataset scheduling and limits (loop delay, max peers, optional bandwidth/entry caps, maintenance interval override).

Isolation Model

Independent per-dataset oplog stream and vector clock.
Independent per-dataset peer confirmation watermarks for pruning.
Independent per-dataset transport filtering (handshake and pull/push include dataset id).
Independent per-dataset observability counters.

Compatibility Strategy

Backward compatible wire changes:
- Add optional dataset_id fields; default to "primary" when absent.
Backward compatible storage:
- Add datasetId columns/fields where needed.
- Existing rows default to "primary" during migration/read fallback.
API defaults:
- Existing single-store registration maps to dataset "primary" with no functional change.

Git Worktree Execution Plan

0. Worktree Preparation

Create worktree and branch:
- git worktree add ../CBDDC-multidataset -b codex/multidataset-sync
Build baseline in worktree:
- dotnet build CBDDC.slnx
Capture baseline tests (save output artifact in worktree):
- dotnet test CBDDC.slnx

Deliverable:

Clean baseline build/test result captured before changes.

1. Design and Contract Layer

Code Changes

Add dataset contracts in src/ZB.MOM.WW.CBDDC.Core:
- DatasetId value object or constants.
- DatasetSyncOptions.
- IDatasetSyncContext/IMultiDatasetSyncOrchestrator.
Extend domain models where sync identity is required:
- OplogEntry add DatasetId (constructor defaults to "primary").
- Any metadata types used for causal state/pruning that need dataset partitioning.
Extend store interfaces (minimally invasive):
- Keep existing methods as compatibility overloads.
- Add dataset-aware variants where cross-dataset ambiguity exists.

Test Work

Add Core unit tests:
- OplogEntry hash stability with DatasetId.
- Defaulting behavior to "primary".
- Equality/serialization behavior for dataset-aware records.
Update existing Core tests that construct OplogEntry directly.

Exit Criteria:

Core tests compile and pass with default dataset behavior unchanged.

2. Persistence Partitioning (Surreal)

Code Changes

Add dataset partition key to persistence records:
- Oplog rows.
- Document metadata rows.
- Snapshot metadata rows (if used in dataset-scoped recoveries).
- Peer confirmation records.
- CDC checkpoints (consumer id should include dataset id or add dedicated field).
Update schema initializer:
- Add datasetId fields and composite indexes (datasetId + existing key dimensions).
Update queries in all Surreal stores:
- Enforce dataset filter in every select/update/delete path.
- Guard against full-table scans that omit dataset filter.
Add migration/read fallback:
- If datasetId missing on older records, treat as "primary" during transitional reads.

Test Work

Extend SurrealStoreContractTests:
- Write records in two datasets and verify strict isolation.
- Verify prune/merge/export/import scoped by dataset.
Add regression tests:
- Legacy records without datasetId load as "primary" only.
Update durability tests:
- CDC checkpoints do not collide between datasets.

Exit Criteria:

Persistence tests prove no cross-dataset reads/writes.

3. Network Protocol Dataset Awareness

Code Changes

Update sync.proto (backward compatible):
- Add dataset_id to HandshakeRequest, HandshakeResponse, PullChangesRequest, PushChangesRequest, and optionally snapshot requests.
Regenerate protocol classes and adapt transport handlers:
- TcpPeerClient sends dataset id for every dataset pipeline.
- TcpSyncServer routes requests to correct dataset context.
Defaulting rules:
- Missing/empty dataset_id => "primary".
Add explicit rejection semantics:
- If remote peer does not support requested dataset, return accepted handshake but with dataset capability mismatch response path (or reject per dataset connection).

Test Work

Add protocol-level unit tests:
- Message parse/serialize with and without dataset field.
Update network tests:
- Handshake stores remote interests per dataset.
- Pull/push operations do not cross datasets.
- Backward compatibility with no dataset id present.

Exit Criteria:

Network tests pass for both new and legacy message shapes.

4. Multi-Orchestrator Runtime and DI

Code Changes

Add multi-dataset DI registration extensions:
- AddCBDDCSurrealEmbeddedDataset(...)
- AddCBDDCMultiDataset(...)
Build MultiDatasetSyncOrchestrator:
- Start/stop orchestrators for configured datasets.
- Isolated cancellation tokens, loops, and failure handling per dataset.
Ensure hosting services (CBDDCNodeService, TcpSyncServerHostedService) initialize dataset contexts deterministically.
Add per-dataset knobs:
- Sync interval, max entries per cycle, maintenance interval, optional parallelism limits.

Test Work

Add Hosting tests:
- Multiple datasets register/start/stop cleanly.
- Failure in one dataset does not stop others.
Add orchestrator tests:
- Scheduling fairness and per-dataset failure backoff isolation.
Update NoOp/fallback tests for multi-dataset mode.

Exit Criteria:

Runtime starts N dataset pipelines with independent lifecycle behavior.

5. Snapshot and Recovery Semantics

Code Changes

Define snapshot scope options:
- Per-dataset snapshot and full multi-dataset snapshot.
Update snapshot service APIs and implementations to support:
- Export/import/merge by dataset id.
Ensure emergency recovery paths in orchestrator are dataset-scoped.

Test Work

Add snapshot tests:
- Replace/merge for one dataset leaves others untouched.
Update reconnect regression tests:
- Snapshot-required flow only affects targeted dataset pipeline.

Exit Criteria:

Recovery operations preserve dataset isolation.

6. Sample App and Developer Experience

Code Changes

Add sample configuration for three datasets:
- primary, logs, timeseries.
Implement append-only sample stores for logs and timeseries.
Expose sample CLI commands to emit load independently per dataset.

Test Work

Add sample integration tests:
- Heavy append load on logs/timeseries does not significantly delay primary data convergence.
Add benchmark harness cases:
- Single-dataset baseline vs multi-dataset under telemetry load.

Exit Criteria:

Demonstrable isolation in sample workload.

7. Documentation and Migration Guides

Code/Docs Changes

New doc: docs/features/multi-dataset-sync.md.
Update:
- docs/architecture.md
- docs/persistence-providers.md
- docs/runbook.md
Add migration notes:
- From single pipeline to multi-dataset configuration.
- Backward compatibility and rollout toggles.

Test Work

Doc examples compile check (if applicable).
Add config parsing tests for dataset option sections.

Exit Criteria:

Operators have explicit rollout and rollback steps.

8. Rollout Strategy (Safe Adoption)

Feature flags:
- EnableMultiDatasetSync (global).
- EnableDatasetPrimary/Logs/Timeseries.
Rollout sequence:
- Stage 1: Deploy with flag off.
- Stage 2: Enable primary only in new runtime path.
- Stage 3: Enable logs, then timeseries.
Observability gates:
- Primary sync latency SLO must remain within threshold before enabling telemetry datasets.

9. Test Plan (Comprehensive Coverage Matrix)

Unit Tests

Core model defaults and hash behavior with dataset id.
Dataset routing logic in orchestrator dispatcher.
Protocol adapters default dataset_id to "primary" when absent.
Persistence query builders always include dataset predicate.

Integration Tests

Surreal stores:
- Same key/collection in different datasets remains isolated.
Network:
- Pull/push with mixed datasets never cross-stream.
Hosting:
- Independent orchestrator lifecycle and failure isolation.

E2E Tests

Multi-node cluster:
- Primary converges under heavy append-only telemetry load.
Snapshot/recovery:
- Dataset-scoped restore preserves other datasets.
Backward compatibility:
- Legacy node (no dataset id) interoperates on "primary".

Non-Functional Tests

Throughput and latency benchmarks:
- Compare primary p95 sync lag before/after.
Resource isolation:
- CPU/memory pressure from telemetry datasets should not break primary SLO.

Test Update Checklist (Existing Tests to Modify)

tests/ZB.MOM.WW.CBDDC.Core.Tests:
- Update direct OplogEntry constructions.
tests/ZB.MOM.WW.CBDDC.Network.Tests:
- Handshake/connection/vector-clock tests for dataset-aware flows.
tests/ZB.MOM.WW.CBDDC.Hosting.Tests:
- Add multi-dataset startup/shutdown/failure cases.
tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests:
- Extend Surreal contract and durability tests for dataset partitioning.
tests/ZB.MOM.WW.CBDDC.E2E.Tests:
- Add multi-dataset convergence + interference tests.

Worktree Task Breakdown (Execution Order)

Phase-A: Contracts + Core model updates + unit tests.
Phase-B: Surreal schema/store partitioning + persistence tests.
Phase-C: Protocol and network routing + network tests.
Phase-D: Multi-orchestrator DI/runtime + hosting tests.
Phase-E: Snapshot/recovery updates + regression tests.
Phase-F: Sample/bench/docs + end-to-end verification.

Each phase should be committed separately in the worktree to keep reviewable deltas.

Validation Commands (Run in Worktree)

dotnet build /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx
dotnet test /Users/dohertj2/Desktop/CBDDC/CBDDC.slnx
Focused suites during implementation:
- dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Core.Tests/ZB.MOM.WW.CBDDC.Core.Tests.csproj
- dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Network.Tests/ZB.MOM.WW.CBDDC.Network.Tests.csproj
- dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Hosting.Tests/ZB.MOM.WW.CBDDC.Hosting.Tests.csproj
- dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests/ZB.MOM.WW.CBDDC.Sample.Console.Tests.csproj
- dotnet test /Users/dohertj2/Desktop/CBDDC/tests/ZB.MOM.WW.CBDDC.E2E.Tests/ZB.MOM.WW.CBDDC.E2E.Tests.csproj

Definition of Done

Multi-dataset mode runs primary, logs, and timeseries in one process with independent sync paths.
No cross-dataset data movement in persistence, protocol, or runtime.
Single-dataset existing usage still works via default "primary" dataset.
Added/updated unit, integration, and E2E tests pass in CI.
Docs include migration and operational guidance.

12 KiB Raw Permalink Blame History

In-Process Multi-Dataset Sync Plan (Worktree Execution)

Goal

Desired Outcome

Current Baseline (Why This Change Is Needed)

Proposed Target Architecture

New Concepts

Isolation Model

Compatibility Strategy

Git Worktree Execution Plan

0. Worktree Preparation

1. Design and Contract Layer

Code Changes

Test Work

2. Persistence Partitioning (Surreal)

Code Changes

Test Work

3. Network Protocol Dataset Awareness

Code Changes

Test Work

4. Multi-Orchestrator Runtime and DI

Code Changes

Test Work

5. Snapshot and Recovery Semantics

Code Changes

Test Work

6. Sample App and Developer Experience

Code Changes

Test Work

7. Documentation and Migration Guides

Code/Docs Changes

Test Work

8. Rollout Strategy (Safe Adoption)

9. Test Plan (Comprehensive Coverage Matrix)

Unit Tests

Integration Tests

E2E Tests

Non-Functional Tests

Test Update Checklist (Existing Tests to Modify)

Worktree Task Breakdown (Execution Order)

Validation Commands (Run in Worktree)

Definition of Done

12 KiB

Raw Permalink Blame History