Initial import of the CBDDC codebase with docs and tests. Add a .NET-focused gitignore to keep generated artifacts out of source control.
Some checks failed
CI / verify (push) Has been cancelled
Some checks failed
CI / verify (push) Has been cancelled
This commit is contained in:
166
twonode.md
Normal file
166
twonode.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# Plan: Peer-Confirmed Oplog Retention and Peer Sync Health
|
||||
|
||||
## Objective
|
||||
Move from time-only oplog pruning to peer-confirmed pruning so entries are not removed until peers have confirmed them, while also adding peer lifecycle management and health visibility.
|
||||
|
||||
## Requested Outcomes
|
||||
- Oplogs are not cleared until confirmed by peers.
|
||||
- Each node tracks latest oplog confirmation per peer.
|
||||
- New peers are automatically registered when they join.
|
||||
- Deprecated peers can be explicitly removed from tracking.
|
||||
- Hosting health checks report peer sync status, not only store availability.
|
||||
|
||||
## Current Baseline (Codebase)
|
||||
- Pruning is retention-time based in `SyncOrchestrator` + `IOplogStore.PruneOplogAsync(...)`.
|
||||
- Push ACK is binary success/fail (`AckResponse`) and does not persist peer confirmation state.
|
||||
- Peer discovery exists (`IDiscoveryService`, `UdpDiscoveryService`, `CompositeDiscoveryService`).
|
||||
- Persistent remote peer config exists (`IPeerConfigurationStore`, `PeerManagementService`).
|
||||
- Hosting health check only validates oplog access (`CBDDCHealthCheck` in Hosting project).
|
||||
|
||||
## Design Decisions
|
||||
- Track confirmation as a persisted watermark per `(peerNodeId, sourceNodeId)` using HLC and hash.
|
||||
- Use existing vector-clock exchange and successful push results to advance confirmations (no mandatory wire protocol break required).
|
||||
- Treat tracked peers as pruning blockers until explicitly removed.
|
||||
- Keep peer registration idempotent and safe for repeated discovery events.
|
||||
|
||||
## Data Model and Persistence Plan
|
||||
### 1. Add peer confirmation tracking model
|
||||
Create a new persisted model (example name: `PeerOplogConfirmation`) with fields:
|
||||
- `PeerNodeId`
|
||||
- `SourceNodeId`
|
||||
- `ConfirmedWall`
|
||||
- `ConfirmedLogic`
|
||||
- `ConfirmedHash`
|
||||
- `LastConfirmedUtc`
|
||||
- `IsActive`
|
||||
|
||||
### 2. Add store abstraction
|
||||
Add `IPeerOplogConfirmationStore` with operations:
|
||||
- `EnsurePeerRegisteredAsync(peerNodeId, address, type)`
|
||||
- `UpdateConfirmationAsync(peerNodeId, sourceNodeId, timestamp, hash)`
|
||||
- `GetConfirmationsAsync()` and `GetConfirmationsForPeerAsync(peerNodeId)`
|
||||
- `RemovePeerTrackingAsync(peerNodeId)`
|
||||
- `GetActiveTrackedPeersAsync()`
|
||||
|
||||
### 3. BLite implementation
|
||||
- Add entity, mapper, and indexed collection to `CBDDCDocumentDbContext`.
|
||||
- Index strategy:
|
||||
- unique `(PeerNodeId, SourceNodeId)`
|
||||
- index `IsActive`
|
||||
- index `(SourceNodeId, ConfirmedWall, ConfirmedLogic)` for cutoff scans
|
||||
|
||||
### 4. Snapshot support
|
||||
Include peer-confirmation state in snapshot export/import/merge so pruning safety state survives backup/restore.
|
||||
|
||||
## Sync and Pruning Behavior Plan
|
||||
### 5. Auto-register peers when discovered
|
||||
On each orchestrator loop, before sync attempts:
|
||||
- collect merged peer list (discovered + known peers)
|
||||
- call `EnsurePeerRegisteredAsync(...)` for each peer
|
||||
- skip local node
|
||||
|
||||
### 6. Advance confirmation watermarks
|
||||
During sync session with a peer:
|
||||
- after vector-clock exchange, advance watermark for nodes where remote is already ahead/equal
|
||||
- after successful push batch, advance watermark to max pushed timestamp/hash per source node
|
||||
- persist updates atomically per peer when possible
|
||||
|
||||
### 7. Gate oplog pruning by peer confirmation
|
||||
Replace retention-only prune trigger with a safe cutoff computation:
|
||||
- compute retention cutoff from `OplogRetentionHours` (existing behavior)
|
||||
- compute confirmation cutoff as the minimum confirmed point across active tracked peers
|
||||
- effective cutoff = minimum(retention cutoff, confirmation cutoff)
|
||||
- prune only to effective cutoff
|
||||
|
||||
If any active tracked peer has no confirmation for relevant source nodes, do not prune those ranges.
|
||||
|
||||
### 8. Deprecated peer removal path
|
||||
Provide explicit management operation to unblock pruning for decommissioned peers:
|
||||
- add method in management service (example: `RemovePeerTrackingAsync(nodeId, removeRemoteConfig = true)`)
|
||||
- remove from confirmation tracking store
|
||||
- optionally remove static peer configuration
|
||||
- document operator workflow for node deprecation
|
||||
|
||||
## Hosting Health Check Plan
|
||||
### 9. Extend hosting health check payload
|
||||
Update Hosting `CBDDCHealthCheck` to include peer sync status data:
|
||||
- tracked peer count
|
||||
- peers with no confirmation
|
||||
- max lag (ms) between local head and peer confirmation
|
||||
- lagging peer list (node IDs)
|
||||
- last successful confirmation update per peer
|
||||
|
||||
### 10. Health status policy
|
||||
- `Healthy`: persistence OK and all active tracked peers within lag threshold
|
||||
- `Degraded`: persistence OK but one or more peers lagging/unconfirmed
|
||||
- `Unhealthy`: persistence unavailable, or critical lag breach (configurable)
|
||||
|
||||
Add configurable thresholds in hosting options/cluster options.
|
||||
|
||||
## Implementation Phases
|
||||
### Phase 1: Persistence and contracts
|
||||
- Add model + store interface + BLite implementation + DI wiring.
|
||||
- Add tests for CRUD, idempotent register, and explicit remove.
|
||||
|
||||
### Phase 2: Sync integration
|
||||
- Register peers from discovery.
|
||||
- Update confirmations from vector clock + push success.
|
||||
- Add sync tests validating watermark advancement.
|
||||
|
||||
### Phase 3: Safe pruning
|
||||
- Implement cutoff calculator service.
|
||||
- Integrate with orchestrator maintenance path.
|
||||
- Add two-node tests proving no prune before peer confirmation.
|
||||
|
||||
### Phase 4: Management and health
|
||||
- Expose remove-tracking operation in peer management API.
|
||||
- Extend hosting healthcheck output and status policy.
|
||||
- Add hosting healthcheck tests for Healthy/Degraded/Unhealthy.
|
||||
|
||||
### Phase 5: Docs and rollout
|
||||
- Update docs for peer lifecycle and pruning semantics.
|
||||
- Add upgrade notes and operational runbook for peer deprecation.
|
||||
|
||||
## Safe Subagent Usage
|
||||
- Use subagents only for isolated, low-coupling tasks with clear file ownership boundaries.
|
||||
- Assign each subagent a narrow scope (one component or one test suite at a time).
|
||||
- Require explicit task contracts for each subagent including input files/components, expected output, and forbidden operations.
|
||||
- Prohibit destructive repository actions by subagents (`reset --hard`, force-push, history rewrite, broad file deletion).
|
||||
- Require subagents to report what changed, why, and which tests were run.
|
||||
- Do not accept subagent-authored changes directly into final output without primary-agent review.
|
||||
|
||||
## Mandatory Verification After Subagent Work
|
||||
- Enforce a verification gate after every subagent-delivered change before integration.
|
||||
- Verification gate checklist: independently review the produced diff; run targeted unit/integration tests for touched behavior; validate impacted acceptance criteria; confirm no regressions in pruning safety, peer lifecycle handling, or healthcheck output.
|
||||
- Reject or rework any subagent output that fails verification.
|
||||
- Only merge/integrate subagent output after verification evidence is documented.
|
||||
|
||||
## Test Plan (Minimum)
|
||||
- Unit:
|
||||
- confirmation store upsert/get/remove behavior
|
||||
- auto-register is idempotent
|
||||
- safe cutoff computation with mixed peer states
|
||||
- removing a peer from tracking immediately changes cutoff eligibility
|
||||
- healthcheck classification rules
|
||||
|
||||
- Integration (two-node focus):
|
||||
- Node B offline: Node A does not prune confirmed-required range
|
||||
- Node B catches up: Node A prunes once confirmed
|
||||
- New node join auto-registers without manual call
|
||||
- Deprecated node removal unblocks pruning
|
||||
|
||||
## Risks and Mitigations
|
||||
- Risk: indefinite growth if a peer never confirms.
|
||||
- Mitigation: explicit removal workflow and degraded health visibility.
|
||||
|
||||
- Risk: confirmation drift after restore/restart.
|
||||
- Mitigation: snapshot persistence of confirmation records.
|
||||
|
||||
- Risk: mixed-version cluster behavior.
|
||||
- Mitigation: rely on existing vector clock exchange first; keep protocol additions backward compatible if introduced later.
|
||||
|
||||
## Acceptance Criteria
|
||||
- Oplog entries are not pruned while any active tracked peer has not confirmed required ranges.
|
||||
- Newly discovered peers are automatically present in tracking without operator action.
|
||||
- Operators can explicitly remove a deprecated peer from tracking and pruning resumes accordingly.
|
||||
- Hosting health endpoint exposes peer sync lag/confirmation status and returns degraded/unhealthy when appropriate.
|
||||
Reference in New Issue
Block a user