Files
CBDDC/twonode.md

167 lines
7.8 KiB
Markdown

# Plan: Peer-Confirmed Oplog Retention and Peer Sync Health
## Objective
Move from time-only oplog pruning to peer-confirmed pruning so entries are not removed until peers have confirmed them, while also adding peer lifecycle management and health visibility.
## Requested Outcomes
- Oplogs are not cleared until confirmed by peers.
- Each node tracks latest oplog confirmation per peer.
- New peers are automatically registered when they join.
- Deprecated peers can be explicitly removed from tracking.
- Hosting health checks report peer sync status, not only store availability.
## Current Baseline (Codebase)
- Pruning is retention-time based in `SyncOrchestrator` + `IOplogStore.PruneOplogAsync(...)`.
- Push ACK is binary success/fail (`AckResponse`) and does not persist peer confirmation state.
- Peer discovery exists (`IDiscoveryService`, `UdpDiscoveryService`, `CompositeDiscoveryService`).
- Persistent remote peer config exists (`IPeerConfigurationStore`, `PeerManagementService`).
- Hosting health check only validates oplog access (`CBDDCHealthCheck` in Hosting project).
## Design Decisions
- Track confirmation as a persisted watermark per `(peerNodeId, sourceNodeId)` using HLC and hash.
- Use existing vector-clock exchange and successful push results to advance confirmations (no mandatory wire protocol break required).
- Treat tracked peers as pruning blockers until explicitly removed.
- Keep peer registration idempotent and safe for repeated discovery events.
## Data Model and Persistence Plan
### 1. Add peer confirmation tracking model
Create a new persisted model (example name: `PeerOplogConfirmation`) with fields:
- `PeerNodeId`
- `SourceNodeId`
- `ConfirmedWall`
- `ConfirmedLogic`
- `ConfirmedHash`
- `LastConfirmedUtc`
- `IsActive`
### 2. Add store abstraction
Add `IPeerOplogConfirmationStore` with operations:
- `EnsurePeerRegisteredAsync(peerNodeId, address, type)`
- `UpdateConfirmationAsync(peerNodeId, sourceNodeId, timestamp, hash)`
- `GetConfirmationsAsync()` and `GetConfirmationsForPeerAsync(peerNodeId)`
- `RemovePeerTrackingAsync(peerNodeId)`
- `GetActiveTrackedPeersAsync()`
### 3. BLite implementation
- Add entity, mapper, and indexed collection to `CBDDCDocumentDbContext`.
- Index strategy:
- unique `(PeerNodeId, SourceNodeId)`
- index `IsActive`
- index `(SourceNodeId, ConfirmedWall, ConfirmedLogic)` for cutoff scans
### 4. Snapshot support
Include peer-confirmation state in snapshot export/import/merge so pruning safety state survives backup/restore.
## Sync and Pruning Behavior Plan
### 5. Auto-register peers when discovered
On each orchestrator loop, before sync attempts:
- collect merged peer list (discovered + known peers)
- call `EnsurePeerRegisteredAsync(...)` for each peer
- skip local node
### 6. Advance confirmation watermarks
During sync session with a peer:
- after vector-clock exchange, advance watermark for nodes where remote is already ahead/equal
- after successful push batch, advance watermark to max pushed timestamp/hash per source node
- persist updates atomically per peer when possible
### 7. Gate oplog pruning by peer confirmation
Replace retention-only prune trigger with a safe cutoff computation:
- compute retention cutoff from `OplogRetentionHours` (existing behavior)
- compute confirmation cutoff as the minimum confirmed point across active tracked peers
- effective cutoff = minimum(retention cutoff, confirmation cutoff)
- prune only to effective cutoff
If any active tracked peer has no confirmation for relevant source nodes, do not prune those ranges.
### 8. Deprecated peer removal path
Provide explicit management operation to unblock pruning for decommissioned peers:
- add method in management service (example: `RemovePeerTrackingAsync(nodeId, removeRemoteConfig = true)`)
- remove from confirmation tracking store
- optionally remove static peer configuration
- document operator workflow for node deprecation
## Hosting Health Check Plan
### 9. Extend hosting health check payload
Update Hosting `CBDDCHealthCheck` to include peer sync status data:
- tracked peer count
- peers with no confirmation
- max lag (ms) between local head and peer confirmation
- lagging peer list (node IDs)
- last successful confirmation update per peer
### 10. Health status policy
- `Healthy`: persistence OK and all active tracked peers within lag threshold
- `Degraded`: persistence OK but one or more peers lagging/unconfirmed
- `Unhealthy`: persistence unavailable, or critical lag breach (configurable)
Add configurable thresholds in hosting options/cluster options.
## Implementation Phases
### Phase 1: Persistence and contracts
- Add model + store interface + BLite implementation + DI wiring.
- Add tests for CRUD, idempotent register, and explicit remove.
### Phase 2: Sync integration
- Register peers from discovery.
- Update confirmations from vector clock + push success.
- Add sync tests validating watermark advancement.
### Phase 3: Safe pruning
- Implement cutoff calculator service.
- Integrate with orchestrator maintenance path.
- Add two-node tests proving no prune before peer confirmation.
### Phase 4: Management and health
- Expose remove-tracking operation in peer management API.
- Extend hosting healthcheck output and status policy.
- Add hosting healthcheck tests for Healthy/Degraded/Unhealthy.
### Phase 5: Docs and rollout
- Update docs for peer lifecycle and pruning semantics.
- Add upgrade notes and operational runbook for peer deprecation.
## Safe Subagent Usage
- Use subagents only for isolated, low-coupling tasks with clear file ownership boundaries.
- Assign each subagent a narrow scope (one component or one test suite at a time).
- Require explicit task contracts for each subagent including input files/components, expected output, and forbidden operations.
- Prohibit destructive repository actions by subagents (`reset --hard`, force-push, history rewrite, broad file deletion).
- Require subagents to report what changed, why, and which tests were run.
- Do not accept subagent-authored changes directly into final output without primary-agent review.
## Mandatory Verification After Subagent Work
- Enforce a verification gate after every subagent-delivered change before integration.
- Verification gate checklist: independently review the produced diff; run targeted unit/integration tests for touched behavior; validate impacted acceptance criteria; confirm no regressions in pruning safety, peer lifecycle handling, or healthcheck output.
- Reject or rework any subagent output that fails verification.
- Only merge/integrate subagent output after verification evidence is documented.
## Test Plan (Minimum)
- Unit:
- confirmation store upsert/get/remove behavior
- auto-register is idempotent
- safe cutoff computation with mixed peer states
- removing a peer from tracking immediately changes cutoff eligibility
- healthcheck classification rules
- Integration (two-node focus):
- Node B offline: Node A does not prune confirmed-required range
- Node B catches up: Node A prunes once confirmed
- New node join auto-registers without manual call
- Deprecated node removal unblocks pruning
## Risks and Mitigations
- Risk: indefinite growth if a peer never confirms.
- Mitigation: explicit removal workflow and degraded health visibility.
- Risk: confirmation drift after restore/restart.
- Mitigation: snapshot persistence of confirmation records.
- Risk: mixed-version cluster behavior.
- Mitigation: rely on existing vector clock exchange first; keep protocol additions backward compatible if introduced later.
## Acceptance Criteria
- Oplog entries are not pruned while any active tracked peer has not confirmed required ranges.
- Newly discovered peers are automatically present in tracking without operator action.
- Operators can explicitly remove a deprecated peer from tracking and pruning resumes accordingly.
- Hosting health endpoint exposes peer sync lag/confirmation status and returns degraded/unhealthy when appropriate.